Case Study: Centrica Succeeds with Data Discovery at Scale

By on

“When you’ve got a mass of data, how do you analyze that data and get to a point where you can get the gemstones, the diamonds out of it?” Mike Young, Chief Information Officer with Centrica, knows what it’s like to wade through a sea of petabytes and terabytes to find value. Centrica is an energy utility and services company, providing gas, oil, and renewables to businesses and consumer markets in North America through Direct Energy, and in the UK through British Gas. Young spoke with DATAVERSITY® about how Centrica was able to turn their massive data lake into a strategic asset in the process of become General Data Protection Regulation (GDPR)-compliant.

A Size Problem

Centrica has two SAP-based back end systems, one serving the consumer side of the business, and the other serving the business side. Those systems rank number three and number five of the biggest systems SAP supports worldwide. What makes them big is not so much their configuration but the data that resides in those systems.

Mostly composed of customer data gathered since 2015, Centrica has had a fairly sophisticated data lake, “But it’s still a mass of data. It’s terabytes and petabytes of data that sit in that lake.” They use data, for instance, to run a fairly sophisticated rewards program for British Gas customers. Besides providing gas and electricity, Centrica offers quite a number of services, such as boiler repair or general plumbing, that sit outside of the energy portfolio. The data they needed to find their customers’ pain points and offer those related services was sitting in their data lake, but they were unable to use it efficiently.

Overall Assessment

In the past, Young’s team built some of their own algorithms to find those gems, but when the GDPR arrived, they realized they needed a much more sophisticated approach to customer data. Anything in the GDPR that could be counted as private—name, address, credit card numbers, telephone numbers—would need additional layers of security. “We found that quite difficult, because all of this data sits in the mass of a big lake along with a lot of other data.” They had no way to control and track access, use, or deletion of private information. Young knew he had a clock that was running to be GDPR-compliant by the deadline.

Primary Considerations

Besides GDPR considerations and the desire for a more comprehensive approach to meeting customer needs, Centrica wanted to improve the level of information being shared between business units across the company, according to Young’s colleague Daljit Rehal, Centrica’s Senior VP of Digital & Data and Chief Data Officer. They realized that with the right solution, they could combine information from multiple units in the data lake and turn it into a strategic asset.

“You can have a data lake, but not necessarily all of the data in there has value.” Getting into and understanding those datasets, usually consisting of unstructured and structured data, is quite a difficult task to do, no matter how many data scientists you have dedicated to the task, said Young. “We knew we needed something much more sophisticated with the new legislation emerging insisting that all corporates had a sense of where all of their sensitive data resides.”

Young said they searched “Far and wide, looking for the solution that could aid us in that endeavor.” In the process, they realized there was nothing on the horizon that could compete with the Smart Data Discovery platform by the company Io-Tahoe.

Early Testing

They started with a small pilot across four data sources that grew into a four-month exercise allowing Io-Tahoe to trawl nearly 900 applications, 22 of them fairly significant. “These are big box type servers, a number of tables, a number of columns, in order to identify where we, from a legislation point of view, had sensitive data and what that sensitive data looked like.” Rehal said they were able to process 30 billion records and 1.7 million columns in a fraction of the time they anticipated, an impossible task to attempt manually.

Duplicate data across multiple sources was also identified in the process, allowing Centrica to quickly locate and clean it up. In comparison, Rehal said they had a side project using a third party that involved finding personal information residing in non-production systems: “They spent eight months cataloging four data sources. Using Io-Tahoe, we did 22 data sources in one month.” One thing Young sees as a strong feature is that the solution works in the lake and in the database arena as well. “We’re big, heavy data users, and if it works for us, it’ll work for any entity that’s trying to find a discovery solution for datasets.”

Extending it All

Building on the success they found using Io-Tahoe to meet and manage GDPR compliance in the pilot, Young said they expanded to using it to keep datasets healthy. “We’re only retaining data now when it’s pertinent to certain products and services, and to our customers, and we’re cleaning up our data as we go along,” which is also allowing them to use their datasets to build future products and services on behalf of the company. “Which is why, in the latter part of last year, we chose to extend the use of the Io-Tahoe product at an enterprise level.” With an enterprise-wide license, they can now allow all users in the business to use the tool within the data lake.


Io-Tahoe performed well enough that Centrica was inspired to expand its use far beyond the initial pilot, he said, and into the future. “We recognize we’ve got a platform that we think is going to be fit for the purpose for a long time to come.” Young said that previously, they had teams on the B-to-B side and on the B-to-D side used to working with data in silos asking his team to retrieve certain datasets and report from them out of the data lake.

With the increased agility and processing power they now have, “We’ve emboldened our business groups to think about data as their asset, not necessarily just as a group asset, and to start using the platform more extensively across the grid.” Using the discovery that Io-Tahoe provides, users can access the data lake on their own, drawing reports with whatever criteria they need.

The self-service discovery and reporting has increased the level of confidence across data groups within the Centrica portfolio, he said. “It’s become the go-to platform that they use on an everyday type basis.”


Io-Tahoe and its smart data discovery capability are a key part of Centrica’s future. “In our world, and increasingly so across the globe, we’re all moving into real time datasets and the ability to interact with your customer in real time.” The portfolio of additional services that Io-Tahoe is pursuing allows Centrica to pick up additional services as they build them. “We found a rare solution when we found Io-Tahoe,” Young said.  “Solutions that allow you to do rich data discovery are worth their weight in gold.”

Image used under license from

Leave a Reply