Case Study: Implementing Data Governance for Data Lakes and Big Data

Shannon Fuller says that knowing what your priorities are is the key piece to efficient development of a governance structure for the Data Lake. Fuller is the Director of Data Governance at Carolinas Healthcare System, where he piloted an HDInsight Hadoop implementation on Microsoft Azure. Speaking at the DATAVERSITY® Enterprise Data Governance Online Conference, Fuller shared practical approaches he used to create a system that supports innovation and insights in health care service delivery.

Assessment

“Our approach was to look at it from an Analytics perspective and to develop an Analytics Platform” that could be used to serve health care organizations, he said. The trend in health care now is coordination of physician networks for continuum of care – being aware of all providers a patient sees and looking at an entire treatment plan, “instead of how it used to be where you go and see the specialist and then you come back to your primary care.”

Another trend in health care is about moving from “a volume-based model to a value-based model when it comes to compensation for care,” he said. Internal knowledge, patient data, billing data, clinical data, and provider data were already being collected, but, “The challenge was, how do you take into account all these disparate pieces of information that we have and provide that information back to the organization to be able to make decisions?”

Program Development

“Our guiding principles around how we are going to engage this is, you want your positions and your operational leaders to be able to consume this data and make decisions. That was a driving factor: To take these insights, and put them in plain language, so that they can make operations improvements,” he said. “You have to govern the data, and the Analytics Products that we’re developing also have to be governed, so that you’re getting consistent answers across the organization, and then this should be used to help us build a more data-driven culture.”

He noted that it’s important to ask some questions about what your priorities are and your desired result before focusing on how you get there. “You don’t want to start with the tool. You need to understand what you’re trying to do so you properly evaluate which of these distributions or offerings would be best for you.”

Fuller then discussed the questions they used to develop their Data Governance program:

Where to govern data?
Is the Data Lake a playground, an incubator for innovation, or an operational data store?
What factors to consider before tiering, segregating or restricting data to protect it, yet still make it useful for innovation?
What tools are available, or will we develop our own?
How do you protect intellectual property when working with third parties?

Fuller said that their initial idea was to bring in all of the data they had, add their Data Quality care measures, and the third-party data that they had from outside agencies, so they could, “Aggregate, cleanse, identify, apply any algorithms, and then provide that in an easily digestible manner back to the organization.”

Their key Data Governance priorities were protecting patient information, encouraging innovation, protecting intellectual property and having a common, accessible repository. Fuller said that protecting patient data is absolutely critical because, “We’re going to be joining many different data sources together, so there will be a lot of detailed information about our patients being brought together in one place.”

Program Components

Innovation was important, so they decided to create smaller, stand-alone products that could support rapid development of Analytics Prototypes and Predictive Models, he said, and to minimize risk with third parties, they wanted to build in intellectual property protections.

“Looking at the assets we were going to develop, we made a decision that we were not just creating another Enterprise Data Warehouse. We were going to take the approach that each of the analytic assets that we developed should be a stand-alone asset,” he said. “It’s taking a smaller structured approach to provide information back to the organization. You’re not trying to boil the ocean with any one asset.”

By taking smaller chunks, they could enable quicker availability of new data sources, he said. “It’s easier to be able to structure – especially with non-structured data – if you can focus on exactly what you’re looking for instead of for example, trying to pull everything out of physician nodes and structuring it,” you can choose to look only at these key pieces to support this asset.

They also wanted to use the Data Lake concept to its fullest, he said:

“We didn’t want to re-create the sins of the past by standing up silos of data inside the Data Lake, with people only having access to certain pieces of data. We wanted to create a common repository and have a way of protecting that information, but still maintaining that one common repository.”

Big Data Tools

After deciding their priorities, they were ready to consider third-party tools. Choosing a Hadoop implementation on Microsoft Azure, Fuller noted that Hadoop offers several different storage options, and of those, they made a decision to use Azure storage blobs as well as implementing an Azure Data Lake store. There are file size limits for storage blobs and they are not truly distributed, he said.

One of the advantages of using Hadoop is because of its capability for distributed processing, so by exclusively using storage blobs, there will be limitations simply because of how the data is stored, he said. “There is still a place – and reasons why – you would use storage blobs, and again, it just points out why you need to understand your use case as you’re evaluating these options.”

To offset these limitations, their Data Lake store is an implementation of HDFS in the Cloud that provides user-based security, which they administer using Microsoft HDInsight. He sees a lack of file size or storage limits as the biggest advantage of the Data Lake store.

“Since this is a Cloud environment, you can grow your storage as large as you need to fit your need,” he said. “I also like the fact that it’s flexible, so if you’re working on a specific initiative, you can grow your environment as big as you want, and then once that initiative is over, you can shrink it back down. That’s functionality that they provide that gives you a lot of flexibility so you’re not wasting money and paying for space that you don’t need.”

“For this project we partnered with a company called Tresata,” he said. “Their product can help you catalog all of that data that you’re bringing into the Data Lake, structure it, and be able to give access to various users,” so they can “provide some level of reporting and organization on that data.” He said there are many companies to partner with that can do this type of work, but, “Tresata is who we chose and it was a very successful partnership.”

To address the issue of IP protection, they set up edge nodes in the Azure environment for their Data Scientists to develop their models. He compared an edge node to a server where applications and files can be stored, he said. “The third-party edge node is where we were loading the Tresata software, so it creates a natural firewall in between those two environments.”

They also didn’t want to segregate out data, he said:

“And this is another reason I like the flexibility of this environment – you can have separation of your applications, but you’re all still working on common storage, and we have a common compute environment as well, so you don’t have to have separate environments in order to protect that data.”

Fuller presented a slide with a diagram of how the different components they chose worked together. Source data is extracted from on-site databases using HDInsight tools and stored in an Azure Data Lake store, then refined, enriched and catalogued by Tresata. Cleansed and enriched data is available to be used for modeling, reporting, and to populate executive dashboards as needed.

Following their objective of having a common environment and bringing in raw data, he said, “We have to curate it and make it usable for people.” To manage this process, they created layers in the Data Lake. The raw data all lands in an operational layer:

“There are many considerations when you’re going down this path, and what we thought about was, Do you want to govern the data coming into the environment? Do you want to land it in the environment and then try and put structure and controls around it? Do you want to create an absolute clean dataset before you even send it to this environment?”

After the data lands in the operational layer, it’s cleansed, matched, and merged, algorithms are applied to it, and access controls are instituted, he said.

“We created this analytic sandbox, because there are going to be times when they’re doing prototyping,” which requires a controlled dataset, he said. The system allows Data Scientists and analysts to “pull that dataset from the curated space and load it into the analytic sandbox.” Fuller said this persistent layer is where security controls are applied, “Because that data is being used in specific reporting, in your BI environment, your BI tools, so not every one of your BI users would have access to all the data in your persistent layer.”

Lessons Learned

Fuller said the most important piece of advice he’d offer is that, “having a plan and understanding where you’re trying to go can help narrow down options, but don’t get intimidated” – it will keep changing. “Like any other technology, you may have fits and starts, where you go down a path and figure out you need to do something else,” he said, “And that’s just the nature of the environment.”

Check out Enterprise Data Governance Online at www.datagovernanceonline.com/

Here is the video of the Enterprise Data Governance Online Presentation:

Photo Credit: Toria/Shutterstock.com

ENROLL TODAY: DATA GOVERNANCE MASTER CLASS

Data Topics