Six Key Principles to Accelerate Data Preparation in the Data Lake

Click to learn more about author Paul Barth.

In the digital economy, most companies are looking to significantly increase their use of data and analytics for competitive advantage. Many are making investments in new technologies like Hadoop that promise the speed and flexibility they need, and early pilots look promising.  However, companies are struggling to scale these platforms to broad adoption by the enterprise, because the longest part of the process—getting data ready for business consumption—is consistently too high. A 2015 O’Reilly Data Scientist Salary Survey, for example, reported that most data scientists spend a third of their time performing basic extraction/transformation/load (ETL), data cleaning, and basic data exploration rather than true analytics or data modeling.

If we think of these new data platforms as a marketplace, preparing data for business use is the “transaction cost” of delivering value.  In an economic marketplace, high transaction costs limit the creation and exchange of value.  In a data marketplace, time-consuming data preparation dramatically limits the use of data and analytics in business processes, and more significantly, the rate at which companies can innovate.

We have identified several root causes for slow data preparation and six key principles that continuously accelerate data preparation. These principles have proven very effective, with some companies achieving a 30x increase in analytics productivity and 100x increase in user adoption. These results have convinced us that the most important driver for innovating with data is making data access and preparation frictionless.

Data Preparation in the Data Lake

There are two common methods to prepare data in the data lake. Data scientists often use programming interfaces such as Spark, Python, or R to work with data in the lake. Data wrangling tools can help lighten their burden. Using a graphical interface rather than requiring programmatic skills, wrangling tools permit data scientists to search data already in the lake and lightly cleanse and prepare new data sets for analytic projects. What wrangling tools don’t deliver is support for loading and cleansing complex data into the lake or the ability to manage, secure, and govern data consistent with the enterprise-scale demands of most large companies.

Moreover, with wrangling tools, work done by an individual data scientist is only available for that person or that person’s immediate team. Wrangling tools don’t support shared learning across the enterprise as multiple data scientists crowdsource and enhance data in the data lake for easy reuse by others.

The challenge is that both approaches to data preparation force data scientists to spend too much valuable time on low-value-added tasks.

Why Not Write Code or Use a Data Wrangling Approach?

For an individual analyst, especially one with deep technical skills who needs direct access to data without waiting for IT, writing code to load data into the lake and then accessing it directly can be cost effective, quick, and pragmatic. Likewise, when most of the needed data is already in the lake, a team of data scientists, even those with limited programming skills, can easily use a data wrangling tool to search, lightly cleanse, and prepare new data sets.

Where these approaches break down is when large quantities of new data or complex or dirty data must be loaded into the lake or when many analysts, including some with limited technical skills, need to work with data in the lake and share the results of their analyses and their enhancements to the data with one another. Let’s examine some of these challenges in more detail.

Data Cleansing

Data wrangling tools offer strong support to prepare data that is relatively simple, well-organized, and data-type/schema compliant, such as relational tables or flat files. These tools allow users to enhance or standardize data in existing fields, create new fields, define relationships between tables, and create new data sets.

What wrangling tools do not do well is support the advanced data cleaning and preparation steps required to ingest complex enterprise data sources such as COBOL files or XML files into a data lake. The list below describes some data quality challenges that are typical of complex enterprise and legacy data sources that data wrangling tools don’t address. If left untreated, these data quality challenges will cause dirty data to corrupt the lake.

·       Embedded delimiters where a value contains the delimiter, such as a comma, that separates fields

·       Corrupted records where an operational system may have inadvertently put control characters in values

·       Data type mismatches such as alphabetic characters in a numeric field

·       Nonstandard representations of numbers and dates

·       Headers and trailers that contain control and quality information that need to be processed differently than traditional records

·      Multiple record types in a single file

·      Mainframe file formats that use different character sets on legacy systems that Hadoop does not know how to recognize or process

A data scientist using various programming interfaces to work with data in the lake could theoretically write code to address many of these advanced data cleansing needs. In practice, however, this effort requires such advanced technical skills and so much time that it is impractical for most data scientists and analytic teams.

The bottom line is that when data scientists write code to cleanse data in a data lake they spend too much time on manual data preparation at the expense of real data analysis. Likewise, when data wrangling tools cleanse data, their ability to solve enterprise-scale quality issues is limited, and they are at risk of bringing dirty, inaccurate data into the lake without knowing it.

The Insight Gap

Metadata is another area where data scientists face challenges when using custom code or data wrangling tools to work with a data lake. Data scientists need metadata to understand what data means, how it is organized, and how it might be used in various analytic projects. When data is loaded into a data lake without the simultaneous creation of robust metadata, responsibility shifts to data scientists to spend more time and put more effort into finding, exploring, collecting, and preparing data for their particular needs. Lack of metadata in the lake effectively creates an “insight gap,” separating valuable data that could potentially be used to drive analytic insights from data scientists and business users responsible for developing those insights.

This insight gap springs from two causes. First, when data is initially ingested into a data lake it is critical that robust metadata describing that data source is added to the data lake at the same time.  Not only should metadata from the source environment be pulled into the lake, but the data itself should be fully profiled and validated during ingestion to confirm its structure, contents, and quality. Results from the profiling and validation process should be captured and made available to users as new metadata.

When data is ingested into a data lake through custom coding, none of these steps of metadata collection, validation, or profiling occurs automatically. Data wrangling tools do import metadata from source environments as they bring data into the lake, but they don’t provide comprehensive profiling and valuation. Both approaches, therefore, create the first half of the insight gap by failing to provide adequate metadata capabilities for new data entering the lake.

The second part of the insight gap stems from the fact that neither programmatic approaches to working with data in a data lake nor data wrangling tools allow analysts and other users to create and share metadata as they work with data in the lake. User-generated metadata that addresses the meaning, quality, or usefulness of particular data elements allows analysts and others to crowdsource and share valuable context and meaning.

The Collaboration Gap

There are significant efficiency and productivity gains available to teams of data scientists who communicate, collaborate, and build on one another’s work by sharing insights about data or prepared data sets in a data lake. However, neither data wrangling tools nor programming approaches to working with data in a data lake allow easy sharing of data, analyses, useful metadata, and automated data preparation processes among multiple users and business units.

Nor do custom coding approaches or data wrangling tools provide easy access to data at multiple points along its progression from raw to ready.  The important parts of the collaboration process are preserving that data at every point in the raw to ready path—as it is loaded, explored, cleansed, enhanced, and finally delivered to users—and in making it available for data scientists to combine and enhance to build specific “fit for purpose” data for their particular needs. This is especially true for analytic teams who want to build and share data sets that mix and match data in various states of readiness.

Programmatic approaches to accessing data in the lake and data wrangling tools also cannot provide the governance and security required to make data useful in an enterprise setting.  This includes rules that manage authentication of users, role- and group-based access controls, encrypting and obfuscation of sensitive data, and auditing of user access and data interactions.

Finally, when access to data in the lake is only possible by writing custom code, the list of people who can generate analytic insights is limited to only those with advanced technical skills. A different approach is needed to empower larger groups of less technical business users to generate value from the lake.

The Transaction Cost Is Still Too High

Data lakes go a long way toward helping data scientists be more productive and self-sufficient as they work to generate analytic insights for the business. By giving data scientists fast, direct, self-service access to data for analytic projects, data lakes eliminate dependence on IT and cut the time required to gain access to a new data source from months to hours. However, when access to the data in the lake is via data wrangling tools or custom code, the transaction cost to work with data in the lake is still too high. Data scientists who spend more than half their time doing ETL, data cleansing, and basic data exploration tasks are hamstrung in their efforts to apply their valuable and rare analytic skills to complete high-value data modeling and analytic projects.

What’s needed is a better way, one that makes it easier and faster for data analysts and other analytically oriented business users to work with data in the lake.  Here are some key aspects of what that better way should look like.

#1 – An easy-to-use graphical user interface

Applications that allow users to interact with the data lake via a GUI interface, including purpose-built data lake management applications and data wrangling tools, accelerate, simplify, and expand access to data in the lake by eliminating the need for specialized programming skills and hand-coding efforts.

#2 –Support the Whole Data Lake Process

Today, when data analysts try to work with data in a data lake using only Hadoop and its associated open source tools, they are effectively building from scratch or cobbling together the functionality required to drive the data lake process. A significant part of lowering the transaction cost for data lakes would be the adoption of data lake applications with comprehensive, built-in support to both deploy and maintain a data lake over time.

#3 – A Robust Metadata Layer

Data lakes with robust metadata lower the transaction cost of working with data in the lake by giving users immediate, accurate information about data in the lake.  This robust metadata layer is generated by importing metadata from source systems, profiling each new data set as it enters the lake, collecting new business metadata from users, and documenting every time data in the lake is touched as part of the data lake process.

#4 – Good Collaboration Tools

Data analysts can make better use of data in a lake if they can learn from one another and build on one another’s work by sharing improvements made to data in the lake, such as the addition of comments or business metadata added to data sets, data cleansing and enhancement measures, or prepared data sets.

#5 – Built-in Publishing

A Data Lake that provides data analysts with easy-to-use tools to curate custom data sets and then publishes those data sets to user groups to access via their preferred analytic applications simplifies the last mile of data delivery and further lower the transaction cost of using data in the lake.

#6 – Enterprise-grade Governance and Security

Finally, organizations can lower the transaction cost of data lakes for data analysts by simply and automatically enforcing enterprise-grade data governance and data security measures for all data and users in the lake. By ensuring this functionality is built into the data lake platform itself, organizations can remove any burden on data analysts to attend to this part of the data lake process while ensuring full compliance and integration with the organization’s overall data security and management standards and systems.


Data lakes were introduced in 2010 (source), but it took until recent years for many mainstream companies to deploy lakes as a more cost-effective, agile way to meet the data needs of data scientists and the broader analytic and business user communities. This broader use of data lakes has highlighted the issue of transaction cost and brought into focus the need for new solutions that empower data scientists to more efficiently address all aspects of data preparation so they can spend more time on analytics and less time on grunt work. As outlined here, these new solutions will need to offer more efficient, complete, and automated support for data cleansing, metadata management, collaboration and sharing, and use of almost-ready data—all within a framework of enterprise-grade security and governance. Most importantly you should be focused on solving these problems and work hard to partner with data scientists, reduce the transaction costs of working with data lakes so they can deliver more analytic insights, and ultimately make a greater impact on their business partners.

We use technologies such as cookies to understand how you use our site and to provide a better user experience. This includes personalizing content, using analytics and improving site operations. We may share your information about your use of our site with third parties in accordance with our Privacy Policy. You can change your cookie settings as described here at any time, but parts of our site may not function correctly without them. By continuing to use our site, you agree that we can save cookies on your device, unless you have disabled cookies.
I Accept