ETL vs. Data Preparation

Extract, Transform, and Load (ETL) technologies, managed exclusively by IT, have until recently been the primary tool used to combine data from multiple sources and thus provide the ability to drive important business decision making for organizations. But, with the advent of self-service data preparation, business users and subject matter experts (SMEs) can find those insights on their own.

“In some organizations, each data engineer/IT resource is expected to support over hundreds of business users or subject matter experts, creating a bottleneck that can delay the arrival of critical business insights,” said Farnaz Erfan, that mean ETL is on its way out? Or are there use cases for both? Erfan talked about the similarities and differences between ETL and data preparation in a recent DATAVERSITY® interview.

Similarities

She said that many people assume that ETL and data preparation are the same thing because conceptually they are very similar, and both solve the same problem. Data in its raw format is never ready for analytics or machine learning, and both tools transform the data into a shape and form that brings it closer to being a single source of truth for uses such as analytics or machine learning, but they are not the same thing, she said.

Differences

Where ETL and data prep differ is in the complexity of the paradigm, the type of preparation required, and the user persona—and these differences determine the kind of use cases they serve in the market.

1. The User Persona

ETL tools were created over 25 years ago for technical users, and data preparation tools are geared toward modern business users. “That could be an analyst, a sales operations manager, a marketing manager—many different people within the line of business can use these tools,” commented Erfan.

Vendors offering data prep tools tend to follow a visual form of data presentation, similar to an Excel spreadsheet, where users can see the data in the center of their workspace. This allows non-technical users to investigate Data Quality issues, prepare data, validate it, and see how data values change as different rules or conditions are applied.

2. The Paradigm

ETL relies on a predetermined set of rules and workflows, she said. Potential issues, such as misspellings or extra characters, must be anticipated beforehand so rules for how to deal with those issues can be built into the end-to-end workflow.

Conversely, a data prep tool using built-in algorithms is capable of discovery and investigation of the data as it proceeds through the workflow. “For example, algorithms based on machine learning or natural language processing can recognize things that are spelled differently but are really the same.”

She gave the example of a city called “St. Louis”, and how it could be entered in multiple ways, or there may be several cities with the same name spelled differently. In an ETL workflow, rules for encountering each particular variation must be programmed ahead of time, and variations not programmed are skipped. A data prep tool can find spelling differences without help, so that the user does not have to anticipate every possible variation. The tool can prompt for a decision on each different variation on the name of this city, providing an opportunity to improve the data before it’s used, she said.

“That’s a huge shift in terms of the quality of the data that comes out of it, as well as increasing the speed with which I can get through my data. Contrast that with an ETL tool which requires that the user pre-determine the rules that need to be applied and guess which possible ways it might have been spelled.”

3. Data Complexity

Erfan who started her career in the traditional Data Management/ETL space said that most of the ETL focus is still on relational data sources, because their strong points lie in bulk-loading, and transformation of tables and relational data sources, as well as with CRM, Salesforce, and marketing applications.

Data preparation is still strong on both the relational side as well as with more complex data types, such as JSON or XML, where there are nested structures of data. A data prep tool flattens data out into a tabular format, going back to that easier-to-understand Excel paradigm that non-technical users find familiar. “You can clearly see the entire data content to prepare it for analysis, rather than trying to understand it in a nested structure,” she said.

Benefits of Data Preparation Tools

The ETL space requires conformity and a well-defined structure, or a multidimensional model for publishing into. Changes in data structure require a “back-to-the-drawing-board” approach to incorporate transformations, and data must then be re-published, resulting in a much longer process. Conversely, by using a data prep solution that can process the data in its entirety, not just samples, machine learning algorithms handles more freeform data, and therefore iterations can be done quickly, allowing for a more agile process. “Rather than taking months to do, they can be done in days,” she said.

The greatest benefit to data preparation from Erfan’s perspective is not having a bottleneck in IT. Instead, the burden to create and prepare data assets for the entire organization can be shared with end users by giving them the tools and ability to do it on their own. Business teams have already been doing visualizations and using dashboards on their own, “So why couldn’t they also prepare data for analytics on their own without waiting for IT?”

Steps to Success

For some companies, a small group or a single individual expresses the need for a data preparation solution. In others, it arises from an organizational-wide data democratization process. Either way, she said, the first step is for the organization to define “maturity” according to their own goals in the long-term.

Appropriate starting questions for that process might be:

What are the steps to maturity?
Where do you want to take your data prep?
Will users be confined to a small group or is collaboration more important?

The next step toward maturity is automation. Questions for this phase might be:

Will all data preparation workloads stay at an ad-hoc level?
Will data prep stay within a small number of users or will it be expanded to offer organization-wide certified data assets that other people can use?
How will data preparation be integrated with the data catalog and existing BI tools?

The Intersection: Where ETL Meets Data Prep

Erfan believes that ETL is at a change point. Data prep solutions are disrupting the market but, at the same, time, “ETL and BI companies are also adding data preparation because that’s a need that they see in the market,” she said. There are many situations where an ETL solution is still the best choice, especially for use cases where billions of rows of data are transformed and bulk-loaded into data warehouses and use cases where jobs must run automatically every day with no change to the scheme of the data. “Things that are predictable and repeatable go through the ETL workflow.”

One intersecting area is between ETL and visualization tools. An embedded catalog, for example, within the platform is important, it’s where certified data assets can become available to business teams. “You still have your ETL, but you also have your data visualization tools.” Data prep tools in general, can fill the gap between ETL and visualization tools, serving the data investigation and preparation needs of business users outside of IT.

Another use case she cited is using a data preparation tool to inform the programming logic of bulk load transformations using ETL. “It’s sort of like the design platform, if you will, to add a new source to existing ETL flows.”

Paxata

Paxata has been important in the self-service data preparation space, creating it as a new product category. The company was founded with the goal of offering business teams and subject matter experts the same capabilities that technical and IT teams have had access to. The product is based on the concept of self-service data prep, where an individual analyst can source and blend the data. As the workload grows to the enterprise level, governance and automation become important, and the platform has the capability to scale along with changing priorities.

Erfan said, “As you grow, how can you ensure that your data prep grows with you across all the use cases that you envision? Paying attention to that is critical for making the right investment.”

Image used under license from Shutterstock.com

BECOME A DATAVERSITY INSIDER FOR ACCESS TO 160+ COURSES

Data Topics