According to Pat Patterson, Community Champion at StreamSets, “data drift” is such a problem now that “only about one fifth of a data analyst’s time is actually spent analyzing the data.” The remainder is spent “wrangling it into shape and getting it from where it is to the actual analysis platform.” Speaking at the Enterprise Data World Conference in a session called Dealing with Drift: Building an Enterprise Data Lake, Patterson and Michael Gay (Lead Technical Architect at Cox Automotive) shared how StreamSets solved the data drift problem for Cox Automotive.
Cox Automotive consists of more than 25 different companies within the automotive space. Those companies include Kelley Bluebook, Autotrader, VinSolutions, NextGear, and international companies based in China and the UK. Patterson noted:
“We do all sorts of things from buying cars, selling cars, transporting cars across the country, maintenance, scheduling; every little thing you can think of with a car.”
The Problem: Data Drift
The ability to share data from multiple areas of the same industry gives Cox a significant competitive advantage, and as their website says, “data is the integration point” for all 25+ companies. But the process used to share that data was problematic. Gay illustrated the situation:
“Kelley Blue Book [KBB] would ask Autotrader for a particular dataset. Autotrader would then ask VinSolutions for a dataset, and then KBB would also ask VinSolutions for the same dataset. Autotrader would ask KBB for VinSolutions’ dataset. So, there was this big spider web of overlap — of sharing of data — but they were never the same.”
Data would be changed, modified, affected by extract, transform, and load (ETL) processes, and subject to three different types of data drift: structure drift, semantic drift, and infrastructure drift. Patterson explained:
“Back in the days of traditional ETL, you would build your map of every incoming field to every outgoing field and the transformations that had to happen, and if you were lucky, that map would stay current for a few weeks, at most. Since those days, change has accelerated.”
At the same time, the variety of data sources has increased. “We’ve got data coming from devices, log files, click-streams – and it’s much more diverse than the traditional set of databases sitting around your enterprise,” Patterson added. New latitude and longitude columns added to the customer address to allow for more effective truck routing, for example, cause a change to the schema – what Patterson calls “structure drift.”
“Semantic drift,” he continued, “is a little bit subtler. That is where the structure stays the same, but your interpretation of the data changes.” For example, zip codes stored in a numeric field can cause problems if the company starts selling overseas and the field now has to be alphanumeric. “What’s going to happen in moving that data around? Is it still going to work, or has any component made an assumption about the data semantics there?” Patterson asked.
“Infrastructure drift” happens when some components in the chain are upgraded, and suddenly there’s a change in related log files. “Best case: data drift happens and something breaks, and you know about it. Worst case: it just insidiously pollutes the data.”
Filling the Data Lake: Finding a Solution
Cox created a data lake as “our central repository for holding all of the data assets from all of the business units,” Gay said. “It is the single place where all of these teams come and access their data. This all lives in a current Cloudera Hadoop cluster where they access it via Hive, Spark, or MapReduce.” But getting multiple types of data from 25+ companies into one place is an expensive process. Gay noted:
“There is one single Oracle system in Autotrader that has over 1600 tables in it alone, so if we were to go and write a custom scoop job to pull this data in, we’d have to do it 1600 times. And we were guessing it was taking a developer about six hours to build a complete workflow.”
Gay discussed the data lake and a tiny pinprick-sized dot on it to illustrate the amount of data they’ve managed to ingest into the data lake so far. After heavily testing a custom solution, they discovered that the custom tool “just didn’t work, and we weren’t able to do half the things we wanted to do. So we decided to go out and start looking for a new ingestion tool.”
Gay then compared the eight different tools they tried in their search for the right Hadoop ingestion platform. Each tool was ranked according to its ability to provide alignment with strategy, alignment with data architecture, operations manageability, development capabilities, and quality and monitoring features. Among the tools they considered were Knife, Gobblin, RedPoint Global, Informatica, and StreamSets Data Collector. Basing their rankings on use cases, they discovered that Data Collector was the best fit.
StreamSets Data Collector: The Solution
According to Patterson, StreamSets was founded just over two years ago “with a mission of easing the transfer of data between systems, and a particular focus on big data.” The team has “a deep pedigree in enterprise data” with roots in a diverse group of organizations: Cloudera, Informatica, Apache, Salesforce, Square, Elastic, and Facebook. Data Collector was StreamSets’ first product, designed to build complex any-to-any dataflows.
Data Collector is designed for “data engineers, data scientists, and developers to build data pipelines to get your data from where it is to where you want it to be with some optional transformations along the way,” said Patterson. It is built around a web UI, “So effectively, it’s a Java application. It starts up, and then you can connect to it with a browser, and you actually build your pipeline graphically.”
Gay discussed the architecture they are using and how Data Collector fits within it:
“We wanted to separate acquisition from ingestion, meaning we wanted to be able to decouple those two things from each other so that we were able to troubleshoot and find issues faster and without breaking ingestion. So, ingestion becomes a black box where we always do the exact same thing every time regardless of data set.”
They use Kafka to help drive and alleviate back pressure from acquisitions. Gay explained:
“We have spurts where data comes in really fast from a whole bunch of sources at midnight, and then at 2:00 it slows down, but then at 3:00 we get a whole bunch more data. So instead of a direct one-to-one connection, we wanted to have Kafka in the middle to hold that data so that we could process it.”
Both sides are independently scalable, he said.
Gay reported, “We have only really started to pull in data from a handful of the companies, and it’s been on-demand, but we’ve been able to do this with StreamSets really quickly.” Cox had a four-person team that previously managed to complete fewer than 100 ingestion jobs in a month.
“The moment we released StreamSets for them to be used, they had a seven-time performance increase. This was huge. This made it so that we were able to bring more people into our cluster and have more data in the Data Lake.”
Another goal was to make it much easier to acquire needed data, so they now use a standard form to trigger the process. “It’s very simple to just fill out a form that then deploys that pipeline onto an acquisition box,” a process that makes data accessible to any user, Gay said.
“We’ve had our vice president [who had no experience with the system] fill out the form with the values he wanted to put in, and he started to acquire data very quickly. This data would then show up on Hadoop, and [he was] able to query it very fast.”
Gay said there was a minor bump with file submission. “We have found that sometimes our acquisition team doesn’t always fill out the form 100% properly.” Basically, they had a group of guys who submitted jobs that were zip files and told StreamSets they were text files. So they had all this “gobbledly gunk” come through that they had to deal with.
Gay said that Cox is seriously considering Amazon Web Services (AWS) for their next generation Data Lake. “One of the reasons we selected StreamSets is that it has the ability to work in AWS along with writing to AWS destinations. StreamSets allows us to write to Amazon S3 very easily.”
They are also anticipating the addition of StreamSets Dataflow Performance Manager for its monitoring capabilities. Being able to see the overall picture “would be fantastic,” Gay said. A big subject has been how to do “a centralized, federalized, or democratized acquisition model for data ingestion. Have we gotten to a point where we can give this to everybody and anybody and have that work?” They plan to add StreamSets’ newly released post-processing framework options, as well.
Other goals include streamlining access to sources, changing data capture, ingestion post-processing, and integration of their existing enterprise data catalog, Gay said. “We have a homemade, homegrown data catalog that chose what our Data Lake looked like.” They want to integrate with StreamSets and their Data Catalog as much as they can.
Data Drift Issues Disappear
“At the micro level, where you’re just talking about one or two sources or one destination, data drift leads to breakage,” Patterson said. “Best case: errors. In the worst case, at the macro level, it can bring the system to a halt, or even just prevent you getting started.”
The month before they deployed Data Collector, Gay said they had ten scoop jobs where a developer had to spend about four hours on each job, fixing column changes and re-running the job. Since deployment, they have seen drift occur “and nobody said a word, which is the best thing. There have been no complaints.” Previously, Patterson said, they only “knew there was a problem when somebody tried to do analysis on the data and it just wasn’t there.”
Gay added, “It either wasn’t there or it was unqueryable, and so we’d get complaints. Now we don’t have to deal with the complaints. It’s running. Just walk away.”
Want to learn more about DATAVERSITY’s upcoming events? Check out our current lineup of online and face-to-face conferences here.
Photo Credit: Shutterstock.com