Click to learn more about author Richard Mohrmann.
Look, I’m telling you this as a friend. Seriously, it’s time to let go of your batch updates.
It’s not a criticism; no one is at fault. It’s just that it’s time to move on. Yes, I know what you’re thinking. Your vendors and other providers send you files. You have to FTP stuff. You have the standard excuses. “That’s just how my business operates.” “I’m trapped in my batch processes.” “No one understands me like my batch processes.” “My batch processes aren’t the problem, OK??”
It doesn’t have to be this way. Just because your friends stay stuck in the past doesn’t mean you have to be. Your ETL processes can evolve independently from your data providers.
It goes like this: When you’re setting up your ETL processes initially, you look at a file you need to incorporate and set up a loader for it. So far, so good. You extract that sucker, transform it into a form you can use, and load it right up into your database. Then you add another file and another (batch) update, and another, and another. You start to discover there are cross dependencies in your ETL process, and you introduce rules like “don’t process file B until file A has been loaded.” As you develop your systems, the number and size of the files you need to ETL grows and takes more and more time to process. It’s still not a huge deal. So what if it takes a few hours to ETL everything? You are a U.S. (or European or PacRim) business, and there’s lots of time in the overnight run.
As your business and its ETL processes continue to evolve, you introduce new Data Quality measures. Of course, you can’t divine what all of those DQ checks need to be up front, so you discover them the hard way: Your users report the problem (occasionally in colorful language). So, what do you do? Well, first you need to get that client back up and running, so you change your data store to reflect corrected information and you introduce new Data Quality checks to avoid the problem in the future.
But then you discover, possibly the hard way, that you can’t just update a record someplace to implement a correction. There are interdependencies, and that innocent-looking update in the middle of the day can cause a number of other problems. You can’t add a widget that references a new gizmo until a record for that gizmo has been created. You ask yourself, where is the business logic that allows me to make changes like this consistently? And the light bulb goes off. All of that logic is contained in your ETL process! Cool. So you re-run last night’s loader with this new information, it updates the widget and gizmo tables, and it seems to work fine. That particular loader only takes 20 minutes to run. You tell your client they will be back online in a half hour. Problem solved … until the next time.
As time goes on, you find you’re ETL-ing more and more complex data sets with intricate interdependencies. You launched a new office in London, reducing your batch-run window from 12 hours to 6 hours. These new challenges mean that you can’t just run one loader during business hours. You need to run a full set of them to ensure consistency. Besides the inefficiency, doing so creates a number of risks including caching problems, dev-ops problems, and interference with your normal production overnight batch stream. We’re talking about the same batch stream that already has barely enough time to finish during that much more cramped six-hour window.
And there is talk of a new office in Singapore.
Of course, by now, you see that this just doesn’t work. It doesn’t scale. You grab your best data expert and offer to buy her a coffee. You have a heart-to-heart about the logic and production problems and she helps you list all the reasons you can’t just run a single update. You explain the strain this all puts on the business. She gets it right away and has actually been thinking about the issue and possible solutions for some time. She tells you how what’s needed is a separate tool that can tap into the loader business logic and make only the necessary changes to incorporate the update and maintain consistency without re-running the full overnight. We need a tool that can correctly process a single event without impacting production.
So you make a change, and your business continues to thrive. You still run your overnight except now you have an arsenal of Data Management tools that can keep you and your clients on track when adjustments or corrections need to be made. And then your data expert asks you for a sit-down. She explains how she is basically maintaining two very different systems that are implementing the same basic set of rules. She needs additional staff, hardware, and other resources to keep these two systems (batch and event-driven) up and running. And it’s unnecessary. If (and it can be a big if) we can really nail these event-driven updates, why do we even need the batch updates? Isn’t the batch process just really processing each record in the file as an individual event?
Let’s do some performance testing and find out. Sure, it involves rolling up your sleeves and buying the data and development teams snacks and coffee for the late-night push, but you see the vision and you make it happen. More importantly, you see how developing the early batch system was, in a large part, throw-away development. Sure, it gave you a platform that helped you hone your business logic, but as you now see, that logic could have been developed more quickly and less painfully if you had adopted an event-driven ETL model from the start.
We live in a world of streaming video, music, Twitter, equipment sensors, and blockchains coming out of firehoses in every direction. We can no longer survive processing this stuff in discrete chunks with huge bulk updates. We need to be just as agile as the fastest source of data we have. We need to remove batch from our strategy.
When it comes to batch updates, the first step is admitting you have a problem.