As companies look for ways to use their data to find new opportunities, the importance of real-time insights is growing. But many companies still rely on older, slower approaches to Data Management that struggle to keep up with the pace of business today. Companies know they need better ways to manage data without slowing down production systems. But combining data reliably from multiple systems in one place without slowdowns is challenging without an advanced replication approach. Data replication allows for consolidated access to operational data for real-time analytics, data streaming, and machine learning use cases.
Change data capture (CDC) is a Data Management strategy that helps companies continuously transfer data as it changes, with latency down to single-digit seconds. With faster access to the latest data, businesses can accelerate the speed at which they make decisions. I’ll explain here how log-based change data capture works, and why it’s the go-to foundation for synchronizing data from multiple database systems.
ON-DEMAND COURSES: BUSINESS ANALYTICS IN ACTION
Learn new analytics and machine learning skills and strategies you can put into immediate use at your organization.
One of the biggest challenges with older data extraction approaches is that they can add significant overhead to the database systems managing the business. If a company doesn’t need access to this data on a regular schedule, say in the middle of the night, then it can run batch processes or take snapshots when the load is low, as long as that data isn’t likely to be needed or changed during the transfer process.
But with the 24/7 global nature of online transaction processing today, companies may not have scheduled downtime. When businesses implement predictive analytics or real-time product recommendations, yesterday’s data is not very useful. Companies need to consider how quickly they are making decisions and use this requirement to inform how frequently they update their systems. My colleague Alexander Lovell wrote about this recently in the article “What You Don’t Know About Real-Time Data Is Killing You.”
For databases, log-based change data capture replication delivers the best of both worlds in terms of speed and convenience – pipeline setup is minimal, and there is no worry about performance degradation for source system administrators. Once data resides in the target destination, analytics teams can perform analysis without impacting production data sources or database systems.
Log-Based Change Data Capture and Alternatives
One big advance with most ACID-compliant online transaction processing (OLTP) databases is the adoption of a transaction log, so all changes are available at any time, and can be read by a small agent running on the database. The transaction log is the foundation for database recovery and ensures data quality is maintained even if a host or source system crashes mid-transfer.
Some enterprises use high watermark-based attributes like last-modified date to track changes. This approach maintains consistency when data in a row or column changes. But if a row is deleted, there’s no more last-modified date, making it hard to track deletes. One approach is to compute the differences between source and target, but this is resource-intensive and can slow system performance.
Deletes also pose a challenge for filtered batch processing. If the data is no longer stored on the source system, a batch process may not “know” to delete the relevant row on the destination system.
With snapshots, companies can get a complete copy of their data, but the data is frozen at the point when the snapshot is taken – any real-time updates, changes, etc. can’t be accessed until the next snapshot, and the data doesn’t propagate forward. When companies need to transform or consolidate data without impacting the source systems, it is very hard to verify which data changed by using a snapshot.
A big advantage of the real-time log-based change data capture approach is that it only adds a minuscule load to the primary source database system to track changes. Change data capture enables data to be transferred as changes happen. This approach uses a binary log reader to parse the transaction log directly, with no intermediate API layers that could slow down or limit data transfer. A binary log reader has no impact on database processing and can even be performed on a standby system or by reading from backups of the log.
Analytics algorithms often need to combine data from multiple sources such as CRM, production data, or first-party behavior to run machine learning models. But this work is computationally intensive. Change data capture helps companies use data that is near-instantaneous enough to make better informed decisions and predictions, without impacting the main database.
Use Cases for Real-Time Data
As online interactions become more customized for each person, there’s a bigger need for consolidated data that can solve time-bound problems. Marketing requirements like instant recommendations, adjusting the flow of a website, or creating unique offers all call for decisions in milliseconds based on real-time customer behaviors.
Combining multiple real-time data streams becomes even more complicated, but is also essential for industrial companies. Take the example of an industrial company that builds and maintains railroad locomotives. These types of companies face a significant data science problem when trying to implement predictive maintenance with older Data Management techniques. Sensors in the equipment collect wear-and-tear data that can indicate when failures are likely to happen. Parts must be replaced by a qualified engineer before failure happens. However, the locomotive moves around, and the goal is to minimize downtime so the railroad can maximize equipment utilization. For example, if an engine needs work, the railroad needs to confirm that Locomotive 7280-45 will be in Chicago on Tuesday and ensure the right parts and mechanics are also available at the same time and place. These companies use an AI system to combine asset location data, scheduling, part inventory, and labor needs in one place so they know the work can be done as scheduled. But working off of yesterday’s data makes this almost impossible.
For a major beverage manufacturer, these problems take on a different data scale when the company combines SAP ERP data, bottler capacity, sales forecasts, and other signals all in one data warehouse. A company this complex needs tons of data from multiple sources to find insights. If real-time data and analysis isn’t available, a business may not be able to respond to market changes quickly enough.
Is Log-Based Change Data Capture right for you?
When evaluating how to improve your Data Management system to get the type of real-time analysis that’s essential for business today, there are a couple of key areas to check for readiness.
- Identify the connectors and data sources you’ll be pulling from and consider what sources might be needed in the future. A good framework showing how data is coming in and flowing out will guide you to the best way to manage that data, and thinking about how you might incorporate other data sources will set you up for future success.
- Ensure you have access to the transaction log on your database server. Work with your IT team to identify who will need access, and any permissions needed to transfer data successfully. Considering the use of an agent on the server is particularly important to achieve the best performance with minimal impact and track the highest volume of data. Look at which systems have the ability to run an agent architecturally close to where the data is getting processed.
- Finally, ask yourself if you had more real-time access to data, what would it mean to your organization? What can you learn about your business with better insight? Where are the opportunities for growth, or areas to cut back on resources? If you had more control of your company’s data, how could you save money and be more efficient?
As businesses move more of their data to the cloud to leverage SaaS reliability and maximize machine learning results, feeding those systems with the latest data is crucial to business success. Change data capture using log-based databases is one of the best ways to quickly capture and utilize valuable data while also storing it safely for further analysis. By reliably transferring data using a log-based change data capture approach, companies can have access to new insights without slowing down high-demand database systems.