Using CDC to Power Real-time Analytics

Click here to learn more about author Joe DeBuzna.

Snowflake is the first data warehouse and analytics service to be built for the cloud. Since Snowflake’s introduction in 2012, companies have flocked to it in order to minimize costs and simplify implementation and management of their data warehouse infrastructure. With Snowflake’s cloud-native architecture, organizations no longer need to purchase, install, configure, and manage the infrastructure required for data warehouse administration, scaling, optimization, availability, and data protection.

As organizations adopt Snowflake they will seek solutions for migrating data from their operational systems to this cloud-based data warehouse more quickly, efficiently, and effectively.

Previously, when organizations transferred data between on-premises transactional databases and on-premises data warehouses, data moved over their internal local area network (LAN). LANs offer plentiful bandwidths of 100 Mbps, 1 Gbps, or 10 Gbps. But as organizations adopt Snowflake (or any other cloud data platform), they must transfer data over a wide area network (WAN). With bandwidths of 4 Mbps, 8 Mbps, 20 Mbps, 50 Mbps, or 100 Mbps, WAN bandwidth is quite limited.

Organizations need a solution to help them transfer data to Snowflake in a way that makes the most effective use of available bandwidth securely and with minimal latency.

This post is about three common approaches organizations use to migrate their data to Snowflake.

Greenfield

With the Greenfield approach, organizations leave their existing data warehouse as is. But they turn to Snowflake when they add a new data analytics use case. For example, when a user or business unit comes to IT with a new set of requirements, such as a new type of report or the need to use sentiment analysis, the IT organization meets these demands using Snowflake.

Snowflake is enticing for organizations implementing new data warehousing use cases because it allows them to “fail fast.” Instead of having to physically add new servers, disks, memory, and concurrency to support a new data mart, which involves significant budget and time for approvals, organizations can provision new data warehousing capabilities instantly in the cloud with Snowflake. That means they can quickly and easily create and test a pilot project. If the pilot works, they can scale it up. If it doesn’t, they can simply kill the project and start over with a new technology. Thus, the Greenfield approach gives organizations the agility to easily experiment with different technologies.

Brownfield

Brownfield is a lift and shift approach. Organizations simply migrate off of their old analytics datastores and start replicating the data from their operational systems to a new instance of Snowflake. One common reason to take this approach is scalability. Data is increasing exponentially. As an organization’s data grows, it needs to add, configure, and manage hardware in order to scale up. Adding this hardware can be a costly and time-consuming process. Snowflake provides horizontal scalability on demand. The organization doesn’t have to know what its workload will be five years from now. It can take advantage of Snowflake’s elastic scalability to have data warehousing power on demand.

Bluefield

With the Bluefield approach, organizations migrate a subset of their data to Snowflake. Organizations have different use cases for their data warehouse. One use case might be financial reporting, another compliance auditing, still another could be predictive analytics for a recommendation engine. Different use cases might operate better using different data warehousing technologies. The Bluefield approach gives organizations the flexibility they need to mix and match the technology with the use case that’s most appropriate, whether that’s a new or old solution.

What to Look for in a Data Replication Solution

Regardless of which approach is chosen to perform a Snowflake migration, consider looking for a data replication solution that offers change data capture (CDC) technology. CDC uses bandwidth in a highly efficient manner to streamline data transfer over the WAN. Look for a solution that is also real-time, secure, agile, and scalable and that offers native support for Snowflake. Most importantly, a solution that will not require you to open your firewall to your on-premises production databases to perform this migration is ideal.

Greater Bandwidth Efficiency: CDC uses network bandwidth more efficiently than other replication solutions because it only moves the data that’s changing, rather than moving the entire table during a single batch update as is typical with ETL processing. A CDC solution that compresses data for transmission optimizes network bandwidth further.
Real-time Data Movement: Log-based CDC solutions enable you to move changes virtually as soon as they occur in the source. Such a solution mines source transaction logs in real time (if permitted by the database), reads the data directly from the I/O cache to achieve high performance, and minimizes the impact on the source databases.
Data Security: A CDC data replication solution can secure data both in transit and at rest. Look for a solution that uses SSL or TLS to secure data in transit and provides transparent data encryption for data at rest. When using an Agent or Proxy to integrate data between on-premises data warehouses and Snowflake in the cloud, ensure that the solution’s agents/hubs are trusted entities with built-in validation. Find out if the solution provides additional security by requiring that the firewall open only a single machine and port pair in a single direction.
Distributed Architecture: This is a best practice for performing data transfers from on-premises transactional systems to the cloud-based Snowflake system over a WAN. In this mode, an agent local to the source database listens to transactions for committed changes, captures only the necessary subset of changes, compresses and encrypts the data and then streams the data in a proprietary format to the hub. The hub then routes the data to an agent close to the target (Snowflake). The target agent decrypts and decompresses the data and applies the data locally. This configuration eliminates the need to open the database or access the data across a WAN, delivering greater flexibility and better performance for customers that demand high-volume and high throughput. A CDC that employs a distributed architecture can scale to enterprise levels. A distributed architecture has minimal impact on the source system because the load capture reads directly from the source system’s I/O cache; it does not compete with the source system for disk resources. In addition, data is efficiently compressed before being sent across the network to reduce bandwidth requirements and latency over long distances.

Native Support for Snowflake. Snowflake takes data from a staging location. A solution that supports that staging area natively makes data transfer much faster.

Data Topics

Leave a Reply Cancel reply