Virtualize or Replicate? Accessing Your Data in Hybrid Cloud Architecture

Click here to learn more about Joe deBuzna.

The term “hybrid cloud” has no single definition. When it was first used, it typically meant a combination of a private and public cloud. Since then, the term has taken on a broader meaning, including (public) cloud offerings from multiple vendors. For this post, I’m using Gartner’s definition: “Hybrid cloud computing refers to policy-based and coordinated service provisioning, use, and management across a mixture of internal and external cloud services.”

Why Do You Need a Hybrid Cloud?

There are multiple reasons to consider hybrid cloud computing:

Security and data privacy concerns drive your organization to maintain an on-premises cloud service and store less sensitive information in a public cloud with the flexibility to scale up and down as needed.
To avoid vendor lock-in, your organization chooses multiple cloud vendors instead of one.
As a result of the use of Software as a Service (SaaS), you end up using multiple clouds.
You need or want access to a technology service that is only available on a specific cloud.

Data Virtualization vs. Data Replication

In hybrid cloud computing, data needs to be integrated between multiple cloud-based and on-prem sources. There two dominant schools of thought on how best to do this:

1. Data virtualization, which combines the data upon request

2. Data replication, which makes a copy of the data

Data virtualization enables access to data without knowing how or where the data is stored. In a virtual — or federated — database, data is interconnected via a network, but it isn’t moved. Data stays in place and is accessed from the heterogeneous sources only when queried.

You need to ask a few questions when deciding if data virtualization could work for your organization:

What are the data volumes involved?
How smart is the data virtualization layer — to avoid extracting large data sets — for each and every query?
How much surplus capacity does your data source system have available to allow (ad-hoc) data retrieval?
To what extent is data from different sources combined in a query? Similarly, what are performance requirements for response times?
How many applications or users will be making data virtualization requests at any one time? Meaning, what’s the load on systems?
What kind of infrastructure is needed to host the data virtualization technology?

An alternative to data federation is data consolidation, which requires data to be replicated.

When data is replicated, access no longer results in load or latency on the data source. Once a copy is replicated, it resides in a separate data store. Also, a lot of the (heavy) processing of combining data sets, filtering data, and computing aggregate information can be pushed down into the data platform instead of the virtualization layer. In addition to taking a full copy of the data, there are multiple ways to perform change data capture (CDC), a method of replication that keeps the target in sync in near real-time with less impact and lower latency on the data sources.

Network Efficiency

A hybrid cloud computing architecture introduces Wide Area Network (WAN) communication between the different clouds. Even though the available bandwidth on this WAN is generally high, speed and responsiveness (latency) don’t quite match Local Area Network (LAN) connectivity. So then, how do you get the most efficiency out of your network?

1. Move only the data you need. For a data virtualization environment, you pass the minimal data to satisfy a query, knowing it will get retrieved for every query. For a replicated data set, change data capture (CDC) is favorable over any approach that would repeatedly perform full data extracts, as it only extracts changes in the data. Also, filter and project data (eliminate columns/fields) that aren’t required.

2. Use data compression. On top of moving only the data that you need, you can “magnify” the required bandwidth by compressing the data before it moves — a 10 times compression ratio results in 10 times less bandwidth required to transfer the same volume of data.

3. Bundle data that goes over the wire to maximize bandwidth despite higher latency. A WAN connection introduces extra latency over a LAN connection. Sending larger bundles lowers the sensitivity for this higher latency (waiting for the acknowledgment that data was correctly received) while still achieving good throughput.

Security

“There are only two types of companies — those that know they’ve been compromised and those that don’t know. If you have anything that may be valuable to a competitor, you will be targeted and almost certainly compromised.” – Dmitri Alperovitch, VP of Threat Research for McAfee

Data security is top of mind for most organizations. A data breach can result in significant reputational damage and can also have severe financial ramifications. Make sure your hybrid cloud architecture implements sound security best practices, especially with data flowing into and out of public cloud infrastructure.

1. Use encryption, both in-flight and at rest. Encrypt your data using a strong industry-standard algorithm such as AES256. Use unique (securely stored) certificates so that if your organization is compromised, the perpetrator(s) will have a hard time making sense of the data.

2. Lockdown firewalls as much as possible to lower the likelihood of getting compromised, especially into your primary data processing systems. Consider using a proxy that is both the gateway to a data endpoint and the gatekeeper to prevent unauthorized access.

3. Use secure and strong authentication, again to prevent the likelihood of data getting compromised.

Data Accuracy

The last consideration is data accuracy. In a data virtualization architecture, this shouldn’t be a concern since you use direct access to the data sources. However, for a replicated data set, data accuracy is an important consideration deserving of a bit more attention.

How can you be sure that the consolidated target data set is a correct representation of the data source? Confidence comes in the form of a data validation solution that routinely checks data values beyond simply comparing row-counts. The technology should have a strategy to validate the data against data sources that are active 24/7.

Hybrid cloud systems are only going to increase in popularity, so companies need to have a sound strategy for integrating the data. Will data virtualization or replication work better for your hybrid cloud architecture? The determining factor that usually carries the most weight is your workload and complexity. The more complex your use case or the higher data volume you have (or both), the more likely it is that data replication is the best choice for your architecture. Regardless of the integration method you choose — virtualization or replication — considering all the angles will help you make an informed, and thus confident, choice.

Data Topics

Leave a Reply Cancel reply