Data virtualization software provides an integrated view of disparate sources of data, regardless of their physical locations. Its abstraction layer accommodates structured, unstructured, and semi-structured data, enabling queries and results in real-time – or caching results for near real-time access.
Once the connectors are in place to all of the appropriate sources, the canonical business views can be modified according to access and security concerns as well as those pertaining to Metadata and aspects of governance. Publication can include any variety of means such as via Service Oriented Architecture, SQL, SharePoint, or other platforms.
Despite the increasing presence of this technology and its applications, particularly in the realms of Business Intelligence and Big Data, there are a number of industry-wide myths regarding its efficacy that still persist. This article clarifies a number of these misconceptions.
We’ve spent years building up our warehouse—why do we need virtualization?
There is still a tendency within the industry to polarize the concepts of Data Virtualization and data warehousing. The reality is that oftentimes the latter can augment the former. With the number of semi- and unstructured sources of data growing daily thanks to the prevalence of Big Data technologies, it has become increasingly difficult to consolidate all of that data into a proprietary warehouse. Additionally, the process by which such data can be queried and effectively analyzed oftentimes creates a substantial IT backlog which alienates business users and decreases data reliance—resulting in less-informed decision-making.
- Mergers and Acquisition
One of the most convincing use cases for Data Virtualization was presented at Enterprise Data World 2013 in Masha Bykin and Anthony Kopec’s session entitled “AAA Changes Wheels While Driving: Virtualized Data Improves Structure and Flexible Architecture.” The pair detailed the fact that Data Virtualization provided an ideal means of federating data from different sources which allowed it to combine data dating back several years to enhance its agility, minimize point connections, and increase data availability.
Denodo Senior Vice-President Suresh Chandrasekaren stated that the costs associated with implementing Data Virtualization software are approximately one third of those associated with implementing a data warehouse. Those savings increase when one factors in the time spent employing IT personnel to consolidate data and process queries – which detracts from valuable time that could be spent improving business needs – as well as maintenance costs. According to Chandrasekaren:
“The analyst time that it takes to define requirements, define the schema of your tables, substantiate your warehouse, build your tables, and populate the ETL is a longer stack than the whole idea of Agile BI which is okay, I may not get it 100 percent right at first, but let me build up the views that my business users want, expose it to them quickly and make changes along the way so we get feedback.”
Traditional data warehouses will never lose their value for providing pre-staging information for analytics or for determining historical persistence. However, they can be supplemented with a hybrid approach that enables a wider integration of sources in real time.
Aren’t there scalability and query performance issues with this maturing technology?
Some of these negative perceptions of Data Virtualization are attributed to its initial phase as data federation technologies, in which there were substantial performance issues. But, with computing technology advances in the last several years (including increased bandwidth, chip speed, and memory) virtualization software is not only able to issue queries in real time, but to augment that technology by caching results for more profound (and time-consuming) analysis.
According to a white paper issued by Business Intelligence industry expert and founder of BI Leader Consulting Wayne Eckerson entitled “Data Virtualization: Perceptions and Market Trends,” virtually the only flaw still associated with Data Virtualization in terms of query performance is the fact that it is:
“…susceptible to unexpected changes in source systems that can hamper performance or alter query results. Although this is true in conventional data warehousing environments as well, data virtualization software exposes changes immediately to end users. In some sense, this direct connection to source systems is a good thing because it alerts business users to problems that only they can fix through political pressure…”
Additionally, scalability concerns with virtualization software should be largely assuaged by the fact that Big Data sources such as Hadoop and others are compatible with many of today’s virtualization products. A survey conducted within the aforementioned white paper reveals that lack of knowledge and expertise among professionals regarding Data Virtualization is more of an inhibiter than concerns regarding performance and scalability.
Isn’t Data Virtualization primarily used for BI?
The analytics capability of Data Virtualization software is impressive, and one of the principle drivers of its adoption. According to Eckerson:
“The problem with BI is getting access to data to users. If the data is all over the place, it’s almost impossible for your average user to get it and even makes it harder for your power user. Giving users one place to go to get any data they want no matter where it’s located? That’s pretty revolutionary.”
However, contemporary Data Virtualization is able to expand its benefits beyond mere analytics. There are its inherent security boons, since data is accessed through an abstracted layer that is strictly regulated in terms of who can view what facets of data. Its utility for integration extends beyond BI, as the aforementioned reference to AAA’s use case scenario proves. It is a principle tool for enhancing agility, since it provides expedient access to data and virtually instant feedback to its specific uses. The reduced time to market and increased simplicity associated with Data Virtualization may be even more valuable than its real-time querying – although these aspects of its performance are certainly related.
Is Data Virtualization Really Real Time?
Responses to queries in applications that utilize Data Virtualization are issued in real time, and benefit from software choosing any number of methods (such as parallelization or delegating the queries to the sources) to provide an optimal response. According to Chandrasekaren, approximately two-thirds of the queries using Denodo’s virtualization products occur via accessing data in real time. The rest utilize data that has been cached, which is why there is still significant value in utilizing data warehouses alongside virtualization tools. Chandrasekaren commented on the real-time benefits of this technology in accordance to its other boons.
“One thing I want to emphasize is that the primary benefit of data virtualization I would almost argue is not just the fact that it’s real-time information. That’s an important benefit, don’t get me wrong. But the primary benefit is agility, time to market, and simplicity both in terms of how quickly and easily to integrate different sources, but do so without being bogged down by the physical infrastructure.”
There are actually few areas of Enterprise Data Management that Data Virtualization does not enhance. It is difficult to match in its ability to integrate and federate disparate data sources, and provides a layer of security between the actual data and its use. It can certainly help to optimize analytics by presenting results in real time, and is an ideal means of facilitating Agile BI. Additionally, it supports the use of multiple BI tools, which has distinct advantages.
However, Data Virtualization is also useful as a means of prototyping and assisting Data Scientists with analytical sandboxes. Lastly, it functions well as a means to augment traditional data warehouses, although it certainly can be employed as a virtual warehouse as well.