The ability to provide a single place for instantaneous data access can mean business continuity or closure. Many nations found this out during the recent global crisis, as countries needed to know the number of tests taken and the infection rate in order to determine both the virus’ spread and who to quarantine. Unfortunately, the required data did not become available in time to prevent a widespread lockdown, shuttering many businesses. Simultaneously, demands for immediate integrated critical data only increased.
Getting high-quality data now, for discovery and solutions, is a top priortiy for Ravi Shankar, the Senior Vice President and Chief Marketing Officer for Denodo. In a recent DATAVERSITY® interview, Shankar explained how data virtualization technologies are evolving and creating a logical data fabric, putting data all into one place and enabling better and faster business decisions.
He described past attempts to do this physically and the resulting challenges. Shankar advised the necessary characteristics for a logical data fabric and how data virtualization delivers, weaving data together into one place and advancing the future.
Data Integration and the Physical Platform
Ravi Shankar has seen the data market come a long way in the last thirty years, by improving on the physical platforms, starting with databases. Shankar noted:
“People needed to analyze and make sense of data in one place. The database became a popular solution to store and find all this information, rather than dealing with data spread out across different locations.”
As the drive for databases gained momentum, multiple databases and vendors emerged to solve different kinds of business problems. Companies like Microsoft and Oracle sold database products at a low cost, overtaking older solutions. Fast forward to the 1990s. Business required integrating all the data from transactional and operational databases into one location, said Shankar. Companies that wished to consolidate data spurred the data warehouse concept. He remarked, “Once again, we tried to get a single source of information, where we could do our analysis.”
Data warehouses became widespread and diverse. As a result, many companies ended up with several data warehouses. Data marts (subject-oriented repositories that pipe information from the data warehouse for specific business services) emerged, adding to the data warehouse glut. To solve this problem, data warehousing vendors promoted one single enterprise data warehouse, which received some adoption. So, data warehouses solved the problem of housing all data in one location for analytical purposes, but they were limited only to structured data.
However, around the turn of the millennium, unstructured data grew from social media and started to appear also in the cloud. The data warehouse technology could not handle this unstructured data. So, “data lakes grew as desired repositories that could store any data type in its raw format and leave the normalization of the data to the time of access,” explained Shankar. Well, companies ended up with multiple data lakes (one for marketing, one for sales, etc.). But today, the preference remains to centralize all data. Shankar said:
“There has always been a need to have data in a single place because it is easy to find and use. But data gravity pulls the data back to different sources, since the data is continuously updated there as the business operates. I hypothesize that the rate of change of data in the sources, as well as invention of new data types, far exceeds the human capacity to physically pull them all to one focal point.”
Having one physical data repository has not been working. He advocates instead for a data view through a logical data fabric. He added:
“A logical data fabric is becoming more popular in knitting all the disparate data together. Leave the data where it is stored and give the business a unified view. If you try to replicate and bring that data together in a repository, moving that data takes time and costs more to store. In the meantime, the data gets out of synch, needs to be fixed and accessible to use. This approach, a physical data fabric, does not provide information immediately. A logical data fabric makes sense, providing real time access to the source much faster.”
Data Virtualization and The Logical Data Fabric
The logical data fabric, according to Shankar, relies on three characteristics:
- Physical Location Should Not Matter: Logical data fabric can connect multiple sources in different places. Whether that data lives in the cloud, a corporate data center, or with a third-party entity (e.g., a supplier or vendor), logical data fabric can link all this data together.
- Data Format Should Not Matter: A logical data fabric solution focuses on joining data in one view, whether structured, unstructured, or semi-structured. Picture data in data warehouses, XML documents, email, Word documents, and Hadoop all sewn together.
- Latency Should Not Matter: Data may be static, as in the records sitting in a data warehouse, or in motion, as it streams from instant messaging or live video. A logical data fabric needs to handle both types of data in whatever timeframe generated.
Shankar sees data virtualization as a data abstraction layer knitting together “disparate data in real time.” Data virtualization keeps the integrated data up-to-date in this data layer and accessible to the user in real-time as data updates or changes in the sources. It provides a “universal semantic layer across multiple consuming applications,” and:
“The data virtualization layer knows what data resides and where. The technology brings them all together immediately into a coherent view (e.g., a chart, a table, or a report). The moment the business user receives the information, he or she can take action, just in time, say to contact the most profitable customers and upsell and improve those users’ experience.”
Data virtualization underpins the logical data fabric, independent of data placement, type, and lag between inputs and outputs. Shankar says (and Gartner affirms) that data virtualization represents one of the more stable technologies and the fastest-growing data integration style. Many companies leverage data virtualization. But what data virtualization solution merges data fast enough for a business to chart a course of action and implement it successfully?
A Data Virtualization Product that Performs Promptly
Denodo provides data virtualization that performs promptly. The company’s experience goes back about twenty years when the CEO along with the CTO published a research paper leading the way. Gartner featured Denodo as the top fastest-growing data integration center. Shankar explained that the company has grown, “fifty percent year over year,” due to increasing data virtualization usage. But Shankar and Denodo are not resting on its laurels, especially with the rapidly escalating Coronavirus pandemic situation.
The company has launched an initiative called Coronavirus Data Portal which is an open, collaborative platform that uses the power of data virtualization to integrate various COVID-19 datasets from across the world and make the combined data available to researchers to help accelerate solutions to this deadly disease.
“Speed remains key to delivering data, especially in real-time. The person who has 1 billion records of data in Hadoop and 200 million records in a customer relationship management (CRM) database needs to have it all combined in real time. Denodo has three or four layers of performance optimization techniques to address this need and make data more quickly accessible.”
Shankar and Denodo have a number of functionalities currently in their product, including:
- Dynamic Query Optimization: They innovated and implemented dynamic query optimization. According to Shankar, this feature, “figures the computing load on the systems during runtime, and then shifts the work to a system less burdened, to search the data much faster.” When most other data virtualization engines could do static optimization, Denodo could implement this function in real time.
- MPP Engine: Denodo’s product leverages in-memory processing — engineering making algorithms more efficient by requiring fewer central processing units (CPU). Since Hadoop systems became more popular, requiring more memory, Denodo can handle this.
“In the case that computing needs to pull tons of data, we put the data to an in-memory system and process it on the spot. Denodo returns the data much faster. Furthermore, a cache in our system accelerates queries.”
Other features include:
- Summarizing information: They embrace getting aggregated information fast, through summary tables, thus speeding up queries.
- Automation Through AI and Machine Learning: AI and ML capabilities automate some manual functions and repetitive tasks. “The system learns and proposes resources needed for more efficient processing,” said Shankar.
- Adding a Data Catalog: Customers have self-serving access to do data analysis, search it, and understand the meaning, through a data catalog.
Enabling analytic use cases with big data, where data scientists can search data at once, is a key necessity. Shankar said:
“We bring data together so that data scientists build models required for cost and price optimization, and salespeople can leverage this information immediately when pricing to their customers.”
In a world where crisis events can unfold rapidly, business functioning means everything. People need data now. Shankar stated, “We provide abstraction. We have the location-agnostic capability to take care of data from on-premises to the cloud. We provide translation for our business users.” Most of all, it’s important to get data promptly to the user for just-in-time action.
Image used under license from Shutterstock.com