The Big Data marketplace capped another successful year in 2014, with vendors procuring billions of dollars in funding. Headlines surrounding the Internet of Things, the trend towards mobility and Cognitive Computing have solidified the future of Big Data technologies and their centrality to the Data Management landscape.
The coming year will herald a focus on real-time results for access, analytics, and integration of data, which will facilitate a new found agility to keep pace with rapidly evolving business objectives, customer needs, and a burgeoning amount of data sources that is transforming the very nature of business in the 21st century.
The crux of the agility that is becoming necessary for the success of Big Data deployments is characterized by MapR CEO John Schroeder as:
“…a broad range of data sources with a number of different access methods and the ability to blend completely unstructured with self describing data in a changed format with traditionally centrally structured data…To move into this century with things like operational analytics, the real-time speeds have to be there too.”
The distinctions between real-time data access, integration, and analytics become less pronounced when considering the current capabilities of utilizing SQL approaches on Big Data sets. Whereas initially NoSQL stores and repositories such as Hadoop were valued for their ability to ingest data quickly regardless of schema or modeling the continuing relevance of SQL to legacy systems and integration attempts with any variety of non-Big Data sources for the enterprise has facilitated a growing demand for SQL options for Big Data.
Although there are no shortage of vendors and tools that can facilitate this task (including Apache Spark, an engine that is optimized for both SQL access, Machine Learning algorithms and massive quantities of data), the ideal is to be able to do so in a way that is not only expedient but also agile. Apache Drill, a SQL engine for Big Data that provisions interactive query response time while handling large results tables, is able to work around the limitation of SQL that requires structure for data before one can access them.
“What we’ve done with Apache Drill is to extend SQL to also be able to query self describing data in a changed format like JSON or…Hbase tables, and things like that. That gives you the agility with the app which is basically publishing the structure and the consuming application finds the structure of the data at execution time. So it’s much more agile.”
Apache Drill also provides a way to account for accessing structured, unstructured or semi-structured data in a way that eschews conventional Data Lakes—which are useful for Data Scientists but present numerous governance complications when used for production. Despite the relative popularity of Data Lakes, Gartner predicted: “Through 2018, 90% of deployed data lakes will be useless as they are overwhelmed with information assets captured for uncertain use cases” (Heudecker et al, 2014). SQL options for accessing Big Data quickly in a structured format could help spur a curbing of the deployment of Data Lakes for general enterprise use in 2015.
Real-time analytics will play a critical role in the adoption rates and technology lifecycle maturity of Big Data, which is still relatively young. The speed of real-time or near real-time analytics is partly attributed to the predictive and prescriptive nature of certain advanced analytics, which may involve Machine Learning. IDC predicted that “Growth in applications incorporating advanced and predictive analytics, including machine learning, will accelerate in 2015. These apps will grow 65% faster than apps without predictive functionality.”
The proclivity for real-time analytics represents a critical juncture in the lifecycle of Big Data, which is transitioning from batch-oriented analytics processes to much more expedient ones that are frequently involving self-service options. As analytics become more ubiquitous throughout the Data Management landscape in general—evolving from specific tools to their inclusion in various apps—the speeds of their performances are attributed to in-memory computing, the capabilities of data repositories to both read and write data simultaneously, and self-service functionality such as search tools on Big Data sets.
There are some trade-offs with real-time analytics, which tends to prioritize speed over complex transactional functionality to deliver insight on data as it is ingested while accounting for operational intelligence in the Internet of Things. However, their integration with conventional Business Intelligence tools greatly increases Big Data’s overall utility. Schroeder reflected that:
“One of our demo use cases that we have on our website shows an analyst being able to do traditional self-service BI against sales data, but also being able to query click stream data from their dot com site and be able to see what the web behaviors were of buyers… It’s a great demonstration of going against traditionally centrally structured data for the product sales information but then being able to go against the self describing data in a changed format.”
Integration of Big Data with various sources is one of the most fundamental aspects of the real-time value it can provide to the enterprise. Gartner reported: “…big data projects need to combine various data types from internal or external sources…This leads to challenges to linking data…When identified early, these challenges can be addressed by adding iteration to the pilot. When discovered at a later stage…this can…jeopardize the project” (Heudecker et al, 2014).
Yet an agile approach to real-time integration can result in the addition of new and changing data sources with minimal jeopardy. Many of the architecture concerns for Big Data stores and repositories are being addressed by a robust series of connectors to the leading vendors in this space. Additionally, there is a growing trend for these repositories to offer native support for the engines and processes of leading Big Data and analytics vendors, which eases integration concerns since such support enables a single copy of the data. According to Schroeder, such singularity is ideal for:
- Governance: Less replications and movement of data provides a single space to enforce governance protocols.
- Lineage: It is less complicated to trace data lineage when data is located in one place instead of multiple locations.
- Backups: Backing up data is relatively straightforward when it is assembled in a single place.
- Security: Less infrastructure means there are fewer components to monitor.
Utilizing the Cloud for Big Data initiatives directly impacts issues of architecture, integration, and analytics. Many real-time, self-service options for Big Data are deployed through the Cloud—a number of these involve predictive analytics. In addition to reduced infrastructure costs associated with leveraging the Cloud’s scalability and pricing model, Cloud offerings for Big Data may require replication of data between on-premise and Cloud sources, which can impact business continuity and the rate of integration. With analytics choices ranging the gamut from SaaS to PaaS, the Cloud’s impact on Big Data (and its adoption rates) will be formidable in the years to come. According to Forrester:
“Whether you are a scientist, DBA, coder, rapid application designer, or business intelligence professional there are cloud-based solutions suited to your skillset that you can leverage in a pay-per-use fashion right now. And nearly all of these services can be consumed via cloud economics – pay for what you use, only when you use it.”
Real-time usage in terms of access, analytics, and integration is an important frontier that Big Data technologies will continue to address throughout the remainder of the year. The nearly instantaneous feedback that real time provides is crucial for improving the overall agility of Big Data initiatives, rendering them less as niche projects and more as mutable means of helping to inform and accomplish business objectives. The reality of Big Data’s role in the current Data Management landscape is that there are a multitude of expensive, extremely lucrative Big Data deployments that hinge on these technologies for success. These use cases, however, are outweighed by the vast majority of organizations that may be considering or perhaps even have launched a Big Data project which has not yet reached the production stage.
Schroeder believes that changing these facts will simply require a maturing of the Big Data technologies lifecycle, which will involve a reduced degree of complexity. The movement towards self-service analytics and advanced analytics on Big Data sets via the Cloud is useful in this regard. Schroeder remarked:
“I think there’s been a tremendous amount of headway with providing high availability and backup, disaster recovery and things like that. The interactive SQL capabilities open up usage dramatically so you get a population of people that understand SQL based technology and use SQL based tools. I think that broadens adoption. And I think the final one is really the operational side of it.”
Heudecker, N., Randall, L., Edjlali, R., Buytendijk, F., Laney, D., Casonato, R., Beyer, M.A., Adrian, M. (2014). Predicts 2015: Big data challenges move from technology to the organization. www.gartner.com