The increasing prevalence of Big Data in today’s business climate is indisputable, yet there are still several issues related to its integration with the enterprise that are preventing organizations from adopting it on a wider scale. The speed, size, and variety of Big Data present challenges to a number of conventional processes including governance, analytics, metadata, and storage. All too often, organizations that do incorporate Big Data do so in a silo format, which detracts from the value of enterprise-wide integration that Big Data can significantly enhance.
Intelligent Business Strategies’ Managing Director Mike Ferguson addressed a number of these concerns for several hours during an Enterprise Data World 2013 presentation entitled “Integrating BIG Data Analytics Into The Enterprise.” Aside from denoting many of the key attributes of Big Data and its ramifications on integration, Ferguson also detailed a variety of solutions to such issues and current products that address them. Ultimately, he concluded that Big Data was merely a launching point for increased data integration throughout the enterprise.
Big Data Governance
Governance concerns for Big Data are similar to those for little data. The goal is to ensure data quality and a manageable format in which data is easily archived, stored, and accessed in order to assist those professionals who use it most. Still, there are a number of key aspects of Big Data that makes its governance concerns unique. The primary distinction between Big Data and little data is that the latter is structured, conforms to a universal definition of metadata, and is able to be readily categorized and accessed by professionals accordingly.
One may argue that the entire point of capturing and utilizing Big Data is to glean insights from unstructured data, the likes of which users themselves may not be fully aware of at the point of capture. Therefore, there is an aspect of data exploration (ideally performed by data scientists) that is vital to the integration of Big Data and occupies a primary place in its governance – which may be secondary or unnecessary for traditional data.
This distinction manifests itself in a number of different ways. Whereas data stewards are seen as the principle curators of the governance of traditional data, data scientists are often the front-line professionals who are responsible for not only exploring Big Data, but also for providing its essential governance principles. Regardless of what technology an organization uses to access Big Data (Apache Hadoop is certainly one of the most popular), the first level of governance is for data scientists to explore various aspects of data in sandboxes to analyze and stratify its characteristics.
Thus, governance issues related to Big Data involve policies about data science projects, policies for the results of data once it has been moved into a warehouse, and policies for discovered schema and data processed through BI tools. Other governance concerns include what sources can be integrated into Big Data technologies and who can access such data while attempting to present as much structure (and avoidance of duplication) as possible. EMC’s GreenPlum Chorus is a tool that enables organizations to govern different sandboxes and workspaces; IBM’s Big Data Platform also has governance capabilities. However, Ferguson claims there may be a more pressing issue:
“I think we will see more data governance capabilities in the Hadoop world, but remember it’s un-modeled data and so some of the things associated with data governance – like common definitions for a data model – may not apply yet. Instead, we’ll need a data scientist team to work on a data source to derive structure from unstructured data. Then we’ll want to map that into some kind of model that may adhere to our standard canonical data names and definitions, so that we can then consume that data easily in the enterprise.”
One particularly insightful aspect about Ferguson’s presentation was that it helped to clarify the analytics challenges of Big Data. Organizations may take terms such as structured, unstructured, semi-structured, and ploy-structured data for granted – until they hope to actually transform such data into information. Depending on an enterprises’ particular area of focus, Big Data encompasses not just sentiment data from social media and other websites, but clickstream data, transactional and vertical industries data, event and sensor data, all of which can range from text (in various languages and jargon) to audio/video and sensors. Most of this data is in a constant state of flux in which it is steadily coming in, leaving little time for analysis.
The approach towards analyzing Big Data is inverted from that of doing so with a conventional data warehouse in which users perform analytics on data that is already stored. The trick with running analytics on Big Data is to analyze it first and then determine whether or not such data should be stored. The primary drivers for Big Data analytics are transactional volume and analytics complexity.
The three principle platform types for Big Data analytics include SQL-based relational databases, NoSQL databases, and Hadoop. There are also hybrid solutions that combine Hadoop and SQL databases such as Teradata Aster, and conventional RDBMS that work with a finite amount of data volume. Developments in SQL technologies such as in-database analytics, columnar and in-memory data have greatly expanded their analytics capabilities for the size concerns of Big Data, while Hadoop’s extreme scalability and inexpensiveness (it is open source) make it one of the most sought after platforms for Big Data.
These two aspects of Hadoop, as well as its other frequently used components such as MapReduce – a data interpretation processing model – and its data warehouse Hive, have contributed to the fact that numerous SQL-based technologies have created applications that allow users to access and analyze Big Data through Hadoop. A number of top vendors package products with Hadoop, such as SAS and IBM’s Big Insights. HortonWorks’ Stinger Initiative significantly increase the speed of Hive (up to 100 times), enabling self-service BI querying. Cloudera Impala allows users to make real-time queries with SQL technologies and is supported by a number of top BI vendors.
The principle boon of querying and analyzing data with SQL is that it simplifies data integration processes. Virtually all of the aforementioned solutions have data integration tools, while some platforms, such as the Teradata Enterprise Access For Hadoop (which is part of its Unified Data Architecture), provide access to Hadoop and a myriad of other data sources in a fully integrated data warehouse. Thus, users can run advanced analytics on Big Data in nearly real-time and readily integrate it with the rest of their data.
More Than Just Big
Ferguson commented on this trend:
“Data management vendors are working to exploit the Hadoop platform to get scalable ETL processing. This opens up the opportunity to potentially offload that kind of processing from data warehouses into a Hadoop environment which raises another question: could we turn Hadoop into a data hub where we load data in there and process it and then move it on to wherever it has to go for subsequent analysis? Maybe it could start moving just within the Hadoop cluster to data scientists’ sandboxes, or into other platforms like data warehouses for subsequent analysis.”
The implications for Hadoop as a data hub are significant because it suggests the possibility of utilizing numerous data sources and respective tools all on the same platform. According to Ferguson, this usage of Hadoop could represent the wave of the future:
“There’s going to constantly be a call to move data between platforms, and it’s going to get faster and the degree of integration is going to get more tightly controlled. You may have Hadoop on a cluster in the cloud and your relational database in-house, but already we’re seeing vendors put both of them in the same place. We’re seeing relational databases putting multiple NoSQL stores in the database. These are all signs of deeper integration in order to allow you to control analytical workloads for multiple platforms, exploit the best platform for analytics and make sure you’re using your data.”
Subsequently, Hadoop’s potential for integrating Big Data with other enterprise data sources could result in a more profound integration throughout the enterprise. Such integration could require moving master data into Hadoop and truly streamlining various ETL, NoSQL, Hadoop, and warehousing technologies via common metadata terminology to effectively stretch Data Management to include all data solutions and assets.