In a similar manner as Cloud computing, the concept of Big Data spans many areas within Data Management. It is arguable that the growth of NoSQL and the Cloud is partially due to enterprises trying to deal the difficulties in weaning meaningful business information out of the mass of Big Data.
In addition to volume, Big Data also means velocity. Considering today’s increasingly mobile and connected social environment, it is vital for businesses to be able to track and analyze data in real time – and this data changes by the second. Many retailers are becoming more dependent on customer buying patterns; such information comes from Big Data.
2013 looks to be another year of opportunity and change in the Big Data space. This article looks at new products, vendors, and what things may be in store for established names throughout the world of Big Data.
The Decline of Hadoop?
There is little denying the importance of Hadoop to Big Data. Its distributed processing framework and MapReduce functionality allows organizations to effectively deal with managing large databases sometimes spread out over multiple physical locations.
But as the technology closes in on its first decade of use, there are growing questions in the professional community on whether Hadoop’s technology has run its course. Considering that Big Data is becoming more widely used in the industry, don’t expect 2013 to be the “End of Hadoop,” but is the technology itself in decline?
Some pundits feel that Hadoop’s success is what is leading to its eventual downfall, as other vendors produce similar or more advanced technologies to handle the distributed processing of large amounts of data. In InfoStor’s Top Ten Storage Predictions of 2013, Spectra Logic’s Molly Rector predicts the Big Data market will expand beyond its current Hadoop focus.
Cloudant’s Chief Scientist, Mike Miller questioned whether Hadoop’s best days were behind it in a column for GigaOm in the summer of 2012. Reminding the reader that the seeds of Hadoop, namely its file system design and MapReduce processing pattern, originally came from the work of Google, Miller wonders if the fact that those technologies are not as prominent at Google as they were five years ago; so is Hadoop itself not as innovative as many hold it to be? Such a question has as many opinions are there are practitioners in the industry.
Given the real-time processing needs of today’s Big Data requirements, Google developed Percolator, a system for the quick processing of incremental updates to large data sets. It gets used in Google’s web search index. Miller feels technologies like Percolator (as well as the mentioned ad-hoc analysis framework, Dremel and the Graph data processor, Pregel – two other Google research projects) are poised to drive innovations in Big Data processing throughout 2013 and over the next few years.
Of course, this all does not mean Hadoop is going away any time soon. Miller feels it will remain the enterprise standard in Big Data processing for at least the next decade. But the needs for real-time analysis, fast updates to large data sets, and the processing of huge amounts of graph data will lead to many commercial Hadoop distributions adding technologies like Percolator, Dremel, and Pregel to their products.
New Opportunities for Using Hadoop, Microsoft Azure, and .NET Together
Bruno Terkaly, a Microsoft engineer, recently published an article that explains how to do basic MapReduce processing in the C# programming language, as well as looking at the implementation of Hadoop on Microsoft’s Azure Cloud computing platform.
Terkaly makes good points about how the “Big” in Big Data also means velocity in addition to volume. He gives a good introductory rundown on the sources that make up Big Data as well as the variety of formats it takes. Bruno also gives a couple of real world example problems that the Hadoop MapReduce pattern attempts to solve.
His C# programming examples do a nice job of explaining MapReduce to the programmer only exposed to Microsoft’s .NET Framework. Terkaly leverages Hadoop’s word counting problem to serve as the equivalent to the “Hello World” example commonly seen in introductory programming texts.
A run down of the components that make up the Hadoop platform (HDFS, Yarn, etc.) is followed by mentioning other modules commonly used with the platform, like the Apache Pig data analysis tool and the Hive data warehouse system.
After giving a full background of all things Hadoop, Terkaly gets into the meat of the matter by taking the reader through signing up for Microsoft’s Hadoop on Azure implementation aka Windows Azure HDInsight. While currently in “preview” status, the product is scheduled for a full rollout in 2013, definitely enriching Hadoop’s place in the industry as an enterprise solution for Big Data processing.
The sheer amount of Microsoft professional users and developers has the potential to become a significant part of the overall Hadoop community – a trend to watch for 2013.
Cloudera Looks to Improve Big Data Analytics in 2013
Despite the popularity of Hadoop as a framework for Big Data processing, there exists a notable lack of analytical tools. The previously mentioned Apache Pig is an open source option providing analytics functionality for Hadoop systems.
As one of the leading tool providers in the Hadoop community, Cloudera looks to help fill the analytical gap in 2013 with Cloudera Impala, its real-time engine for querying of data persisted in Hadoop. Inspired by Google’s Dremel technology, Impala provides real time querying of data stored in either the Hadoop File System (HDFS) or HBase including selects, joins, and aggregates.
Impala leverages the same interface technology as the Apache Hive data warehouse utility used with Hadoop. The tool provides superior performance by bypassing MapReduce, using its own query engine similar to those available in commercial RDBMS offerings. Impala is freely available under the Apache software license; its source code can be found on Github.
Datameer Combines Data Integration and Big Data Analytics
Continuing the thread of improved analytics on Hadoop is Datameer. The company offers a suite-like solution containing modules for data integration, data analytics, and data visualization. Their product is available under three different licenses – single user workstation, single server, as well as a Hadoop cluster.
Datameer’s integration module handles a wide variety of structured and unstructured sources – from most of the popular SQL engines, to HBase and Cassandra, to social networking and email. The data visualization functionality can be leveraged on most mobile devices in addition to the desktop.
Datameer supports Hadoop environments armed with Kerberos security as well as integrating with Active Directory and LDAP. The company includes Sears Holdings among its client list. Datameer is another vendor worth keeping an eye on as analytics on Hadoop becomes a big trend in 2013.
Hadapt Also in the Hadoop Analytics Game for 2013
In addition to Datameer and Cloudera, Hadapt offers analytical software to be used with Hadoop. Their solution – the Adaptive Analytical Platform – offers a unified view of SQL and Hadoop, allowing companies to analyze an integrated view of their data.
Hadapt’s platform offers a query interface that supports SQL, MapReduce, as well as ODBC and JDBC connectors. Data persistence is either in a proprietary relational format or HDFS for non-relational data. This paralleled approach between Hadoop and relational databases also applies to query execution.
MapR Leverages NFS Instead of HDFS for Hadoop
Despite Hadoop’s popularity, many companies feel there are performance and stability issues with the enterprise use of Hadoop’s native HDFS and the HBase NoSQL database. As companies wait for those technologies to mature, one 2013 Big Data trend involves swapping out parts of Hadoop for more stable alternatives in its distribution packages. This is popular in open source software (see Linux).
One company trying to improve Hadoop is MapR. The company’s distribution of Hadoop uses the more dependable NFS instead of the HDFS for its file system. The company claims their version of Hadoop will run twice as fast on half the hardware of their competitors.
MapR provides three different distributions of Hadoop – called M3, M5, and M7. The M3 version is free, and while enterprise suitable, it provides only a community support option. M5 offers a subscription model and for the cost users get support, HA (Highly Available) functionality, as well as mirroring.
M7 is the flagship of MapR’s Hadoop product line. It offers the best performance and support options, plus a fully optimized HBase distribution. Its Instant Recovery feature is vital for enterprises with an exposure to disaster recovery scenarios. Expect premium level distributions of Hadoop, like M7, to continue to propagate throughout 2013.
So 2013 in Big Data can be called “The Year Hadoop Improves.” Third parties are filling in the analytical gap by providing tools that work with Hadoop, in addition to other companies offering premium Hadoop distributions armed with improved performance, stability, as well as better file system and database functionality.