Big Data is all the rage these days, so it behooves us to delve a little deeper into it – to understand what it means and the opportunities it presents. Is Big Data truly a new phenomenon or an old concept presented in a new package? Will investing in Big Data solutions provide a significant return-on-investment? What do Hadoop, MapReduce, HBase, and NoSQL mean? Most importantly, if it is truly a new phenomenon, then what is the potential for Big Data solutions from a business value perspective? Lastly, we should be aware of the land mines and challenges, before jumping on the Big Data bandwagon.
What is Big Data?
The most common explanation of Big Data is defined by the three “V’s” – Velocity, Volume, and Variety. Velocity is the speed at which data is being generated via multiple channels, especially social media and social networks. The other dimension is the advent of unstructured and semi-structured data that is being created: Twitter feeds, Facebook entries, LinkedIn updates, audio files, video files, various types of documents and rich transactional data. Estimates vary on how much data is considered “Big” – a few dozen Petabytes or thousands? The jury is still out on such questions, ask any industry professional and they’ll have an opinion. The combination of the three V’s and new data formats are the factors that uniquely define Big Data.
Is Big Data new?
After analyzing industry research, the most appropriate conclusion to such a question is “Yes and No.” Certain aspects of Big Data are indeed new, while others are not. Let’s review a few facts related to the “V’s” to confirm:
- Velocity: Meteorology, space exploration, defense applications, business transactional systems, embedded systems, academic research and medical research are some examples of existing applications that have dealt with the high velocity of data for several decades. So, if such primary examples are not new, then what is new? It’s the number of new “data producing” channels (e.g. mobile devices, social networking applications, etc.) that weren’t prevalent before. These channels are designed for real-time communication across national boundaries and hence generate extremely large amounts of raw data.
- Volume: Firms have always dealt with large data sets (e.g. Business Transactions, Data Warehouses and Operational Data Stores), so processing large volumes of data isn’t something new. However, the new “data producing” channels are generating data that far exceeds the volumes that were generated by the previous generation of applications. Businesses can mine this rich data to gather business intelligence and for business analytics.
- Variety: Traditional Database Management Systems (DBMS) were designed to process structured data. Free form content or unstructured data was stored in its native form in Content Management Systems (CMS). Integrating and analyzing it across two different systems – DBMS and CMS – was cumbersome and inefficient. So new frameworks, persistence mechanisms and distributed processing methods were required to persist, integrate and analyze large data sets comprised of structured, unstructured and semi-structured data efficiently.
Big Data Frameworks and Supporting Infrastructure
Hadoop, MapReduce, HBase and NoSQL databases were developed to overcome the deficiencies in existing technology and address the new requirements associated with the three “V’s”. Let us review each of these:
- Hadoop: Apache Hadoop is an open source framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. According to the Apache Hadoop team, it is designed to scale up from a single server to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the Hadoop software library itself is designed to detect and handle failures at the application layer. Therefore, it delivers a highly available service on top of a cluster of computers, each of which may be prone to failures.
- MapReduce: MapReduce is a component of the Hadoop software library. It is a software framework for distributed processing of large data sets on computer clusters.
- HBase: HBase is another project run under the Apache Hadoop project. It is the Hadoop database. It is a non-relational, scalable and distributed database that supports structured data storage for large tables.
- NoSQL Databases: Relational database management systems use Structured Query Language (SQL) for accessing and manipulating data that resides in structured columns in relational tables. However, unstructured data is typically stored in key-value pairs in a data store and therefore can’t be accessed via SQL. Such data stores are called NoSQL data stores and are accessed via get and put commands.
Jaspersoft, a prominent Business Intelligence company, recently released some key findings from the 2012 Big Data Index; it gathered the data from captured downloads between January 2011 and January 2012. The information indicates:
- Over 15,000 Big Data connectors were downloaded in 2011.
- Demand for MongoDB, the document-oriented NoSQL database, saw the biggest spike with over 200 percent growth in 2011.
- Hadoop Hive, the SQL interface to Hadoop MapReduce, represented 60 percent of all Hadoop-based connectors.
- Hadoop HBase, the distributed Hadoop environment, was the second most popular Hadoop-based connector.
- Cassandra, the high availability NoSQL database, was among the top four most downloaded Big Data sources in 2011.
- Over 27 percent of Big Data connector downloads were for Riak, Infinispan, Neo4J, Redis, CouchDB, VoltDB or others.
The diagram below is from Don Hinchcliffe’s article titled “The enterprise opportunity of Big Data: Closing the “clue gap”“. It visually depicts the tremendous growth of data and the three major moving parts used to meet business objectives – Fast Data, Big Analytics and Deep Insight.
Social Network Facts
Here are a few facts related to traffic generated by Google, Twitter, Facebook and LinkedIn!:
- Google set a staggering record in May 2011, according to new comScore data: It reeled in more than 1 billion unique visitors, marking the first time an Internet company can claim that honor.
- Twitter posted 32.8 million unique U.S. visitors in July 2011. Twitter remains the second biggest social platform in terms of unique visitors with 40.4 Million unique visitors at the end of December 2011.
- In 2011, Facebook had 800 million users and this number continues to grow significantly each month. Facebook attracted over 170 million unique visitors in December 2011.
- LinkedIn! was at 24 million unique visitors at the end of 2011.
The proliferation of social networks, mobile devices and new social media applications are creating a perfect “data storm” that firms must leverage to their advantage.
Social Networks and Mobile Devices – Some Projections
Since social networks and mobile devices will generate a significant portion of the Big Data volume, let us review projections related to them. The following figures were presented in a recent Cisco Systems White Paper on mobile traffic:
- The number of mobile-connected devices will exceed the number of people on earth by the end of 2012.
- By 2016, there will be 1.4 mobile devices per capita. That year, there will be over 10 billion mobile-connected devices, including machine-to-machine (M2M) modules. Again, the number will exceed the world’s population at that time (7.3 billion).
- Monthly global mobile data traffic will surpass 10 exabytes in 2016.
- Monthly mobile tablet traffic will surpass 1 exabyte per month in 2016.
- Monthly global mobile data traffic will surpass 10 exabytes in 2016.
Business Potential of Big Data Solutions
The facts on the ground and market reality points to significant business potential, if firms develop a Big Data Strategy to find new opportunities, develop new products, spot trends, and improve customer service. Here’s a snippet from a recent study by McKinsey Global Institute (MGI) that highlights some business potential that Big Data solutions can generate:
A retailer using Big Data to the full could increase its operating margin by more than 60 percent. Harnessing big data in the public sector has enormous potential, too. If US healthcare were to use big data creatively and effectively to drive efficiency and quality, the sector could create more than $300 billion in value every year. Two-thirds of that would be in the form of reducing US healthcare expenditure by about 8 percent. In the developed economies of Europe, government administrators could save more than €100 billion ($149 billion) in operational efficiency improvements alone by using big data, not including using Big Data to reduce fraud and errors and boost the collection of tax revenues. And users of services enabled by personal-location data could capture $600 billion in consumer surplus.
The Big Data market will grow at an astounding Compound Annual Growth Rate (CAGR) of 58% between now and 2017, hitting the $50 billion mark within five years. Well, that should quiet the doubters that claim Big Data is all hype and no substance.
Here’s an Infographic developed by ColumnFiveMedia in collaboration with GetSatisfaction, that visually depicts how Big Data has the potential to become the next frontier for innovation, competition and profit.
Challenges and Land Mines
Since Big Data is an emerging domain, firms will have to tackle some challenges and look out for potential land mines that could disrupt progress. Let us consider a few that are most pertinent:
- Technology: Tools and technology related to Big Data are flooding the market, making it extremely difficult to separate the “wheat from the chaff.” Depending on a firm’s business requirements, it should consider some well-known and robust solutions currently available on the market. The tool and technology choices will vary depending on the types of data to be manipulated (e.g. XML documents, social media, structured, semi-structured etc.), business drivers (e.g. customer service, customer trends, product development, etc.) and data usage (analytic or product development focused).
- Data Governance: With the addition of new data producing channels and significant growth of data volumes, governing it will become more challenging. Governance is especially important from a data security, privacy and quality standpoints. There are major legal and financial ramifications, if firms do not comply with existing and new regulations related to these areas.
- Skilled Resources: The MGI study mentioned above highlights the fact that there will be a shortage of talent necessary for organizations to take advantage of Big Data. MGI predicts that by 2018, the United States alone could face a shortage of 140,000 to 190,000 people with deep analytical skills, as well as 1.5 million managers and analysts with the know-how to use the analysis of Big Data to make effective decisions.
- Awareness and Education: Vendors will have to offer solutions that abstract the complexities of the Big Data framework and data infrastructure from users to enable them to better extract business value from Big Data and to justify their investment. Vendors and training organizations will have to educate users about the underlying methodologies, architectures, tools and technologies to encourage adoption.
The Bottom Line
It is quite clear that Big Data is a new phenomenon that has tremendous potential. Data practitioners must engage their business, operations and technology counterparts to educate them about Big Data’s potential, find specific business problems that it can solve and leverage the power of Big Data to gain a competitive advantage. In order to gain traction and showcase its capabilities – firms should consider designing and developing narrowly scoped “pilot” or “proof-of-concept” business applications first, before embarking on a full-blown project. Data management experts and Big Data practitioners should be consulted, before, during and after the “pilot” stage. They can use lessons-learned from earlier projects, industry best practices and technical skills, to help with building a firm’s Business Case, Big Data Strategy, Approach, Data Architecture and Deployment Model.
Apache Hadoop – A software framework for distributed processing of large data sets – http://hadoop.apache.org/
HBase Overview and Details – http://hbase.apache.org/
MapReduce Tutorial – http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
NoSQL – A resource for non-relational databases – http://nosql-database.org/
Amazon’s Dynamo – A highly available key value store – http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html
Google’s Big Table – A Distributed storage system for structured data – http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/bigtable-osdi06.pdf
What is Big Data – An introduction to the Big Data landscape by Ed Dumbill – http://radar.oreilly.com/2012/01/what-is-big-data.html