by Charles Roe
What do Claude Shannon, Paul Baran, Ted Nelson, Leonard Kleinrock, Lawrence Roberts, Jon Postel, Vinton Cerf, Robert Kahn, Brian Carpenter, Tim Berners-Lee, Esther Dyson and so many others have in common? They were pioneers who helped to build the foundation, write the code, and created the structures and systems that are the Internet today. Their various contributions range from the creation of HTML language and TCP/IP networking protocol, to the formation of the Electronic Frontier Foundation, the groundwork for Modern Information Theory, to packet switching networks, ARPANET, and security systems.
Their visions, along with the work of untold others, helped develop the groundwork of what today has become one of the foremost topics in Data Management: Big Data and its co-conspirator, Unstructured Data. Both terms are filled with an inundation of meanings, and are spoken about during keynote speeches, tutorial sessions, workshops and the dark, quiet corners of data conferences around the world. Every Data Management professional who hears them quivers in anticipation as they see dollar signs, promotions, and job security flash before their eyes; yet inside they tremble with apprehension at the sheer engulfing size of the data structures filling up their data warehouses, non-relational distributed systems, and email inboxes at alarming rates never seen in the world before.
Why all the Hype?
Unstructured data are simply data that does not fit easily into traditional relational systems; such a definition is of course fraught with questions, concerns and lack of depth, but works for discussion purposes. Therefore, the term Unstructured Data includes emails, word processing documents, multimedia, video, PDF files, spreadsheets, messaging content, digital pictures and graphics, mobile phone GPS records, and social media content which combines all the other elements on a gargantuan scale. Such data are scattered across the Internet, within the Intranets of corporations around the world, buried in the hard disks of personal computers, and are now one of the driving forces in the growth of non-relational, distributed, horizontally scalable systems like NoSQL; such data are Big Data (along with the ever-expanding relational structures) and are getting bigger all the time. According to John Thielens article “Big Data Wizardry: Pay Attention To What's Behind The Curtain,” Big Data is “[L]ike the explosive thrust blowing out of a rocket nozzle,” and “how to maximize its value remains a mystery to most of us.”
Everyone reads the articles, listens to the presentations, and watches the videos that say the amount of data is now doubling every two years. They understand that it doesn’t really matter if Mark Logic’s estimate in “The Post-Relational Reality Sets in: 2011 Survey on Unstructured Data” of 295 exabytes of data in the IT world is correct, or if IDC’s estimate of over a zettabtye of digital data in the world is more accurate. In reality both are correct since they are talking about somewhat different entities: Mark Logic’s numbers cover the Information Technology industry and IDC’s are about all the digital data created in the world. The numbers are so staggering as to make anyone step back in awe:
- A megabyte is 106 or 1,000,000 bytes or 1000 kilobytes or approximately 6 seconds of uncompressed CD-quality audio.
- A gigabyte is 109 or 1,000,000,000 bytes or 1000 megabytes. One DVD holds about 4.7GB of information.
- A terabyte is 1012 or 1,000,000,000,000 bytes or 1000 gigabytes. In April 2011, the Library of Congress digital data amounted to about 235 terabytes of data; it adds around 5TB per month.
- A petabyte is 1015 or 1000 gigabytes or 1 million terabytes. Google processes around 24PB of data per day and as of August 2011 IBM built a single 120PB storage array (the largest ever).
- An exabyte is 1018 or 1 billion gigabytes or 1 trillion megabytes. According to Cisco’s June 2009 Visual Networking Index, annual worldwide IP traffic will reach 667 exabytes by 2013, with 18 exabytes of Internet video generated per month.
- A zettabyte is 1021 or 1,000,000,000,000,000,000,000 bytes 1 trillion gigabytes or 1 quadrillion megabytes. In 2003, Mark Liberman, a linguist at the University of Pennsylvania calculated (with a nice bit of humor thrown in for good measure) the total amount of all human speech ever spoken to be at 42 zettabytes.
With numbers like that, when Big Data and Unstructured Data walk into a room, everyone else has less oxygen to breathe: data analysts hold their BI Tools close to their chests and try to hide in the corner, fearing the ramifications; data modelers turn their heads and tell ERwin that everything will be ok; C-level executives start planning vacations so they can escape the siren’s call of new Data Governance initiatives and Data Quality issues; programmers and developers giggle to themselves as they see a horizon where they can start mastering open source systems and maybe even create the new HBase, Riak, Redis, Cassandra, MondoDB and become rich. Everyone wants a piece of the cluster, with bank accounts so large they have their own node and sharding of the account becomes necessary; where keynote addresses make their resumes too voluminous to send as an attachment, and a buying a new Jaguar or Maserati (or any dream car) is schemaless, no structured planning necessary.
It’s all in the Numbers
IDC’s newest estimate says that in 2011 there was 1.8 zettabytes of digital data (created and replicated) in the world, growing to 7.9 zettabytes by 2015. So the question is really where is all this data coming from? How are we creating, replicating, saving, mining, and analyzing such colossal amounts of data? There is a veritable plethora of information sites detailing the statistics on Internet usage, digital data growth, with special consideration for social networking. Some statistics from 2010 and 2011 include:
- Twitter has 200 million tweets per day or approximately 46MB/sec of data created (August 2011)
- Facebook has 640 million users, with 50% logging in daily (March 2011)
- LinkedIn has over 100 million users (mid-2011)
- The largest Yahoo! Hadoop cluster is 82PB, and over 40,000 servers are running its operations (June 2011)
- Facebook collects an average of 15TB of data every day or 5000+ TB per year, and has more than 30PB in one cluster (March 2011)
- 107 trillion emails were sent in 2010
- There were 152 million blogs in 2010
- Goggle has more than 50 billion pages in its index (December 2011)
- YouTube has 3 billion visitors per day, 48 hours of video is uploaded per minute (May 2011)
- Amazon’s S3 cloud service had some 262 billion objects at the end of 2010, with approximately 200,000 requests per second.
The statistics above are just a microcosm of what is happening with the growth of Big Data and Unstructured Data across the globe; they only highlight some of the statistics released from a few of the larger players for the past couple of years. Such numbers many not exist for everyone, but even smaller enterprises that are trying to data mine and analyze their own Unstructured Data and expand their BI operations into social networking have problems coping with the massive amount of data filling up their servers. The growth of non-relational database systems allows enterprises to capture data BLOBs (binary large objects), emails and other text documents, scour forums and blogs for company sensitive information, while also integrating that information with their traditional relational data warehouses . But the integration is still slow; the BI tools are not yet up to par in terms of off-the-shelf availability and ease-of-use dashboard simplicity with full BI functionality that many business users expect. The alleviation of those problems is occurring quickly and soon will be much more cost effective so even small businesses and can get big gains from Big Data.
The Bottom Line is the Band Wagon
So what exactly is everyone supposed to do with all this data? We will continue creating more and more, that is assured; the proverbial stack of DVDs that now reaches to the moon and back will reach Mars then Jupiter then maybe even the Oort cloud - humans like data. Steven Lohr’s assessment in his article “The Age of Big Data” is correct: “Despite the caveats, there seems to be no turning back. Data is in the driver’s seat. It’s there, it’s useful and it’s valuable, even hip.” As the volume of data continues to grow, luckily our ability to deal with it efficiently is also growing:
- Distributed computing systems - These allow horizontal expansion of power and storage capacity, lower costs, more effective solutions to system failures, and ease of deployment that vertical models do not.
- New BI Tools - NoSQL systems have more BI tool integrations entering the market all the time and with new tools come better crossovers with traditional data warehouse structures so the IT and business elements of enterprises can better work together to use all that data.
- Better Processes - Many corporations are now instituting separate processes for governance, security, quality and management of their Unstructured Data systems. Such systems require different actions than traditional data systems and such differences are now better understood.
In the end it all comes down to the bottom line; corporations want to spend less money and make more money, while expanding services and potentialities. A recent report from the World Economic Forum titled “Big Data, Big Impact: New Possibilities for International Development” outlines the opportunities. The report specifically discusses the use of mobile phone data to help the growth of services for low income individuals and emerging economies, but it’s essential focus can easily be extrapolated to all individuals, corporations and the use of data as a whole: “[B]uilding user-centric solutions offers compelling possibilities for providing better access to services in health, education, financial services, and agriculture for people living in poverty.” The successful collection, analysis and use of the 1.2 zettabytes of world digital data (which includes the 295 exabytes of IT data) allows everyone to better understand consumer markets, information highways, idea exchange, areas of potential growth, and problem zones; this in turn allows everyone to create better financial systems, educational opportunities, distribution networks, regulatory bodies, sharing incentives, crisis models and structures that aid all of us in ways we are only beginning to understand. We are in the driver’s seat; data is our Maserati Gran Turismo; we have to learn to drive it like it’s supposed to be driven.