Loading...
You are here:  Home  >  Data Education  >  Big Data News, Articles, & Education  >  Big Data Articles  >  Current Article

The Growth of Unstructured Data: What To Do with All Those Zettabytes?

By   /  March 15, 2012  /  1 Comment

by Charles Roe

What do Claude Shannon, Paul Baran, Ted Nelson, Leonard Kleinrock, Lawrence Roberts, Jon Postel, Vinton Cerf, Robert Kahn, Brian Carpenter, Tim Berners-Lee, Esther Dyson and so many others have in common? They were pioneers who helped to build the foundation, write the code, and created the structures and systems that are the Internet today. Their various contributions range from the creation of HTML language and TCP/IP networking protocol, to the formation of the Electronic Frontier Foundation, the groundwork for Modern Information Theory, to packet switching networks, ARPANET,  and security systems.

Their visions, along with the work of untold others, helped develop the groundwork of what today has become one of the foremost topics in Data Management: Big Data and its co-conspirator, Unstructured Data. Both terms are filled with an inundation of meanings, and are spoken about during keynote speeches, tutorial sessions, workshops and the dark, quiet corners of data conferences around the world. Every Data Management professional who hears them quivers in anticipation as they see dollar signs, promotions, and job security flash before their eyes; yet inside they tremble with apprehension at the sheer engulfing size of the data structures filling up their data warehouses, non-relational distributed systems, and email inboxes at alarming rates never seen in the world before.

Why all the Hype?

Unstructured data are simply data that does not fit easily into traditional relational systems; such a definition is of course fraught with questions, concerns and lack of depth, but works for discussion purposes. Therefore, the term Unstructured Data includes emails, word processing documents, multimedia, video, PDF files, spreadsheets, messaging content, digital pictures and graphics, mobile phone GPS records, and social media content which combines all the other elements on a gargantuan scale. Such data are scattered across the Internet, within the Intranets of corporations around the world, buried in the hard disks of personal computers, and are now one of the driving forces in the growth of non-relational, distributed, horizontally scalable systems like NoSQL; such data are Big Data (along with the ever-expanding relational structures) and are getting bigger all the time. According to John Thielens article “Big Data Wizardry: Pay Attention To What’s Behind The Curtain,” Big Data is “[L]ike the explosive thrust blowing out of a rocket nozzle,” and “how to maximize its value remains a mystery to most of us.”

Everyone reads the articles, listens to the presentations, and watches the videos that say the amount of data is now doubling every two years. They understand that it doesn’t really matter if Mark Logic’s estimate in “The Post-Relational Reality Sets in: 2011 Survey on Unstructured Data” of 295 exabytes of data in the IT world is correct, or if IDC’s estimate of over a zettabtye of digital data in the world is more accurate. In reality both are correct since they are talking about somewhat different entities: Mark Logic’s numbers cover the Information Technology industry and IDC’s are about all the digital data created in the world. The numbers are so staggering as to make anyone step back in awe:

  • A megabyte is 106 or 1,000,000 bytes or 1000 kilobytes or approximately 6 seconds of uncompressed CD-quality audio.
  • A gigabyte is 109 or 1,000,000,000 bytes or 1000 megabytes. One DVD holds about 4.7GB of information.
  • A terabyte is 1012 or 1,000,000,000,000 bytes or 1000 gigabytes. In April 2011, the Library of Congress digital data amounted to about 235 terabytes of data; it adds around 5TB per month.
  • A petabyte is 1015 or 1000 gigabytes or 1 million terabytes. Google processes around 24PB of data per day and as of August 2011 IBM built a single 120PB storage array (the largest ever).
  • An exabyte is 1018 or 1 billion gigabytes or 1 trillion megabytes. According to Cisco’s June 2009 Visual Networking Index, annual worldwide IP traffic will reach 667 exabytes by 2013, with 18 exabytes of Internet video generated per month.
  • A zettabyte is 1021 or 1,000,000,000,000,000,000,000 bytes 1 trillion gigabytes or 1 quadrillion megabytes. In 2003, Mark Liberman, a linguist at the University of Pennsylvania calculated (with a nice bit of humor thrown in for good measure) the total amount of all human speech ever spoken to be at 42 zettabytes.

With numbers like that, when Big Data and Unstructured Data walk into a room, everyone else has less oxygen to breathe: data analysts hold their BI Tools close to their chests and try to hide in the corner, fearing the ramifications; data modelers turn their heads and tell ERwin that everything will be ok; C-level executives start planning vacations so they can escape the siren’s call of new Data Governance initiatives and Data Quality issues; programmers and developers giggle to themselves as they see a horizon where they can start mastering open source systems and maybe even create the new HBase, Riak, Redis, Cassandra, MondoDB and become rich. Everyone wants a piece of the cluster, with bank accounts so large they have their own node and sharding of the account becomes necessary; where keynote addresses make their resumes too voluminous to send as an attachment, and a buying a new Jaguar or Maserati (or any dream car) is schemaless, no structured planning necessary.

It’s all in the Numbers

IDC’s newest estimate says that in 2011 there was 1.8 zettabytes of digital data (created and replicated) in the world, growing to 7.9 zettabytes by 2015. So the question is really where is all this data coming from? How are we creating, replicating, saving, mining, and analyzing such colossal amounts of data? There is a veritable plethora of information sites detailing the statistics on Internet usage, digital data growth, with special consideration for social networking. Some statistics from 2010 and 2011 include:

  • Twitter has 200 million tweets per day or approximately 46MB/sec of data created (August 2011)
  • Facebook has 640 million users, with 50% logging in daily (March 2011)
  • LinkedIn has over 100 million users (mid-2011)
  • The largest Yahoo! Hadoop cluster is 82PB, and over 40,000 servers are running its operations (June 2011)
  • Facebook collects an average of 15TB of data every day or 5000+ TB per year, and has more than 30PB in one cluster (March 2011)
  • 107 trillion emails were sent in 2010
  • There were 152 million blogs in 2010
  • Goggle has more than 50 billion pages in its index (December 2011)
  • YouTube has 3 billion visitors per day, 48 hours of video is uploaded per minute (May 2011)
  • Amazon’s S3 cloud service had some 262 billion objects at the end of 2010, with approximately 200,000 requests per second.

The statistics above are just a microcosm of what is happening with the growth of Big Data and Unstructured Data across the globe; they only highlight some of the statistics released from a few of the larger players for the past couple of years. Such numbers many not exist for everyone, but even smaller enterprises that are trying to data mine and analyze their own Unstructured Data and expand their BI operations into social networking have problems coping with the massive amount of data filling up their servers. The growth of non-relational database systems allows enterprises to capture data BLOBs (binary large objects), emails and other text documents, scour forums and blogs for company sensitive information, while also integrating that information with their traditional relational data warehouses . But the integration is still slow; the BI tools are not yet up to par in terms of off-the-shelf availability and ease-of-use dashboard simplicity with full BI functionality that many business users expect. The alleviation of those problems is occurring quickly and soon will be much more cost effective so even small businesses and can get big gains from Big Data.

The Bottom Line is the Band Wagon

So what exactly is everyone supposed to do with all this data? We will continue creating more and more, that is assured; the proverbial stack of DVDs that now reaches to the moon and back will reach Mars then Jupiter then maybe even the Oort cloud – humans like data. Steven Lohr’s assessment in his article “The Age of Big Data” is correct: “Despite the caveats, there seems to be no turning back. Data is in the driver’s seat. It’s there, it’s useful and it’s valuable, even hip.” As the volume of data continues to grow, luckily our ability to deal with it efficiently is also growing:

  • Distributed computing systems – These allow horizontal expansion of power and storage capacity, lower costs, more effective solutions to system failures, and ease of deployment that vertical models do not.
  • New BI Tools – NoSQL systems have more BI tool integrations entering the market all the time and with new tools come better crossovers with traditional data warehouse structures so the IT and business elements of enterprises can better work together to use all that data.
  • Better Processes – Many corporations are now instituting separate processes for governance, security, quality and management of their Unstructured Data systems. Such systems require different actions than traditional data systems and such differences are now better understood.

In the end it all comes down to the bottom line; corporations want to spend less money and make more money, while expanding services and potentialities. A recent report from the World Economic Forum titled “Big Data, Big Impact: New Possibilities for International Development” outlines the opportunities. The report specifically discusses the use of mobile phone data to help the growth of services for low income individuals and emerging economies, but it’s essential focus can easily be extrapolated to all individuals, corporations and the use of data as a whole: “[B]uilding user-centric solutions offers compelling possibilities for providing better access to services in health, education, financial services, and agriculture for people living in poverty.” The successful collection, analysis and use of the 1.2 zettabytes of world digital data (which includes the 295 exabytes of IT data) allows everyone to better understand consumer markets, information highways, idea exchange, areas of potential growth, and problem zones; this in turn allows everyone to create better financial systems, educational opportunities, distribution networks, regulatory bodies, sharing incentives, crisis models and structures that aid all of us in ways we are only beginning to understand. We are in the driver’s seat; data is our Maserati Gran Turismo; we have to learn to drive it like it’s supposed to be driven.

About the author

Charles Roe has been a professional freelance writer and copy editor for more 15 years, and has been writing for the Data Management industry since 2009. He is the founder of CRScribes.com, his own writing/editing business. Charles has written on a range of industry topics in numerous articles, white papers, and research reports including Data Governance, Big Data, NoSQL technologies, Data Science, Cognitive Computing, Business Intelligence & Analytics, Information Architecture, Data Modeling, Executive Management, Metadata Management, and a host of others. Charles is backed with advanced degrees in English, History, and a Cambridge degree in Language Instruction. He has worked for almost 20 years as an instructor of English, History, Culture, and Writing at the college level in the USA, Europe, and Turkey. He writes creatively in his spare time.

  • Ilya Geller

    All unstructured data can be soon all structured. How?

    I discovered and patented statistics on unstructured data. For instance, there are two sentences:
    a) ‘Sam!’
    b) ‘A loud ringing of one of the bells was followed by the appearance of a smart chambermaid in the upper sleeping gallery, who, after tapping at one of the doors, and receiving a request from within, called over the balustrades -‘Sam!’.’
    Evidently, that the ‘Sam’ has different importance into both sentences, in regard to extra information in both. This distinction is reflected as the phrases, which contain ‘Sam’, weights: the first has 1, the second – 0.08; the greater weight signifies stronger emotional ‘acuteness’; where the weight refers to the frequency that a phrase occurs in relation to other phrases.

    That statistics is what makes AI, cognitive computing and machine learning possible. That statistics is what allowed me to begin the current AI revolution!

You might also like...

Letter Featured Image

May 2016 DATAVERSITY Letter from the Editor

Read More →