by Charles Roe
Jonathan Hsieh, an HBase engineer at Cloudera, began his four-part NoSQL Now! 2011 Conference tutorial-presentation on Hadoop and HBase by discussing two different computer programs. The first was an old psychotherapist program from his youth where you could ask a question like, “What is the capital of Argentina?” and possibly get back a correct answer. The second was Watson, IBM’s artificial intelligence system which outperformed the top two human competitors ever on Jeopardy!. The point of Mr. Hsieh’s introduction was not to show the advances in computer programming or artificial intelligence. It was to highlight the fact that the primary difference between both examples was the amount of data each system had available to use: the psychoanalyst program had a very small amount of accessible data, while Watson had the sum total of all human knowledge off the Internet. All that data was cleaned, organized and accessible through algorithms so it could access the data quickly. The modern world of data management is now grounded on terabytes and petabytes, not megabytes, and gigabytes. The evolution of social media and social analytics has necessitated the exponential growth of data storage to levels barely imagined 10 or 15 years ago. For many organizations – no matter their size – data is their livelihood, whether internally created or mined from the vast stores of information on the Internet, but much of that data is dirty, raw and uncooked. People care about data because data allows them to make better decisions; organizations need data because it allows them to get ahead in the market place, make better predictions, create better models and have better results. One of the primary problems with big data today is how to manage it efficiently in the most cost-effective ways possible.
What are Apache HBase and Hadoop?
These related and popular NoSQL systems enable organizations to handle petabyte-level volumes of raw data. They are both distributed systems that work on different assumptions and different design concepts than conventional relational databases. Apache Hadoop is an open source, horizontally scalable system for reliably storing and processing massive amounts of data across many commodity servers. Apache HBase is an open source, horizontally scalable, sorted map store built on top of Apache Hadoop. But what do all those terms mean?
- Open Source is a term that refers to an array of community projects that work to create something better through cooperation and sharing. Many different companies, developers, programmers, business-minded people and a considerable number of individuals have helped make the idea of open source become a reality. Both Apache HBase and Apache Hadoop have Apache 2.0 licenses and are top-level open source projects; anyone who has a license can use, modify, share and build the software in any way they want. Some companies involved in the open source development of them are Cloudera, Facebook, Yahoo!, eBay, StumbleUpon, TrendMicro and numerous others.
- Horizontally scalable represents the ability for an organization to expand their computing power across many commodity servers, rather than vertically scalable which refers to the ability to upgrade a single machine. These Apache open source platforms allow enterprise’s to store and access data on literally thousands of servers with a linear increase in storage capacity, processing power and input/output operations. If a company has 100 machines that take 24 hours to do a job, then doubling the number of the machines should cut that time in half (see Graphic One).
- Commodity servers are used in commodity computing, whereby a large number of servers are clustered together to allow for significantly enhanced performance at lower costs. A typical commodity server, circa 2010-11, has two quad-core CPUs, each with a minimum of 2-2.5GHZ, with 24-32GB of RAM, 12x1TB hard disks (though more is available now) in a JBOD (Just a Bunch of Disks) configuration, a gigabit Ethernet connection, with each server running around $5-10k. There is certainly a much greater range of performance and price for such servers, but the goal of commodity computing is to reach the sweet spot where an enterprise gets the most power at the lowest cost.
Hadoop is actually built from two different parts: Hadoop distributed file system (HDFS) and Hadoop MapReduce. It first started as an open-source web crawler project called Nutch by Doug Cutting and Mike Cafarella in 2002. They wanted to create a project everyone could benefit from, not just large search engine providers. Then in 2003 and 2004 two groundbreaking papers were published:
- SOSP 2003: “The Google File System” by Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung
- OSDI 2004: “MapReduce: Simplified Data Processing on Large Clusters” by Jeffery Dean and Sanjay Ghemawat
The two papers inspired many people in the industry and provided a number of projects some important insights to move forward. In 2006, Doug Cutting released Hadoop (rumored to be named from his son’s toy elephant), his open source, distributed file system, MapReduce project that today is used and developed by many of the top enterprises in the industry. In 2008, Yahoo! claimed to be running Hadoop in a 4000 node cluster; in the same year Hadoop won the Terabyte sort benchmark award where it sorted a terabyte of random numbers in 90 seconds. Hadoop in production today includes:
- Yahoo! Hadoop clusters: As of June 2011 they have over 82PB and 40,000 machines.
- Facebook: Adds at least 15TB of data per day into their Hadoop system, with over 1200 machines and claim to have a single cluster of 30PB
- Twitter: Adds over 1TB of data per day and have more than 120 nodes
- There are hundreds, if not thousands, of examples of enterprises running 5-40 node clusters that do not have petabytes of data, but prefer the flexibility of the noSQL and Hadoop systems.
Enterprises in this day and age are generating, collecting, processing and reporting on volumes of data never seen before. Hadoop and other similar systems were designed specifically to deal with those data loads, but building such a system is not necessarily the best approach for all organizations. It may be cheaper, easier and more efficient for a given organization to stay with a vertical model of computing and keep their standard relational databases and traditional data warehousing structures intact. Many enterprises are now taking a dual approach, whereby they utilize Hadoop along with their legacy systems, to gain the benefits of both. But, there are certain reasons why an organization should consider moving to a more noSQL, HBase/Hadoop, commodity cluster computing model.
Building a Web Index – an example of Google
Google beta was released in the late 1990s (circa 1998-99) by fellow Stanford students Larry Page and Sergey Brin. Their goal was to build an index that encompassed the entire Internet, store all of it, analyze all of it to build the index, create page rankings and repeat the process endlessly. No problem right? The numbers to accomplish this task are staggering – even if the numbers are only estimated at approximately 50k per web page and 500 bytes per URL:
- 1998: 26 million indexed pages with the total size about 1.3TB
- 2000: 1 billion indexed pages @ 500TB in size
- 2008: ~40 billion pages @ 20,000TB
- 2008: 1 trillion URLS @ 500TB just in URL names
As of 2011, those numbers are even larger and they are growing faster all the time. Everyone wants constantly updated web stats, search queries, reporting metrics and other information from Google – now, not two weeks from now. Dealing with all that data has three primary problems:
- Volume – How does an organization store that much data? Where can they store it?
- Variety – How can they deal with it? The Internet is dirty and uncooked, not all pages parse properly, programmers and web designers use bad code in a host of formats, pages are badly organized, there are no strict rules. The sheer variety of data includes standard text pages, movies, flash, XML, micro formats and innumerable others.
- Velocity – How can an organization, not even one the size and complexity of Google, keep up with the constant changes on the Internet? Enterprises must constantly pull in more data, organize it, clean it, analyze it and the process never stops. It isn’t going to help if eBay updates their search index every other day, people want immediate results.
Google and eBay have both deployed Hadoop systems within their respective organizations since it allows them the flexibility, scaling options and ability to handle such voluminous workloads. But, such a deployment may also aid smaller organizations that are not downloading terabytes of data every day, or trying to analyze petabytes of data.