An Article on Jonathan Hsieh’s NoSQL Now! 2011 Conference Presentation
by Charles Roe
In Part 1 of the article “Cloud Deployment of Hadoop and HBase” the essential features of the Hadoop and HBase were discussed. This article is based on Jonathan Hsieh’s four-part NoSQL Now! 2001 Conference tutorial and only covers part one of that presentation. If you’d like to watch all four parts a link is provided at the end of this article.
Apache Hadoop and HBase are related open source projects. Apache Hadoop is an open source, horizontally scalable system for reliably storing and processing massive amounts of data across many commodity servers. Apache HBase is an open source, horizontally scalable, sorted map store built on top of Apache Hadoop. Hadoop is actually built from two different parts: Hadoop distributed file system (HDFS) and Hadoop MapReduce. It allows organizations to collect and store enormous amounts of data reliably in amounts that are not as efficient or cost effective in traditional relational database systems.
The first part of this article gave a general history of the systems, some of the biggest companies using them today, an idea of the sizes of these systems and why it is becoming ever more necessary for enterprises to start considering noSQL platforms. When a given organization is collecting some 200GB or more of data per day (and many are collecting much more than that), and that data is from the Internet, it is raw and dirty. Enterprises need to collect it and store it without having to force it into a schema upon collection. They need the ability to deal with the enormous volumes, variety and increasing velocity of data on the Internet, along with their standard data warehouse amounts, in ways that were never necessary before.
Part 2 of this article will discuss some of the reasons why an enterprise may want to implement a noSQL option like Hadoop, how traditional relational assumptions need to be left behind with such a scenario and how Hadoop gives an enterprise data agility like they have never had before.
Telltale Signs an Organization might need noSQL
Traditional relational databases have the ability to store all the data an organization can collect; they are very good at it and have been since their development. But, under many circumstances a database may not be the right choice to fix the problem and instead it may be necessary to consider other options, including those related to noSQL like HBase, Mongo, Hadoop or one of the other many players in this area:
- Already scaled vertically? The simplest and most cost effective way to deal with an increase of data is to get a better server or servers. If an organization’s applications are already built on top of a relational database like SQL or Postgres, then scaling vertically may be the most viable option. Going from an m1.large machine to an m2.4xlarge is certainly easier than completing reworking the entire information processing architecture. But, if this option no longer works, it’s time to try something else.
- Changed schema and queries? If processing power has not been enough to alleviate the problem of increasing data loads, then employing some “tricks” may work, at least for the time being. First, remove text search queries such as ‘like’ fields, they are expensive and removing them will streamline processes. Remove ‘joins’ from the database and put them into the application if they are still needed or wanted. 20 years ago seeks and reads were about the same cost, today that is not the case, seeks are much more expensive than sequential reads/writes, so reduce them as much as possible – multiple seeks are even worse. Remove constraints such as foreign keys and encode own relations, thus getting rid of the constraints that a database can perform. Next, put all the parts of a query into a single table to allow for better optimization. If these solutions are not working well enough, or if the organization is doing lots of full table scans, then it is probably time for something like Hadoop; it is built for large sequential scans of data.
- Need to scale reads? It is possible to use database replication to deal with a situation where it is necessary to go beyond the standard/reasonable 80:20 read to write ratio. If more reads are required, then changing the schema could really help. Memcaching could help the issue, since keeping more items in the cache allows the ability to do more network calls or batch network calls to get data from the memory cache, but this won’t work for everyone. As long as some replication lag is tolerable, this is an option, if not look to a better solution.
- Need to scale writes? Eventually the possibility of more writes becomes an unfortunate requirement for many database operations and more writes means diminishing returns. It is possible to shard and federate the database, but that ultimately causes a loss of consistency and guarantees on order of operations within different shards of the database. Since it becomes necessary to write information to all the different shards, it will also become necessary to propagate all the information to different databases along the way. Graphic Three shows this problem: the blue is a horizontally scalable line; the red is what happens if replication is used across multiple databases. It shows the diminishing returns since there are increased constraints and requirements between one database and another.
In all of these telltale signs the issue of a machine going down and the subsequent loss of data is a reality – this is not a problem with cluster computing and Hadoop. Moreover, the question could be asked: “did we really just optimize the database by discarding some of fundamental SQL-relational database features?” The answer is yes. The options listed above do get rid of many features that traditional databases have done extensive work adding. If that answer is an acceptable consolation to entirely changing the assumptions of data management within an organization, then moving to a noSQL system is probably not necessary. If the answer is not acceptable, then moving to a new system with an entirely different set of assumptions may be the best choice.
Transform the Assumptions
NoSQL systems are built with assumptions distinct from of traditional SQL/relational systems. The best way to begin to change the identity of data storage within an enterprise is to look at those traditional assumptions, demonstrate how a noSQL system is different and make educated choices to alter those assumptions. Only then it is possible to implement a data storage system that more efficiently deals with the needs of modern big data:
- Assumption 1 – All workload fits on one machine: This assumption has already been dropped by many organizations that need considerably more computing power than a single machine can handle. As demonstrated in the last section, it is possible to vertically scale to meet more processing requirements, or to add more machines and to shard the database system across those. But, with many companies collecting upwards of 200GB of logs or more per day due to collecting every API call, every web click, every auditing trail and so much more necessary data to the organization, such an option is not viable – massive storage demands progressive processes. Some other options include only keeping a few days of data and then to dev/null it, to just keep a sample of data to get a more longitudinal view, or save it to another media, but those options are not preferable and or necessarily cost effective. If an enterprise wants to collect, store and analyze all their data, then cluster computing becomes necessary – noSQL systems can do this.
- Assumption 2 – Machines deserve identities: In the early years of Facebook where each school had a separate database and a separate identity such as Harvard, Stanford, Berkeley then sharding and separate identities were possible. When Facebook expanded to include everyone with innumerable relationships within then entire system, it became untenable. In a sharded database system, each database runs one part of the system. But, the issue of consistency, diminishing returns, order of operations issues and loss of data become crucial problems. Cluster computing with a system like Hadoop fixes this; machines don’t need identities, clusters work more efficiently and data loss becomes much less of a worry.
- Assumption 3 – Machines can be reliable: In a 2007 study by Google, it was shown that on average they had a 2-8% hard disk failure per year. In a 100 node cluster, with 1200 drives, that means that a disk fails every 15-60 days. In a vertical scaled, sharding situation that is a fairly significant amount of data loss even with replication – or at least a much higher chance of such problems. When a drive fails in a cluster, there are no red flags. Hardware techs don’t get a call at 3am to get out of bed and replace a drive, there is no loss of consistency and no data loss (though of course in extreme situations some can still occur since no system is fail proof). The hardware techs or the organization using the hardware can run a report and see that a disk failed, put in a work order and have it replaced in a timely fashion, but not an immediate “oh my, the world has ended” fashion.
Hadoop/HDFS is a system, one of many available, which can rework these assumptions for an enterprise. It provides cost effective, workable solutions to common real world problems that so many companies are facing today with the seemingly exponential growth of data over the past few years.
Conclusion – Hadoop Gives Agility
The ability to store data in a raw, uncooked format is certainly an increased benefit for many organizations. When doing traditional “schema on write” processes within a standard relational database, the data is essentially cleaned up and forced into a schema at the moment it is collected. If an organization knows exactly what schema they want to use, this is a functional and effective way to deal with that data. But, if the data is like the dirty, voluminous kind collected from the Internet, then putting it into a non-relational structure allows considerably more “cooking” options. Hadoop allows an enterprise to put all the raw data into HDFS, utilize MapReduce to try and clean it up, then do an ETL-type job into a relational database like SQL and employ traditional BI tools to analyze it. If the subsequent reports are not what an enterprise is looking for, the uncooked data still exists within HDFS; they don’t have to force it into a “schema on write,” but rather can use a “schema on read” option over and over with that data as many times as they want. Hadoop gives an organization a kind of data agility that they could never employ within a relational database system. There are certainly drawbacks still being worked out in the noSQL non-relational development world, but as more of those issues are dealt with, platforms like Hadoop are going to become ever more important to enterprises of all shapes and sizes.
If you’d like to watch the video of Jonathan Hsieh’s tutorial, along with the other three parts of his NoSQL Now! 2011 Conference presentation, see the links below: