Just as the brand, Kleenex, became synonymous with bath tissue, such is the fate with NoSQL database subset known as Big Table databases. Google’s Bigtable, which defined the entire sector, serves the same role as Kleenex, since tabular databases are commonly called “Big Table” in the nascent NoSQL industry. This final entry in DATAVERSITY’s™ series on the NoSQL database movement takes a look at Big Table databases.
Practically every other Big Table database in the industry (with some exceptions) is modeled after or inspired by Google’s Bigtable. So it makes sense tabular databases are now commonly identified with the Big Table moniker. Google’s influence in technology goes far beyond search algorithms, search advertising, and mobile operating systems.
When trying to find a definition for Big Table databases, it is a good idea to look to the source. Google’s white paper on Bigtable describes the technology behind their tabular data store as follows:
“Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Bigtable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterrupted array of bytes.”
This article obviously focuses greatly on Bigtable, but also takes a look at other Big Table databases in the industry as well as mentioning a few systems employing tabular database functionality, in addition to other NoSQL database types.
Google Bigtable Defines Big Table Databases
Given Google’s primary business focus on web search, the ability to quickly access huge amounts of data distributed across a wide array of servers became vital. This is just the kind of application – unsuitable for the relational model – that created the NoSQL movement and related technologies like MapReduce. Google built Bigtable primarily for internal applications, and only makes it available externally as a data store for customers who use the Google App Engine Platform as a Service offering
In some ways it is easiest to think of a Bigtable database as containing one giant table with a three-dimensional Key-Value data store. As mentioned in the Google white paper, this singular table is indexed by three discreet values – a row index, a column index, and a timestamp. These indexes point at a value containing a byte array which defines most anything.
Bigtable became widely used by over 60 applications throughout Google after its first appearance in 2004; MapReduce gets a lot of work managing data reads and writes for Bigtable. Google Maps, YouTube, and Gmail also leverage the technology. Bigtable replaced older databases used at Google previously.
The uninterrupted byte array described in the Bigtable white paper can contain a wide variety of structured and unstructured data. For instance, Google’s internal clients serialize and deserialize data objects in and out of Bigtable. Parameters are used to select between memory- or disk-based storage for the underlying data.
The row keys in Bigtable are stored in string format. This allows rows to be sorted in lexicographical (or alphabetical) order which helps the overall efficiency of the system especially in regard to load distribution. Since Bigtable sees wide use in web-based search applications, this design allows information from the same domain to be stored close to each other in the database, with the URLs being used as row keys.
A range of rows in Bigtable is known as a “tablet.” Bigtable tablets tend to be stored on the same server, to enhance overall performance and load distribution. In a Bigtable implementation, one master server manages the location of tablets across a cluster of other servers. An internal application known as “Chubby” controls this process, the metadata for the Bigtable instance, as well as the locking and replication of data.
Column keys in Bigtable get grouped together as column families; usually data within a column family share the same data type. Google uses column families in their Webtable implementation of Bigtable to store all the anchors that refer to a web page. Once again, this design makes reads and writes more efficient in a distributed environment.
The timestamp index allows each cell in a Bigtable to contain multiple versions of the data indexed by time. Webtable stores the time Google crawled a specific URL in its timestamp index. The timestamps are stored in decreasing order, so the most recent version of a record can be read first.
As mentioned earlier, Bigtable works with MapReduce, and Google also provides a C++ API to manage creating table, deleting tables, and column families. The MapReduce wrapper allows Bigtable to be used as either an input source for or an output from a MapReduce function call.
The fact that Google has kept Bigtable primarily in-house, with the exception of the availability through App Engine, led to the development of other data stores following the Big Table model.
Apache Accumulo Adds a Security Layer
The open source Apache Software Foundation has spawned a host of worthy technical initiatives, especially in the database industry. Two of these innovations leverage the design principles behind Google Bigtable – Accumulo and HBase, but in a form available for commercial use. Apache Cassandra received an earlier mention in DATAVERSITY’s Key-value database article; its use of super columns is very similar to Bigtable’s column families.
Apache Accumulo actually saw its genesis at the National Security Agency in 2008 who then donated the product to Apache last year. Written in Java, Accumolo also uses three other Apache technologies: Hadoop for distributed processing of large data sets, Thrift to provide API functionality, and Zookeeper to serve as the role of traffic cop for configuration and synchronization – more or less what Chubby does for Bigtable.
One innovation Accumulo added to the Bigtable model was the concept of column visibility. This adds a layer of security to the data stored in Accumulo allowing users access to only the information they are authorized to view. This new feature makes sense considering the database’s history at the NSA.
Additional Accumulo features include a framework for unit testing and TDD, improved data management functionality, and plug-ins that add functionality for load balancing and memory management.
Apache HBase Powers Facebook Messaging
Sharing a close relationship with Hadoop is Apache HBase which is currently used for Facebook’s messaging application. Apache refers to HBase being built on top of the Hadoop File System in the same manner as Google Bigtable is built on top of Google File System (GFS). If any users are looking for an open source version of Bigtable, here it is!
HBase leverages MapReduce as well as a Java API for client programming. Distributed processing features include configurable automatic sharding of tables and failover support between servers.
Like with many other Apache projects, a robust community has grown around HBase. In addition to Facebook, the social bookmarking site, StumbleUpon, also uses the technology. Cloudera is a company providing commercial consulting services for Hadoop and HBase. They hosted an HBase user conference inSan Francisco earlier this year.
Hypertable Offers Commercial Support for Their Big Table Database
Hypertable is a tabular database inspired by Bigtable. It is available through the GNU public license, but Hypertable the company also offers commercial support and consulting services. Windows users are out of luck because the database only runs on Linux or Mac servers.
Written in C++, Hypertable also offers an API supporting a host of client languages including Java, PHP, Python, and Ruby. The database features true consistency, as successful writes are reflected in each subsequent read function.
Hypertable is 100 percent compatible with Hadoop, running on top of HFS in a similar manner as HBase. It uses a facility called Hyperspace in the role of Google’s Chubby to manage locking and system metadata.
The Chinese search engine, Baidu, was a major sponsor of Hypertable, and current customers include the online auction site, eBay, the coupon service, Groupon, and Indian email provider, Rediff.com.
NoSQL Databases Including Tabular Functionality
As mentioned earlier, the super columns in the Key-Value database Apache Cassandra, offers similar functionality as Bigtable’s column families, allowing more complex queries than those typically supported by Key-Value data stores.
OpenLink Virtuoso, covered in DATAVERSITY’s article on Graph databases also includes tabular database functionality. Virtuoso offers practically every available database technology in its product!
Ultimately, tabular databases combine the model simplicity of a Key-Value database with the flexibility in persistence provided by a Document database, but offering the performance needed by highly distributed applications. Google’s Bigtable efforts have enriched the entire database industry, and are an important part of the NoSQL movement.
More articles in this series: