by Charles Roe
Taxonomies are important; they provide classification systems for scientists, anthropologists, sociologists, psychologists and for the purposes of this article, Data Management (DM) professionals. Without taxonomy it is nigh impossible to create hierarchies, categories, descriptions, structures, queries, values, references and any number of other terms used in DM daily life. The creation of taxonomy, aka semantic architecture, in a given enterprise and across the entire industry is about providing accurate terminology to various concepts, practices, configurations and other “things” to allow for reliable communication channels between various participants. Therefore, to clear up any preconceptions before beginning, let’s create a simple taxonomy that will be used throughout this article:
- SQL: Structured Query Language. This term refers only to the programming language used to access and manipulate traditional relational databases.
- RDBMS: Relational Database Management Systems. In 1970, Edgar F. Codd wrote the article “A Relational Model of Data for Large Shared Data Banks” that laid the foundation for relational databases. RDBMS have gone through innumerable changes since the 1970, but has fundamentally remained the same. Some of the best known RDBMS include MS SQL Server, IBM DB2, MySQL, Sybase, PostgreSQL, Oracle RDBMS and Informix (though there are also many others).
- NoSQL: Not Only SQL. The purpose of this article is to give a general history of this concept. But, since it has gone through various revisions over the course of the many years, there needs to be a separation of its meaning into the past and present. In practice it currently covers a large range of non-relational databases that have recently become the driving forces of the industry.
- NoREL: Not only Relational? No Relational? A term not in general use, but should probably be used to define non-relational databases, as it is more semantically correct for this burgeoning taxonomy. We will not employ the term NoREL to any great extent; we just wanted to point it out for discussion purposes.
Most readers of this article should be nodding their heads and wondering why it is necessary to include terminologies that are such common industry standards? They have been added to make a point: traditional relational databases have been around for decades, they are known entities across the industry, they have names everyone knows and understands, they have large tool sets, and they have been implemented as primary storage systems for literally millions of companies worldwide. Such truisms are not so clearly delineated when we step into the labyrinth that is now known collectively as NoSQL.
The Past History of NoSQL
Carlo Strozzi coined the term “NoSQL” in 1998 with the development of his new, relational database model that no longer used SQL as its base programming language. The point of Strozzi’s new system was not to break away from RDBMS, but instead to streamline many features of RDBMS:
- Simplify the entire structure so more casual (non-computer experts) users could actually work with the system
- Make the system portable so it could work between different types of machines
- Allow the system to run effectively on any UNIX machine as a shell-level tool
- Cost less than standard (and expensive) commercial products and not be so feature packed as most other RDBMS
- Remove arbitrary limits on such elements as data field size, number of columns and others
Strozzi did not conceive of NoSQL as some sort of “Big Name Database” or “a complex monolithic piece of software that sits on a local or a network socket listening for connections by client programs.” In fact, NoSQL wasn’t even a DBMS in any real sense of the word. It was only a “set of shell utilities meant to manipulate ordinary text files and relate them to one another in a database-like structure.” Strozzi’s new tool essentially harkened back to the humble, and by 1998, fairly antiquated DBMS of decades past; those that laid the groundwork for today’s surfeit of database options. Knut Haugen’s article “A Brief History of NoSQL” gives an excellent timeline of the primary developments in database technology since the 1960s; some of his essential points are enumerated below:
- IBM’s IMS: a hierarchical database developed in 1966 for the Apollo space program.
- AT&T DBM: created by Ken Thompson in 1979. It means “Database Manager.”
- NDBM: University of Berkeley’s1986 version of DBM, that meant simply “New Database Manager.” It allowed many databases to be open at the same time. Later manifestations include TDBM, SDBM and GDBM.
- GT.M: Developed by Greystone Technology M in the 1980s and thus where the database gets its name. GT.M was the first version of a Key/Value store with high throughput processing. It eventually open sourced in 2000.
- BerkeleyDB: Also developed at Berkeley (it seems everything was at this time). It was created from 1986-1994 during the transitional period between 4.3BSD and 4.4 BSD (Berkeley Software Distribution or Berkeley UNIX).
- Lotus Domino: This is the server piece of Lotus Notes and was originally released in 1989.
- Mnesia: Developed in the 1990s by Ericsson, it was built with the Erlang programming language (not SQL) as a real-time, relational database for the telecom industry.
This list is by no means complete and should also include MultiValue (PICK) databases from TRW in the mid-1960s, M[umps] from Mass General Hospital in 1966, InterSystems Caché from 1997, and Metakit (the “generally accepted” first document-oriented DB) also from 1997. There are of course numerous other precursors to modern database incarnations, both relational and non-relational, but those above serve as some of the primary players in the long history of database development that has led us to the modern world of NoSQL.
The Current History of NoSQL
Skip forward from 1998 to June 11, 2009: Eric Evans – a Rackspace employee at that time – goes to a meetup on open source, distributed, non-relational databases. He reintroduces the term NoSQL as specifically referring to non-relational databases. Strozzi has since said that the term should really be NoREL, but NoSQL has stuck and is now the apex word that refers to a vast taxonomy of non-relational database structures, classifications, definitions, references, architectures and other semantic elements. Johan Oskarsson was the organizer of the event and so there are some questions as to whether Evans or Oskarsson really re-coined the term as a reference to rapidly multiplying number of distributed, non-relational systems erupting into the Data Management universe – they should probably both get credit. A few necessary highlights in the timeline between the years 1998 and 2009 were:
- 2000: The release of the graph database Neo4j and object database db4o. These are more developed precursors to what would enter the landscape later in the decade.
- 2003: Memcached is developed by Danga. It is not a database (memcachedb is though), but provided a necessary element in the growth of distributed systems, with its “distributed memory object caching system.”
- 2003 SOSP Conference: Publishing of the paper “The Google File System” by Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung. It provided an innovative system for dealing with the Big Data needs of Google and set the foundation for most of the current distributed systems on the market.
- 2004 OSDI Conference: Publishing of the paper “MapReduce: Simplified Data Processing on Large Clusters” by Jeffery Dean and Sanjay Ghemawat. MapReduce“is a programming framework popularized by Google and used to simplify data processing across massive data sets.”
- 2004: Google Bigtable is started and the preliminary research paper “Bigtable: A Distributed Storage System for Structured Data” by Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber is published in 2006. Bigtable is Google’s “distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers.”
- 2005: CouchDB is released as a document database and eventually moves to the Apache Foundation in 2008.
- 2006-2009: The distributed, non-relational “NoSQL” systems JackRabbit, Amazon Dynamo, MongoDB, Cassandra, Project Voldemort, Terrastore, Redis, Riak, and HBase all enter the industry in one version or another.
There were many other systems released and hundreds of papers published during those years that could be listed as important forces in the rapid expansion of NoSQL into the DM marketplace. Those detailed above only stand as a few highlights of the many possible examples that have now forever transformed the landscape of Data Management.
Conclusion – Why did it All Happen?
If there is blame to be laid at the feet of anyone or anything, the blame certainly can go to the Internet when discussing the ever-increasing need for distributed, non-relational systems: from the Internet came terminologies like social media, social networking, social analytics and crowdsourcing; from the industry side arose time-honored terms that took on new meanings and new exigencies like Data Warehousing, Data Mining, Data Governance and Master Data Management; from the amalgamation of all those terms ascended the new giants of modern Data Management, vast exabyte and zettabyte monoliths called Unstructured Data, Big Data and The Cloud.
In all of this seemingly formless chaos there are other taxonomies that need to be discussed further. In 2000, Eric Brewer presented his keynote speech “Towards Robust Distributed Systems” at the ACM Symposium on the Principles of Distributed Computing and CAP Theorem was born. Enterprises had to decide what was more important: Consistency, Availability or Partition Tolerance? They could effectively only choose two and the other would have to suffer. The respected database reliability test, ACID (Atomicity, Consistency, Isolation, Durability), has now fallen prey to its nemesis, BASE (Basically Available, Soft state, Eventually consistent). The taxonomies so often discussed in the cubicles, rack spaces and break rooms worldwide have changed.
Everyone in the industry knew that the needs of Big Data were altering the landscape. According to the Couchbase paper “NoSQL DatabaseTechnology: Post-relational data management for interactive software systems,” RDBMS technology could not keep pace with the needs of modern web applications, new systems were needed. Vocabularies like sharding, denormalizing, clustering, horizontal scalability and literal dictionaries full of others have now become commonplace at DM conferences around the globe. In a white paper from Datastax titled “NoSQL in the Enterprise: A Guide for Technology Leaders and Decision-Makers,” clarity is presented in few simple sentences:
There hasn’t been such a rapid shift to a new method for storing data since the move from hierarchical to relational data stores. Conferences devoted to addressing modern data management challenges have been sold out – and most have focused agendas on NoSQL topics. Technology leaders are no longer addressing the question of if they’ll have a NoSQL strategy, but rather when their NoSQL strategy will roll out – and more importantly, what it will be comprised of.
The beginning of this article brought up the point that maybe NoSQL should be really called NoREL. But, in reality such a semantic difference is now immaterial. NoSQL has solidified itself in the vernacular of DM professionals everywhere; with it comes an ever-unfolding taxonomy of terminologies, classifications, references, structures and practices that have revolutionized the world of Data Management.
Taxonomy debates aside, it’s a good time to be part of the industry.
Further Useful Resources
Grijalva, D. (2011). Selecting the Right NoSQL Tool for the Job [1-part video]. Retrieved from http://www.dataversity.net/archives/6774.
Goodman, N. (2011). BI/Analytics on NoSQL: Review of Architectures [1-part video]. Retrieved from http://www.dataversity.net/archives/6632.
Haugen, K. (2010) Analysis of the NoSQL Landscape. Retrieved from http://blog.knuthaugen.no/2010/03/the-nosql-landscape.html.
McCreary (2011). NoSQL 101: An Introduction to Newcomers [4-part video]. Retrieved from http://www.dataversity.net/archives/6548.
NoSQL.org (2011). Links, Articles, Blobs. Retrieved from http://nosql-database.org/links.html.
NoSQL.org (2011). List of NoSQL Databases. Retrieved from http://nosql-database.org/.
Oracle (2011). An Oracle White Paper: Oracle NoSQL Database. Retrieved from http://www.oracle.com/technetwork/database/nosqldb/learnmore/nosql-database-498041.pdf?ssSourceSiteId=ocomen.
Steier, S. (2011). NoSQL? How About NoDBMS? [1-part video]. Retrieved from http://www.dataversity.net/archives/6792.
Stonebraker, S. NewSQL vs NoSQL for New OLTP [1-part video]. Retrieved from http://www.dataversity.net/archives/5287.
Vogels, W. (2012). Amazon DynamoDB – a Fast and Scalable NoSQL Database Service Designed for Internet Scale Applications. Retrieved from http://www.allthingsdistributed.com/2012/01/amazon-dynamodb.html.