The NoSQL Movement: Document Databases

by Paul Williams

This installment of the DATAVERSITY series on the wide-ranging NoSQL movement covers Document Databases. Obviously, Document Stores center on the concept of data stored within a document. The encoding of this document can either be text in some form of a markup language or something formatted as binary. Examples of binary encoding include the aforementioned Microsoft Word or any other user-created file managed by a proprietary software program, such as Excel, Photoshop, Pro Tools, and others.

Text markup languages used for data purposes are plentiful: from YAML to JSON to the nearly ubiquitous XML. Both flavors of document types feature in the overall world of Document databases.

Keys are normally used to retrieve items from a Document Store. Usually in a string format, the key can represent the path to a stored document, although performance improvements result from indexing those keys. In addition to faster access, indexes using other data types provide flexibility in the queries used for document retrieval – for example returning all stored documents containing a certain phrase.

XML Databases

A markup language developed to facilitate data exchange between different systems, XML saw increased use throughout the 2000s. XML is widely used to describe RSS feeds, data objects using SOAP markup, as well as various communication protocols.

XML can be used to markup relational databases as well. In the Microsoft .NET Framework ADO.NET library, XML files are treated essentially like any other relational data source (SQL Server, Oracle Database, etc.) for CRUD transactions.

Many relational databases, including DB2, SQL Server, and PostgreSQL, support persistence to and from the XML format, with the ability to serialize and deserialize XML from binary formats as necessary. This helps when exchanging data between two different databases.

Other traditional Document-oriented data stores, known as Native XML Databases (NXD), use XML as a data type in a logical model, with the actual data stored in various physical formats depending on the individual database.

NXDs group sets of XML documents into collections that follow a similar pattern to a directory structure on a computer. Many of these databases leverage XPath and XQuery for querying purposes, and some of them provide XSLT functionality to actually transform the native XML into other usable formats, including HTML for rendering in a web page.

Some of the more popular NXDs today include the open source BaseX which features a powerful XPath/XQuery processor as well a friendly GUI for administration. Another open source option is Sedna which uses the Apache license model. Sedna offers a robust API supporting multiple languages, as well XQuery and ACID transaction support.

A commercial NXD option is MarkLogic Server, a powerful, flexible system capable of supporting many Big Data applications, including search, open source intelligence, social media analysis and data virtualization. The product scales nicely through the use of Hadoop’s MapReduce functionality. The company recently added JSON as a storage format option.

Apache CouchDB and the Cluster of Unreliable Commodity Hardware

CouchDB is a popular open source Document-oriented database developed by Damien Katz and others. Its technology stack is relatively straightforward: JSON documents for storage, HTTP as the primary API, and MapReduce support using JavaScript.

It features native transformation functionality, allowing users to essentially serve web applications directly from CouchDB. The database follows the CAP (Consistency, Availability, Partition Tolerance) theorem for distributed scaling, and uses a model of eventual consistency.

Mobile applications are well suited for CouchDB, considering its ease of replication for the syncing of data between a mobile device and a desktop. The database works equally well in online and offline states.

Some of the major enterprise users of CouchDB include the BBC and Credit Suisse, as well as the scientists working on the Large Hadron Collider project. Many Facebook apps leverage CouchDB’s abilities in effectively managing a growing social media-driven database.

CouchDB’s major strength may be in its active user community of open source aficionados. Written in the Erlang language, the database is constantly adding new features and functionality.

Previously mentioned in the DATAVERSITY article on Key-Value stores, CouchBase arose out of the merging of the minds behind CouchDB and the Membase memory cache database tool; effectively combining the strengths of both products.

MongoDB Serves Humongous Data

MongoDB derives its name from humongous. It is an open source Document database written in C++, and uses JSON markup for the storage of documents on the back end. It is available for most major operating systems, including Windows, Mac OS X, Linux, and Solaris.

Developed by 10gen, the company also offers commercially available support for MongoDB. The database provides many features suitable for large enterprises, including MapReduce, auto-sharding to facilitate horizontal scaling, as well as easy replication across WANs.

MongoDB features robust support for a variety of index options. Read operations see improvement for frequently used queries, and the accessing of documents larger than the available RAM becomes more efficient.

Support for dynamic, ad-hoc queries is a major selling point for MongoDB. This feature is common on relational databases, and MongoDB users migrating from an older relational store will find that many of their dynamic SQL queries seamlessly translate to MongoDB’s query language.

MongoDB’s list of current production users is large and diverse. Disney leverages the database for a gaming application. The humongous classified ad website, Craigslist, is also a client. Location-based social networking giant, foursquare, also depends on MongoDB’s state of the art database features.

Clusterpoint Delivers Superior Search Functionality

A commercial Document-oriented database, Clusterpoint is known for its high-end search functionality. Written in C++, the database provides an out-of-the-box search facility able to quickly find relevant material from a collection of documents formatted as XML, JSON, or HTML. The tool can also search emails and text documents.

Clusterpoint provides a web-style search interface, so users of Google or Bing (meaning almost everybody!) will feel right at home. Its ranking algorithm is another calling card, greatly increasing the quality of the search result set. This theoretically improves overall enterprise efficiency by reducing the time it takes employees to find the right business information out of a mass of enterprise data.

The database includes elastic clustering and replication functionality that promises high scalability. It also provides an API for Java, PHP, and the .NET languages. While a commercial product, Clusterpoint is available as a single server trial edition and the full cluster or site license version also includes a 30-day free trial.

OrientDB Combines the Worlds of Document and Graph Databases

Combining features of Document and Graph databases, OrientDB is an open source NoSQL database written in Java available under the Apache software license. Nuvolabase offers professional services centered on the use of OrientDB, including support, training, and consulting options.

What makes OrientDB unique among other Document-oriented databases is its ability to manage relationships within the data using Graph database technology. This gives the database superior performance when processing large amounts of data containing many relationships. It is also possible to use OrientDB purely as a Graph database.

Squarely a “not only” SQL database, OrientDB offers the ability to use SQL queries to access relationships and data. It also provides support for ACID transactions and the superfast capability of storing 150,000 records per second.

Document-oriented databases make up an important part of the overall NoSQL picture. In some cases closely related to Key-Value stores, the ability to manage data stored in actual documents or as marked-up data objects provides flexibility and speed not always available with the relational model. Some Document Stores, like OrientDB, even offer Graph database functionality.

Speaking of Graph databases, the next article in the DATAVERSITY NoSQL series covers these superfast relationship-managing data stores.

 

Other articles in this series:

Related Posts Plugin for WordPress, Blogger...

  4 comments for “The NoSQL Movement: Document Databases

  1. October 23, 2012 at 10:45 am

    Hi Paul,

    Nice article.

    However, there’s one DB you missed mentioning: ArangoDB.
    I’d like to share my little story with it and also present some facts surrounding it.

    In search for a database that would fit my needs and having evaluated some of the ones mentioned in this article I found a very promising new kid on the block.
    Don’t get me wrong. I have used MongoDB in the past and evaluated OrientDB for a short time. Both are very good Dbs each with their great features, but didn’t fit my needs for the project I have been working on.
    While ArangoDB doesn’t yet meet one primary need of mine, which is transaction support, it will provide it very soon, in Version 1.2 according to the roadmap. You’ll find out below, why I chose it, nevertheless.

    Besides the great features and amazing ideas that are implemented in this database, I came to really appreciate the openness and forthcoming of the developers on any questions I threw at them during my evaluation. The experience with the ArangoDB developers came to a high point when I proposed some enhancements and features. The devs checked if the features had meaning and could be done without introducing problems to the db. They implemented the suggested features mostly on the same day! That’s something I had never seen before with any other DB. Having had that nice experience I just had to give back to the project and so I started contributing, by helping out, with the ArangoDB PHP API. :)

    I must say, it’s been a very nice experience so far and it has all the signs that it will continue to do so.

    So, finally, here are some facts on ArangoDB:

    ArangoDB is an open-source, multi-threaded most flexible all-in-one database solution.

    You can mix “key-value store”, “document store” and “graph database”, all in the same DB.

    Indexes: You have “Full-text”, “Geo”, “Hash”, “Bitmap” and “Skip-list” indexes to choose from. You have the freedom to use what you need for your specific application.

    Querying: There are several ways to do queries in ArangoDB. Simple queries by id or example and ArangoDB’s Query Language AQL for the more complex queries, also supporting PATH queries on Graphs.

    ArangoDB also features Actions, which allow you to write server side code in Javascript and provide a REST interface to call them.

    It makes use of shapes to store and reuse common document schemas instead of writing same document attributes in every document over and over again. This results in far less space usage in memory and disk, which automatically translates in increased performance.

    Speed: Here’s a nice one! ArangoDB uses HTTP REST as its interface. So, one would think that would be slow…. This Blog post (http://www.arangodb.org/2012/10/04/gain-factor-of-5-using-batch-updates) will prove quiet the opposite. The tests that were run in that post show that up to 500.000 document inserts in less than 5 seconds are possible. This depends of course on the hardware, document size and various other factors, but bottom line: It’s amazingly fast!

    MVCC: ArangoDB uses AppendOnly/MVCC to update documents. That way locks are minimized and documents are quickly appended to the datafiles without having to reorganize or find empty spots, etc… Of course there is garbage collection which gets rid of the replaced documents.

    Drivers: Since access to the DB is through the REST interface there are no binary drivers. Actually one could write his/her own driver if it doesn’t yet exist. Up-to-date there are drivers for PHP, Node.js, Ruby, Java, Perl, Python & D.

    ArangoDB 1.1 will soon be released with many new features and performance improvements.

    Transactions will be introduced in Version 1.2. Replication and sharding are on the roadmap for Version 2.0.

    So, in the end ArangoDB is definitely a database to keep an eye on :)

Leave a Reply

Your email address will not be published. Required fields are marked *

Add video comment