This installment of the DATAVERSITY series on the wide-ranging NoSQL movement covers Document Databases. Obviously, Document Stores center on the concept of data stored within a document. The encoding of this document can either be text in some form of a markup language or something formatted as binary. Examples of binary encoding include the aforementioned Microsoft Word or any other user-created file managed by a proprietary software program, such as Excel, Photoshop, Pro Tools, and others.
Text markup languages used for data purposes are plentiful: from YAML to JSON to the nearly ubiquitous XML. Both flavors of document types feature in the overall world of Document databases.
Keys are normally used to retrieve items from a Document Store. Usually in a string format, the key can represent the path to a stored document, although performance improvements result from indexing those keys. In addition to faster access, indexes using other data types provide flexibility in the queries used for document retrieval – for example returning all stored documents containing a certain phrase.
A markup language developed to facilitate data exchange between different systems, XML saw increased use throughout the 2000s. XML is widely used to describe RSS feeds, data objects using SOAP markup, as well as various communication protocols.
XML can be used to markup relational databases as well. In the Microsoft .NET Framework ADO.NET library, XML files are treated essentially like any other relational data source (SQL Server, Oracle Database, etc.) for CRUD transactions.
Many relational databases, including DB2, SQL Server, and PostgreSQL, support persistence to and from the XML format, with the ability to serialize and deserialize XML from binary formats as necessary. This helps when exchanging data between two different databases.
Other traditional Document-oriented data stores, known as Native XML Databases (NXD), use XML as a data type in a logical model, with the actual data stored in various physical formats depending on the individual database.
NXDs group sets of XML documents into collections that follow a similar pattern to a directory structure on a computer. Many of these databases leverage XPath and XQuery for querying purposes, and some of them provide XSLT functionality to actually transform the native XML into other usable formats, including HTML for rendering in a web page.
Some of the more popular NXDs today include the open source BaseX which features a powerful XPath/XQuery processor as well a friendly GUI for administration. Another open source option is Sedna which uses the Apache license model. Sedna offers a robust API supporting multiple languages, as well XQuery and ACID transaction support.
A commercial NXD option is MarkLogic Server, a powerful, flexible system capable of supporting many Big Data applications, including search, open source intelligence, social media analysis and data virtualization. The product scales nicely through the use of Hadoop’s MapReduce functionality. The company recently added JSON as a storage format option.
Apache CouchDB and the Cluster of Unreliable Commodity Hardware
It features native transformation functionality, allowing users to essentially serve web applications directly from CouchDB. The database follows the CAP (Consistency, Availability, Partition Tolerance) theorem for distributed scaling, and uses a model of eventual consistency.
Mobile applications are well suited for CouchDB, considering its ease of replication for the syncing of data between a mobile device and a desktop. The database works equally well in online and offline states.
Some of the major enterprise users of CouchDB include the BBC and Credit Suisse, as well as the scientists working on the Large Hadron Collider project. Many Facebook apps leverage CouchDB’s abilities in effectively managing a growing social media-driven database.
CouchDB’s major strength may be in its active user community of open source aficionados. Written in the Erlang language, the database is constantly adding new features and functionality.
Previously mentioned in the DATAVERSITY article on Key-Value stores, CouchBase arose out of the merging of the minds behind CouchDB and the Membase memory cache database tool; effectively combining the strengths of both products.
MongoDB Serves Humongous Data
MongoDB derives its name from humongous. It is an open source Document database written in C++, and uses JSON markup for the storage of documents on the back end. It is available for most major operating systems, including Windows, Mac OS X, Linux, and Solaris.
Developed by 10gen, the company also offers commercially available support for MongoDB. The database provides many features suitable for large enterprises, including MapReduce, auto-sharding to facilitate horizontal scaling, as well as easy replication across WANs.
MongoDB features robust support for a variety of index options. Read operations see improvement for frequently used queries, and the accessing of documents larger than the available RAM becomes more efficient.
Support for dynamic, ad-hoc queries is a major selling point for MongoDB. This feature is common on relational databases, and MongoDB users migrating from an older relational store will find that many of their dynamic SQL queries seamlessly translate to MongoDB’s query language.
MongoDB’s list of current production users is large and diverse. Disney leverages the database for a gaming application. The humongous classified ad website, Craigslist, is also a client. Location-based social networking giant, foursquare, also depends on MongoDB’s state of the art database features.
Clusterpoint Delivers Superior Search Functionality
A commercial Document-oriented database, Clusterpoint is known for its high-end search functionality. Written in C++, the database provides an out-of-the-box search facility able to quickly find relevant material from a collection of documents formatted as XML, JSON, or HTML. The tool can also search emails and text documents.
Clusterpoint provides a web-style search interface, so users of Google or Bing (meaning almost everybody!) will feel right at home. Its ranking algorithm is another calling card, greatly increasing the quality of the search result set. This theoretically improves overall enterprise efficiency by reducing the time it takes employees to find the right business information out of a mass of enterprise data.
The database includes elastic clustering and replication functionality that promises high scalability. It also provides an API for Java, PHP, and the .NET languages. While a commercial product, Clusterpoint is available as a single server trial edition and the full cluster or site license version also includes a 30-day free trial.
OrientDB Combines the Worlds of Document and Graph Databases
Combining features of Document and Graph databases, OrientDB is an open source NoSQL database written in Java available under the Apache software license. Nuvolabase offers professional services centered on the use of OrientDB, including support, training, and consulting options.
What makes OrientDB unique among other Document-oriented databases is its ability to manage relationships within the data using Graph database technology. This gives the database superior performance when processing large amounts of data containing many relationships. It is also possible to use OrientDB purely as a Graph database.
Squarely a “not only” SQL database, OrientDB offers the ability to use SQL queries to access relationships and data. It also provides support for ACID transactions and the superfast capability of storing 150,000 records per second.
Document-oriented databases make up an important part of the overall NoSQL picture. In some cases closely related to Key-Value stores, the ability to manage data stored in actual documents or as marked-up data objects provides flexibility and speed not always available with the relational model. Some Document Stores, like OrientDB, even offer Graph database functionality.
Speaking of Graph databases, the next article in the DATAVERSITY NoSQL series covers these superfast relationship-managing data stores.
Other articles in this series: