How to Choose the Right NoSQL Database

This is the second part of an ongoing series on NoSQL Databases, the first part was NoSQL Data Architecture & Data Governance: Everything You Need to Know. In that first part, I explained different NoSQL Database types and provided a few use cases suitable for each type. But that is not sufficient when you are planning a new application and want to choose a database for your use case. Arriving at a decision is even more difficult when you see a variety of database vendors in the market today.

In this article, I will outline a framework to perform a fit analysis to help you choose the right NoSQL Database for your application. The fit analysis comprises four stages that will help you narrow down choices and arrive at a decision. But first, make sure at least high-level requirements, access paths and query patterns are elicited, analyzed and finalized for your application. This is critical as NoSQL database types are designed for specific use case and their design is based on your application’s access paths and query patterns.

RDBMS or NoSQL?

The first stage is to determine if you really need a NoSQL database or an RDBMS already being used in your organization will suffice. Understanding ACID vs BASE properties shall help you in this decision.

As you may know, an RDBMS is characterized by ACID properties, which are:

Atomic: Each task in a transaction succeeds or the entire transaction is rolled back.
Consistent: A transaction maintains a valid state for the database before and after its completion and cannot leave the database in an inconsistent state.
Isolated: A transaction not yet committed must not interfere with another transaction and must remain isolated.
Durable: Committed transactions persist in the database and can be recovered in case of database failure.

While these characteristics seem obvious for most of the applications, they are not suitable for horizontal scaling, high availability, performance and fault tolerance.

The alternative to ACID is BASE which is what NoSQL databases follow:

Basically Available: The system is guaranteed to be available in event of failure.
Soft State: The state of the data could change without application interactions due to eventual consistency.
Eventual Consistency: The system will be eventually consistent after the application input. The data will be replicated to different nodes and will eventually reach a consistent state. But the consistency is not guaranteed at a transaction level.

The BASE systems allow horizontal scaling, fault tolerance and high availability at the cost of consistency. So, if your application requires high availability and scalability, a NoSQL Database built on BASE properties might be suitable.

Other Factors to Consider While Choosing Between NoSQL and RDBMS

Choose NoSQL if you have or need:

Semi-structured or Unstructured data / flexible schema
Limited pre-defined access paths and query patterns
No complex queries, stored procedures, or views
High velocity transactions
Large volume of data (in Terabyte range) requiring quick and cheap scalability
Requires distributed computing and storage
No Data Warehouse, Analytics or BI use cases

Choose and RDBMS if you have or need:

Consistent data/ACID transactions
Complex dynamic queries requiring stored procedures, or view
Option to migrate to another database without significant change to existing application’s access paths or logic
Data Warehouse, Analytics or BI use case

Based on the above considerations, if your application aligns better with the NoSQL’s BASE properties and other selection factors above, we can proceed to stage 2 and narrow NoSQL choices through CAP theorem.

Narrow the NoSQL Choices Through CAP Theorem

The CAP Theorem quantifies tradeoffs between ACID and BASE and states that, in a distributed system, you can only have two out of the following three guarantees: Consistency, Availability, and Partition Tolerance, one of them will not be supported.

Consistency: All nodes in the cluster have consistent data and a read request returns the most recent write from any node.
Availability: A non-failing node must always respond to requests in a reasonable time
Partition Tolerance: System continues to operate during network or node failures.

As per CAP theorem, we must choose from CA, AP or CP characteristics for a given system. This offers a way to categorize databases and provides guidance on determining which database shall be a good fit for your application.

Consistent and Available System: If your application requires high consistency and availability with no partition tolerance, a CA system is a good fit. Most of the traditional RDBMS are CA systems but we have ruled them out from our fit analysis in stage 1. A Graph Database such as Neo4j is also a CA system and will be analyzed in stage 3 of the fit analysis.
Consistent and Partition Tolerant System: If your application requires high consistency and partition tolerance, a CP system is a good fit. CP systems are not able to guarantee availability as the system returns error until the partitioned state is resolved. Redis (K:V), MongoDB (Doc Store) and HBase (Col Oriented) are examples.
Available and Partition Tolerant System: If your application requires high availability and partition tolerance, a AP system is a good fit. AP systems are not able to guarantee consistency as writes/updates can be made to either side of the partition. Such systems usually provide GDHA (Geographically Dispersed High Availability) where data is bi-directionally replicated across two datacenters and both are in Active-Active configuration i.e. application can write/read to/from either datacenter. Riak (K:V), Couchbase (Doc Store) and Cassandra (Col Oriented) are examples.

After analyzing the CAP requirements for your application, you can narrow down to a set of NoSQL databases from the selected CAP category for further consideration in stage 3.

Determine NoSQL Database Type

As you may have noticed in stage 2, each CAP category contains more than one NoSQL Database types (K:V/Document Store/Column Oriented/Graph). In this stage, we further analyze the application purpose & use case to determine which NoSQL Database type should be considered from the CAP category chosen for your application.

NoSQL Database types are designed for a specific group of use cases. I have listed some of the key use cases for each NoSQL Database type. You can use this list as a starting point for analyzing your application’s requirements.

Choose K:V Stores if:

Simple schema
High velocity read/write with no frequent updates
High performance and scalability
No complex queries involving multiple keys or joins

Choose Document Stores if:

Flexible schema with complex querying
JSON/BSON or XML data formats
Leverage complex Indexes (multikey, geospatial, full text search etc)
High performance and balanced R:W ratio

Choose Column-Oriented Database if:

High volume of data
Extreme write speeds with relatively less velocity reads
Data extractions by columns using row keys
No ad-hoc query patterns, complex indices or high level of aggregations

Choose Graph Database if:

Applications requiring traversal between data points
Ability to store properties of each data point as well as relationship between them
Complex queries to determine relationships between data points
Need to detect patterns between data points

Now you have decided the CAP category and the NoSQL type for your application. At this stage if we perform a fit analysis based on the select NoSQL databases shown in Fig 1, our decision matrix would look as follows:

But as a last step, we also need to consider the database and technology characteristics of each NoSQL Database and the requirements from the application and organization to finalize a selection. These are detailed in step 4.

Select NoSQL Database (Vendor)

Even after selecting a CAP category and NoSQL Database type, the fit analysis is not complete. Selection of a NoSQL Database also depends on the database technology, its configuration and available infrastructure, proposed architecture of your application, budget as well as the skill set available at your organization etc.

Database considerations:

Backup and recovery configurations
Cluster topology: GDHA / HADR, Active-Active / Active-Passive
Replication: Synchronous, Asynchronous or Quorum
Read/Write concerns and Indexing strategies
Concurrency control: Locks, MVCC (Multi Version Concurrency Control), Read Your Own Write (RYOW)
Security, access controls and encryption at rest
Available APIs and Query methods: JSON, XML, REST, Thrift, CQL, MapReduce, SPARQL, Cypher, Gremlin etc.
Infrastructure: On-premise or Cloud / Dedicated or Shared
Database uptime categorization (99.9% up to 99.999%)

Architecture/Application considerations:

Application Requirements: Use cases, R:W patterns, performance expectations/SLAs, upstream/downstream systems, criticality to the business etc.
Implementation Language and SDKs: C/C++, Java, Python, Node.Js etc
Application Architecture: Web Application, Microservices, Mobile etc.
Data Integration: Batch processing, ETL, Streaming, Message broker, ESB etc.
Complementary Technologies: Spark, Storm, Kafka, ELK, Solr, Splunk etc.

Organization considerations:

Budget and cost considerations
Team skillset
Preferred vendors / existing technology stack
Motivation for NoSQL/Big Data
Business / Technology leadership sponsorship & support

Once all such questions are answered, the application and data team should shortlist a couple of NoSQL Database vendors and perform a Proof of Concept to evaluate the technology and benchmark the performance in order to finalize the selection.

TAKE OUR DATA MANAGEMENT CERTIFICATION PREP COURSES

Data Topics