This is the second part of an ongoing series on NoSQL Databases, the first part was NoSQL Data Architecture & Data Governance: Everything You Need to Know. In that first part, I explained different NoSQL Database types and provided a few use cases suitable for each type. But that is not sufficient when you are planning a new application and want to choose a database for your use case. Arriving at a decision is even more difficult when you see a variety of database vendors in the market today.
In this article, I will outline a framework to perform a fit analysis to help you choose the right NoSQL Database for your application. The fit analysis comprises four stages that will help you narrow down choices and arrive at a decision. But first, make sure at least high-level requirements, access paths and query patterns are elicited, analyzed and finalized for your application. This is critical as NoSQL database types are designed for specific use case and their design is based on your application’s access paths and query patterns.
RDBMS or NoSQL?
The first stage is to determine if you really need a NoSQL database or an RDBMS already being used in your organization will suffice. Understanding ACID vs BASE properties shall help you in this decision.
As you may know, an RDBMS is characterized by ACID properties, which are:
- Atomic: Each task in a transaction succeeds or the entire transaction is rolled back.
- Consistent: A transaction maintains a valid state for the database before and after its completion and cannot leave the database in an inconsistent state.
- Isolated: A transaction not yet committed must not interfere with another transaction and must remain isolated.
- Durable: Committed transactions persist in the database and can be recovered in case of database failure.
While these characteristics seem obvious for most of the applications, they are not suitable for horizontal scaling, high availability, performance and fault tolerance.
The alternative to ACID is BASE which is what NoSQL databases follow:
- Basically Available: The system is guaranteed to be available in event of failure.
- Soft State: The state of the data could change without application interactions due to eventual consistency.
- Eventual Consistency: The system will be eventually consistent after the application input. The data will be replicated to different nodes and will eventually reach a consistent state. But the consistency is not guaranteed at a transaction level.
The BASE systems allow horizontal scaling, fault tolerance and high availability at the cost of consistency. So, if your application requires high availability and scalability, a NoSQL Database built on BASE properties might be suitable.
Other Factors to Consider While Choosing Between NoSQL and RDBMS
Choose NoSQL if you have or need:
- Semi-structured or Unstructured data / flexible schema
- Limited pre-defined access paths and query patterns
- No complex queries, stored procedures, or views
- High velocity transactions
- Large volume of data (in Terabyte range) requiring quick and cheap scalability
- Requires distributed computing and storage
- No Data Warehouse, Analytics or BI use cases
Choose and RDBMS if you have or need:
- Consistent data/ACID transactions
- Complex dynamic queries requiring stored procedures, or view
- Option to migrate to another database without significant change to existing application’s access paths or logic
- Data Warehouse, Analytics or BI use case
Based on the above considerations, if your application aligns better with the NoSQL’s BASE properties and other selection factors above, we can proceed to stage 2 and narrow NoSQL choices through CAP theorem.
Narrow the NoSQL Choices Through CAP Theorem
The CAP Theorem quantifies tradeoffs between ACID and BASE and states that, in a distributed system, you can only have two out of the following three guarantees: Consistency, Availability, and Partition Tolerance, one of them will not be supported.
- Consistency: All nodes in the cluster have consistent data and a read request returns the most recent write from any node.
- Availability: A non-failing node must always respond to requests in a reasonable time
- Partition Tolerance: System continues to operate during network or node failures.
As per CAP theorem, we must choose from CA, AP or CP characteristics for a given system. This offers a way to categorize databases and provides guidance on determining which database shall be a good fit for your application.
- Consistent and Available System: If your application requires high consistency and availability with no partition tolerance, a CA system is a good fit. Most of the traditional RDBMS are CA systems but we have ruled them out from our fit analysis in stage 1. A Graph Database such as Neo4j is also a CA system and will be analyzed in stage 3 of the fit analysis.
- Consistent and Partition Tolerant System: If your application requires high consistency and partition tolerance, a CP system is a good fit. CP systems are not able to guarantee availability as the system returns error until the partitioned state is resolved. Redis (K:V), MongoDB (Doc Store) and HBase (Col Oriented) are examples.
- Available and Partition Tolerant System: If your application requires high availability and partition tolerance, a AP system is a good fit. AP systems are not able to guarantee consistency as writes/updates can be made to either side of the partition. Such systems usually provide GDHA (Geographically Dispersed High Availability) where data is bi-directionally replicated across two datacenters and both are in Active-Active configuration i.e. application can write/read to/from either datacenter. Riak (K:V), Couchbase (Doc Store) and Cassandra (Col Oriented) are examples.
After analyzing the CAP requirements for your application, you can narrow down to a set of NoSQL databases from the selected CAP category for further consideration in stage 3.
Determine NoSQL Database Type
As you may have noticed in stage 2, each CAP category contains more than one NoSQL Database types (K:V/Document Store/Column Oriented/Graph). In this stage, we further analyze the application purpose & use case to determine which NoSQL Database type should be considered from the CAP category chosen for your application.
NoSQL Database types are designed for a specific group of use cases. I have listed some of the key use cases for each NoSQL Database type. You can use this list as a starting point for analyzing your application’s requirements.
Choose K:V Stores if:
- Simple schema
- High velocity read/write with no frequent updates
- High performance and scalability
- No complex queries involving multiple keys or joins
Choose Document Stores if:
- Flexible schema with complex querying
- JSON/BSON or XML data formats
- Leverage complex Indexes (multikey, geospatial, full text search etc)
- High performance and balanced R:W ratio
Choose Column-Oriented Database if:
- High volume of data
- Extreme write speeds with relatively less velocity reads
- Data extractions by columns using row keys
- No ad-hoc query patterns, complex indices or high level of aggregations
Choose Graph Database if:
- Applications requiring traversal between data points
- Ability to store properties of each data point as well as relationship between them
- Complex queries to determine relationships between data points
- Need to detect patterns between data points
Now you have decided the CAP category and the NoSQL type for your application. At this stage if we perform a fit analysis based on the select NoSQL databases shown in Fig 1, our decision matrix would look as follows:
But as a last step, we also need to consider the database and technology characteristics of each NoSQL Database and the requirements from the application and organization to finalize a selection. These are detailed in step 4.
Select NoSQL Database (Vendor)
Even after selecting a CAP category and NoSQL Database type, the fit analysis is not complete. Selection of a NoSQL Database also depends on the database technology, its configuration and available infrastructure, proposed architecture of your application, budget as well as the skill set available at your organization etc.
- Backup and recovery configurations
- Cluster topology: GDHA / HADR, Active-Active / Active-Passive
- Replication: Synchronous, Asynchronous or Quorum
- Read/Write concerns and Indexing strategies
- Concurrency control: Locks, MVCC (Multi Version Concurrency Control), Read Your Own Write (RYOW)
- Security, access controls and encryption at rest
- Available APIs and Query methods: JSON, XML, REST, Thrift, CQL, MapReduce, SPARQL, Cypher, Gremlin etc.
- Infrastructure: On-premise or Cloud / Dedicated or Shared
- Database uptime categorization (99.9% up to 99.999%)
- Application Requirements: Use cases, R:W patterns, performance expectations/SLAs, upstream/downstream systems, criticality to the business etc.
- Implementation Language and SDKs: C/C++, Java, Python, Node.Js etc
- Application Architecture: Web Application, Microservices, Mobile etc.
- Data Integration: Batch processing, ETL, Streaming, Message broker, ESB etc.
- Complementary Technologies: Spark, Storm, Kafka, ELK, Solr, Splunk etc.
- Budget and cost considerations
- Team skillset
- Preferred vendors / existing technology stack
- Motivation for NoSQL/Big Data
- Business / Technology leadership sponsorship & support
Once all such questions are answered, the application and data team should shortlist a couple of NoSQL Database vendors and perform a Proof of Concept to evaluate the technology and benchmark the performance in order to finalize the selection.