A Report on Dave Grijalva’s NoSQL NOW! 2011 Conference Presentation
by Charles Roe
The plethora of NoSQL products on the market today makes choosing the right tool for your organization a complicated matter at best. All of them have diverse functionality, different levels of usability and support, as well as distinct schema and indexing, failure cases, clustering preferences and a range of other options. How is an enterprise, no matter its size, supposed to choose from such products like Cassandra, Redis, MongoDB, CouchDB, Hadoop, Riak and MarkLogic? They are only a very small shortlist of some of the most known players in the field, there are many more and all have pros and cons depending on the needs of a particular organization. Dave Grijalva – the Director of Platform Technology at ngmoco:) – discussed the many questions involved in choosing the correct NoSQL tool for a given enterprise during his NoSQL NOW! 2011 Conference presentation. He focused his presentation on the primary issues that need to be discussed when selecting a tool, the tradeoffs that must be evaluated since all of the products have various advantages and disadvantages, and ended with a case study of ngmoco:). No system is perfect and no system will ever fulfill every single need for an organization, the point is to weigh all the options and come up with a best solution. Mr. Grijalva’s presentation provided some of the most relevant information for making such a decision. All graphical elements in this report are taken directly from Mr. Grijalva’s presentation.
Inherent in the problem are four essential features that each NoSQL system addresses to one extent or another; a decision must be made which of them are the most and least important for a given organization. Those features are: Access Patterns, Schema and Indexing, Clustering, and Operations.
The first major consideration in choosing the right NoSQL tool is how will all that data be accessed? The amount of content within a system can range from finite, if there is a specific list of users who will ever use it; to essentially unbounded, if the project will include user-generated content. Both approaches require different configurations and have a completely different set of requirements. In an unbounded system such as a social media or gaming platform, the content volume will probably be lower, as the system moves into production and could climb exponentially very quickly. In a finite system with a known list of users, the access patterns will remain principally the same, with known periods of increased traffic. Each system requires knowledge of whether they will be read or write heavy, primary access patterns of the content in terms of times and server loads, frequency of data changes, ratios of inserts/updates/reads/deletes, and the total number of operations expected at given times of day. All of those questions do not always have answers during the early phases of a project and often not until it moves into full production with users.
Furthermore, how valuable is the data? In many instances, the stakeholders of a project will say “it’s all important,” when in reality there is usually more and less important data. Decisions on the most important, least important, acceptability of certain percentages or amounts of data loss, and the speed of data access must be weighed. If the “100% data accuracy, 100% of the time” is truly the most crucial scenario to be met, then access speeds, redundancy and other factors must be part of the tradeoff.
Schema and Indexing
There are seemingly as many schemas as tools on the market today and the schema has direct relevance on how the data will be accessed. What will your schema look like? Will you use a scheme like the one in Cassandra with different tiers of column storage? A Document store similar to MarkLogic? A Key/Value store? A BigTable implementation? A Graph database? Or some other arrangement? This is an essential question when choosing the correct tool. The schema directly affects the kinds of queries you can use, the loads and speeds of reads/writes, performance issues, latency and a variety of other interactions within the system. One critical aspect of the chosen schema is how it deals with indexing. If you are serving user requests or have API requirements that need real-time processing, then high performance indexing throughput and lower latency are a necessity. If most of the database requirements are due to backend processing, with far less real-time processing, then you may be able to accept more system latency. The final decision to be made in terms of schema and indexing is the necessity of secondary indexes. Many Key/Value stores don’t provide them built-in, though in some instances you can build a secondary index yourself or go to an external storage solution. Do you need secondary indexes in your tool? If so, then it becomes an important consideration when choosing.
The server configuration is of utmost importance when choosing a given NoSQL tool. If you have a small data set, then staying with a traditional master/slave configuration, or small cluster, is probably enough. But, if that data set and subsequent number of operations is expected to grow over time to levels that a master/slave configuration will not be able to deal with, then the issue of scalability becomes paramount. All of the many tools available manage scalability in different ways. Some of the crucial questions to consider are:
- Sharding: How is sharding done? Cassandra, Hadoop, Redis and all the other systems deal with sharding in their own fashions; all have pros and con that need be considered up front.
- Cost of Adding/Removing Nodes: The repartitioning of a cluster can cause significant downtime and includes a range of costs and challenges depending on the given tool. If you expect such a challenge as your system grows, then you need to know the constraints of having one tool versus another when the adding/removing of nodes needs to take place, and the costs associated with those changes.
- Where (in the world) is your Data? Is your data all in one center or is it spread in many centers across the world? Some tools are adept at dealing with the separation of massive amounts of data over large geographic locations, others are not. Does the tool you’re looking at provide for adequate geo-distribution for your needs?
- Failure Cases: This is perhaps the most important issue to be addressed prior to implementing a new tool. The literature provided by any tools on the market has long lists of failure case scenarios and how their system copes with such problems. But, make sure you look at each and every possible case before moving forward. What happens if one node goes down? Many nodes at once? A data center? Single or multiple disks? Application server?
Once the application is designed and ready for deployment, it must run in production and there are many aspects that need careful consideration at the operational level:
- Backup and Recovery: Traditional RDBMS have many time-tested techniques for backup and recovery operations; they work well and have for years. In the new NoSQL systems, such truths are not consistent across the board. Some use more standard types of backup techniques, while others rely on new replication strategies instead of backup. Some deal well with multiple data center backups and recovery, while others still have serious issues with such problems. Some systems have relatively low I/O costs for doing backups to other nodes or clusters, while others have significantly higher costs. What does your organization require and how can you best implement your needs?
- Multiple Data Center Issues: This issue has already been discussed in other sections, but needs specific highlighting. How do the various tools you are considering deal with multi-data center failures, replication, configuration and other such expansive issues on a regional, national and global scale?
- Ease of Deployment: If you are going to be implementing cloud deployments, then you need a way to essentially just push a button and deploy more nodes. If you manage your own data centers, then you need many people to do the same deployments. How long can you wait for a specific node deployment? How does the given system deal with recovery cases?
- Tools Integration: Once the application is deployed and users are interacting with it, then you are guaranteed to see anomalous behaviors you didn’t expect or plan for. Even with excellent load testing there are always going to be times when abnormal occurrences happen. The more distributed your system is, the more difficult it is to have tools that keep track of all the interactions going on at any given time. How much information do you need from your tools? How smooth does integration need to be? How is your system prepared for anomalies?
- Support: Many of the top NoSQL tools available are open source and community managed; such systems bring specific challenges outside of traditional off-the-shelf products. How much does support cost? What are the various support options? How available is support? If you are having a problem at 2am, do you have to then post to a mailing list and hope somebody is awake to help you, or is there a 24/7/365 support system you can easily access?
To be continued…