Data Modeling In The Age Of NoSQL And Big Data

By on

Data Modeling Big Databy Jennifer Zaino

Hadoop Hbase. MongoDB. Cassandra. Couchbase. Neo4J. Riak.

Those are just a few of the sprawling community of NoSQL databases, a category that originally sprang up in response to the internal needs of companies such as Google, Amazon, Facebook, LinkedIn, Yahoo and more – needs for better scalability, lower latency, greater flexibility, and a better price/performance ratio in an age of Big Data and Cloud computing.

They come in many forms, from key-value stores to wide-column stores to data grids and document, graph, and object databases. And as a group – however still informally defined – NoSQL (considered by most to mean “not only SQL”) is growing fast. The worldwide NoSQL market is expected to reach $3.4 billion by 2018, growing at a CAGR of 21 percent between last year and 2018, according to Market Research Media.

“There are all kinds of applications being built now with NoSQL systems,” says Rick van der Lans, an independent analyst, consultant, speaker and author who specializes in data warehousing, Business Intelligence (BI), database technology, and data virtualization. His book, Introduction to SQL, was the first book about SQL databases available in English.

Web sites and BI provide two examples where NoSQL databases are being adopted, “both worlds are Big Data worlds,” he says. With web sites like Amazon, for example, “they are analyzing what you as a customer do on the spot. You are looking for a book and they are hosting recommendations, so they are analyzing live the enormous web logs of highly complex and multi-structured data created in the background as users move around their site.”

Indeed, one of advantages that NoSQL brings to the table for Big Data is that it allows storage of schema-less data, which makes it well-suited to Big Data environments where the data doesn’t have a particular structure – it may be unstructured, like text, and it may be open to your coming up with many different structures for the same data:

“That’s why some call the data multi-structured, meaning that you can look at the same data from different angles,” says van der Lans, perhaps from the point of view of the customer today and from the supplier angle tomorrow. “It’s as if you are using different filters when looking at the same object.”

Instead of coming up with the structure for modeling the data in advance, as is the case with relational databases, “NoSQL systems let us store the data as it comes in,” he says, in nested and hierarchical structures, in records in tables that can have different structures, and to which values can be added for which no column has been defined yet. “When we access the data, when we query it, then we determine the structure we want to use. That means it’s more flexible.”

Data Modeling Still A Priority

Data modeling, then, still has an important role to play in NoSQL environments. “The data modeling process is always there,” he says. You can look at that role in a simple way, van der Lans explains, by thinking of it as a process that leads to a diagram. In the process of creating the diagram, you are trying to understand what the data means and how the data elements relate together. Thus, “understanding” is a key aspect of data modeling.

Just as is the case when you are doing data modeling for SQL environments, data modeling for NoSQL requires doing the same homework: Talk to end users and read reports to come up with some logical model that specifies the structure and meaning of the data. That’s the business-oriented step, he notes. “The moment we want to interpret data we have to understand it,” he says. Implementing that logical model – the physical and technical aspect of data modeling – is what changes dramatically in NoSQL environments compared to SQL environments.

In the SQL environment, the data modeling process that leads to such an understanding lives inside the database server. In NoSQL environments, however, the data modeling ends up in the code of the application that reads the data, van der Lans says. “Twenty years ago, if you would do data modeling, the result would always be a database structure – tables and columns.” In today’s NoSQL environments, “what will happen is the data model ends up as lines of application code….The structure is there but in the lines of application code.” Because of that approach, it revolves around just changing how you want to look at the data, there’s no requirement to reorganize the physical database.

Keep The Differences in Mind

He also points out that successfully conducting these implementations will require making a distinction between the various different types of NoSQL systems. “There’s a tendency to classify NoSQL as one big, homogeneous group of products, but they are very different,” and you have to build data models in such a way as to get the best performance out of the kind of system you are working with.

“In NoSQL systems everything is secondary to performance and scalability.  In a SQL world, the tendency is to come up with a very neutral database structure that is good for everything. With a NoSQL system your goal is to make one application incredibly fast and scalable.”

As an example, designing a data model for MongoDB, a document NoSQL database that some people also refer to as a JSON data store, the focus in order to attain speed and scale should be on your transactions. The data that is being inserted and updated together, orders and order lines, for instance, would be two separate tables in an SQL database. “In MongoDB we’d say that’s just one table, so if someone does an insert of an order then the order heading and lines form one object logically and physically,” he says. So, the design of the data model in MongoDB is heavily influenced by the transactions, whereas in Hadoop Hbase the design may be more aimed at reporting, which can lead to denormalized data models.

The Changing Landscape

The message to reinforce to those becoming acquainted with NoSQL systems and modeling data for them is that they work differently from classic database servers and must be optimized differently, he says. That includes taking on the work that SQL systems handle themselves for automatically taking care of where the data is stored or the most efficient way of extracting it, for instance. In NoSQL, programmers have to solve that for themselves in the systems’ low-level programming languages; though it means more control over performance can be a drag on time to market for new reports or apps, he states. Building a data security and integrity layer is also in the hands of the programmer, assuming there’s not concern that adding that in will slow down performance when performance is the top requirement.

Interestingly, those data specialists who’ve been around long enough – from the pre-relational database era – may be able to apply some data modeling tricks from that long ago past to this new world. “There’s a slight relation between NoSQL and pre-relational databases – they were based on a hierarchical structure and some of the ways they used then to optimize systems almost in the same way apply to some NoSQL systems,” he says.

But, there’s no reason to believe that things are truly standing still. The market is moving very fast, he says, noting that in just the last year a lot of new software layers to support SQL on top of NoSQL systems have come to the fore too, so that you can use SQL to access these NoSQL environments, “this new class of products changes everything.”

Leave a Reply