Click to learn more about author Niels Brinch
Experts in relational database structure may find it very confusing to structure data in a document database. However, there are actually more similarities than differences.
Here are some observations that may make it easier for experts in relational databases to make the transition to design their data structure in a document database.
Forget the Distractions
There are many popular ideas about how document databases should be structured, but many of these are for very specialized circumstances and are not useful in terms of designing a basic document database well.
So, forget everything you have heard and start over now with the foundation being just your knowledge about normal relational database design.
Types are Still Good
While you are bound by certain columns in a table in a relational database, you are bound by no such limitations in a document database. It contains JSON documents, and they can have any fields you want. That doesn’t mean that it is a good idea to mix different types of data together.
Some document databases provide a way to indicate the type, such as via a collection or container, but even if they do not, you can simply indicate the type by including a “type” field on the document. This would be equivalent to a table, and all documents of the same type should hold the same fields.
However, whenever you add a new field, that field will not be present in old documents. You do not need to update old documents. Instead, you just accept that some documents do not contain all fields and then handle it when you load them in your application.
It is not a good idea to change the name of a field, but that was never uncomplicated in a relational database either, so not much of a difference there.
A relational database excels at being able to relate data to each other via foreign keys.
Some people have heard that you are supposed to put all your data in a single document and keep it related that way. Say you have customers and orders; then, they would say that you should have one customer document, and inside that document have a list of orders. That is wrong and a distraction. Forget about that.
Instead, keep it neatly separated as you would in a relational database and keep them related using foreign keys. Needless to say, you cannot have referential integrity, but that is an issue for the application, not the database.
Generally speaking, you cannot query efficiently on a document database unless you know what you are looking for. In a relational database, you can ask for anything at any time, but it will only be truly efficient with an index. It is the same for a document database, except indexes are far more important – and they are not called indexes.
Instead, you copy data, so it is ready for the queries you intend to make. It is very similar to indexes with included columns or even views, but it’s usually more cumbersome to change.
Some providers, like Cosmos DB, provide a way to query your data any way you want, and it is quite fast, but it comes at a cost in terms of hard cold cash to pay for the Request Units required for the queries. Some people say Cosmos DB is very expensive, but that is because they query it mindlessly instead of structuring it diligently.
Partition by Foreign Key
The most important method of structuring copied data is by partitions. It is basically the same data as before, but it is partitioned, usually by one or more fields on the document.
For example, you may have a type called “file” and a type called “folder.” On the “file” documents, you have a foreign key called “folderId.” So far, so good. However, efficiently getting all the “file” documents that belong to a certain “folder” will require you to partition for that.
The partition may look something like this: “file-12345” where 12345 is the folderId. All “file” documents belonging to that folder are placed in the partition and are incredibly fast to fetch.
You can make as many partitions as you want, but you pay in one of two ways:
- Insertion time (just like a table with many indexes)
- Delay before the data becomes available in a partition
Partition by Query
Say you have blog posts with likes and comments in your document database. That is documents of the type “post,” “like,” and “comment.”
The most common query will be opening a single blog post, so you want to efficiently get the blog post and all the likes and comments. The most efficient way of fetching them will be to put them in the same partition.
That partition might be called “post-6789” and will contain three types of documents in no particular order. The calling application can check the “type” field to see what type of document was retrieved.
Single Document Partitioning
If you have one partition with all items relating to a blog post, why not put all that data inside a single large document that can be fetched even faster?
That is indeed a useful strategy in some scenarios, but it is rarely worth it. Think about the trade-off.
Firstly, you have to update this large document every time the blog post is changed, or a like or comment is added. This complicates the application, which has to ensure that two users do not update the document at the same time.
Secondly, many document databases have a limitation on the size of a single document.
And the benefit is supposed to be faster query time, but you are already getting all documents in a partition incredibly fast. It is very unlikely that it is not fast enough if you structure your data well.
Large Data Volumes
You may want to put all your “comment” documents in one big partition and then make efficient partitions for each method of a query.
However, that partition becomes very big, and although you never query it directly, the size alone can become a problem.
Document database literally partitions logically and even physically by their partition, and it does not usually split up a single partition, so it is stuck with a large volume of data.
In Cosmos DB the upper limit for a single logical partition is 10 GB.
If you ever have a volume of data that may grow too big, you must partition it in advance. Hopefully, you have a natural foreign key to partition by. It could be by country, by author, or by month. When you partition your base data, think about one question: How will you iterate through all your data?
Say you partition by the author; then, you need to know all the possible authors. If you delete one author, then you will be unable to actually find the data relating to that author because it is in a partition with an author id you no longer know about.
If you have experience with relational database design, you will be able to easily design a document database as well, as long as you think about partitioning, as described in this article. The rest is more or less following the same principles you’re used to.