What are Data Models and Query Languages?

Chapter 2 of Designing Data Intensive Applications: Data Models and Query Languages

May 30, 2021

Welcome to the 10% Smarter newsletter. I’m Brian, a software engineer writing about tech and growing skills as an engineer.

If you’re reading this but haven’t subscribed, consider joining!

The database. For most people, it’s a mysterious piece of software, they’ve vaguely heard of that businesses use. Maybe you’ve used it to store or read some data, but you’re probably full of questions. Why it’s such an important tool in so many applications? What are relational databases? What is NoSQL? Well, let’s find out!

Today, we’ll go through chapter 2 of Designing Data-Intensive Applications. This chapter deals with Data Models and Query Languages, giving an overview of the data modeling side of databases. Let’s start with the most common type of database model, the relational model.

The Relational Model

What is the relational model? The relational model stores data in a relation (table) and is simply a collection of tuples (rows). To relate two tables, a JOIN is used on a common feature.

Databases that use the relational model are called Relational Database Management Systems (RDMS) or SQL Databases. This model is simple to use and has 40+ years of production battle-tested work from IBM, Oracle, to now open source variants such as Postgres and MySQL. It’s even on the cloud, such as Amazon Relational Database Service (Amazon RDS) and Amazon Aurora.

Here’s an example of how you might store data as a collection of tables.

Bill Gates' LinkedIn represented as relations of tables

Types of Relationships

When thinking about relating data as tables or relationships, it’s important to understand the types of relationships possible. There are four types of possible relationships:

One-to-One: For each entity in A, it is associated with one and only one entities in B. An example is each country has exactly one capital city.
One-to-Many: For each entity in A, it may have a relationship to multiple entities in B. For example, a student may take many classes.
Many-to-One: There may be multiple entities in A, but each must have at most a single relationship to an entities in B. For example, multiple student enrolls in a university, but cannot enroll in multiple universities at the same time.
Many-to-Many: There may be multiple entities in A, which may have multiple relationships to entities in B. For example, a student may enrolls in multiple classes and classes may have multiple students. Many-to-Many relations can encompass multiple One-to-Many relations.

From our example of Bill Gates’ LinkedIn, his region and industry are a One-to-One relationship because he can only be part of one region and industry. His positions, education, and contact info are One-to-Many as he may have worked in multiple positions, had education at multiple schools and have different types of contact info.

Relational Model Benefits and Weaknesses

What are the benefits and weaknesses of the Relational Model? The benefits of the Relational Model are it is easy to read and write; you don’t have to worry about records relationships or access paths, and have a query optimizer that automatically decides which parts of the query to execute, which order to execute, and which indexes to use (the access path).

The relational model is effective for many-to-one and many-to-many relations as it can JOIN tables and match these relations.

The weakness of the Relational Model is data may have an awkward translation, known as impedance mismatch, as it has to be reorganized as tables. Most data in applications are represented as objects. Bill Gates’ LinkedIn, would includes his education and experience as attributes of his person object, but in a relational database, all three table would have to be joined together.

Additionally, the data for an entity in tables which are located separately in memory and joining them has a performance cost.

Now to NoSQL, an alternative data model that has provided an alternative since it was started in the early 2010s.

What is NoSQL?

NoSQL stands for Not Only SQL and aimed to explore new data models that have

greater scalability avoiding JOINs and using multiple machines
more dynamic and expressive data models such as JSON.

Some examples are

Document Model databases such as MongoDB, DynamoDB
Column-Family databases like Cassandra or BigTable
Key-Value databases such as Redis, Memecached, or RocksDB
Graph databases such as Neo4j and Amazon Neptune.

We’ll focus on the Document Model as it is the best alternative to the Relational Model and the rest are used for specialized cases.

The Document Model

The Document Model represents data in a document tree like structure like JSON.

For example, a person may have attended multiple schools, which is a natural one-to-many tree relationship. In the relational model, this would require joining a person and schools table but is a natural tree structure in the document model. One can think of a document as a pre-joined table.

Bill Gates' LinkedIn represented as a JSON document.

If we look at Bill Gates’ LinkedIn in the relational model and document model, the document model seems like a better match for how the data is displayed and structured! The relational model seems complex, having to pool data from different relations. Then why is the relational model still so popular? We can look at the tradeoffs of the document model to see.

Document Model Benefits and Weaknesses

The benefits of the Document Model is that storing data as JSON is close to a natural data representation. Most data is stored this way in a one-to-many representation. A lack of schema also allows the organization of data to be easily updated, compared to creating new tables in the relational model.

The Document Model has better locality as all the relevant information is in one place, and requires one query.

The weaknesses of the Document Model is that it cannot handle many-to-one and many-to-many well. The two options are to manually duplicate data or manually perform joins in code.

In LinkedIn, while a user to schools and jobs relationship is One-to-Many, it makes sense to have an official entity for schools and jobs so they do not get duplicated.

This quickly becomes complex and shows why the relational model is still logical option when modeling many applications.

The Convergence of the Document and Relational Model

In recent times, both the document model and relational model have learned from each other and taken features from each other. PostgreSQL now supports JSON documents. MongoDB supports simple server-side joins.

Additionally, Column-Family databases are a hybrid of document store and a relational database. They still uses tables, rows, and columns like a relational database, but the names and formats of the columns can be different for various rows, similar to the flexibility of a document store.

What is the Graph Model?

As a quick aside, the graph model can be a good fit if your data is graph like. If many-to-many relationships are very common in your application, it becomes more natural to start modeling your data as a graph. A graph consists of vertices and edges and is stored as such. Some examples for modeling are social graphs and web graphs. These databases are still a special cases and less commonly used.

Query Languages and SQL

A final innovation of the relational database are query languages. SQL is a query language used for querying data in relational databases.

In Declarative languages, you specify the goal of the data you want, and lets the system decide how to fetch the data. In Imperative languages, you tell the computer to perform certain operations in order to achieve a goal.

SQL is Declarative because it hides the implementation, making it possible for the database to use performance optimizations, such as parallel execution, without requiring any changes to queries .

Conclusion

Hopefully this gives a good overview of data models such as the relational model and document model. What should you use? You should aim to use the model that fits your data, but in most cases it is better start with the relational model. Andy Pavlo recommends a Relational Database like Postgres first, because it has the most features and is well battle tested for all sorts of applications. Then if you need to upgrade, you can hire smart people to improve your database or system.

Well that concludes today’s chapter on Data Models and Query Languages. Next week we’ll go over the internals of databases. What are LSM-Trees or B-Trees? Stay tune to find out! Please give a share if you like this article.

10% Smarter