Data Encoding and Modes of Dataflow
Chapter 4 of Designing Data Intensive Applications: Encoding and Evolution
Welcome to the 10% Smarter newsletter. I’m Brian, a software engineer writing about tech and growing skills as an engineer.
If you’re reading this but haven’t subscribed, consider joining!
Hello! So far, we’ve explored how data is stored in databases, but data has to be manipulated and moved around to be useful for its users. Today we’ll look at how data is encoded and the different modes of dataflow, through databases, services, and message-passing.
There are two kinds of data representations that are useful to use. First is in-memory data structures, which are optimized for efficient access and CPU manipulation. The other is sequence of bytes (e.g. JSON) for transmitting over the network.
For working between these representations we have encoding/serialization, which is the translation from an in-memory representation to a byte sequence and decoding/deserialization, which is the translation from a byte sequence to an in-memory representation.
Let’s start by looking at different kinds of interchange data formats for sequences of bytes.
Data Interchange Formats
However, it is has some issues which include:
Its ambiguity around the encoding of numbers and dealing with large numbers.
No support for binary strings. People get around this by encoding binary data as Base64, but this increases the data size by 33%.
Binary Encoding is an alternative format designed for efficient binary representations. It is used because the efficiency of the data format can have a big impact especially when the dataset is in the order of terabytes.
Some popular binary encoding libraries are Protocol Buffers (protobuf) and Apache Thrift. Protocol Buffers were design by Google and are widely used, and Apache Thrift was designed by Facebook for similar purposes.
Let’s look at an example. Here, we have an example JSON record we’d like to encode.
If we’d like to store this data efficiently, we might use a Protocol Buffer. Protocol Buffers have a schema of how the types and how the data will be encoded. It also includes field tags for clients to keep track and have compatibility with the data.
Below we see the efficient encoding of the JSON record in only 33 bytes. This byte sequence is broken down into all the information we need to store and encode/decode the record. This includes field tags, types, value lengths, and data values.
Backwards and Forward Compatibility
One major benefit of Protocol Buffers and Apache Thrift are its backwards and forwards compatibility. This is important when designing applications that deal with data, as we must handle how the data and applications will evolve over time. Backward compatibility is when newer code can read data that was written by older code. Forward compatibility is when older code can read data written by newer code. This is trickier because it requires older code to ignore additions made by a newer version of the code.
Protocol Buffers and Apache Thrift achieve backwards and forward compatibility through field tag numbers.
Backwards Compatibility: As long as each field has a unique tag number, new code can always read old data. Every field you add after initial deployment of schema must be optional or have a default value.
Forward Compatibility: When adding new fields, you will add new tag numbers. Old code trying to read new code can simply ignore unrecognized tags.
We’ve seen how data can be encoded to be used and transferred between different applications, but how is the data actually transferred? Dataflow is how the data is transferred and moved between different applications.
Dataflow Through Databases, Services, and Message-Passing
Dataflow through Databases
In a database, a process will write to the database and encode the data, and another process will read from the database and decode it. Backwards and forward compatibility are important to ensure within the database. An important saying is data outlives code as your code is refactored more frequently than your data. Backwards compatibility is done by ensuring newly written rows are compatible with past code and forward compatibility is done by ensuring new data columns written by new code can be ignored by past code.
Dataflow through Services
Whenever there is communication between processes over a network, a common arrangement is to organize them as clients (such as web browsers) and servers. A server can also be a client to another service. A web app server is also usually a client to a database.
When designing a system with multiple services, it’s common to design it as a Service-oriented architecture (SOA). Service-oriented architectures, or microservices, are the practice of decomposing a large application into smaller components by area of functionality.
This data can be transferred over a network through either web requests to a REST API or a remote procedure call (RPC).
REST API stand for REpresentational State Transfer Application Programming Interface. and is a set of design principles for API design. A server exposes an API over the network for a client to make requests over HTTP. REST APIs create an interface for the exchange of simple data formats. The data exchange is identified by URLs that can be requested using HTTP methods. It is also easy to experiment with using the
curl command line request client.
Remote Procedure Call (RPC)
RPC stands for Remote Procedure Call, and work by attempting to make a request to a remote network service look the same as calling a function from within the same process.
RPCs can be dangerous. Some points of caution are:
A network call is unpredictable - the request or response packets can get lost, the remote machine may be slow.
Local function call always returns something, but network request response might be lost completely due a timeout
Retrying will cause the action to be performed multiple times, unless you build a mechanism for deduplication (idempotence).
The RPC framework must translate datatypes from one language to another, as not all languages have the same types.
Despite their issues, RPCs are incredible popular due to better networks and client libraries to deal with these issues. gRPC from Google is a popular example.
Asynchronous message-passing systems are similar to both RPCs and databases. They are similar to RPCs because a client's request is delivered to another process with low latency. They are similar to databases in that a message is not sent via a direct network connection, but via an intermediary called a message queue.
The advantages of a message queue are that:
It can act as a buffer if the recipient is unavailable or overloaded.
It avoids the sender needing to know the IP address and port number of the recipient (useful in a cloud environment).
A message can be delivered to multiple recipients.
It can retry message delivering to a crashed process and prevent lost messages.
It decouples the sender from the recipient. The sender does not need to know anything about the recipient.
The communication pattern here is usually asynchronous - the sender does not wait for the message to be delivered, but simply sends it and forgets about it. In a message queue, a process sends a message to a named queue or topic and the queue broker ensures that the message is delivered to one or more consumers or subscribers to that queue or topic. A topic can have many producers and many consumers.
Apache Kafka is the most common message queue framework.
Distributed Actor Model
The distributed actor model is a programming model for across multiple nodes. An actor is usually a client or an entity which communicates with other actors by sending and receiving asynchronous messages using a message queue.
This concludes this week’s overview as we look at different data encoding and the different modes of dataflow including databases, services, and message-passing.
Next time, we’ll start to deep dive into distributed data! We’ll look at how data in a database are replicated to many machines. Stay tune for more and share if you like this article!