The Future of Data Systems
Chapter 12 of Designing Data Intensive Applications: The Future of Data Systems
Welcome to the 10% Smarter newsletter. I’m Brian, a software engineer writing about tech and growing skills as an engineer.
If you’re reading this but haven’t subscribed, consider joining!
Hello! Thank you all for reading the 10% Smarter newsletter. We’ve come to the end of a term as I present the last chapter of Designing Data Intensive Applications. I hope you all have enjoyed this as much as I have. This final week, we’ll look at the future of data systems.
The Evolution of Data Systems
The first thing Martin Kleppmann notes is the evolution of data systems. While Apache Spark was designed for batch processing and Apache Flink was designed for stream processing, both systems have evolved to be handle the opposite workloads too. Apache Spark can perform stream processing on top of batch processing while Apache Flink can perform batch processing in top of stream processing.
Another area where data systems are evolving is the Database. This is known as the unbundling of the database where different components of the database are now being unbundled to have different areas of specialization. Different software components can be developed, improved, and maintained independently from each other by different teams. This has been accelerated by the cloud, where storage and compute can be decoupled as seen in Snowflake or Google BigQuery.
Another evolution in software architecture is the separation of application code and state. This is a strategy to keep stateless application logic separate from state management (databases). This makes a system more logical to reason about and audit if issues occur.
Theory and Practice
Some data systems techniques have been developed in research but are not yet widely adapted in practice.
One example is effective-once semantics. Effective-once semantics are a strategy where if something goes wrong while processing a message, you can either give up or try again and if successful, it will appear as if only ran once. This has been shown incredibly useful in database ACID transactions, but has not been widely adapted outside of databases.
Another example is coordination-avoiding data systems. Coordination-avoiding data systems can maintain consistency without atomic commit, linearizability, or synchronous cross-partition coordination. They can achieve better performance and fault tolerance than systems that need to perform synchronous coordination. Conflict-free replicated data types (CRDT) are one example. Anna KVS is a research key-value store that uses these ideas and has shown having coordination-free consistency can achieves high performance and elasticity via waitfree execution.
A final note is on the importance of user privacy in data systems. We should not retain a user's data forever, but remove it as soon as it is no longer needed. This is increasingly important as we become more and more connected online and are given the responsibility to be good stewards of a user’s personal data.
Thank you all for reading through as we go over Designing Data Intensive Applications by Martin Kleppmann. I hope this has been interesting, and feel free to subscribe or share this article if you liked it!