So You Want to Up Your Software Engineering Game?

Building Reliable, Scalable, and Maintainable Applications

May 23, 2021

Welcome to the 10% Smarter newsletter. I’m Brian, a software engineer writing about tech and growing skills as an engineer.

If you’re reading this but haven’t subscribed, consider joining!

Are you just getting started being an engineer and already feeling overwhelmed? Fear not! This newsletter aims to give a quick 10 minute summary of concepts to help you become a better software engineer every week.

I’ve decided to start off with System Design, or how do you design a large software application like Instagram, YouTube, or Netflix. You might have some knowledge of how to programming with a language like Python or Java and some data structures and algorithm knowledge, but System Design isn’t taught that much in college and is an important skill to be an effective engineer. So let’s get started!

This week I’ll be going through chapter 1 of Designing Data-Intensive Applications: Reliable, Scalable, and Maintainable Applications.

What is Reliability, Scalability, and Maintainability?

Reliability is the ability to work correctly and be resilient and tolerance of mistakes and surprises. It performs well even under load. The goal is to prevent faults from turning into failures.

Scalability is the ability to cope with increased load in terms of data, traffic, or complexity.

Maintainability is the ability to ensuring software can be maintained. This involves making a system operable, simple, and evolvable. Anticipating future problems and designing good abstractions like APIs improve maintainability.

We can see these would be important in designing a service like Instagram, YouTube, or Netflix. Now, how would we design these systems?

The Basics of Designing Systems

Before designing a system, you must gather requirements to ensure the system is successful.

Functional requirements are what the application should do, such as allow data to be stored, retrieved, searched, and processed. For example Instagram would let users upload photos, and let users follow other users to like their photos.

Nonfunctional requirements are general properties like reliability, scalability, maintainability, and security.

Once you have gathered your requirements, you can design out a basic architecture. Software engineering used to be hard, but today you can build systems with Lego-like building blocks using cloud services such as AWS! Some basic building blocks are:

Databases: to store data
Caches: to speed up reads
Search Indices: to speed up searches
Stream Processing: to sending messages between processes asynchronously
Batch Processing: to periodically crunch data

Let’s go more in-depth. How we can make the system fault-tolerant, scalable, and performant?

Fault Tolerance, Scalability, and Performance

What are common kinds of faults in systems?

Hardware faults: When hardware and entire machines fail. Hard disks have a mean time to failure (MTTF) of 10 to 50 years. On a storage cluster with 10,000 disks, we should expect on average one disk to die per day.

Software Errors: Bugs and Systematic errors such as Runaway Processes or Cascading Failures.

Human Errors: Humans mistakes are the leading cause of errors. Configuration Errors are the leading cause of outages.

Here is a good list of post-mortems you can check out to see fault and failure cases in action!

Now onto scalability. We can look at an example from Twitter.

Case Study: Scalability at Twitter

Twitter's main operations are

User post tweets at 4.6k req/sec and 12k req/sec peak
Generating user's home timeline at 300k req/sec

For Twitter, simply handling 12,000 writes per second (the peak rate for posting tweets) is easy but the fan-out complexity is hard.

Two ways to implement this are

Posting a tweet inserts into a global database table. When a user requests a home timeline, look up the people they follow, tweets by those users and SQL JOIN the tweets sorted by time
Use a cache for each user's home timeline. When a user posts a tweet, look up the people that follow that user, and inserts the new tweet into each of their home timeline caches.

Looking at the tradeoffs:

Approach 1 is simpler and better for users with many followers. However the home timeline join query puts a lot of load.
Approach 2 requires less load as the average rate of published tweets is two orders of magnitude lower than the rate of home timeline reads so using user caches to do less work at read time and more work at write time is preferable. However if a user has 30 million followers, this requires 30 million home timeline cache writes, which is also a lot of load.

Twitter uses a hybrid of both approaches. Most users tweets are fanned out to home timeline caches. Celebrities with large number of followers are fetched and merged separately.

Twitter had dedicated whole server racks for Justin Bieber who was 3% of their traffic.

Performance

Performance is another important consideration when thinking about a system. How do you describe performance?

In a batch processing system like Spark, we care about throughput: the number of records processed per second.

In online systems we care about the service’s response time: time between a client sending a request and receiving a response.

An important distinction between response time and latency is that the response time is what the client sees: time to process the request, network and queueing delays. Latency is the duration that a request is waiting to be handled.

Additionally, we should use percentiles to describe latency. Since web services receive many requests we want the majority of requests to be fast. Using an average is skewed by outliers.

p99 and p999 latency mean 99% or 99.9% of requests are handled faster than x response time.

Amazon uses 99.9 for their internal services and have service level agreements (SLAs) as a contract their median p50 response time will be less than 200ms and p99 under 1s. A customer can get a refund if this is not met.

Ways to Scale

Finally, we want our systems to scale. There are three ways to scale with load:

Vertical Scaling: using a more powerful machine
Horizontal Scaling: distributing a load across multiple smaller machines
Elastic Systems: Automatically adding computing resources when load increases

Conclusion

This week, we’ve covered an overview of designing Reliable, Scalable, and Maintainable Applications! It might be a lot to think about and digest but feel free to go over it and review. Always keep learning and you’ll for sure grow as an engineer.

Next week we’ll go over Data Models and Query Languages. Ever wonder what NoSQL is?

10% Smarter