Batch Processing - How do MapReduce and Spark work?
Chapter 10 of Designing Data Intensive Applications: Batch Processing
Welcome to the 10% Smarter newsletter. I’m Brian, a software engineer writing about tech and growing skills as an engineer.
If you’re reading this but haven’t subscribed, consider joining!
Have you ever wondered how Google creates search indexes? Google uses web crawlers that search for links on websites and stores them in a distributed file system. This data can be processed using batch processing to create search indexes and rankings for website. Let’s look at at batch processing and see how you can use it to process big amounts of data.
Batch Processing vs Stream Processing
There are two common data processing jobs. Batch processing is an offline (not real-time) process takes a large amount of input data, runs a job to process it, and produces some output. Stream processing is a process that consumes input, process it, and produces outputs in near-real-time.
Google can use batch processing to create search indexes because website content does not change frequently. An app like TikTok might chose to use stream processing, and giving hourly short vides recommendations in its recommendation algorithm.
MapReduce and Distributed File Systems
To process and store large amounts of data, Google created MapReduce as a batch processing framework and the Google File System (GFS) for storing data in a distributed file system.
MapReduce is a distributed programming framework with which you can write code to process large datasets in a distributed filesystem such as GFS, HDFS, or S3. It was created by Google and uses the Google File System (GFS) to build search indexes. Open source variants include Apache Hadoop or dataflow frameworks like Apache Spark.
HDFS, also known as the Hadoop Distributed File System, is an open source shared-nothing distributed file system. File blocks are replicated to multiple machines.
How Does MapReduce Work?
MapReduce does batch processing with the following steps:
Read a set of input files, and break it up into records.
Call the mapper function to extract a key and value from each input record.
Perform a Shuffle, a step which sorts all of the key-value pairs by key and copies data partitions from mappers to reducers. The data is partitioned by reducer to merge on.
Call the reducer function to iterate over the sorted key-value pairs.
MapReduce has an incredibly simple programming model, only requiring a user to provide two functions:
Mapper Function: Called once for every input record, and its job is to extract the key and value from the input record.
Reducer Function: Takes the key-value pairs produced by the mappers, collects all the values belonging to the same key, and calls the reducer with an iterator over that collection of values.
With only a Mapper and Reducer Function, any user can processing huge amounts of data with no distributed systems knowledge. MapReduce can parallelize this computation across many machines, without you having to write code to explicitly handle the parallelism.
MapReduce is also fault tolerant, if a machine crashes, the computation can be recomputed.
The Evolution of MapReduce
While MapReduce is extremely powerful, it still had glaring limitation leading to future batch processing systems based on it.
MapReduce has poor performance as it writes everything to disk in the distributed filesystem. This process of writing to disk is called materialization. MapReduce jobs are usually chained together which do not require data to be materialized.
Dataflow engines such as Apache Spark solve this problem. Spark can handle an entire workflow as one job. It keeps data in memory and uses Resilient Distributed Datasets (RDD), an abstraction that instead knows the full dataflow and can recompute data from saved points for fault tolerance.
Additionally, while the MapReduce programming model was simple, it is inflexible to the workflows that use it. One example is the chaining of multiple MapReduce jobs that require several Mapper and Reducer Function. To solve this problem, several higher-level programming models have been developed to come on top of MapReduce. They use SQL or similar declarative languages to model the dataflow. Some examples SparkSQL, Presto, Hive, and Pig.
More information about the evolution of MapReduce can be found here.
This article explored batch processing and how Google creates search indexes using MapReduce and the Google File System. Batch processing large amounts of data has become widely adapted by many companies and will continue to do so as companies collect and process data to perform machine learning on. You can also use these tools such as Spark and SparkSQL and perform your own processing on big data. Feel free to subscribe as we look at stream processing next week and share this article if you liked it!