How groupBy works

In Apache Spark, a groupBy operation is essentially a transformation that triggers a Shuffle, moving data across the cluster so that all records sharing the same key end up on the same executor.

The process happens in two distinct phases to minimize network traffic: Local Aggregation and Global Aggregation.

The Lifecycle of a GroupBy

1. Local Aggregation (Map Side)

Before sending data over the network, Spark performs "Map-side combines." Each executor looks at its local partitions and calculates a partial result for each key.

Master/Driver: Plans the execution and communicates the grouping logic to executors.
Executors: Process their own data locally. If you are doing a count(), the executor counts occurrences within its own partition first.

2. The Shuffle (Exchange)

This is the most "expensive" part of the process.

Data Movement: Spark hashes the grouping key to determine which partition (and thus which executor) the data should belong to.
Partitions: The number of resulting partitions is controlled by spark.sql.shuffle.partitions.

3. Global Aggregation (Reduce Side)

Once all data for a specific key is gathered on a single executor, Spark performs the final calculation.

Executors: Merge the partial results from all other executors to produce the final value for that key.
Result: The final dataset is distributed across the cluster in new partitions.

Visualizing GroupBy with Mermaid

PreviousHow joins work NextApache Beam - both batch and stream processing unified

Last updated 1 day ago

hashtagThe Lifecycle of a GroupBy

hashtag1. Local Aggregation (Map Side)

hashtag2. The Shuffle (Exchange)

hashtag3. Global Aggregation (Reduce Side)

hashtagVisualizing GroupBy with Mermaid

The Lifecycle of a GroupBy

1. Local Aggregation (Map Side)

2. The Shuffle (Exchange)

3. Global Aggregation (Reduce Side)

Visualizing GroupBy with Mermaid