How groupBy works


In Apache Spark, a groupBy operation is essentially a transformation that triggers a Shuffle, moving data across the cluster so that all records sharing the same key end up on the same executor.

The process happens in two distinct phases to minimize network traffic: Local Aggregation and Global Aggregation.


The Lifecycle of a GroupBy

1. Local Aggregation (Map Side)

Before sending data over the network, Spark performs "Map-side combines." Each executor looks at its local partitions and calculates a partial result for each key.

  • Master/Driver: Plans the execution and communicates the grouping logic to executors.

  • Executors: Process their own data locally. If you are doing a count(), the executor counts occurrences within its own partition first.

2. The Shuffle (Exchange)

This is the most "expensive" part of the process.

  • Data Movement: Spark hashes the grouping key to determine which partition (and thus which executor) the data should belong to.

  • Partitions: The number of resulting partitions is controlled by spark.sql.shuffle.partitions.

3. Global Aggregation (Reduce Side)

Once all data for a specific key is gathered on a single executor, Spark performs the final calculation.

  • Executors: Merge the partial results from all other executors to produce the final value for that key.

  • Result: The final dataset is distributed across the cluster in new partitions.


Visualizing GroupBy with Mermaid

spinner


Last updated