# How groupBy works

***

In Apache Spark, a `groupBy` operation is essentially a transformation that triggers a **Shuffle**, moving data across the cluster so that all records sharing the same key end up on the same executor.

The process happens in two distinct phases to minimize network traffic: **Local Aggregation** and **Global Aggregation**.

***

### The Lifecycle of a GroupBy

#### 1. Local Aggregation (Map Side)

Before sending data over the network, Spark performs "Map-side combines." Each executor looks at its local partitions and calculates a partial result for each key.

* **Master/Driver:** Plans the execution and communicates the grouping logic to executors.
* **Executors:** Process their own data locally. If you are doing a `count()`, the executor counts occurrences within its own partition first.

#### 2. The Shuffle (Exchange)

This is the most "expensive" part of the process.

* **Data Movement:** Spark hashes the grouping key to determine which partition (and thus which executor) the data should belong to.
* **Partitions:** The number of resulting partitions is controlled by `spark.sql.shuffle.partitions`.

#### 3. Global Aggregation (Reduce Side)

Once all data for a specific key is gathered on a single executor, Spark performs the final calculation.

* **Executors:** Merge the partial results from all other executors to produce the final value for that key.
* **Result:** The final dataset is distributed across the cluster in new partitions.

***

### Visualizing GroupBy with Mermaid

{% @mermaid/diagram content="graph TD
subgraph "Phase 1: Local Partitions (Executors)"
E1\[Executor 1: A:1, B:1, A:1]
E2\[Executor 2: B:1, A:1, C:1]
end

```
subgraph "Local Map-Side Combine"
E1 --> L1[Local Result: A:2, B:1]
E2 --> L2[Local Result: B:1, A:1, C:1]
end

subgraph "SHUFFLE (Network Transfer)"
L1 -- Key A --> R1
L1 -- Key B --> R2
L2 -- Key A --> R1
L2 -- Key B --> R2
L2 -- Key C --> R3
end

subgraph "Phase 2: Final Aggregation (Reducers)"
R1[Executor X: A -> 2+1 = 3]
R2[Executor Y: B -> 1+1 = 2]
R3[Executor Z: C -> 1]
end" %}
```

***
