# Internals of Bigquery

***

More information about the internals of BigQuery [here](https://cloud.google.com/blog/products/data-analytics/new-blog-series-bigquery-explained-overview) and [here](https://cloud.google.com/blog/products/bigquery/bigquery-under-the-hood)

***

To understand the internals of BigQuery, you have to look at it not as a single database, but as a **orchestrated collection of massive-scale Google technologies**.

BigQuery’s "secret sauce" is its **disaggregated architecture**, meaning storage and compute are completely separate, connected by a hyper-fast network.

### General Architecture

{% @mermaid/diagram content="graph TD
%% Client Layer
subgraph Client\["1. Client Interface"]
UI\[Console/CLI/SDKs]
end

```
%% Control Plane
subgraph Control["2. Control Plane (The Brain)"]
    API[BigQuery API]
    Meta[Metadata Store: Partition info]
    Planner[Query Optimizer]
end

%% Orchestration
subgraph Borg["3. Borg (The Manager)"]
    Borg_S[Resource Scheduler]
end

%% Compute & Storage
subgraph Compute["4. Dremel & Colossus (The Muscle & Memory)"]
    Root[Dremel Root Server]
    Mixer[Dremel Mixers]
    Leaf[Dremel Slots]
    
    subgraph Jupiter["Jupiter Network Fabric"]
        Network((Petabit Network))
    end
    
    Storage[(Colossus: Columnar Data)]
end

%% Interaction Flow
UI --> API
API --> Meta
API --> Planner
Planner --> Borg_S
Borg_S -->|Allocates Slots| Root
Root --> Mixer
Mixer --> Leaf
Leaf <--> Network
Network <--> Storage

%% Style
style Network fill:#f9f,stroke:#333,stroke-width:2px
style Borg_S fill:#fff4dd,stroke:#d4a017" %}
```

#### Dremel: The Execution Engine (Compute)

Dremel is the heart of BigQuery. It is a distributed query execution engine that turns your SQL into an **execution tree.**

* **The Tree Structure**: When you run a query, a "root server" receives it and breaks it down into hundreds or thousands of small tasks. These tasks are passed to intermediate "mixers" and finally to "leaf nodes" (slots).
* **Slots**: In BigQuery terminology, a "slot" is a unit of computational capacity (CPU and RAM).
* **Multitenancy**: Dremel allows thousands of users to share the same massive cluster of machines, dynamically allocating slots based on query complexity.

{% @mermaid/diagram content="graph TD
%% Input
Query(\[SQL Query / DML Command]) --> Root

```
subgraph "Compute Layer (Dremel)"
    Root[Root Server: Plan & Orchestrate]
    
    Root --> M1[Mixer 1]
    Root --> M2[Mixer 2]
    
    M1 --> L1[Leaf Node / Slot]
    M1 --> L2[Leaf Node / Slot]
    M2 --> L3[Leaf Node / Slot]
    M2 --> L4[Leaf Node / Slot]
end

%% Storage Layer
subgraph "Storage Layer (Colossus)"
    Disk[(Colossus: Permanent Storage)]
end

%% Bi-directional Logic
L1 <== "Read Shards / Write Results" ==> Disk
L2 <== "Read Shards / Write Results" ==> Disk
L3 <== "Read Shards / Write Results" ==> Disk
L4 <== "Read Shards / Write Results" ==> Disk

%% Styling
style Disk fill:#f9f,stroke:#333,stroke-width:2px
style Query fill:#dfd,stroke:#333" %}
```

*Dremel uses a multi-level tree structure to break a single SQL query into thousands of tiny parallel tasks.*

#### Colossus: The Storage Layer

BigQuery stores data in **Colossus**, Google’s latest generation distributed file system (the successor to GFS).

* **Durability**: Data is automatically replicated and sharded across many physical disks.
* **Columnar Format (Capacitor)**: Data isn't stored as "rows" in Colossus; it's stored in a highly compressed columnar format called **Capacitor**. This allows BigQuery to scan only the columns you mention in your `SELECT` or `WHERE` clauses.
* **Separation**: Because storage is on Colossus and compute is in Dremel, you can scale your data to petabytes without needing to add more "servers" manually.

{% @mermaid/diagram content="graph LR
subgraph "Colossus File System"
direction TB
subgraph "File A (Sharded)"
C1\[Column 1 Chunk]
C2\[Column 2 Chunk]
C3\[Column 3 Chunk]
end
subgraph "Replication"
R1\[Replica 1 - Zone A]
R2\[Replica 2 - Zone B]
end
end

```
Capacitor[Capacitor Format] --> C1
Capacitor --> C2
C1 -.-> R1
C1 -.-> R2" %}
```

*Colossus handles the physical sharding and replication of data, using the Capacitor columnar format.*

#### Jupiter: The Network

Since compute (Dremel) and storage (Colossus) are physically separate, the data has to travel from disk to CPU extremely fast.

* **The Backbone: Jupiter** is Google's custom-built petabit-scale data center network.
* **Speed:** It provides enough bandwidth for Dremel to read data from Colossus at **terabytes** **per** **second**. This is why BigQuery doesn't need traditional "indexes"—it can simply brute-force scan massive amounts of data in seconds.

{% @mermaid/diagram content="graph TD
subgraph Dremel\["Dremel (Compute)"]
S\[Slots / CPUs]
end

```
subgraph Jupiter["Jupiter Network"]
    J1((1 Pbps Switch Fabric))
end

subgraph Colossus["Colossus (Storage)"]
    D[Disks / Storage]
end

S <--> J1
J1 <--> D" %}
```

*Jupiter acts as the high-speed "highway" that allows the compute nodes to read data from storage as if it were on a local disk.*

#### Borg: The Cluster Manager

Borg is Google's precursor to Kubernetes. It is the large-scale cluster management system that coordinates the resources.

* **Orchestration**: Borg handles the "where" and "how" of running Dremel. It assigns the physical hardware, manages health checks, and ensures that if a machine fails, the query task is instantly moved to a healthy one without the user noticing.

{% @mermaid/diagram content="graph TD
Borg{Borg Master}

```
subgraph "Physical Datacenter"
    Node1[Machine 1]
    Node2[Machine 2]
    Node3[Machine 3]
end

Borg -->|Schedules Task| Node1
Borg -->|Schedules Task| Node2
Node3 -.->|Health Heartbeat| Borg

subgraph "Workloads"
    Node1 --- D1[Dremel Slot]
    Node2 --- C1[Colossus Process]
end" %}
```

*Borg is the management layer that schedules tasks and ensures the system remains healthy.*

#### The Client Interface & Control Plane

This is the entry point for your requests.

* **The Gatekeeper**: When you use the UI, CLI (`gcloud`), or an SDK, you are hitting the BigQuery **Control Plane**.
* **Optimization**: It handles metadata (like your partitioning info we discussed earlier), query optimization, and IAM security checks before passing the work down to Dremel.

***

#### The Internal Stack

| **Component** | **Responsibility**  | **Analogous To...**                             |
| ------------- | ------------------- | ----------------------------------------------- |
| **Dremel**    | Query Execution     | The "Chef" (Cooks the data)                     |
| **Colossus**  | Data Storage        | The "Pantry" (Holds the ingredients)            |
| **Jupiter**   | Networking          | The "Conveyor Belt" (Moves ingredients to chef) |
| **Borg**      | Resource Management | The "Manager" (Assigns tasks to staff)          |

***

### Example: Shuffle Join

The Shuffle phase is the "magical" middle layer where BigQuery re-distributes data across the network so that matching keys from two different tables end up on the same physical worker.

#### BigQuery Shuffle Architecture (JOIN)

In BigQuery, shuffling doesn't happen on the leaf nodes' local disks. It happens in a dedicated, distributed memory layer (Remote RAM) that sits between the mixers.

{% @mermaid/diagram content="graph TD
%% Input Stage
subgraph "Stage 1: Input & Partial Processing"
L\_Scan\[Leaf Nodes: Scan Table A]
R\_Scan\[Leaf Nodes: Scan Table B]
end

```
%% The Shuffle Layer
subgraph "The Shuffle Layer (Distributed Memory)"
    Shuffle_Srv{Jupiter Network / Shuffle Service}
    Hash_A[Hash & Partition A by Join Key]
    Hash_B[Hash & Partition B by Join Key]
end

%% The Join Stage
subgraph "Stage 2: Shuffled Join"
    W1[Join Worker 1: Processes Key Range 0-99]
    W2[Join Worker 2: Processes Key Range 100-199]
    W3[Join Worker 3: Processes Key Range 200-299]
end

%% Flow logic
L_Scan --> Hash_A
R_Scan --> Hash_B

Hash_A -->|Key Hash 50| Shuffle_Srv
Hash_B -->|Key Hash 50| Shuffle_Srv

Shuffle_Srv -->|Both rows with Key 50| W1
Shuffle_Srv -->|Rows with Key 150| W2
Shuffle_Srv -->|Rows with Key 250| W3

W1 --> Final[Final Aggregation / Result]
W2 --> Final
W3 --> Final" %}
```

#### Why this architecture matters:

* **Decoupled Shuffle:** By using a dedicated shuffle service instead of local disk, BigQuery avoids the "disk I/O bottleneck" found in traditional MapReduce.
* **Parallelism**: Because the data is hashed by the Join Key, Worker 1 can perform its part of the JOIN completely independently of Worker 2.
* **Jupiter's Role**: The Jupiter network fabric provides the massive bandwidth (Petabits/sec) required to move millions of rows from the "Input" stage to the "Join" stage in milliseconds.

***
