Internals of Bigquery


More information about the internals of BigQuery herearrow-up-right and herearrow-up-right


To understand the internals of BigQuery, you have to look at it not as a single database, but as a orchestrated collection of massive-scale Google technologies.

BigQuery’s "secret sauce" is its disaggregated architecture, meaning storage and compute are completely separate, connected by a hyper-fast network.

General Architecture

spinner

Dremel: The Execution Engine (Compute)

Dremel is the heart of BigQuery. It is a distributed query execution engine that turns your SQL into an execution tree.

  • The Tree Structure: When you run a query, a "root server" receives it and breaks it down into hundreds or thousands of small tasks. These tasks are passed to intermediate "mixers" and finally to "leaf nodes" (slots).

  • Slots: In BigQuery terminology, a "slot" is a unit of computational capacity (CPU and RAM).

  • Multitenancy: Dremel allows thousands of users to share the same massive cluster of machines, dynamically allocating slots based on query complexity.

spinner

Dremel uses a multi-level tree structure to break a single SQL query into thousands of tiny parallel tasks.

Colossus: The Storage Layer

BigQuery stores data in Colossus, Google’s latest generation distributed file system (the successor to GFS).

  • Durability: Data is automatically replicated and sharded across many physical disks.

  • Columnar Format (Capacitor): Data isn't stored as "rows" in Colossus; it's stored in a highly compressed columnar format called Capacitor. This allows BigQuery to scan only the columns you mention in your SELECT or WHERE clauses.

  • Separation: Because storage is on Colossus and compute is in Dremel, you can scale your data to petabytes without needing to add more "servers" manually.

spinner

Colossus handles the physical sharding and replication of data, using the Capacitor columnar format.

Jupiter: The Network

Since compute (Dremel) and storage (Colossus) are physically separate, the data has to travel from disk to CPU extremely fast.

  • The Backbone: Jupiter is Google's custom-built petabit-scale data center network.

  • Speed: It provides enough bandwidth for Dremel to read data from Colossus at terabytes per second. This is why BigQuery doesn't need traditional "indexes"—it can simply brute-force scan massive amounts of data in seconds.

spinner

Jupiter acts as the high-speed "highway" that allows the compute nodes to read data from storage as if it were on a local disk.

Borg: The Cluster Manager

Borg is Google's precursor to Kubernetes. It is the large-scale cluster management system that coordinates the resources.

  • Orchestration: Borg handles the "where" and "how" of running Dremel. It assigns the physical hardware, manages health checks, and ensures that if a machine fails, the query task is instantly moved to a healthy one without the user noticing.

spinner

Borg is the management layer that schedules tasks and ensures the system remains healthy.

The Client Interface & Control Plane

This is the entry point for your requests.

  • The Gatekeeper: When you use the UI, CLI (gcloud), or an SDK, you are hitting the BigQuery Control Plane.

  • Optimization: It handles metadata (like your partitioning info we discussed earlier), query optimization, and IAM security checks before passing the work down to Dremel.


The Internal Stack

Component

Responsibility

Analogous To...

Dremel

Query Execution

The "Chef" (Cooks the data)

Colossus

Data Storage

The "Pantry" (Holds the ingredients)

Jupiter

Networking

The "Conveyor Belt" (Moves ingredients to chef)

Borg

Resource Management

The "Manager" (Assigns tasks to staff)


Example: Shuffle Join

The Shuffle phase is the "magical" middle layer where BigQuery re-distributes data across the network so that matching keys from two different tables end up on the same physical worker.

BigQuery Shuffle Architecture (JOIN)

In BigQuery, shuffling doesn't happen on the leaf nodes' local disks. It happens in a dedicated, distributed memory layer (Remote RAM) that sits between the mixers.

spinner

Why this architecture matters:

  • Decoupled Shuffle: By using a dedicated shuffle service instead of local disk, BigQuery avoids the "disk I/O bottleneck" found in traditional MapReduce.

  • Parallelism: Because the data is hashed by the Join Key, Worker 1 can perform its part of the JOIN completely independently of Worker 2.

  • Jupiter's Role: The Jupiter network fabric provides the massive bandwidth (Petabits/sec) required to move millions of rows from the "Input" stage to the "Join" stage in milliseconds.


Last updated