Internals of Bigquery
More information about the internals of BigQuery here and here
To understand the internals of BigQuery, you have to look at it not as a single database, but as a orchestrated collection of massive-scale Google technologies.
BigQuery’s "secret sauce" is its disaggregated architecture, meaning storage and compute are completely separate, connected by a hyper-fast network.
General Architecture
Dremel: The Execution Engine (Compute)
Dremel is the heart of BigQuery. It is a distributed query execution engine that turns your SQL into an execution tree.
The Tree Structure: When you run a query, a "root server" receives it and breaks it down into hundreds or thousands of small tasks. These tasks are passed to intermediate "mixers" and finally to "leaf nodes" (slots).
Slots: In BigQuery terminology, a "slot" is a unit of computational capacity (CPU and RAM).
Multitenancy: Dremel allows thousands of users to share the same massive cluster of machines, dynamically allocating slots based on query complexity.
Dremel uses a multi-level tree structure to break a single SQL query into thousands of tiny parallel tasks.
Colossus: The Storage Layer
BigQuery stores data in Colossus, Google’s latest generation distributed file system (the successor to GFS).
Durability: Data is automatically replicated and sharded across many physical disks.
Columnar Format (Capacitor): Data isn't stored as "rows" in Colossus; it's stored in a highly compressed columnar format called Capacitor. This allows BigQuery to scan only the columns you mention in your
SELECTorWHEREclauses.Separation: Because storage is on Colossus and compute is in Dremel, you can scale your data to petabytes without needing to add more "servers" manually.
Colossus handles the physical sharding and replication of data, using the Capacitor columnar format.
Jupiter: The Network
Since compute (Dremel) and storage (Colossus) are physically separate, the data has to travel from disk to CPU extremely fast.
The Backbone: Jupiter is Google's custom-built petabit-scale data center network.
Speed: It provides enough bandwidth for Dremel to read data from Colossus at terabytes per second. This is why BigQuery doesn't need traditional "indexes"—it can simply brute-force scan massive amounts of data in seconds.
Jupiter acts as the high-speed "highway" that allows the compute nodes to read data from storage as if it were on a local disk.
Borg: The Cluster Manager
Borg is Google's precursor to Kubernetes. It is the large-scale cluster management system that coordinates the resources.
Orchestration: Borg handles the "where" and "how" of running Dremel. It assigns the physical hardware, manages health checks, and ensures that if a machine fails, the query task is instantly moved to a healthy one without the user noticing.
Borg is the management layer that schedules tasks and ensures the system remains healthy.
The Client Interface & Control Plane
This is the entry point for your requests.
The Gatekeeper: When you use the UI, CLI (
gcloud), or an SDK, you are hitting the BigQuery Control Plane.Optimization: It handles metadata (like your partitioning info we discussed earlier), query optimization, and IAM security checks before passing the work down to Dremel.
The Internal Stack
Component
Responsibility
Analogous To...
Dremel
Query Execution
The "Chef" (Cooks the data)
Colossus
Data Storage
The "Pantry" (Holds the ingredients)
Jupiter
Networking
The "Conveyor Belt" (Moves ingredients to chef)
Borg
Resource Management
The "Manager" (Assigns tasks to staff)
Example: Shuffle Join
The Shuffle phase is the "magical" middle layer where BigQuery re-distributes data across the network so that matching keys from two different tables end up on the same physical worker.
BigQuery Shuffle Architecture (JOIN)
In BigQuery, shuffling doesn't happen on the leaf nodes' local disks. It happens in a dedicated, distributed memory layer (Remote RAM) that sits between the mixers.
Why this architecture matters:
Decoupled Shuffle: By using a dedicated shuffle service instead of local disk, BigQuery avoids the "disk I/O bottleneck" found in traditional MapReduce.
Parallelism: Because the data is hashed by the Join Key, Worker 1 can perform its part of the JOIN completely independently of Worker 2.
Jupiter's Role: The Jupiter network fabric provides the massive bandwidth (Petabits/sec) required to move millions of rows from the "Input" stage to the "Join" stage in milliseconds.
Last updated