CockroachDB Wiki

Overview

Relevant Files

README.md
CLAUDE.md
docs/design.md

CockroachDB is a cloud-native distributed SQL database designed for scalability, strong consistency, and survivability. It combines a PostgreSQL-compatible SQL interface with a distributed key-value store backed by Raft consensus, enabling horizontal scaling across multiple nodes and datacenters while maintaining ACID guarantees.

Core Design Principles

CockroachDB achieves three primary goals:

Scalability - Horizontal scaling by adding nodes; queries distribute across the cluster for linear throughput increases
Strong Consistency - Raft-based consensus ensures ACID semantics; distributed transactions use non-locking commit protocols
Survivability - Tolerates disk, machine, rack, and datacenter failures with minimal latency disruption and no manual intervention

Layered Architecture

Loading diagram...

The system follows a three-tier architecture:

SQL Layer (/pkg/sql/) - PostgreSQL-compatible query processing, optimization, and execution
Distributed KV Layer (/pkg/kv/) - Transaction management with Serializable Snapshot Isolation (SSI)
Storage Layer (/pkg/storage/) - RocksDB/Pebble integration with MVCC (Multi-Version Concurrency Control)

Key Components

Range-Based Partitioning: The keyspace is divided into ranges (default 512MB), each replicated across multiple nodes using Raft consensus. Ranges automatically split and merge to maintain target size and balance load.

Consensus & Replication: Every range uses Raft for synchronous replication. With N replicas, the system tolerates up to F failures where N = 2F + 1 (e.g., 3 replicas tolerate 1 failure).

Transaction Model: Supports both snapshot isolation and serializable snapshot isolation. Lock-free reads and writes enable high concurrency while maintaining external consistency.

Enterprise Features (/pkg/ccl/) - Backup, restore, multi-tenancy, and other commercial capabilities.

Development Workflow

Use the unified ./dev tool (Bazel wrapper) for all operations:

./dev doctor              # Verify environment setup
./dev build short         # Fast iterative builds
./dev test pkg/sql        # Run package tests
./dev generate            # Generate code (protobuf, parsers)
./dev testlogic           # Run SQL logic tests

The codebase is heavily generated—always run ./dev generate after modifying .proto files, SQL grammar, or optimizer rules.

Architecture & System Design

Relevant Files

docs/design.md
docs/tech-notes/life_of_a_query.md
docs/tech-notes/txn_coord_sender.md
pkg/sql/doc.go
pkg/kv/doc.go
pkg/storage/doc.go
pkg/server/doc.go

CockroachDB implements a layered, distributed architecture designed for scalability, strong consistency, and survivability. The system abstracts a single monolithic sorted key-value map across a cluster of nodes, with each layer providing higher-level abstractions.

Three-Layer Architecture

Loading diagram...

SQL Layer (/pkg/sql/) transforms SQL queries into key-value operations. It handles parsing (via LALR parser), logical planning & optimization, physical planning, and execution. The layer is PostgreSQL-compatible and exposes a pgwire protocol endpoint.

Distributed KV Layer (/pkg/kv/) provides transactional semantics with Serializable Snapshot Isolation (SSI). It manages transaction coordination, automatic retries, conflict resolution, and request routing across the cluster. The layer abstracts away range addressing and replication details.

Storage Layer (/pkg/storage/) implements Multi-Version Concurrency Control (MVCC) on top of RocksDB/Pebble. It provides lock-free reads and writes with snapshot isolation, enabling externally consistent transactions without explicit locking.

Range-Based Partitioning & Replication

The keyspace is divided into ranges (default 512MB), each identified by a [start, end) key interval. Each range:

Is replicated across multiple nodes (typically 3x) using Raft consensus
Automatically splits when exceeding target size
Automatically merges when becoming too small
Maintains strong consistency through Raft-mediated mutations

Loading diagram...

Query Execution Flow

A SQL query follows this path:

Network: PostgreSQL wire protocol (pgwire package) receives the query
Parsing: LALR parser generates Abstract Syntax Tree (AST)
Planning: Logical planner optimizes the AST; physical planner generates execution plan
Execution: Executor translates plan into KV operations
Transaction Coordination: KV layer batches operations, handles retries, manages timestamps
Range Routing: Requests routed to appropriate range replicas
Raft Consensus: Leader replicates writes to followers
Storage: MVCC engine persists to RocksDB/Pebble

Key Design Principles

Lock-Free Transactions: No explicit locks; conflicts resolved via timestamp ordering and retries
Automatic Retry Handling: Transaction layer transparently retries on conflicts
Monolithic Keyspace: Clients see a single sorted map; ranges are internal implementation detail
Symmetric Nodes: All nodes run the same binary; any node can serve as SQL gateway
Zone Configuration: Replication factor, storage type, and datacenter location configurable per zone

SQL Engine & Query Execution

Relevant Files

pkg/sql/pgwire/conn.go - PostgreSQL wire protocol connection handling
pkg/sql/conn_executor.go - Connection executor and statement coordination
pkg/sql/conn_executor_exec.go - Statement execution and plan dispatch
pkg/sql/planner.go - Query planner and optimizer coordination
pkg/sql/plan_opt.go - Optimizer integration and plan building
pkg/sql/opt/xform/optimizer.go - Cost-based query optimizer
pkg/sql/distsql_physical_planner.go - Distributed SQL physical planning
pkg/sql/distsql_running.go - Distributed SQL execution runtime

Query Execution Pipeline

CockroachDB processes SQL queries through a multi-stage pipeline coordinated by the connExecutor. Each query follows this path:

Network Reception: PostgreSQL wire protocol (pgwire) receives the query and buffers it in a StmtBuf
Parsing: LALR parser generates an Abstract Syntax Tree (AST) from SQL text
Logical Planning: Optimizer transforms AST into a logical query plan
Physical Planning: DistSQL planner converts logical plan to distributed execution plan
Execution: Plan is executed locally or distributed across nodes
Result Delivery: Results streamed back to client via ClientComm

The Planner: Central Coordinator

The planner struct is the centerpiece of SQL execution. It combines session state, database state, and execution logic for a single statement. Key responsibilities:

Name Resolution: Binds table and column references to schema objects
Type Checking: Validates expression types and semantic rules
Optimizer Invocation: Calls makeOptimizerPlan() to generate logical plans
DistSQL Coordination: Works with DistSQLPlanner for distributed execution
Memory Tracking: Monitors memory usage via BytesMonitor

The planner is scoped to a single statement and not thread-safe.

Cost-Based Optimizer

The optimizer in pkg/sql/opt/xform/optimizer.go uses a memo-based approach to find the lowest-cost execution plan:

Memo Structure: Stores a forest of logically equivalent expression trees organized into groups
Normalization Phase: Applies cost-agnostic transformations to canonical form
Exploration Phase: Generates alternate equivalent expressions (join reordering, predicate pushdown, etc.)
Costing: Estimates execution cost using table statistics and operator-specific models
Search: Uses dynamic programming with branch-and-bound pruning to find optimal plan

The optimizer handles both prepared statements (cached memo) and ad-hoc queries.

Distributed SQL Execution

For queries that benefit from parallelization, the DistSQLPlanner converts logical plans into physical plans:

Table Readers: Scans distributed across nodes holding relevant ranges
Processing Stages: Intermediate operators (joins, aggregations, sorts) connected via routers
Flow Setup: Sends SetupFlowRequest RPCs to remote nodes to initialize execution
Result Collection: Gateway node collects results from remote processors

Distribution decisions are based on data locality, query complexity, and cluster configuration.

Execution Dispatch

The dispatchToExecutionEngine() method routes plans to appropriate executors:

Local Execution: Single-node plans execute directly on gateway
Distributed Execution: Multi-node plans use DistSQL engine with remote processors
Columnar Execution: Vectorized execution engine processes batches of rows for performance
Row-by-Row Fallback: Traditional row-oriented execution for complex operations

Loading diagram...

Key Optimization Techniques

Predicate Pushdown: Moves filters below joins to reduce intermediate rows
Join Reordering: Explores different join orders using statistics
Index Selection: Chooses covering indexes to minimize data access
Distinct/GroupBy Elimination: Removes redundant operations based on keys
Decorrelation: Converts correlated subqueries to semi-joins and anti-joins

Statistics on table cardinality and column histograms guide these decisions.

Distributed Transactions & KV Layer

Relevant Files

pkg/kv/txn.go
pkg/kv/sender.go
pkg/kv/kvclient/kvcoord/txn_coord_sender.go
pkg/kv/kvclient/kvcoord/dist_sender.go
pkg/kv/kvclient/kvcoord/doc.go
pkg/kv/kvclient/rangefeed/doc.go

CockroachDB implements distributed transactions using a two-phase commit protocol coordinated by the KV client layer. The transaction system bridges the SQL layer with the underlying distributed key-value store, handling consistency, conflict resolution, and fault recovery transparently.

Transaction Architecture

The transaction layer consists of three main components:

TxnCoordSender - Maintains transaction state and coordinates all operations for a single transaction. Each transaction gets its own TxnCoordSender instance, which is not a singleton. It manages the transaction lifecycle, heartbeating, lock spans, and handles retriable errors.
DistSender - Routes requests to appropriate ranges based on key ranges. It subdivides batches that span multiple ranges and sends them in parallel, then recombines responses.
Interceptor Stack - A pluggable pipeline of request/response transformers that handle specific transaction concerns in order:
- Heartbeater: Keeps transaction record alive
- Sequence Number Allocator: Assigns sequence numbers to operations
- Write Buffer: Buffers writes before sending
- Pipeliner: Pipelines writes using asynchronous consensus
- Committer: Handles transaction commit logic
- Span Refresher: Refreshes read spans on conflicts
- Metric Recorder: Collects transaction metrics

Transaction Types

CockroachDB supports two transaction types:

RootTxn - The primary transaction coordinator, responsible for aggregating state and finalizing the transaction. Only root transactions heartbeat the transaction record.
LeafTxn - Used in distributed SQL flows. Leaf transactions execute on remote nodes and propagate their state back to the root transaction. They do not heartbeat, relying on the root to keep the transaction alive.

Request Flow

Loading diagram...

Key Optimizations

Pipelining - Writes are proposed asynchronously through Raft consensus. The pipeliner tracks in-flight writes and ensures interfering requests chain on them, proving they succeeded before considering the transaction committed.

Parallel Commit - When committing, the system can parallelize the commit of secondary ranges with the primary range, reducing commit latency from two round-trips to one.

Span Refreshing - On transaction conflicts, instead of aborting, the system attempts to refresh read spans to a newer timestamp, allowing the transaction to proceed without restart.

1PC Optimization - Single-partition transactions that touch only one range can commit in a single round-trip using the 1PC (one-phase commit) protocol.

Transaction Lifecycle

Begin - Transaction created with unique ID and initial timestamp
Execute - Reads and writes accumulate; intents are written to the KV store
Heartbeat - Root transaction periodically heartbeats to prevent cleanup
Commit/Rollback - EndTxn request finalizes the transaction
Cleanup - Intents are resolved; transaction record is garbage collected

Error Handling

The transaction layer handles several error types:

Retriable Errors - Transaction conflicts trigger automatic retry with timestamp bump (epoch increment)
Non-Retriable Errors - Abort the transaction; client must create new TxnCoordSender
TransactionAbortedError - Transaction was cleaned up; requires new transaction attempt

Storage, MVCC & Replication

Relevant Files

pkg/storage/mvcc.go - MVCC operations (Get, Put, Delete, Merge)
pkg/storage/mvcc_key.go - MVCC key versioning structure
pkg/storage/mvcc_value.go - MVCC value encoding and metadata
pkg/storage/pebble.go - Pebble storage engine integration
pkg/storage/engine.go - Storage engine interface
pkg/raft/raft.go - Raft consensus implementation
pkg/kv/kvserver/replica.go - Replica state machine and Raft integration
pkg/kv/kvserver/replica_raftlog.go - Raft log storage interface

MVCC: Multi-Version Concurrency Control

CockroachDB uses MVCC to enable lock-free, snapshot-isolated transactions. Each key stores multiple timestamped versions, allowing readers to see consistent snapshots without blocking writers.

Key Structure:

Metadata key: Stores the most recent version timestamp and optional transaction intent
Versioned keys: Stored in decreasing timestamp order, with metadata at timestamp zero
Visibility: Readers with timestamp T see all versions with timestamp ≤ T

Core Operations:

MVCCGet() - Read a key at a specific timestamp, checking for intents and conflicts
MVCCPut() - Write a new version, optionally acquiring locks for transactions
MVCCDelete() - Mark a key as deleted (empty value)
MVCCMerge() - Merge inline values (used for stats and time series)

// Example: Reading a key at a specific timestamp
result, err := MVCCGet(ctx, reader, key, timestamp, opts)
// Returns the value visible at that timestamp, or a conflict error

Pebble Storage Engine

CockroachDB uses Pebble (a RocksDB variant) as its underlying storage engine. Pebble provides:

LSM-tree architecture: Efficient writes via in-memory memtables and sorted SSTables
Compaction: Automatic background merging of SSTables to optimize read performance
MVCC-aware comparator: Understands CockroachDB's key encoding and timestamp ordering
Merge operator: Supports efficient aggregation for stats and time series

Pebble handles the physical storage of MVCC keys and values, with CockroachDB managing the logical MVCC semantics above it.

Raft Consensus & Replication

Each range (contiguous keyspace) is replicated across multiple nodes using Raft consensus. Raft ensures:

Strong consistency: All replicas apply the same commands in the same order
Fault tolerance: Tolerates up to (N-1)/2 replica failures (typically 3 replicas)
Leader election: Automatically elects a leader to coordinate writes

Replication Flow:

Proposal: Client sends write to the leader replica
Log replication: Leader appends entry to its log and sends to followers
Commitment: Once a majority acknowledges, entry is committed
Application: Committed entries are applied to the state machine (storage engine)
Response: Client receives acknowledgment after application

// Raft state machine loop (simplified)
for {
  select {
  case <-ticker:
    node.Tick()  // Advance Raft timers
  case rd := <-node.Ready():
    saveToStorage(rd.Entries, rd.Snapshot)  // Persist to Pebble
    send(rd.Messages)  // Send to followers
    for _, entry := range rd.CommittedEntries {
      applyToStateMachine(entry)  // Apply to replica state
    }
  }
}

Snapshots & Log Truncation

Raft logs grow indefinitely without management. CockroachDB uses snapshots to:

Capture the state machine at a point in time
Allow followers to catch up quickly without replaying all log entries
Enable log truncation to reclaim disk space

When a follower falls too far behind, the leader sends a snapshot instead of individual entries, dramatically reducing replication overhead.

Integration: MVCC + Raft

The combination enables externally consistent distributed transactions:

MVCC provides snapshot isolation at the storage layer
Raft ensures all replicas see the same mutations in order
Timestamps (from HLC) coordinate visibility across the cluster
Intents (transaction markers) prevent dirty reads and enable conflict detection

This architecture allows CockroachDB to provide PostgreSQL-compatible ACID semantics across a distributed system without explicit locking.

Cluster Management & Operations

Relevant Files

pkg/server/server.go - Server initialization and component setup
pkg/server/init.go - Cluster bootstrap and join logic
pkg/server/node.go - Node management and ID allocation
pkg/gossip/gossip.go - Peer-to-peer gossip protocol
pkg/rpc/context.go - RPC communication infrastructure
pkg/spanconfig/spanconfig.go - Span configuration management
pkg/jobs/registry.go - Background job coordination

Node Initialization & Cluster Joining

When a CockroachDB node starts, it must determine whether it's joining an existing cluster or bootstrapping a new one. The initServer handles this critical phase by attempting to contact nodes specified in the --join flag.

Join RPC Flow:

New node sends JoinNodeRequest to an existing cluster node
Receiving node allocates unique NodeID and StoreID via atomic increments on system keys
Response includes ClusterID, allocated IDs, and active cluster version
Joining node validates binary compatibility before accepting IDs

The Node.Join() RPC handler manages ID allocation atomically using the KV layer, ensuring no two nodes receive the same ID even under concurrent joins.

Gossip Protocol

The gossip subsystem implements a self-assembling peer-to-peer network for sharing cluster metadata. Each node maintains connections to multiple peers and exchanges information about other nodes, ranges, and system state.

Key characteristics:

Minimizes hops between nodes (target: <= 5 hops)
Maintains minimum peer connections (3+ peers)
Automatically discovers new nodes and culls inefficient connections
Propagates critical metadata like cluster ID, node liveness, and range information

Gossip runs continuously in the background, ensuring all nodes eventually converge on the same cluster state.

RPC Communication

The rpc.Context establishes secure, authenticated communication between nodes using gRPC with optional TLS. Each node maintains connection pools and implements circuit breakers for fault tolerance.

Features:

Heartbeat-based liveness detection
Clock offset monitoring across nodes
Tenant-aware authorization for multi-tenant deployments
Compression and efficient serialization

Span Configuration Management

Span configurations define replication, zone, and performance settings for key ranges. The SpanConfigManager coordinates a background reconciliation job that ensures zone configurations are translated into span configs applied across the cluster.

Process:

Zone configurations are stored in system tables
Reconciliation job periodically reads zone configs
Converts them to span configs and updates system.span_configurations
Subscribers (like the KV layer) receive notifications of changes
Ranges adjust replication and behavior accordingly

Background Job Coordination

The jobs.Registry manages long-running operations like backups, restores, and schema changes. It uses SQL liveness to track which nodes are actively executing jobs and automatically reassigns work if a node fails.

Job lifecycle:

Jobs are created with metadata stored in system.jobs
Registry adopts unclaimed jobs up to a configurable rate
Resumer goroutines execute job logic with independent contexts
Progress is persisted periodically for crash recovery
Non-cancelable jobs (like span config reconciliation) run to completion

Cluster Startup Sequence

1. Create engines and load existing state
2. Initialize gossip and RPC infrastructure
3. Run initServer to determine NodeID/ClusterID
   - If joining: send Join RPC to bootstrap addresses
   - If bootstrapping: wait for explicit init command
4. Start Node with allocated IDs
5. Initialize stores and begin accepting traffic
6. Start background tasks (gossip, jobs, span config reconciliation)

The entire process is coordinated through a Stopper that manages graceful shutdown and ensures all components quiesce in the correct order.

Monitoring & Observability

Cluster health is monitored through:

Liveness records: Track which nodes are responsive
Metrics: Connection counts, gossip latency, job progress
Tracing: Distributed traces for debugging multi-node operations
Logs: Structured logging with node/store context

These signals enable operators to detect node failures, network partitions, and resource exhaustion.

Enterprise Features & Extensions

Relevant Files

pkg/backup/backup_job.go
pkg/backup/restore_job.go
pkg/ccl/changefeedccl/doc.go
pkg/ccl/changefeedccl/sink.go
pkg/multitenant/doc.go
pkg/crosscluster/physical/stream_ingestion_job.go
pkg/crosscluster/logical/logical_replication_job.go

CockroachDB's enterprise features, housed in /pkg/ccl/, provide critical capabilities for production deployments: backup & restore, changefeeds, multi-tenancy, and cross-cluster replication. These features are built on top of the core distributed SQL and KV layers.

Backup & Restore

Backup exports a snapshot of every KV entry into non-overlapping SSTable files stored in cloud storage or local filesystems. The system supports full backups, incremental backups, and point-in-time restore.

Key Components:

BackupJob (pkg/backup/backup_job.go): Orchestrates distributed backup execution across cluster nodes
BackupDataProcessor: Each node exports assigned key ranges to SSTable format
BackupManifest: Metadata proto containing timestamps, descriptors, span mappings, and file paths
Encryption: Optional KMS-based encryption at rest with key rotation support
Compaction: Merges incremental backups into consolidated full backups

The backup flow uses distributed SQL execution: the coordinator plans spans, workers export data in parallel, and progress is checkpointed to enable resumption on failure.

Changefeeds

Changefeeds emit KV events on user-specified tables to external sinks, enabling real-time data streaming and event-driven architectures.

Architecture:

kvfeed: Coordinates rangefeed consumption with schema tracking. Holds events until schema is known at the event's timestamp
changeAggregator: Reads KV events from kvfeed, encodes them, and emits to sink. Forwards resolved timestamps to changeFrontier
changeFrontier: Tracks high-watermark of resolved timestamps across spans. Periodically checkpoints progress to the job system
Sinks: Kafka, Pub/Sub, Webhook, Cloud Storage, SQL, Pulsar, and null sink for testing

Event Types: Changes include INSERT, UPDATE, DELETE with optional "before" and "after" values. Resolved timestamps guarantee no new events will appear at or below that timestamp.

Multi-Tenancy

Multi-tenant deployments allow multiple isolated SQL tenants to share a single CockroachDB cluster, each with its own keyspace, schema, and users.

Key Features:

Tenant Isolation: Each tenant has a dedicated key range and SQL codec
Tenant Creation: CreateTenantRecord allocates tenant ID and initializes keyspace
Cost Tracking: Tenant cost server tracks resource usage for billing
Capabilities: Per-tenant feature flags control access to enterprise features

Cross-Cluster Replication

Two replication modes enable disaster recovery and data distribution:

Physical Replication (pkg/crosscluster/physical/): Streams raw KV events from producer to consumer cluster. Ingestion job applies events, maintains frontier, and supports cutover to standby.

Logical Replication (pkg/crosscluster/logical/): Streams SQL-level changes with conflict resolution. Supports last-write-wins (LWW) and user-defined function (UDF) processors for custom conflict handling. Includes dead-letter queue for failed rows.

Both modes use the job system for resumable, fault-tolerant replication with protected timestamps to prevent GC of needed data.

Development, Build & Testing

Relevant Files

pkg/cmd/dev/dev.go - Main dev tool implementation
pkg/cmd/dev/build.go - Build command logic
pkg/cmd/dev/test.go - Test command logic
pkg/sql/logictest/logic.go - SQL logic test framework
pkg/cmd/roachtest/test/test_interface.go - Roachtest interface
build/github/unit-tests.sh - CI unit test runner
GNUmakefile - Build system entry point

CockroachDB uses a unified development workflow powered by the ./dev tool, which wraps Bazel underneath. This tool streamlines building, testing, and code generation across the entire codebase.

The `./dev` Tool

The ./dev script is the primary interface for developers. It automatically builds and runs the dev binary with supplied arguments. Key commands include:

./dev build cockroach          # Build full binary
./dev build short              # Build without UI (faster)
./dev test pkg/sql             # Run unit tests
./dev testlogic                # Run SQL logic tests
./dev generate go              # Generate code (protos, stringer, etc.)
./dev lint                      # Run all linters
./dev doctor                    # Verify environment setup

Build System

Bazel is the primary build system, configured through .bazelrc and wrapped by the dev tool. Key build targets include:

cockroach - Full binary with UI
cockroach-short - Binary without UI (faster for development)
roachtest - Integration test runner
workload - Load testing tool
optgen - Query optimizer code generator

Build configurations support cross-compilation via --cross flag (linux, linuxarm, macos, macosarm, windows) and distributed caching for faster builds.

Testing Infrastructure

CockroachDB has comprehensive testing at multiple levels:

Unit Tests - Standard Go tests throughout /pkg/ packages, run via dev test pkg/[package]. Supports filtering with -f=TestName*, race detection with --race, and stress testing with --stress.

Logic Tests - SQL correctness tests using dev testlogic. Tests run against multiple cluster configurations (local, fakedist, 5node, etc.) to verify SQL semantics. Supports filtering by file (--files=pattern) and subtest (--subtests=ids).

Roachtests - Distributed system integration tests in pkg/cmd/roachtest/tests/. These tests run against real clusters and verify end-to-end functionality. Organized by suite (acceptance, nightly, weekly, etc.).

Acceptance Tests - End-to-end tests in pkg/acceptance/ that verify driver compatibility and basic functionality.

Linting - Code quality checks via dev lint (full) or dev lint --short (fast subset).

CI/CD Pipeline

GitHub Actions runs automated tests on every commit:

Unit Tests - bazel test //pkg:all_tests with up to 200 parallel jobs
Acceptance Tests - Validates driver compatibility
Lint Tests - Code style and static analysis
Coverage - Generates LCOV coverage reports for changed files

The build system uses EngFlow for distributed caching and remote execution in CI, significantly reducing build times.

Code Generation

Generated code must be kept in sync with source definitions:

./dev generate go              # Protos, stringer, etc.
./dev generate bazel           # BUILD.bazel files
./dev generate protobuf        # Protocol buffer definitions

Always run ./dev generate after modifying .proto files, SQL grammar, or optimizer rules.

Development Workflow

Setup - Run ./dev doctor to verify dependencies
Build - Use ./dev build short for iterative development
Test - Run ./dev test pkg/[package] for unit tests
Generate - Run ./dev generate go after schema changes
Lint - Run ./dev lint --short before committing (full lint is slow)

Overview

Core Design Principles

Layered Architecture

Key Components

Development Workflow

Architecture & System Design

Three-Layer Architecture

Range-Based Partitioning & Replication

Query Execution Flow

Key Design Principles

SQL Engine & Query Execution

Query Execution Pipeline

The Planner: Central Coordinator

Cost-Based Optimizer

Distributed SQL Execution

Execution Dispatch

Key Optimization Techniques

Distributed Transactions & KV Layer

Transaction Architecture

Transaction Types

Request Flow

Key Optimizations

Transaction Lifecycle

Error Handling

Storage, MVCC & Replication

MVCC: Multi-Version Concurrency Control

Pebble Storage Engine

Raft Consensus & Replication

Snapshots & Log Truncation

Integration: MVCC + Raft

Cluster Management & Operations

Node Initialization & Cluster Joining

Gossip Protocol

RPC Communication

Span Configuration Management

Background Job Coordination

Cluster Startup Sequence

Monitoring & Observability

Enterprise Features & Extensions

Backup & Restore

Changefeeds

Multi-Tenancy

Cross-Cluster Replication

Development, Build & Testing

The ./dev Tool

Build System

Testing Infrastructure

CI/CD Pipeline

Code Generation

Development Workflow

The `./dev` Tool