RocksDB Wiki | Augment Code

Overview

Relevant Files

README.md
include/rocksdb/db.h
include/rocksdb/options.h
db/db_impl/db_impl.h
db/db_impl/db_impl_write.cc
db/compaction/compaction_job.cc

RocksDB is a high-performance, persistent key-value store library optimized for flash and RAM storage. Developed and maintained by Meta (formerly Facebook), it builds on LevelDB and is designed to handle massive datasets with multi-threaded compactions, making it ideal for systems storing multiple terabytes of data.

Core Architecture

RocksDB uses a Log-Structured-Merge (LSM) tree design that provides flexible tradeoffs between three critical factors:

Write-Amplification-Factor (WAF) - How many times data is rewritten during compaction
Read-Amplification-Factor (RAF) - How many disk seeks are needed per read operation
Space-Amplification-Factor (SAF) - Overhead from storing redundant data across levels

Data Flow Overview

Loading diagram...

Key Components

Memtable - In-memory data structure (default: SkipList) that buffers writes. When it reaches a size threshold (typically 64 MB), it becomes immutable and is flushed to disk as an SST file.

Write-Ahead Log (WAL) - Ensures durability by logging all writes to disk before they are applied to the memtable. On recovery, the WAL is replayed to restore uncommitted data.

SST Files - Sorted String Table files stored on disk in a multi-level hierarchy. Each level has a target size, and files are organized to optimize both reads and writes.

Compaction - Background process that merges SST files from lower levels into higher levels, removing obsolete entries and maintaining read performance. RocksDB supports multi-threaded compactions for efficiency.

Column Families - Logical partitions within a single database, each with independent options and memtables. Useful for separating different data types or access patterns.

Write Path

User calls DB::Put() or DB::Write() with a write batch
Data is written to the WAL for durability
Data is inserted into the active memtable with a sequence number
When memtable reaches size limit, it is flushed to Level 0 as an SST file
Compaction gradually moves data through levels, optimizing for read performance

Read Path

User calls DB::Get() with a key
Search begins in the active memtable
If not found, search immutable memtables
If still not found, search SST files starting from Level 0 through higher levels
Block cache accelerates repeated reads of the same data blocks

Configuration & Tuning

RocksDB provides extensive options for tuning performance:

DBOptions - Database-level settings (WAL, compaction threads, cache size)
ColumnFamilyOptions - Per-column-family settings (compression, bloom filters, compaction style)
ReadOptions / WriteOptions - Per-operation settings (consistency level, durability guarantees)

Optimization helpers like OptimizeForSmallDb(), OptimizeForPointLookup(), and OptimizeLevelStyleCompaction() provide sensible defaults for common workloads.

Architecture & LSM Tree Design

Relevant Files

db/version_set.h - Version and VersionSet management
db/version_set.cc - Version storage info and compaction scoring
db/memtable.h - In-memory data structure interface
db/compaction/compaction.h - Compaction metadata and logic
db/version_edit.h - File metadata and version edits

RocksDB uses a Log-Structured Merge (LSM) tree to organize data across multiple levels on disk. This architecture optimizes write performance through sequential I/O while maintaining efficient reads through multi-level organization.

LSM Tree Structure

The LSM tree consists of multiple levels (typically 0-6), each with exponentially increasing capacity:

Level 0 (L0) - Receives freshly flushed memtables as SST files. Files may overlap in key ranges. When L0 reaches a threshold (default 4 files), compaction is triggered.
Levels 1-N - Each level has a target size (e.g., L1 = 64 MB, L2 = 640 MB with 10x multiplier). Files within a level are non-overlapping and sorted by key range.
Last Level - The bottommost level stores all data that has been fully compacted. No further compaction occurs here.

Memtable to Disk Flow

Active Memtable - Accepts all writes in a SkipList structure (default). When it reaches write_buffer_size (typically 64 MB), it becomes immutable.
Immutable Memtables - Queued for flushing to disk. Multiple immutable memtables can exist simultaneously.
Flush to L0 - The immutable memtable is converted to an SST file and placed in Level 0. A VersionEdit records this change in the manifest.
Version Update - The VersionSet creates a new Version reflecting the updated LSM state.

Compaction Process

Compaction merges overlapping files from adjacent levels to maintain the LSM invariant:

Compaction picks files from level N and level N+1
↓
Merges them while removing obsolete entries (deletes, old versions)
↓
Writes output files to level N+1
↓
Updates Version with new file metadata

Compaction Scoring - VersionStorageInfo::ComputeCompactionScore() calculates which level needs compaction most urgently based on:

Ratio of level size to target size
Number of L0 files
Files marked for compaction

Version Management

A Version is an immutable snapshot of the LSM tree state at a point in time:

VersionStorageInfo - Stores file metadata per level, compaction scores, and file indexing structures.
VersionSet - Maintains a linked list of versions. The current version is used for reads; older versions support iterators and snapshots.
VersionEdit - Records changes (file additions/deletions) applied to create a new version.

The manifest file persists all version edits for recovery.

Key Optimizations

Trivial Move - If a file doesn't overlap with the output level, it's moved without rewriting.
Compaction Picker - Selects which level to compact based on scoring and policy (leveled vs. universal).
Bloom Filters & Indexes - Cached in memory to accelerate point lookups across levels.

Write Path & Memtable Management

Relevant Files

db/db_impl/db_impl_write.cc
db/write_batch.cc
db/memtable.h and db/memtable.cc
memtable/skiplist.h
memtable/hash_skiplist_rep.cc
memtable/write_buffer_manager.cc
db/write_thread.cc

Write Path Overview

RocksDB's write path is optimized for high throughput through batching and parallel memtable writes. When a write operation (Put, Delete, Merge) is submitted, it goes through several coordinated stages: WAL (Write-Ahead Log) writing, memtable insertion, and optional post-processing.

Loading diagram...

Write Thread Coordination

The write thread system uses a leader-follower pattern to batch multiple writes together. When WriteImpl() is called, the writer joins a batch group via JoinBatchGroup(). The first writer becomes the group leader and coordinates WAL writes for all followers. This batching significantly reduces lock contention and improves throughput.

Key states include:

STATE_GROUP_LEADER: Coordinates WAL writes for the batch group
STATE_PARALLEL_MEMTABLE_WRITER: Parallel memtable insertion when enabled
STATE_MEMTABLE_WRITER_LEADER: Leads memtable write phase in pipelined mode

Memtable Insertion

Once a batch is assigned sequence numbers, WriteBatchInternal::InsertInto() applies each operation to the appropriate memtable. The insertion process:

Encodes the key-value pair with internal metadata (sequence number, value type)
Allocates space in the memtable's arena
Inserts into the memtable representation (SkipList, HashSkipList, etc.)
Updates bloom filters and statistics

For concurrent writes, InsertConcurrently() uses lock-free techniques on the underlying data structure. The SkipList and InlineSkipList support concurrent insertion via compare-and-swap operations, allowing multiple threads to insert simultaneously without global locks.

Memtable Representations

RocksDB supports multiple memtable data structures:

SkipList (default): Ordered structure enabling efficient range queries and prefix seeks
HashSkipList: Combines hash table bucketing with skip lists for faster prefix-based lookups
VectorRep: Simple vector for small memtables or specific workloads
InlineSkipList: Memory-optimized variant storing nodes inline with keys

Each representation implements the MemTableRep interface with Insert(), InsertConcurrently(), and iterator methods.

Write Buffer Management

The WriteBufferManager tracks total memtable memory usage across all column families. When memory usage exceeds configured thresholds, it triggers memtable flushing. Key features:

Tracks both active (mutable) and total memory usage
Supports stalling writes when buffer is full
Integrates with block cache for memory charging
Maintains a queue of stalled writers to resume when space becomes available

Memtable Lifecycle

Active memtables accept writes until they reach size limits or are explicitly switched. When SwitchMemtable() is called:

Current memtable is marked immutable
New memtable is created and becomes active
Immutable memtable is added to the immutable list for flushing
New WAL file may be created

Immutable memtables remain in memory until flushed to SST files. The MemTableList maintains both active and immutable memtables, with reference counting ensuring safe cleanup after flushing.

Write-Ahead Logging & Recovery

Relevant Files

db/wal_manager.h
db/log_writer.h
db/log_reader.h
db/log_format.h
db/db_impl/db_impl_open.cc
db/db_wal_test.cc

Write-Ahead Logging (WAL) is RocksDB's durability mechanism that ensures data consistency across crashes and failures. Every write operation is first recorded in a WAL file before being applied to the memtable, guaranteeing that committed data survives system failures.

WAL Architecture

The WAL system consists of three main components:

Log Writer (log::Writer): Appends records to WAL files in a structured format with checksums and record types
Log Reader (log::Reader): Reads and validates WAL records during recovery, handling fragmented records and corruption
WAL Manager (WalManager): Manages multiple WAL files, handles purging obsolete logs, and provides recovery coordination

Log File Format

WAL files are divided into 32 KB blocks. Each record has a header containing:

CRC (4 bytes): Checksum for integrity verification
Size (2 bytes): Payload length
Type (1 byte): Record type (kFullType, kFirstType, kMiddleType, kLastType, etc.)
Log Number (4 bytes, recyclable format): Distinguishes records from different log writer instances

Records larger than available block space are fragmented across multiple blocks using kFirstType, kMiddleType, and kLastType markers.

Recovery Process

Recovery happens during database open and follows these stages:

Loading diagram...

Key Recovery Steps:

WAL Discovery: Scan WAL directory and identify all log files
Sorting: Order WAL files by log number to ensure sequential replay
Filtering: Skip WALs that don't contain unflushed data (non-2PC mode)
Record Reading: Use log::Reader to extract records with corruption handling
Memtable Insertion: Apply records to column family memtables via WriteBatchInternal::InsertInto
Conditional Flushing: Flush memtables if they exceed size limits or if avoid_flush_during_recovery is false
Sequence Number Tracking: Update next sequence number to maintain consistency

WAL Recovery Modes

RocksDB supports three recovery modes controlled by wal_recovery_mode:

kTolerateCorruptedTailRecords (default): Ignores incomplete records at log end (legacy behavior)
kAbsoluteConsistency: Expects clean shutdown with no corruption; fails on any error
kPointInTimeRecovery: Stops replay on first inconsistency to prevent data holes

Corruption Handling

The recovery system handles corruption gracefully:

Checksum Verification: CRC32c checksums detect bit flips and corruption
Record Fragmentation: Detects incomplete fragmented records and handles them per recovery mode
Old Log Records: Identifies records from previous log writer instances and treats them as EOF
WAL Filters: Optional user-defined filters can skip, modify, or reject records during recovery

Memtable Flushing During Recovery

Recovery may trigger memtable flushes when:

Memtable reaches size limit during record insertion
avoid_flush_during_recovery is false (default)
Final memtable contains data after all WAL records are processed

Flushed data becomes Level-0 SST files, and sequence numbers are preserved for consistency.

WAL Purging and Lifecycle

WAL files progress through states:

Alive: Currently receiving writes
Archived: Closed but retained for recovery
Obsolete: Safe to delete after data is flushed to SST files

WalManager::PurgeObsoleteWALFiles() removes obsolete logs based on:

Column family log numbers (data flushed to SST)
TTL settings (WAL_ttl_seconds)
Size limits (WAL_size_limit_MB)
Recycling configuration (recycle_log_file_num)

Best Practices

Enable track_and_verify_wals for production systems to detect WAL corruption early
Use kPointInTimeRecovery for systems with disk controller caches
Set avoid_flush_during_recovery=false for faster recovery with acceptable memory usage
Monitor WAL directory size and configure appropriate TTL/size limits
Test recovery scenarios with fault injection to validate durability guarantees

Table Formats & SST Files

Relevant Files

table/format.h
table/block_based/block.h
table/block_based/block_based_table_reader.h
table/plain/plain_table_factory.h
table/cuckoo/cuckoo_table_builder.h
include/rocksdb/sst_file_writer.h

RocksDB supports multiple SST (Sorted String Table) file formats, each optimized for different use cases. The format determines how data is organized, compressed, and accessed within the file.

Block-Based Table (Default)

Block-based table is RocksDB's default and most widely used format. Data is organized into fixed-size blocks (typically 4KB), with each block containing multiple key-value pairs. This design enables efficient compression and caching.

Key characteristics:

Data divided into blocks with configurable size
Each block has a 5-byte trailer (1 byte compression type + 4 bytes checksum)
Supports compression (Snappy, LZ4, Zstd, etc.)
Block cache for frequently accessed blocks
Index and filter blocks for fast lookups
Suitable for disk and flash storage

Structure:

[Data Block 1] [Data Block 2] ... [Data Block N]
[Filter Block] [Index Block] [Metaindex Block]
[Footer]

Plain Table Format

Plain table is optimized for memory-mapped file systems and in-memory databases. It stores data sequentially without block organization, enabling very fast access patterns.

Key characteristics:

No block structure — data stored sequentially
No compression support
No checksums
Fixed or variable-length keys
Prefix-based hash indexing available
Best for tmpfs and pure-memory workloads

Limitations:

Not recommended for persistent storage
Higher memory overhead for large datasets
Limited query optimization

Cuckoo Table Format

Cuckoo hashing provides an alternative format using hash-based lookups instead of binary search. It offers O(1) lookup time with careful tuning.

Key characteristics:

Hash-based key lookup
Fixed bucket size
Configurable hash functions
Lower query latency for specific workloads
Requires careful parameter tuning

SST File Structure

All formats share a common footer structure that identifies the file type and contains metadata:

Magic Number (8 bytes) - Identifies format type
Format Version (4 bytes) - Version within format
Checksum Type (1 byte) - Checksum algorithm used
Metaindex Handle - Pointer to metaindex block
Index Handle - Pointer to index block (format-dependent)

The magic number distinguishes between formats:

Block-based: kBlockBasedTableMagicNumber
Plain: kPlainTableMagicNumber
Cuckoo: kCuckooTableMagicNumber

Creating SST Files with SstFileWriter

The SstFileWriter API allows external SST file creation for bulk ingestion:

SstFileWriter writer(env_options, options);
writer.Open(file_path);
writer.Put(key, value);
writer.Finish();

All keys in externally-created files have sequence number 0, making them suitable for bulk loading into a fresh database or specific levels.

Format Selection

Choose your format based on workload:

Block-based: General purpose, disk/flash storage, compression needed
Plain: In-memory databases, memory-mapped storage, maximum speed
Cuckoo: Specialized hash-based lookups, tuned workloads

Format is configured via TableFactory in column family options. Once chosen, all new SST files use that format, though RocksDB can read all formats simultaneously.

Caching & Block Cache Management

Relevant Files

cache/cache.cc
cache/lru_cache.h and cache/lru_cache.cc
cache/clock_cache.h and cache/clock_cache.cc
cache/secondary_cache.h
cache/compressed_secondary_cache.h and cache/compressed_secondary_cache.cc
cache/secondary_cache_adapter.h and cache/secondary_cache_adapter.cc
include/rocksdb/cache.h

RocksDB implements a sophisticated multi-tier caching strategy to optimize read performance. The system supports primary block caches with optional secondary (persistent) caches, enabling flexible memory-to-storage trade-offs.

Primary Cache Implementations

RocksDB provides two primary cache implementations, each optimized for different workload patterns:

LRU Cache - The traditional Least Recently Used cache divides the key space into shards (2^num_shard_bits), with each shard maintaining its own LRU list and mutex. This design reduces contention but requires exclusive access even for reads to update the LRU list. LRU Cache supports priority pools (high and low) to protect important entries from eviction during scans.

HyperClockCache (HCC) - A lock-free alternative specifically designed for block cache workloads under high concurrency. HCC uses a generalized CLOCK eviction algorithm with aging, where each entry maintains a countdown score. Most Lookup() and Release() operations are single atomic operations, making it superior for high-contention scenarios. HCC is now the recommended default over LRUCache. It supports both fixed-size (with estimated_entry_charge) and automatic (estimated_entry_charge = 0) variants.

// Creating an LRU cache
auto cache = NewLRUCache(capacity, num_shard_bits);

// Creating a HyperClockCache
HyperClockCacheOptions opts;
opts.capacity = 1024 * 1024 * 1024;  // 1GB
auto cache = NewHyperClockCache(opts);

Secondary Cache Layer

The secondary cache provides a non-volatile tier for evicted blocks, reducing disk I/O on cache misses. When a block is evicted from the primary cache, it can be stored in the secondary cache. On subsequent lookups, if the block is not in the primary cache, RocksDB checks the secondary cache before reading from disk.

CompressedSecondaryCache - The primary secondary cache implementation uses LRU internally and compresses blocks to reduce memory footprint. It supports selective compression (e.g., excluding filter blocks) and custom split/merge for better memory allocator bin fitting.

Tiered Cache Architecture

Loading diagram...

Cache Entry Management

Cache entries are classified by role (data blocks, filter blocks, index blocks, etc.) to enable fine-grained statistics and selective compression. Each entry tracks:

Reference count - External references prevent eviction
Priority level - HIGH, LOW, or BOTTOM priority affects eviction order
Charge - Memory cost (can include metadata overhead)
Helper callbacks - Custom serialization/deserialization for secondary cache

Configuration & Tuning

Key configuration options:

capacity - Total cache size in bytes
num_shard_bits - Number of shards (2^num_shard_bits); negative values use defaults
strict_capacity_limit - If true, Insert() fails when over capacity
high_pri_pool_ratio / low_pri_pool_ratio - Fraction of cache reserved for each priority
metadata_charge_policy - Whether to charge cache metadata against capacity

For tiered caches, TieredCacheOptions distributes capacity between primary and secondary caches proportionally, with compressed_secondary_ratio controlling the split.

Transactions & Concurrency Control

Relevant Files

include/rocksdb/utilities/transaction_db.h
utilities/transactions/pessimistic_transaction_db.h
utilities/transactions/pessimistic_transaction.h
utilities/transactions/write_prepared_txn_db.h
utilities/transactions/write_unprepared_txn_db.h
utilities/transactions/lock/lock_manager.h

RocksDB provides two primary concurrency control models: pessimistic (lock-based) and optimistic (conflict detection at commit). Pessimistic transactions use locks to prevent conflicts, while optimistic transactions detect conflicts during the commit phase.

Pessimistic Transactions

Pessimistic transactions acquire locks on keys before modifying them. The PessimisticTransactionDB class manages lock acquisition, deadlock detection, and transaction lifecycle. Locks are managed by a pluggable LockManager that supports both point locks (individual keys) and range locks.

Key components:

PessimisticTransaction: Implements locking semantics with TryLock() for acquiring locks and automatic release on commit/rollback
LockManager: Handles lock acquisition, tracking, and deadlock detection
Lock striping: Configurable via num_stripes in TransactionDBOptions to reduce contention

Write Policies

RocksDB supports three write policies for pessimistic transactions, controlled by TxnDBWritePolicy:

WRITE_COMMITTED (default): Data is written to the database only after transaction commit. Simplest model but limits transaction size and throughput.

WRITE_PREPARED: Data is written after the prepare phase of two-phase commit (2PC). Enables higher throughput and larger transactions. Uses snapshot caching and commit tracking to distinguish committed from uncommitted data.

WRITE_UNPREPARED: Data is written before the prepare phase, flushing writes incrementally during the transaction. Provides maximum throughput for large transactions but requires careful rollback handling.

Deadlock Detection

When deadlock_detect is enabled in TransactionOptions, the lock manager performs cycle detection in the wait-for graph. If a deadlock is detected, the transaction returns Status::Busy with kDeadlock subcode. The deadlock_timeout_us parameter controls when detection runs, allowing tuning between CPU usage and latency.

Snapshot Isolation

Transactions can establish a snapshot via SetSnapshot() to read a consistent view of the database. The snapshot is used for conflict detection and read validation. Timestamped snapshots enable point-in-time reads and recovery semantics.

Optimistic Transactions

Optimistic transactions skip locking and instead validate the read-set at commit time. Two validation policies exist: kValidateSerial (single-threaded validation in write group) and kValidateParallel (parallel validation before write group). This approach reduces lock contention but may have higher abort rates under high contention.

// Example: Pessimistic transaction with deadlock detection
TransactionOptions txn_opts;
txn_opts.deadlock_detect = true;
txn_opts.lock_timeout = 1000;  // 1 second
Transaction* txn = txn_db->BeginTransaction(write_opts, txn_opts);
txn->Put(cf, "key1", "value1");
txn->Put(cf, "key2", "value2");
Status s = txn->Commit();
delete txn;

Concurrency Control Optimization

Applications can skip concurrency control for known non-conflicting writes via TransactionDBWriteOptimizations::skip_concurrency_control. This is useful during recovery or when the application guarantees no conflicts. Large transaction commits can bypass memtable writes using commit_bypass_memtable for improved performance.

Utilities & Advanced Features

Relevant Files

include/rocksdb/utilities/backup_engine.h
utilities/backup/backup_engine_impl.h
include/rocksdb/utilities/checkpoint.h
include/rocksdb/utilities/options_util.h
options/options_parser.h
include/rocksdb/utilities/db_ttl.h
include/rocksdb/utilities/transaction_db.h
include/rocksdb/utilities/write_batch_with_index.h
include/rocksdb/utilities/sim_cache.h

RocksDB provides a rich set of utilities and advanced features that extend core functionality for specialized use cases. These utilities enable backup/restore, point-in-time snapshots, transaction support, TTL-based expiration, and performance analysis.

Backup & Recovery

BackupEngine provides incremental backup and restore capabilities with file deduplication:

BackupEngineOptions backup_opts;
backup_opts.backup_dir = "/path/to/backups";
BackupEngine* backup_engine;
BackupEngine::Open(env, backup_opts, &backup_engine);

backup_engine->CreateNewBackup(db);
backup_engine->RestoreDBFromLatestBackup(db_path, wal_dir);

Key features include share_table_files for incremental backups, rate limiting, and checksum verification. Checkpoint creates openable snapshots without stopping the database:

Checkpoint* checkpoint;
Checkpoint::Create(db, &checkpoint);
checkpoint->CreateCheckpoint("/path/to/checkpoint");

Checkpoints hard-link SST files on the same filesystem or copy them otherwise, making them suitable for point-in-time recovery.

Options Management

OptionsUtil enables persisting and loading database configurations:

LoadLatestOptions(config_opts, db_path, &db_options, &cf_descs);
LoadOptionsFromFile(config_opts, options_file, &db_options, &cf_descs);
CheckOptionsCompatibility(config_opts, db_path, db_options, cf_descs);

Options are automatically persisted to INI files on DB::Open(), SetOptions(), and column family operations. This ensures configuration consistency across restarts.

Time-To-Live (TTL) Support

DBWithTTL automatically expires keys after a specified duration:

DBWithTTL* db;
DBWithTTL::Open(options, db_path, &db, ttl_seconds);
db->Put(write_opts, key, value);  // Expires after ttl_seconds

Timestamps are appended to values internally. Expired entries are removed during compaction, though Get/Iterator may return stale entries until compaction runs.

Transactions

TransactionDB supports ACID transactions with three write policies:

WRITE_COMMITTED: Data written at commit time (default, most stable)
WRITE_PREPARED: Data written after 2PC prepare phase (experimental)
WRITE_UNPREPARED: Data written before prepare phase (experimental)

TransactionDB* txn_db;
TransactionDB::Open(options, txn_opts, db_path, &txn_db);

Transaction* txn = txn_db->BeginTransaction(write_opts);
txn->Put(key, value);
txn->Commit();

Supports optimistic and pessimistic concurrency control, deadlock detection, and timestamped snapshots.

WriteBatchWithIndex

WriteBatchWithIndex provides searchable write batches with point-in-time iteration:

WriteBatchWithIndex batch;
batch.Put(key1, value1);
batch.Put(key2, value2);

Iterator* iter = batch.NewIterator();
iter->Seek(key1);  // Search within batch

Useful for transactions and applications needing to query uncommitted writes before applying them.

Cache Simulation

SimCache predicts block cache hit rates without allocating actual memory:

auto sim_cache = NewSimCache(real_cache, sim_capacity, num_shards);
// Use sim_cache like a normal cache
size_t hit_rate = sim_cache->GetSimUsage() / sim_cache->GetSimCapacity();

Helps tune cache sizes and measure efficiency without production impact.

Additional Utilities

Merge Operators: String append, uint64 addition, max, bytes XOR for custom value merging
Compaction Filters: Remove empty values, compact on deletion thresholds
Secondary Indexes: Simple and FAISS-based vector indexing
Persistent Cache: Multi-tier caching with disk-backed secondary cache
Trace & Replay: Record and replay database operations for analysis

Column Families & Data Organization

Relevant Files

db/column_family.h
db/column_family.cc
db/memtable_list.h
include/rocksdb/metadata.h
examples/column_families_example.cc

Column families enable logical partitioning of data within a single RocksDB instance. Each column family maintains its own memtables, SST files, and compaction state, allowing independent configuration and lifecycle management while sharing the same write-ahead log (WAL) and database infrastructure.

Core Architecture

RocksDB uses three primary classes to manage column families:

ColumnFamilyHandle is the user-facing interface. Clients obtain handles through DB::Open() or DB::CreateColumnFamily() and use them to specify which column family to operate on for reads, writes, and deletes.

ColumnFamilyData (CFD) is the internal metadata container holding all state for a column family: the mutable memtable, immutable memtables list, current version (SST file structure), options, comparator, and compaction picker. Each CFD maintains reference counts to track active users and can be marked as dropped.

ColumnFamilySet is a global registry managed by DBImpl that maintains all active column families. It provides lookup by name or ID and manages the circular linked list of CFDs. The default column family always exists and cannot be dropped.

SuperVersion & Versioning

Each CFD maintains a SuperVersion that captures a point-in-time snapshot of the LSM-tree state:

struct SuperVersion {
  ColumnFamilyData* cfd;
  ReadOnlyMemTable* mem;           // Current mutable memtable
  MemTableListVersion* imm;        // Immutable memtables
  Version* current;                // Current SST file structure
  uint64_t version_number;         // Ordinal for this snapshot
};

SuperVersions enable lock-free reads. A reader acquires a reference to a SuperVersion and can safely access its memtables and SST files even if concurrent flushes or compactions modify the current state. Old SuperVersions remain alive until all references are released.

Memtable Management

MemTableList manages the mutable and immutable memtables for a CFD. When a memtable reaches its size limit, it becomes immutable and is added to the immutable list. MemTableListVersion is a snapshot of immutable memtables at a point in time, used by readers and compactions to ensure consistency.

Each memtable is assigned a unique ID for tracking flush progress. The immutable list maintains both unflushed memtables (pending flush) and flushed memtables (retained for transaction validation).

Lifecycle & Reference Counting

Column families use atomic reference counting. When a CFD is created, it starts with a reference count of zero. Each ColumnFamilyHandle increments the count in its constructor and decrements it in its destructor. When DropColumnFamily() is called, the CFD is marked as dropped but remains alive until all handles are destroyed.

void Ref() { refs_.fetch_add(1); }
bool UnrefAndTryDelete() {
  // Decrements refs and deletes CFD if refs reaches zero
}

Metadata & File Organization

Each column family has independent SST file organization across levels. ColumnFamilyMetaData provides a snapshot of the LSM-tree structure:

struct ColumnFamilyMetaData {
  uint64_t size;                    // Total bytes across all levels
  std::vector<LevelMetaData> levels; // Files per level
  std::vector<BlobMetaData> blob_files;
};

Files are organized by level, with level 0 containing recently flushed memtables and higher levels containing compacted data. Each CFD has its own compaction picker that decides which files to compact based on the column family's options.

Usage Example

// Open with multiple column families
std::vector<ColumnFamilyDescriptor> families;
families.push_back(ColumnFamilyDescriptor(
    kDefaultColumnFamilyName, ColumnFamilyOptions()));
families.push_back(ColumnFamilyDescriptor(
    "analytics", ColumnFamilyOptions()));

std::vector<ColumnFamilyHandle*> handles;
DB::Open(DBOptions(), db_path, families, &handles, &db);

// Write to specific column family
db->Put(WriteOptions(), handles[1], key, value);

// Atomic writes across column families
WriteBatch batch;
batch.Put(handles[0], key1, value1);
batch.Put(handles[1], key2, value2);
db->Write(WriteOptions(), &batch);

Thread Safety

Most CFD operations require the DB mutex. Exceptions include thread-safe methods like GetID(), GetName(), and user_comparator(). SuperVersion access is lock-free via thread-local storage, enabling concurrent reads without mutex contention. The write thread holds the mutex during memtable writes and flush/compaction operations.

Monitoring, Tracing & Diagnostic Tools

Relevant Files

monitoring/statistics.h & monitoring/statistics_impl.h
monitoring/perf_context.h & monitoring/perf_context_imp.h
monitoring/iostats_context.h & monitoring/iostats_context_imp.h
monitoring/perf_step_timer.h
trace_replay/io_tracer.h
tools/db_bench_tool.cc
tools/ldb_cmd.cc

RocksDB provides a comprehensive monitoring and tracing infrastructure to measure performance, diagnose bottlenecks, and understand system behavior. The framework consists of three main components: Statistics, Performance Context, and IO Statistics Context, complemented by IO Tracing for detailed operation recording.

Statistics: Aggregate Metrics

The Statistics class tracks cumulative counters and histograms across the entire database. It records tickers (counters) and histograms (distributions) for events like block cache hits/misses, compaction operations, and write amplification.

Key features:

Per-core aggregation for lock-free performance
Configurable stats levels to control overhead
Ticker types: BLOCK_CACHE_HIT, BLOCK_CACHE_MISS, COMPACTION_KEY_DROP_*, etc.
Histogram types: DB_GET, DB_WRITE_MICROS, COMPACTION_TIME, etc.

// Enable statistics collection
options.statistics = CreateDBStatistics();

// Query metrics
uint64_t hits = options.statistics->getTickerCount(BLOCK_CACHE_HIT);
HistogramData hist;
options.statistics->histogramData(DB_GET, &hist);

Performance Context: Thread-Local Timing

PerfContext is a thread-local structure that captures fine-grained timing information for individual operations. It measures time spent in specific code paths like block reads, memtable searches, and key comparisons.

Key metrics:

Block operations: block_read_time, block_read_cpu_time, block_cache_hit_count
Memtable operations: get_from_memtable_time, seek_on_memtable_time
Iterator operations: iter_next_cpu_nanos, iter_seek_cpu_nanos
Filesystem operations: env_new_sequential_file_nanos, env_delete_file_nanos

// Enable perf context
SetPerfLevel(PerfLevel::kEnableTimeExceptForMutex);

// Use PERF_TIMER_GUARD macro for automatic timing
{
  PERF_TIMER_GUARD(get_from_table_nanos);
  // code to measure
}

// Access results
std::cout &lt;&lt; get_perf_context().ToString();

IO Statistics Context: I/O Tracking

IOStatsContext tracks I/O operations at the filesystem level, including bytes read/written, operation counts, and latencies. It supports tiered storage metrics (hot, warm, cool, cold, ice files).

Key metrics:

Bytes: bytes_read, bytes_written
Latencies: open_nanos, read_nanos, write_nanos, fsync_nanos
CPU time: cpu_read_nanos, cpu_write_nanos
Temperature-based I/O: hot_file_bytes_read, warm_file_read_count, etc.

// Access thread-local IO stats
IOStatsContext* io_stats = get_iostats_context();
std::cout &lt;&lt; io_stats-&gt;ToString();

IO Tracing: Detailed Operation Recording

IOTracer records detailed traces of I/O operations for offline analysis. Each trace captures timestamp, operation type, latency, file name, and offset information.

Loading diagram...

Diagnostic Tools

db_bench: Comprehensive benchmarking tool with built-in statistics collection and performance reporting.

ldb: Command-line database inspection tool supporting manifest dumps, SST inspection, and statistics queries.

io_tracer_parser: Converts binary IO trace files to human-readable format for performance analysis.

Performance Levels

Control monitoring overhead via PerfLevel:

kDisable: No monitoring
kEnableCount: Only counters
kEnableTimeExceptForMutex: Timing except mutex operations
kEnableTime: Full timing including mutexes
kEnableTimeAndCPUTimeExceptForMutex: Timing plus CPU time
kEnableTimeAndCPUTime: Full timing and CPU metrics