Apache Cassandra Wiki | Augment Code

Overview

Relevant Files

README.asc
doc/modules/cassandra/pages/architecture/overview.adoc
src/java/org/apache/cassandra/service/CassandraDaemon.java
doc/modules/cassandra/pages/architecture/storage-engine.adoc
src/java/org/apache/cassandra/cql3/QueryProcessor.java
src/java/org/apache/cassandra/gms/Gossiper.java

Apache Cassandra is a highly-scalable, distributed NoSQL database designed for massive scale and high availability. It implements a partitioned wide-column storage model with eventually consistent semantics, combining the best ideas from Amazon's Dynamo and Google's Bigtable.

Core Design Philosophy

Cassandra was built to address emerging requirements for globally distributed systems requiring low-latency reads and writes across multiple datacenters. Unlike traditional relational databases, Cassandra explicitly trades off some consistency guarantees for availability and partition tolerance, making it ideal for applications that prioritize uptime and responsiveness over strict ACID semantics.

Key Design Objectives:

Full multi-primary database replication across datacenters
Global availability with low-latency reads and writes
Linear scalability on commodity hardware
Online load balancing and cluster growth without downtime
Partitioned key-oriented queries for predictable performance

Architecture Overview

Loading diagram...

Data Organization

Cassandra organizes data hierarchically using CQL (Cassandra Query Language), an SQL-like interface:

Keyspace: A container for tables that specifies replication strategy and replication factor across datacenters
Table: Composed of rows and columns with flexible schema; partitioned by partition key
Partition: The mandatory part of the primary key that determines which node stores the data
Row: A collection of columns identified by a unique primary key (partition key + clustering keys)
Column: A single typed datum within a row

Storage Engine

Cassandra uses a Log Structured Merge (LSM) tree architecture optimized for write-heavy workloads:

Commit Log: Append-only write-ahead log ensuring durability
Memtable: In-memory write-back cache buffering writes per table
SSTable: Immutable sorted files on disk containing flushed data
Compaction: Background process merging SSTables to optimize reads

The write path is highly optimized for throughput, while reads benefit from Bloom filters and index structures within SSTables.

Distributed Coordination

Cassandra uses a gossip protocol for cluster membership and failure detection. Nodes periodically exchange state information through three-message gossip rounds, enabling:

Automatic detection of node failures
Cluster topology awareness
Metadata propagation across the cluster
Decentralized cluster management without a single point of failure

Query Execution

The QueryProcessor handles CQL statement parsing, validation, and execution. Queries are routed through a replication strategy that determines which replicas should handle the request based on consistency level requirements. The system supports various consistency levels from ONE (fastest, least consistent) to ALL (slowest, most consistent).

Advanced Features

Cassandra supports sophisticated data modeling and querying capabilities:

Collection types (sets, maps, lists)
User-defined types, functions, and aggregates
Storage-Attached Indexing (SAI) for secondary indexes
Single-partition lightweight transactions with atomic compare-and-set semantics
Materialized views (experimental)

Intentional Limitations

Cassandra explicitly avoids operations requiring cross-partition coordination, as these are difficult to provide with high availability guarantees:

No cross-partition transactions
No distributed joins
No foreign keys or referential integrity

This design choice enables Cassandra to maintain its availability and partition tolerance guarantees while scaling to massive datasets.

Architecture & Core Design

Relevant Files

doc/modules/cassandra/pages/architecture/dynamo.adoc
doc/modules/cassandra/pages/architecture/storage-engine.adoc
doc/modules/cassandra/pages/architecture/guarantees.adoc
src/java/org/apache/cassandra/tcm/ClusterMetadata.java
src/java/org/apache/cassandra/dht/IPartitioner.java

Cassandra's architecture is built on three foundational pillars: distributed data partitioning, multi-master replication, and a write-optimized storage engine. These design principles enable Cassandra to scale horizontally across commodity hardware while maintaining high availability and durability.

Distributed Data Partitioning

Cassandra uses consistent hashing to distribute data across nodes without requiring a central coordinator. Each node is assigned one or more tokens on a continuous hash ring. When a key is written, it is hashed to a token position, and the data is stored on the nodes that own that token range.

Virtual Nodes (vnodes) allow a single physical node to own multiple token positions on the ring. This enables:

Even data distribution when adding or removing nodes
Balanced query load across the cluster
Simplified cluster scaling with minimal data movement

The IPartitioner interface defines how keys are hashed to tokens. The default Murmur3Partitioner provides uniform distribution across the ring.

Multi-Master Replication

Every partition is replicated to multiple nodes determined by the replication strategy and replication factor (RF). With RF=3, each partition exists on three distinct nodes, allowing the cluster to tolerate node failures.

Consistency Levels provide tunable consistency through the R + W > RF principle:

ONE, TWO, THREE: Fixed replica counts
QUORUM: Majority of replicas (n/2 + 1)
LOCAL_QUORUM: Majority in the local datacenter
ALL: All replicas must respond

Writes are always sent to all replicas; the consistency level only controls how many responses the coordinator waits for before acknowledging the client.

Storage Engine: Log-Structured Merge Trees

Cassandra's storage engine is optimized for write-heavy workloads using Log-Structured Merge (LSM) trees:

CommitLog: Append-only write-ahead log ensuring durability
Memtable: In-memory buffer for recent writes, sorted by key
SSTable: Immutable sorted files on disk containing flushed data

When a memtable reaches its size limit, it is flushed to disk as an SSTable. Compaction merges multiple SSTables, removing obsolete data and improving read performance. Bloom filters accelerate reads by quickly determining if a key exists in an SSTable.

Cluster Metadata & Membership

The ClusterMetadata class maintains the authoritative cluster state, including:

Directory: Node identity, location, and addressing
TokenMap: Mapping of tokens to physical endpoints
DataPlacements: Replica assignments for each keyspace
DistributedSchema: Table definitions and replication parameters

Gossip protocol propagates cluster state changes (node joins, failures, schema updates) in a decentralized manner. Every second, each node exchanges state with a random peer, ensuring eventual consistency of cluster membership.

Consistency Guarantees

Cassandra provides:

Eventual Consistency: All replicas converge to the same state
Durability: Data persists via replication and commit logs
High Availability: Continues operating despite node failures
Lightweight Transactions: Paxos-based compare-and-set operations for linearizable consistency

Loading diagram...

Query Processing & CQL

Relevant Files

src/java/org/apache/cassandra/cql3/QueryProcessor.java
src/java/org/apache/cassandra/cql3/CQLStatement.java
src/java/org/apache/cassandra/cql3/statements/SelectStatement.java
src/java/org/apache/cassandra/cql3/statements/ModificationStatement.java
src/java/org/apache/cassandra/cql3/statements/BatchStatement.java
src/java/org/apache/cassandra/service/StorageProxy.java

Overview

Query processing in Cassandra transforms a CQL string into a distributed execution plan. The pipeline consists of three main stages: parsing (text to AST), preparation (semantic validation and optimization), and execution (distributed coordination). Each stage is handled by specialized components that work together to ensure correctness and performance.

Parsing & Preparation Pipeline

The journey begins when a CQL query string arrives at QueryProcessor.parseStatement(). This method uses ANTLR-generated parsers to tokenize and parse the query into a CQLStatement.Raw object. The grammar (defined in src/antlr/Parser.g) recognizes all CQL statement types: SELECT, INSERT, UPDATE, DELETE, BATCH, and DDL operations.

After parsing, the raw statement enters the preparation phase via CQLStatement.Raw.prepare(ClientState). This is where semantic validation occurs:

Keyspace resolution: Statements are qualified with the current keyspace from ClientState
Schema validation: Column names, table names, and types are verified against the schema
Bind variable analysis: Placeholder markers (?) are catalogued and typed
Authorization checks: Permissions are validated for the user

The result is a fully-prepared CQLStatement object, ready for execution.

Statement Types & Execution

Cassandra supports multiple statement categories, each implementing the CQLStatement interface:

SelectStatement: Queries data via StorageProxy.read(), supporting pagination and aggregation
ModificationStatement (INSERT, UPDATE, DELETE): Generates PartitionUpdate mutations and routes via StorageProxy.mutate()
BatchStatement: Coordinates multiple mutations with optional Paxos consensus for LWT (Lightweight Transactions)
Schema statements: Alter cluster metadata (CREATE TABLE, DROP KEYSPACE, etc.)

Each statement type implements execute(QueryState, QueryOptions, RequestTime) to handle distributed coordination.

Execution Flow

Loading diagram...

Key Components

QueryProcessor: Singleton managing the full lifecycle. Maintains caches for prepared statements (using MD5 digest IDs) and internal queries. Handles both client-facing and internal query paths.

QueryOptions: Encapsulates execution parameters—consistency level, bind values, page size, timestamp. Prepared via options.prepare(bindVariables) to bind parameter values.

ClientState: Tracks user identity, current keyspace, and permissions. Used during preparation to resolve keyspace-qualified names and enforce authorization.

StorageProxy: Routes mutations and reads to appropriate replicas based on consistency level and token distribution. Handles read repair, write acknowledgments, and Paxos consensus for LWT.

Prepared Statement Caching

Prepared statements are cached by QueryProcessor using an MD5 digest of the query string (optionally including keyspace). The cache reduces parsing and preparation overhead for frequently-executed queries. Clients can explicitly prepare statements via the native protocol, receiving a statement ID for reuse.

Consistency & Coordination

Every statement execution includes a ConsistencyLevel parameter that determines replica coordination:

Read: Waits for responses from N replicas before returning
Write: Waits for N replicas to acknowledge before returning
Paxos: For LWT, uses consensus to ensure linearizability

The Dispatcher.RequestTime tracks enqueue and start times for metrics and timeout enforcement.

Storage Engine & Data Persistence

Relevant Files

src/java/org/apache/cassandra/db/ColumnFamilyStore.java
src/java/org/apache/cassandra/db/memtable/Memtable.java
src/java/org/apache/cassandra/db/commitlog/CommitLog.java
src/java/org/apache/cassandra/io/sstable

Cassandra's storage engine is built on a Log Structured Merge (LSM) tree architecture, optimized for write-heavy workloads. Data flows through three key components: the commit log for durability, memtables for in-memory buffering, and SSTables for persistent disk storage.

Write Path & Durability

When a write arrives, Cassandra performs two operations in parallel:

Commit Log - An append-only write-ahead log (WAL) records every mutation to disk immediately, ensuring durability even if the node crashes before data reaches disk.
Memtable - The mutation is also written to an in-memory data structure (typically a skip list or trie-based structure) for fast reads.

The commit log is shared across all tables in a keyspace and uses configurable sync modes: periodic (batched writes), batch (group commits), or group (optimized group commits). On startup, any mutations in the commit log are replayed into memtables to recover unwritten data.

Memtables

Memtables are in-memory write buffers managed by ColumnFamilyStore. Each table typically has one active memtable at a time. Key characteristics:

Allocation: Can be stored entirely on-heap or partially off-heap (configurable via memtable_allocation_type)
Sharding: Large memtables can be sharded across multiple internal structures for better concurrency
Implementations: SkipListMemtable (default) or TrieMemtable (newer, more efficient)
Lifecycle: When a memtable reaches size thresholds, it's switched out and flushed to disk as an SSTable

Memtables track partition counts, data size, and encoding statistics. They support both write operations (put) and read operations via iterators.

SSTables & Disk Storage

SSTables (Sorted String Tables) are immutable files containing sorted data. Each SSTable consists of multiple component files:

Data.db - Actual row data, organized by partition (sorted by token) and clustering key
Index.db / Partitions.db / Rows.db - Index structures mapping keys to data locations
Summary.db - Sampled index (every 128th entry by default) for faster lookups
Filter.db - Bloom filter for partition keys, enabling efficient negative lookups
Statistics.db - Metadata about timestamps, tombstones, TTLs, and compaction info
CompressionInfo.db - Metadata for block-based compression
Digest.crc32 - CRC-32 checksum of Data.db

Data within an SSTable is sorted by partition token, then by clustering key within each partition, enabling efficient range scans and merge operations during compaction.

Compaction & Write Amplification

Compaction is the background process that merges multiple SSTables into fewer, larger ones. This is necessary because:

Memtable flushes create many small SSTables over time
Reads must check multiple SSTables, degrading performance
Compaction consolidates data and processes deletes/updates

The tradeoff is write amplification: data is rewritten multiple times as SSTables are compacted. Cassandra mitigates this with Bloom filters and index summaries to minimize read lookups.

Key Components

ColumnFamilyStore - Central coordinator managing memtables, SSTables, and flushes for a single table. Handles write routing, read coordination, and lifecycle management.

Memtable.Factory - Pluggable interface for creating different memtable implementations. Factories can declare whether they provide write durability independent of the commit log.

CommitLog - Singleton managing all commit log segments. Supports archiving, compression, and encryption. Tracks which segments can be safely deleted after memtable flushes.

Loading diagram...

Read Path

Reads check memtables first (fastest), then SSTables in reverse creation order. Bloom filters eliminate most negative lookups. Index summaries and key caches accelerate SSTable access. The row cache can further optimize repeated reads of hot partitions.

Replication & Consensus Protocols

Relevant Files

src/java/org/apache/cassandra/service/accord/AccordService.java
src/java/org/apache/cassandra/service/paxos/Paxos.java
src/java/org/apache/cassandra/tcm/ClusterMetadataService.java
doc/modules/cassandra/pages/architecture/accord-architecture.adoc

Cassandra supports two consensus protocols for transactional operations: Paxos (traditional) and Accord (modern). Both ensure strong consistency guarantees, but they differ fundamentally in their approach to coordination, recovery, and execution.

Paxos Protocol

Paxos is a single-decree consensus algorithm that performs consensus on the exact set of writes to apply. The protocol operates in three phases:

Prepare Phase: A coordinator seeks promises from replicas (the electorate) to not accept lower-numbered proposals. This phase also completes any in-progress proposals and commits.
Propose Phase: The coordinator proposes a value with a higher ballot number.
Commit Phase: Once a quorum accepts the proposal, the coordinator broadcasts the commit to all replicas.

Paxos recovery is straightforward: replaying the committed writes deterministically recreates the state. The protocol is implemented in Paxos.java and related classes like PaxosPrepare.java, PaxosPropose.java, and PaxosCommit.java.

Accord Protocol

Accord is a newer consensus protocol that performs consensus on the transaction definition, dependencies, and execution timestamp rather than the exact writes. This enables more efficient coordination and better performance under contention.

Key characteristics:

PreAccept & Accept Phases: Coordinator determines when the transaction should execute by gathering dependency information from replicas.
Execute Phase: Replicas execute the transaction at the agreed timestamp, computing writes deterministically.
Asynchronous Commit: Accord defaults to asynchronous commit, improving latency.

The Accord implementation is split into coordinator and replica-local components. The AccordService.java manages the Accord node lifecycle, while AccordVerbHandler processes all Accord messages on replicas. Each replica maintains CommandStore instances (local shards) that track transaction metadata and execute commands sequentially.

Key Differences

┌─────────────────────┬──────────────────┬──────────────────┐
│ Aspect              │ Paxos            │ Accord           │
├─────────────────────┼──────────────────┼──────────────────┤
│ Consensus On        │ Exact writes     │ Txn definition   │
│ Recovery            │ Replay writes    │ Recompute writes │
│ Commit Model        │ Synchronous      │ Asynchronous     │
│ Read Consistency    │ QUORUM           │ ONE (after setup) │
│ Contention Handling │ Ballot increment │ Timestamp adjust │
└─────────────────────┴──────────────────┴──────────────────┘

Cluster Metadata Service (TCM)

The ClusterMetadataService manages cluster topology and configuration using Paxos for consensus on metadata changes. It maintains a distributed metadata table replicated across CMS (Cluster Metadata Service) nodes, ensuring all nodes have consistent cluster state.

Live Migration

Cassandra supports live migration between Paxos and Accord on a per-table, per-range basis. This requires careful coordination:

Paxos to Accord: Two-phase migration ensures Accord can safely read non-SERIAL writes after data repair completes.
Accord to Paxos: Single-phase migration; Accord key repair ensures all Accord transactions are committed at QUORUM before switching.

Key barriers (per-key synchronization points) bridge the gap between protocols, allowing mixed-mode execution where different ranges use different consensus mechanisms.

Transaction Execution Flow

Loading diagram...

Recovery Mechanisms

Both protocols support automatic recovery of in-progress transactions:

Paxos: Recovery coordinator replays committed writes.
Accord: Recovery coordinator recomputes writes from transaction definition and dependencies, requiring repeatable reads.

The recovery protocol ensures that even if the original coordinator fails, another node can safely complete the transaction deterministically.

Networking & Inter-node Communication

Relevant Files

src/java/org/apache/cassandra/net/MessagingService.java
src/java/org/apache/cassandra/net/SocketFactory.java
src/java/org/apache/cassandra/net/OutboundConnections.java
src/java/org/apache/cassandra/net/InboundSockets.java
src/java/org/apache/cassandra/gms/Gossiper.java
src/java/org/apache/cassandra/transport/Server.java

Cassandra's networking layer handles two distinct communication patterns: inter-node messaging for cluster coordination and data operations, and client-server communication via the native transport protocol. The architecture prioritizes reliability, performance, and preventing head-of-line blocking through sophisticated connection management and flow control.

Inter-node Messaging Architecture

MessagingService is the core component managing all inter-node communication. It maintains three separate TCP connections to each peer node, each optimized for different message sizes:

Urgent Messages - Small, time-critical messages (gossip, failure detection)
Small Messages - Messages under 65 KiB (most queries, mutations)
Large Messages - Messages exceeding 65 KiB (bulk operations, streaming)

This separation prevents large messages from blocking smaller, latency-sensitive operations. Each connection type has its own queue and resource limits.

Wire Format & Framing

Messages are grouped into frames with integrity protection using CRC24 for headers and CRC32 for payloads. Optional LZ4 compression reduces bandwidth. Small messages are batched together, while large messages span multiple frames. The frame-based approach provides application-level error detection independent of TCP checksums.

Outbound Message Flow

Loading diagram...

OutboundConnection handles message serialization and delivery. For small and urgent messages, serialization occurs on the event loop. For large messages, serialization is offloaded to a thread pool to prevent blocking. Each connection enforces strict memory limits: 4 MiB per connection, 128 MiB per endpoint, 512 MiB globally.

Inbound Message Flow

Incoming frames are decoded by FrameDecoder, which validates integrity and decompresses if needed. InboundMessageHandler then deserializes messages and routes them to appropriate verb handlers. Small messages are deserialized immediately; large messages defer deserialization until execution on the target stage to conserve memory.

Inbound connections also enforce resource limits with flow control: if an endpoint exceeds its quota, the handler pauses frame reading until permits become available.

Gossip Protocol

Cassandra uses a push-pull gossip protocol for cluster state propagation. Every second, each node initiates a gossip round with a random peer:

SYN - Initiator sends GossipDigestSyn containing digests of known endpoint states
ACK - Receiver responds with GossipDigestAck containing state deltas
ACK2 - Initiator sends GossipDigestAck2 with additional state updates

State is versioned using (generation, version) tuples where generation is a monotonic timestamp and version is a logical clock. This allows nodes to ignore stale information by comparing version numbers.

Gossip also includes failure detection: the Phi Accrual Failure Detector monitors heartbeat intervals and calculates a phi value. When phi exceeds a threshold, the node is marked down and subscribers are notified.

Native Transport (Client Protocol)

Server manages client connections using Netty. It supports multiple protocol versions and handles authentication, query execution, and event subscriptions. Connections are pooled and tracked for resource management. The server enforces per-connection and global limits on concurrent requests to prevent resource exhaustion.

Connection Lifecycle

Outbound connections are initiated on-demand by OutboundConnectionInitiator and maintained with automatic reconnection on failure. Inbound connections are accepted by InboundSockets listening on configured ports. Both support optional TLS encryption via SocketFactory, which configures Netty pipelines with SSL handlers.

Resource Management

All connections participate in a hierarchical resource limit system:

Connection-level - Exclusive quota for pending messages
Endpoint-level - Shared quota across all connections to a peer
Global-level - Cluster-wide aggregate limit

When limits are exceeded, messages are rejected with callbacks invoked immediately, preventing unbounded queue growth and memory exhaustion.

Repair & Data Streaming

Relevant Files

src/java/org/apache/cassandra/repair/RepairSession.java
src/java/org/apache/cassandra/service/ActiveRepairService.java
src/java/org/apache/cassandra/streaming/StreamManager.java
src/java/org/apache/cassandra/dht/RangeStreamer.java
src/java/org/apache/cassandra/streaming/StreamPlan.java
src/java/org/apache/cassandra/streaming/StreamSession.java
src/java/org/apache/cassandra/repair/RepairJob.java
src/java/org/apache/cassandra/repair/ValidationTask.java
src/java/org/apache/cassandra/repair/SyncTask.java

Cassandra's repair and streaming subsystems work together to maintain data consistency across the cluster. Repair detects and fixes inconsistencies between replicas, while streaming transfers data efficiently during repair, bootstrap, and topology changes.

Repair Architecture

Repair is a multi-phase process coordinated by RepairSession and RepairJob:

Paxos Repair Phase: Completes any unfinished Paxos operations in the target range/keyspace/table.
Validation Phase: Each replica generates a Merkle tree of its data for the target range. ValidationTask sends validation requests to replicas and collects their trees. Validation tasks run sequentially per job to ensure trees are created roughly simultaneously across replicas.
Synchronization Phase: Once all trees are received, RepairJob compares them to identify differences. For each pair of replicas with mismatches, a SyncTask is created to stream the differences. Sync phases run concurrently with each other and with validation phases.

Sync Task Types

Cassandra supports three sync task implementations:

LocalSyncTask: Coordinator streams with a remote replica. The coordinator initiates bidirectional streaming using StreamPlan.
SymmetricRemoteSyncTask: Two remote replicas stream with each other. The coordinator sends sync requests to both nodes.
AsymmetricRemoteSyncTask: Optimized streaming where only necessary data flows between replicas based on Merkle tree differences.

Streaming System

StreamPlan and StreamCoordinator manage data transfer:

StreamPlan: Builder class that configures streaming operations (bootstrap, repair, decommission). It specifies source/destination pairs and ranges to transfer.
StreamCoordinator: Maintains multiple StreamSession instances per peer, handling connection pooling and session lifecycle.
StreamSession: Manages bidirectional data exchange with a single peer. It handles incoming/outgoing streams, flow control, and completion tracking.
StreamManager: Global singleton tracking all active streaming operations, rate limiting, and JMX metrics.

Rate Limiting & Flow Control

Streaming respects configurable throughput limits:

stream_throughput_outbound_megabits_per_sec: General streaming rate limit
inter_dc_stream_throughput_outbound_megabits_per_sec: Cross-datacenter limit
entire_sstable_stream_throughput_outbound_megabits_per_sec: Full SSTable transfer limit

StreamRateLimiter enforces these limits using Guava's RateLimiter, with separate limiters for intra-DC and inter-DC traffic.

Data Flow Diagram

Loading diagram...

Key Concepts

Merkle Trees: Hierarchical hash structures enabling efficient difference detection. Leaves represent data partitions; parent nodes hash their children. Comparing trees identifies mismatched ranges without transferring all data.

Pending Repair: Incremental repair marks repaired data with a session ID, allowing future repairs to skip already-repaired ranges. StreamPlan tracks pending repair IDs for proper SSTable marking.

Preview Repair: Non-destructive repair mode that validates consistency without streaming fixes. Useful for testing repair impact before committing changes.

Transient Replicas: Streaming distinguishes between full and transient replicas. Transient replicas receive data but don't participate in quorum reads, reducing replication overhead.

Indexing, Caching & Performance

Relevant Files

src/java/org/apache/cassandra/index/sai
src/java/org/apache/cassandra/service/CacheService.java
src/java/org/apache/cassandra/cache/CaffeineCache.java
src/java/org/apache/cassandra/db/compaction
src/java/org/apache/cassandra/db/ReadCommand.java

Cassandra provides multiple layers of indexing and caching to optimize query performance. Understanding these mechanisms is critical for designing efficient data models and tuning cluster performance.

Storage-Attached Indexing (SAI)

SAI is Cassandra's modern secondary indexing system, designed to overcome limitations of legacy indexes. Unlike traditional secondary indexes that require read-before-write operations, SAI indexes are attached to SSTables and updated synchronously with mutations.

Key advantages:

Reduced disk footprint: 20-35% overhead compared to unindexed data, with shared index structures across multiple columns
Efficient on-disk formats: Strings use byte-ordered trie structures; numeric types use k-dimensional trees for range queries
Zero-copy streaming: Indexes stream with SSTables during bootstrap/decommission without rebuilding
Flexible query support: Equality, range queries, AND/OR logic, vector search, and CONTAINS operations

SAI maintains separate indexes for memtables (in-memory) and SSTables (on-disk), resolving differences at read time. Index builds execute on compaction threads and are visible as compaction operations.

Caching Layers

Cassandra implements three independent cache types managed by CacheService:

Key Cache: Caches partition key positions in SSTable index files, reducing disk seeks for partition lookups. Stores KeyCacheKey <-> AbstractRowIndexEntry mappings with configurable capacity.

Row Cache: Caches entire partitions in memory using a weighted eviction policy (W-TinyLFU via Caffeine). Useful for read-heavy workloads with hot partitions. Invalidated on writes to prevent stale data.

Counter Cache: Specialized cache for counter columns, storing clock-and-count pairs to optimize counter increments without full reads.

All caches support periodic persistence to disk, allowing recovery after restarts. Cache sizes are configured in cassandra.yaml and adjustable via JMX at runtime.

Read Path Optimization

When executing a read command, Cassandra follows this sequence:

Index selection: If indexes exist and row filters are present, the query planner selects the best index using getBestIndexQueryPlanFor()
Cache lookup: For single-partition reads, row cache is checked first (if enabled)
Memtable & SSTable query: If cache misses, data is fetched from memtables and SSTables in timestamp order
Index-assisted filtering: SAI indexes narrow the result set before fetching full rows, reducing I/O

The ReadExecutionController manages lifecycle of read operations, acquiring references to SSTables to prevent compaction from removing data mid-read.

Compaction and Index Maintenance

Compaction merges SSTables and triggers index updates. Different compaction strategies balance read vs. write amplification:

Unified Compaction Strategy (UCS): Recommended for most workloads; combines tiered and leveled approaches with parallel sharding
Leveled Compaction Strategy (LCS): Optimized for read-heavy workloads; maintains non-overlapping SSTables per level
Size-Tiered Compaction Strategy (STCS): Default; good for write-heavy workloads
Time Window Compaction Strategy (TWCS): Optimized for TTL'd time-series data

During compaction, indexes are incrementally rebuilt for affected SSTables. SAI supports both full rebuilds and partial incremental builds, minimizing overhead.

Performance Tuning

For indexing: Create SAI indexes on frequently filtered columns. Avoid indexing high-cardinality columns or partition keys. Monitor index disk usage and query latency via metrics.

For caching: Enable row cache for hot partitions; disable for write-heavy tables. Tune key cache size based on working set. Use counter cache for counter-heavy workloads.

For compaction: Choose UCS for balanced workloads, LCS for read-heavy, STCS for write-heavy. Monitor compaction throughput and read amplification metrics.