Overview
Relevant Files
README.ascdoc/modules/cassandra/pages/architecture/overview.adocsrc/java/org/apache/cassandra/service/CassandraDaemon.javadoc/modules/cassandra/pages/architecture/storage-engine.adocsrc/java/org/apache/cassandra/cql3/QueryProcessor.javasrc/java/org/apache/cassandra/gms/Gossiper.java
Apache Cassandra is a highly-scalable, distributed NoSQL database designed for massive scale and high availability. It implements a partitioned wide-column storage model with eventually consistent semantics, combining the best ideas from Amazon's Dynamo and Google's Bigtable.
Core Design Philosophy
Cassandra was built to address emerging requirements for globally distributed systems requiring low-latency reads and writes across multiple datacenters. Unlike traditional relational databases, Cassandra explicitly trades off some consistency guarantees for availability and partition tolerance, making it ideal for applications that prioritize uptime and responsiveness over strict ACID semantics.
Key Design Objectives:
- Full multi-primary database replication across datacenters
- Global availability with low-latency reads and writes
- Linear scalability on commodity hardware
- Online load balancing and cluster growth without downtime
- Partitioned key-oriented queries for predictable performance
Architecture Overview
Loading diagram...
Data Organization
Cassandra organizes data hierarchically using CQL (Cassandra Query Language), an SQL-like interface:
- Keyspace: A container for tables that specifies replication strategy and replication factor across datacenters
- Table: Composed of rows and columns with flexible schema; partitioned by partition key
- Partition: The mandatory part of the primary key that determines which node stores the data
- Row: A collection of columns identified by a unique primary key (partition key + clustering keys)
- Column: A single typed datum within a row
Storage Engine
Cassandra uses a Log Structured Merge (LSM) tree architecture optimized for write-heavy workloads:
- Commit Log: Append-only write-ahead log ensuring durability
- Memtable: In-memory write-back cache buffering writes per table
- SSTable: Immutable sorted files on disk containing flushed data
- Compaction: Background process merging SSTables to optimize reads
The write path is highly optimized for throughput, while reads benefit from Bloom filters and index structures within SSTables.
Distributed Coordination
Cassandra uses a gossip protocol for cluster membership and failure detection. Nodes periodically exchange state information through three-message gossip rounds, enabling:
- Automatic detection of node failures
- Cluster topology awareness
- Metadata propagation across the cluster
- Decentralized cluster management without a single point of failure
Query Execution
The QueryProcessor handles CQL statement parsing, validation, and execution. Queries are routed through a replication strategy that determines which replicas should handle the request based on consistency level requirements. The system supports various consistency levels from ONE (fastest, least consistent) to ALL (slowest, most consistent).
Advanced Features
Cassandra supports sophisticated data modeling and querying capabilities:
- Collection types (sets, maps, lists)
- User-defined types, functions, and aggregates
- Storage-Attached Indexing (SAI) for secondary indexes
- Single-partition lightweight transactions with atomic compare-and-set semantics
- Materialized views (experimental)
Intentional Limitations
Cassandra explicitly avoids operations requiring cross-partition coordination, as these are difficult to provide with high availability guarantees:
- No cross-partition transactions
- No distributed joins
- No foreign keys or referential integrity
This design choice enables Cassandra to maintain its availability and partition tolerance guarantees while scaling to massive datasets.
Architecture & Core Design
Relevant Files
doc/modules/cassandra/pages/architecture/dynamo.adocdoc/modules/cassandra/pages/architecture/storage-engine.adocdoc/modules/cassandra/pages/architecture/guarantees.adocsrc/java/org/apache/cassandra/tcm/ClusterMetadata.javasrc/java/org/apache/cassandra/dht/IPartitioner.java
Cassandra's architecture is built on three foundational pillars: distributed data partitioning, multi-master replication, and a write-optimized storage engine. These design principles enable Cassandra to scale horizontally across commodity hardware while maintaining high availability and durability.
Distributed Data Partitioning
Cassandra uses consistent hashing to distribute data across nodes without requiring a central coordinator. Each node is assigned one or more tokens on a continuous hash ring. When a key is written, it is hashed to a token position, and the data is stored on the nodes that own that token range.
Virtual Nodes (vnodes) allow a single physical node to own multiple token positions on the ring. This enables:
- Even data distribution when adding or removing nodes
- Balanced query load across the cluster
- Simplified cluster scaling with minimal data movement
The IPartitioner interface defines how keys are hashed to tokens. The default Murmur3Partitioner provides uniform distribution across the ring.
Multi-Master Replication
Every partition is replicated to multiple nodes determined by the replication strategy and replication factor (RF). With RF=3, each partition exists on three distinct nodes, allowing the cluster to tolerate node failures.
Consistency Levels provide tunable consistency through the R + W > RF principle:
ONE,TWO,THREE: Fixed replica countsQUORUM: Majority of replicas (n/2 + 1)LOCAL_QUORUM: Majority in the local datacenterALL: All replicas must respond
Writes are always sent to all replicas; the consistency level only controls how many responses the coordinator waits for before acknowledging the client.
Storage Engine: Log-Structured Merge Trees
Cassandra's storage engine is optimized for write-heavy workloads using Log-Structured Merge (LSM) trees:
- CommitLog: Append-only write-ahead log ensuring durability
- Memtable: In-memory buffer for recent writes, sorted by key
- SSTable: Immutable sorted files on disk containing flushed data
When a memtable reaches its size limit, it is flushed to disk as an SSTable. Compaction merges multiple SSTables, removing obsolete data and improving read performance. Bloom filters accelerate reads by quickly determining if a key exists in an SSTable.
Cluster Metadata & Membership
The ClusterMetadata class maintains the authoritative cluster state, including:
- Directory: Node identity, location, and addressing
- TokenMap: Mapping of tokens to physical endpoints
- DataPlacements: Replica assignments for each keyspace
- DistributedSchema: Table definitions and replication parameters
Gossip protocol propagates cluster state changes (node joins, failures, schema updates) in a decentralized manner. Every second, each node exchanges state with a random peer, ensuring eventual consistency of cluster membership.
Consistency Guarantees
Cassandra provides:
- Eventual Consistency: All replicas converge to the same state
- Durability: Data persists via replication and commit logs
- High Availability: Continues operating despite node failures
- Lightweight Transactions: Paxos-based compare-and-set operations for linearizable consistency
Loading diagram...
Query Processing & CQL
Relevant Files
src/java/org/apache/cassandra/cql3/QueryProcessor.javasrc/java/org/apache/cassandra/cql3/CQLStatement.javasrc/java/org/apache/cassandra/cql3/statements/SelectStatement.javasrc/java/org/apache/cassandra/cql3/statements/ModificationStatement.javasrc/java/org/apache/cassandra/cql3/statements/BatchStatement.javasrc/java/org/apache/cassandra/service/StorageProxy.java
Overview
Query processing in Cassandra transforms a CQL string into a distributed execution plan. The pipeline consists of three main stages: parsing (text to AST), preparation (semantic validation and optimization), and execution (distributed coordination). Each stage is handled by specialized components that work together to ensure correctness and performance.
Parsing & Preparation Pipeline
The journey begins when a CQL query string arrives at QueryProcessor.parseStatement(). This method uses ANTLR-generated parsers to tokenize and parse the query into a CQLStatement.Raw object. The grammar (defined in src/antlr/Parser.g) recognizes all CQL statement types: SELECT, INSERT, UPDATE, DELETE, BATCH, and DDL operations.
After parsing, the raw statement enters the preparation phase via CQLStatement.Raw.prepare(ClientState). This is where semantic validation occurs:
- Keyspace resolution: Statements are qualified with the current keyspace from
ClientState - Schema validation: Column names, table names, and types are verified against the schema
- Bind variable analysis: Placeholder markers (
?) are catalogued and typed - Authorization checks: Permissions are validated for the user
The result is a fully-prepared CQLStatement object, ready for execution.
Statement Types & Execution
Cassandra supports multiple statement categories, each implementing the CQLStatement interface:
- SelectStatement: Queries data via
StorageProxy.read(), supporting pagination and aggregation - ModificationStatement (INSERT, UPDATE, DELETE): Generates
PartitionUpdatemutations and routes viaStorageProxy.mutate() - BatchStatement: Coordinates multiple mutations with optional Paxos consensus for LWT (Lightweight Transactions)
- Schema statements: Alter cluster metadata (CREATE TABLE, DROP KEYSPACE, etc.)
Each statement type implements execute(QueryState, QueryOptions, RequestTime) to handle distributed coordination.
Execution Flow
Loading diagram...
Key Components
QueryProcessor: Singleton managing the full lifecycle. Maintains caches for prepared statements (using MD5 digest IDs) and internal queries. Handles both client-facing and internal query paths.
QueryOptions: Encapsulates execution parameters—consistency level, bind values, page size, timestamp. Prepared via options.prepare(bindVariables) to bind parameter values.
ClientState: Tracks user identity, current keyspace, and permissions. Used during preparation to resolve keyspace-qualified names and enforce authorization.
StorageProxy: Routes mutations and reads to appropriate replicas based on consistency level and token distribution. Handles read repair, write acknowledgments, and Paxos consensus for LWT.
Prepared Statement Caching
Prepared statements are cached by QueryProcessor using an MD5 digest of the query string (optionally including keyspace). The cache reduces parsing and preparation overhead for frequently-executed queries. Clients can explicitly prepare statements via the native protocol, receiving a statement ID for reuse.
Consistency & Coordination
Every statement execution includes a ConsistencyLevel parameter that determines replica coordination:
- Read: Waits for responses from N replicas before returning
- Write: Waits for N replicas to acknowledge before returning
- Paxos: For LWT, uses consensus to ensure linearizability
The Dispatcher.RequestTime tracks enqueue and start times for metrics and timeout enforcement.
Storage Engine & Data Persistence
Relevant Files
src/java/org/apache/cassandra/db/ColumnFamilyStore.javasrc/java/org/apache/cassandra/db/memtable/Memtable.javasrc/java/org/apache/cassandra/db/commitlog/CommitLog.javasrc/java/org/apache/cassandra/io/sstable
Cassandra's storage engine is built on a Log Structured Merge (LSM) tree architecture, optimized for write-heavy workloads. Data flows through three key components: the commit log for durability, memtables for in-memory buffering, and SSTables for persistent disk storage.
Write Path & Durability
When a write arrives, Cassandra performs two operations in parallel:
- Commit Log - An append-only write-ahead log (WAL) records every mutation to disk immediately, ensuring durability even if the node crashes before data reaches disk.
- Memtable - The mutation is also written to an in-memory data structure (typically a skip list or trie-based structure) for fast reads.
The commit log is shared across all tables in a keyspace and uses configurable sync modes: periodic (batched writes), batch (group commits), or group (optimized group commits). On startup, any mutations in the commit log are replayed into memtables to recover unwritten data.
Memtables
Memtables are in-memory write buffers managed by ColumnFamilyStore. Each table typically has one active memtable at a time. Key characteristics:
- Allocation: Can be stored entirely on-heap or partially off-heap (configurable via
memtable_allocation_type) - Sharding: Large memtables can be sharded across multiple internal structures for better concurrency
- Implementations: SkipListMemtable (default) or TrieMemtable (newer, more efficient)
- Lifecycle: When a memtable reaches size thresholds, it's switched out and flushed to disk as an SSTable
Memtables track partition counts, data size, and encoding statistics. They support both write operations (put) and read operations via iterators.
SSTables & Disk Storage
SSTables (Sorted String Tables) are immutable files containing sorted data. Each SSTable consists of multiple component files:
- Data.db - Actual row data, organized by partition (sorted by token) and clustering key
- Index.db / Partitions.db / Rows.db - Index structures mapping keys to data locations
- Summary.db - Sampled index (every 128th entry by default) for faster lookups
- Filter.db - Bloom filter for partition keys, enabling efficient negative lookups
- Statistics.db - Metadata about timestamps, tombstones, TTLs, and compaction info
- CompressionInfo.db - Metadata for block-based compression
- Digest.crc32 - CRC-32 checksum of Data.db
Data within an SSTable is sorted by partition token, then by clustering key within each partition, enabling efficient range scans and merge operations during compaction.
Compaction & Write Amplification
Compaction is the background process that merges multiple SSTables into fewer, larger ones. This is necessary because:
- Memtable flushes create many small SSTables over time
- Reads must check multiple SSTables, degrading performance
- Compaction consolidates data and processes deletes/updates
The tradeoff is write amplification: data is rewritten multiple times as SSTables are compacted. Cassandra mitigates this with Bloom filters and index summaries to minimize read lookups.
Key Components
ColumnFamilyStore - Central coordinator managing memtables, SSTables, and flushes for a single table. Handles write routing, read coordination, and lifecycle management.
Memtable.Factory - Pluggable interface for creating different memtable implementations. Factories can declare whether they provide write durability independent of the commit log.
CommitLog - Singleton managing all commit log segments. Supports archiving, compression, and encryption. Tracks which segments can be safely deleted after memtable flushes.
Loading diagram...
Read Path
Reads check memtables first (fastest), then SSTables in reverse creation order. Bloom filters eliminate most negative lookups. Index summaries and key caches accelerate SSTable access. The row cache can further optimize repeated reads of hot partitions.
Replication & Consensus Protocols
Relevant Files
src/java/org/apache/cassandra/service/accord/AccordService.javasrc/java/org/apache/cassandra/service/paxos/Paxos.javasrc/java/org/apache/cassandra/tcm/ClusterMetadataService.javadoc/modules/cassandra/pages/architecture/accord-architecture.adoc
Cassandra supports two consensus protocols for transactional operations: Paxos (traditional) and Accord (modern). Both ensure strong consistency guarantees, but they differ fundamentally in their approach to coordination, recovery, and execution.
Paxos Protocol
Paxos is a single-decree consensus algorithm that performs consensus on the exact set of writes to apply. The protocol operates in three phases:
- Prepare Phase: A coordinator seeks promises from replicas (the electorate) to not accept lower-numbered proposals. This phase also completes any in-progress proposals and commits.
- Propose Phase: The coordinator proposes a value with a higher ballot number.
- Commit Phase: Once a quorum accepts the proposal, the coordinator broadcasts the commit to all replicas.
Paxos recovery is straightforward: replaying the committed writes deterministically recreates the state. The protocol is implemented in Paxos.java and related classes like PaxosPrepare.java, PaxosPropose.java, and PaxosCommit.java.
Accord Protocol
Accord is a newer consensus protocol that performs consensus on the transaction definition, dependencies, and execution timestamp rather than the exact writes. This enables more efficient coordination and better performance under contention.
Key characteristics:
- PreAccept & Accept Phases: Coordinator determines when the transaction should execute by gathering dependency information from replicas.
- Execute Phase: Replicas execute the transaction at the agreed timestamp, computing writes deterministically.
- Asynchronous Commit: Accord defaults to asynchronous commit, improving latency.
The Accord implementation is split into coordinator and replica-local components. The AccordService.java manages the Accord node lifecycle, while AccordVerbHandler processes all Accord messages on replicas. Each replica maintains CommandStore instances (local shards) that track transaction metadata and execute commands sequentially.
Key Differences
┌─────────────────────┬──────────────────┬──────────────────┐
│ Aspect │ Paxos │ Accord │
├─────────────────────┼──────────────────┼──────────────────┤
│ Consensus On │ Exact writes │ Txn definition │
│ Recovery │ Replay writes │ Recompute writes │
│ Commit Model │ Synchronous │ Asynchronous │
│ Read Consistency │ QUORUM │ ONE (after setup) │
│ Contention Handling │ Ballot increment │ Timestamp adjust │
└─────────────────────┴──────────────────┴──────────────────┘
Cluster Metadata Service (TCM)
The ClusterMetadataService manages cluster topology and configuration using Paxos for consensus on metadata changes. It maintains a distributed metadata table replicated across CMS (Cluster Metadata Service) nodes, ensuring all nodes have consistent cluster state.
Live Migration
Cassandra supports live migration between Paxos and Accord on a per-table, per-range basis. This requires careful coordination:
- Paxos to Accord: Two-phase migration ensures Accord can safely read non-SERIAL writes after data repair completes.
- Accord to Paxos: Single-phase migration; Accord key repair ensures all Accord transactions are committed at QUORUM before switching.
Key barriers (per-key synchronization points) bridge the gap between protocols, allowing mixed-mode execution where different ranges use different consensus mechanisms.
Transaction Execution Flow
Loading diagram...
Recovery Mechanisms
Both protocols support automatic recovery of in-progress transactions:
- Paxos: Recovery coordinator replays committed writes.
- Accord: Recovery coordinator recomputes writes from transaction definition and dependencies, requiring repeatable reads.
The recovery protocol ensures that even if the original coordinator fails, another node can safely complete the transaction deterministically.
Networking & Inter-node Communication
Relevant Files
src/java/org/apache/cassandra/net/MessagingService.javasrc/java/org/apache/cassandra/net/SocketFactory.javasrc/java/org/apache/cassandra/net/OutboundConnections.javasrc/java/org/apache/cassandra/net/InboundSockets.javasrc/java/org/apache/cassandra/gms/Gossiper.javasrc/java/org/apache/cassandra/transport/Server.java
Cassandra's networking layer handles two distinct communication patterns: inter-node messaging for cluster coordination and data operations, and client-server communication via the native transport protocol. The architecture prioritizes reliability, performance, and preventing head-of-line blocking through sophisticated connection management and flow control.
Inter-node Messaging Architecture
MessagingService is the core component managing all inter-node communication. It maintains three separate TCP connections to each peer node, each optimized for different message sizes:
- Urgent Messages - Small, time-critical messages (gossip, failure detection)
- Small Messages - Messages under 65 KiB (most queries, mutations)
- Large Messages - Messages exceeding 65 KiB (bulk operations, streaming)
This separation prevents large messages from blocking smaller, latency-sensitive operations. Each connection type has its own queue and resource limits.
Wire Format & Framing
Messages are grouped into frames with integrity protection using CRC24 for headers and CRC32 for payloads. Optional LZ4 compression reduces bandwidth. Small messages are batched together, while large messages span multiple frames. The frame-based approach provides application-level error detection independent of TCP checksums.
Outbound Message Flow
Loading diagram...
OutboundConnection handles message serialization and delivery. For small and urgent messages, serialization occurs on the event loop. For large messages, serialization is offloaded to a thread pool to prevent blocking. Each connection enforces strict memory limits: 4 MiB per connection, 128 MiB per endpoint, 512 MiB globally.
Inbound Message Flow
Incoming frames are decoded by FrameDecoder, which validates integrity and decompresses if needed. InboundMessageHandler then deserializes messages and routes them to appropriate verb handlers. Small messages are deserialized immediately; large messages defer deserialization until execution on the target stage to conserve memory.
Inbound connections also enforce resource limits with flow control: if an endpoint exceeds its quota, the handler pauses frame reading until permits become available.
Gossip Protocol
Cassandra uses a push-pull gossip protocol for cluster state propagation. Every second, each node initiates a gossip round with a random peer:
- SYN - Initiator sends
GossipDigestSyncontaining digests of known endpoint states - ACK - Receiver responds with
GossipDigestAckcontaining state deltas - ACK2 - Initiator sends
GossipDigestAck2with additional state updates
State is versioned using (generation, version) tuples where generation is a monotonic timestamp and version is a logical clock. This allows nodes to ignore stale information by comparing version numbers.
Gossip also includes failure detection: the Phi Accrual Failure Detector monitors heartbeat intervals and calculates a phi value. When phi exceeds a threshold, the node is marked down and subscribers are notified.
Native Transport (Client Protocol)
Server manages client connections using Netty. It supports multiple protocol versions and handles authentication, query execution, and event subscriptions. Connections are pooled and tracked for resource management. The server enforces per-connection and global limits on concurrent requests to prevent resource exhaustion.
Connection Lifecycle
Outbound connections are initiated on-demand by OutboundConnectionInitiator and maintained with automatic reconnection on failure. Inbound connections are accepted by InboundSockets listening on configured ports. Both support optional TLS encryption via SocketFactory, which configures Netty pipelines with SSL handlers.
Resource Management
All connections participate in a hierarchical resource limit system:
- Connection-level - Exclusive quota for pending messages
- Endpoint-level - Shared quota across all connections to a peer
- Global-level - Cluster-wide aggregate limit
When limits are exceeded, messages are rejected with callbacks invoked immediately, preventing unbounded queue growth and memory exhaustion.
Repair & Data Streaming
Relevant Files
src/java/org/apache/cassandra/repair/RepairSession.javasrc/java/org/apache/cassandra/service/ActiveRepairService.javasrc/java/org/apache/cassandra/streaming/StreamManager.javasrc/java/org/apache/cassandra/dht/RangeStreamer.javasrc/java/org/apache/cassandra/streaming/StreamPlan.javasrc/java/org/apache/cassandra/streaming/StreamSession.javasrc/java/org/apache/cassandra/repair/RepairJob.javasrc/java/org/apache/cassandra/repair/ValidationTask.javasrc/java/org/apache/cassandra/repair/SyncTask.java
Cassandra's repair and streaming subsystems work together to maintain data consistency across the cluster. Repair detects and fixes inconsistencies between replicas, while streaming transfers data efficiently during repair, bootstrap, and topology changes.
Repair Architecture
Repair is a multi-phase process coordinated by RepairSession and RepairJob:
-
Paxos Repair Phase: Completes any unfinished Paxos operations in the target range/keyspace/table.
-
Validation Phase: Each replica generates a Merkle tree of its data for the target range.
ValidationTasksends validation requests to replicas and collects their trees. Validation tasks run sequentially per job to ensure trees are created roughly simultaneously across replicas. -
Synchronization Phase: Once all trees are received,
RepairJobcompares them to identify differences. For each pair of replicas with mismatches, aSyncTaskis created to stream the differences. Sync phases run concurrently with each other and with validation phases.
Sync Task Types
Cassandra supports three sync task implementations:
- LocalSyncTask: Coordinator streams with a remote replica. The coordinator initiates bidirectional streaming using
StreamPlan. - SymmetricRemoteSyncTask: Two remote replicas stream with each other. The coordinator sends sync requests to both nodes.
- AsymmetricRemoteSyncTask: Optimized streaming where only necessary data flows between replicas based on Merkle tree differences.
Streaming System
StreamPlan and StreamCoordinator manage data transfer:
- StreamPlan: Builder class that configures streaming operations (bootstrap, repair, decommission). It specifies source/destination pairs and ranges to transfer.
- StreamCoordinator: Maintains multiple
StreamSessioninstances per peer, handling connection pooling and session lifecycle. - StreamSession: Manages bidirectional data exchange with a single peer. It handles incoming/outgoing streams, flow control, and completion tracking.
- StreamManager: Global singleton tracking all active streaming operations, rate limiting, and JMX metrics.
Rate Limiting & Flow Control
Streaming respects configurable throughput limits:
stream_throughput_outbound_megabits_per_sec: General streaming rate limitinter_dc_stream_throughput_outbound_megabits_per_sec: Cross-datacenter limitentire_sstable_stream_throughput_outbound_megabits_per_sec: Full SSTable transfer limit
StreamRateLimiter enforces these limits using Guava's RateLimiter, with separate limiters for intra-DC and inter-DC traffic.
Data Flow Diagram
Loading diagram...
Key Concepts
Merkle Trees: Hierarchical hash structures enabling efficient difference detection. Leaves represent data partitions; parent nodes hash their children. Comparing trees identifies mismatched ranges without transferring all data.
Pending Repair: Incremental repair marks repaired data with a session ID, allowing future repairs to skip already-repaired ranges. StreamPlan tracks pending repair IDs for proper SSTable marking.
Preview Repair: Non-destructive repair mode that validates consistency without streaming fixes. Useful for testing repair impact before committing changes.
Transient Replicas: Streaming distinguishes between full and transient replicas. Transient replicas receive data but don't participate in quorum reads, reducing replication overhead.
Indexing, Caching & Performance
Relevant Files
src/java/org/apache/cassandra/index/saisrc/java/org/apache/cassandra/service/CacheService.javasrc/java/org/apache/cassandra/cache/CaffeineCache.javasrc/java/org/apache/cassandra/db/compactionsrc/java/org/apache/cassandra/db/ReadCommand.java
Cassandra provides multiple layers of indexing and caching to optimize query performance. Understanding these mechanisms is critical for designing efficient data models and tuning cluster performance.
Storage-Attached Indexing (SAI)
SAI is Cassandra's modern secondary indexing system, designed to overcome limitations of legacy indexes. Unlike traditional secondary indexes that require read-before-write operations, SAI indexes are attached to SSTables and updated synchronously with mutations.
Key advantages:
- Reduced disk footprint: 20-35% overhead compared to unindexed data, with shared index structures across multiple columns
- Efficient on-disk formats: Strings use byte-ordered trie structures; numeric types use k-dimensional trees for range queries
- Zero-copy streaming: Indexes stream with SSTables during bootstrap/decommission without rebuilding
- Flexible query support: Equality, range queries, AND/OR logic, vector search, and CONTAINS operations
SAI maintains separate indexes for memtables (in-memory) and SSTables (on-disk), resolving differences at read time. Index builds execute on compaction threads and are visible as compaction operations.
Caching Layers
Cassandra implements three independent cache types managed by CacheService:
Key Cache: Caches partition key positions in SSTable index files, reducing disk seeks for partition lookups. Stores KeyCacheKey <-> AbstractRowIndexEntry mappings with configurable capacity.
Row Cache: Caches entire partitions in memory using a weighted eviction policy (W-TinyLFU via Caffeine). Useful for read-heavy workloads with hot partitions. Invalidated on writes to prevent stale data.
Counter Cache: Specialized cache for counter columns, storing clock-and-count pairs to optimize counter increments without full reads.
All caches support periodic persistence to disk, allowing recovery after restarts. Cache sizes are configured in cassandra.yaml and adjustable via JMX at runtime.
Read Path Optimization
When executing a read command, Cassandra follows this sequence:
- Index selection: If indexes exist and row filters are present, the query planner selects the best index using
getBestIndexQueryPlanFor() - Cache lookup: For single-partition reads, row cache is checked first (if enabled)
- Memtable & SSTable query: If cache misses, data is fetched from memtables and SSTables in timestamp order
- Index-assisted filtering: SAI indexes narrow the result set before fetching full rows, reducing I/O
The ReadExecutionController manages lifecycle of read operations, acquiring references to SSTables to prevent compaction from removing data mid-read.
Compaction and Index Maintenance
Compaction merges SSTables and triggers index updates. Different compaction strategies balance read vs. write amplification:
- Unified Compaction Strategy (UCS): Recommended for most workloads; combines tiered and leveled approaches with parallel sharding
- Leveled Compaction Strategy (LCS): Optimized for read-heavy workloads; maintains non-overlapping SSTables per level
- Size-Tiered Compaction Strategy (STCS): Default; good for write-heavy workloads
- Time Window Compaction Strategy (TWCS): Optimized for TTL'd time-series data
During compaction, indexes are incrementally rebuilt for affected SSTables. SAI supports both full rebuilds and partial incremental builds, minimizing overhead.
Performance Tuning
For indexing: Create SAI indexes on frequently filtered columns. Avoid indexing high-cardinality columns or partition keys. Monitor index disk usage and query latency via metrics.
For caching: Enable row cache for hot partitions; disable for write-heavy tables. Tune key cache size based on working set. Use counter cache for counter-heavy workloads.
For compaction: Choose UCS for balanced workloads, LCS for read-heavy, STCS for write-heavy. Monitor compaction throughput and read amplification metrics.