Overview
Relevant Files
README.rstpython/ray/init.pydoc/source/ray-overview/index.mddoc/source/ray-overview/getting-started.md
Ray is a unified, open-source framework for scaling AI and Python applications. It provides a simple, universal API for building distributed applications that seamlessly scale from a laptop to a large cluster without requiring deep distributed systems expertise.
What Ray Solves
Modern ML workloads are increasingly compute-intensive. Ray bridges the gap between single-node development and production-scale distributed computing by enabling the same Python code to run efficiently on any infrastructure—from your laptop to cloud clusters.
Ray consists of three core layers:
- Ray AI Libraries – Domain-specific tools for ML workflows (Data, Train, Tune, Serve, RLlib)
- Ray Core – General-purpose distributed computing primitives (Tasks, Actors, Objects)
- Ray Clusters – Deployment infrastructure with autoscaling support
Core Abstractions
Loading diagram...
Tasks are stateless functions decorated with @ray.remote that execute in parallel across the cluster. Actors are stateful worker processes that maintain internal state and execute remote method calls sequentially. Objects are immutable values stored in Ray's distributed object store, accessible across the cluster via object references.
Ray AI Libraries
Ray provides five specialized libraries for common ML tasks:
- Data – Scalable, framework-agnostic data loading and transformation
- Train – Distributed multi-node training with fault tolerance (PyTorch, TensorFlow)
- Tune – Hyperparameter tuning at scale with efficient search algorithms
- Serve – Production-ready model serving with optional microbatching
- RLlib – Distributed reinforcement learning with high-performance algorithms
Key Features
- Seamless Scaling – Same code runs on laptop or cluster without modification
- Fault Tolerance – Automatic recovery from node failures
- Auto-scaling – Dynamic resource allocation based on workload demand
- Multi-language Support – Python and Java APIs available
- Cloud Integration – Native support for AWS, GCP, Azure, Kubernetes
- Observability – Built-in dashboard and state APIs for monitoring
Getting Started
Install Ray with pip install ray. For specific libraries, use pip install "ray[data]", pip install "ray[train]", etc. Ray automatically initializes on first use of distributed APIs, or explicitly with ray.init().
Architecture & Core Components
Relevant Files
src/ray/gcs/gcs_server.hsrc/ray/gcs/gcs_server.ccsrc/ray/raylet/node_manager.hsrc/ray/core_worker/core_worker.hsrc/ray/object_manager/object_manager.h
Ray Core is built on a distributed architecture with three primary components: the Global Control Service (GCS), Raylets, and Core Workers. These components communicate via gRPC and manage tasks, actors, objects, and cluster resources.
System Overview
Loading diagram...
Global Control Service (GCS)
The GCS is the cluster's control plane, managing all cluster-level metadata and operations. It runs as a single centralized service and coordinates with all nodes.
Key Managers:
- GcsNodeManager - Tracks alive, draining, and dead nodes; handles node registration and health monitoring
- GcsActorManager - Manages actor lifecycle: registration, scheduling, creation, and failure recovery
- GcsJobManager - Tracks job lifecycle, runtime environments, and job completion
- GcsTaskManager - Collects and stores task event data for observability and debugging
- GcsPlacementGroupManager - Schedules placement groups across nodes for resource-aware task placement
- GcsResourceManager - Maintains cluster-wide resource availability and scheduling decisions
- GcsWorkerManager - Tracks worker registration and lifecycle
The GCS uses GcsTableStorage (backed by Redis or in-memory) for persistent state and GcsPublisher for pub/sub notifications to keep all components synchronized.
Raylet (Node Manager)
Each node runs a Raylet, which is the local resource manager and task scheduler. It manages worker processes, object storage, and local task execution.
Responsibilities:
- Worker Pool Management - Starts, monitors, and recycles worker processes
- Local Scheduling - Schedules tasks on local workers using ClusterResourceScheduler
- Object Management - Coordinates with ObjectManager for object storage and transfer
- Resource Reporting - Reports node resources to GCS via RaySyncer
- Lease Management - Manages worker leases through LocalLeaseManager and ClusterLeaseManager
Core Worker
Each worker process runs a CoreWorker, which executes tasks and manages object references.
Key Responsibilities:
- Task Execution - Receives and executes tasks via TaskReceiver and SchedulingQueue
- Task Submission - Submits new tasks via NormalTaskSubmitter or ActorTaskSubmitter
- Object Store Access - Puts/gets objects from Plasma store or memory store
- Reference Counting - Tracks object ownership via ReferenceCounter for garbage collection
- Actor Management - Handles actor method calls and state management
Object Manager
The ObjectManager handles distributed object storage and transfer across nodes using the Plasma in-memory object store.
Operations:
- Pull - Fetches objects from remote nodes when needed by tasks
- Push - Transfers objects to nodes that request them
- Object Directory - Tracks which nodes contain which objects
- Spilling - Moves objects to disk when memory is full
Communication Flow
- Task Submission: Driver calls
ray.remote()function → CoreWorker submits to GCS → GCS schedules on Raylet - Task Execution: Raylet assigns task to worker → CoreWorker executes → Returns object reference
- Object Access: CoreWorker requests object → ObjectManager pulls from remote Plasma store → Returns to caller
- Actor Creation: CoreWorker registers actor → GCS schedules creation task → Raylet executes on worker
Key Design Patterns
- Asynchronous I/O - All components use ASIO for non-blocking operations
- Pub/Sub Messaging - GCS publishes state changes; components subscribe for updates
- Reference Counting - Objects are garbage collected when no references remain
- Fault Tolerance - GCS can recover from Redis; tasks can be retried on failure
Ray Core: Tasks, Actors & Objects
Relevant Files
python/ray/remote_function.pypython/ray/actor.pysrc/ray/core_worker/core_worker.ccsrc/ray/core_worker/task_submission/normal_task_submitter.ccsrc/ray/core_worker/actor_manager.ccsrc/ray/core_worker/reference_counter.cc
Ray Core provides three fundamental primitives for distributed computing: tasks, actors, and objects. These work together to enable flexible, scalable applications.
Tasks
A task is an asynchronous function execution on a remote worker. When you decorate a function with @ray.remote and call .remote(), Ray creates a RemoteFunction wrapper that submits the task for execution.
The task submission flow:
.remote()pickles the function and stores it in the GCS key-value store (once per function definition)- Arguments are classified into three types: pass-by-reference (ObjectRefs), pass-by-value inline (small objects), and pass-by-value non-inline (large objects put in the object store)
- A
TaskSpecificationis built containing the function ID, arguments, and resource requirements - The spec is submitted to
NormalTaskSubmitter, which immediately returns an ObjectRef
The scheduler then waits for all ObjectRef arguments to be available, requests a worker lease from the raylet, and sends a PushTask RPC to execute the task.
Actors
An actor is a stateful worker that maintains mutable state across method calls. Creating an actor with ActorClass.remote() submits an ACTOR_CREATION_TASK that instantiates the actor on a dedicated worker.
Key differences from tasks:
- Actors are registered with the GCS before execution begins
- All methods on an actor execute on the same worker, preserving state
- Actor methods are submitted as
ACTOR_TASKtypes to the actor's dedicated worker - The actor handle maintains a cursor to track method execution order
Objects & References
Remote objects are immutable values stored in the distributed object store (one per node). Objects are referenced via ObjectRef, which is a unique ID that can be passed between workers.
Objects are created by:
- Task return values (automatically stored if large)
ray.put()calls (explicitly store a value)
The reference counter tracks object ownership and lifecycle. When an ObjectRef goes out of scope, the reference counter decrements the count. Once all references are released, the object becomes eligible for garbage collection and eviction from the object store.
import ray
@ray.remote
def add(a, b):
return a + b
# Task submission returns immediately
result_ref = add.remote(1, 2)
# Fetch the result
result = ray.get(result_ref) # 3
@ray.remote
class Counter:
def __init__(self):
self.count = 0
def increment(self):
self.count += 1
return self.count
# Create actor
counter = Counter.remote()
# Call actor method
ref = counter.increment.remote()
print(ray.get(ref)) # 1
Loading diagram...
Ray Data: Scalable Data Processing
Relevant Files
python/ray/data/init.pypython/ray/data/dataset.pypython/ray/data/block.pypython/ray/data/_internal/execution/streaming_executor.pydoc/source/data/data.rst
Ray Data is a scalable data processing library for AI workloads built on Ray. It provides flexible APIs for batch inference, data preprocessing, and ML training data loading. Unlike traditional distributed data systems, Ray Data features a streaming execution engine that efficiently processes large datasets while maintaining high utilization across CPU and GPU workloads.
Core Concepts
Datasets and Blocks
A Dataset is a distributed data collection partitioned into blocks, where each block is the basic unit of data stored in Ray's object store. Blocks are represented as either PyArrow Tables or Pandas DataFrames, enabling efficient parallel processing. The block-level parallelism allows Ray Data to scale from single machines to hundreds of nodes processing terabytes of data.
Lazy Execution
Dataset transformations are lazy—they don't execute immediately. Instead, Ray Data builds a logical plan describing the operations. Execution is triggered only when you call a consumption method like .show(), .take(), or .iter_batches(). This enables optimization of the entire pipeline before execution begins.
Key Operations
Ray Data supports three categories of operations:
-
Transformations (return a new Dataset):
map()andmap_batches()for row-wise or batch-wise transformationsfilter()for row filtering with predicatesgroupby()for grouping and aggregationsort(),random_shuffle(),repartition()for data reorganizationjoin()for combining datasets
-
Aggregations (return scalar values):
min(),max(),mean(),sum()for simple statisticsaggregate()with custom aggregation functionsgroupby().aggregate()for grouped aggregations
-
Consumption (trigger execution):
iter_batches()for streaming iterationtake(),take_all()for collecting resultswrite_*()methods for saving to external storage
Streaming Execution Engine
Ray Data's streaming executor uses an event-loop approach to maximize throughput under resource constraints:
# Transformations are lazy
ds = ray.data.read_csv("data.csv")
ds = ds.map_batches(preprocess_fn, batch_size=64)
ds = ds.filter(lambda x: x["score"] > 0.5)
# Execution starts here
for batch in ds.iter_batches(batch_size=32):
process(batch)
The executor manages:
- Operator scheduling: Selects which operator to run next based on resource availability
- Resource management: Tracks CPU, GPU, and object store memory usage
- Backpressure policies: Prevents memory bloat by throttling upstream operators when downstream queues fill up
- Task dispatch: Launches Ray tasks or actors based on compute strategy
Compute Strategies
Ray Data supports two execution strategies:
- TaskPoolStrategy: Launches stateless Ray tasks for each operation (default, lower overhead)
- ActorPoolStrategy: Creates a pool of stateful Ray actors for reuse (better for stateful operations)
# Using actors for stateful transformations
ds.map_batches(
ModelInference,
compute=ray.data.ActorPoolStrategy(size=4),
num_gpus=1
)
Data Format Support
Ray Data integrates with multiple data sources and formats:
- File formats: Parquet, CSV, JSON, Avro, TFRecords, Lance, images, audio, video
- Cloud storage: S3, GCS, Azure Blob Storage, HDFS
- Databases: BigQuery, Snowflake, SQL databases, MongoDB
- Frameworks: Pandas, PyArrow, Spark, Dask, Modin, HuggingFace, PyTorch, TensorFlow
Preprocessing and ML Integration
Ray Data includes built-in preprocessors for common ML tasks:
- Scaling and normalization (StandardScaler, MinMaxScaler)
- Encoding (LabelEncoder, OneHotEncoder)
- Imputation and discretization
- Tokenization and vectorization
- Chaining multiple preprocessors together
Preprocessors can be fit on training data and applied to test data, enabling reproducible ML pipelines.
Performance Characteristics
Ray Data optimizes for:
- Streaming efficiency: Data flows through the pipeline without materializing intermediate results
- GPU utilization: Overlaps CPU preprocessing with GPU inference/training
- Memory efficiency: Adaptive batching and backpressure prevent object store overflow
- Fault tolerance: Integrates with Ray's fault tolerance mechanisms for reliable execution
Ray Train: Distributed Model Training
Relevant Files
python/ray/train/init.pypython/ray/train/base_trainer.pypython/ray/train/data_parallel_trainer.pypython/ray/train/backend.pypython/ray/air/config.pypython/ray/train/_internal/data_config.py
Ray Train is a scalable machine learning library for distributed training and fine-tuning. It abstracts away the complexities of distributed computing, allowing you to scale model training from a single machine to a cluster with minimal code changes.
Core Architecture
Ray Train follows a trainer-based architecture where trainers orchestrate distributed training jobs. The main components are:
- Trainers - High-level APIs for different frameworks (PyTorch, TensorFlow, XGBoost, etc.)
- Backends - Framework-specific distributed setup (NCCL for PyTorch, Gloo for CPU, etc.)
- Workers - Ray Actors that execute the training function on distributed nodes
- Datasets - Ray Data integration for efficient data loading and sharding
Loading diagram...
Trainer Lifecycle
When you call trainer.fit(), the following sequence occurs:
- Initialization - Trainer is instantiated locally with configuration
- Serialization - Trainer is pickled and sent to a remote Ray actor
- Setup -
trainer.setup()runs on the remote actor for heavyweight initialization - Training Loop -
trainer.training_loop()executes the main training logic - Results - Returns a
Resultobject with metrics and checkpoints
Key Configuration Classes
ScalingConfig - Controls distributed training scale:
from ray.train import ScalingConfig
scaling_config = ScalingConfig(
num_workers=4, # Number of training workers
use_gpu=True, # Enable GPU per worker
resources_per_worker={"GPU": 1, "CPU": 8},
placement_strategy="SPREAD" # Spread workers across nodes
)
RunConfig - Manages execution and checkpointing:
from ray.train import RunConfig, CheckpointConfig, FailureConfig
run_config = RunConfig(
name="my_training_run",
storage_path="s3://my-bucket/checkpoints",
checkpoint_config=CheckpointConfig(num_to_keep=3),
failure_config=FailureConfig(max_failures=3)
)
DataConfig - Configures dataset preprocessing and sharding:
from ray.train import DataConfig
data_config = DataConfig(
datasets_to_split=["train"], # Which datasets to shard
enable_shard_locality=True # Prefer local data
)
Data Parallel Training
DataParallelTrainer implements SPMD (Single Program, Multiple Data) training where the same function runs on all workers with different data shards:
from ray.train.data_parallel_trainer import DataParallelTrainer
from ray import train
def train_loop_per_worker():
# Each worker gets a shard of the training dataset
dataset_shard = train.get_dataset_shard("train")
# Access distributed context
world_size = train.get_context().get_world_size()
rank = train.get_context().get_world_rank()
# Report metrics and checkpoints
train.report({"loss": 0.5})
trainer = DataParallelTrainer(
train_loop_per_worker,
scaling_config=ScalingConfig(num_workers=4, use_gpu=True),
datasets={"train": ray_dataset}
)
result = trainer.fit()
Framework-Specific Trainers
Ray Train provides specialized trainers for popular frameworks:
- TorchTrainer - Automatic NCCL/Gloo setup for PyTorch
- TensorflowTrainer - Distributed TensorFlow environment configuration
- XGBoostTrainer - XGBoost distributed training with Ray
- JAXTrainer - JAX multi-device training
Each trainer handles framework-specific initialization while inheriting the core trainer lifecycle.
Backend System
The Backend and BackendConfig classes enable framework-specific distributed setup:
from ray.train.backend import Backend, BackendConfig
class MyBackendConfig(BackendConfig):
@property
def backend_cls(self):
return MyBackend
class MyBackend(Backend):
def on_start(self, worker_group, backend_config):
# Setup distributed environment
pass
Backends handle CUDA device visibility, process group initialization, and framework-specific environment variables.
Ray Tune & Serve: Hyperparameter Tuning & Model Serving
Relevant Files
python/ray/tune/tuner.pypython/ray/tune/tune.pypython/ray/tune/tune_config.pypython/ray/tune/search/python/ray/tune/schedulers/python/ray/serve/deployment.pypython/ray/serve/api.pypython/ray/serve/handle.py
Ray Tune and Ray Serve are complementary systems for machine learning workflows: Tune optimizes hyperparameters during training, while Serve deploys trained models for inference at scale.
Ray Tune: Hyperparameter Tuning
Ray Tune is a distributed hyperparameter tuning framework that runs multiple training trials in parallel. The core workflow uses the Tuner class:
from ray import tune
tuner = tune.Tuner(
trainable=my_training_function,
param_space={"lr": tune.loguniform(1e-4, 1e-1), "batch_size": tune.choice([32, 64])},
tune_config=tune.TuneConfig(
metric="loss",
mode="min",
num_samples=10,
search_alg=BayesOptSearch(),
scheduler=AsyncHyperBandScheduler()
),
run_config=tune.RunConfig(name="my_experiment", storage_path="~/ray_results")
)
results = tuner.fit()
Key Components:
- Tuner: Entry point that orchestrates the entire tuning job. Manages configuration, execution, and result collection.
- TuneConfig: Specifies tuning behavior including the metric to optimize, search algorithm, and trial scheduler.
- RunConfig: Handles job-level settings like storage, checkpointing, and fault tolerance.
- Search Algorithms: Generate hyperparameter configurations.
BasicVariantGeneratoris the default; others includeBayesOptSearch,OptunaSearch, andAxSearch. - Trial Schedulers: Control trial execution and early stopping.
FIFOSchedulerruns trials sequentially;AsyncHyperBandScheduler(ASHA) andHyperBandSchedulerperform adaptive resource allocation;PopulationBasedTraining(PBT) mutates hyperparameters during training. - ConcurrencyLimiter: Wrapper that limits concurrent trials from any search algorithm.
Tune reports metrics via tune.report() and supports checkpointing for fault tolerance and warm-start resumption.
Ray Serve: Model Serving
Ray Serve deploys models as scalable HTTP/gRPC services. The core pattern uses the @serve.deployment decorator:
from ray import serve
@serve.deployment(num_replicas=2, ray_actor_options={"num_gpus": 1})
class MyModel:
def __init__(self, model_path: str):
self.model = load_model(model_path)
def __call__(self, request):
return self.model.predict(request)
app = MyModel.bind(model_path="path/to/model")
handle = serve.run(app)
Key Components:
- Deployment: A class or function decorated with
@serve.deployment. Each deployment runs as Ray actors (replicas) that handle requests. - Application: One or more deployments bound with arguments. The ingress deployment receives external traffic.
- DeploymentHandle: Enables inter-deployment communication for model composition. Allows one deployment to call another without HTTP overhead.
- Replicas: Individual actor instances of a deployment. Scale horizontally to handle load.
Deployments compose via Deployment.bind(), enabling multi-stage pipelines. The serve.run() function deploys an application and returns a handle for programmatic access.
Integration Pattern
Tune and Serve work together in ML workflows:
- Use Tune to find optimal hyperparameters
- Save the best model from
results.get_best_result() - Deploy the model with Serve using the tuned hyperparameters
- Scale inference with multiple replicas and autoscaling policies
best_result = results.get_best_result(metric="loss", mode="min")
best_model_path = best_result.checkpoint.path
@serve.deployment(num_replicas=4)
class TunedModel:
def __init__(self, model_path):
self.model = load_model(model_path)
def __call__(self, x):
return self.model.predict(x)
app = TunedModel.bind(best_model_path)
serve.run(app)
Both systems leverage Ray's distributed execution engine for efficient resource utilization across clusters.
RLlib: Scalable Reinforcement Learning
Relevant Files
rllib/init.pyrllib/algorithms/algorithm.pyrllib/algorithms/algorithm_config.pyrllib/policy/policy.pyrllib/core/rl_module/rl_module.pyrllib/connectors/connector_v2.pyrllib/env/env_runner.pyrllib/core/learner/learner.py
RLlib is Ray's production-grade reinforcement learning library, providing scalable, fault-tolerant RL training with support for single-agent, multi-agent, offline, and externally-connected environments. It abstracts distributed training complexity while maintaining flexible APIs for custom algorithms.
Core Architecture
RLlib's architecture centers on three main components working in concert:
Algorithm (rllib/algorithms/algorithm.py) is the runtime orchestrator that manages the entire training loop. It coordinates environment sampling, model updates, and checkpointing. You configure it via AlgorithmConfig, which provides a fluent API for setting hyperparameters, environment details, and resource allocation.
EnvRunners (rllib/env/env_runner.py) are distributed actors that collect training data by stepping through environments. They use RLModules to compute actions and connectors to transform environment observations into model-readable tensors. Multiple EnvRunners run in parallel across a Ray cluster for scalable data collection.
Learners (rllib/core/learner/learner.py) are distributed actors that perform gradient-based updates on RLModules using collected data. They handle loss computation, backpropagation, and weight synchronization across multiple GPUs or machines.
RLModules and Neural Networks
RLModules (rllib/core/rl_module/rl_module.py) are framework-specific neural network wrappers that define three forward passes:
forward_exploration()- Computes actions with exploration during data collectionforward_inference()- Computes deterministic actions for evaluation or deploymentforward_train()- Prepares data for loss computation during training
RLlib provides default RLModules with configurable architectures (MLPs, CNNs, RNNs), or you can implement custom modules in PyTorch. For multi-agent scenarios, MultiRLModule contains multiple sub-modules, each identified by a ModuleID.
Connectors: Data Transformation Pipelines
Connectors (rllib/connectors/connector_v2.py) are composable data transformation pipelines that bridge environments and models. Three pipeline types exist:
- Env-to-Module - Transforms raw environment observations (preprocessing, filtering, RNN state handling)
- Module-to-Env - Converts model outputs to environment actions (sampling, clipping)
- Learner - Converts episodes to training batches for loss computation
Connectors enable custom data handling without modifying core algorithm logic, supporting stateful operations like observation normalization.
Training Loop
A typical training iteration follows this flow:
config = PPOConfig().environment("CartPole-v1").training(lr=0.0001)
algo = config.build()
for _ in range(num_iterations):
# EnvRunners sample episodes in parallel
# Connectors transform episodes to batches
# Learners compute losses and update RLModule weights
result = algo.train()
print(result)
algo.stop()
Multi-Agent and Offline RL
RLlib natively supports multi-agent training via MultiAgentEnv and policy mapping functions. For offline RL, algorithms can train from pre-collected datasets without environment interaction, using specialized input readers and off-policy estimators.
Key Design Patterns
RLlib uses composition over inheritance - algorithms are built from configurable components rather than deep class hierarchies. The connector pattern decouples data transformation from algorithm logic. Ray actors provide transparent distributed execution with fault tolerance and auto-scaling.
Cluster Management & Autoscaling
Relevant Files
python/ray/autoscaler/init.pypython/ray/autoscaler/node_provider.pypython/ray/autoscaler/_private/autoscaler.pypython/ray/autoscaler/v2/autoscaler.pypython/ray/autoscaler/v2/scheduler.pypython/ray/autoscaler/_private/resource_demand_scheduler.pypython/ray/autoscaler/_private/monitor.pypython/ray/autoscaler/_private/load_metrics.py
Ray's autoscaler automatically adjusts cluster size based on workload demands. It monitors resource utilization, schedules pending tasks and actors, and manages node lifecycle across multiple cloud providers.
Architecture Overview
Loading diagram...
Core Components
Monitor Loop (monitor.py): Runs periodically on the head node, collecting cluster state and triggering autoscaling decisions. It queries the GCS for resource demands and node status.
Load Metrics (load_metrics.py): Aggregates cluster telemetry from raylet heartbeats, including static/dynamic resources, pending placement groups, and waiting task bundles. Tracks both feasible and infeasible resource requests.
Resource Demand Scheduler: Determines which nodes to launch or terminate using bin-packing algorithms. Considers node type constraints, min/max node counts, and placement group requirements. Two implementations exist: v1 (resource_demand_scheduler.py) and v2 (v2/scheduler.py).
Node Provider (node_provider.py): Abstract interface for cloud-specific operations. Implementations exist for AWS, GCP, Azure, Kubernetes, and local providers. Must be thread-safe for concurrent node creation/termination.
Instance Manager (v2/instance_manager/): Manages instance lifecycle in v2 autoscaler. Tracks instance state transitions (QUEUED → REQUESTED → ALLOCATED → RAY_INSTALLING → RAY_RUNNING) and coordinates with subscribers for Ray installation and monitoring.
Scaling Decision Flow
- Collect Metrics: Monitor queries GCS for pending tasks, actors, and placement groups
- Compute Demands: Load metrics aggregates resource requirements into demand vectors
- Schedule: Scheduler performs bin-packing to fit demands onto existing/new nodes
- Execute: Instance manager launches new nodes or terminates idle ones via cloud provider
- Install: Ray runtime installed on new nodes; nodes join cluster when ready
Key Features
- Multi-node-type support: Different node types (CPU, GPU, memory-optimized) with separate min/max constraints
- Placement group awareness: Respects gang scheduling requirements and affinity constraints
- Idle node termination: Removes underutilized nodes after configurable timeout
- Cloud provider abstraction: Pluggable providers enable multi-cloud deployments
- GCS fault tolerance: Integrates with Redis-backed GCS for cluster state recovery