Ray: Unified Framework for Scaling AI and Python Applications | Augment Code

Overview

Relevant Files

README.rst
python/ray/init.py
doc/source/ray-overview/index.md
doc/source/ray-overview/getting-started.md

Ray is a unified, open-source framework for scaling AI and Python applications. It provides a simple, universal API for building distributed applications that seamlessly scale from a laptop to a large cluster without requiring deep distributed systems expertise.

What Ray Solves

Modern ML workloads are increasingly compute-intensive. Ray bridges the gap between single-node development and production-scale distributed computing by enabling the same Python code to run efficiently on any infrastructure—from your laptop to cloud clusters.

Ray consists of three core layers:

Ray AI Libraries – Domain-specific tools for ML workflows (Data, Train, Tune, Serve, RLlib)
Ray Core – General-purpose distributed computing primitives (Tasks, Actors, Objects)
Ray Clusters – Deployment infrastructure with autoscaling support

Core Abstractions

Loading diagram...

Tasks are stateless functions decorated with @ray.remote that execute in parallel across the cluster. Actors are stateful worker processes that maintain internal state and execute remote method calls sequentially. Objects are immutable values stored in Ray's distributed object store, accessible across the cluster via object references.

Ray AI Libraries

Ray provides five specialized libraries for common ML tasks:

Data – Scalable, framework-agnostic data loading and transformation
Train – Distributed multi-node training with fault tolerance (PyTorch, TensorFlow)
Tune – Hyperparameter tuning at scale with efficient search algorithms
Serve – Production-ready model serving with optional microbatching
RLlib – Distributed reinforcement learning with high-performance algorithms

Key Features

Seamless Scaling – Same code runs on laptop or cluster without modification
Fault Tolerance – Automatic recovery from node failures
Auto-scaling – Dynamic resource allocation based on workload demand
Multi-language Support – Python and Java APIs available
Cloud Integration – Native support for AWS, GCP, Azure, Kubernetes
Observability – Built-in dashboard and state APIs for monitoring

Getting Started

Install Ray with pip install ray. For specific libraries, use pip install "ray[data]", pip install "ray[train]", etc. Ray automatically initializes on first use of distributed APIs, or explicitly with ray.init().

Architecture & Core Components

Relevant Files

src/ray/gcs/gcs_server.h
src/ray/gcs/gcs_server.cc
src/ray/raylet/node_manager.h
src/ray/core_worker/core_worker.h
src/ray/object_manager/object_manager.h

Ray Core is built on a distributed architecture with three primary components: the Global Control Service (GCS), Raylets, and Core Workers. These components communicate via gRPC and manage tasks, actors, objects, and cluster resources.

System Overview

Loading diagram...

Global Control Service (GCS)

The GCS is the cluster's control plane, managing all cluster-level metadata and operations. It runs as a single centralized service and coordinates with all nodes.

Key Managers:

GcsNodeManager - Tracks alive, draining, and dead nodes; handles node registration and health monitoring
GcsActorManager - Manages actor lifecycle: registration, scheduling, creation, and failure recovery
GcsJobManager - Tracks job lifecycle, runtime environments, and job completion
GcsTaskManager - Collects and stores task event data for observability and debugging
GcsPlacementGroupManager - Schedules placement groups across nodes for resource-aware task placement
GcsResourceManager - Maintains cluster-wide resource availability and scheduling decisions
GcsWorkerManager - Tracks worker registration and lifecycle

The GCS uses GcsTableStorage (backed by Redis or in-memory) for persistent state and GcsPublisher for pub/sub notifications to keep all components synchronized.

Raylet (Node Manager)

Each node runs a Raylet, which is the local resource manager and task scheduler. It manages worker processes, object storage, and local task execution.

Responsibilities:

Worker Pool Management - Starts, monitors, and recycles worker processes
Local Scheduling - Schedules tasks on local workers using ClusterResourceScheduler
Object Management - Coordinates with ObjectManager for object storage and transfer
Resource Reporting - Reports node resources to GCS via RaySyncer
Lease Management - Manages worker leases through LocalLeaseManager and ClusterLeaseManager

Core Worker

Each worker process runs a CoreWorker, which executes tasks and manages object references.

Key Responsibilities:

Task Execution - Receives and executes tasks via TaskReceiver and SchedulingQueue
Task Submission - Submits new tasks via NormalTaskSubmitter or ActorTaskSubmitter
Object Store Access - Puts/gets objects from Plasma store or memory store
Reference Counting - Tracks object ownership via ReferenceCounter for garbage collection
Actor Management - Handles actor method calls and state management

Object Manager

The ObjectManager handles distributed object storage and transfer across nodes using the Plasma in-memory object store.

Operations:

Pull - Fetches objects from remote nodes when needed by tasks
Push - Transfers objects to nodes that request them
Object Directory - Tracks which nodes contain which objects
Spilling - Moves objects to disk when memory is full

Communication Flow

Task Submission: Driver calls ray.remote() function → CoreWorker submits to GCS → GCS schedules on Raylet
Task Execution: Raylet assigns task to worker → CoreWorker executes → Returns object reference
Object Access: CoreWorker requests object → ObjectManager pulls from remote Plasma store → Returns to caller
Actor Creation: CoreWorker registers actor → GCS schedules creation task → Raylet executes on worker

Key Design Patterns

Asynchronous I/O - All components use ASIO for non-blocking operations
Pub/Sub Messaging - GCS publishes state changes; components subscribe for updates
Reference Counting - Objects are garbage collected when no references remain
Fault Tolerance - GCS can recover from Redis; tasks can be retried on failure

Ray Core: Tasks, Actors & Objects

Relevant Files

python/ray/remote_function.py
python/ray/actor.py
src/ray/core_worker/core_worker.cc
src/ray/core_worker/task_submission/normal_task_submitter.cc
src/ray/core_worker/actor_manager.cc
src/ray/core_worker/reference_counter.cc

Ray Core provides three fundamental primitives for distributed computing: tasks, actors, and objects. These work together to enable flexible, scalable applications.

Tasks

A task is an asynchronous function execution on a remote worker. When you decorate a function with @ray.remote and call .remote(), Ray creates a RemoteFunction wrapper that submits the task for execution.

The task submission flow:

.remote() pickles the function and stores it in the GCS key-value store (once per function definition)
Arguments are classified into three types: pass-by-reference (ObjectRefs), pass-by-value inline (small objects), and pass-by-value non-inline (large objects put in the object store)
A TaskSpecification is built containing the function ID, arguments, and resource requirements
The spec is submitted to NormalTaskSubmitter, which immediately returns an ObjectRef

The scheduler then waits for all ObjectRef arguments to be available, requests a worker lease from the raylet, and sends a PushTask RPC to execute the task.

Actors

An actor is a stateful worker that maintains mutable state across method calls. Creating an actor with ActorClass.remote() submits an ACTOR_CREATION_TASK that instantiates the actor on a dedicated worker.

Key differences from tasks:

Actors are registered with the GCS before execution begins
All methods on an actor execute on the same worker, preserving state
Actor methods are submitted as ACTOR_TASK types to the actor's dedicated worker
The actor handle maintains a cursor to track method execution order

Objects & References

Remote objects are immutable values stored in the distributed object store (one per node). Objects are referenced via ObjectRef, which is a unique ID that can be passed between workers.

Objects are created by:

Task return values (automatically stored if large)
ray.put() calls (explicitly store a value)

The reference counter tracks object ownership and lifecycle. When an ObjectRef goes out of scope, the reference counter decrements the count. Once all references are released, the object becomes eligible for garbage collection and eviction from the object store.

import ray

@ray.remote
def add(a, b):
    return a + b

# Task submission returns immediately
result_ref = add.remote(1, 2)
# Fetch the result
result = ray.get(result_ref)  # 3

@ray.remote
class Counter:
    def __init__(self):
        self.count = 0
    
    def increment(self):
        self.count += 1
        return self.count

# Create actor
counter = Counter.remote()
# Call actor method
ref = counter.increment.remote()
print(ray.get(ref))  # 1

Loading diagram...

Ray Data: Scalable Data Processing

Relevant Files

python/ray/data/init.py
python/ray/data/dataset.py
python/ray/data/block.py
python/ray/data/_internal/execution/streaming_executor.py
doc/source/data/data.rst

Ray Data is a scalable data processing library for AI workloads built on Ray. It provides flexible APIs for batch inference, data preprocessing, and ML training data loading. Unlike traditional distributed data systems, Ray Data features a streaming execution engine that efficiently processes large datasets while maintaining high utilization across CPU and GPU workloads.

Core Concepts

Datasets and Blocks

A Dataset is a distributed data collection partitioned into blocks, where each block is the basic unit of data stored in Ray's object store. Blocks are represented as either PyArrow Tables or Pandas DataFrames, enabling efficient parallel processing. The block-level parallelism allows Ray Data to scale from single machines to hundreds of nodes processing terabytes of data.

Lazy Execution

Dataset transformations are lazy—they don't execute immediately. Instead, Ray Data builds a logical plan describing the operations. Execution is triggered only when you call a consumption method like .show(), .take(), or .iter_batches(). This enables optimization of the entire pipeline before execution begins.

Key Operations

Ray Data supports three categories of operations:

Transformations (return a new Dataset):
- map() and map_batches() for row-wise or batch-wise transformations
- filter() for row filtering with predicates
- groupby() for grouping and aggregation
- sort(), random_shuffle(), repartition() for data reorganization
- join() for combining datasets
Aggregations (return scalar values):
- min(), max(), mean(), sum() for simple statistics
- aggregate() with custom aggregation functions
- groupby().aggregate() for grouped aggregations
Consumption (trigger execution):
- iter_batches() for streaming iteration
- take(), take_all() for collecting results
- write_*() methods for saving to external storage

Streaming Execution Engine

Ray Data's streaming executor uses an event-loop approach to maximize throughput under resource constraints:

# Transformations are lazy
ds = ray.data.read_csv("data.csv")
ds = ds.map_batches(preprocess_fn, batch_size=64)
ds = ds.filter(lambda x: x["score"] > 0.5)

# Execution starts here
for batch in ds.iter_batches(batch_size=32):
    process(batch)

The executor manages:

Operator scheduling: Selects which operator to run next based on resource availability
Resource management: Tracks CPU, GPU, and object store memory usage
Backpressure policies: Prevents memory bloat by throttling upstream operators when downstream queues fill up
Task dispatch: Launches Ray tasks or actors based on compute strategy

Compute Strategies

Ray Data supports two execution strategies:

TaskPoolStrategy: Launches stateless Ray tasks for each operation (default, lower overhead)
ActorPoolStrategy: Creates a pool of stateful Ray actors for reuse (better for stateful operations)

# Using actors for stateful transformations
ds.map_batches(
    ModelInference,
    compute=ray.data.ActorPoolStrategy(size=4),
    num_gpus=1
)

Data Format Support

Ray Data integrates with multiple data sources and formats:

File formats: Parquet, CSV, JSON, Avro, TFRecords, Lance, images, audio, video
Cloud storage: S3, GCS, Azure Blob Storage, HDFS
Databases: BigQuery, Snowflake, SQL databases, MongoDB
Frameworks: Pandas, PyArrow, Spark, Dask, Modin, HuggingFace, PyTorch, TensorFlow

Preprocessing and ML Integration

Ray Data includes built-in preprocessors for common ML tasks:

Scaling and normalization (StandardScaler, MinMaxScaler)
Encoding (LabelEncoder, OneHotEncoder)
Imputation and discretization
Tokenization and vectorization
Chaining multiple preprocessors together

Preprocessors can be fit on training data and applied to test data, enabling reproducible ML pipelines.

Performance Characteristics

Ray Data optimizes for:

Streaming efficiency: Data flows through the pipeline without materializing intermediate results
GPU utilization: Overlaps CPU preprocessing with GPU inference/training
Memory efficiency: Adaptive batching and backpressure prevent object store overflow
Fault tolerance: Integrates with Ray's fault tolerance mechanisms for reliable execution

Ray Train: Distributed Model Training

Relevant Files

python/ray/train/init.py
python/ray/train/base_trainer.py
python/ray/train/data_parallel_trainer.py
python/ray/train/backend.py
python/ray/air/config.py
python/ray/train/_internal/data_config.py

Ray Train is a scalable machine learning library for distributed training and fine-tuning. It abstracts away the complexities of distributed computing, allowing you to scale model training from a single machine to a cluster with minimal code changes.

Core Architecture

Ray Train follows a trainer-based architecture where trainers orchestrate distributed training jobs. The main components are:

Trainers - High-level APIs for different frameworks (PyTorch, TensorFlow, XGBoost, etc.)
Backends - Framework-specific distributed setup (NCCL for PyTorch, Gloo for CPU, etc.)
Workers - Ray Actors that execute the training function on distributed nodes
Datasets - Ray Data integration for efficient data loading and sharding

Loading diagram...

Trainer Lifecycle

When you call trainer.fit(), the following sequence occurs:

Initialization - Trainer is instantiated locally with configuration
Serialization - Trainer is pickled and sent to a remote Ray actor
Setup - trainer.setup() runs on the remote actor for heavyweight initialization
Training Loop - trainer.training_loop() executes the main training logic
Results - Returns a Result object with metrics and checkpoints

Key Configuration Classes

ScalingConfig - Controls distributed training scale:

from ray.train import ScalingConfig

scaling_config = ScalingConfig(
    num_workers=4,              # Number of training workers
    use_gpu=True,               # Enable GPU per worker
    resources_per_worker={"GPU": 1, "CPU": 8},
    placement_strategy="SPREAD" # Spread workers across nodes
)

RunConfig - Manages execution and checkpointing:

from ray.train import RunConfig, CheckpointConfig, FailureConfig

run_config = RunConfig(
    name="my_training_run",
    storage_path="s3://my-bucket/checkpoints",
    checkpoint_config=CheckpointConfig(num_to_keep=3),
    failure_config=FailureConfig(max_failures=3)
)

DataConfig - Configures dataset preprocessing and sharding:

from ray.train import DataConfig

data_config = DataConfig(
    datasets_to_split=["train"],  # Which datasets to shard
    enable_shard_locality=True    # Prefer local data
)

Data Parallel Training

DataParallelTrainer implements SPMD (Single Program, Multiple Data) training where the same function runs on all workers with different data shards:

from ray.train.data_parallel_trainer import DataParallelTrainer
from ray import train

def train_loop_per_worker():
    # Each worker gets a shard of the training dataset
    dataset_shard = train.get_dataset_shard("train")
    
    # Access distributed context
    world_size = train.get_context().get_world_size()
    rank = train.get_context().get_world_rank()
    
    # Report metrics and checkpoints
    train.report({"loss": 0.5})

trainer = DataParallelTrainer(
    train_loop_per_worker,
    scaling_config=ScalingConfig(num_workers=4, use_gpu=True),
    datasets={"train": ray_dataset}
)
result = trainer.fit()

Framework-Specific Trainers

Ray Train provides specialized trainers for popular frameworks:

TorchTrainer - Automatic NCCL/Gloo setup for PyTorch
TensorflowTrainer - Distributed TensorFlow environment configuration
XGBoostTrainer - XGBoost distributed training with Ray
JAXTrainer - JAX multi-device training

Each trainer handles framework-specific initialization while inheriting the core trainer lifecycle.

Backend System

The Backend and BackendConfig classes enable framework-specific distributed setup:

from ray.train.backend import Backend, BackendConfig

class MyBackendConfig(BackendConfig):
    @property
    def backend_cls(self):
        return MyBackend

class MyBackend(Backend):
    def on_start(self, worker_group, backend_config):
        # Setup distributed environment
        pass

Backends handle CUDA device visibility, process group initialization, and framework-specific environment variables.

Ray Tune & Serve: Hyperparameter Tuning & Model Serving

Relevant Files

python/ray/tune/tuner.py
python/ray/tune/tune.py
python/ray/tune/tune_config.py
python/ray/tune/search/
python/ray/tune/schedulers/
python/ray/serve/deployment.py
python/ray/serve/api.py
python/ray/serve/handle.py

Ray Tune and Ray Serve are complementary systems for machine learning workflows: Tune optimizes hyperparameters during training, while Serve deploys trained models for inference at scale.

Ray Tune: Hyperparameter Tuning

Ray Tune is a distributed hyperparameter tuning framework that runs multiple training trials in parallel. The core workflow uses the Tuner class:

from ray import tune

tuner = tune.Tuner(
    trainable=my_training_function,
    param_space={"lr": tune.loguniform(1e-4, 1e-1), "batch_size": tune.choice([32, 64])},
    tune_config=tune.TuneConfig(
        metric="loss",
        mode="min",
        num_samples=10,
        search_alg=BayesOptSearch(),
        scheduler=AsyncHyperBandScheduler()
    ),
    run_config=tune.RunConfig(name="my_experiment", storage_path="~/ray_results")
)
results = tuner.fit()

Key Components:

Tuner: Entry point that orchestrates the entire tuning job. Manages configuration, execution, and result collection.
TuneConfig: Specifies tuning behavior including the metric to optimize, search algorithm, and trial scheduler.
RunConfig: Handles job-level settings like storage, checkpointing, and fault tolerance.
Search Algorithms: Generate hyperparameter configurations. BasicVariantGenerator is the default; others include BayesOptSearch, OptunaSearch, and AxSearch.
Trial Schedulers: Control trial execution and early stopping. FIFOScheduler runs trials sequentially; AsyncHyperBandScheduler (ASHA) and HyperBandScheduler perform adaptive resource allocation; PopulationBasedTraining (PBT) mutates hyperparameters during training.
ConcurrencyLimiter: Wrapper that limits concurrent trials from any search algorithm.

Tune reports metrics via tune.report() and supports checkpointing for fault tolerance and warm-start resumption.

Ray Serve: Model Serving

Ray Serve deploys models as scalable HTTP/gRPC services. The core pattern uses the @serve.deployment decorator:

from ray import serve

@serve.deployment(num_replicas=2, ray_actor_options={"num_gpus": 1})
class MyModel:
    def __init__(self, model_path: str):
        self.model = load_model(model_path)
    
    def __call__(self, request):
        return self.model.predict(request)

app = MyModel.bind(model_path="path/to/model")
handle = serve.run(app)

Key Components:

Deployment: A class or function decorated with @serve.deployment. Each deployment runs as Ray actors (replicas) that handle requests.
Application: One or more deployments bound with arguments. The ingress deployment receives external traffic.
DeploymentHandle: Enables inter-deployment communication for model composition. Allows one deployment to call another without HTTP overhead.
Replicas: Individual actor instances of a deployment. Scale horizontally to handle load.

Deployments compose via Deployment.bind(), enabling multi-stage pipelines. The serve.run() function deploys an application and returns a handle for programmatic access.

Integration Pattern

Tune and Serve work together in ML workflows:

Use Tune to find optimal hyperparameters
Save the best model from results.get_best_result()
Deploy the model with Serve using the tuned hyperparameters
Scale inference with multiple replicas and autoscaling policies

best_result = results.get_best_result(metric="loss", mode="min")
best_model_path = best_result.checkpoint.path

@serve.deployment(num_replicas=4)
class TunedModel:
    def __init__(self, model_path):
        self.model = load_model(model_path)
    
    def __call__(self, x):
        return self.model.predict(x)

app = TunedModel.bind(best_model_path)
serve.run(app)

Both systems leverage Ray's distributed execution engine for efficient resource utilization across clusters.

RLlib: Scalable Reinforcement Learning

Relevant Files

rllib/init.py
rllib/algorithms/algorithm.py
rllib/algorithms/algorithm_config.py
rllib/policy/policy.py
rllib/core/rl_module/rl_module.py
rllib/connectors/connector_v2.py
rllib/env/env_runner.py
rllib/core/learner/learner.py

RLlib is Ray's production-grade reinforcement learning library, providing scalable, fault-tolerant RL training with support for single-agent, multi-agent, offline, and externally-connected environments. It abstracts distributed training complexity while maintaining flexible APIs for custom algorithms.

Core Architecture

RLlib's architecture centers on three main components working in concert:

Algorithm (rllib/algorithms/algorithm.py) is the runtime orchestrator that manages the entire training loop. It coordinates environment sampling, model updates, and checkpointing. You configure it via AlgorithmConfig, which provides a fluent API for setting hyperparameters, environment details, and resource allocation.

EnvRunners (rllib/env/env_runner.py) are distributed actors that collect training data by stepping through environments. They use RLModules to compute actions and connectors to transform environment observations into model-readable tensors. Multiple EnvRunners run in parallel across a Ray cluster for scalable data collection.

Learners (rllib/core/learner/learner.py) are distributed actors that perform gradient-based updates on RLModules using collected data. They handle loss computation, backpropagation, and weight synchronization across multiple GPUs or machines.

RLModules and Neural Networks

RLModules (rllib/core/rl_module/rl_module.py) are framework-specific neural network wrappers that define three forward passes:

forward_exploration() - Computes actions with exploration during data collection
forward_inference() - Computes deterministic actions for evaluation or deployment
forward_train() - Prepares data for loss computation during training

RLlib provides default RLModules with configurable architectures (MLPs, CNNs, RNNs), or you can implement custom modules in PyTorch. For multi-agent scenarios, MultiRLModule contains multiple sub-modules, each identified by a ModuleID.

Connectors: Data Transformation Pipelines

Connectors (rllib/connectors/connector_v2.py) are composable data transformation pipelines that bridge environments and models. Three pipeline types exist:

Env-to-Module - Transforms raw environment observations (preprocessing, filtering, RNN state handling)
Module-to-Env - Converts model outputs to environment actions (sampling, clipping)
Learner - Converts episodes to training batches for loss computation

Connectors enable custom data handling without modifying core algorithm logic, supporting stateful operations like observation normalization.

Training Loop

A typical training iteration follows this flow:

config = PPOConfig().environment("CartPole-v1").training(lr=0.0001)
algo = config.build()

for _ in range(num_iterations):
    # EnvRunners sample episodes in parallel
    # Connectors transform episodes to batches
    # Learners compute losses and update RLModule weights
    result = algo.train()
    print(result)

algo.stop()

Multi-Agent and Offline RL

RLlib natively supports multi-agent training via MultiAgentEnv and policy mapping functions. For offline RL, algorithms can train from pre-collected datasets without environment interaction, using specialized input readers and off-policy estimators.

Key Design Patterns

RLlib uses composition over inheritance - algorithms are built from configurable components rather than deep class hierarchies. The connector pattern decouples data transformation from algorithm logic. Ray actors provide transparent distributed execution with fault tolerance and auto-scaling.

Cluster Management & Autoscaling

Relevant Files

python/ray/autoscaler/init.py
python/ray/autoscaler/node_provider.py
python/ray/autoscaler/_private/autoscaler.py
python/ray/autoscaler/v2/autoscaler.py
python/ray/autoscaler/v2/scheduler.py
python/ray/autoscaler/_private/resource_demand_scheduler.py
python/ray/autoscaler/_private/monitor.py
python/ray/autoscaler/_private/load_metrics.py

Ray's autoscaler automatically adjusts cluster size based on workload demands. It monitors resource utilization, schedules pending tasks and actors, and manages node lifecycle across multiple cloud providers.

Architecture Overview

Loading diagram...

Core Components

Monitor Loop (monitor.py): Runs periodically on the head node, collecting cluster state and triggering autoscaling decisions. It queries the GCS for resource demands and node status.

Load Metrics (load_metrics.py): Aggregates cluster telemetry from raylet heartbeats, including static/dynamic resources, pending placement groups, and waiting task bundles. Tracks both feasible and infeasible resource requests.

Resource Demand Scheduler: Determines which nodes to launch or terminate using bin-packing algorithms. Considers node type constraints, min/max node counts, and placement group requirements. Two implementations exist: v1 (resource_demand_scheduler.py) and v2 (v2/scheduler.py).

Node Provider (node_provider.py): Abstract interface for cloud-specific operations. Implementations exist for AWS, GCP, Azure, Kubernetes, and local providers. Must be thread-safe for concurrent node creation/termination.

Instance Manager (v2/instance_manager/): Manages instance lifecycle in v2 autoscaler. Tracks instance state transitions (QUEUED → REQUESTED → ALLOCATED → RAY_INSTALLING → RAY_RUNNING) and coordinates with subscribers for Ray installation and monitoring.

Scaling Decision Flow

Collect Metrics: Monitor queries GCS for pending tasks, actors, and placement groups
Compute Demands: Load metrics aggregates resource requirements into demand vectors
Schedule: Scheduler performs bin-packing to fit demands onto existing/new nodes
Execute: Instance manager launches new nodes or terminates idle ones via cloud provider
Install: Ray runtime installed on new nodes; nodes join cluster when ready

Key Features

Multi-node-type support: Different node types (CPU, GPU, memory-optimized) with separate min/max constraints
Placement group awareness: Respects gang scheduling requirements and affinity constraints
Idle node termination: Removes underutilized nodes after configurable timeout
Cloud provider abstraction: Pluggable providers enable multi-cloud deployments
GCS fault tolerance: Integrates with Redis-backed GCS for cluster state recovery