Overview
Relevant Files
README.mdsdks/python/README.mdsdks/go/README.mdsdks/java/core/src/main/java/org/apache/beam/sdk/package-info.javarunners/(all runner implementations)
Apache Beam is a unified, open-source framework for defining both batch and streaming data-parallel processing pipelines. It provides a portable programming model that abstracts away the complexity of distributed data processing, allowing developers to write pipelines once and execute them on multiple distributed backends.
The Beam Model
The Beam Model evolved from Google's internal data processing systems including MapReduce, FlumeJava, and Millwheel. It introduces four core abstractions that form the foundation of every Beam pipeline:
- PCollection: Represents a distributed collection of data, which can be bounded (batch) or unbounded (streaming) in size.
- PTransform: Represents a computation that transforms input PCollections into output PCollections. Transforms are composable and can be chained together.
- Pipeline: Manages a directed acyclic graph (DAG) of PTransforms and PCollections, orchestrating the entire data processing workflow.
- PipelineRunner: Specifies where and how the pipeline executes on a distributed processing backend.
Multi-Language SDKs
Beam supports language-specific SDKs that implement the Beam Model:
- Java SDK (
sdks/java/): Full-featured SDK with extensive I/O connectors and transforms - Python SDK (
sdks/python/): Pythonic API with support for streaming, type hints, and ML inference - Go SDK (
sdks/go/): Lightweight SDK for Go developers with portable execution support - TypeScript SDK (
sdks/typescript/): JavaScript/TypeScript implementation for web and Node.js environments
Execution Runners
Beam pipelines can execute on multiple distributed backends through pluggable runners:
- DirectRunner: Local execution for development and testing
- PrismRunner: Portable local runner using Beam's FnAPI
- DataflowRunner: Google Cloud Dataflow for managed cloud execution
- FlinkRunner: Apache Flink cluster execution
- SparkRunner: Apache Spark cluster execution
- JetRunner: Hazelcast Jet cluster execution
- Twister2Runner: Twister2 cluster execution
Architecture Overview
Loading diagram...
Key Features
- Unified Model: Single programming model for batch and streaming workloads
- Portability: Write once, run on any supported backend via the Beam Portability Framework
- Extensibility: Create custom transforms, I/O connectors, and runners
- Type Safety: Python SDK includes type hints for early error detection
- Multi-Language Pipelines: Combine transforms written in different languages within a single pipeline
- Rich Ecosystem: Extensive built-in transforms for filtering, grouping, windowing, and aggregation
Architecture & Core Model
Relevant Files
model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.protomodel/fn-execution/src/main/proto/org/apache/beam/model/fn_execution/v1/beam_fn_api.protomodel/job-management/src/main/proto/org/apache/beam/model/job_management/v1/beam_job_api.protosdks/python/apache_beam/pipeline.pysdks/go/pkg/beam/pipeline.go
Beam's architecture is built on a unified model that separates pipeline definition from execution. The core abstractions—PCollection, PTransform, Pipeline, and PipelineRunner—form the foundation, while the portable execution layer enables language-agnostic SDK and runner interoperability.
Core Abstractions
The Beam Model defines four fundamental concepts:
-
PCollection: A distributed, immutable collection of data that can be bounded (batch) or unbounded (streaming). PCollections are produced and consumed by PTransforms.
-
PTransform: A computation that transforms zero or more input PCollections into zero or more output PCollections. Transforms are composable and can be chained using the pipe operator (
|). -
Pipeline: A directed acyclic graph (DAG) of PTransforms and PCollections. The Pipeline manages the entire workflow and coordinates execution through a runner.
-
PipelineRunner: Specifies where and how the pipeline executes on a distributed backend (e.g., Flink, Spark, Dataflow, or local execution).
Pipeline Representation
Pipelines are represented in two forms:
- SDK-level: Language-specific objects (Python, Java, Go, TypeScript) that developers use to construct pipelines.
- Runner API (Proto): A language-neutral Protocol Buffer representation that enables runners to understand and execute pipelines regardless of the SDK used.
The Runner API defines a hierarchical graph where transforms reference PCollections by ID, and all components (coders, windowing strategies, environments) are stored in a shared Components map for efficient reuse.
Portable Execution Model
The portable execution model decouples SDKs from runners through three key APIs:
SDK (Python/Java/Go) → Job API → Runner → Fn API → SDK Harness
Job API (beam_job_api.proto): Handles job submission, lifecycle management, and state tracking. SDKs submit pipelines to a JobService, which prepares, stages artifacts, and runs the job.
Fn API (beam_fn_api.proto): Defines the execution contract between runners and SDK harnesses. It includes:
- Control API: Bidirectional gRPC stream for work instructions (ProcessBundle, FinalizeBundleRequest)
- Data API: Streams elements between runner and SDK harness
- State API: Manages user state and side inputs
- Logging API: Collects logs from SDK harnesses
Environments
Environments specify where user code executes:
- Docker: Containerized SDK harness (default for distributed execution)
- Process: Native process spawned by the runner
- External: Pre-existing SDK harness managed outside the runner
- Loopback: SDK harness runs in the same process as the runner (testing/local)
Standard Transforms
The Runner API defines primitive transforms that runners must implement:
- ParDo: Parallel element-wise processing with user-defined functions
- GroupByKey: Shuffle operation that groups values by key
- Flatten: Merges multiple PCollections
- Combine: Aggregates elements (e.g., sum, count)
- Window: Assigns elements to logical windows for aggregation
Composite transforms are built by combining primitives and can be nested hierarchically.
Data Flow Architecture
Loading diagram...
The runner translates the Runner API into executable stages, each containing a subset of transforms. Stages are sent to SDK harnesses via the Fn API, which execute user code and stream results back through the Data API.
Key Design Principles
- Language Neutrality: Proto-based APIs enable any language to implement SDKs and runners
- Separation of Concerns: SDKs define pipelines; runners handle execution strategy
- Composability: Transforms can be nested and reused across pipelines
- Portability: The same pipeline can run on multiple runners without modification
Python SDK
Relevant Files
sdks/python/apache_beam/init.pysdks/python/apache_beam/pipeline.pysdks/python/apache_beam/pvalue.pysdks/python/apache_beam/transforms/core.pysdks/python/apache_beam/io/iobase.py
The Apache Beam Python SDK provides a unified programming model for building batch and streaming data processing pipelines. It exposes core Beam concepts through Python classes and enables developers to define complex data transformations using familiar Python idioms.
Core Concepts
The SDK is built around four fundamental abstractions:
-
Pipeline - A directed acyclic graph (DAG) that manages transforms and data flows. Created with
beam.Pipeline(), it orchestrates execution and handles runner selection. -
PCollection - An immutable, distributed dataset that can be bounded (finite) or unbounded (streaming). Elements flow through transforms and are tagged with optional metadata.
-
PTransform - A computation that transforms input PCollections into output PCollections. Transforms can be composite (containing sub-transforms) or primitive (leaf nodes).
-
Runner - Executes the pipeline on a specific execution engine (DirectRunner, Dataflow, Spark, etc.).
Pipeline Construction
Pipelines use a fluent API with the pipe operator (|) to chain transforms:
import apache_beam as beam
with beam.Pipeline() as p:
result = (
p
| 'Create' >> beam.Create([1, 2, 3])
| 'Double' >> beam.Map(lambda x: x * 2)
| 'Write' >> beam.io.WriteToText('./output')
)
The with statement ensures proper resource cleanup and automatically calls run() on exit.
Key Transform Types
ParDo - The most flexible transform, applies a user-defined function (DoFn) to each element. Supports side inputs, stateful processing, and multiple outputs via TaggedOutput.
Map/FlatMap - Convenience wrappers around ParDo for simple element-wise transformations.
GroupByKey - Groups elements by key, essential for aggregations. Requires KV pairs as input.
CombinePerKey - Combines values per key using a CombineFn, more efficient than GroupByKey for associative operations.
Flatten - Merges multiple PCollections into one.
Windowing - Divides unbounded streams into finite windows (fixed, sliding, session-based).
Data I/O
Read - Ingests data from external sources via BoundedSource or UnboundedSource implementations. Supports splitting for parallel reads.
Write - Outputs data to external sinks. Handles batching and retry logic.
Common I/O connectors include ReadFromText, WriteToText, ReadFromBigQuery, and many database/messaging system adapters.
Type Hints and Inference
The SDK includes a sophisticated type system using @with_input_types and @with_output_types decorators. Type variables (K, V, T) enable generic transforms that adapt to concrete input types through pattern matching and substitution. Type checking can be enabled via pipeline options for early error detection.
Side Inputs
PCollections can be passed as side inputs using view wrappers: AsSingleton (single value), AsIter (iterable), AsList (materialized list), AsDict (key-value lookup), and AsMultiMap (multi-value keys).
main_data | beam.ParDo(MyDoFn(), beam.pvalue.AsSingleton(side_data))
Execution Model
The pipeline is lazily evaluated. Transforms are applied during construction to build the DAG, but actual computation occurs when run() is called. The runner converts the Python DAG to its native execution format (e.g., Dataflow proto, Spark RDD operations).
Java SDK
Relevant Files
sdks/java/core/src/main/java/org/apache/beam/sdk/Pipeline.javasdks/java/core/src/main/java/org/apache/beam/sdk/values/PCollection.javasdks/java/core/src/main/java/org/apache/beam/sdk/transforms/PTransform.javasdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.javasdks/java/core/src/main/java/org/apache/beam/sdk/io/BoundedSource.javasdks/java/core/src/main/java/org/apache/beam/sdk/io/UnboundedSource.java
Core Concepts
The Java SDK provides a powerful model for building batch and streaming data processing pipelines. The four fundamental abstractions are:
- Pipeline: Manages a directed acyclic graph (DAG) of transforms and collections. Each pipeline is self-contained and isolated; pipelines can execute concurrently.
- PCollection: An immutable collection of values of type
T. Can be bounded (finite) or unbounded (infinite). Produced by transforms and consumed by other transforms. - PTransform: An operation that takes an input (subtype of
PInput) and produces an output (subtype ofPOutput). Transforms are composable and chainable. - PValue: Base interface for pipeline values. Includes
PCollection,PBegin(root input), andPDone(terminal output).
Building a Pipeline
Pipelines are constructed declaratively—transforms are added to the graph but not executed until run() is called:
PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);
PCollection<String> lines = p.apply(TextIO.read().from("gs://bucket/file*.txt"));
PCollection<String> words = lines.apply(FlatMapElements.into(TypeDescriptors.strings())
.via(line -> Arrays.asList(line.split("\\s+"))));
p.run();
Transforms are chained using apply(), which returns the output collection for the next transform.
Data Sources and I/O
The Read transform provides access to data sources through two abstractions:
- BoundedSource: Reads a finite amount of data. Supports splitting into bundles, size estimation, and dynamic work rebalancing. Use
Read.from(boundedSource). - UnboundedSource: Reads an unbounded stream. Supports checkpointing (for fault tolerance), watermarks (for progress tracking), and record IDs (for deduplication). Use
Read.from(unboundedSource).
Both sources must be serializable so they can be distributed to worker machines. Sources are effectively immutable; mutable fields should be marked transient.
Coders and Type Safety
Each PCollection has an associated Coder that specifies how elements are serialized. Coders are inferred from Java types when possible, but can be explicitly set:
PCollection<String> lines = p.apply(TextIO.read().from("file.txt"))
.setCoder(StringUtf8Coder.of());
The Pipeline's CoderRegistry maintains the mapping from Java types to default coders. Custom types require explicit coder registration or manual specification.
Composite Transforms
Most transforms are composites—built from other transforms. Implement custom transforms by extending PTransform and overriding expand():
public class CountWords extends PTransform<PCollection<String>, PCollection<KV<String, Long>>> {
@Override
public PCollection<KV<String, Long>> expand(PCollection<String> input) {
return input
.apply(FlatMapElements.into(TypeDescriptors.strings()).via(line -> Arrays.asList(line.split("\\s+"))))
.apply(Count.perElement());
}
}
Windowing and Timestamps
Elements in a PCollection have associated timestamps and window assignments. By default, all elements are in a single global window. Override with the Window transform:
PCollection<String> windowed = input
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(5))));
Readers assign timestamps when creating collections; transforms propagate them downstream.
Pipeline Execution
The runner is determined by PipelineOptions. The direct runner executes locally and sequentially (useful for testing). Distributed runners (Dataflow, Spark, Flink) execute on clusters. Transforms are not executed when applied—only when pipeline.run() is called.
Loading diagram...
Runners & Execution
Relevant Files
runners/core-java/src/main/java/org/apache/beam/runners/core/DoFnRunner.javarunners/direct-java/src/main/java/org/apache/beam/runners/direct/DirectRunner.javarunners/flink/src/main/java/org/apache/beam/runners/flink/FlinkRunner.javarunners/spark/src/main/java/org/apache/beam/runners/spark/SparkRunner.javarunners/prism/java/src/main/java/org/apache/beam/runners/prism/PrismRunner.java
Beam runners execute pipelines by translating the pipeline DAG into a format executable by their target platform, then orchestrating the execution of transforms and data flow.
Runner Architecture
Each runner extends PipelineRunner and implements the run() method to execute a Pipeline. The execution flow follows a consistent pattern:
- Validation & Optimization - Validate pipeline options and apply optimizations (e.g., projection pushdown, splittable DoFn conversion)
- Translation - Convert the pipeline DAG to the target platform's native format (Spark RDD operations, Flink operators, etc.)
- Execution - Submit the translated pipeline to the execution engine and monitor progress
- Result Handling - Return a
PipelineResultfor tracking job state and metrics
Native vs. Portable Runners
Native Runners (DirectRunner, SparkRunner, FlinkRunner) translate pipelines directly to their platform's APIs:
- DirectRunner - Executes locally using Java threads and in-memory data structures. Enforces Beam model semantics (encodability, immutability) for testing.
- SparkRunner - Translates transforms to Spark RDDs/DataFrames. Supports both batch and streaming via
SparkPipelineTranslator. - FlinkRunner - Converts to Flink operators. Detects streaming vs. batch and applies appropriate optimizations.
Portable Runners (PrismRunner, PortableRunner) use the Beam Fn API to execute user code in separate SDK harnesses:
- PrismRunner - Launches a Prism service (Go-based) and delegates execution via the Fn API.
- PortableRunner - Submits pipelines to a remote job service using the Job API and Fn API.
Bundle Execution Model
Transforms execute in bundles - small groups of elements processed together for efficiency. The DoFnRunner interface orchestrates this:
void startBundle(); // Initialize state, call @StartBundle
void processElement(elem); // Call @ProcessElement for each element
void finishBundle(); // Flush outputs, call @FinishBundle
void onTimer(timerId, ...); // Fire timers
Each runner implements bundle execution differently. Native runners use local executors; portable runners send bundles to SDK harnesses via gRPC.
Beam Fn API
The Fn API enables language-independent execution by defining a gRPC contract between runners and SDK harnesses:
- ProcessBundleRequest - Runner sends a bundle descriptor and input data
- ProcessBundleResponse - SDK harness returns output data and metrics
- Data Plane - Separate gRPC streams for high-throughput element transmission
This allows Python, Go, Java, and other SDKs to run on any Fn API-compatible runner.
Loading diagram...
I/O Connectors & Transforms
Relevant Files
sdks/python/apache_beam/io/init.pysdks/python/apache_beam/io/iobase.pysdks/python/apache_beam/transforms/core.pysdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.javasdks/java/core/src/main/java/org/apache/beam/sdk/io/BoundedSource.javasdks/java/core/src/main/java/org/apache/beam/sdk/io/UnboundedSource.java
Apache Beam provides a unified I/O framework for reading from and writing to external data sources. The framework separates concerns between sources (for reading) and sinks (for writing), with corresponding Read and Write transforms that integrate these into pipelines.
Sources: Reading Data
Sources define how data enters a pipeline. Beam distinguishes between two types:
BoundedSource - Reads a finite amount of data. Supports:
- Size estimation via
estimate_size() - Splitting into bundles for parallel reads via
split() - Range tracking for dynamic work rebalancing via
get_range_tracker() - Reading data respecting position boundaries via
read()
UnboundedSource - Reads infinite or continuously growing data streams. Supports:
- Checkpointing to resume from failure points
- Watermark estimation for event time semantics
- Record IDs for deduplication
- Splitting into multiple partitions
The Read transform wraps a source and expands it into a PCollection. In Python, beam.io.Read(source) handles both bounded and unbounded sources, automatically selecting the appropriate execution strategy.
Sinks: Writing Data
Sinks define where data exits a pipeline. The Sink class (deprecated but still used) manages three-phase writes:
- Initialization - Sequential setup (e.g., creating output directories)
- Parallel Write - Workers write bundles via
Writerinstances - Finalization - Sequential cleanup (e.g., merging files, committing transactions)
The Write transform applies a sink to a PCollection. Methods like initialize_write(), open_writer(), and finalize_write() must be idempotent to handle retries and failures.
Range Tracking and Dynamic Splitting
The RangeTracker enables dynamic work rebalancing for bounded sources. It tracks:
- Position ranges
[start, stop) - Consumed vs. remaining positions
- Split points for parallelizable records
Sources call try_claim(position) for split points and set_current_position() for non-split records. The try_split() method allows runners to dynamically rebalance work during execution.
Common I/O Connectors
Beam includes built-in connectors for popular systems:
- File formats: TextIO, AvroIO, ParquetIO, TFRecordIO
- Cloud storage: GCS (Google Cloud Storage), S3, Azure Blob Storage
- Databases: BigQuery, Datastore, MongoDB, JDBC
- Messaging: Pub/Sub, Kafka, Kinesis, JMS
- Other: Elasticsearch, Splunk, Debezium CDC
Each connector provides fluent builder APIs for configuration:
# Python example
pipeline | beam.io.ReadFromText('gs://bucket/file.txt')
pipeline | beam.io.WriteToText('gs://bucket/output')
// Java example
pipeline.apply(Read.from(new MyBoundedSource()))
pipeline.apply(TextIO.write().to("gs://bucket/output"))
Extensibility
Custom sources and sinks extend BoundedSource/UnboundedSource and Sink respectively. Implementations must be serializable for distribution to workers. The framework handles bundling, retries, and fault tolerance automatically.
Portability & Cross-Language
Relevant Files
runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/control/SdkHarnessClient.javarunners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/environment/ProcessEnvironmentFactory.javasdks/python/apache_beam/portability/common_urns.pysdks/go/pkg/beam/xlang.gosdks/go/pkg/beam/core/runtime/harness/harness.gomodel/fn-execution/src/main/proto/org/apache/beam/model/fn_execution/v1/beam_fn_api.proto
Beam's portability framework enables language-agnostic execution by defining standardized protocols and data structures. This allows SDKs in different languages (Python, Java, Go, TypeScript) to work with any runner, and enables cross-language pipelines where transforms from one SDK can be used in another.
Core Concepts
URNs (Uniform Resource Names) are the foundation of portability. They uniquely identify transforms, coders, environments, and other components in a language-neutral way. For example, beam:transform:pardo:v1 identifies the ParDo transform across all SDKs. The common_urns.py module centralizes URN definitions, making them accessible to all language SDKs.
The Beam Fn API is a gRPC-based protocol that defines three communication planes between runners and SDK harnesses:
- Control Plane (
BeamFnControl) - Runner sends instructions to SDKs (e.g., process bundles) - Data Plane (
BeamFnData) - Bidirectional streaming of PCollection elements - State Plane (
BeamFnState) - SDK queries runner for state (side inputs, user state)
SDK Harness Architecture
The SDK harness is the worker-side component that executes user code. SdkHarnessClient in Java provides a high-level wrapper around the Fn API, managing bundle processing, state delegation, and data flow. Each harness connects to the runner via gRPC endpoints and processes bundles according to ProcessBundleDescriptor messages.
Key responsibilities:
- Bundle Management - Creates
ActiveBundleinstances that handle input/output receivers - State Handling - Delegates state requests to
StateRequestHandlerimplementations - Data Routing - Manages logical endpoints for data and timer streams
Environment Abstraction
ProcessEnvironmentFactory demonstrates how runners spawn language-specific worker processes. It:
- Extracts environment configuration (URN, payload) from the pipeline
- Starts a subprocess with the appropriate SDK harness binary
- Waits for the harness to connect back via gRPC
- Provides provisioning and control endpoints
This abstraction allows runners to support multiple languages without language-specific code.
Cross-Language Transforms
Cross-language execution uses expansion services to convert external transforms into native pipeline graphs. The Go SDK's CrossLanguage function demonstrates the pattern:
- Client sends
ExpansionRequestwith URN and payload to expansion service - Service returns expanded
PTransformgraph - Client integrates expanded graph into its pipeline
This enables composing transforms across language boundaries at pipeline construction time.
Loading diagram...
Standardization Benefits
The portability framework achieves:
- SDK Interoperability - New SDKs automatically work with existing runners
- Runner Flexibility - Runners support all SDKs without custom integration
- Cross-Language Pipelines - Mix transforms from different SDKs in one pipeline
- Consistent Semantics - Windowing, state, and metrics work uniformly across languages
Beam Playground
Relevant Files
playground/README.mdplayground/backend/README.mdplayground/frontend/README.mdplayground/api/v1/api.protoplayground/backend/internal/
Beam Playground is a web-based interactive environment that allows users to write, run, and share Apache Beam code snippets without requiring local setup. It supports multiple SDKs (Java, Python, Go, and SCIO) and provides a modern browser interface for learning and experimenting with Beam pipelines.
Architecture Overview
The Playground consists of three main components:
Frontend (Flutter/Dart): A responsive web UI built with Flutter that provides code editing, execution controls, and result visualization. It communicates with backend services via gRPC and supports embedding in documentation via iframes.
Backend (Go): A gRPC server that orchestrates code execution, manages pipeline lifecycle, and handles caching. It processes code through validation, preparation, compilation, and execution stages specific to each SDK.
API (Protocol Buffers): Defines the gRPC contract between frontend and backend, including request/response messages for code execution, status checking, and output retrieval.
Loading diagram...
Core Workflow
When a user submits code, the backend executes a multi-stage pipeline:
- Validation: SDK-specific validators check syntax and semantic correctness
- Preparation: Preparers set up dependencies and environment (e.g., Go mod files, Python packages)
- Compilation: Code is compiled to executable form
- Execution: The compiled code runs with optional pipeline options and datasets
- Output Retrieval: Results, logs, errors, and execution graphs are cached and returned
Each stage produces output that can be retrieved independently via the gRPC API.
Key Features
Multi-SDK Support: Separate validator, preparer, and executor implementations for Java, Python, Go, and SCIO SDKs in playground/backend/internal/.
Caching: Supports both local and Redis-based caching to avoid re-executing identical code snippets.
Precompiled Examples: Catalog of examples, katas, and unit tests loaded from playground/categories.yaml and SDK directories.
Emulators: Kafka emulator support for demonstrating streaming I/O patterns.
Shareable Links: Users can share code snippets via unique identifiers or embed them in documentation using URL parameters.
Development Setup
Prerequisites include Go 1.23+, Flutter, protoc, Docker, and gcloud CLI. Run ./gradlew :playground:dockerComposeLocalUp to start a local deployment with all services.
For backend development, set environment variables like BEAM_SDK, APP_WORK_DIR, and DATASTORE_EMULATOR_HOST, then run go run ./cmd/server in the backend directory.
Frontend development uses flutter run -d chrome for local testing or flutter build web for production builds.
Deployment
Playground deploys to Google Cloud Platform using Terraform configurations in playground/terraform/. Infrastructure scripts in playground/infrastructure/ handle CI/CD pipelines for verifying and uploading examples to Cloud Datastore.