Apache Beam Wiki | Augment Code

Overview

Relevant Files

README.md
sdks/python/README.md
sdks/go/README.md
sdks/java/core/src/main/java/org/apache/beam/sdk/package-info.java
runners/ (all runner implementations)

Apache Beam is a unified, open-source framework for defining both batch and streaming data-parallel processing pipelines. It provides a portable programming model that abstracts away the complexity of distributed data processing, allowing developers to write pipelines once and execute them on multiple distributed backends.

The Beam Model

The Beam Model evolved from Google's internal data processing systems including MapReduce, FlumeJava, and Millwheel. It introduces four core abstractions that form the foundation of every Beam pipeline:

PCollection: Represents a distributed collection of data, which can be bounded (batch) or unbounded (streaming) in size.
PTransform: Represents a computation that transforms input PCollections into output PCollections. Transforms are composable and can be chained together.
Pipeline: Manages a directed acyclic graph (DAG) of PTransforms and PCollections, orchestrating the entire data processing workflow.
PipelineRunner: Specifies where and how the pipeline executes on a distributed processing backend.

Multi-Language SDKs

Beam supports language-specific SDKs that implement the Beam Model:

Java SDK (sdks/java/): Full-featured SDK with extensive I/O connectors and transforms
Python SDK (sdks/python/): Pythonic API with support for streaming, type hints, and ML inference
Go SDK (sdks/go/): Lightweight SDK for Go developers with portable execution support
TypeScript SDK (sdks/typescript/): JavaScript/TypeScript implementation for web and Node.js environments

Execution Runners

Beam pipelines can execute on multiple distributed backends through pluggable runners:

DirectRunner: Local execution for development and testing
PrismRunner: Portable local runner using Beam's FnAPI
DataflowRunner: Google Cloud Dataflow for managed cloud execution
FlinkRunner: Apache Flink cluster execution
SparkRunner: Apache Spark cluster execution
JetRunner: Hazelcast Jet cluster execution
Twister2Runner: Twister2 cluster execution

Architecture Overview

Loading diagram...

Key Features

Unified Model: Single programming model for batch and streaming workloads
Portability: Write once, run on any supported backend via the Beam Portability Framework
Extensibility: Create custom transforms, I/O connectors, and runners
Type Safety: Python SDK includes type hints for early error detection
Multi-Language Pipelines: Combine transforms written in different languages within a single pipeline
Rich Ecosystem: Extensive built-in transforms for filtering, grouping, windowing, and aggregation

Architecture & Core Model

Relevant Files

model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto
model/fn-execution/src/main/proto/org/apache/beam/model/fn_execution/v1/beam_fn_api.proto
model/job-management/src/main/proto/org/apache/beam/model/job_management/v1/beam_job_api.proto
sdks/python/apache_beam/pipeline.py
sdks/go/pkg/beam/pipeline.go

Beam's architecture is built on a unified model that separates pipeline definition from execution. The core abstractions—PCollection, PTransform, Pipeline, and PipelineRunner—form the foundation, while the portable execution layer enables language-agnostic SDK and runner interoperability.

Core Abstractions

The Beam Model defines four fundamental concepts:

PCollection: A distributed, immutable collection of data that can be bounded (batch) or unbounded (streaming). PCollections are produced and consumed by PTransforms.
PTransform: A computation that transforms zero or more input PCollections into zero or more output PCollections. Transforms are composable and can be chained using the pipe operator (|).
Pipeline: A directed acyclic graph (DAG) of PTransforms and PCollections. The Pipeline manages the entire workflow and coordinates execution through a runner.
PipelineRunner: Specifies where and how the pipeline executes on a distributed backend (e.g., Flink, Spark, Dataflow, or local execution).

Pipeline Representation

Pipelines are represented in two forms:

SDK-level: Language-specific objects (Python, Java, Go, TypeScript) that developers use to construct pipelines.
Runner API (Proto): A language-neutral Protocol Buffer representation that enables runners to understand and execute pipelines regardless of the SDK used.

The Runner API defines a hierarchical graph where transforms reference PCollections by ID, and all components (coders, windowing strategies, environments) are stored in a shared Components map for efficient reuse.

Portable Execution Model

The portable execution model decouples SDKs from runners through three key APIs:

SDK (Python/Java/Go) → Job API → Runner → Fn API → SDK Harness

Job API (beam_job_api.proto): Handles job submission, lifecycle management, and state tracking. SDKs submit pipelines to a JobService, which prepares, stages artifacts, and runs the job.

Fn API (beam_fn_api.proto): Defines the execution contract between runners and SDK harnesses. It includes:

Control API: Bidirectional gRPC stream for work instructions (ProcessBundle, FinalizeBundleRequest)
Data API: Streams elements between runner and SDK harness
State API: Manages user state and side inputs
Logging API: Collects logs from SDK harnesses

Environments

Environments specify where user code executes:

Docker: Containerized SDK harness (default for distributed execution)
Process: Native process spawned by the runner
External: Pre-existing SDK harness managed outside the runner
Loopback: SDK harness runs in the same process as the runner (testing/local)

Standard Transforms

The Runner API defines primitive transforms that runners must implement:

ParDo: Parallel element-wise processing with user-defined functions
GroupByKey: Shuffle operation that groups values by key
Flatten: Merges multiple PCollections
Combine: Aggregates elements (e.g., sum, count)
Window: Assigns elements to logical windows for aggregation

Composite transforms are built by combining primitives and can be nested hierarchically.

Data Flow Architecture

Loading diagram...

The runner translates the Runner API into executable stages, each containing a subset of transforms. Stages are sent to SDK harnesses via the Fn API, which execute user code and stream results back through the Data API.

Key Design Principles

Language Neutrality: Proto-based APIs enable any language to implement SDKs and runners
Separation of Concerns: SDKs define pipelines; runners handle execution strategy
Composability: Transforms can be nested and reused across pipelines
Portability: The same pipeline can run on multiple runners without modification

Python SDK

Relevant Files

sdks/python/apache_beam/init.py
sdks/python/apache_beam/pipeline.py
sdks/python/apache_beam/pvalue.py
sdks/python/apache_beam/transforms/core.py
sdks/python/apache_beam/io/iobase.py

The Apache Beam Python SDK provides a unified programming model for building batch and streaming data processing pipelines. It exposes core Beam concepts through Python classes and enables developers to define complex data transformations using familiar Python idioms.

Core Concepts

The SDK is built around four fundamental abstractions:

Pipeline - A directed acyclic graph (DAG) that manages transforms and data flows. Created with beam.Pipeline(), it orchestrates execution and handles runner selection.
PCollection - An immutable, distributed dataset that can be bounded (finite) or unbounded (streaming). Elements flow through transforms and are tagged with optional metadata.
PTransform - A computation that transforms input PCollections into output PCollections. Transforms can be composite (containing sub-transforms) or primitive (leaf nodes).
Runner - Executes the pipeline on a specific execution engine (DirectRunner, Dataflow, Spark, etc.).

Pipeline Construction

Pipelines use a fluent API with the pipe operator (|) to chain transforms:

import apache_beam as beam

with beam.Pipeline() as p:
  result = (
      p
      | 'Create' >> beam.Create([1, 2, 3])
      | 'Double' >> beam.Map(lambda x: x * 2)
      | 'Write' >> beam.io.WriteToText('./output')
  )

The with statement ensures proper resource cleanup and automatically calls run() on exit.

Key Transform Types

ParDo - The most flexible transform, applies a user-defined function (DoFn) to each element. Supports side inputs, stateful processing, and multiple outputs via TaggedOutput.

Map/FlatMap - Convenience wrappers around ParDo for simple element-wise transformations.

GroupByKey - Groups elements by key, essential for aggregations. Requires KV pairs as input.

CombinePerKey - Combines values per key using a CombineFn, more efficient than GroupByKey for associative operations.

Flatten - Merges multiple PCollections into one.

Windowing - Divides unbounded streams into finite windows (fixed, sliding, session-based).

Data I/O

Read - Ingests data from external sources via BoundedSource or UnboundedSource implementations. Supports splitting for parallel reads.

Write - Outputs data to external sinks. Handles batching and retry logic.

Common I/O connectors include ReadFromText, WriteToText, ReadFromBigQuery, and many database/messaging system adapters.

Type Hints and Inference

The SDK includes a sophisticated type system using @with_input_types and @with_output_types decorators. Type variables (K, V, T) enable generic transforms that adapt to concrete input types through pattern matching and substitution. Type checking can be enabled via pipeline options for early error detection.

Side Inputs

PCollections can be passed as side inputs using view wrappers: AsSingleton (single value), AsIter (iterable), AsList (materialized list), AsDict (key-value lookup), and AsMultiMap (multi-value keys).

main_data | beam.ParDo(MyDoFn(), beam.pvalue.AsSingleton(side_data))

Execution Model

The pipeline is lazily evaluated. Transforms are applied during construction to build the DAG, but actual computation occurs when run() is called. The runner converts the Python DAG to its native execution format (e.g., Dataflow proto, Spark RDD operations).

Java SDK

Relevant Files

sdks/java/core/src/main/java/org/apache/beam/sdk/Pipeline.java
sdks/java/core/src/main/java/org/apache/beam/sdk/values/PCollection.java
sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/PTransform.java
sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java
sdks/java/core/src/main/java/org/apache/beam/sdk/io/BoundedSource.java
sdks/java/core/src/main/java/org/apache/beam/sdk/io/UnboundedSource.java

Core Concepts

The Java SDK provides a powerful model for building batch and streaming data processing pipelines. The four fundamental abstractions are:

Pipeline: Manages a directed acyclic graph (DAG) of transforms and collections. Each pipeline is self-contained and isolated; pipelines can execute concurrently.
PCollection: An immutable collection of values of type T. Can be bounded (finite) or unbounded (infinite). Produced by transforms and consumed by other transforms.
PTransform: An operation that takes an input (subtype of PInput) and produces an output (subtype of POutput). Transforms are composable and chainable.
PValue: Base interface for pipeline values. Includes PCollection, PBegin (root input), and PDone (terminal output).

Building a Pipeline

Pipelines are constructed declaratively—transforms are added to the graph but not executed until run() is called:

PipelineOptions options = PipelineOptionsFactory.create();
Pipeline p = Pipeline.create(options);

PCollection<String> lines = p.apply(TextIO.read().from("gs://bucket/file*.txt"));
PCollection<String> words = lines.apply(FlatMapElements.into(TypeDescriptors.strings())
    .via(line -> Arrays.asList(line.split("\\s+"))));
p.run();

Transforms are chained using apply(), which returns the output collection for the next transform.

Data Sources and I/O

The Read transform provides access to data sources through two abstractions:

BoundedSource: Reads a finite amount of data. Supports splitting into bundles, size estimation, and dynamic work rebalancing. Use Read.from(boundedSource).
UnboundedSource: Reads an unbounded stream. Supports checkpointing (for fault tolerance), watermarks (for progress tracking), and record IDs (for deduplication). Use Read.from(unboundedSource).

Both sources must be serializable so they can be distributed to worker machines. Sources are effectively immutable; mutable fields should be marked transient.

Coders and Type Safety

Each PCollection has an associated Coder that specifies how elements are serialized. Coders are inferred from Java types when possible, but can be explicitly set:

PCollection<String> lines = p.apply(TextIO.read().from("file.txt"))
    .setCoder(StringUtf8Coder.of());

The Pipeline's CoderRegistry maintains the mapping from Java types to default coders. Custom types require explicit coder registration or manual specification.

Composite Transforms

Most transforms are composites—built from other transforms. Implement custom transforms by extending PTransform and overriding expand():

public class CountWords extends PTransform<PCollection<String>, PCollection<KV<String, Long>>> {
  @Override
  public PCollection<KV<String, Long>> expand(PCollection<String> input) {
    return input
        .apply(FlatMapElements.into(TypeDescriptors.strings()).via(line -> Arrays.asList(line.split("\\s+"))))
        .apply(Count.perElement());
  }
}

Windowing and Timestamps

Elements in a PCollection have associated timestamps and window assignments. By default, all elements are in a single global window. Override with the Window transform:

PCollection<String> windowed = input
    .apply(Window.into(FixedWindows.of(Duration.standardMinutes(5))));

Readers assign timestamps when creating collections; transforms propagate them downstream.

Pipeline Execution

The runner is determined by PipelineOptions. The direct runner executes locally and sequentially (useful for testing). Distributed runners (Dataflow, Spark, Flink) execute on clusters. Transforms are not executed when applied—only when pipeline.run() is called.

Loading diagram...

Runners & Execution

Relevant Files

runners/core-java/src/main/java/org/apache/beam/runners/core/DoFnRunner.java
runners/direct-java/src/main/java/org/apache/beam/runners/direct/DirectRunner.java
runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkRunner.java
runners/spark/src/main/java/org/apache/beam/runners/spark/SparkRunner.java
runners/prism/java/src/main/java/org/apache/beam/runners/prism/PrismRunner.java

Beam runners execute pipelines by translating the pipeline DAG into a format executable by their target platform, then orchestrating the execution of transforms and data flow.

Runner Architecture

Each runner extends PipelineRunner and implements the run() method to execute a Pipeline. The execution flow follows a consistent pattern:

Validation & Optimization - Validate pipeline options and apply optimizations (e.g., projection pushdown, splittable DoFn conversion)
Translation - Convert the pipeline DAG to the target platform's native format (Spark RDD operations, Flink operators, etc.)
Execution - Submit the translated pipeline to the execution engine and monitor progress
Result Handling - Return a PipelineResult for tracking job state and metrics

Native vs. Portable Runners

Native Runners (DirectRunner, SparkRunner, FlinkRunner) translate pipelines directly to their platform's APIs:

DirectRunner - Executes locally using Java threads and in-memory data structures. Enforces Beam model semantics (encodability, immutability) for testing.
SparkRunner - Translates transforms to Spark RDDs/DataFrames. Supports both batch and streaming via SparkPipelineTranslator.
FlinkRunner - Converts to Flink operators. Detects streaming vs. batch and applies appropriate optimizations.

Portable Runners (PrismRunner, PortableRunner) use the Beam Fn API to execute user code in separate SDK harnesses:

PrismRunner - Launches a Prism service (Go-based) and delegates execution via the Fn API.
PortableRunner - Submits pipelines to a remote job service using the Job API and Fn API.

Bundle Execution Model

Transforms execute in bundles - small groups of elements processed together for efficiency. The DoFnRunner interface orchestrates this:

void startBundle();           // Initialize state, call @StartBundle
void processElement(elem);    // Call @ProcessElement for each element
void finishBundle();          // Flush outputs, call @FinishBundle
void onTimer(timerId, ...);   // Fire timers

Each runner implements bundle execution differently. Native runners use local executors; portable runners send bundles to SDK harnesses via gRPC.

Beam Fn API

The Fn API enables language-independent execution by defining a gRPC contract between runners and SDK harnesses:

ProcessBundleRequest - Runner sends a bundle descriptor and input data
ProcessBundleResponse - SDK harness returns output data and metrics
Data Plane - Separate gRPC streams for high-throughput element transmission

This allows Python, Go, Java, and other SDKs to run on any Fn API-compatible runner.

Loading diagram...

I/O Connectors & Transforms

Relevant Files

sdks/python/apache_beam/io/init.py
sdks/python/apache_beam/io/iobase.py
sdks/python/apache_beam/transforms/core.py
sdks/java/core/src/main/java/org/apache/beam/sdk/io/Read.java
sdks/java/core/src/main/java/org/apache/beam/sdk/io/BoundedSource.java
sdks/java/core/src/main/java/org/apache/beam/sdk/io/UnboundedSource.java

Apache Beam provides a unified I/O framework for reading from and writing to external data sources. The framework separates concerns between sources (for reading) and sinks (for writing), with corresponding Read and Write transforms that integrate these into pipelines.

Sources: Reading Data

Sources define how data enters a pipeline. Beam distinguishes between two types:

BoundedSource - Reads a finite amount of data. Supports:

Size estimation via estimate_size()
Splitting into bundles for parallel reads via split()
Range tracking for dynamic work rebalancing via get_range_tracker()
Reading data respecting position boundaries via read()

UnboundedSource - Reads infinite or continuously growing data streams. Supports:

Checkpointing to resume from failure points
Watermark estimation for event time semantics
Record IDs for deduplication
Splitting into multiple partitions

The Read transform wraps a source and expands it into a PCollection. In Python, beam.io.Read(source) handles both bounded and unbounded sources, automatically selecting the appropriate execution strategy.

Sinks: Writing Data

Sinks define where data exits a pipeline. The Sink class (deprecated but still used) manages three-phase writes:

Initialization - Sequential setup (e.g., creating output directories)
Parallel Write - Workers write bundles via Writer instances
Finalization - Sequential cleanup (e.g., merging files, committing transactions)

The Write transform applies a sink to a PCollection. Methods like initialize_write(), open_writer(), and finalize_write() must be idempotent to handle retries and failures.

Range Tracking and Dynamic Splitting

The RangeTracker enables dynamic work rebalancing for bounded sources. It tracks:

Position ranges [start, stop)
Consumed vs. remaining positions
Split points for parallelizable records

Sources call try_claim(position) for split points and set_current_position() for non-split records. The try_split() method allows runners to dynamically rebalance work during execution.

Common I/O Connectors

Beam includes built-in connectors for popular systems:

File formats: TextIO, AvroIO, ParquetIO, TFRecordIO
Cloud storage: GCS (Google Cloud Storage), S3, Azure Blob Storage
Databases: BigQuery, Datastore, MongoDB, JDBC
Messaging: Pub/Sub, Kafka, Kinesis, JMS
Other: Elasticsearch, Splunk, Debezium CDC

Each connector provides fluent builder APIs for configuration:

# Python example
pipeline | beam.io.ReadFromText('gs://bucket/file.txt')
pipeline | beam.io.WriteToText('gs://bucket/output')

// Java example
pipeline.apply(Read.from(new MyBoundedSource()))
pipeline.apply(TextIO.write().to("gs://bucket/output"))

Extensibility

Custom sources and sinks extend BoundedSource/UnboundedSource and Sink respectively. Implementations must be serializable for distribution to workers. The framework handles bundling, retries, and fault tolerance automatically.

Portability & Cross-Language

Relevant Files

runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/control/SdkHarnessClient.java
runners/java-fn-execution/src/main/java/org/apache/beam/runners/fnexecution/environment/ProcessEnvironmentFactory.java
sdks/python/apache_beam/portability/common_urns.py
sdks/go/pkg/beam/xlang.go
sdks/go/pkg/beam/core/runtime/harness/harness.go
model/fn-execution/src/main/proto/org/apache/beam/model/fn_execution/v1/beam_fn_api.proto

Beam's portability framework enables language-agnostic execution by defining standardized protocols and data structures. This allows SDKs in different languages (Python, Java, Go, TypeScript) to work with any runner, and enables cross-language pipelines where transforms from one SDK can be used in another.

Core Concepts

URNs (Uniform Resource Names) are the foundation of portability. They uniquely identify transforms, coders, environments, and other components in a language-neutral way. For example, beam:transform:pardo:v1 identifies the ParDo transform across all SDKs. The common_urns.py module centralizes URN definitions, making them accessible to all language SDKs.

The Beam Fn API is a gRPC-based protocol that defines three communication planes between runners and SDK harnesses:

Control Plane (BeamFnControl) - Runner sends instructions to SDKs (e.g., process bundles)
Data Plane (BeamFnData) - Bidirectional streaming of PCollection elements
State Plane (BeamFnState) - SDK queries runner for state (side inputs, user state)

SDK Harness Architecture

The SDK harness is the worker-side component that executes user code. SdkHarnessClient in Java provides a high-level wrapper around the Fn API, managing bundle processing, state delegation, and data flow. Each harness connects to the runner via gRPC endpoints and processes bundles according to ProcessBundleDescriptor messages.

Key responsibilities:

Bundle Management - Creates ActiveBundle instances that handle input/output receivers
State Handling - Delegates state requests to StateRequestHandler implementations
Data Routing - Manages logical endpoints for data and timer streams

Environment Abstraction

ProcessEnvironmentFactory demonstrates how runners spawn language-specific worker processes. It:

Extracts environment configuration (URN, payload) from the pipeline
Starts a subprocess with the appropriate SDK harness binary
Waits for the harness to connect back via gRPC
Provides provisioning and control endpoints

This abstraction allows runners to support multiple languages without language-specific code.

Cross-Language Transforms

Cross-language execution uses expansion services to convert external transforms into native pipeline graphs. The Go SDK's CrossLanguage function demonstrates the pattern:

Client sends ExpansionRequest with URN and payload to expansion service
Service returns expanded PTransform graph
Client integrates expanded graph into its pipeline

This enables composing transforms across language boundaries at pipeline construction time.

Loading diagram...

Standardization Benefits

The portability framework achieves:

SDK Interoperability - New SDKs automatically work with existing runners
Runner Flexibility - Runners support all SDKs without custom integration
Cross-Language Pipelines - Mix transforms from different SDKs in one pipeline
Consistent Semantics - Windowing, state, and metrics work uniformly across languages

Beam Playground

Relevant Files

playground/README.md
playground/backend/README.md
playground/frontend/README.md
playground/api/v1/api.proto
playground/backend/internal/

Beam Playground is a web-based interactive environment that allows users to write, run, and share Apache Beam code snippets without requiring local setup. It supports multiple SDKs (Java, Python, Go, and SCIO) and provides a modern browser interface for learning and experimenting with Beam pipelines.

Architecture Overview

The Playground consists of three main components:

Frontend (Flutter/Dart): A responsive web UI built with Flutter that provides code editing, execution controls, and result visualization. It communicates with backend services via gRPC and supports embedding in documentation via iframes.

Backend (Go): A gRPC server that orchestrates code execution, manages pipeline lifecycle, and handles caching. It processes code through validation, preparation, compilation, and execution stages specific to each SDK.

API (Protocol Buffers): Defines the gRPC contract between frontend and backend, including request/response messages for code execution, status checking, and output retrieval.

Loading diagram...

Core Workflow

When a user submits code, the backend executes a multi-stage pipeline:

Validation: SDK-specific validators check syntax and semantic correctness
Preparation: Preparers set up dependencies and environment (e.g., Go mod files, Python packages)
Compilation: Code is compiled to executable form
Execution: The compiled code runs with optional pipeline options and datasets
Output Retrieval: Results, logs, errors, and execution graphs are cached and returned

Each stage produces output that can be retrieved independently via the gRPC API.

Key Features

Multi-SDK Support: Separate validator, preparer, and executor implementations for Java, Python, Go, and SCIO SDKs in playground/backend/internal/.

Caching: Supports both local and Redis-based caching to avoid re-executing identical code snippets.

Precompiled Examples: Catalog of examples, katas, and unit tests loaded from playground/categories.yaml and SDK directories.

Emulators: Kafka emulator support for demonstrating streaming I/O patterns.

Shareable Links: Users can share code snippets via unique identifiers or embed them in documentation using URL parameters.

Development Setup

Prerequisites include Go 1.23+, Flutter, protoc, Docker, and gcloud CLI. Run ./gradlew :playground:dockerComposeLocalUp to start a local deployment with all services.

For backend development, set environment variables like BEAM_SDK, APP_WORK_DIR, and DATASTORE_EMULATOR_HOST, then run go run ./cmd/server in the backend directory.

Frontend development uses flutter run -d chrome for local testing or flutter build web for production builds.

Deployment

Playground deploys to Google Cloud Platform using Terraform configurations in playground/terraform/. Infrastructure scripts in playground/infrastructure/ handle CI/CD pipelines for verifying and uploading examples to Cloud Datastore.