Install

apache/airflow

Apache Airflow Wiki

Last updated on Dec 18, 2025 (Commit: e19c6f5)

Overview

Relevant Files
  • README.md
  • airflow-core/README.md
  • airflow-core/docs/index.rst
  • airflow-core/docs/core-concepts/overview.rst
  • task-sdk/docs/index.rst

Apache Airflow is an open-source platform for developing, scheduling, and monitoring batch-oriented workflows. It enables you to define workflows as Python code, providing a flexible, extensible framework for orchestrating complex data pipelines and business processes.

What is Airflow?

Airflow represents workflows as Directed Acyclic Graphs (DAGs), where each node is a task and edges represent dependencies. The platform handles scheduling, execution, monitoring, and debugging of these workflows through a web-based UI and command-line interface.

Key characteristics:

  • Workflows as Code: Define pipelines entirely in Python, enabling version control, testing, and collaboration
  • Extensible: Built-in operators for common tasks, with support for custom operators and providers
  • Flexible: Leverages Jinja templating for rich customizations and dynamic pipeline generation
  • Scalable: Runs from a single machine to distributed systems handling massive workloads

Core Architecture

Airflow consists of several essential components:

  1. Scheduler: Triggers scheduled workflows and submits tasks to the executor
  2. Executor: Runs tasks (LocalExecutor, CeleryExecutor, KubernetesExecutor, or custom)
  3. DAG Processor: Parses DAG files and serializes them into the metadata database
  4. Webserver: Provides UI for visualization, management, and debugging
  5. Metadata Database: Stores state of tasks, DAGs, and variables (PostgreSQL, MySQL, or SQLite)
  6. Triggerer (optional): Executes deferred tasks in an asyncio event loop
Loading diagram...

Repository Structure

The monorepo contains multiple interconnected packages:

  • airflow-core: Core Airflow functionality (scheduler, API server, DAG processor, triggerer)
  • task-sdk: Stable interface for DAG authoring, decoupled from Airflow internals
  • providers: 70+ community-managed provider packages for integrations
  • chart: Kubernetes Helm chart for deploying Airflow
  • clients: Python client libraries for interacting with Airflow APIs
  • contributing-docs: Comprehensive contributor guidelines and development setup

Key Concepts

DAGs: Workflows defined as Python code with tasks and dependencies. Use the airflow.sdk namespace for stable, forward-compatible DAG authoring.

Tasks: Individual units of work represented by operators (BashOperator, PythonOperator, etc.) or decorated Python functions.

Executors: Determine how tasks are executed. LocalExecutor runs tasks in parallel processes on a single machine; CeleryExecutor and KubernetesExecutor enable distributed execution.

Providers: Separate packages extending Airflow with integrations for external systems (AWS, GCP, Databricks, etc.).

Use Cases

Airflow excels at:

  • Batch data processing and ETL pipelines
  • Scheduled workflows with clear start and end points
  • Complex task orchestration with dependencies
  • Monitoring and alerting on workflow execution

Airflow is not designed for streaming, continuously running, or event-driven workloads, though it complements streaming systems like Apache Kafka.

Architecture & Core Components

Relevant Files
  • airflow-core/src/airflow/jobs/scheduler_job_runner.py
  • airflow-core/src/airflow/dag_processing/processor.py
  • airflow-core/src/airflow/executors/base_executor.py
  • airflow-core/src/airflow/models/dag.py
  • airflow-core/src/airflow/models/dagrun.py
  • airflow-core/src/airflow/models/taskinstance.py

Airflow's architecture centers on a scheduler-executor pattern that orchestrates distributed task execution. The system consists of several interconnected components that work together to parse DAGs, schedule tasks, and execute them reliably.

Core Components

Scheduler (SchedulerJobRunner) is the central orchestrator. It runs continuously, executing a main loop that:

  1. Harvests DAG parsing results from the DAG processor
  2. Creates new DAG runs based on schedules
  3. Identifies executable tasks and queues them to executors
  4. Processes task completion events and updates state

Executor (BaseExecutor) is a pluggable component that handles actual task execution. The scheduler queues workloads to the executor, which manages parallelism, resource allocation, and task lifecycle. Different executor implementations (LocalExecutor, CeleryExecutor, KubernetesExecutor) provide various execution strategies.

DAG Processor (DagFileProcessorProcess) runs in isolated subprocesses to parse DAG files and serialize them into the metadata database. This isolation prevents user code from affecting the scheduler. The processor converts Python DAG definitions into serialized representations that the scheduler uses for decision-making.

Data Models

The core data models represent the workflow state:

  • DAG (DagModel): Represents a workflow definition with scheduling metadata, concurrency limits, and next run information
  • DagRun: An instance of a DAG execution with a specific logical date and state
  • TaskInstance: Represents a single task execution within a DAG run, storing state, retry information, and execution metadata
  • Task: The operator definition within a DAG (not persisted; loaded from serialized DAG)

Scheduling Loop

The scheduler's main loop executes these steps repeatedly:

# Simplified scheduler loop structure
1. Harvest DAG parsing results
2. Create DAG runs for scheduled DAGs
3. Find executable tasks (dependencies met, resources available)
4. Queue tasks to executors (with row-level locking for concurrency)
5. Heartbeat executors (trigger execution, sync task states)
6. Process task completion events
7. Handle expired deadlines

The critical section uses database row locks on the Pool model to ensure thread-safe task queuing across multiple scheduler instances.

Data Flow

Loading diagram...

Key Architectural Principles

Isolation: User code (DAGs, operators) runs in separate processes to prevent scheduler crashes. The DAG processor and task runtime are isolated from the scheduler.

Concurrency Control: Multiple schedulers can run simultaneously using database-level row locks. Pool limits and concurrency constraints prevent resource exhaustion.

State Authority: The metadata database is the single source of truth for task and DAG state. All state transitions are persisted before execution.

Asynchronous Execution: The scheduler queues tasks and continues; executors handle actual execution asynchronously. The scheduler polls for completion events.

DAG & Task Execution

Relevant Files
  • task-sdk/src/airflow/sdk/definitions/dag.py
  • airflow-core/src/airflow/models/dag.py
  • airflow-core/src/airflow/models/taskinstance.py
  • airflow-core/src/airflow/executors/base_executor.py
  • airflow-core/src/airflow/jobs/scheduler_job_runner.py

DAG Fundamentals

A DAG (Directed Acyclic Graph) is a collection of tasks with directional dependencies that represents a workflow. Each DAG has a schedule, start date, and optional end date. The DAG itself doesn't execute logic—it defines how tasks should run: their order, retry policies, timeouts, and other operational details.

DAGs are instantiated into DAG Runs each time they execute according to their schedule. For example, a daily DAG creates one run per day.

Declaring DAGs

There are three ways to declare a DAG:

1. Context Manager (with statement):

from airflow.sdk import DAG
from airflow.providers.standard.operators.empty import EmptyOperator
import datetime

with DAG(
    dag_id="my_dag",
    start_date=datetime.datetime(2021, 1, 1),
    schedule="@daily",
):
    EmptyOperator(task_id="task1")

2. Constructor:

my_dag = DAG(
    dag_id="my_dag",
    start_date=datetime.datetime(2021, 1, 1),
    schedule="@daily",
)
EmptyOperator(task_id="task1", dag=my_dag)

3. Decorator:

from airflow.sdk import dag

@dag(start_date=datetime.datetime(2021, 1, 1), schedule="@daily")
def generate_dag():
    EmptyOperator(task_id="task1")

generate_dag()

Tasks and Task Instances

A Task is the basic unit of execution in a DAG. Tasks are arranged with upstream and downstream dependencies to express execution order. There are three kinds of tasks:

  • Operators: Predefined task templates (e.g., BashOperator, PythonOperator)
  • Sensors: Special operators that wait for external events
  • TaskFlow tasks: Custom Python functions decorated with @task

Much like DAGs become DAG Runs, tasks become Task Instances—specific executions of a task for a given DAG run. Task instances have state representing their lifecycle stage.

Task Instance States

Task instances flow through these states:

  • none: Dependencies not yet met
  • scheduled: Scheduler determined it should run
  • queued: Assigned to executor, awaiting worker
  • running: Currently executing
  • success: Completed without errors
  • failed: Encountered an error
  • skipped: Bypassed due to branching logic
  • upstream_failed: Upstream task failed
  • up_for_retry: Failed but has retry attempts remaining
  • deferred: Waiting for a trigger event

Task Dependencies

Define dependencies using bitshift operators or explicit methods:

task1 >> task2 >> [task3, task4]  # Bitshift operators
task1.set_downstream(task2)        # Explicit method

By default, a task runs when all upstream tasks succeed. Use trigger rules to modify this behavior.

Executors and Task Execution

Executors are the mechanism by which task instances get executed. They're pluggable—you can swap executors based on your deployment needs. The scheduler queues workloads to the executor, which manages parallelism, resource allocation, and task lifecycle.

Loading diagram...

Executor Types

Local Executors run tasks within the scheduler process:

  • LocalExecutor: Runs tasks locally using multiprocessing (default)
  • Pros: Easy setup, low latency
  • Cons: Limited scalability, shares resources with scheduler

Remote Executors distribute tasks to external workers:

  • Queued/Batch: Tasks sent to central queue (e.g., CeleryExecutor, BatchExecutor)
  • Containerized: Each task runs in isolated container (e.g., KubernetesExecutor, EcsExecutor)

Execution Flow

  1. Scheduler parses DAGs and creates DAG runs based on schedules
  2. Scheduler identifies executable task instances (dependencies met, state valid)
  3. Scheduler creates workloads and queues them to the executor
  4. Executor manages task assignment to workers/processes
  5. Worker executes the task instance
  6. Executor reports completion state back to scheduler
  7. Scheduler updates task instance state and processes downstream tasks

Configuration

Set the executor in your Airflow configuration:

[core]
executor = LocalExecutor

For multiple executors (Airflow 2.10+):

[core]
executor = LocalExecutor,CeleryExecutor

The first executor is the default; others are available when explicitly specified on tasks or DAGs.

Task SDK & Execution Runtime

Relevant Files
  • task-sdk/src/airflow/sdk/init.py
  • task-sdk/src/airflow/sdk/execution_time/supervisor.py
  • task-sdk/src/airflow/sdk/execution_time/comms.py
  • task-sdk/src/airflow/sdk/execution_time/task_runner.py
  • task-sdk/src/airflow/sdk/api/client.py

The Task SDK provides a stable, forward-compatible interface for defining DAGs and executing tasks in isolated subprocesses. It decouples task authoring from Airflow internals, enabling remote execution and language-agnostic task support.

Core Architecture

The Task SDK introduces a service-oriented architecture with three key components:

1. DAG Authoring Interface (airflow.sdk namespace)

  • Stable public API for defining DAGs, tasks, and operators
  • Replaces internal imports like airflow.models.dag.DAG and airflow.decorators.task
  • Includes decorators (@dag, @task, @setup, @teardown), classes (DAG, TaskGroup, BaseOperator), and utilities (Context, Variable, Connection)

2. Execution Runtime (Supervisor & Task Runner)

  • Supervisor: Parent process that manages task execution, proxies API calls, and monitors subprocess health
  • Task Runner: Isolated subprocess where user task code executes
  • Communication via binary length-prefixed msgpack frames over stdin/stdout

3. Execution API Client (airflow.sdk.api.client)

  • HTTP client for communicating with the Execution API server
  • Handles task state transitions, heartbeats, XCom operations, and resource fetching
  • Implements retry logic with exponential backoff

Task Execution Flow

Loading diagram...

Communication Protocol

The Supervisor and Task Runner communicate via a binary protocol:

  • Request Frame: 4-byte big-endian length prefix + msgpack-encoded _RequestFrame (id, body)
  • Response Frame: 4-byte length prefix + msgpack-encoded _ResponseFrame (id, body, error)
  • Log Messages: Dedicated socket with line-based JSON encoding
  • No unsolicited messages: Task process only receives responses to its requests

This design reduces API server load (single connection per task) and prevents user code from accessing JWT tokens.

Key Runtime Operations

Task Startup

  • Supervisor calls POST /run to mark task as running
  • API returns TIRunContext with DAG run info, variables, connections, and retry metadata

Runtime Requests

  • Task code requests variables, connections, XComs, or asset information
  • Supervisor intercepts requests, calls Execution API, and relays responses

Heartbeats & Token Renewal

  • Task Runner periodically sends heartbeats through Supervisor
  • API server returns refreshed JWT tokens in Refreshed-API-Token header

State Transitions

  • On completion/failure/deferral, Task Runner sends final state to Supervisor
  • Supervisor calls PATCH /state with terminal status and metadata

Testing & In-Process Execution

The InProcessTestSupervisor class enables testing DAGs without spawning subprocesses:

from airflow.sdk.execution_time.supervisor import InProcessTestSupervisor

result = InProcessTestSupervisor.start(
    what=task_instance,
    task=my_task,
    logger=log
)

This is useful for dag.test() workflows where the DAG is already parsed in memory.

Providers & Extensibility

Relevant Files
  • airflow-core/src/airflow/providers_manager.py
  • PROVIDERS.rst
  • providers-summary-docs/index.rst
  • contributing-docs/12_provider_distributions.rst
  • providers/standard/provider.yaml
  • airflow-core/src/airflow/provider_info.schema.json

Airflow is built on a modular architecture where the core provides scheduling and orchestration, while providers extend capabilities through integrations with external systems. Providers are independently versioned packages that can be installed, upgraded, or downgraded without affecting the core.

Provider Architecture

The provider system uses a discovery and registration pattern. When Airflow starts, the ProviderManager scans installed packages for the apache_airflow_provider entry point, loads provider metadata from provider.yaml files, and registers all available extensions. This lazy-loading approach means components are only imported when actually needed.

Loading diagram...

Core Extension Points

Providers can extend Airflow through multiple mechanisms:

Connections & Hooks - Define custom connection types with UI customizations. Each connection type maps to a Hook class that handles authentication and interaction with external systems.

Operators & Sensors - Task types for orchestrating external services. Operators perform actions; sensors wait for conditions.

Executors - Custom task execution strategies (e.g., Kubernetes, Celery, cloud-native executors).

Auth Managers - Handle user authentication and authorization for UI and API access.

Logging Handlers - Remote task logging to S3, Cloudwatch, HDFS, or other storage systems.

Secret Backends - Read connections and variables from external secret managers instead of the database.

Notifications - Send alerts via Slack, email, SNS, or custom channels when task/DAG states change.

Plugins - General-purpose extensions for custom UI components or functionality.

Task Decorators - Simplified task definition syntax (e.g., @task.python, @task.bash).

Provider Metadata Structure

Each provider includes a provider.yaml file declaring its capabilities:

package-name: apache-airflow-providers-amazon
connection-types:
  - hook-class-name: airflow.providers.amazon.aws.hooks.s3.S3Hook
    connection-type: aws_s3
operators:
  - integration-name: Amazon S3
    python-modules:
      - airflow.providers.amazon.aws.operators.s3
notifications:
  - airflow.providers.amazon.aws.notifications.sns.SnsNotifier
logging:
  - airflow.providers.amazon.aws.log.s3_task_handler.S3TaskHandler

Community vs. Third-Party Providers

Community providers are maintained by the Airflow project, released with constraints ensuring compatibility, and included in convenience Docker images. They follow Apache governance and release processes.

Third-party providers are independently maintained and released. They have the same capabilities as community providers but are not subject to Airflow's release cycle or constraint management.

Creating Custom Providers

Custom providers follow the same structure as community providers. A minimal provider requires:

  • pyproject.toml - Package metadata and dependencies
  • provider.yaml - Extension declarations
  • src/airflow/providers/YOUR_PROVIDER/ - Implementation code
  • Entry point in pyproject.toml pointing to a get_provider_info() function

This enables organizations to build proprietary integrations with the same extensibility as official providers.

REST API & Web Interface

Relevant Files
  • airflow-core/src/airflow/api_fastapi/app.py
  • airflow-core/src/airflow/api_fastapi/core_api/app.py
  • airflow-core/src/airflow/api_fastapi/execution_api/app.py
  • airflow-core/src/airflow/ui/src/main.tsx
  • airflow-core/docs/administration-and-deployment/web-stack.rst

Airflow 3 uses a modern FastAPI-based REST API with a React frontend, replacing the legacy Flask-based system. The architecture separates concerns into two independent API servers that can be deployed together or separately.

API Architecture

The REST API is built on FastAPI and organized into two main components:

Core API (/api/v2) provides stable, public endpoints for DAG management, task execution, monitoring, and configuration. These endpoints are backward compatible and safe for external consumption.

Execution API (/execution) is a private, versioned API designed for task execution and internal communication. It uses JWT authentication and supports API versioning through Cadwyn, allowing breaking changes while maintaining backward compatibility.

Both APIs are mounted under a single FastAPI application in app.py, which can be selectively enabled via the --apps flag:

airflow api-server --apps core,execution  # Both (default)
airflow api-server --apps core            # Core API only
airflow api-server --apps execution       # Execution API only

API Routing & Organization

Routes are organized hierarchically using routers:

  • Public routes (/api/v2) include DAGs, task instances, connections, assets, and monitoring endpoints
  • UI routes (/ui) are internal endpoints for frontend consumption, subject to breaking changes
  • Execution routes (/execution) handle task execution, XComs, variables, and asset events with JWT authentication

Each route module is self-contained with its own dependencies, request/response models, and security checks. The AirflowRouter wrapper provides common functionality like access control and error handling.

Web UI Stack

The frontend is a React + TypeScript single-page application built with Vite and deployed as static assets. Key technologies:

  • React Router for client-side navigation
  • TanStack React Query for server state management and caching
  • Chakra UI for component library and theming
  • OpenAPI client generation for type-safe API calls

The UI is mounted at the root path and serves as a catch-all for all non-API routes, enabling client-side routing. Static assets are served from /static, and the frontend communicates with the API via dynamically configured base URLs.

Plugin System

The UI supports dynamic React plugin loading through a plugin system. Plugins are loaded as separate bundles and share the host application's React instance via globalThis. This allows third-party extensions without modifying core code.

Configuration & Deployment

CORS, authentication, and middleware are configured through Airflow's configuration system. The API root path can be customized via api.base_url, allowing deployment behind URL prefixes without frontend rebuilds. Middleware for JWT refresh, authentication, and error handling is applied globally.

Loading diagram...

Kubernetes & Helm Deployment

Relevant Files
  • chart/Chart.yaml
  • chart/values.yaml
  • chart/README.md
  • chart/templates/scheduler/scheduler-deployment.yaml
  • chart/docs/index.rst

The Apache Airflow Helm chart provides a production-ready deployment mechanism for Kubernetes. It abstracts the complexity of deploying Airflow's distributed components (scheduler, workers, webserver, API server) into a single, configurable package.

Chart Overview

The chart (version 1.19.0-dev, supporting Airflow 3.1.5) deploys Airflow with support for multiple executors: LocalExecutor, CeleryExecutor, KubernetesExecutor, and hybrid variants. It includes PostgreSQL as a dependency and provides optional Redis for Celery message brokering.

Key features:

  • Multi-executor support with automatic pod launching capabilities
  • Automatic database migrations and admin user creation via Helm hooks
  • Built-in monitoring with StatsD/Prometheus and Flower UI
  • Security enhancements including Service Account Token Volume configuration
  • KEDA-based autoscaling for Celery workers
  • Kerberos authentication support

Core Components

Loading diagram...

Scheduler: Deployed as Deployment or StatefulSet (when using LocalExecutor with persistence). Manages DAG parsing and task scheduling. Supports multiple replicas for high availability with MySQL 8+ or PostgreSQL.

Workers: Celery workers deployed as StatefulSet with persistent volumes. Configurable replicas with KEDA autoscaling based on queued tasks or HPA for CPU metrics.

API Server: Airflow 3.0+ component providing REST API. Supports horizontal pod autoscaling with configurable metrics.

Webserver: Legacy UI component (Airflow <3.0). Replaced by API Server in Airflow 3.0+.

Triggerer: Manages async task triggers. Deployed as Deployment with configurable replicas.

Configuration via values.yaml

The chart uses a comprehensive values.yaml with sections for each component:

executor: "CeleryExecutor"
scheduler:
  replicas: 1
  resources:
    limits:
      cpu: 100m
      memory: 128Mi
workers:
  replicas: 1
  persistence:
    enabled: true
    size: 100Gi
apiServer:
  replicas: 1
  hpa:
    enabled: false
    minReplicaCount: 1
    maxReplicaCount: 5

Key configuration areas include resource limits, security contexts, ingress rules, environment variables, and custom volumes. The chart supports templating for dynamic values.

Deployment Patterns

LocalExecutor: Scheduler acts as worker. Uses StatefulSet with persistence when enabled. Suitable for single-node or small deployments.

CeleryExecutor: Distributed task execution across multiple worker pods. Requires Redis broker and PostgreSQL/MySQL backend. Scales horizontally via KEDA or HPA.

KubernetesExecutor: Each task runs in its own pod. No separate workers needed. Ideal for dynamic workloads with varying resource requirements.

Database & Secrets

The chart manages Airflow metadata database connections, Fernet keys, and API secrets through Kubernetes Secrets. PostgreSQL can be deployed as a chart dependency or configured externally. PgBouncer provides connection pooling for high-concurrency scenarios.

Ingress & Networking

Ingress resources can be configured for API Server, Webserver, Flower, and StatsD endpoints. Network policies are optional. Service discovery uses standard Kubernetes DNS within the cluster.

Monitoring & Observability

StatsD exporter collects Airflow metrics for Prometheus scraping. Flower provides Celery worker monitoring. Pod disruption budgets ensure availability during cluster maintenance.

Testing Infrastructure

Relevant Files
  • contributing-docs/09_testing.rst
  • contributing-docs/testing/unit_tests.rst
  • contributing-docs/testing/integration_tests.rst
  • airflow-core/tests/conftest.py
  • devel-common/src/tests_common/pytest_plugin.py
  • pyproject.toml (pytest configuration)

Airflow uses a comprehensive, multi-layered testing infrastructure built on pytest to ensure reliability across different deployment scenarios. All tests use pytest as the standard framework, with custom plugins and fixtures providing specialized functionality.

Test Categories

The testing framework includes several distinct test types:

  • Unit Tests - Python tests without external integrations, runnable in local virtualenv or Breeze. Required for all PRs unless documentation-only.
  • Integration Tests - Tests requiring external services (Postgres, MySQL, Kerberos, Celery, etc.), run only in Breeze with --integration flag.
  • System Tests - End-to-end DAG execution tests using external systems like Google Cloud and AWS.
  • Docker Compose Tests - Verify quick-start Docker Compose setup.
  • Kubernetes Tests - Validate Kubernetes deployment and Pod Operator functionality.
  • Helm Unit Tests - Verify Helm Chart rendering for various configurations.
  • Task SDK Integration Tests - Specialized tests for Task SDK integration with running Airflow.
  • Airflow Ctl Tests - Verify airflowctl command-line tool functionality.

Unit Test Architecture

Unit tests are split into two categories:

DB Tests - Access the database, run sequentially, slower execution. Marked with @pytest.mark.backend("postgres", "mysql") for specific backends.

Non-DB Tests - Run with none backend (database access fails), execute in parallel using pytest-xdist, much faster. Run with --skip-db-tests flag.

Pytest Configuration & Plugins

The testing infrastructure uses a custom pytest plugin (tests_common.pytest_plugin) that:

  • Configures Airflow in unit test mode before any imports
  • Manages environment variables and integrations
  • Provides fixtures for DAGs, operators, and task instances
  • Handles database setup and teardown
  • Enforces test file naming conventions (test_*.py for unit tests, example_*.py or test_*.py for system tests)
  • Captures and validates warnings (prohibits AirflowProviderDeprecationWarning by default)

Key pytest options in pyproject.toml:

[tool.pytest]
addopts = [
    "--tb=short",
    "-rasl",
    "--verbosity=2",
    "-p", "no:flaky",
    "-p", "no:nose",
    "-p", "no:legacypath",
    "--disable-warnings",
    "--asyncio-mode=strict",
]

Running Tests

Local virtualenv:

pytest airflow-core/tests/unit/core/test_core.py

Breeze (with integrations):

breeze testing core-tests --run-in-parallel
breeze testing core-tests --skip-db-tests --use-xdist
breeze testing providers-tests --run-in-parallel

CI pipeline uses scripts/ci/testing/run_unit_tests.sh with test scopes: DB, Non-DB, All, Quarantined, System.

Best Practices

  • Use standard Python assert statements and pytest decorators, not unittest classes
  • Mock all external communications in unit tests
  • Use pytest.mark.parametrize for parameter variations
  • Mock time.sleep() and asyncio.sleep() to speed up tests
  • Use pytest.warns() to capture expected warnings
  • Avoid deprecated methods; test legacy features only with explicit warning capture

Development & Contributing

Relevant Files
  • contributing-docs/README.rst
  • contributing-docs/03a_contributors_quick_start_beginners.rst
  • contributing-docs/07_local_virtualenv.rst
  • contributing-docs/08_static_code_checks.rst
  • contributing-docs/09_testing.rst
  • contributing-docs/11_documentation_building.rst
  • CONTRIBUTING.rst

Quick Start for New Contributors

Apache Airflow welcomes contributions from developers of all experience levels. The project provides two main paths for getting started:

Breeze (Local Development) – Run Airflow in Docker containers on your machine. Requires Docker/Podman, uv package manager, and 4GB RAM with 40GB disk space.

GitHub Codespaces – One-click cloud-based development environment with VS Code web IDE. No local setup required.

Both paths guide you through making your first pull request in approximately 15 minutes.

Development Environment Setup

Using uv for Virtual Environment Management

As of November 2024, Airflow recommends uv for managing Python virtual environments. It is a fast, modern package manager that handles Python versions, dependencies, and development tools.

# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh

# Create and sync virtual environment
uv venv
uv sync

Breeze Development Container

Breeze replicates the CI environment locally and includes all necessary services (databases, message brokers, etc.) for integration testing.

# Install Breeze
uv tool install -e ./dev/breeze

# Start development environment
breeze start-airflow

Code Quality & Static Checks

Prek Hooks

Airflow uses prek (a Rust-based replacement for pre-commit) to run code quality checks before commits. Hooks run only on staged files, making them fast and non-intrusive.

# Install prek
uv tool install prek
prek install -f
prek install -f --hook-type pre-push  # for mypy checks

Checks include formatting, linting, type checking, and bug detection. They use the same environment as CI, ensuring local validation matches CI results.

Testing Framework

Airflow features a comprehensive testing infrastructure:

  • Unit tests – Python tests without external dependencies; required for most PRs
  • Integration tests – Tests requiring services (Postgres, MySQL, Kerberos) in Breeze
  • System tests – End-to-end tests using external systems (Google Cloud, AWS)
  • Docker Compose tests – Validation of quick-start Docker setup
  • Kubernetes & Helm tests – Deployment and chart rendering verification

Run tests locally with pytest or via Breeze for full integration testing.

Documentation Building

Documentation is built using Sphinx and organized by distribution:

# Build docs locally (requires Python 3.11)
uv python pin 3.11
uv run --group docs build-docs

Key documentation locations:

  • airflow-core/docs – Core Airflow documentation
  • providers/*/docs – Provider-specific documentation
  • chart/docs – Helm Chart documentation
  • task-sdk/docs – Task SDK documentation

Pull Request Workflow

  1. Fork the repository and clone your fork
  2. Create a branch for your changes
  3. Make changes and run prek --all-files to validate
  4. Commit & push to your fork
  5. Open a PR – GitHub shows a "Compare & pull request" button
  6. Respond to reviews and push updates as needed
  7. Merge – A committer merges once CI passes and reviews are approved

Keep your branch rebased with git fetch upstream && git rebase upstream/main && git push --force-with-lease.

Key Resources

  • New Contributors – Start with the 15-minute quick start guide
  • Seasoned Developers – Full development environment guide with advanced tooling
  • Contribution Workflow – Overview of how to contribute to Airflow
  • Git Workflow – Branching strategy, syncing forks, and rebasing PRs
  • Provider Development – Guide for contributing to Airflow providers