Overview
Relevant Files
README.mdairflow-core/README.mdairflow-core/docs/index.rstairflow-core/docs/core-concepts/overview.rsttask-sdk/docs/index.rst
Apache Airflow is an open-source platform for developing, scheduling, and monitoring batch-oriented workflows. It enables you to define workflows as Python code, providing a flexible, extensible framework for orchestrating complex data pipelines and business processes.
What is Airflow?
Airflow represents workflows as Directed Acyclic Graphs (DAGs), where each node is a task and edges represent dependencies. The platform handles scheduling, execution, monitoring, and debugging of these workflows through a web-based UI and command-line interface.
Key characteristics:
- Workflows as Code: Define pipelines entirely in Python, enabling version control, testing, and collaboration
- Extensible: Built-in operators for common tasks, with support for custom operators and providers
- Flexible: Leverages Jinja templating for rich customizations and dynamic pipeline generation
- Scalable: Runs from a single machine to distributed systems handling massive workloads
Core Architecture
Airflow consists of several essential components:
- Scheduler: Triggers scheduled workflows and submits tasks to the executor
- Executor: Runs tasks (LocalExecutor, CeleryExecutor, KubernetesExecutor, or custom)
- DAG Processor: Parses DAG files and serializes them into the metadata database
- Webserver: Provides UI for visualization, management, and debugging
- Metadata Database: Stores state of tasks, DAGs, and variables (PostgreSQL, MySQL, or SQLite)
- Triggerer (optional): Executes deferred tasks in an asyncio event loop
Loading diagram...
Repository Structure
The monorepo contains multiple interconnected packages:
- airflow-core: Core Airflow functionality (scheduler, API server, DAG processor, triggerer)
- task-sdk: Stable interface for DAG authoring, decoupled from Airflow internals
- providers: 70+ community-managed provider packages for integrations
- chart: Kubernetes Helm chart for deploying Airflow
- clients: Python client libraries for interacting with Airflow APIs
- contributing-docs: Comprehensive contributor guidelines and development setup
Key Concepts
DAGs: Workflows defined as Python code with tasks and dependencies. Use the airflow.sdk namespace for stable, forward-compatible DAG authoring.
Tasks: Individual units of work represented by operators (BashOperator, PythonOperator, etc.) or decorated Python functions.
Executors: Determine how tasks are executed. LocalExecutor runs tasks in parallel processes on a single machine; CeleryExecutor and KubernetesExecutor enable distributed execution.
Providers: Separate packages extending Airflow with integrations for external systems (AWS, GCP, Databricks, etc.).
Use Cases
Airflow excels at:
- Batch data processing and ETL pipelines
- Scheduled workflows with clear start and end points
- Complex task orchestration with dependencies
- Monitoring and alerting on workflow execution
Airflow is not designed for streaming, continuously running, or event-driven workloads, though it complements streaming systems like Apache Kafka.
Architecture & Core Components
Relevant Files
airflow-core/src/airflow/jobs/scheduler_job_runner.pyairflow-core/src/airflow/dag_processing/processor.pyairflow-core/src/airflow/executors/base_executor.pyairflow-core/src/airflow/models/dag.pyairflow-core/src/airflow/models/dagrun.pyairflow-core/src/airflow/models/taskinstance.py
Airflow's architecture centers on a scheduler-executor pattern that orchestrates distributed task execution. The system consists of several interconnected components that work together to parse DAGs, schedule tasks, and execute them reliably.
Core Components
Scheduler (SchedulerJobRunner) is the central orchestrator. It runs continuously, executing a main loop that:
- Harvests DAG parsing results from the DAG processor
- Creates new DAG runs based on schedules
- Identifies executable tasks and queues them to executors
- Processes task completion events and updates state
Executor (BaseExecutor) is a pluggable component that handles actual task execution. The scheduler queues workloads to the executor, which manages parallelism, resource allocation, and task lifecycle. Different executor implementations (LocalExecutor, CeleryExecutor, KubernetesExecutor) provide various execution strategies.
DAG Processor (DagFileProcessorProcess) runs in isolated subprocesses to parse DAG files and serialize them into the metadata database. This isolation prevents user code from affecting the scheduler. The processor converts Python DAG definitions into serialized representations that the scheduler uses for decision-making.
Data Models
The core data models represent the workflow state:
- DAG (
DagModel): Represents a workflow definition with scheduling metadata, concurrency limits, and next run information - DagRun: An instance of a DAG execution with a specific logical date and state
- TaskInstance: Represents a single task execution within a DAG run, storing state, retry information, and execution metadata
- Task: The operator definition within a DAG (not persisted; loaded from serialized DAG)
Scheduling Loop
The scheduler's main loop executes these steps repeatedly:
# Simplified scheduler loop structure
1. Harvest DAG parsing results
2. Create DAG runs for scheduled DAGs
3. Find executable tasks (dependencies met, resources available)
4. Queue tasks to executors (with row-level locking for concurrency)
5. Heartbeat executors (trigger execution, sync task states)
6. Process task completion events
7. Handle expired deadlines
The critical section uses database row locks on the Pool model to ensure thread-safe task queuing across multiple scheduler instances.
Data Flow
Loading diagram...
Key Architectural Principles
Isolation: User code (DAGs, operators) runs in separate processes to prevent scheduler crashes. The DAG processor and task runtime are isolated from the scheduler.
Concurrency Control: Multiple schedulers can run simultaneously using database-level row locks. Pool limits and concurrency constraints prevent resource exhaustion.
State Authority: The metadata database is the single source of truth for task and DAG state. All state transitions are persisted before execution.
Asynchronous Execution: The scheduler queues tasks and continues; executors handle actual execution asynchronously. The scheduler polls for completion events.
DAG & Task Execution
Relevant Files
task-sdk/src/airflow/sdk/definitions/dag.pyairflow-core/src/airflow/models/dag.pyairflow-core/src/airflow/models/taskinstance.pyairflow-core/src/airflow/executors/base_executor.pyairflow-core/src/airflow/jobs/scheduler_job_runner.py
DAG Fundamentals
A DAG (Directed Acyclic Graph) is a collection of tasks with directional dependencies that represents a workflow. Each DAG has a schedule, start date, and optional end date. The DAG itself doesn't execute logic—it defines how tasks should run: their order, retry policies, timeouts, and other operational details.
DAGs are instantiated into DAG Runs each time they execute according to their schedule. For example, a daily DAG creates one run per day.
Declaring DAGs
There are three ways to declare a DAG:
1. Context Manager (with statement):
from airflow.sdk import DAG
from airflow.providers.standard.operators.empty import EmptyOperator
import datetime
with DAG(
dag_id="my_dag",
start_date=datetime.datetime(2021, 1, 1),
schedule="@daily",
):
EmptyOperator(task_id="task1")
2. Constructor:
my_dag = DAG(
dag_id="my_dag",
start_date=datetime.datetime(2021, 1, 1),
schedule="@daily",
)
EmptyOperator(task_id="task1", dag=my_dag)
3. Decorator:
from airflow.sdk import dag
@dag(start_date=datetime.datetime(2021, 1, 1), schedule="@daily")
def generate_dag():
EmptyOperator(task_id="task1")
generate_dag()
Tasks and Task Instances
A Task is the basic unit of execution in a DAG. Tasks are arranged with upstream and downstream dependencies to express execution order. There are three kinds of tasks:
- Operators: Predefined task templates (e.g.,
BashOperator,PythonOperator) - Sensors: Special operators that wait for external events
- TaskFlow tasks: Custom Python functions decorated with
@task
Much like DAGs become DAG Runs, tasks become Task Instances—specific executions of a task for a given DAG run. Task instances have state representing their lifecycle stage.
Task Instance States
Task instances flow through these states:
none: Dependencies not yet metscheduled: Scheduler determined it should runqueued: Assigned to executor, awaiting workerrunning: Currently executingsuccess: Completed without errorsfailed: Encountered an errorskipped: Bypassed due to branching logicupstream_failed: Upstream task failedup_for_retry: Failed but has retry attempts remainingdeferred: Waiting for a trigger event
Task Dependencies
Define dependencies using bitshift operators or explicit methods:
task1 >> task2 >> [task3, task4] # Bitshift operators
task1.set_downstream(task2) # Explicit method
By default, a task runs when all upstream tasks succeed. Use trigger rules to modify this behavior.
Executors and Task Execution
Executors are the mechanism by which task instances get executed. They're pluggable—you can swap executors based on your deployment needs. The scheduler queues workloads to the executor, which manages parallelism, resource allocation, and task lifecycle.
Loading diagram...
Executor Types
Local Executors run tasks within the scheduler process:
LocalExecutor: Runs tasks locally using multiprocessing (default)- Pros: Easy setup, low latency
- Cons: Limited scalability, shares resources with scheduler
Remote Executors distribute tasks to external workers:
- Queued/Batch: Tasks sent to central queue (e.g.,
CeleryExecutor,BatchExecutor) - Containerized: Each task runs in isolated container (e.g.,
KubernetesExecutor,EcsExecutor)
Execution Flow
- Scheduler parses DAGs and creates DAG runs based on schedules
- Scheduler identifies executable task instances (dependencies met, state valid)
- Scheduler creates workloads and queues them to the executor
- Executor manages task assignment to workers/processes
- Worker executes the task instance
- Executor reports completion state back to scheduler
- Scheduler updates task instance state and processes downstream tasks
Configuration
Set the executor in your Airflow configuration:
[core]
executor = LocalExecutor
For multiple executors (Airflow 2.10+):
[core]
executor = LocalExecutor,CeleryExecutor
The first executor is the default; others are available when explicitly specified on tasks or DAGs.
Task SDK & Execution Runtime
Relevant Files
task-sdk/src/airflow/sdk/init.pytask-sdk/src/airflow/sdk/execution_time/supervisor.pytask-sdk/src/airflow/sdk/execution_time/comms.pytask-sdk/src/airflow/sdk/execution_time/task_runner.pytask-sdk/src/airflow/sdk/api/client.py
The Task SDK provides a stable, forward-compatible interface for defining DAGs and executing tasks in isolated subprocesses. It decouples task authoring from Airflow internals, enabling remote execution and language-agnostic task support.
Core Architecture
The Task SDK introduces a service-oriented architecture with three key components:
1. DAG Authoring Interface (airflow.sdk namespace)
- Stable public API for defining DAGs, tasks, and operators
- Replaces internal imports like
airflow.models.dag.DAGandairflow.decorators.task - Includes decorators (
@dag,@task,@setup,@teardown), classes (DAG,TaskGroup,BaseOperator), and utilities (Context,Variable,Connection)
2. Execution Runtime (Supervisor & Task Runner)
- Supervisor: Parent process that manages task execution, proxies API calls, and monitors subprocess health
- Task Runner: Isolated subprocess where user task code executes
- Communication via binary length-prefixed msgpack frames over stdin/stdout
3. Execution API Client (airflow.sdk.api.client)
- HTTP client for communicating with the Execution API server
- Handles task state transitions, heartbeats, XCom operations, and resource fetching
- Implements retry logic with exponential backoff
Task Execution Flow
Loading diagram...
Communication Protocol
The Supervisor and Task Runner communicate via a binary protocol:
- Request Frame: 4-byte big-endian length prefix + msgpack-encoded
_RequestFrame(id, body) - Response Frame: 4-byte length prefix + msgpack-encoded
_ResponseFrame(id, body, error) - Log Messages: Dedicated socket with line-based JSON encoding
- No unsolicited messages: Task process only receives responses to its requests
This design reduces API server load (single connection per task) and prevents user code from accessing JWT tokens.
Key Runtime Operations
Task Startup
- Supervisor calls
POST /runto mark task as running - API returns
TIRunContextwith DAG run info, variables, connections, and retry metadata
Runtime Requests
- Task code requests variables, connections, XComs, or asset information
- Supervisor intercepts requests, calls Execution API, and relays responses
Heartbeats & Token Renewal
- Task Runner periodically sends heartbeats through Supervisor
- API server returns refreshed JWT tokens in
Refreshed-API-Tokenheader
State Transitions
- On completion/failure/deferral, Task Runner sends final state to Supervisor
- Supervisor calls
PATCH /statewith terminal status and metadata
Testing & In-Process Execution
The InProcessTestSupervisor class enables testing DAGs without spawning subprocesses:
from airflow.sdk.execution_time.supervisor import InProcessTestSupervisor
result = InProcessTestSupervisor.start(
what=task_instance,
task=my_task,
logger=log
)
This is useful for dag.test() workflows where the DAG is already parsed in memory.
Providers & Extensibility
Relevant Files
airflow-core/src/airflow/providers_manager.pyPROVIDERS.rstproviders-summary-docs/index.rstcontributing-docs/12_provider_distributions.rstproviders/standard/provider.yamlairflow-core/src/airflow/provider_info.schema.json
Airflow is built on a modular architecture where the core provides scheduling and orchestration, while providers extend capabilities through integrations with external systems. Providers are independently versioned packages that can be installed, upgraded, or downgraded without affecting the core.
Provider Architecture
The provider system uses a discovery and registration pattern. When Airflow starts, the ProviderManager scans installed packages for the apache_airflow_provider entry point, loads provider metadata from provider.yaml files, and registers all available extensions. This lazy-loading approach means components are only imported when actually needed.
Loading diagram...
Core Extension Points
Providers can extend Airflow through multiple mechanisms:
Connections & Hooks - Define custom connection types with UI customizations. Each connection type maps to a Hook class that handles authentication and interaction with external systems.
Operators & Sensors - Task types for orchestrating external services. Operators perform actions; sensors wait for conditions.
Executors - Custom task execution strategies (e.g., Kubernetes, Celery, cloud-native executors).
Auth Managers - Handle user authentication and authorization for UI and API access.
Logging Handlers - Remote task logging to S3, Cloudwatch, HDFS, or other storage systems.
Secret Backends - Read connections and variables from external secret managers instead of the database.
Notifications - Send alerts via Slack, email, SNS, or custom channels when task/DAG states change.
Plugins - General-purpose extensions for custom UI components or functionality.
Task Decorators - Simplified task definition syntax (e.g., @task.python, @task.bash).
Provider Metadata Structure
Each provider includes a provider.yaml file declaring its capabilities:
package-name: apache-airflow-providers-amazon
connection-types:
- hook-class-name: airflow.providers.amazon.aws.hooks.s3.S3Hook
connection-type: aws_s3
operators:
- integration-name: Amazon S3
python-modules:
- airflow.providers.amazon.aws.operators.s3
notifications:
- airflow.providers.amazon.aws.notifications.sns.SnsNotifier
logging:
- airflow.providers.amazon.aws.log.s3_task_handler.S3TaskHandler
Community vs. Third-Party Providers
Community providers are maintained by the Airflow project, released with constraints ensuring compatibility, and included in convenience Docker images. They follow Apache governance and release processes.
Third-party providers are independently maintained and released. They have the same capabilities as community providers but are not subject to Airflow's release cycle or constraint management.
Creating Custom Providers
Custom providers follow the same structure as community providers. A minimal provider requires:
pyproject.toml- Package metadata and dependenciesprovider.yaml- Extension declarationssrc/airflow/providers/YOUR_PROVIDER/- Implementation code- Entry point in
pyproject.tomlpointing to aget_provider_info()function
This enables organizations to build proprietary integrations with the same extensibility as official providers.
REST API & Web Interface
Relevant Files
airflow-core/src/airflow/api_fastapi/app.pyairflow-core/src/airflow/api_fastapi/core_api/app.pyairflow-core/src/airflow/api_fastapi/execution_api/app.pyairflow-core/src/airflow/ui/src/main.tsxairflow-core/docs/administration-and-deployment/web-stack.rst
Airflow 3 uses a modern FastAPI-based REST API with a React frontend, replacing the legacy Flask-based system. The architecture separates concerns into two independent API servers that can be deployed together or separately.
API Architecture
The REST API is built on FastAPI and organized into two main components:
Core API (/api/v2) provides stable, public endpoints for DAG management, task execution, monitoring, and configuration. These endpoints are backward compatible and safe for external consumption.
Execution API (/execution) is a private, versioned API designed for task execution and internal communication. It uses JWT authentication and supports API versioning through Cadwyn, allowing breaking changes while maintaining backward compatibility.
Both APIs are mounted under a single FastAPI application in app.py, which can be selectively enabled via the --apps flag:
airflow api-server --apps core,execution # Both (default)
airflow api-server --apps core # Core API only
airflow api-server --apps execution # Execution API only
API Routing & Organization
Routes are organized hierarchically using routers:
- Public routes (
/api/v2) include DAGs, task instances, connections, assets, and monitoring endpoints - UI routes (
/ui) are internal endpoints for frontend consumption, subject to breaking changes - Execution routes (
/execution) handle task execution, XComs, variables, and asset events with JWT authentication
Each route module is self-contained with its own dependencies, request/response models, and security checks. The AirflowRouter wrapper provides common functionality like access control and error handling.
Web UI Stack
The frontend is a React + TypeScript single-page application built with Vite and deployed as static assets. Key technologies:
- React Router for client-side navigation
- TanStack React Query for server state management and caching
- Chakra UI for component library and theming
- OpenAPI client generation for type-safe API calls
The UI is mounted at the root path and serves as a catch-all for all non-API routes, enabling client-side routing. Static assets are served from /static, and the frontend communicates with the API via dynamically configured base URLs.
Plugin System
The UI supports dynamic React plugin loading through a plugin system. Plugins are loaded as separate bundles and share the host application's React instance via globalThis. This allows third-party extensions without modifying core code.
Configuration & Deployment
CORS, authentication, and middleware are configured through Airflow's configuration system. The API root path can be customized via api.base_url, allowing deployment behind URL prefixes without frontend rebuilds. Middleware for JWT refresh, authentication, and error handling is applied globally.
Loading diagram...
Kubernetes & Helm Deployment
Relevant Files
chart/Chart.yamlchart/values.yamlchart/README.mdchart/templates/scheduler/scheduler-deployment.yamlchart/docs/index.rst
The Apache Airflow Helm chart provides a production-ready deployment mechanism for Kubernetes. It abstracts the complexity of deploying Airflow's distributed components (scheduler, workers, webserver, API server) into a single, configurable package.
Chart Overview
The chart (version 1.19.0-dev, supporting Airflow 3.1.5) deploys Airflow with support for multiple executors: LocalExecutor, CeleryExecutor, KubernetesExecutor, and hybrid variants. It includes PostgreSQL as a dependency and provides optional Redis for Celery message brokering.
Key features:
- Multi-executor support with automatic pod launching capabilities
- Automatic database migrations and admin user creation via Helm hooks
- Built-in monitoring with StatsD/Prometheus and Flower UI
- Security enhancements including Service Account Token Volume configuration
- KEDA-based autoscaling for Celery workers
- Kerberos authentication support
Core Components
Loading diagram...
Scheduler: Deployed as Deployment or StatefulSet (when using LocalExecutor with persistence). Manages DAG parsing and task scheduling. Supports multiple replicas for high availability with MySQL 8+ or PostgreSQL.
Workers: Celery workers deployed as StatefulSet with persistent volumes. Configurable replicas with KEDA autoscaling based on queued tasks or HPA for CPU metrics.
API Server: Airflow 3.0+ component providing REST API. Supports horizontal pod autoscaling with configurable metrics.
Webserver: Legacy UI component (Airflow <3.0). Replaced by API Server in Airflow 3.0+.
Triggerer: Manages async task triggers. Deployed as Deployment with configurable replicas.
Configuration via values.yaml
The chart uses a comprehensive values.yaml with sections for each component:
executor: "CeleryExecutor"
scheduler:
replicas: 1
resources:
limits:
cpu: 100m
memory: 128Mi
workers:
replicas: 1
persistence:
enabled: true
size: 100Gi
apiServer:
replicas: 1
hpa:
enabled: false
minReplicaCount: 1
maxReplicaCount: 5
Key configuration areas include resource limits, security contexts, ingress rules, environment variables, and custom volumes. The chart supports templating for dynamic values.
Deployment Patterns
LocalExecutor: Scheduler acts as worker. Uses StatefulSet with persistence when enabled. Suitable for single-node or small deployments.
CeleryExecutor: Distributed task execution across multiple worker pods. Requires Redis broker and PostgreSQL/MySQL backend. Scales horizontally via KEDA or HPA.
KubernetesExecutor: Each task runs in its own pod. No separate workers needed. Ideal for dynamic workloads with varying resource requirements.
Database & Secrets
The chart manages Airflow metadata database connections, Fernet keys, and API secrets through Kubernetes Secrets. PostgreSQL can be deployed as a chart dependency or configured externally. PgBouncer provides connection pooling for high-concurrency scenarios.
Ingress & Networking
Ingress resources can be configured for API Server, Webserver, Flower, and StatsD endpoints. Network policies are optional. Service discovery uses standard Kubernetes DNS within the cluster.
Monitoring & Observability
StatsD exporter collects Airflow metrics for Prometheus scraping. Flower provides Celery worker monitoring. Pod disruption budgets ensure availability during cluster maintenance.
Testing Infrastructure
Relevant Files
contributing-docs/09_testing.rstcontributing-docs/testing/unit_tests.rstcontributing-docs/testing/integration_tests.rstairflow-core/tests/conftest.pydevel-common/src/tests_common/pytest_plugin.pypyproject.toml(pytest configuration)
Airflow uses a comprehensive, multi-layered testing infrastructure built on pytest to ensure reliability across different deployment scenarios. All tests use pytest as the standard framework, with custom plugins and fixtures providing specialized functionality.
Test Categories
The testing framework includes several distinct test types:
- Unit Tests - Python tests without external integrations, runnable in local virtualenv or Breeze. Required for all PRs unless documentation-only.
- Integration Tests - Tests requiring external services (Postgres, MySQL, Kerberos, Celery, etc.), run only in Breeze with
--integrationflag. - System Tests - End-to-end DAG execution tests using external systems like Google Cloud and AWS.
- Docker Compose Tests - Verify quick-start Docker Compose setup.
- Kubernetes Tests - Validate Kubernetes deployment and Pod Operator functionality.
- Helm Unit Tests - Verify Helm Chart rendering for various configurations.
- Task SDK Integration Tests - Specialized tests for Task SDK integration with running Airflow.
- Airflow Ctl Tests - Verify airflowctl command-line tool functionality.
Unit Test Architecture
Unit tests are split into two categories:
DB Tests - Access the database, run sequentially, slower execution. Marked with @pytest.mark.backend("postgres", "mysql") for specific backends.
Non-DB Tests - Run with none backend (database access fails), execute in parallel using pytest-xdist, much faster. Run with --skip-db-tests flag.
Pytest Configuration & Plugins
The testing infrastructure uses a custom pytest plugin (tests_common.pytest_plugin) that:
- Configures Airflow in unit test mode before any imports
- Manages environment variables and integrations
- Provides fixtures for DAGs, operators, and task instances
- Handles database setup and teardown
- Enforces test file naming conventions (
test_*.pyfor unit tests,example_*.pyortest_*.pyfor system tests) - Captures and validates warnings (prohibits
AirflowProviderDeprecationWarningby default)
Key pytest options in pyproject.toml:
[tool.pytest]
addopts = [
"--tb=short",
"-rasl",
"--verbosity=2",
"-p", "no:flaky",
"-p", "no:nose",
"-p", "no:legacypath",
"--disable-warnings",
"--asyncio-mode=strict",
]
Running Tests
Local virtualenv:
pytest airflow-core/tests/unit/core/test_core.py
Breeze (with integrations):
breeze testing core-tests --run-in-parallel
breeze testing core-tests --skip-db-tests --use-xdist
breeze testing providers-tests --run-in-parallel
CI pipeline uses scripts/ci/testing/run_unit_tests.sh with test scopes: DB, Non-DB, All, Quarantined, System.
Best Practices
- Use standard Python
assertstatements and pytest decorators, notunittestclasses - Mock all external communications in unit tests
- Use
pytest.mark.parametrizefor parameter variations - Mock
time.sleep()andasyncio.sleep()to speed up tests - Use
pytest.warns()to capture expected warnings - Avoid deprecated methods; test legacy features only with explicit warning capture
Development & Contributing
Relevant Files
contributing-docs/README.rstcontributing-docs/03a_contributors_quick_start_beginners.rstcontributing-docs/07_local_virtualenv.rstcontributing-docs/08_static_code_checks.rstcontributing-docs/09_testing.rstcontributing-docs/11_documentation_building.rstCONTRIBUTING.rst
Quick Start for New Contributors
Apache Airflow welcomes contributions from developers of all experience levels. The project provides two main paths for getting started:
Breeze (Local Development) – Run Airflow in Docker containers on your machine. Requires Docker/Podman, uv package manager, and 4GB RAM with 40GB disk space.
GitHub Codespaces – One-click cloud-based development environment with VS Code web IDE. No local setup required.
Both paths guide you through making your first pull request in approximately 15 minutes.
Development Environment Setup
Using uv for Virtual Environment Management
As of November 2024, Airflow recommends uv for managing Python virtual environments. It is a fast, modern package manager that handles Python versions, dependencies, and development tools.
# Install uv
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create and sync virtual environment
uv venv
uv sync
Breeze Development Container
Breeze replicates the CI environment locally and includes all necessary services (databases, message brokers, etc.) for integration testing.
# Install Breeze
uv tool install -e ./dev/breeze
# Start development environment
breeze start-airflow
Code Quality & Static Checks
Prek Hooks
Airflow uses prek (a Rust-based replacement for pre-commit) to run code quality checks before commits. Hooks run only on staged files, making them fast and non-intrusive.
# Install prek
uv tool install prek
prek install -f
prek install -f --hook-type pre-push # for mypy checks
Checks include formatting, linting, type checking, and bug detection. They use the same environment as CI, ensuring local validation matches CI results.
Testing Framework
Airflow features a comprehensive testing infrastructure:
- Unit tests – Python tests without external dependencies; required for most PRs
- Integration tests – Tests requiring services (Postgres, MySQL, Kerberos) in Breeze
- System tests – End-to-end tests using external systems (Google Cloud, AWS)
- Docker Compose tests – Validation of quick-start Docker setup
- Kubernetes & Helm tests – Deployment and chart rendering verification
Run tests locally with pytest or via Breeze for full integration testing.
Documentation Building
Documentation is built using Sphinx and organized by distribution:
# Build docs locally (requires Python 3.11)
uv python pin 3.11
uv run --group docs build-docs
Key documentation locations:
airflow-core/docs– Core Airflow documentationproviders/*/docs– Provider-specific documentationchart/docs– Helm Chart documentationtask-sdk/docs– Task SDK documentation
Pull Request Workflow
- Fork the repository and clone your fork
- Create a branch for your changes
- Make changes and run
prek --all-filesto validate - Commit & push to your fork
- Open a PR – GitHub shows a "Compare & pull request" button
- Respond to reviews and push updates as needed
- Merge – A committer merges once CI passes and reviews are approved
Keep your branch rebased with git fetch upstream && git rebase upstream/main && git push --force-with-lease.
Key Resources
- New Contributors – Start with the 15-minute quick start guide
- Seasoned Developers – Full development environment guide with advanced tooling
- Contribution Workflow – Overview of how to contribute to Airflow
- Git Workflow – Branching strategy, syncing forks, and rebasing PRs
- Provider Development – Guide for contributing to Airflow providers