Install Now

postgres/postgres

PostgreSQL Database System

Overview

Relevant Files
  • README.md
  • src/DEVELOPERS
  • src/backend/main/main.c
  • src/backend/postmaster/postmaster.c
  • src/backend/tcop/postgres.c

PostgreSQL is an advanced object-relational database management system supporting transactions, foreign keys, subqueries, triggers, user-defined types, and functions. This repository contains the complete source code distribution, organized into a modular backend architecture with multiple specialized subsystems.

Architecture Overview

The PostgreSQL backend is organized into distinct functional layers:

Loading diagram...

Core Subsystems

Postmaster & Process Management (src/backend/postmaster/)

  • Central coordinator that manages all server processes
  • Handles client connections, forking backend processes
  • Manages auxiliary processes: autovacuum, background workers, WAL writer, checkpointer

Query Processing Pipeline (src/backend/parser/, src/backend/optimizer/, src/backend/executor/)

  • Parser: Converts SQL text into parse trees using Yacc/Lex
  • Analyzer: Performs semantic analysis and builds query trees
  • Planner: Generates optimal execution plans using cost-based optimization
  • Executor: Executes plans via node-based evaluation engine

Storage & Access Methods (src/backend/storage/, src/backend/access/)

  • Buffer management for efficient disk I/O
  • Multiple index types: B-tree, Hash, GiST, GIN, BRIN, SP-GiST
  • Heap table access and sequence management
  • Transaction management and MVCC implementation

Catalog & Metadata (src/backend/catalog/)

  • System catalogs storing schema definitions
  • Namespace and object management
  • Dependency tracking and access control

Replication & WAL (src/backend/replication/, src/backend/access/transam/)

  • Write-Ahead Logging for crash recovery
  • Logical and physical replication support
  • Slot management for replication streams

Entry Points

The main entry point (src/backend/main/main.c) dispatches to different modes:

  • Postmaster: Normal server operation (PostmasterMain)
  • Single-user: Direct database access (PostgresSingleUserMain)
  • Bootstrap: Database initialization (BootstrapModeMain)
  • Check mode: Configuration validation

Build System

PostgreSQL uses both traditional Makefiles and Meson build system. Key build files:

  • GNUmakefile.in / Makefile: Traditional build configuration
  • meson.build: Modern build system files throughout the tree
  • configure.ac: Autoconf configuration

Development Resources

For detailed development information, see the PostgreSQL wiki at http://wiki.postgresql.org/wiki/Development_information. Developer tools are located in src/tools/.

Architecture & Query Processing Pipeline

Relevant Files
  • src/backend/parser/README
  • src/backend/optimizer/README
  • src/backend/executor/README
  • src/backend/tcop/postgres.c

PostgreSQL processes queries through a well-defined pipeline: ParseAnalyzeOptimizeExecute. Each stage transforms the query representation, ultimately producing executable plan trees that the executor runs.

Query Processing Pipeline

Loading diagram...

Parser Stage

The parser (src/backend/parser/) tokenizes SQL and builds a raw parse tree. Key components:

  • scan.l - Lexer that breaks queries into tokens
  • gram.y - Grammar rules that produce parse trees
  • analyze.c - Transforms raw parse trees into Query structures with semantic information
  • parse_*.c - Specialized handlers for clauses (WHERE, GROUP BY), expressions, functions, operators, and type coercion

The parser also handles complex features like Common Table Expressions (CTEs), aggregates, and MERGE statements. Output is a Query node ready for optimization.

Optimizer Stage

The optimizer (src/backend/optimizer/) generates efficient execution plans by exploring alternative join orders and access methods. It uses a dynamic programming algorithm to build join trees:

  1. Path Generation - For each base relation, create Paths representing different scan methods (sequential, index, bitmap)
  2. Join Enumeration - Recursively combine relations using nested-loop, merge, or hash joins
  3. Cost Estimation - Evaluate each Path's cost and select the cheapest
  4. Plan Conversion - Transform the optimal Path tree into a Plan tree

Key concepts:

  • RelOptInfo - Represents a relation or join; contains all Paths for that relation
  • Path - One way to generate a RelOptInfo (e.g., IndexPath, HashPath, NestPath)
  • EquivalenceClass - Groups expressions known to be equal (e.g., a.x = b.y = 42)
  • PathKey - Represents sort ordering to enable merge joins without explicit sorts

Executor Stage

The executor (src/backend/executor/) runs the plan tree as a demand-pull pipeline. Each plan node produces tuples when called:

  • Plan Tree - Read-only tree of Plan nodes from the optimizer
  • State Tree - Parallel tree of PlanState nodes holding runtime state
  • Expression Evaluation - ExprState nodes flatten expression trees into linear steps for efficient interpretation

Key execution features:

  • Tuple-at-a-time - Nodes pull tuples from children on demand
  • Memory Management - Per-query and per-tuple contexts for efficient cleanup
  • Rescan Support - Nodes can reset and regenerate output (used in nested loops)
  • ModifyTable - Handles INSERT, UPDATE, DELETE, and MERGE operations

Data Flow Example

For SELECT * FROM t1 JOIN t2 ON t1.id = t2.id WHERE t1.x > 5:

  1. Parser creates a Query with join and WHERE clause information
  2. Optimizer builds Paths for t1 (seq scan, index scans), t2 (seq scan, index scans), then join Paths (nested loop, hash, merge)
  3. Optimizer selects cheapest Path (e.g., index scan on t1 + hash join with t2)
  4. Executor runs the plan: scan t1 with filter, build hash table from t2, probe and return matching rows

Parser & Query Analysis

Relevant Files
  • src/backend/parser/gram.y
  • src/backend/parser/scan.l
  • src/backend/parser/analyze.c
  • src/backend/parser/parse_expr.c
  • src/backend/parser/parser.c

The parser transforms raw SQL text into structured query trees through three distinct phases: lexical analysis, grammatical parsing, and semantic analysis.

Lexical Analysis (scan.l)

The lexer tokenizes input using Flex. It recognizes keywords, identifiers, operators, and literals while maintaining position tracking for error reporting. Keywords are looked up against a static keyword table, and identifiers are normalized to lowercase. The scanner handles complex constructs like dollar-quoted strings and Unicode escapes without backtracking, optimizing performance.

Grammatical Parsing (gram.y)

Bison processes tokens according to grammar rules defined in gram.y, producing a raw parse tree. The grammar supports multiple parse modes (SQL statements, type names, PL/pgSQL expressions) via the parse_toplevel production. The parser uses location tracking to record token positions for error messages. Importantly, the grammar performs minimal semantic analysis—it focuses on structural validation only.

Semantic Analysis (analyze.c & parse_expr.c)

The raw parse tree undergoes transformation into a Query object through parse_analyze_fixedparams() and related functions. This phase:

  • Validates references: Resolves table and column names against the catalog
  • Type checking: Determines expression types and applies implicit coercions
  • Expression transformation: Converts raw expression nodes (A_Expr, FuncCall, etc.) into typed expression trees

Expression Transformation Pipeline

transformExpr() in parse_expr.c recursively processes expressions via transformExprRecurse(). This dispatcher handles:

  • Operators: Binary, unary, comparison, and special operators (IN, BETWEEN, DISTINCT)
  • Functions: Function calls with argument type coercion
  • Subqueries: SubLink nodes that embed SELECT statements
  • Aggregates: Aggregate functions with GROUP BY validation
  • Complex constructs: CASE expressions, array constructors, row expressions, JSON operations

Each expression type has a dedicated transformation function that validates context (e.g., aggregates only in SELECT lists) and applies type coercion rules.

Data Flow

Loading diagram...

Key Structures

  • ParseState: Maintains context during analysis (namespace, parameter types, expression kind)
  • RawStmt: Wraps raw parse tree nodes with location information
  • Query: Final analyzed query object with resolved types and validated semantics

Query Optimizer & Planning

Relevant Files
  • src/backend/optimizer/plan/planner.c - Main planner entry point
  • src/backend/optimizer/plan/planmain.c - Core join planning logic
  • src/backend/optimizer/util/pathnode.c - Path node creation and management
  • src/backend/optimizer/util/relnode.c - Relation node construction
  • src/backend/optimizer/geqo/geqo_main.c - Genetic query optimizer
  • src/backend/optimizer/path/allpaths.c - Access path generation

The query optimizer transforms parsed SQL into an efficient execution plan by exploring alternative join orders and access methods. It uses a cost-based approach to select the cheapest plan.

Core Architecture

The optimizer operates in two main phases:

  1. Path Generation - Explores all feasible ways to execute the query (different join orders, scan methods, join algorithms)
  2. Plan Selection - Chooses the cheapest path and converts it to an executable plan

Key data structures:

  • RelOptInfo - Represents a relation (base table or join result) with its possible access paths
  • Path - Represents one way to scan or join a relation, including cost estimates
  • PlannerInfo - Global state for a single query optimization

Join Tree Construction

The optimizer uses dynamic programming to build join trees:

Level 1: {A}, {B}, {C}           (base relations)
Level 2: {A,B}, {B,C}, {A,C}     (2-way joins)
Level 3: {A,B,C}                 (final join)

At each level, the planner considers all legal join combinations and generates paths for each using different join methods (nested loop, hash join, merge join).

Path Selection Strategy

For each relation, the optimizer maintains a pathlist containing the cheapest paths for different sort orderings and parameterizations. When adding a new path, it compares against existing paths using:

  • Total cost - Cost to retrieve all rows
  • Startup cost - Cost to retrieve first row
  • Sort order - Useful orderings for upper query levels

Paths that are dominated on all criteria are discarded.

GEQO: Genetic Query Optimizer

For queries with many joins (>= geqo_threshold, default 12), the standard exhaustive search becomes prohibitively expensive. GEQO uses a genetic algorithm to explore the join space probabilistically:

  • Represents join orders as chromosomes
  • Evolves a population across generations using selection, crossover, and mutation
  • Evaluates fitness by computing path costs for candidate join orders
  • Returns the best solution found within the generation limit

GEQO trades optimality for speed on complex queries, making it practical for large join problems.

Cost Estimation

The planner estimates costs for all operations (scans, joins, sorts) using:

  • Table statistics (row counts, page counts, column distributions)
  • Index information
  • Selectivity estimates from WHERE clauses
  • Cardinality estimates for intermediate results

These estimates guide the choice between sequential scans, index scans, and different join algorithms.

Loading diagram...

Executor & Runtime Execution

Relevant Files
  • src/backend/executor/execMain.c
  • src/backend/executor/execExpr.c
  • src/backend/executor/nodeSeqscan.c
  • src/backend/executor/nodeHashjoin.c
  • src/backend/executor/README

The executor is PostgreSQL's runtime engine that transforms a plan tree into actual query results. It operates as a demand-pull pipeline where each plan node produces tuples on demand, creating a tree of operations that flow data upward from leaf nodes (scans) to the root.

Execution Architecture

The executor maintains two parallel tree structures:

  1. Plan Tree - Read-only tree of Plan nodes from the optimizer, representing the query strategy
  2. State Tree - Parallel tree of PlanState nodes holding runtime state, allowing plan reuse across queries

Each PlanState node has an ExecProcNode function pointer that returns the next tuple when called. This design separates immutable query plans from mutable execution state.

Query Execution Lifecycle

Loading diagram...

ExecutorStart initializes the executor state and recursively builds the state tree. ExecutorRun drives tuple production through the plan tree. ExecutorFinish handles post-processing (e.g., AFTER triggers). ExecutorEnd releases resources.

Expression Evaluation

Expressions are compiled into flat arrays of ExprEvalStep instructions during initialization via ExecInitExpr(). This linear representation avoids tree-walk overhead during evaluation:

  • Each step has an opcode (e.g., EEOP_VAR, EEOP_FUNCEXPR) and stores results in resvalue/resnull
  • Steps chain together: a + operator step receives inputs from prior VAR steps
  • The array ends with EEOP_DONE_RETURN or EEOP_DONE_NO_RETURN
  • ExecReadyExpr() selects the evaluation method (interpreted via ExecInterpExpr or JIT-compiled)

Memory Management

Execution uses hierarchical memory contexts:

  • Per-query context - Created by CreateExecutorState(), holds plan/state trees
  • Per-tuple contexts - Reset after each tuple to prevent memory leaks
  • ExprContext nodes manage per-tuple memory for expression evaluation

Plan Node Types

Common node types include:

  • Scan nodes (SeqScan, IndexScan) - Fetch tuples from tables
  • Join nodes (HashJoin, NestLoop, MergeJoin) - Combine tuples from child nodes
  • Aggregate nodes (Agg, GroupAggregate) - Compute aggregates
  • Sort/Limit nodes - Order or restrict output
  • ModifyTable - Handles INSERT/UPDATE/DELETE/MERGE operations

Hash Join Example

Hash joins use a hybrid algorithm with multi-batch support. The build phase hashes inner relation tuples into buckets; the probe phase streams outer tuples and looks up matches. If the hash table exceeds work_mem, batches are spilled to disk and processed iteratively. Parallel hash joins coordinate multiple backends via barrier synchronization.

Rescan and Parameter Handling

Nodes support ExecReScan() to reset and regenerate output. When query parameters change, affected nodes rescan. Intelligent schemes avoid unnecessary rescans (e.g., Sort reuses cached sorted data if input parameters unchanged).

Asynchronous Execution

For I/O-bound operations (e.g., ForeignScan), nodes can indicate they cannot produce tuples immediately. Append nodes coordinate async-capable children via ExecAsyncRequest, ExecAsyncConfigureWait, and ExecAsyncNotify callbacks, enabling concurrent execution of multiple branches.

Storage Management & I/O

Relevant Files
  • src/backend/storage/buffer/bufmgr.c
  • src/backend/storage/smgr/smgr.c
  • src/backend/storage/page/bufpage.c
  • src/backend/storage/lmgr/lock.c
  • src/backend/storage/buffer/freelist.c
  • src/backend/storage/buffer/buf_init.c

PostgreSQL's storage layer manages physical data access through a multi-layered architecture combining buffer management, storage abstraction, page formatting, and concurrency control. This section covers how data flows from disk to memory and back.

Buffer Manager Architecture

The buffer manager (bufmgr.c) maintains a shared pool of in-memory buffers that cache disk pages. Key concepts:

  • Pinning: Processes must "pin" a buffer (increment its reference count) before accessing it. Unpinned buffers can be evicted at any time.
  • Buffer Tags: Each buffer is identified by a BufferTag containing the relation locator, fork number, and block number.
  • Shared vs. Local Buffers: Shared buffers are accessible to all backends; local buffers are per-backend and used for temporary relations.

The buffer pool is initialized in shared memory with NBuffers slots, each containing a BufferDesc structure (padded to 64 bytes to avoid false sharing) and a corresponding data page.

Clock Sweep Replacement Algorithm

When a new page is needed and no free buffers exist, the buffer manager uses a clock sweep algorithm to find a victim:

  1. A "clock hand" (nextVictimBuffer) circularly advances through all buffers
  2. Each buffer has a usage_count that increments when pinned
  3. The algorithm skips buffers with non-zero usage counts, decrementing their counts
  4. The first unpinned buffer with zero usage count becomes the victim

This provides efficient O(1) replacement without maintaining explicit LRU lists.

Storage Manager Interface

The storage manager (smgr.c) abstracts physical file operations through a function pointer table (f_smgr). Currently, only magnetic disk storage is implemented (md.c), but the interface supports alternative storage backends.

Key operations:

  • Read/Write: smgr_readv(), smgr_writev() for vectored I/O
  • Extend: smgr_extend() adds new blocks to relations
  • Prefetch: smgr_prefetch() initiates asynchronous reads
  • Sync: smgr_immedsync() forces immediate disk synchronization

SMgrRelation objects are cached in a hash table and pinned by the relation cache to prevent premature closure of file descriptors.

Page Structure and Management

Pages are fixed-size blocks (typically 8KB, BLCKSZ) with a standardized header (PageHeader). The bufpage.c module provides:

  • PageInit(): Initializes a new page with header and free space tracking
  • PageAddItem(): Inserts items (tuples) into pages, managing line pointers
  • Checksums: Optional page checksums computed at write time for corruption detection

Pages track free space using pd_lower (end of header/line pointers) and pd_upper (start of item data), allowing efficient space management.

Concurrency Control

Buffer access is protected by multiple lock levels:

Loading diagram...
  • BufMappingLock: Partitioned locks protect the buffer tag hash table
  • Buffer Header Spinlock: Protects buffer state modifications
  • Content Lock: Shared/exclusive locks for data access
  • Pin Count: Prevents buffer eviction while in use

Lock Manager (lock.c) handles relation-level locks and transaction locks, preventing conflicts at a higher level than buffer locks.

I/O and Writeback

Dirty buffers are written back through:

  • Checkpoint: Periodic flush of all dirty buffers
  • Background Writer: Continuous flushing to reduce checkpoint stalls
  • Writeback Context: Coalesces multiple flush requests to reduce system calls

The WritebackContext structure batches pending flushes, improving I/O efficiency by combining multiple page writes into single system calls.

Access Methods & Indexes

Relevant Files
  • src/backend/access/heap/heapam.c
  • src/backend/access/nbtree/nbtree.c
  • src/backend/access/hash/hash.c
  • src/backend/access/gin/gininsert.c
  • src/include/access/amapi.h
  • src/backend/access/index/indexam.c

PostgreSQL supports multiple index types through a pluggable access method (AM) interface. Each index type implements a standardized API defined by IndexAmRoutine, allowing the query executor to work uniformly with different index structures.

Index Access Method Interface

The IndexAmRoutine struct defines the contract between the core system and index implementations. Every index AM must provide:

  • Build operations: ambuild (create index), ambuildempty (create empty index)
  • Modification: aminsert (add entries), ambulkdelete (remove entries), amvacuumcleanup (post-vacuum cleanup)
  • Scanning: ambeginscan, amrescan, amgettuple, amgetbitmap, amendscan
  • Metadata: amcostestimate (query planner hints), amoptions (parse index options), amvalidate (verify opclass definitions)

Each AM also declares capabilities via boolean flags: whether it supports unique indexes, multi-column indexes, backward scans, parallel scans, and index-only scans.

Built-in Index Types

B-tree (nbtree) is the default general-purpose index. It implements Lehman & Yao's high-concurrency algorithm with right-sibling pointers and high keys, enabling lock-free searches. B-trees support range queries, equality, ordering, and are clusterable. They handle unique constraints and multi-column indexes.

Hash indexes use dynamic hashing with bucket splitting. Tuples hash to buckets, which chain overflow pages as needed. Hash indexes are compact (storing only hash codes, not values) and support equality queries only. They cannot be used for unique constraints or multi-column indexes.

GIN (Generalized Inverted Index) is optimized for composite values like arrays and full-text search. It stores (key, posting list) pairs where posting lists are heap row references. GIN uses a B-tree over keys with posting trees for large posting lists. Fast-update pending lists accelerate bulk inserts. GIN excels at "contains" and "overlaps" queries on arrays.

GIST (not shown in core files) is a generalized search tree supporting custom operators and spatial data types.

Index Scan Flow

Loading diagram...

Scans begin with index_beginscan, which calls the AM's ambeginscan to initialize an IndexScanDesc. The executor repeatedly calls index_getnext_tid to retrieve heap tuple IDs. For each ID, index_fetch_heap fetches the actual tuple from the heap using the table AM interface. This separation allows the index AM to release locks before heap access.

Heap Access Method

The heap is the default table storage format. heapam.c implements sequential scans, tuple insertion/deletion/update, and MVCC visibility checks. The heap AM integrates with indexes through index_fetch_begin/end callbacks, allowing indexes to prepare for tuple fetching. HOT (Heap-Only Tuple) optimization reduces index entries when updated columns aren't indexed.

Index Maintenance

During INSERT, UPDATE, or DELETE, the executor calls index_insert for each modified index. The index AM's aminsert callback adds or removes entries. Vacuum calls ambulkdelete with a callback to identify dead tuples, then amvacuumcleanup for final cleanup. GIN's pending list is merged into the main B-tree during vacuum to maintain search performance.

Catalog & Metadata Management

Relevant Files
  • src/backend/catalog/catalog.c
  • src/backend/catalog/heap.c
  • src/backend/catalog/index.c
  • src/backend/catalog/namespace.c

PostgreSQL stores all schema metadata in system catalogs—regular tables that describe tables, columns, types, indexes, and other database objects. The catalog system is the foundation for schema management and object lifecycle operations.

Core System Catalogs

The primary catalogs are:

  • pg_class - Describes all relations (tables, indexes, views, sequences, etc.). Each relation has an OID, name, namespace, owner, and storage metadata.
  • pg_attribute - Describes columns within relations. Each attribute references its parent relation and stores type, collation, and constraint information.
  • pg_type - Defines data types, including built-in types and composite types created from relations.
  • pg_namespace - Organizes objects into schemas. Provides namespace isolation and search path support.
  • pg_index - Metadata for indexes, including key columns, uniqueness constraints, and index expressions.
  • pg_depend - Tracks object dependencies for cascading operations (e.g., dropping a table drops its indexes).

System catalogs are pinned (OID < FirstUnpinnedObjectId) and protected from accidental modification.

Relation Creation Pipeline

Creating a cataloged relation involves multiple coordinated steps:

  1. Validation - CheckAttributeNamesTypes() validates column definitions and types.
  2. Storage Creation - heap_create() allocates physical disk storage and creates a relation descriptor.
  3. Type Registration - AddNewRelationType() creates a composite type for the relation's row type.
  4. Catalog Registration - AddNewRelationTuple() inserts a row into pg_class.
  5. Attribute Registration - AddNewAttributeTuples() inserts column metadata into pg_attribute.
  6. Dependency Recording - Dependencies on types and collations are recorded in pg_depend.

The main entry point is heap_create_with_catalog(), which orchestrates these steps and returns the new relation's OID.

Namespace Search Path

Object lookup uses a configurable search path defined by the search_path GUC variable. The path is a list of namespace OIDs searched in order:

  • Implicit namespaces - Temp namespace (if initialized) and pg_catalog are always searched.
  • Explicit namespaces - User-specified schemas from search_path.
  • Creation target - The first namespace in the path is the default for new objects.

The search path is cached and recomputed only when invalidated by syscache changes or GUC updates. Functions like recomputeNamespacePath() and fetch_search_path() manage this state. Visibility functions (RelationIsVisible(), TypeIsVisible(), etc.) check if objects are accessible in the current search path.

Index Metadata

Indexes are tracked in pg_index, which stores:

  • Index and heap relation OIDs
  • Key columns and included columns
  • Uniqueness and constraint flags
  • Index expressions and predicates (for partial indexes)

The index_create() function creates indexes by constructing a tuple descriptor, registering the index in pg_index, and building the index structure via the access method API.

Catalog Access Patterns

Catalog access is optimized through:

  • System cache - Frequently accessed catalogs (e.g., pg_class, pg_type) are cached in memory for fast lookups.
  • Catalog indexes - Unique indexes on pg_class (by OID and name/namespace) enable efficient searches.
  • Pinned OIDs - System catalog OIDs are hardcoded to avoid catalog lookups during bootstrap.

All catalog modifications are transactional and trigger invalidation messages to keep caches consistent across backends.

Replication & Write-Ahead Logging

Relevant Files
  • src/backend/access/transam/xlog.c
  • src/backend/replication/walsender.c
  • src/backend/replication/walreceiver.c
  • src/backend/replication/slot.c
  • src/backend/replication/syncrep.c

PostgreSQL's replication system ensures data durability and high availability through Write-Ahead Logging (WAL) and streaming replication. All database changes are first written to WAL before being applied to data pages, enabling crash recovery and replication to standby servers.

Write-Ahead Logging (WAL)

WAL is the core durability mechanism. Every transaction writes its changes to the WAL buffer before modifying the actual data pages. The xlog.c module manages WAL buffers, insertion, and flushing to disk.

Key WAL Operations:

  1. WAL Insertion - XLogInsertRecord() appends records to the in-memory WAL buffer
  2. WAL Flushing - XLogFlush() writes buffered records to disk, ensuring durability
  3. WAL Writing - XLogWrite() handles the actual I/O operations to WAL files
  4. Checkpointing - Periodic snapshots that allow old WAL to be discarded

The WAL buffer is organized as a circular ring, with insertion and write positions tracked separately. This allows concurrent insertions while background processes flush data to disk.

Streaming Replication Architecture

Loading diagram...

Primary Side (WAL Sender):

  • Spawned when a standby connects requesting replication
  • Reads WAL records from disk and sends them via the COPY protocol
  • Tracks replication progress from standbys
  • Handles both physical and logical replication

Standby Side (WAL Receiver):

  • Connects to primary and requests WAL streaming
  • Receives WAL records and writes them to pg_wal directory
  • Updates WalRcvData->flushedUpto to signal startup process
  • Sends periodic status updates to primary

Replication Slots

Replication slots prevent premature WAL removal and maintain state about replication streams. Each slot tracks the restart LSN (Log Sequence Number) needed by its consumer.

Slot Types:

  • Physical Slots - For streaming replication; prevent WAL removal until replicated
  • Logical Slots - For logical decoding; maintain snapshot and transaction state

Slots are stored on disk in $PGDATA/pg_replslot/ with state files containing metadata. The slot state is protected by ReplicationSlotControlLock and per-slot mutexes for thread safety.

Synchronous Replication

Synchronous replication ensures transactions wait for acknowledgment from standbys before committing. The syncrep.c module implements this on the primary.

Modes:

  • FIRST - Priority-based; wait for N highest-priority standbys
  • ANY - Quorum-based; wait for any N standbys from the list

Backends waiting for replication are queued and released when walsenders confirm WAL has been written, flushed, or applied on standbys. This is controlled by synchronous_standby_names GUC parameter.

WAL Lifecycle

  1. Generation - Transactions generate WAL records during execution
  2. Buffering - Records accumulate in the WAL buffer
  3. Flushing - Background writer or explicit XLogFlush() writes to disk
  4. Replication - WAL sender streams to standbys
  5. Archival - Archive command copies WAL files for long-term storage
  6. Recycling - Old WAL files are recycled when no longer needed

The system maintains careful coordination between these stages to ensure durability while managing disk space efficiently.

Utilities & Command-Line Tools

Relevant Files
  • src/bin/psql/mainloop.c
  • src/bin/pg_dump/pg_dump.c
  • src/bin/initdb/initdb.c
  • src/bin/pg_ctl/pg_ctl.c
  • src/bin/pg_basebackup/pg_basebackup.c
  • src/bin/pgbench/pgbench.c
  • src/bin/scripts/ (createdb, dropdb, vacuumdb, etc.)

PostgreSQL provides a comprehensive suite of command-line utilities in src/bin/ for database administration, client interaction, and server management. These tools are essential for day-to-day operations and cluster maintenance.

Client & Interactive Tools

psql (src/bin/psql/) is the primary interactive terminal for PostgreSQL. Its architecture centers on a re-entrant main loop (MainLoop() in mainloop.c) that processes user input from stdin or files. The loop uses a flex-based lexer (psqlscanslash.l) to parse both SQL queries and backslash commands (\ prefix). Key components include:

  • Query Accumulation: Buffers (query_buf, previous_buf) collect multi-line queries until a semicolon terminates them
  • Command Processing: HandleSlashCmds() routes backslash commands (e.g., \d, \i, \e) to specialized handlers
  • Conditional Execution: A conditional stack manages \if/\else/\endif blocks for scripting
  • Signal Handling: Uses sigsetjmp() to handle Ctrl-C interrupts gracefully

The lexer maintains state across interactive and file-based input, supporting variable expansion and nested file inclusion via \i (include) commands.

Backup & Restore Tools

pg_dump (src/bin/pg_dump/) serializes database schemas and data into SQL scripts or custom binary formats. It queries system catalogs to enumerate all dumpable objects (tables, indexes, functions, etc.) and emits them in dependency order. Compression support includes gzip, LZ4, and Zstandard formats.

pg_basebackup (src/bin/pg_basebackup/) performs physical backups by streaming WAL and data files from a running server using the replication protocol. It supports incremental backups and manifest verification.

pg_restore reconstructs databases from pg_dump's custom or directory formats, with parallel restore capability.

Server Management Tools

pg_ctl (src/bin/pg_ctl/) controls the PostgreSQL server lifecycle. It dispatches commands (init, start, stop, restart, reload, status, promote) to appropriate handlers. The tool manages PID files, log rotation, and shutdown modes (smart, fast, immediate).

initdb (src/bin/initdb/) initializes a new cluster by creating the data directory structure, bootstrap databases (template0, template1, postgres), and configuration files. It handles locale and encoding setup.

Maintenance & Diagnostic Tools

pgbench (src/bin/pgbench/) is a benchmarking tool that generates synthetic workloads. It supports custom SQL scripts, variable substitution, and concurrent client simulation.

vacuumdb and reindexdb (src/bin/scripts/) wrap SQL VACUUM and REINDEX commands with connection management and parallel execution options.

pg_upgrade performs in-place cluster version upgrades by comparing old and new schemas, transferring data, and validating consistency.

Architecture Pattern

All utilities follow a common pattern: parse command-line options, establish a libpq connection (if needed), execute operations, and report results. They use shared frontend utilities (src/fe_utils/) for connection management, option parsing, and output formatting.

Loading diagram...

repository: "https://github.com/postgres/postgres" generatedAt: "2025-12-17T19:10:28.378291"