Consul - Distributed Service Networking | Augment Code

Overview

Relevant Files

main.go
agent/agent.go
agent/consul/server.go
agent/consul/client.go
README.md

Consul is a distributed, highly available solution for service discovery, health checking, and dynamic configuration across datacenters. It operates as a cluster of agents that can run in either server or client mode, providing multi-datacenter awareness and service mesh capabilities.

Architecture Overview

Loading diagram...

Core Components

Agent (agent/agent.go) is the central long-running process on every machine. It exposes RPC, HTTP, DNS, and gRPC interfaces and can operate in two modes:

Server Mode - Runs a full Consul server with Raft consensus, state management, and leadership election
Client Mode - Forwards requests to Consul servers via RPC, maintaining local service and check state

Consul Server (agent/consul/server.go) manages cluster state using Raft consensus, maintains the state store, handles service registration, and coordinates with other servers across datacenters.

Consul Client (agent/consul/client.go) maintains a connection pool to servers, routes RPC requests, and manages local agent state without participating in consensus.

Key Features

Service Discovery - Services register themselves; clients discover via DNS or HTTP API
Health Checking - Monitors service health; prevents routing to unhealthy instances
Service Mesh - Enables secure service-to-service communication with automatic TLS and identity-based authorization
Multi-Datacenter - Servers federate across datacenters; clients forward to local servers
Dynamic Configuration - HTTP API for storing indexed configuration and metadata

Communication Protocols

Consul uses multiple protocols for different purposes:

Serf Gossip - LAN and WAN membership management and failure detection
Raft - Server consensus for state replication (servers only)
RPC - Client-server and server-server communication
gRPC - Modern API for proxies and external services
DNS - Service discovery queries
HTTP - REST API for all operations

Startup Flow

When the agent starts, it initializes configuration, creates either a Server or Client delegate based on mode, sets up local state tracking, and starts listening on HTTP, DNS, and gRPC ports. The agent then begins service synchronization and retry join logic to connect with the cluster.

Architecture & Core Components

Relevant Files

agent/consul/server.go
agent/consul/client.go
agent/consul/rpc.go
agent/consul/fsm/fsm.go
agent/consul/state/state_store.go
agent/pool/pool.go

Consul's architecture is built on a distributed consensus model with clear separation between server and client components. The system uses Raft for strong consistency and Serf for gossip-based cluster membership.

Core Components

Server (agent/consul/server.go) is the primary stateful component that manages the cluster. Each server maintains:

A Raft instance for distributed consensus across the datacenter
A Finite State Machine (FSM) that applies committed log entries to the state store
A State Store (in-memory MemDB) holding all cluster data (nodes, services, ACLs, etc.)
Multiple Serf pools for cluster membership (LAN for local DC, WAN for cross-DC)
RPC servers for handling both traditional net/rpc and gRPC requests

Client (agent/consul/client.go) is a lightweight agent that runs on every node. Clients:

Do not participate in Raft consensus
Use a connection pool to communicate with servers
Maintain a router to discover and select healthy servers
Apply rate limiting to outbound RPC requests
Listen to Serf events for cluster membership changes

Consensus & State Management

The FSM (agent/consul/fsm/fsm.go) implements Raft's state machine interface. When Raft commits a log entry:

The FSM receives the log entry via Apply()
It dispatches to a registered command handler based on message type
The handler modifies the State Store (MemDB)
Changes are published to event subscribers for real-time updates

The State Store (agent/consul/state/state_store.go) uses MemDB for fast, queryable in-memory storage with MVCC semantics. It supports:

Blocking queries (clients wait for state changes)
Snapshots for Raft recovery
Transaction-based updates for consistency

RPC & Communication

RPC Layer (agent/consul/rpc.go) handles all request routing:

Clients forward requests to servers via the connection pool
Servers accept connections and route to appropriate handlers
Forwarding logic sends requests to the leader if needed, or to other datacenters
Rate limiting prevents overload on both client and server sides

Connection Pool (agent/pool/pool.go) manages persistent connections:

Multiplexes multiple RPC streams over single TCP connections using Yamux
Caches connections for reuse
Supports TLS encryption and certificate verification
Handles connection timeouts and cleanup

Data Flow

Loading diagram...

Key Design Patterns

Strong Consistency: Write operations go through Raft; reads can be stale or consistent
Blocking Queries: Clients can wait for state changes without polling
Gossip Membership: Serf maintains cluster topology; Raft manages state
Multiplexing: Single TCP connection carries multiple concurrent RPC streams
Rate Limiting: Protects servers from client overload and clients from server limits

State Management & Persistence

Relevant Files

agent/consul/state/state_store.go
agent/consul/state/catalog.go
agent/consul/fsm/fsm.go
agent/consul/fsm/snapshot.go
agent/consul/server.go

Consul's state management system is built on a Raft-based finite state machine (FSM) that ensures strong consistency across the cluster. All state is stored in-memory using MemDB, a fast in-memory database, and is reconstructed from Raft logs through the FSM.

Architecture Overview

Loading diagram...

State Store (MemDB)

The Store struct in state_store.go is the core in-memory database containing all Consul state:

MemDB: A thread-safe, in-memory database with MVCC (Multi-Version Concurrency Control) semantics
Tables: Organized into logical tables for nodes, services, checks, KV pairs, ACLs, sessions, and more
Transactions: Read and write transactions provide isolation and consistency guarantees
Change Tracking: All writes are tracked and published as events for subscribers

The state store is entirely reconstructed from the Raft log through the FSM, ensuring it can be rebuilt on any server.

Finite State Machine (FSM)

The FSM in fsm.go applies Raft log entries to the state store:

Apply: Processes each Raft log entry by dispatching to registered command handlers
Command Registry: Commands are registered at package init time and mapped by message type
Atomic Updates: Each log entry updates the state store within a single transaction
Event Publishing: Changes trigger events that are published to subscribers

Snapshots & Persistence

Snapshots enable fast recovery and cluster bootstrap:

Persist (snapshot.go):

Captures a point-in-time snapshot of the entire state store
Encodes all tables (nodes, services, ACLs, KV, etc.) using msgpack
Includes a header with the last Raft index for consistency tracking
Persisted to disk by Raft for recovery

Restore:

Reads snapshot data and reconstructs the state store
Replaces the entire state store atomically to prevent inconsistency
Restores chunking state and resource storage separately
Signals watchers that the state has changed

Consistency Model

Strong Consistency: All writes go through Raft consensus before applying to state
Read Consistency: Reads from the state store reflect all committed writes
Blocking Queries: Clients can watch for changes using index-based blocking
Snapshot Consistency: Snapshots capture a consistent view at a specific Raft index

Key Operations

Write Path: RPC request → Raft leader → FSM.Apply() → State Store update → Event published

Read Path: Query → State Store snapshot → MemDB transaction → Results returned

Recovery Path: Snapshot restored → State Store rebuilt → Raft logs replayed from snapshot index

Service Discovery & Catalog

Relevant Files

agent/consul/catalog_endpoint.go
agent/consul/health_endpoint.go
agent/consul/state/catalog.go
agent/structs/catalog.go
agent/dns.go
api/catalog.go

The service catalog is Consul's core registry that tracks all nodes, services, and their health status. It enables service discovery by maintaining a queryable database of what services are available and where they run.

Core Architecture

The catalog system has three main layers:

Endpoints (catalog_endpoint.go, health_endpoint.go) - HTTP/RPC API handlers that accept registration and query requests
State Store (agent/consul/state/catalog.go) - In-memory database using memdb that stores and indexes catalog data
Data Structures (agent/structs/catalog.go, api/catalog.go) - Request/response types and core entities

Registration Flow

Services register through the Catalog.Register RPC endpoint:

// Register a service and/or check(s) in a node
func (c *Catalog) Register(args *structs.RegisterRequest, reply *struct{}) error

The registration process:

Validates ACL permissions and enterprise metadata
Pre-applies validation rules to node, service, and check data
Stores the node, service, and health checks in the state store
Triggers replication to other servers via Raft

A RegisterRequest can include:

Node information (name, address, metadata)
Service definition (name, port, tags, metadata)
Health checks (HTTP, TCP, script-based, TTL)

Query Operations

The Health endpoint provides multiple query patterns:

Health.ServiceNodes - Returns all instances of a service with their health status
Health.ServiceChecks - Returns checks for a specific service
Health.NodeChecks - Returns all checks for a node
Health.ChecksInState - Returns checks matching a health state (passing, warning, critical)

Queries support:

Tag filtering to find service instances with specific tags
Connect-aware queries for service mesh proxies
Ingress gateway queries for external traffic routing
Blocking queries for real-time updates

DNS Integration

The DNS server (agent/dns.go) queries the catalog to resolve service names:

// Service lookup queries the catalog for service instances
args := structs.ServiceSpecificRequest{
    ServiceName: lookup.Service,
    ServiceTags: serviceTags,
    // ...
}
out, _, err := d.agent.rpcClientHealth.ServiceNodes(context.TODO(), args)

DNS queries like redis.service.consul are resolved by:

Parsing the service name from the DNS query
Calling Health.ServiceNodes to get available instances
Returning A/AAAA records for healthy instances
Supporting SRV records for port information

State Storage

The state store maintains multiple indexes for efficient queries:

Nodes table - Indexed by node ID and name
Services table - Indexed by node, service name, and tags
Checks table - Indexed by node and service
Service virtual IPs - Maps services to their assigned VIPs

Queries use memdb watch sets for blocking query support, allowing clients to wait for changes without polling.

Health Status Management

Every node has a built-in serfHealth check that reflects cluster membership status. Services can have multiple checks:

HTTP checks - Periodic HTTP requests to a health endpoint
TCP checks - TCP connection attempts
Script checks - Custom scripts executed by the agent
TTL checks - Manual status updates with expiration

The catalog aggregates check statuses to determine if a service instance is passing, warning, or critical.

Service Mesh & Connect

Relevant Files

agent/proxycfg/state.go - Proxy configuration state management
agent/xds/server.go - XDS gRPC server for Envoy configuration
connect/resolver.go - Service discovery and resolution
connect/tls.go - TLS certificate verification and mTLS setup
agent/consul/leader_connect_ca.go - Certificate Authority management

Overview

Consul's service mesh (Connect) provides secure service-to-service communication using mutual TLS (mTLS) encryption, identity-based authentication, and explicit service authorization. The architecture consists of a control plane (Consul servers and agents) that manages configuration and a data plane (Envoy sidecar proxies) that enforces policies.

Core Architecture

The service mesh operates through three main layers:

Certificate Authority (CA) - Issues and manages SPIFFE X.509 certificates for service identity
Proxy Configuration - Generates and distributes Envoy proxy configurations
XDS Server - Delivers configuration updates to proxies via gRPC

Loading diagram...

Certificate Authority (CA)

The CA subsystem manages the PKI infrastructure for service mesh. Key components:

CAManager (agent/consul/leader_connect_ca.go) - Runs on the leader and manages CA state, certificate rotation, and provider lifecycle
CA Providers - Pluggable implementations (built-in, Vault, etc.) that handle certificate signing
Root Certificates - Distributed to all agents and proxies for trust chain validation
Leaf Certificates - Issued per service instance with SPIFFE URIs for identity

The CA maintains state through caState transitions: UNINITIALIZED → INITIALIZING → INITIALIZED → RENEWING/RECONFIGURING.

Proxy Configuration Management

The proxycfg package coordinates data fetching for proxy configuration:

Manager - Tracks registered proxies and coordinates state updates
State - Maintains configuration for a single proxy, watching multiple data sources (roots, leaf certs, intentions, upstreams, discovery chains)
ConfigSnapshot - Immutable snapshot of all configuration needed by a proxy at a point in time

The state machine watches for updates from the catalog, ACL system, and configuration entries, coalescing changes into snapshots that are pushed to consumers.

XDS Server & Envoy Integration

The XDS server (agent/xds/server.go) implements Envoy's Aggregated Discovery Service (ADS) protocol:

DeltaAggregatedResources - Primary gRPC endpoint for Envoy proxy connections
Resource Types - Listeners, Routes, Clusters, Endpoints, Secrets (certificates)
Authorization - Validates that proxy tokens have service:write permission for their service
Streaming - Long-lived gRPC streams push configuration updates to proxies in real-time

Service Discovery & Resolution

The connect package provides client-side service discovery:

Resolver Interface - Abstracts service discovery mechanisms
ConsulResolver - Queries Consul catalog for healthy service instances
StaticResolver - For known endpoints without discovery
Service - High-level API for establishing mTLS connections with automatic certificate management

mTLS & Certificate Verification

TLS configuration in connect/tls.go enforces security:

Minimum TLS 1.2 with strong cipher suites (ECDHE + AES/ChaCha20)
Client Certificate Verification - Both sides verify peer certificates
SPIFFE URI Validation - Certificates must contain correct service identity URIs
Custom Verifiers - Server-side verifier checks authorization; client-side verifier validates chain

The verifyServerCertMatchesURI function ensures the peer certificate identity matches the expected service URI, preventing man-in-the-middle attacks.

Data Flow

Service registers with Consul; proxycfg Manager creates state object
Envoy proxy connects to XDS server via gRPC
XDS server watches proxycfg state for configuration changes
proxycfg state watches CA for certificate updates and catalog for service topology
Configuration snapshots are serialized to Envoy resources (Listeners, Routes, Clusters, Secrets)
Envoy applies configuration and enforces mTLS, authorization, and routing policies

ACL & Security

Relevant Files

acl/acl.go
acl/authorizer.go
acl/policy.go
acl/policy_authorizer.go
acl/chained_authorizer.go
agent/consul/acl.go
agent/acl_endpoint.go

Consul's ACL system provides role-based access control (RBAC) for authenticating and authorizing access to HTTP API and RPC operations. The system is built on a foundation of tokens, policies, and roles that work together to enforce fine-grained permissions across the cluster.

Core Components

Tokens are the primary authentication mechanism. Each token has a secret ID used for authentication and an accessor ID for logging. Tokens can be associated with policies, roles, or service/node identities. The system includes special tokens like the anonymous token (used when no token is provided) and agent recovery tokens (for emergency access).

Policies define sets of rules that grant or deny access to resources. Rules are organized by resource type (agent, key, node, service, session, event, query, keyring, operator, mesh, peering) and support both exact-match and prefix-based matching. Each rule specifies an access level: deny, read, list, or write.

Roles group policies and identities together, allowing administrators to manage permissions at a higher level of abstraction. Service identities and node identities are synthetic policies that automatically grant permissions for specific services or nodes.

Authorization Flow

Loading diagram...

The ACLResolver handles token and policy resolution. It maintains caches for identities, policies, and roles to minimize RPC calls. When a token is presented, the resolver:

Checks if ACLs are enabled
Attempts local resolution (agent recovery tokens, server management tokens)
Consults the cache if available
Falls back to remote RPC resolution if needed

Policy Enforcement

The Authorizer interface defines methods for checking permissions on each resource type. The PolicyAuthorizer implements this interface using radix trees for efficient prefix matching. Each resource type maintains its own radix tree for exact and prefix-based rules.

The ChainedAuthorizer combines multiple authorizers in sequence, allowing the first non-default decision to take precedence. This enables layering of authorization logic.

Access Control Decisions

Three enforcement decisions are possible:

Allow: A matching rule explicitly grants access
Deny: A matching rule explicitly denies access
Default: No matching rule found; decision deferred to default policy

The default policy is configurable via acl_default_policy (typically deny for secure deployments). When no rule matches, the system falls back to this default.

Token Resolution Strategies

The ACL down policy determines behavior when the ACL datacenter is unavailable:

allow: Permit all requests (unsafe)
deny: Deny all requests (conservative)
extend-cache: Use cached values indefinitely
async-cache: Use cached values while fetching updates asynchronously

This enables graceful degradation during network partitions while maintaining security posture.

Cluster Membership & Gossip

Relevant Files

agent/consul/server_serf.go
agent/consul/client_serf.go
agent/consul/leader_registrator_v1.go
agent/consul/merge.go
internal/gossip/libserf/serf.go

Consul uses the Serf gossip protocol to manage cluster membership and detect node failures. This distributed protocol enables all agents to maintain a consistent view of the cluster without requiring a central authority.

Gossip Protocol Overview

Serf implements a modified SWIM (Scalable Weakly-consistent Infection-style Process Group Membership) protocol. Each node periodically exchanges membership information with random peers, allowing state changes to propagate exponentially through the cluster. This approach scales to thousands of nodes with minimal overhead.

Key characteristics:

Decentralized: No single point of failure for membership management
Probabilistic: Uses random peer selection for efficient propagation
Failure detection: Detects node failures within seconds
Event broadcasting: Disseminates custom events and queries across the cluster

LAN vs WAN Gossip Pools

Consul maintains separate gossip pools for different network topologies:

LAN Pool (serfLAN):

Connects agents within a single datacenter
Handles local node discovery and health monitoring
Supports segments for logical grouping within a datacenter
Processes member join, leave, fail, and reap events

WAN Pool (serfWAN):

Connects servers across multiple datacenters
Uses mesh gateway transport for cross-datacenter federation
Validates that joining nodes are servers (not clients)
Prevents datacenter mismatches during cluster merges

Member Lifecycle

When a node joins the cluster, Serf broadcasts a EventMemberJoin event. The event handler processes this through several stages:

Join Detection (lanNodeJoin): Identifies Consul servers and updates the server lookup table
Reconciliation (localMemberEvent): Leaders reconcile Serf state with the catalog
Registration (HandleAliveMember): Registers the node in the catalog with a passing health check

When a node fails or leaves, similar handlers (lanNodeFailed, HandleFailedMember) mark it critical or deregister it.

Merge Delegates

Merge delegates validate cluster merges when partitioned networks rejoin:

LAN Merge Delegate:

Checks for conflicting node IDs
Validates all nodes are in the same datacenter
Prevents duplicate node IDs across the cluster

WAN Merge Delegate:

Ensures only servers join the WAN pool
Validates server metadata consistency
Can disable federation if misconfiguration is detected

Event Handling

Serf events flow through dedicated event channels:

// Event types handled
case serf.EventMemberJoin:
    s.lanNodeJoin(e.(serf.MemberEvent))
case serf.EventMemberLeave, serf.EventMemberFailed, serf.EventMemberReap:
    s.lanNodeFailed(e.(serf.MemberEvent))
case serf.EventUser:
    s.localEvent(e.(serf.UserEvent))
case serf.EventMemberUpdate:
    s.lanNodeUpdate(e.(serf.MemberEvent))

User events enable custom workflows like remote execution and cluster-wide notifications.

Bootstrap Coordination

During cluster bootstrap, servers use gossip to discover peers and coordinate Raft initialization. The maybeBootstrap function:

Scans LAN members for expected server count
Queries each peer for existing Raft state
Initializes Raft cluster configuration if no existing state is found
Prevents spurious elections by ensuring only one bootstrap occurs

Configuration & Tuning

Consul-specific Serf defaults in libserf.DefaultConfig():

MinQueueDepth: 4096 (dynamically sized based on cluster size)
LeavePropagateDelay: 3 seconds (allows graceful leave propagation)
QueueDepthWarning: 1,000,000 (effectively disabled)

These settings optimize for large clusters while maintaining responsiveness.

Member Metadata

Serf tags encode critical node information:

role: "consul" (server) or "node" (client)
dc: Datacenter name
id: Unique node ID
vsn: Protocol version
port: RPC port
grpc_port, grpc_tls_port: gRPC endpoints
bootstrap, expect: Bootstrap configuration
read_replica: Read-only server flag

This metadata enables intelligent routing and version compatibility checks.

HTTP API & Endpoints

Relevant Files

agent/http.go
agent/http_register.go
agent/health_endpoint.go
agent/catalog_endpoint.go
api/api.go

Consul exposes a comprehensive HTTP API for service discovery, health checks, configuration management, and cluster operations. The HTTP server is built on Go's standard net/http package with a custom routing and middleware layer.

Architecture Overview

Loading diagram...

Endpoint Registration System

Endpoints are registered at package initialization time using the registerEndpoint() function in http_register.go. Each endpoint maps a URL pattern to an HTTP method set and a handler function:

registerEndpoint("/v1/catalog/services", []string{"GET"}, (*HTTPHandlers).CatalogServices)
registerEndpoint("/v1/agent/service/register", []string{"PUT"}, (*HTTPHandlers).AgentRegisterService)

The registration system maintains two global maps:

endpoints: Maps URL patterns to unbound endpoint handler functions
allowedMethods: Maps patterns to supported HTTP methods (GET, PUT, DELETE, POST)

Request Handling Pipeline

When a request arrives, it flows through multiple layers:

Routing: The http.ServeMux matches the request path to a registered pattern
Middleware: Gzip compression, metrics collection, and ACL authorization
Handler Execution: The endpoint handler processes the request and returns (interface{}, error)
Response Encoding: Results are JSON-encoded and sent to the client

The wrap() function standardizes response handling by converting endpoint results into HTTP responses with proper status codes, headers, and error formatting.

Key Endpoint Categories

ACL: Token management, policy creation, role-based access control
Agent: Service registration, health checks, node information
Catalog: Service discovery, node listing, datacenter information
Health: Health status queries, service health checks
Connect: mTLS certificate management, intentions, authorization
KV Store: Key-value storage operations
Operator: Raft configuration, autopilot, keyring management
Session: Session creation and management for distributed locks

Error Handling

Three error types provide flexible HTTP response control:

HTTPError: Returns custom status code with plain text reason
CodeWithPayloadError: Returns non-200 status with custom content type
MethodNotAllowedError: Handles unsupported HTTP methods

Performance Features

Gzip Compression: Automatically applied to responses (configurable minimum size)
Metrics: All requests tracked with method and path labels via Prometheus
Caching: Client-side caching via blocking queries and watch mechanisms
Rate Limiting: Configurable per-endpoint rate limits via middleware

Configuration & Agent Setup

Relevant Files

agent/config/builder.go
agent/consul/config.go
command/agent/agent.go
agent/agent.go

Consul's configuration system is built on a multi-layered approach that merges configuration from multiple sources with a well-defined precedence order. The agent startup process loads and validates configuration, then initializes all necessary components.

Configuration Loading Pipeline

The configuration builder processes sources in this order:

Default configuration – Built-in defaults for all settings
Config files – HCL or JSON files in alphabetical order
Command-line flags – Override file-based settings
Overrides – Final programmatic overrides

The LoadOpts struct in builder.go controls this process. It accepts ConfigFiles (paths to HCL/JSON files), FlagValues (command-line arguments), and optional Overrides for testing or special cases. The builder validates file extensions and skips non-HCL/JSON files in directories with a warning.

RuntimeConfig Construction

The RuntimeConfig struct represents the fully resolved configuration after all sources are merged. Key sections include:

Network Configuration – Bind addresses, advertise addresses, ports for RPC, DNS, HTTP, gRPC
Cluster Settings – Datacenter, node name, bootstrap mode, Raft parameters
ACL Configuration – Token settings, policy TTLs, default policies
TLS Configuration – Certificate paths, verification modes, minimum TLS versions
Gossip Protocol – Serf LAN/WAN settings, probe intervals, suspicion multipliers
Service Discovery – DNS settings, service TTLs, recursors
Connect/Service Mesh – CA provider, virtual IP CIDRs, mesh gateway settings

Agent Initialization

The Agent struct in agent/agent.go orchestrates startup through the New() and Start() methods:

New() creates the agent instance and registers cache types. It initializes:

Token store for ACL tokens
Service manager for proxy configuration
RPC clients for health, config entries, and peering
File watcher for auto-reload capability

Start() brings up all agent subsystems:

Creates local state and anti-entropy synchronizer
Initializes Consul server or client based on ServerMode
Starts DNS, HTTP, HTTPS, and gRPC listeners
Launches proxy configuration manager
Begins retry join attempts and watch plan execution

Consul Server/Client Configuration

The newConsulConfig() function translates RuntimeConfig into consul.Config. This includes:

Raft configuration (election timeout, heartbeat timeout, snapshot settings)
Serf LAN/WAN configuration (bind addresses, gossip parameters)
ACL resolver settings (token TTLs, default policies)
Autopilot configuration (dead server cleanup, stabilization time)
Connect CA configuration (provider type, certificate TTLs)

Configuration Validation

The builder validates:

Port ranges (DNS, HTTP, gRPC, Serf, RPC)
Address formats (IPv4/IPv6 compatibility)
Raft multiplier bounds (1 to 10)
Virtual IP CIDR blocks for Connect
Deprecated configuration keys with warnings

Invalid configurations return errors during the Load() call, preventing agent startup with broken settings.

Dynamic Configuration Reload

Agents support configuration reload via SIGHUP signal. The ReloadConfig() method updates:

Request rate limits
RPC timeouts and burst settings
Raft snapshot thresholds
Config entry bootstrap entries
Reporting settings

File watchers can trigger automatic reloads when TLS certificates or config files change, controlled by the AutoReloadConfig setting.

Loading diagram...

Health Checking System

Relevant Files

agent/checks/check.go
agent/consul/health_endpoint.go
agent/agent.go
agent/structs/check_type.go

Consul's health checking system enables agents to monitor service and node health through periodic checks. The system supports multiple check types, each suited for different monitoring scenarios.

Check Types

Consul supports nine distinct check types, each with specific use cases:

Script: Executes a custom script at regular intervals. Exit code 0 = passing, 1 = warning, other = critical.
HTTP: Makes periodic HTTP requests. Status codes 2xx = passing, 429 = warning, others = critical.
TCP: Attempts TCP connections to verify service availability.
UDP: Sends UDP datagrams and validates responses.
gRPC: Sends gRPC health check requests following the standard gRPC health protocol.
H2PING: Performs HTTP/2 ping operations to verify connectivity.
TTL: Client-driven checks where the client must periodically update status. Automatically marks critical if TTL expires.
Docker: Executes scripts inside Docker containers using the Docker API.
OS Service: Monitors Windows services or systemd services on Linux.

Check Lifecycle

Each check type (except TTL) runs in its own goroutine with a configurable interval. The lifecycle follows this pattern:

Initialization: Check is created with configuration (interval, timeout, target URL/address, etc.)
Start: Start() method launches the monitoring goroutine
Periodic Execution: Check runs at specified intervals with randomized initial delay to prevent thundering herd
Status Update: Results are passed to StatusHandler which applies threshold logic
Stop: Stop() method gracefully terminates the goroutine

Status Handling and Thresholds

The StatusHandler implements failure and success thresholds to prevent flapping:

Success Before Passing: Number of consecutive passing checks before status changes to passing
Failures Before Warning: Threshold for transitioning to warning state
Failures Before Critical: Threshold for transitioning to critical state

This prevents temporary network glitches from immediately marking services as unhealthy.

Health Endpoint

The Health RPC endpoint (agent/consul/health_endpoint.go) provides query capabilities:

ChecksInState: Retrieve all checks in a specific state (passing, warning, critical)
NodeChecks: Get all checks for a specific node
ServiceChecks: Get all checks for a specific service
ServiceNodes: Get healthy nodes running a service with health information

All queries support ACL filtering, bexpr filtering, and node metadata filtering.

Output Management

Check output is captured in circular buffers with configurable maximum size (default 4KB) to prevent excessive memory consumption. Output is truncated with a notice if it exceeds the limit.

Timeout and Execution Safety

Minimum interval enforced to prevent fork bombing (1 second minimum)
Configurable timeouts prevent hung checks from blocking the system
Script checks use process subtrees to ensure proper cleanup
Concurrent check execution is prevented through synchronization primitives

Loading diagram...