Install Now

kubernetes/kubernetes

Kubernetes Core Architecture

Last updated on Dec 18, 2025 (Commit: 3347801)

Overview

Relevant Files
  • README.md
  • pkg/apis/core/register.go
  • pkg/apis/apps/types.go
  • pkg/apis/batch/types.go
  • pkg/apis (API definitions)
  • cmd (Command-line tools)
  • pkg/controller (Controllers)

Kubernetes is an open-source container orchestration platform that automates deployment, scaling, and management of containerized applications across clusters of machines. Built on lessons from Google's Borg system, it provides a declarative approach to infrastructure management.

Core Architecture

The codebase is organized around Kubernetes' fundamental design principles:

  • API-Driven Design: Everything in Kubernetes is a resource exposed through REST APIs. The pkg/apis directory contains type definitions for core resources like Pods, Services, Deployments, and Jobs.
  • Controllers: The pkg/controller directory houses reconciliation loops that continuously work to match the desired state (spec) with the actual state (status).
  • Command-Line Tools: The cmd directory contains binaries for key components: kube-apiserver, kube-controller-manager, kube-scheduler, kubelet, and kubectl.

Key Resource Types

The API layer defines several resource categories:

Core API (pkg/apis/core): Fundamental resources including Pods, Services, Nodes, Namespaces, ConfigMaps, Secrets, and PersistentVolumes. These form the foundation of cluster operations.

Apps API (pkg/apis/apps): Workload controllers including Deployments, StatefulSets, DaemonSets, and ReplicaSets. These manage pod replicas with different guarantees and update strategies.

Batch API (pkg/apis/batch): Job and CronJob resources for running batch workloads with completion semantics and scheduling capabilities.

Declarative Model

Loading diagram...

Workload Management

Kubernetes provides multiple abstractions for running workloads:

  • Pods: Smallest deployable units, typically containing one container
  • Deployments: Declarative updates for stateless applications with rolling updates
  • StatefulSets: Ordered, stable pod identities for stateful applications
  • DaemonSets: Ensure pods run on every node (or selected nodes)
  • Jobs: Run pods to completion with configurable retry and parallelism policies

API Registration System

The pkg/apis/core/register.go file demonstrates Kubernetes' type registration pattern. All resource types must be registered with the scheme to enable serialization, deserialization, and API discovery. This allows the system to handle multiple API versions and maintain backward compatibility.

Development Structure

The repository follows a modular layout:

  • pkg/: Core libraries and controllers
  • cmd/: Executable binaries for cluster components
  • test/: Comprehensive test suites (unit, integration, e2e)
  • hack/: Build and development scripts
  • staging/: Publishable libraries extracted as separate modules

Architecture & Core Components

Relevant Files
  • cmd/kube-apiserver/app/server.go
  • cmd/kube-controller-manager/controller-manager.go
  • cmd/kube-scheduler/scheduler.go
  • cmd/kubelet/kubelet.go
  • pkg/controlplane/apiserver/server.go
  • pkg/scheduler/scheduler.go
  • pkg/scheduler/framework/runtime/framework.go

Kubernetes is built on a distributed control plane architecture where multiple independent components work together to manage cluster state. Each component has a specific responsibility and communicates through the API server.

Core Components

kube-apiserver is the central hub of the Kubernetes control plane. It validates and configures cluster data (pods, services, deployments, etc.) and provides the REST API that all other components use. The API server is actually a server chain of three distinct instances: the core Kubernetes API, API extensions (CRDs), and the aggregation layer. This composition is managed through a layered configuration system starting with command-line flags parsed into a Config object.

kube-controller-manager monitors the cluster state and drives it toward the desired state. It runs multiple controllers (Deployment, StatefulSet, DaemonSet, Job, etc.) that watch for resource changes via informers and reconcile actual state with desired state. Controllers use leader election for high availability and can be selectively enabled or disabled.

kube-scheduler assigns unscheduled pods to nodes. It watches for new pods in the scheduling queue and uses a pluggable framework to evaluate nodes. The scheduler runs two cycles: a scheduling cycle that finds a suitable node and assumes the pod, and a binding cycle that persists the binding asynchronously.

kubelet runs on each node and ensures containers are running in pods. It syncs pod specifications from multiple sources (API server, config files, HTTP endpoints) and communicates with the container runtime to start, stop, and monitor containers.

Scheduler Framework

The scheduler uses an extensible plugin framework with multiple extension points:

  1. PreEnqueue – Filter pods before adding to queue
  2. QueueSort – Sort pods in the scheduling queue
  3. PreFilter – Pre-processing before filtering
  4. Filter – Eliminate nodes that cannot run the pod
  5. PostFilter – Run only if no feasible nodes found
  6. PreScore – Pre-processing before scoring
  7. Score – Rank feasible nodes
  8. Reserve – Reserve resources for the pod
  9. Permit – Approve or delay pod binding
  10. PreBind – Pre-binding setup
  11. Bind – Persist the pod-to-node binding
  12. PostBind – Post-binding cleanup
Loading diagram...

Data Flow

The API server stores all cluster state in etcd. Controllers watch resources via informers and react to changes. The scheduler watches for unscheduled pods and assigns them to nodes. Kubelets watch their assigned pods and ensure the container runtime maintains the desired state. This watch-and-reconcile pattern ensures the cluster converges toward the desired state even when components fail or restart.

API Server & Storage

Relevant Files
  • pkg/kubeapiserver/options
  • pkg/registry/core/rest/storage_core.go
  • staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go
  • staging/src/k8s.io/apiserver/pkg/server/genericapiserver.go
  • staging/src/k8s.io/apiserver/pkg/server/handler.go
  • staging/src/k8s.io/apiserver/pkg/endpoints/installer.go

Overview

The Kubernetes API server is the central control plane component that exposes the Kubernetes API and manages all cluster state. It receives HTTP requests, applies authentication and authorization policies, validates operations through admission controllers, and persists data to etcd3. The architecture separates concerns into request handling, storage abstraction, and backend persistence.

Request Flow

Incoming HTTP requests follow a structured pipeline:

  1. HTTP Handler Chain: Requests enter through APIServerHandler, which chains multiple filters for authentication, authorization, request info extraction, and audit logging.
  2. Routing: The Director routes requests to either the GoRestful container (for registered API endpoints) or the NonGoRestful mux (for other paths).
  3. API Installation: The APIInstaller registers resource handlers for each API group version, mapping HTTP verbs (GET, POST, PUT, DELETE) to storage operations.
  4. Admission Control: Before persistence, requests pass through admission webhooks and plugins for validation and mutation.
  5. Storage Backend: Operations are delegated to the appropriate storage implementation (etcd3 for persistent data).

GenericAPIServer Architecture

The GenericAPIServer is the core server component that manages:

  • Handler Chain: Coordinates authentication, authorization, and request processing
  • API Groups: Manages multiple API group versions with their respective storage backends
  • Lifecycle: Handles server startup, graceful shutdown, and post-start hooks
  • Discovery: Serves API discovery endpoints at /apis and /api
  • OpenAPI: Exposes OpenAPI v2 and v3 specifications for API documentation
type GenericAPIServer struct {
    Handler *APIServerHandler
    LoopbackClientConfig *restclient.Config
    StorageVersionManager storageversion.Manager
    // ... additional fields
}

Storage Layer

The storage layer abstracts the backend persistence mechanism through the storage.Interface:

  • etcd3 Store: Primary implementation using etcd3 as the distributed key-value store
  • Key Prefixing: Resources are stored with hierarchical prefixes (e.g., /registry/pods/default/my-pod)
  • Versioning: Each object has a resource version for optimistic concurrency control
  • Transformers: Support encryption and other value transformations before persistence
  • Watchers: Enable real-time change notifications through etcd3 watch streams
type store struct {
    client *kubernetes.Client
    codec runtime.Codec
    versioner storage.Versioner
    transformer value.Transformer
    pathPrefix string
}

Resource Storage Registration

Core resources (Pods, Services, Nodes, etc.) are registered through storage_core.go:

  • Each resource gets a dedicated REST storage handler implementing CRUD operations
  • Storage handlers are composed into an APIGroupInfo structure
  • The RESTOptionsGetter provides configuration for storage backends (etcd endpoints, encryption keys, etc.)
  • Allocators manage IP ranges for Services and node ports

Key Concepts

Storage Versions: The API server tracks which version it uses to encode objects in etcd. This enables safe schema evolution and rolling upgrades across multiple API server instances.

Admission Pipeline: Requests are validated and mutated by admission controllers before reaching storage, enabling policy enforcement and resource defaults.

Graceful Shutdown: The server drains in-flight requests and closes connections before terminating, ensuring data consistency.

Controllers & Reconciliation

Relevant Files
  • cmd/kube-controller-manager/app/controller_descriptor.go
  • cmd/kube-controller-manager/app/apps.go
  • pkg/controller/deployment/deployment_controller.go
  • pkg/controller/replicaset/replica_set.go
  • pkg/controller/statefulset/stateful_set.go

Overview

Kubernetes controllers implement the reconciliation pattern: they continuously observe the current state of resources and take actions to match the desired state. The controller manager orchestrates multiple controllers (Deployment, ReplicaSet, StatefulSet, etc.), each responsible for synchronizing specific resource types with actual cluster state.

Controller Registration & Initialization

Controllers are registered via ControllerDescriptor objects that wrap controller implementations with metadata:

type ControllerDescriptor struct {
    name                      string
    constructor               ControllerConstructor
    requiredFeatureGates      []featuregate.Feature
    aliases                   []string
    isDisabledByDefault       bool
    isCloudProviderController bool
    requiresSpecialHandling   bool
}

Each descriptor holds a constructor function that instantiates the controller when the manager starts. The NewControllerDescriptors() function registers all available controllers, validating that each has a unique name and valid constructor.

The Reconciliation Loop

The reconciliation pattern follows a standard worker queue pattern:

  1. Event Handlers - Informers watch API resources and trigger event handlers on Add/Update/Delete
  2. Work Queue - Events enqueue resource keys (namespace/name) into a rate-limited queue
  3. Worker Goroutines - Multiple workers dequeue items and invoke the sync handler
  4. Sync Handler - Fetches the resource, compares desired vs. actual state, and takes corrective actions
Loading diagram...

Deployment Controller Example

The DeploymentController manages Deployments by reconciling their ReplicaSets and Pods:

  • Watches: Deployments, ReplicaSets, and Pods
  • Sync Logic: Determines which ReplicaSets to scale up/down based on deployment strategy (rolling update or recreate)
  • Adoption: Uses ControllerRef to claim orphaned ReplicaSets matching the deployment's selector

The controller's syncDeployment method orchestrates the entire reconciliation, handling rollouts, rollbacks, and cleanup.

ReplicaSet Controller

The ReplicaSetController manages ReplicaSets by ensuring the correct number of Pods exist:

  • Expectations: Tracks expected Pod creations/deletions to avoid thrashing on transient state changes
  • Slow Start: When creating many Pods, batches them with exponential backoff to prevent API overload
  • Pod Adoption: Claims Pods matching the ReplicaSet's selector via ControllerRef
  • Status Updates: Continuously updates ReplicaSet status with replica counts and conditions

StatefulSet Controller

The StatefulSetController manages StatefulSets with ordered, stable Pod identities:

  • Monotonic Updates: Scales up in ordinal order; no new Pod created while any is unhealthy
  • Revision History: Maintains ControllerRevision objects for rollback support
  • Persistent Volumes: Manages PVC lifecycle tied to Pod ordinals
  • Burst Mode: Optional relaxed ordering for faster scaling (with consistency trade-offs)

Error Handling & Retries

Controllers use exponential backoff for failed reconciliations:

  • Rate Limiting: Failed items are re-enqueued with increasing delays (5ms, 10ms, 20ms, …)
  • Max Retries: After 15 retries, items are dropped from the queue to prevent infinite loops
  • Namespace Termination: Special handling for resources in terminating namespaces (no retries)

Key Patterns

ControllerRef: Ownership mechanism using metadata.ownerReferences to prevent multiple controllers from managing the same resource.

Expectations: TTL-based tracking of expected state changes, allowing controllers to wait for informer events before re-syncing.

Informer Sync: Controllers wait for all informer caches to sync before starting workers, ensuring consistent initial state.

Scheduler & Plugin Framework

Relevant Files
  • pkg/scheduler/scheduler.go
  • pkg/scheduler/schedule_one.go
  • pkg/scheduler/framework/interface.go
  • pkg/scheduler/framework/runtime/framework.go
  • pkg/scheduler/apis/config/v1/default_plugins.go

The Kubernetes scheduler uses a plugin framework to make scheduling decisions. This architecture allows extensibility while maintaining a clear, predictable scheduling workflow.

Core Components

Scheduler (pkg/scheduler/scheduler.go) is the main orchestrator that:

  • Watches for unscheduled pods in the scheduling queue
  • Calls ScheduleOne() to process each pod
  • Manages the node cache and extenders
  • Handles scheduling failures and retries

Framework (pkg/scheduler/framework/runtime/framework.go) manages plugin execution at defined extension points. It initializes plugins from a registry and orchestrates their execution in sequence.

Scheduling Cycle

Each pod goes through a single scheduling cycle via ScheduleOne():

// Simplified flow
1. Get next pod from queue
2. Create CycleState (shared state for this cycle)
3. Run scheduling algorithm (find feasible nodes)
4. Bind pod to selected node
5. Handle success or failure

Plugin Extension Points

Plugins hook into the scheduling workflow at these points (in order):

  1. PreFilter – Reject pods early or compute pod-level info (e.g., required ports, topology constraints)
  2. Filter – Eliminate nodes that cannot run the pod (e.g., insufficient resources, port conflicts)
  3. PostFilter – Run only if no feasible nodes found; can trigger preemption or pod rejection
  4. PreScore – Compute pod-level scoring data before ranking nodes
  5. Score – Rank feasible nodes; plugins return scores that are summed per node
  6. Reserve – Reserve resources on the chosen node (or Unreserve on failure)
  7. Permit – Final gate before binding; can delay or reject the pod
  8. PreBind – Perform setup work before binding (e.g., mount volumes)
  9. Bind – Bind the pod to the node (default: write to API server)
  10. PostBind – Cleanup or notifications after successful binding

Default Plugins

The default plugin set (from default_plugins.go) includes:

  • Filtering: NodeUnschedulable, NodeName, TaintToleration, NodeAffinity, NodePorts, NodeResourcesFit, VolumeRestrictions, VolumeBinding
  • Scoring: TaintToleration, NodeAffinity, NodeResourcesBalancedAllocation, ImageLocality, PodTopologySpread, InterPodAffinity
  • Binding: DefaultBinder, DefaultPreemption

Plugins can be enabled/disabled or configured via scheduler profiles.

Plugin State Management

CycleState stores data shared across plugins in a single scheduling cycle. Plugins write computed state (e.g., prefilter results) that later plugins read:

state.Write(key, data)      // Store data
data, err := state.Read(key) // Retrieve data

This avoids recomputing expensive operations across multiple plugins.

Extenders

Extenders are HTTP webhooks that run after the framework’s filter and score phases. They allow external systems to influence scheduling decisions without modifying the core scheduler.

Loading diagram...

Configuration

Scheduler profiles allow different configurations for different workloads. Each profile specifies which plugins are enabled at each extension point and their weights (for scoring plugins). The MultiPoint field simplifies configuration by enabling a plugin across all applicable extension points.

Kubelet & Pod Lifecycle

Relevant Files
  • pkg/kubelet/kubelet.go
  • pkg/kubelet/pod_workers.go
  • pkg/kubelet/kuberuntime/kuberuntime_manager.go
  • pkg/kubelet/kuberuntime/kuberuntime_container.go
  • pkg/kubelet/cm/container_manager_linux.go
  • pkg/kubelet/lifecycle/handlers.go

The kubelet manages pod and container lifecycles through a multi-layered architecture that reconciles desired state with runtime state. Understanding this flow is critical for debugging pod issues and extending kubelet functionality.

Pod Worker Architecture

The pod worker system is the core orchestration engine. Each pod gets its own goroutine (pod worker) that drives it through four sequential phases:

  1. Wait to start – Ensures no two pods with the same UID run simultaneously
  2. Sync – Reconciles desired pod spec with runtime state
  3. Terminating – Stops all running containers gracefully
  4. Terminated – Cleans up resources before pod deletion

Pod workers are managed by podWorkers in pod_workers.go, which is the source of truth for what pods should be active on a node. The UpdatePod() method receives configuration changes or termination signals, and the podWorkerLoop() processes these updates sequentially.

Pod Sync Workflow

When a pod enters the sync phase, the kubelet executes SyncPod() which is reentrant and converges the pod toward its desired state:

// High-level SyncPod workflow
1. Record pod start latency metrics
2. Generate API pod status
3. Update status manager
4. Create mirror pod (if static pod)
5. Create pod data directories
6. Wait for volumes to attach/mount
7. Fetch pull secrets
8. Call container runtime SyncPod
9. Update traffic shaping rules

If SyncPod() completes without error, the pod runtime state matches the desired configuration. Transient errors trigger retries with backoff. If containers reach a terminal phase (RestartNever/RestartOnFailure), the pod transitions to terminating.

Container Runtime Integration

The kubeGenericRuntimeManager implements the actual container lifecycle through SyncPod():

Loading diagram...

The runtime manager executes 8 steps: compute changes, kill sandbox if needed, kill unwanted containers, create sandbox, create ephemeral containers, create init containers, resize containers (if scaling), and create normal containers.

Container Lifecycle Hooks

Containers support PostStart and PreStop lifecycle hooks defined in the pod spec. These execute via the handler runner:

  • PostStart – Runs after container starts; blocks container readiness
  • PreStop – Runs before container termination; respects grace period

Handlers support three types: Exec (run command), HTTPGet (HTTP request), and Sleep (delay). Failures are logged but don't prevent container startup/termination.

Pod Termination

When a pod is deleted or evicted, it transitions to terminating state. The kubelet:

  1. Sends SIGTERM to all containers with the grace period
  2. Runs PreStop hooks (if defined)
  3. Waits for containers to exit gracefully
  4. Sends SIGKILL if grace period expires
  5. Transitions to terminated state for resource cleanup

The pod worker ensures termination completes before the pod can be deleted from the node.

Container Manager Integration

The container manager handles resource allocation and isolation:

  • CPU and memory management via cgroups
  • Topology-aware resource allocation
  • Dynamic resource allocation (DRA) for device plugins
  • Internal lifecycle hooks for resource cleanup

When containers stop, PostStopContainer() is called to release allocated resources immediately, allowing reallocation if the container restarts.

Networking & Service Proxy

Relevant Files
  • pkg/proxy/iptables/proxier.go
  • pkg/proxy/ipvs/proxier.go
  • pkg/proxy/nftables/proxier.go
  • cmd/kube-proxy/app/server.go
  • cmd/kube-proxy/app/server_linux.go
  • pkg/proxy/types.go
  • pkg/proxy/config/config.go

Overview

Kubernetes networking is implemented through kube-proxy, a node-level component that manages service networking and load balancing. The proxy layer translates Kubernetes Service abstractions into actual network rules on each node, enabling pod-to-service communication and external traffic routing.

Architecture

Loading diagram...

Proxy Modes

Kubernetes supports three proxy implementations on Linux, each using the kernel's netfilter subsystem:

1. IPTables Mode (Default)

  • Uses iptables rules for packet filtering and NAT
  • Simpler rule structure but can have performance issues with many services
  • Each service creates multiple iptables rules
  • Suitable for clusters with <1000 services

2. IPVS Mode (Deprecated)

  • IP Virtual Server provides kernel-level load balancing
  • Better performance and scalability than iptables
  • Supports advanced load balancing algorithms (round-robin, least connections, locality-aware)
  • Falls back to iptables for filtering and masquerading via ipset
  • Marked for deprecation in favor of nftables

3. NFTables Mode (Modern)

  • Newer netfilter framework with improved performance
  • Unified rule syntax replacing iptables
  • Better scalability for large clusters
  • Supports both IPv4 and IPv6 natively
  • Recommended for new deployments

Core Components

Provider Interface (pkg/proxy/types.go) All proxiers implement the Provider interface with key methods:

  • Sync() - Immediately synchronizes proxy rules
  • SyncLoop() - Runs periodic synchronization
  • Handlers for Services, EndpointSlices, and Node topology

MetaProxier (pkg/proxy/metaproxier/meta_proxier.go) Enables dual-stack (IPv4 & IPv6) operation by dispatching calls to separate single-stack proxier instances.

Service Proxying Flow

  1. Service Discovery: kube-proxy watches Services and EndpointSlices via informers
  2. Rule Generation: Based on service type and configuration, generates appropriate netfilter rules
  3. DNAT (Destination NAT): Rewrites traffic from service IPs to endpoint IPs
  4. SNAT/Masquerade: Ensures return traffic routes correctly back through the node
  5. Filtering: Applies policies like LoadBalancerSourceRanges and traffic policies

Configuration

Proxy mode is selected via --proxy-mode flag or KubeProxyConfiguration:

  • Default: iptables
  • Requires kernel support and appropriate binaries (iptables, nft, etc.)
  • Platform-specific setup in server_linux.go validates prerequisites

Key Features

  • Dual-Stack Support: IPv4 and IPv6 simultaneously via MetaProxier
  • Health Checking: Service endpoint health monitoring
  • Connection Tracking: Manages stateful connections via conntrack
  • Metrics: Prometheus metrics for monitoring proxy performance
  • Local Traffic Detection: Identifies local vs. remote endpoints for traffic policies

Storage & Volumes

Relevant Files
  • pkg/volume/plugins.go
  • pkg/volume/volume.go
  • pkg/volume/csi/csi_plugin.go
  • pkg/kubelet/volumemanager/volume_manager.go
  • pkg/controller/volume/attachdetach/
  • pkg/apis/storage/types.go

Kubernetes volumes provide persistent and ephemeral storage to pods. The volume system is built on a plugin architecture that supports diverse storage backends, from local directories to cloud storage and CSI drivers.

Volume Plugin Architecture

The volume system uses a plugin-based design where each storage type implements the VolumePlugin interface. Plugins are registered at kubelet startup and managed by the VolumePluginMgr. Each plugin must implement:

  • Init: Initialize the plugin with a VolumeHost reference
  • GetPluginName: Return a namespaced plugin identifier (e.g., kubernetes.io/csi)
  • NewMounter/NewUnmounter: Create volume mount/unmount handlers
  • CanSupport: Determine if the plugin handles a given volume spec

Built-in plugins include: EmptyDir, HostPath, Local, NFS, iSCSI, Fibre Channel, ConfigMap, Secret, Projected, and CSI.

Volume Lifecycle: Attach, Mount, Unmount, Detach

Volumes follow a four-stage lifecycle managed by the VolumeManager (kubelet) and AttachDetach Controller (control plane):

  1. Attach: For attachable volumes (e.g., cloud disks), the controller attaches the device to the node
  2. Mount: The kubelet mounts the volume to a pod-specific directory
  3. Unmount: When a pod terminates, the volume is unmounted from the pod
  4. Detach: The controller detaches the device from the node when no pods need it

Desired State vs. Actual State

The volume system maintains two state caches to reconcile reality with intent:

  • Desired State of World (DSW): Tracks volumes that should be attached/mounted based on pod specs
  • Actual State of World (ASW): Tracks volumes that are actually attached/mounted on the node

A reconciler runs periodically, comparing DSW and ASW, and triggering attach/mount/unmount/detach operations to converge them. This ensures volumes are available when pods start and cleaned up when pods terminate.

CSI and Ephemeral Volumes

The Container Storage Interface (CSI) plugin enables third-party storage drivers to integrate with Kubernetes. CSI drivers support two volume lifecycle modes:

  • Persistent: Traditional PersistentVolume/PersistentVolumeClaim model
  • Ephemeral: Inline volumes with pod lifecycle (created and destroyed with the pod)

CSI drivers register with the kubelet via a socket-based protocol, allowing dynamic driver discovery and hot-reload without kubelet restarts.

Volume Metrics and Monitoring

Volumes implement the MetricsProvider interface to expose usage statistics:

  • Capacity: Total storage size
  • Used: Bytes currently in use
  • Available: Free space remaining
  • Inodes: Filesystem inode usage

Metrics are collected via du (disk usage), statfs (filesystem stats), or block device queries depending on volume type.

Security & Authorization

Relevant Files
  • pkg/kubeapiserver/authenticator/config.go
  • pkg/kubeapiserver/authorizer/config.go
  • pkg/kubeapiserver/options/authentication.go
  • pkg/kubeapiserver/options/authorization.go
  • pkg/auth/authorizer/abac/abac.go
  • plugin/pkg/auth/authorizer/rbac/rbac.go
  • plugin/pkg/auth/authorizer/node/node_authorizer.go
  • pkg/apis/rbac/types.go
  • pkg/serviceaccount/jwt.go

Kubernetes security is built on two foundational pillars: authentication (verifying who you are) and authorization (determining what you can do). The API server enforces both before processing any request.

Authentication: Identifying the User

Authentication is handled by a chain of authenticators that attempt to identify the caller. Each authenticator is tried in sequence until one succeeds. The chain includes:

  • Client Certificates (mTLS) - Verifies X.509 certificates from the client
  • Bearer Tokens - Validates tokens from the Authorization header
    • Service Account tokens (JWT format)
    • Bootstrap tokens (for cluster initialization)
    • Token files (static token list)
    • OIDC tokens (external identity providers)
    • Webhook tokens (custom external validation)
  • Request Headers - Extracts identity from HTTP headers (for reverse proxies)
  • Anonymous - Allows unauthenticated requests if explicitly enabled

The authenticator chain is built in pkg/kubeapiserver/authenticator/config.go. Token authenticators are wrapped with optional caching to improve performance. If authentication succeeds, the user information is attached to the request context.

Authorization: Controlling Access

After authentication, the authorization chain determines if the authenticated user can perform the requested action. Multiple authorization modes can be chained together:

  • RBAC (Role-Based Access Control) - The default and most common mode. Uses Role, RoleBinding, ClusterRole, and ClusterRoleBinding objects to define permissions
  • Node - Specialized authorizer for kubelet requests, restricting node access to their own resources and pods
  • ABAC (Attribute-Based Access Control) - Legacy mode using a policy file with attribute matching
  • Webhook - Delegates authorization decisions to an external service
  • AlwaysAllow - Permits all requests (development only)
  • AlwaysDeny - Rejects all requests

RBAC: The Permission Model

RBAC is structured around PolicyRules that define what actions are allowed:

type PolicyRule struct {
    Verbs       []string  // get, create, update, delete, etc.
    APIGroups   []string  // API group (e.g., "apps", "")
    Resources   []string  // pods, services, etc.
    ResourceNames []string // specific resource names (optional)
    NonResourceURLs []string // non-resource paths (e.g., "/api")
}

Rules are grouped into Roles (namespace-scoped) or ClusterRoles (cluster-wide). RoleBindings and ClusterRoleBindings connect rules to Subjects (users, groups, or service accounts). Authorization checks ClusterRoleBindings first, then RoleBindings in the target namespace, returning Allow on the first match or Deny by default.

Node Authorization

The Node authorizer enforces kubelet-specific restrictions. Kubelets can only access resources related to their own node: their Node object, pods scheduled on them, and secrets/configmaps referenced by those pods. This prevents one node from accessing another node's data.

Request Flow

  1. Request arrives at API server
  2. Authentication chain identifies the user
  3. Authorization chain checks if user can perform the action
  4. If both pass, the request proceeds to admission controllers and handlers
  5. If either fails, the request is rejected with 401 (auth) or 403 (authz)

Authorization decisions can be cached and reloaded dynamically, allowing policy changes without restarting the server.

Client Tools & CLI

Relevant Files
  • cmd/kubectl/kubectl.go
  • staging/src/k8s.io/kubectl/pkg/cmd/cmd.go
  • cmd/kubeadm/kubeadm.go
  • cmd/kubeadm/app/cmd/cmd.go
  • staging/src/k8s.io/kubectl/pkg/cmd/plugin.go

Kubernetes provides two primary command-line tools for cluster management: kubectl for day-to-day operations and kubeadm for cluster bootstrapping and lifecycle management.

kubectl: The Kubernetes Control Tool

kubectl is the primary CLI for interacting with Kubernetes clusters. It follows a hierarchical command structure organized into logical groups:

  • Basic Commands: create, get, delete, apply, replace
  • Cluster Management: certificate, cluster-info, top, drain, taint
  • Troubleshooting: describe, logs, attach, exec, port-forward, debug
  • Advanced Commands: diff, patch, wait, kustomize
  • Settings: label, annotate, completion

The command initialization flow starts in cmd/kubectl/kubectl.go, which calls cmd.NewDefaultKubectlCommand(). This creates a Cobra command tree with all subcommands registered through the Factory pattern. The Factory provides abstractions for REST clients, dynamic clients, and resource builders, allowing commands to work with any Kubernetes resource.

Plugin System

kubectl supports extensibility through a plugin system that allows users to add custom commands. Plugins are discovered and executed via the PluginHandler interface:

  • Discovery: Plugins are executable files on the user's PATH prefixed with kubectl- (e.g., kubectl-myplugin)
  • Lookup: The DefaultPluginHandler searches for plugins using exec.LookPath() with valid prefixes
  • Execution: Plugins are invoked via syscall.Exec() on Unix or cmd.Run() on Windows
  • Subcommand Plugins: Only the create command allows plugins as subcommands (e.g., kubectl create myplugin)

Plugin discovery happens in NewDefaultKubectlCommandWithArgs() before command execution. Users can list available plugins with kubectl plugin list.

kubeadm: Cluster Bootstrapping Tool

kubeadm initializes and manages Kubernetes clusters through a phase-based workflow system. Main commands include:

  • init: Bootstrap a control plane node
  • join: Add nodes to an existing cluster
  • reset: Revert changes made by init or join
  • upgrade: Upgrade cluster components
  • token: Manage bootstrap tokens
  • certs: Manage certificates
  • config: Manage kubeadm configuration

Each command (init, join, reset) uses a Runner that executes an ordered sequence of phases. For example, kubeadm init runs phases like preflight checks, certificate generation, kubeconfig creation, etcd setup, control plane deployment, and addon installation. Phases can have nested sub-phases and dependencies, enabling atomic execution of individual phases via kubeadm init phase <phase-name>.

Command Structure

Both tools use Cobra for command parsing and flag management. kubectl commands receive a Factory for accessing Kubernetes clients and resources. kubeadm commands use a RunData interface to share state across phases. Both support global flags like verbosity (-v) and configuration paths, with flag normalization via cliflag.WordSepNormalizeFunc.

// kubectl command initialization
command := cmd.NewDefaultKubectlCommand()
cli.RunNoErrOutput(command)

// kubeadm command initialization
cmd := cmd.NewKubeadmCommand(os.Stdin, os.Stdout, os.Stderr)
cmd.Execute()

The CLI architecture emphasizes modularity: kubectl's Factory pattern decouples commands from client implementations, while kubeadm's phase system allows reusable, composable workflows across different cluster operations.