Linux Kernel Architecture & Subsystems | Augment Code

Overview

Relevant Files

README
init/main.c
kernel/fork.c
include/linux/kernel.h

The Linux kernel is a monolithic operating system kernel that manages hardware resources, process scheduling, memory management, and provides core services for all user-space applications. This repository contains the complete source code for the Linux kernel, supporting multiple architectures and configurations.

Core Architecture

The kernel is organized into several major subsystems:

Process Management (kernel/fork.c, kernel/sched/) - Task creation, scheduling, and lifecycle management
Memory Management (mm/) - Virtual memory, page allocation, and memory protection
Filesystem (fs/) - VFS layer, filesystem implementations, and I/O operations
Networking (net/) - Network stack, protocols, and device drivers
Device Drivers (drivers/) - Hardware abstraction and device support
Architecture Support (arch/) - CPU-specific code for x86, ARM, RISC-V, and others

Boot Process

The kernel initialization begins in init/main.c with the start_kernel() function, which orchestrates the boot sequence:

Early Setup - CPU initialization, memory setup, and interrupt handling
Core Subsystems - RCU, scheduler, memory allocator, and workqueues
Device Initialization - Driver loading and hardware detection
Userspace Launch - Execution of the init process (PID 1)

Process Creation

Process creation is handled by kernel/fork.c through the copy_process() function, which:

Duplicates the parent task structure and memory space
Allocates kernel stack and thread-local storage
Copies file descriptors, signal handlers, and credentials
Initializes scheduling and accounting structures

Key Concepts

System States - The kernel tracks its operational state through system_state (defined in include/linux/kernel.h):

SYSTEM_BOOTING - Initial boot phase
SYSTEM_SCHEDULING - Scheduler ready
SYSTEM_RUNNING - Normal operation
SYSTEM_HALT/POWER_OFF/RESTART - Shutdown states

Preemption - The kernel supports voluntary and dynamic preemption modes, controlled via might_resched() and related macros for safe context switching.

Build System

The kernel uses a hierarchical Kbuild system with:

Makefile - Top-level build configuration
Kconfig - Configuration options and dependencies
scripts/ - Build utilities and code generation tools
Architecture-specific makefiles in arch/*/

Development Workflow

The kernel follows a structured development process with:

Patch submission via mailing lists (kernel.org)
Code review and testing requirements
Stable kernel maintenance branches
Regular release cycles with long-term support (LTS) versions

Loading diagram...

Architecture & Core Subsystems

Relevant Files

kernel/sched/core.c - Process scheduler and task management
mm/page_alloc.c - Memory allocation and page management
fs/super.c - Filesystem superblock and mount operations
net/core/dev.c - Network device management
kernel/irq/handle.c - Interrupt handling core
arch/x86/entry/entry_64.S - x86-64 entry point assembly

The Linux kernel is organized into several interconnected subsystems that manage hardware resources and provide abstractions for user-space applications. Understanding these core components is essential for kernel development.

Process Scheduler

The scheduler (kernel/sched/core.c) manages CPU time allocation among tasks. It uses a hierarchical class-based design with multiple scheduling classes:

Stop Class - Highest priority, used for CPU hotplug and migration
Deadline Class - Real-time tasks with deadline guarantees
Real-Time Class - Fixed-priority real-time scheduling
Fair Class - CFS (Completely Fair Scheduler) for normal tasks
Idle Class - Lowest priority, runs when no other tasks are ready

Each CPU has a runqueue (struct rq) containing tasks organized by their scheduling class. The scheduler uses virtual runtime (vruntime) to ensure fairness, selecting the task with the smallest vruntime for execution.

struct rq {
    raw_spinlock_t __lock;
    struct cfs_rq cfs;
    struct rt_rq rt;
    struct dl_rq dl;
    struct task_struct *curr;
};

Memory Management

The memory subsystem (mm/page_alloc.c) handles physical memory allocation and virtual address mapping. Key concepts include:

Zones - Physical memory regions (ZONE_DMA, ZONE_NORMAL, ZONE_HIGHMEM)
Free Areas - Buddy allocator maintains free lists for each order (2^order pages)
Per-CPU Lists - Fast allocation path using per-CPU page caches
Watermarks - Thresholds triggering kswapd reclaim daemon

The allocator uses a two-level strategy: fast path via per-CPU lists, and slow path with zone locks and reclaim.

Interrupt Handling

Interrupts flow through a standardized path across architectures:

Hardware IRQ - CPU receives interrupt signal
Entry Code - Assembly saves registers and switches to kernel stack
IRQ Handler - handle_irq() dispatches to registered handlers
Handler Execution - Device-specific interrupt service routine runs
Exit - Restores context and returns to interrupted code

The generic IRQ subsystem (kernel/irq/handle.c) provides architecture-independent interrupt management with support for threaded handlers and dynamic IRQ allocation.

Filesystem Layer

The VFS (fs/super.c) provides a unified interface to different filesystems. Key structures:

Superblock - Filesystem metadata and operations
Inode - File metadata and operations
Dentry - Directory entry cache
File - Open file instance

Mount operations register filesystems and establish the directory tree hierarchy.

Networking Stack

The network subsystem (net/core/dev.c) manages network devices and packet flow:

Device Registration - Drivers register network devices
Packet Reception - Interrupts trigger packet processing
Protocol Handlers - IP, TCP, UDP process packets
Transmission - Queuing discipline (qdisc) manages outgoing packets

Loading diagram...

Synchronization Primitives

The kernel provides multiple synchronization mechanisms:

Spinlocks - Busy-wait locks for short critical sections
Mutexes - Sleep-based locks for longer operations
Semaphores - Counting synchronization primitives
RCU - Read-Copy-Update for lock-free reads

These primitives protect shared data structures across CPUs and prevent race conditions.

Boot and Initialization

Kernel initialization (init/main.c) follows a staged approach:

Early Boot - Architecture-specific setup, memory initialization
Core Subsystems - Scheduler, memory, IRQ initialization
Device Drivers - Device discovery and registration
Filesystem - Mount root filesystem
Userspace - Execute init process

Each subsystem registers initialization functions that execute in dependency order, ensuring proper setup of the kernel environment.

Process Management & Scheduler

Relevant Files

kernel/sched/core.c
kernel/sched/fair.c
kernel/fork.c
kernel/exit.c
kernel/signal.c

The Linux kernel manages process execution through a sophisticated scheduler that allocates CPU time fairly among competing tasks. The system uses a hierarchical class-based design where each CPU maintains a runqueue (struct rq) containing all runnable tasks organized by scheduling class.

Scheduling Classes

The kernel implements five scheduling classes in priority order:

Stop Class - Highest priority, used for CPU hotplug and migration operations
Deadline Class - Real-time tasks with explicit deadline guarantees (SCHED_DEADLINE)
Real-Time Class - Fixed-priority real-time scheduling (SCHED_FIFO, SCHED_RR)
Fair Class - CFS (Completely Fair Scheduler) for normal tasks (SCHED_NORMAL, SCHED_BATCH)
Idle Class - Lowest priority, runs only when no other tasks are ready

Each class implements a sched_class interface with hooks for task enqueueing, dequeueing, and selection.

Completely Fair Scheduler (CFS)

The Fair Class uses CFS, which ensures fairness by tracking virtual runtime (vruntime). Each task accumulates vruntime proportional to actual CPU time used, weighted by its nice level. The scheduler always selects the task with the smallest vruntime, ensuring all tasks receive their fair share of CPU time.

CFS maintains a red-black tree (tasks_timeline) where tasks are sorted by vruntime. The leftmost node represents the next task to run. When a task executes, its vruntime increases; once it exceeds other tasks' vruntime by a scheduling granularity threshold, preemption occurs.

Process Lifecycle

Fork & Initialization (kernel/fork.c): When a process is created via fork() or clone(), sched_fork() initializes scheduler state. The new task is marked TASK_NEW to prevent premature execution. The scheduler class is assigned based on priority: real-time tasks use rt_sched_class, while normal tasks use fair_sched_class.

Scheduling (kernel/sched/core.c): The schedule() function is called when a task yields or blocks. It invokes __schedule(), which selects the next runnable task via pick_next_task(). This function iterates through scheduling classes in priority order, allowing each class to select its highest-priority task.

Exit (kernel/exit.c): When a process terminates, do_exit() performs cleanup: releasing resources, notifying parent processes, and removing the task from the scheduler. The task transitions to TASK_DEAD state.

Task States

Tasks cycle through several states:

TASK_RUNNING - Runnable or currently executing
TASK_INTERRUPTIBLE - Sleeping, woken by signals or events
TASK_UNINTERRUPTIBLE - Sleeping, only woken by explicit wake-up
TASK_STOPPED - Stopped by debugger or job control
TASK_DEAD - Exited, awaiting reaping by parent

Context Switching

Context switching occurs in context_switch(), which saves the previous task's state and loads the next task's state. This includes switching memory contexts (switch_mm()), updating CPU-local data, and performing architecture-specific register swaps via switch_to().

/* Simplified context switch flow */
rq_lock(rq);
next = pick_next_task(rq, prev);
context_switch(rq, prev, next);
rq_unlock(rq);

Load Balancing

The scheduler periodically balances load across CPUs to prevent idle CPUs while others are overloaded. The load_balance() function runs during scheduler ticks and migration events, moving tasks between runqueues to optimize CPU utilization.

Loading diagram...

Memory Management

Relevant Files

mm/page_alloc.c - Buddy allocator and page allocation
mm/vma.c - Virtual memory area management
mm/mmap.c - Memory mapping syscalls
mm/slub.c - SLUB slab allocator
mm/swap.c - Page swapping and LRU management
mm/vmscan.c - Memory reclamation and kswapd

Linux memory management is a multi-layered system balancing performance, fairness, and resource constraints. It operates at three primary levels: physical page allocation, virtual address mapping, and object-level allocation.

Physical Page Allocation (Buddy Allocator)

The buddy allocator in page_alloc.c manages free physical pages using a power-of-2 free list strategy. Pages are organized into zones (ZONE_DMA, ZONE_NORMAL, ZONE_HIGHMEM) and grouped by order (2^order pages). When a page is freed, it merges with its buddy if available, reducing fragmentation.

The allocator uses a two-path strategy:

Fast Path: Per-CPU page caches (PCP lists) allow allocation without zone locks, reducing contention
Slow Path: Zone-locked allocation with fallback logic, reclaim, and compaction when PCP lists are empty

Watermarks (low, high, min) trigger background reclaim via kswapd when memory pressure increases.

Virtual Memory Areas (VMAs)

VMAs in vma.c represent contiguous virtual address ranges with consistent permissions. Each process has an mm_struct containing a tree of VMAs. The mmap_region() function handles mapping creation, merging adjacent VMAs when possible to reduce overhead.

Key operations:

VMA Insertion: Validates overlaps, checks memory limits, and integrates with file mappings
VMA Merging: Combines adjacent regions with identical flags to reduce VMA count
Page Table Setup: Establishes mappings between virtual and physical addresses

SLUB Allocator

The SLUB allocator in slub.c manages small kernel objects (typically <4KB). It organizes objects into slabs (contiguous page ranges) with per-CPU caches for fast allocation. Each slab maintains a freelist of available objects.

Allocation path:

Try per-CPU cache (no locking)
Grab partial slab from node list
Allocate new slab if needed

This design minimizes lock contention while maintaining cache locality.

Page Reclamation (kswapd)

The vmscan.c module implements memory reclamation through kswapd, a background daemon that scans LRU lists when memory pressure rises. It evicts pages based on access patterns, prioritizing:

File-backed pages: Reclaimed first (can be re-read from disk)
Anonymous pages: Swapped to disk if necessary
Active vs. Inactive: Inactive pages evicted before active ones

The scan_control structure coordinates reclaim parameters across zones and memory cgroups.

Page Swapping

The swap.c module manages the LRU (Least Recently Used) lists and page aging. Folios are batched into per-CPU caches before being added to zone LRU lists, reducing lock overhead. Swap operations move pages between memory and disk storage.

// Simplified allocation flow
__alloc_pages_nodemask()
  → get_page_from_freelist()  // Fast path
  → __alloc_pages_slowpath()  // Reclaim + compact
  → rmqueue()                 // Buddy allocator

Memory Pressure Handling

When free memory drops below watermarks, the system triggers:

Direct Reclaim: Allocating task reclaims pages synchronously
Kswapd Wakeup: Background daemon begins scanning
Compaction: Moves pages to create larger contiguous regions
OOM Killer: Last resort—terminates processes to free memory

This multi-level approach balances responsiveness with system stability under memory pressure.

File Systems & Storage

Relevant Files

fs/super.c
fs/inode.c
fs/namei.c
fs/ext4/super.c
block/blk-core.c
block/bio.c

The Linux kernel abstracts storage through a layered architecture: the Virtual File System (VFS) layer provides a unified interface for all filesystems, while the block I/O layer handles communication with physical storage devices.

Superblocks and Filesystem Mounting

A superblock (struct super_block) represents a mounted filesystem instance. It contains critical metadata like block size, magic numbers, and pointers to filesystem-specific operations. When a filesystem is mounted, the kernel allocates a superblock and initializes it with filesystem-specific data read from disk. The superblock maintains lists of inodes, manages writeback operations, and tracks filesystem state (mounted, read-only, frozen, etc.).

Each filesystem type registers a file_system_type structure defining how to mount and unmount instances. The ext4 filesystem, for example, implements ext4_fill_super() to read the on-disk superblock and populate the in-memory structure with block size, inode count, and journal information.

Inodes and Dentries

An inode (struct inode) represents a file or directory on disk. It stores metadata: file size, permissions, timestamps, and block pointers. The kernel maintains an inode cache to avoid repeated disk reads. Inodes are allocated per-superblock and tracked in hash tables for fast lookup.

A dentry (struct dentry) is a directory entry that maps a filename to an inode. Dentries form a tree structure representing the filesystem hierarchy. The dentry cache (dcache) accelerates pathname lookups by caching recently accessed paths. When resolving /path/to/file, the kernel walks the dentry tree, looking up each component until reaching the target inode.

Pathname Resolution

The namei.c module implements pathname lookup through iterative traversal rather than recursion. The lookup_dentry() function handles symlink resolution, permission checks, and mount point crossing. This design prevents stack overflow from deeply nested symlinks and improves performance through caching.

Block I/O and BIO Structures

The block layer abstracts physical storage through bio structures (struct bio). A bio represents a single I/O operation: it contains the device, sector address, operation type (read/write), and a list of memory pages (bi_io_vec). Filesystems submit bios to the block layer, which queues them for the device driver.

struct bio {
    struct bio *bi_next;
    struct block_device *bi_bdev;
    blk_opf_t bi_opf;
    unsigned short bi_flags;
    struct bvec_iter bi_iter;
    bio_end_io_t *bi_end_io;
    void *bi_private;
};

Buffer Heads and Filesystem I/O

For traditional filesystems, buffer heads (struct buffer_head) bridge inodes and bios. A buffer head represents a single disk block in memory, tracking its state (dirty, locked, uptodate). The submit_bh() function wraps a buffer head in a bio and submits it to the block layer. This abstraction allows filesystems to work with logical blocks while the block layer handles physical sectors.

Data Flow

When reading a file: the VFS calls the filesystem's read_folio() operation, which allocates bios for the required blocks, submits them to the block layer, and waits for completion. The block layer schedules I/O through the device driver. Upon completion, the bio's bi_end_io callback marks pages uptodate and wakes waiting processes.

Loading diagram...

Filesystem-Specific Operations

Each filesystem implements super_operations and inode_operations callbacks. These define how to allocate inodes, read/write data, manage metadata, and handle special operations. The ext4 filesystem, for instance, implements journaling through these callbacks to ensure crash consistency.

The address space operations (address_space_operations) handle page-level I/O, including read_folio() for reading pages and writepages() for writeback. This abstraction allows filesystems to customize caching and I/O behavior while maintaining a consistent interface.

Networking Stack

Relevant Files

net/core/dev.c
net/ipv4/ip_input.c
net/ipv6/ip6_input.c
net/socket.c
net/core/skbuff.c
include/linux/skbuff.h

The Linux networking stack is a layered architecture that processes packets from hardware devices through protocol handlers to user applications. At its core is the socket buffer (sk_buff), a metadata structure that tracks packet data as it flows through the system.

Socket Buffers (sk_buff)

The sk_buff structure is the fundamental data structure for all network packets. It does not hold packet data directly; instead, it maintains pointers to data buffers and metadata about the packet. Key fields include:

head - pointer to the main data buffer
data - current position in the buffer (adjusted as headers are parsed)
dev - network device the packet arrived on or will leave from
sk - associated socket (if any)
protocol - packet protocol type
len and data_len - total and fragmented data lengths

The buffer is divided into a linear data section and a skb_shared_info structure containing page fragments for zero-copy operations.

Receive Path

Packet reception begins when a network driver calls netif_receive_skb() with a populated sk_buff. The receive pipeline in net/core/dev.c follows this sequence:

XDP Processing - eBPF programs can inspect and redirect packets before protocol processing
VLAN Untagging - Strip VLAN headers if present
Traffic Control (TC) Ingress - Qdisc and BPF-based ingress filtering
Protocol Dispatch - Route to appropriate protocol handler (IPv4, IPv6, etc.)

The __netif_receive_skb_core() function orchestrates this flow, maintaining packet type handlers and rx_handler callbacks for devices like bridges and bonds.

IP Layer Processing

IPv4 packets enter through ip_rcv() in net/ipv4/ip_input.c, which performs:

Header validation and checksum verification
IP option processing
Early demux for TCP/UDP (optional fast path)
Routing table lookup via ip_route_input_noref()
Fragment reassembly if needed

Packets destined for the local host proceed to ip_local_deliver(), which reassembles fragments and dispatches to transport layer handlers (TCP, UDP, ICMP) via ip_protocol_deliver_rcu().

Transmit Path

Outgoing packets flow through dev_queue_xmit() in net/core/dev.c:

Traffic Control (TC) Egress - Qdisc scheduling and BPF egress programs
Netfilter Hooks - NF_INET_POST_ROUTING for firewall rules
Queue Selection - Choose transmit queue based on packet hash
Driver Transmission - Hand off to device driver

Key Abstractions

Netfilter Hooks - NF_HOOK() macro allows protocol-independent packet filtering at defined points (PRE_ROUTING, LOCAL_IN, FORWARD, LOCAL_OUT, POST_ROUTING).

Traffic Control - Qdisc (queuing discipline) and BPF programs provide rate limiting, prioritization, and packet manipulation.

RCU Synchronization - Read-copy-update locks protect packet type handlers and device lists without blocking the fast path.

Loading diagram...

Device Drivers & Hardware Abstraction

Relevant Files

drivers/base/core.c
drivers/base/bus.c
drivers/pci/pci-driver.c
include/linux/device.h
kernel/irq/manage.c

The Linux kernel's device driver model provides a unified abstraction layer for managing hardware devices and their drivers. This architecture decouples hardware specifics from driver logic, enabling scalable and maintainable device support across diverse platforms.

Core Device Model Architecture

The device model is built on three fundamental concepts: devices, drivers, and buses. Devices represent physical or virtual hardware, drivers contain the logic to control them, and buses provide the communication infrastructure. The kernel maintains a hierarchical device tree where each device has a parent (typically a bus), and devices are matched with drivers through a registration and probing mechanism.

Loading diagram...

Device Registration and Lifecycle

Devices are registered through device_initialize() and device_add() (or combined via device_register()). During initialization, the kernel sets up internal structures including kobject hierarchies, DMA pools, power management state, and device links. The device is then added to the sysfs filesystem, making it visible to userspace. Reference counting via get_device() and put_device() ensures safe device lifecycle management.

Bus and Driver Management

Buses are registered via bus_register(), which creates the /sys/bus/<name> hierarchy and initializes subsystem-private structures. Drivers register with buses through subsystem-specific functions (e.g., pci_register_driver() for PCI). The bus matching algorithm compares device IDs with driver supported IDs, triggering the driver's probe() callback on match. This decoupling allows multiple drivers to coexist for the same device type.

Interrupt Request and Management

Interrupt handling is managed through request_irq() and related functions in kernel/irq/manage.c. When a driver requests an IRQ, the kernel allocates an interrupt descriptor, configures trigger types (level/edge), and registers the handler. The IRQF_SHARED flag enables multiple drivers to share a single IRQ line. Threaded interrupt handlers (request_threaded_irq()) allow long-running interrupt processing in kernel threads, improving system responsiveness.

PCI Driver Integration

PCI drivers exemplify the device model. The PCI subsystem enumerates devices, creates device structures, and matches them with registered PCI drivers. Dynamic device ID registration via pci_add_dynid() allows runtime driver updates. PCI drivers implement probe/remove callbacks and manage device-specific resources like memory-mapped I/O regions and interrupts.

Key Abstractions

Device Links: Track supplier-consumer relationships for power management and probe ordering
devres: Managed device resources automatically freed on driver unbind
sysfs Attributes: Expose device properties and driver controls to userspace
Power Management: Integrated PM callbacks for suspend/resume across the device tree

Security, Locking & Synchronization

Relevant Files

kernel/locking/mutex.c
kernel/locking/spinlock.c
kernel/rcu/tree.c
kernel/futex/core.c
security/security.c

The Linux kernel provides multiple synchronization primitives to protect shared data and coordinate access across CPUs and processes. These mechanisms range from lightweight spinlocks to heavyweight mutexes, each optimized for different scenarios.

Mutexes: Blocking Synchronization

Mutexes in kernel/locking/mutex.c are sleeping locks designed for longer critical sections. When a thread cannot acquire a mutex, it blocks and yields the CPU rather than spinning. Key features include:

Adaptive spinning: Before sleeping, the mutex holder spins briefly on the owner field, reducing context-switch overhead for short hold times
Optimistic spinning (OSQ): Uses a queued spinlock to prevent thundering herd when multiple threads wake up
Owner tracking: Stores the owning task pointer with flags for handoff and pickup semantics
Wait queues: Maintains ordered lists of waiting threads for fair scheduling

void __sched mutex_lock(struct mutex *lock) {
    might_sleep();
    if (!__mutex_trylock_fast(lock))
        __mutex_lock_slowpath(lock);
}

Spinlocks: Busy-Wait Synchronization

Spinlocks in kernel/locking/spinlock.c are non-sleeping locks where waiters continuously poll the lock. They are essential for protecting code that cannot sleep, such as interrupt handlers. The implementation:

Disables preemption to prevent priority inversion
Supports IRQ-safe variants (spin_lock_irq, spin_lock_irqsave) that disable interrupts
Provides architecture-specific optimizations via arch_spin_relax()
Includes read-write variants for reader-heavy workloads

RCU: Lock-Free Reads

Read-Copy-Update (RCU) in kernel/rcu/tree.c enables lock-free reads by deferring reclamation. The pattern is:

Removal phase: Remove data from structure (readers see old or new version, never partial)
Grace period: Wait for all existing readers to finish
Reclamation phase: Free the removed data

RCU readers use rcu_read_lock() / rcu_read_unlock() with zero overhead on most architectures. Writers call synchronize_rcu() to wait for a grace period, or call_rcu() for asynchronous reclamation.

rcu_read_lock();
p = rcu_dereference(ptr);  // Safe pointer access
// Use p
rcu_read_unlock();

Futexes: User-Space Synchronization

Futexes in kernel/futex/core.c bridge user-space and kernel synchronization. They provide:

Fast path in user-space (atomic operations only)
Kernel fallback for contention via syscalls
Priority inheritance for real-time applications
Robust semantics for crash recovery

Security Framework

The security/security.c module implements the Linux Security Module (LSM) framework, providing:

Pluggable security policies (SELinux, AppArmor, Smack, etc.)
Lockdown mechanism preventing privileged operations
Audit hooks for security-relevant events
Capability-based access control

Loading diagram...

Synchronization Strategy Selection

Choose based on:

Mutex: Long critical sections, can sleep, need fairness
Spinlock: Short critical sections, cannot sleep, interrupt context
RCU: Read-heavy workloads, can tolerate stale reads
Futex: User-space coordination, priority inheritance needed