Stable Diffusion Wiki | Augment Code

Overview

Relevant Files

README.md
setup.py
Stable_Diffusion_v1_Model_Card.md
main.py
ldm/ (core model implementation)

Stable Diffusion is a latent text-to-image diffusion model that generates high-quality images from text prompts. This repository contains the official implementation of Stable Diffusion v1, a collaboration between Stability AI, Runway, and the research team behind Latent Diffusion Models.

What is Stable Diffusion?

Stable Diffusion is a diffusion-based generative model that operates in the latent space of a pre-trained autoencoder. Unlike pixel-space diffusion models, this approach significantly reduces computational requirements while maintaining image quality. The model uses a frozen CLIP ViT-L/14 text encoder to condition generation on text prompts, similar to Google's Imagen architecture.

Key specifications:

Model size: 860M UNet + 123M text encoder (relatively lightweight)
Training resolution: 256×256 initially, then fine-tuned on 512×512
Training data: LAION-5B subset (512M+ images with English captions)
Hardware requirement: GPU with at least 10GB VRAM

Core Architecture

Loading diagram...

The model combines three main components:

Text Encoder: CLIP ViT-L/14 converts text prompts into embeddings that guide generation
Autoencoder: Compresses images to latent space (8× downsampling factor) for efficient processing
Diffusion Model: UNet backbone with cross-attention layers that iteratively denoises latents conditioned on text embeddings

Capabilities

The model supports multiple image generation and modification tasks:

Text-to-Image: Generate images from text descriptions using PLMS or DDIM sampling
Image-to-Image: Modify existing images based on text prompts with controllable strength parameter
Inpainting: Fill masked regions of images with generated content
Upscaling: Enhance resolution of generated or existing images

Training Details

Stable Diffusion v1 was trained on 32×8 A100 GPUs with:

Batch size: 2048 (32×8×2×4)
Learning rate: 0.0001 (constant after 10k warmup steps)
Optimizer: AdamW with gradient accumulation
Training stages: 237k steps at 256×256, then 194k-515k steps at 512×512 depending on checkpoint version

Four checkpoint versions are provided, progressively improving through aesthetic filtering and classifier-free guidance (10% text-conditioning dropout in v1-3 and v1-4).

Usage

The repository provides reference sampling scripts and integrations:

Reference script: scripts/txt2img.py for text-to-image generation with safety checker and watermarking
Diffusers integration: Simple Python API via the HuggingFace diffusers library
Training framework: PyTorch Lightning-based training pipeline in main.py supporting distributed training and checkpointing

Important Considerations

The model reflects biases present in LAION-5B training data and has known limitations including imperfect photorealism, inability to render legible text, and difficulty with complex compositional tasks. The weights are research artifacts intended for research purposes, with safety mechanisms recommended for production deployment.

Architecture & Core Components

Relevant Files

ldm/models/diffusion/ddpm.py
ldm/models/autoencoder.py
ldm/modules/diffusionmodules/model.py
ldm/modules/attention.py
ldm/modules/encoders/modules.py

This repository implements Latent Diffusion Models, a generative framework that combines autoencoders with diffusion processes. The architecture operates in a compressed latent space rather than pixel space, enabling efficient training and inference.

Core Architecture Overview

The system consists of three main components working together:

Autoencoder (First Stage Model) - Compresses images into a latent representation
Diffusion Model (UNet) - Learns to denoise latent representations
Conditioning Encoder - Encodes text or other conditions for guided generation

Loading diagram...

Autoencoder (ldm/models/autoencoder.py)

The autoencoder compresses images into a lower-dimensional latent space using an encoder-decoder architecture. Two variants are implemented:

VQModel - Uses vector quantization for discrete latent codes
AutoencoderKL - Uses a Gaussian distribution with KL divergence regularization

The encoder progressively downsamples the input through residual blocks and attention layers, while the decoder reconstructs the image. This compression reduces computational cost for the diffusion process by 4-16x depending on the compression factor.

Diffusion Model (ldm/models/diffusion/ddpm.py)

The DDPM (Denoising Diffusion Probabilistic Model) class implements the core diffusion training and sampling logic:

Forward Process - Gradually adds Gaussian noise to latents over timesteps
Reverse Process - Trains a UNet to predict and remove noise at each timestep
Timestep Embedding - Sinusoidal positional encoding injected into the UNet

The LatentDiffusion subclass extends DDPM to work in latent space and supports conditioning through multiple mechanisms: concatenation, cross-attention, or hybrid approaches.

UNet Architecture (ldm/modules/diffusionmodules/model.py)

The diffusion model uses a U-shaped architecture with:

Encoder Path - Downsampling blocks with residual connections
Bottleneck - Middle layers with attention mechanisms
Decoder Path - Upsampling blocks with skip connections from encoder

Key components include:

ResnetBlock - Residual blocks with group normalization and timestep conditioning
Attention Layers - Both spatial self-attention and linear attention variants
Channel Multipliers - Progressive channel expansion at each resolution level

Attention Mechanisms (ldm/modules/attention.py)

Two attention implementations enable context integration:

SpatialSelfAttention - Spatial attention within feature maps
CrossAttention - Attends to external conditioning (text embeddings)
LinearAttention - Efficient linear-complexity attention for high-resolution features

The SpatialTransformer module wraps transformer blocks for image-like data, projecting spatial features to sequence format and back.

Conditioning System (ldm/modules/encoders/modules.py)

Multiple encoder types support different conditioning modalities:

CLIPTextModel - Encodes text prompts using CLIP embeddings
BERTEmbedder - Alternative text encoding with BERT tokenizer
ClassEmbedder - Class label embeddings for class-conditional generation

The DiffusionWrapper routes conditioning through the appropriate mechanism based on the conditioning_key parameter (concat, crossattn, hybrid, or adm).

Data Flow During Training

Image is encoded to latent space via autoencoder
Random timestep is sampled
Noise is added to latent according to noise schedule
Conditioning (text) is encoded
UNet predicts noise given noisy latent, timestep, and condition
Loss is computed between predicted and actual noise
Gradients flow back through all components

Data Flow During Sampling

Start with random noise in latent space
For each timestep (reversed):
- Condition is encoded
- UNet predicts noise
- Noise is removed from latent
- Optional masking applied for inpainting
Final latent is decoded to image space via autoencoder

Training Pipeline

Relevant Files

main.py
ldm/data/base.py
ldm/data/imagenet.py
ldm/lr_scheduler.py

The training pipeline orchestrates model training using PyTorch Lightning, with configuration-driven setup for models, data, and optimization. The entry point is main.py, which loads YAML configs, instantiates components, and manages the training loop.

Configuration System

Training is controlled via YAML configuration files (in configs/) that define three main sections:

Model: Specifies the model class, base learning rate, and architecture parameters
Data: Defines the data module, batch size, and dataset configurations for train/validation/test splits
Lightning: Optional trainer settings, logger, callbacks, and checkpointing behavior

Configs are merged left-to-right, allowing layered composition. Command-line arguments override config values using dot notation (e.g., model.params.key=value).

Data Loading Pipeline

Data Module Architecture

DataModuleFromConfig: PyTorch Lightning data module that instantiates datasets from config
Txt2ImgIterableBaseDataset: Base class for iterable datasets (text-to-image training)
ImageNetTrain/ImageNetValidation: ImageNet dataset loaders with automatic download and extraction
ImageNetSR: Super-resolution variant with image degradation pipeline

The DataModuleFromConfig class wraps dataset instantiation and creates PyTorch DataLoaders. For iterable datasets, a custom worker_init_fn distributes data across workers by splitting valid_ids. Non-iterable datasets use standard shuffling. Batch size is configurable, and worker count defaults to batch_size * 2.

Learning Rate Scheduling

Three scheduler implementations in ldm/lr_scheduler.py support different training strategies:

LambdaWarmUpCosineScheduler: Linear warmup followed by cosine annealing decay
LambdaWarmUpCosineScheduler2: Multi-cycle variant with configurable warmup and decay per cycle
LambdaLinearScheduler: Linear warmup followed by linear decay

All schedulers use a base learning rate of 1.0 and are multiplied by the model's configured learning rate. The pipeline supports optional learning rate scaling: lr = accumulate_grad_batches × num_gpus × batch_size × base_lr.

Training Loop

Loading diagram...

The trainer is created with callbacks for checkpointing, image logging, learning rate monitoring, and CUDA memory tracking. Signal handlers allow checkpointing via SIGUSR1 and debugging via SIGUSR2. On exception, a checkpoint is automatically saved.

Key Callbacks

SetupCallback: Creates log directories and saves configs at training start
ImageLogger: Logs generated images at configurable batch frequency
LearningRateMonitor: Tracks learning rate changes per step
CUDACallback: Monitors GPU memory and epoch timing
ModelCheckpoint: Saves best models based on monitored metrics (e.g., validation loss)

Resume & Checkpointing

Training can resume from a checkpoint via --resume flag. The system automatically loads the last checkpoint and previous configs. Checkpoints are saved to logs/{timestamp}_{name}/checkpoints/, with the latest always available as last.ckpt. Optional per-step checkpointing saves intermediate states without deletion.

Sampling & Inference

Relevant Files

scripts/txt2img.py
scripts/img2img.py
scripts/inpaint.py
ldm/models/diffusion/ddim.py
ldm/models/diffusion/plms.py
ldm/models/diffusion/dpm_solver/sampler.py

Sampling and inference are the core processes for generating images from text prompts or modifying existing images. The system supports multiple sampling algorithms, each with different speed-quality tradeoffs.

Sampling Algorithms

The codebase provides three primary samplers:

DDIM (Denoising Diffusion Implicit Models) - Fast, deterministic sampling with configurable stochasticity via eta parameter. Default choice for most tasks.
PLMS (Pseudo Linear Multistep) - Higher-order solver for faster convergence. Requires eta=0 (deterministic only).
DPM-Solver - Advanced ODE solver with multistep methods. Offers best quality-speed balance with configurable order.

All samplers share a common interface: sample(S, batch_size, shape, conditioning, ...) where S is the number of steps.

Text Conditioning & Classifier-Free Guidance

Text prompts are encoded into latent representations using a frozen CLIP text encoder. The get_learned_conditioning() method tokenizes and embeds text into a 77-token sequence of embeddings.

Classifier-free guidance enables control over prompt adherence:

# Unconditional embedding (empty prompt)
uc = model.get_learned_conditioning(batch_size * [""])

# Conditional embedding (actual prompt)
c = model.get_learned_conditioning(prompts)

# During sampling, guidance is applied:
# e_t = e_t_uncond + scale * (e_t_cond - e_t_uncond)

The unconditional_guidance_scale parameter controls strength (1.0 = no guidance, 7.5 = typical default).

Sampling Workflows

Text-to-Image (txt2img):

Encode text prompt to conditioning
Sample noise in latent space
Iteratively denoise with guidance
Decode latent to image via VAE

Image-to-Image (img2img):

Encode input image to latent space
Add noise based on strength parameter (0.0 = no change, 1.0 = full regeneration)
Denoise from noisy latent with text guidance
Decode result

Inpainting:

Encode masked image and mask to latent space
Concatenate mask with conditioning
Denoise only masked regions
Blend with original image

Key Parameters

ddim_steps / S: Number of denoising steps (50-100 typical)
ddim_eta: Stochasticity (0.0 = deterministic, 1.0 = maximum noise)
scale: Guidance scale for prompt adherence
strength: Image-to-image noise level
seed: Reproducible sampling

Latent Space Operations

All sampling occurs in compressed latent space (8x downsampling). The VAE encoder/decoder handles conversion:

# Encode image to latent
z = model.encode_first_stage(image)

# Decode latent to image
x_sample = model.decode_first_stage(z)

This reduces computation while preserving semantic information.

Unconditional Guidance Implementation

During each denoising step, the model predicts noise for both conditional and unconditional inputs:

x_in = torch.cat([x] * 2)  # Duplicate batch
c_in = torch.cat([uc, c])  # Unconditional + conditional
e_t_uncond, e_t = model.apply_model(x_in, t_in, c_in).chunk(2)
e_t = e_t_uncond + scale * (e_t - e_t_uncond)

This doubles computation but enables fine-grained control over generation.

Text Conditioning & Encoders

Relevant Files

ldm/modules/encoders/modules.py
ldm/modules/x_transformer.py
ldm/models/diffusion/ddpm.py
ldm/modules/attention.py

Text conditioning is the mechanism that allows diffusion models to generate images guided by text prompts, class labels, or other conditioning signals. The system converts raw conditioning inputs into learned embeddings that the diffusion model uses during generation.

Encoder Architecture

The codebase provides multiple encoder implementations for different conditioning modalities:

FrozenCLIPEmbedder - The primary text encoder used in Stable Diffusion. It leverages OpenAI's CLIP model to encode text prompts into 768-dimensional embeddings. The model is frozen (non-trainable) to preserve CLIP's semantic understanding.

BERTEmbedder - An alternative text encoder combining BERT tokenization with custom transformer layers. Useful for models trained before CLIP integration or for specialized text understanding tasks.

TransformerEmbedder - A lightweight custom transformer encoder that tokenizes and embeds text using configurable transformer layers.

ClassEmbedder - Handles class-conditional generation by embedding discrete class labels (e.g., ImageNet classes) into learned embeddings.

SpatialRescaler - Preprocesses spatial conditioning inputs (images, segmentation maps) by resizing and optionally remapping channels.

Encoding Pipeline

# Text encoding flow
text_input = "a dog wearing sunglasses"
tokens = tokenizer(text_input, max_length=77, padding="max_length")
embeddings = encoder(tokens)  # Shape: [batch, 77, 768]

The encoding process follows these steps:

Tokenization - Convert text to token IDs using the encoder's tokenizer
Token Embedding - Map token IDs to dense vectors
Positional Encoding - Add position information to preserve sequence order
Transformer Processing - Apply attention layers to contextualize embeddings
Output - Return sequence of embeddings for cross-attention in the diffusion model

Integration with Diffusion Model

The LatentDiffusion class manages conditioning through the cond_stage_model and conditioning_key parameters:

conditioning_key='crossattn' - Embeddings are passed to cross-attention layers in the UNet. The diffusion model attends to text embeddings at each denoising step.
conditioning_key='concat' - Embeddings are concatenated with the noisy latent before processing.
conditioning_key='hybrid' - Combines both concatenation and cross-attention.
conditioning_key='adm' - Class embeddings are added to timestep embeddings (Classifier-Free Guidance style).

Loading diagram...

Key Design Patterns

Frozen Encoders - Text encoders are typically frozen during diffusion training to preserve pre-trained semantic knowledge. This reduces training cost and improves stability.

Fixed Sequence Length - All text encoders use a fixed maximum sequence length (typically 77 tokens). Longer text is truncated; shorter text is padded.

Embedding Dimension Matching - The encoder output dimension must match the context_dim parameter in the UNet's spatial transformer blocks (e.g., 768 for CLIP, 1280 for larger models).

Batch Processing - Encoders process entire batches of text simultaneously, enabling efficient GPU utilization during training and inference.

Latent Space & Autoencoders

Relevant Files

ldm/models/autoencoder.py
ldm/modules/diffusionmodules/model.py
ldm/modules/distributions/distributions.py

Latent Diffusion Models operate in a compressed latent space rather than pixel space, dramatically reducing computational cost. The autoencoder is the first-stage model that learns this compression, enabling efficient diffusion training and inference.

Why Latent Space?

Working in latent space provides several advantages:

Computational Efficiency - Reduces memory and compute by 4–16x depending on compression factor
Semantic Compression - Learns meaningful representations rather than pixel-level details
Faster Diffusion - Fewer timesteps needed for denoising in compressed space
Better Generalization - Focuses on high-level image structure

Autoencoder Architecture

The autoencoder consists of three components:

Encoder - Progressively downsamples input images through residual blocks and attention layers
Quantization/Distribution Layer - Compresses to discrete codes (VQ) or continuous distribution (KL)
Decoder - Reconstructs images by upsampling from latent codes

# Encoding pipeline
h = self.encoder(x)           # Downsample: 256x256 &rarr; 32x32
h = self.quant_conv(h)        # Project to latent dimension
z = self.quantize(h)          # Quantize or sample

Two Autoencoder Variants

VQModel (Vector Quantization)

Uses discrete codebook entries for latent representation:

Encoder outputs are mapped to nearest codebook vectors
Produces discrete latent codes (indices into codebook)
Supports decode_code() for direct code-to-image generation
Configuration: n_embed (codebook size), embed_dim (code dimension)

AutoencoderKL (Variational)

Uses continuous Gaussian distributions with KL regularization:

Encoder outputs mean and log-variance parameters
Samples from DiagonalGaussianDistribution during training
Uses deterministic mode (mean) during inference
Supports stochastic sampling for diversity

# AutoencoderKL encoding
posterior = self.encode(x)    # Returns DiagonalGaussianDistribution
z = posterior.sample()        # Stochastic: mean + std * noise
z = posterior.mode()          # Deterministic: just mean

Latent Distribution

The DiagonalGaussianDistribution class handles probabilistic sampling:

Parameters - Splits encoder output into mean and log-variance
Clamping - Log-variance clamped to [–30, 20] for stability
Sampling - z = mean + std * randn() (reparameterization trick)
KL Divergence - Computed against standard normal for regularization

Integration with Diffusion

The diffusion model receives latent codes from the autoencoder:

Loading diagram...

During training, the autoencoder is frozen and the diffusion model learns to denoise latent representations. At inference, the pipeline reverses: sample latents from noise, then decode to image space.

Configuration Parameters

Key settings in autoencoder configs:

z_channels - Latent feature channels (typically 3–16)
double_z - For KL models, outputs 2x channels for mean and logvar
ch_mult - Channel multipliers for encoder/decoder blocks
attn_resolutions - Resolutions where attention is applied
embed_dim - Embedding dimension for VQ codebook

Configuration & Utilities

Relevant Files

ldm/util.py
configs/stable-diffusion/v1-inference.yaml
configs/latent-diffusion/txt2img-1p4B-eval.yaml
main.py

Configuration System

The codebase uses a declarative YAML-based configuration system powered by OmegaConf. Configuration files define model architectures, training parameters, and data pipelines in a hierarchical structure. The instantiate_from_config() function dynamically instantiates Python classes from config dictionaries, enabling flexible model composition without code changes.

Core Configuration Pattern

Each config file follows a standard structure with three main sections:

model:
  base_learning_rate: 1.0e-04
  target: ldm.models.diffusion.ddpm.LatentDiffusion
  params:
    # Model-specific parameters
    timesteps: 1000
    channels: 4
    unet_config:
      target: ldm.modules.diffusionmodules.openaimodel.UNetModel
      params:
        model_channels: 320
        attention_resolutions: [4, 2, 1]
    first_stage_config:
      target: ldm.models.autoencoder.AutoencoderKL
      params:
        embed_dim: 4
    cond_stage_config:
      target: ldm.modules.encoders.modules.FrozenCLIPEmbedder

data:
  target: main.DataModuleFromConfig
  params:
    batch_size: 64
    train:
      target: ldm.data.imagenet.ImageNetTrain
      params:
        config:
          size: 256

The target field specifies the full Python import path (e.g., ldm.models.diffusion.ddpm.LatentDiffusion), and params contains constructor arguments.

Dynamic Instantiation

How instantiate_from_config Works

instantiate_from_config(config) reads a config dict and returns an instantiated object
Extracts the target string and uses get_obj_from_str() to dynamically import the class
Passes config["params"] as keyword arguments to the class constructor
Supports special values: 'is_first_stage' and 'is_unconditional' return None
Nested configs are recursively instantiated (e.g., unet_config, first_stage_config)

Utility Functions

The ldm/util.py module provides essential helper functions:

count_params(model, verbose=False) – Counts total trainable parameters in millions
isimage(x) – Checks if tensor is an image (4D with 1 or 3 channels)
ismap(x) – Checks if tensor is a feature map (4D with >3 channels)
exists(x) – Returns True if x is not None
default(val, d) – Returns val if it exists, otherwise d (callable or value)
mean_flat(tensor) – Computes mean over all non-batch dimensions
log_txt_as_img(wh, xc, size=10) – Renders text captions as image tensors for logging
parallel_data_prefetch(func, data, n_proc, ...) – Parallelizes data preprocessing across CPU cores or threads

Configuration Loading in Training

The training pipeline (main.py) loads configs using OmegaConf:

configs = [OmegaConf.load(cfg) for cfg in opt.base]
cli = OmegaConf.from_dotlist(unknown)
config = OmegaConf.merge(*configs, cli)

model = instantiate_from_config(config.model)
data = instantiate_from_config(config.data)

Multiple config files can be merged, and command-line arguments override YAML values. This enables easy experimentation with different model sizes, datasets, and hyperparameters.

Configuration Variants

Different model variants are defined in separate YAML files:

Stable Diffusion v1 (v1-inference.yaml) – Uses CLIP embedder, 768-dim context
Text-to-Image 1.4B (txt2img-1p4B-eval.yaml) – Uses BERT embedder, 1280-dim context
Latent Diffusion variants – Different autoencoder and conditioning configurations

Each variant specifies its own UNet architecture, conditioning mechanism, and first-stage autoencoder, allowing rapid prototyping of different model configurations.

Safety & Watermarking

Relevant Files

scripts/txt2img.py
scripts/tests/test_watermark.py

Stable Diffusion implements two complementary safety mechanisms: NSFW content detection and invisible watermarking. These systems work together to reduce harmful outputs and help identify machine-generated images.

Safety Checker: NSFW Detection

The safety checker uses a pre-trained CLIP-based model to detect and filter potentially unsafe content before images are saved.

Architecture:

The system loads a specialized safety model from Hugging Face:

safety_model_id = "CompVis/stable-diffusion-safety-checker"
safety_feature_extractor = AutoFeatureExtractor.from_pretrained(safety_model_id)
safety_checker = StableDiffusionSafetyChecker.from_pretrained(safety_model_id)

Detection Pipeline:

Feature Extraction - Images are converted to PIL format and processed by the feature extractor
Classification - The safety checker analyzes pixel values and CLIP embeddings to detect NSFW concepts
Replacement - If unsafe content is detected, the image is replaced with a fallback image (assets/rick.jpeg)

def check_safety(x_image):
    safety_checker_input = safety_feature_extractor(numpy_to_pil(x_image), return_tensors="pt")
    x_checked_image, has_nsfw_concept = safety_checker(images=x_image, clip_input=safety_checker_input.pixel_values)
    for i in range(len(has_nsfw_concept)):
        if has_nsfw_concept[i]:
            x_checked_image[i] = load_replacement(x_checked_image[i])
    return x_checked_image, has_nsfw_concept

The function returns both the checked images and a boolean array indicating which samples triggered the safety filter.

Invisible Watermarking

Invisible watermarks are embedded into generated images using discrete wavelet transform (DWT) and discrete cosine transform (DCT) techniques. This helps identify images as machine-generated without visible artifacts.

Watermark Encoding:

During image generation, a watermark encoder is initialized with the model identifier:

wm_encoder = WatermarkEncoder()
wm_encoder.set_watermark('bytes', "StableDiffusionV1".encode('utf-8'))

The watermark is applied to both individual samples and grid outputs:

def put_watermark(img, wm_encoder=None):
    if wm_encoder is not None:
        img = cv2.cvtColor(np.array(img), cv2.COLOR_RGB2BGR)
        img = wm_encoder.encode(img, 'dwtDct')
        img = Image.fromarray(img[:, :, ::-1])
    return img

Watermark Decoding:

The watermark can be extracted from generated images using the decoder:

def testit(img_path):
    bgr = cv2.imread(img_path)
    decoder = WatermarkDecoder('bytes', 136)
    watermark = decoder.decode(bgr, 'dwtDct')
    dec = watermark.decode('utf-8')
    print(dec)  # Outputs: "StableDiffusionV1"

Integration in Generation Pipeline

Both safety and watermarking are applied sequentially after image decoding:

Images are decoded from latent space
Safety check is performed; unsafe images are replaced
Watermark is embedded into all output images
Images are saved to disk

This ensures every generated image carries both safety guarantees and provenance information.