Overview
Relevant Files
README.mdsetup.pyStable_Diffusion_v1_Model_Card.mdmain.pyldm/(core model implementation)
Stable Diffusion is a latent text-to-image diffusion model that generates high-quality images from text prompts. This repository contains the official implementation of Stable Diffusion v1, a collaboration between Stability AI, Runway, and the research team behind Latent Diffusion Models.
What is Stable Diffusion?
Stable Diffusion is a diffusion-based generative model that operates in the latent space of a pre-trained autoencoder. Unlike pixel-space diffusion models, this approach significantly reduces computational requirements while maintaining image quality. The model uses a frozen CLIP ViT-L/14 text encoder to condition generation on text prompts, similar to Google's Imagen architecture.
Key specifications:
- Model size: 860M UNet + 123M text encoder (relatively lightweight)
- Training resolution: 256×256 initially, then fine-tuned on 512×512
- Training data: LAION-5B subset (512M+ images with English captions)
- Hardware requirement: GPU with at least 10GB VRAM
Core Architecture
Loading diagram...
The model combines three main components:
- Text Encoder: CLIP ViT-L/14 converts text prompts into embeddings that guide generation
- Autoencoder: Compresses images to latent space (8× downsampling factor) for efficient processing
- Diffusion Model: UNet backbone with cross-attention layers that iteratively denoises latents conditioned on text embeddings
Capabilities
The model supports multiple image generation and modification tasks:
- Text-to-Image: Generate images from text descriptions using PLMS or DDIM sampling
- Image-to-Image: Modify existing images based on text prompts with controllable strength parameter
- Inpainting: Fill masked regions of images with generated content
- Upscaling: Enhance resolution of generated or existing images
Training Details
Stable Diffusion v1 was trained on 32×8 A100 GPUs with:
- Batch size: 2048 (32×8×2×4)
- Learning rate: 0.0001 (constant after 10k warmup steps)
- Optimizer: AdamW with gradient accumulation
- Training stages: 237k steps at 256×256, then 194k-515k steps at 512×512 depending on checkpoint version
Four checkpoint versions are provided, progressively improving through aesthetic filtering and classifier-free guidance (10% text-conditioning dropout in v1-3 and v1-4).
Usage
The repository provides reference sampling scripts and integrations:
- Reference script:
scripts/txt2img.pyfor text-to-image generation with safety checker and watermarking - Diffusers integration: Simple Python API via the HuggingFace diffusers library
- Training framework: PyTorch Lightning-based training pipeline in
main.pysupporting distributed training and checkpointing
Important Considerations
The model reflects biases present in LAION-5B training data and has known limitations including imperfect photorealism, inability to render legible text, and difficulty with complex compositional tasks. The weights are research artifacts intended for research purposes, with safety mechanisms recommended for production deployment.
Architecture & Core Components
Relevant Files
ldm/models/diffusion/ddpm.pyldm/models/autoencoder.pyldm/modules/diffusionmodules/model.pyldm/modules/attention.pyldm/modules/encoders/modules.py
This repository implements Latent Diffusion Models, a generative framework that combines autoencoders with diffusion processes. The architecture operates in a compressed latent space rather than pixel space, enabling efficient training and inference.
Core Architecture Overview
The system consists of three main components working together:
- Autoencoder (First Stage Model) - Compresses images into a latent representation
- Diffusion Model (UNet) - Learns to denoise latent representations
- Conditioning Encoder - Encodes text or other conditions for guided generation
Loading diagram...
Autoencoder (ldm/models/autoencoder.py)
The autoencoder compresses images into a lower-dimensional latent space using an encoder-decoder architecture. Two variants are implemented:
- VQModel - Uses vector quantization for discrete latent codes
- AutoencoderKL - Uses a Gaussian distribution with KL divergence regularization
The encoder progressively downsamples the input through residual blocks and attention layers, while the decoder reconstructs the image. This compression reduces computational cost for the diffusion process by 4-16x depending on the compression factor.
Diffusion Model (ldm/models/diffusion/ddpm.py)
The DDPM (Denoising Diffusion Probabilistic Model) class implements the core diffusion training and sampling logic:
- Forward Process - Gradually adds Gaussian noise to latents over timesteps
- Reverse Process - Trains a UNet to predict and remove noise at each timestep
- Timestep Embedding - Sinusoidal positional encoding injected into the UNet
The LatentDiffusion subclass extends DDPM to work in latent space and supports conditioning through multiple mechanisms: concatenation, cross-attention, or hybrid approaches.
UNet Architecture (ldm/modules/diffusionmodules/model.py)
The diffusion model uses a U-shaped architecture with:
- Encoder Path - Downsampling blocks with residual connections
- Bottleneck - Middle layers with attention mechanisms
- Decoder Path - Upsampling blocks with skip connections from encoder
Key components include:
- ResnetBlock - Residual blocks with group normalization and timestep conditioning
- Attention Layers - Both spatial self-attention and linear attention variants
- Channel Multipliers - Progressive channel expansion at each resolution level
Attention Mechanisms (ldm/modules/attention.py)
Two attention implementations enable context integration:
- SpatialSelfAttention - Spatial attention within feature maps
- CrossAttention - Attends to external conditioning (text embeddings)
- LinearAttention - Efficient linear-complexity attention for high-resolution features
The SpatialTransformer module wraps transformer blocks for image-like data, projecting spatial features to sequence format and back.
Conditioning System (ldm/modules/encoders/modules.py)
Multiple encoder types support different conditioning modalities:
- CLIPTextModel - Encodes text prompts using CLIP embeddings
- BERTEmbedder - Alternative text encoding with BERT tokenizer
- ClassEmbedder - Class label embeddings for class-conditional generation
The DiffusionWrapper routes conditioning through the appropriate mechanism based on the conditioning_key parameter (concat, crossattn, hybrid, or adm).
Data Flow During Training
- Image is encoded to latent space via autoencoder
- Random timestep is sampled
- Noise is added to latent according to noise schedule
- Conditioning (text) is encoded
- UNet predicts noise given noisy latent, timestep, and condition
- Loss is computed between predicted and actual noise
- Gradients flow back through all components
Data Flow During Sampling
- Start with random noise in latent space
- For each timestep (reversed):
- Condition is encoded
- UNet predicts noise
- Noise is removed from latent
- Optional masking applied for inpainting
- Final latent is decoded to image space via autoencoder
Training Pipeline
Relevant Files
main.pyldm/data/base.pyldm/data/imagenet.pyldm/lr_scheduler.py
The training pipeline orchestrates model training using PyTorch Lightning, with configuration-driven setup for models, data, and optimization. The entry point is main.py, which loads YAML configs, instantiates components, and manages the training loop.
Configuration System
Training is controlled via YAML configuration files (in configs/) that define three main sections:
- Model: Specifies the model class, base learning rate, and architecture parameters
- Data: Defines the data module, batch size, and dataset configurations for train/validation/test splits
- Lightning: Optional trainer settings, logger, callbacks, and checkpointing behavior
Configs are merged left-to-right, allowing layered composition. Command-line arguments override config values using dot notation (e.g., model.params.key=value).
Data Loading Pipeline
Data Module Architecture
DataModuleFromConfig: PyTorch Lightning data module that instantiates datasets from configTxt2ImgIterableBaseDataset: Base class for iterable datasets (text-to-image training)ImageNetTrain/ImageNetValidation: ImageNet dataset loaders with automatic download and extractionImageNetSR: Super-resolution variant with image degradation pipeline
The DataModuleFromConfig class wraps dataset instantiation and creates PyTorch DataLoaders. For iterable datasets, a custom worker_init_fn distributes data across workers by splitting valid_ids. Non-iterable datasets use standard shuffling. Batch size is configurable, and worker count defaults to batch_size * 2.
Learning Rate Scheduling
Three scheduler implementations in ldm/lr_scheduler.py support different training strategies:
- LambdaWarmUpCosineScheduler: Linear warmup followed by cosine annealing decay
- LambdaWarmUpCosineScheduler2: Multi-cycle variant with configurable warmup and decay per cycle
- LambdaLinearScheduler: Linear warmup followed by linear decay
All schedulers use a base learning rate of 1.0 and are multiplied by the model's configured learning rate. The pipeline supports optional learning rate scaling: lr = accumulate_grad_batches × num_gpus × batch_size × base_lr.
Training Loop
Loading diagram...
The trainer is created with callbacks for checkpointing, image logging, learning rate monitoring, and CUDA memory tracking. Signal handlers allow checkpointing via SIGUSR1 and debugging via SIGUSR2. On exception, a checkpoint is automatically saved.
Key Callbacks
- SetupCallback: Creates log directories and saves configs at training start
- ImageLogger: Logs generated images at configurable batch frequency
- LearningRateMonitor: Tracks learning rate changes per step
- CUDACallback: Monitors GPU memory and epoch timing
- ModelCheckpoint: Saves best models based on monitored metrics (e.g., validation loss)
Resume & Checkpointing
Training can resume from a checkpoint via --resume flag. The system automatically loads the last checkpoint and previous configs. Checkpoints are saved to logs/{timestamp}_{name}/checkpoints/, with the latest always available as last.ckpt. Optional per-step checkpointing saves intermediate states without deletion.
Sampling & Inference
Relevant Files
scripts/txt2img.pyscripts/img2img.pyscripts/inpaint.pyldm/models/diffusion/ddim.pyldm/models/diffusion/plms.pyldm/models/diffusion/dpm_solver/sampler.py
Sampling and inference are the core processes for generating images from text prompts or modifying existing images. The system supports multiple sampling algorithms, each with different speed-quality tradeoffs.
Sampling Algorithms
The codebase provides three primary samplers:
- DDIM (Denoising Diffusion Implicit Models) - Fast, deterministic sampling with configurable stochasticity via
etaparameter. Default choice for most tasks. - PLMS (Pseudo Linear Multistep) - Higher-order solver for faster convergence. Requires
eta=0(deterministic only). - DPM-Solver - Advanced ODE solver with multistep methods. Offers best quality-speed balance with configurable order.
All samplers share a common interface: sample(S, batch_size, shape, conditioning, ...) where S is the number of steps.
Text Conditioning & Classifier-Free Guidance
Text prompts are encoded into latent representations using a frozen CLIP text encoder. The get_learned_conditioning() method tokenizes and embeds text into a 77-token sequence of embeddings.
Classifier-free guidance enables control over prompt adherence:
# Unconditional embedding (empty prompt)
uc = model.get_learned_conditioning(batch_size * [""])
# Conditional embedding (actual prompt)
c = model.get_learned_conditioning(prompts)
# During sampling, guidance is applied:
# e_t = e_t_uncond + scale * (e_t_cond - e_t_uncond)
The unconditional_guidance_scale parameter controls strength (1.0 = no guidance, 7.5 = typical default).
Sampling Workflows
Text-to-Image (txt2img):
- Encode text prompt to conditioning
- Sample noise in latent space
- Iteratively denoise with guidance
- Decode latent to image via VAE
Image-to-Image (img2img):
- Encode input image to latent space
- Add noise based on
strengthparameter (0.0 = no change, 1.0 = full regeneration) - Denoise from noisy latent with text guidance
- Decode result
Inpainting:
- Encode masked image and mask to latent space
- Concatenate mask with conditioning
- Denoise only masked regions
- Blend with original image
Key Parameters
ddim_steps/S: Number of denoising steps (50-100 typical)ddim_eta: Stochasticity (0.0 = deterministic, 1.0 = maximum noise)scale: Guidance scale for prompt adherencestrength: Image-to-image noise levelseed: Reproducible sampling
Latent Space Operations
All sampling occurs in compressed latent space (8x downsampling). The VAE encoder/decoder handles conversion:
# Encode image to latent
z = model.encode_first_stage(image)
# Decode latent to image
x_sample = model.decode_first_stage(z)
This reduces computation while preserving semantic information.
Unconditional Guidance Implementation
During each denoising step, the model predicts noise for both conditional and unconditional inputs:
x_in = torch.cat([x] * 2) # Duplicate batch
c_in = torch.cat([uc, c]) # Unconditional + conditional
e_t_uncond, e_t = model.apply_model(x_in, t_in, c_in).chunk(2)
e_t = e_t_uncond + scale * (e_t - e_t_uncond)
This doubles computation but enables fine-grained control over generation.
Text Conditioning & Encoders
Relevant Files
ldm/modules/encoders/modules.pyldm/modules/x_transformer.pyldm/models/diffusion/ddpm.pyldm/modules/attention.py
Text conditioning is the mechanism that allows diffusion models to generate images guided by text prompts, class labels, or other conditioning signals. The system converts raw conditioning inputs into learned embeddings that the diffusion model uses during generation.
Encoder Architecture
The codebase provides multiple encoder implementations for different conditioning modalities:
FrozenCLIPEmbedder - The primary text encoder used in Stable Diffusion. It leverages OpenAI's CLIP model to encode text prompts into 768-dimensional embeddings. The model is frozen (non-trainable) to preserve CLIP's semantic understanding.
BERTEmbedder - An alternative text encoder combining BERT tokenization with custom transformer layers. Useful for models trained before CLIP integration or for specialized text understanding tasks.
TransformerEmbedder - A lightweight custom transformer encoder that tokenizes and embeds text using configurable transformer layers.
ClassEmbedder - Handles class-conditional generation by embedding discrete class labels (e.g., ImageNet classes) into learned embeddings.
SpatialRescaler - Preprocesses spatial conditioning inputs (images, segmentation maps) by resizing and optionally remapping channels.
Encoding Pipeline
# Text encoding flow
text_input = "a dog wearing sunglasses"
tokens = tokenizer(text_input, max_length=77, padding="max_length")
embeddings = encoder(tokens) # Shape: [batch, 77, 768]
The encoding process follows these steps:
- Tokenization - Convert text to token IDs using the encoder's tokenizer
- Token Embedding - Map token IDs to dense vectors
- Positional Encoding - Add position information to preserve sequence order
- Transformer Processing - Apply attention layers to contextualize embeddings
- Output - Return sequence of embeddings for cross-attention in the diffusion model
Integration with Diffusion Model
The LatentDiffusion class manages conditioning through the cond_stage_model and conditioning_key parameters:
- conditioning_key='crossattn' - Embeddings are passed to cross-attention layers in the UNet. The diffusion model attends to text embeddings at each denoising step.
- conditioning_key='concat' - Embeddings are concatenated with the noisy latent before processing.
- conditioning_key='hybrid' - Combines both concatenation and cross-attention.
- conditioning_key='adm' - Class embeddings are added to timestep embeddings (Classifier-Free Guidance style).
Loading diagram...
Key Design Patterns
Frozen Encoders - Text encoders are typically frozen during diffusion training to preserve pre-trained semantic knowledge. This reduces training cost and improves stability.
Fixed Sequence Length - All text encoders use a fixed maximum sequence length (typically 77 tokens). Longer text is truncated; shorter text is padded.
Embedding Dimension Matching - The encoder output dimension must match the context_dim parameter in the UNet's spatial transformer blocks (e.g., 768 for CLIP, 1280 for larger models).
Batch Processing - Encoders process entire batches of text simultaneously, enabling efficient GPU utilization during training and inference.
Latent Space & Autoencoders
Relevant Files
ldm/models/autoencoder.pyldm/modules/diffusionmodules/model.pyldm/modules/distributions/distributions.py
Latent Diffusion Models operate in a compressed latent space rather than pixel space, dramatically reducing computational cost. The autoencoder is the first-stage model that learns this compression, enabling efficient diffusion training and inference.
Why Latent Space?
Working in latent space provides several advantages:
- Computational Efficiency - Reduces memory and compute by 4–16x depending on compression factor
- Semantic Compression - Learns meaningful representations rather than pixel-level details
- Faster Diffusion - Fewer timesteps needed for denoising in compressed space
- Better Generalization - Focuses on high-level image structure
Autoencoder Architecture
The autoencoder consists of three components:
- Encoder - Progressively downsamples input images through residual blocks and attention layers
- Quantization/Distribution Layer - Compresses to discrete codes (VQ) or continuous distribution (KL)
- Decoder - Reconstructs images by upsampling from latent codes
# Encoding pipeline
h = self.encoder(x) # Downsample: 256x256 → 32x32
h = self.quant_conv(h) # Project to latent dimension
z = self.quantize(h) # Quantize or sample
Two Autoencoder Variants
VQModel (Vector Quantization)
Uses discrete codebook entries for latent representation:
- Encoder outputs are mapped to nearest codebook vectors
- Produces discrete latent codes (indices into codebook)
- Supports
decode_code()for direct code-to-image generation - Configuration:
n_embed(codebook size),embed_dim(code dimension)
AutoencoderKL (Variational)
Uses continuous Gaussian distributions with KL regularization:
- Encoder outputs mean and log-variance parameters
- Samples from
DiagonalGaussianDistributionduring training - Uses deterministic mode (mean) during inference
- Supports stochastic sampling for diversity
# AutoencoderKL encoding
posterior = self.encode(x) # Returns DiagonalGaussianDistribution
z = posterior.sample() # Stochastic: mean + std * noise
z = posterior.mode() # Deterministic: just mean
Latent Distribution
The DiagonalGaussianDistribution class handles probabilistic sampling:
- Parameters - Splits encoder output into mean and log-variance
- Clamping - Log-variance clamped to [–30, 20] for stability
- Sampling -
z = mean + std * randn()(reparameterization trick) - KL Divergence - Computed against standard normal for regularization
Integration with Diffusion
The diffusion model receives latent codes from the autoencoder:
Loading diagram...
During training, the autoencoder is frozen and the diffusion model learns to denoise latent representations. At inference, the pipeline reverses: sample latents from noise, then decode to image space.
Configuration Parameters
Key settings in autoencoder configs:
z_channels- Latent feature channels (typically 3–16)double_z- For KL models, outputs 2x channels for mean and logvarch_mult- Channel multipliers for encoder/decoder blocksattn_resolutions- Resolutions where attention is appliedembed_dim- Embedding dimension for VQ codebook
Configuration & Utilities
Relevant Files
ldm/util.pyconfigs/stable-diffusion/v1-inference.yamlconfigs/latent-diffusion/txt2img-1p4B-eval.yamlmain.py
Configuration System
The codebase uses a declarative YAML-based configuration system powered by OmegaConf. Configuration files define model architectures, training parameters, and data pipelines in a hierarchical structure. The instantiate_from_config() function dynamically instantiates Python classes from config dictionaries, enabling flexible model composition without code changes.
Core Configuration Pattern
Each config file follows a standard structure with three main sections:
model:
base_learning_rate: 1.0e-04
target: ldm.models.diffusion.ddpm.LatentDiffusion
params:
# Model-specific parameters
timesteps: 1000
channels: 4
unet_config:
target: ldm.modules.diffusionmodules.openaimodel.UNetModel
params:
model_channels: 320
attention_resolutions: [4, 2, 1]
first_stage_config:
target: ldm.models.autoencoder.AutoencoderKL
params:
embed_dim: 4
cond_stage_config:
target: ldm.modules.encoders.modules.FrozenCLIPEmbedder
data:
target: main.DataModuleFromConfig
params:
batch_size: 64
train:
target: ldm.data.imagenet.ImageNetTrain
params:
config:
size: 256
The target field specifies the full Python import path (e.g., ldm.models.diffusion.ddpm.LatentDiffusion), and params contains constructor arguments.
Dynamic Instantiation
How instantiate_from_config Works
instantiate_from_config(config)reads a config dict and returns an instantiated object- Extracts the
targetstring and usesget_obj_from_str()to dynamically import the class - Passes
config["params"]as keyword arguments to the class constructor - Supports special values:
'is_first_stage'and'is_unconditional'returnNone - Nested configs are recursively instantiated (e.g.,
unet_config,first_stage_config)
Utility Functions
The ldm/util.py module provides essential helper functions:
count_params(model, verbose=False)– Counts total trainable parameters in millionsisimage(x)– Checks if tensor is an image (4D with 1 or 3 channels)ismap(x)– Checks if tensor is a feature map (4D with >3 channels)exists(x)– ReturnsTrueifx is not Nonedefault(val, d)– Returnsvalif it exists, otherwised(callable or value)mean_flat(tensor)– Computes mean over all non-batch dimensionslog_txt_as_img(wh, xc, size=10)– Renders text captions as image tensors for loggingparallel_data_prefetch(func, data, n_proc, ...)– Parallelizes data preprocessing across CPU cores or threads
Configuration Loading in Training
The training pipeline (main.py) loads configs using OmegaConf:
configs = [OmegaConf.load(cfg) for cfg in opt.base]
cli = OmegaConf.from_dotlist(unknown)
config = OmegaConf.merge(*configs, cli)
model = instantiate_from_config(config.model)
data = instantiate_from_config(config.data)
Multiple config files can be merged, and command-line arguments override YAML values. This enables easy experimentation with different model sizes, datasets, and hyperparameters.
Configuration Variants
Different model variants are defined in separate YAML files:
- Stable Diffusion v1 (
v1-inference.yaml) – Uses CLIP embedder, 768-dim context - Text-to-Image 1.4B (
txt2img-1p4B-eval.yaml) – Uses BERT embedder, 1280-dim context - Latent Diffusion variants – Different autoencoder and conditioning configurations
Each variant specifies its own UNet architecture, conditioning mechanism, and first-stage autoencoder, allowing rapid prototyping of different model configurations.
Safety & Watermarking
Relevant Files
scripts/txt2img.pyscripts/tests/test_watermark.py
Stable Diffusion implements two complementary safety mechanisms: NSFW content detection and invisible watermarking. These systems work together to reduce harmful outputs and help identify machine-generated images.
Safety Checker: NSFW Detection
The safety checker uses a pre-trained CLIP-based model to detect and filter potentially unsafe content before images are saved.
Architecture:
The system loads a specialized safety model from Hugging Face:
safety_model_id = "CompVis/stable-diffusion-safety-checker"
safety_feature_extractor = AutoFeatureExtractor.from_pretrained(safety_model_id)
safety_checker = StableDiffusionSafetyChecker.from_pretrained(safety_model_id)
Detection Pipeline:
- Feature Extraction - Images are converted to PIL format and processed by the feature extractor
- Classification - The safety checker analyzes pixel values and CLIP embeddings to detect NSFW concepts
- Replacement - If unsafe content is detected, the image is replaced with a fallback image (
assets/rick.jpeg)
def check_safety(x_image):
safety_checker_input = safety_feature_extractor(numpy_to_pil(x_image), return_tensors="pt")
x_checked_image, has_nsfw_concept = safety_checker(images=x_image, clip_input=safety_checker_input.pixel_values)
for i in range(len(has_nsfw_concept)):
if has_nsfw_concept[i]:
x_checked_image[i] = load_replacement(x_checked_image[i])
return x_checked_image, has_nsfw_concept
The function returns both the checked images and a boolean array indicating which samples triggered the safety filter.
Invisible Watermarking
Invisible watermarks are embedded into generated images using discrete wavelet transform (DWT) and discrete cosine transform (DCT) techniques. This helps identify images as machine-generated without visible artifacts.
Watermark Encoding:
During image generation, a watermark encoder is initialized with the model identifier:
wm_encoder = WatermarkEncoder()
wm_encoder.set_watermark('bytes', "StableDiffusionV1".encode('utf-8'))
The watermark is applied to both individual samples and grid outputs:
def put_watermark(img, wm_encoder=None):
if wm_encoder is not None:
img = cv2.cvtColor(np.array(img), cv2.COLOR_RGB2BGR)
img = wm_encoder.encode(img, 'dwtDct')
img = Image.fromarray(img[:, :, ::-1])
return img
Watermark Decoding:
The watermark can be extracted from generated images using the decoder:
def testit(img_path):
bgr = cv2.imread(img_path)
decoder = WatermarkDecoder('bytes', 136)
watermark = decoder.decode(bgr, 'dwtDct')
dec = watermark.decode('utf-8')
print(dec) # Outputs: "StableDiffusionV1"
Integration in Generation Pipeline
Both safety and watermarking are applied sequentially after image decoding:
- Images are decoded from latent space
- Safety check is performed; unsafe images are replaced
- Watermark is embedded into all output images
- Images are saved to disk
This ensures every generated image carries both safety guarantees and provenance information.