Install

huggingface/transformers

Hugging Face Transformers Library

Last updated on Dec 18, 2025 (Commit: 3e4baf8)

Overview

Relevant Files
  • README.md
  • src/transformers/init.py
  • docs/source/en/index.md
  • src/transformers/models/
  • src/transformers/pipelines/
  • src/transformers/tokenization_utils_base.py
  • src/transformers/processing_utils.py
  • src/transformers/modeling_utils.py

Transformers is a unified model-definition framework for state-of-the-art machine learning across text, computer vision, audio, video, and multimodal tasks. It serves as the central hub for model definitions, enabling compatibility across training frameworks (PyTorch, JAX, TensorFlow), inference engines (vLLM, SGLang, TGI), and adjacent libraries (llama.cpp, mlx).

Core Purpose

The library centralizes model definitions so the ecosystem agrees on a single source of truth. Rather than reimplementing models for each framework or inference engine, developers define a model once in Transformers and it works everywhere. This reduces maintenance burden and democratizes access to state-of-the-art models.

Architecture Overview

Loading diagram...

Key Components

Models (src/transformers/models/): Over 400 model architectures including BERT, GPT, T5, Vision Transformers, and multimodal models. Each model inherits from PreTrainedModel and provides PyTorch implementations with consistent APIs.

Pipelines (src/transformers/pipelines/): High-level inference interface abstracting away preprocessing and postprocessing. Supports 30+ tasks: text generation, image classification, question answering, automatic speech recognition, and more.

Tokenizers (src/transformers/tokenization_utils_base.py): Convert raw text to token IDs. Base class PreTrainedTokenizerBase provides unified interface for both slow (Python) and fast (Rust-based) tokenizers.

Processors (src/transformers/processing_utils.py): Combine tokenizers with feature extractors and image processors for multimodal inputs. Handle audio, images, and text in a single unified interface.

Trainer (src/transformers/trainer.py): Comprehensive training loop supporting mixed precision, distributed training (FSDP, DeepSpeed), gradient checkpointing, and optimization strategies.

Design Philosophy

The library prioritizes self-contained model files over shared abstractions. Each model implementation is complete and readable, allowing researchers to iterate quickly without diving into complex inheritance hierarchies. Code duplication is managed through "Copied from" comments that enable automated synchronization across models.

Ecosystem Integration

Transformers models integrate with 1M+ pretrained checkpoints on Hugging Face Hub. The library supports quantization (bitsandbytes, GPTQ, AWQ), parameter-efficient fine-tuning (PEFT), and distributed training frameworks, making it production-ready for inference and training at scale.

Architecture & Core Components

Relevant Files
  • src/transformers/modeling_utils.py
  • src/transformers/configuration_utils.py
  • src/transformers/models/init.py
  • AGENTS.md

The Hugging Face Transformers library is built on a modular architecture centered around three core components: configurations, models, and utilities. This design enables flexible model composition while maintaining consistency across 400+ model architectures.

Core Architecture Pattern

Loading diagram...

PreTrainedConfig

The PreTrainedConfig class is the foundation for model configuration. It:

  • Stores all hyperparameters needed to reconstruct a model (hidden size, number of layers, attention heads, etc.)
  • Handles loading and saving configurations to JSON files
  • Provides a standardized interface for all model types
  • Supports model-specific attributes via subclassing

Each model architecture (BERT, GPT, T5, etc.) has its own config class inheriting from PreTrainedConfig.

PreTrainedModel

The PreTrainedModel class is the base for all model implementations. It provides:

  • Loading & Saving: from_pretrained() and save_pretrained() methods for Hub integration
  • Weight Management: Handles state dict loading, sharding, and device placement
  • Embedding Access: Unified interface for input/output embeddings via EmbeddingAccessMixin
  • Module Utilities: Device and dtype properties via ModuleUtilsMixin
  • Adapter Support: PEFT integration via PeftAdapterMixin

Key class attributes allow customization:

  • config_class: The configuration class for this model
  • base_model_prefix: Identifies the base model in composite architectures
  • main_input_name: Primary input tensor name (e.g., input_ids, pixel_values)
  • _no_split_modules: Modules to keep together during distributed training

Model Organization

Models are organized hierarchically in /src/transformers/models/:

  • Base Models: Core architecture (e.g., BertModel, GPT2Model)
  • Task-Specific Models: Add heads for specific tasks (e.g., BertForSequenceClassification)
  • Modular Files: New models use modular_*.py files that compose existing components

The auto module provides factory classes (AutoModel, AutoConfig, AutoTokenizer) that automatically select the correct class based on model type.

Mixins & Utilities

ModuleUtilsMixin provides:

  • Device and dtype property access
  • Attention mask creation and manipulation
  • Parameter counting utilities

EmbeddingAccessMixin provides:

  • Unified embedding getter/setter interface
  • Support for different embedding layouts (direct, nested, encoder-decoder)

PeftAdapterMixin enables:

  • Parameter-Efficient Fine-Tuning (LoRA, prefix tuning, etc.)
  • Adapter loading and management

Code Reuse Strategy

The codebase uses two mechanisms to maintain consistency:

  1. "Copied from" Comments: Functions marked with # Copied from transformers.models.llama.modeling_llama.rotate_half are automatically synchronized via make fixup
  2. Modular Files: New models prefer composition over duplication, with modular_*.py files auto-generating complete implementations

This approach balances self-contained model files with DRY principles, ensuring each model is independently understandable while staying synchronized with shared components.

Tokenization System

Relevant Files
  • src/transformers/tokenization_utils_base.py
  • src/transformers/tokenization_utils_tokenizers.py
  • src/transformers/tokenization_python.py
  • src/transformers/tokenization_utils_sentencepiece.py
  • src/transformers/convert_slow_tokenizer.py

Architecture Overview

The tokenization system provides a unified interface for converting raw text into token IDs that models can process. It supports multiple backends: slow tokenizers (pure Python), fast tokenizers (Rust-based via the tokenizers library), and SentencePiece-based tokenizers. All backends inherit from PreTrainedTokenizerBase, ensuring consistent APIs across implementations.

Loading diagram...

Key Components

PreTrainedTokenizerBase is the abstract base class defining the core interface. It manages:

  • Special tokens (bos, eos, unk, sep, pad, cls, mask)
  • Vocabulary conversion (tokens <-> IDs)
  • Padding, truncation, and sequence length handling
  • Loading/saving tokenizer configurations

Backend Implementations:

  • PythonBackend: Pure Python tokenizers with full control over tokenization logic
  • TokenizersBackend: Wraps the Rust-based tokenizers library for speed and alignment tracking
  • SentencePieceBackend: Handles SentencePiece models (.model files)

Encoding Pipeline

The __call__ method orchestrates the encoding process:

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
encoding = tokenizer(
    "Hello world",
    max_length=512,
    padding="max_length",
    truncation=True,
    return_tensors="pt"
)
# Returns: BatchEncoding with input_ids, attention_mask, token_type_ids

Steps:

  1. Normalization: Clean and standardize text
  2. Pre-tokenization: Split into words/subwords
  3. Tokenization: Convert to token strings
  4. Post-processing: Add special tokens, apply attention masks
  5. Conversion: Map tokens to IDs

BatchEncoding

BatchEncoding wraps tokenizer outputs as a dictionary-like object with additional methods for fast tokenizers:

  • Alignment methods: char_to_token(), token_to_chars() map between character and token spaces
  • Tensor conversion: Automatic conversion to PyTorch/NumPy tensors via return_tensors parameter
  • Batch indexing: Access individual samples or slices of batches

Special Tokens Management

Special tokens are managed through named attributes and extra tokens:

tokenizer.add_special_tokens({
    "cls_token": "[CLS]",
    "sep_token": "[SEP]",
    "pad_token": "[PAD]"
})

The system distinguishes between:

  • Named special tokens: Standard attributes (bos, eos, unk, sep, pad, cls, mask)
  • Model-specific tokens: Custom tokens for multimodal or domain-specific models
  • Extra special tokens: Additional tokens beyond the standard set

Slow-to-Fast Conversion

The convert_slow_tokenizer.py module converts slow tokenizers to fast equivalents. Key converters include:

  • SentencePieceExtractor: Extracts vocab and merges from .model files
  • BertConverter: Converts BERT tokenizers to WordPiece-based fast tokenizers
  • BpeConverter: Handles BPE tokenizers with merge operations

This enables fast inference while maintaining compatibility with existing slow tokenizer implementations.

Loading and Saving

Tokenizers are loaded via from_pretrained() and saved with save_pretrained():

# Load from Hub
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# Save locally
tokenizer.save_pretrained("./my_tokenizer")

File formats:

  • Fast tokenizers: Single tokenizer.json file (config + vocab + added tokens)
  • Slow tokenizers: Multiple files (vocab, special_tokens_map.json, tokenizer_config.json, added_tokens.json)

Pipelines & Inference

Relevant Files
  • src/transformers/pipelines/base.py
  • src/transformers/pipelines/text_generation.py
  • src/transformers/pipelines/image_classification.py
  • src/transformers/pipelines/init.py

Pipelines provide a high-level, task-oriented interface for running inference with transformer models. They abstract away the complexity of preprocessing, model inference, and postprocessing, making it easy to use models for common NLP and computer vision tasks.

Pipeline Architecture

The pipeline system follows a standardized workflow:

Input → Preprocess → Forward → Postprocess → Output

Every pipeline inherits from the base Pipeline class and implements four key methods:

  • _sanitize_parameters() - Validates and organizes parameters from __init__ and __call__ into three dictionaries: preprocess, forward, and postprocess parameters.
  • preprocess() - Converts raw input (text, images, audio) into model-ready tensors using tokenizers, image processors, or feature extractors.
  • _forward() - Runs the model inference on preprocessed tensors.
  • postprocess() - Transforms raw model outputs into user-friendly results (e.g., class labels with scores).

Creating and Using Pipelines

The pipeline() factory function is the primary entry point:

from transformers import pipeline

# Create a pipeline by task name
classifier = pipeline("text-classification")
result = classifier("This movie is great!")

# Specify a custom model
generator = pipeline("text-generation", model="gpt2")
output = generator("Once upon a time")

# Pass preprocessing options
qa_pipeline = pipeline("question-answering", model="distilbert-base-cased-distilled-squad")
answer = qa_pipeline(question="What is AI?", context="AI is...")

Pipeline Registry

Pipelines are registered in PIPELINE_REGISTRY, which maps task names to pipeline implementations. The registry stores:

  • Task name - Identifier like "text-classification" or "image-segmentation"
  • Pipeline class - The implementation (e.g., TextClassificationPipeline)
  • Model classes - Compatible model types for the task
  • Default model - A pre-trained model to use if none is specified
from transformers.pipelines import PIPELINE_REGISTRY

# Register a custom pipeline
PIPELINE_REGISTRY.register_pipeline(
    "custom-task",
    pipeline_class=MyCustomPipeline,
    pt_model=AutoModelForSequenceClassification,
    default={"model": ("user/model-name", "revision")}
)

Batch Processing and Iteration

Pipelines support efficient batch processing:

# Process multiple inputs at once
texts = ["Great movie!", "Terrible experience", "Not bad"]
results = classifier(texts)

# Use with datasets for large-scale inference
from datasets import load_dataset
dataset = load_dataset("glue", "sst2")
predictions = classifier(dataset["text"], batch_size=32, num_workers=4)

Device Management

Pipelines automatically handle device placement (CPU, GPU, TPU):

# Specify device
pipe = pipeline("text-generation", device=0)  # GPU 0

# Use device context manager
with pipe.device_placement():
    output = pipe("Hello world")

Common Pipeline Tasks

The library includes 30+ built-in pipelines:

  • Text - text-classification, text-generation, token-classification, question-answering, summarization, translation
  • Vision - image-classification, object-detection, image-segmentation, depth-estimation
  • Audio - automatic-speech-recognition, audio-classification, text-to-audio
  • Multimodal - visual-question-answering, image-to-text, document-question-answering

Each pipeline is optimized for its specific task with appropriate preprocessing and postprocessing logic.

Text Generation & Decoding

Relevant Files
  • src/transformers/generation/utils.py
  • src/transformers/generation/logits_process.py
  • src/transformers/generation/stopping_criteria.py
  • src/transformers/generation/configuration_utils.py

The text generation system in Transformers provides a flexible framework for auto-regressive decoding with multiple strategies, logits processors, and stopping criteria. The core entry point is the generate() method, which orchestrates the entire generation pipeline.

Decoding Strategies

The GenerationMixin.generate() method supports several decoding strategies controlled by parameter combinations:

  • Greedy Search (num_beams=1, do_sample=False): Selects the highest probability token at each step. Fast but may miss better sequences.
  • Multinomial Sampling (num_beams=1, do_sample=True): Randomly samples from the probability distribution. Enables diversity through temperature and top-k/top-p filtering.
  • Beam Search (num_beams>1, do_sample=False): Maintains multiple hypotheses and explores the most promising paths. Balances quality and diversity.
  • Beam Sampling (num_beams>1, do_sample=True): Combines beam search with sampling for controlled diversity.
  • Assisted Generation: Uses a smaller assistant model to generate candidate tokens, which the main model validates. Speeds up decoding without quality loss.

Logits Processing Pipeline

Logits processors modify token probabilities before sampling. They form a LogitsProcessorList that applies transformations sequentially:

class LogitsProcessor:
    def __call__(self, input_ids: torch.LongTensor, 
                 scores: torch.FloatTensor) -> torch.FloatTensor:
        # Modify and return scores
        pass

Common processors include:

  • Temperature: Scales logits to control randomness (scores / temperature)
  • Top-K/Top-P: Filters to top-k tokens or cumulative probability threshold
  • Repetition Penalty: Penalizes previously generated tokens to reduce repetition
  • Min/Max Length: Enforces sequence length constraints by zeroing EOS token scores
  • Forced Tokens: Ensures specific tokens appear at designated positions

Stopping Criteria

Stopping criteria determine when generation terminates. They return a boolean tensor indicating which sequences are done:

class StoppingCriteria:
    def __call__(self, input_ids: torch.LongTensor, 
                 scores: torch.FloatTensor, **kwargs) -> torch.BoolTensor:
        # Return True to stop, False to continue
        pass

Built-in criteria include:

  • MaxLengthCriteria: Stops when sequence reaches max_length
  • MaxTimeCriteria: Stops after elapsed time exceeds threshold
  • EosTokenCriteria: Stops when EOS token is generated
  • StopStringCriteria: Stops when specific strings appear in output

Generation Configuration

GenerationConfig centralizes all generation parameters. It can be saved with models and loaded automatically:

config = GenerationConfig(
    max_new_tokens=100,
    num_beams=4,
    temperature=0.7,
    top_p=0.9,
    do_sample=True
)
model.generate(input_ids, generation_config=config)

Architecture Flow

Loading diagram...

The system is extensible: custom logits processors and stopping criteria can be created by subclassing the base classes and passed to generate().

Training Framework

Relevant Files
  • src/transformers/trainer.py
  • src/transformers/training_args.py
  • src/transformers/trainer_callback.py
  • src/transformers/data/data_collator.py

The Transformers training framework provides a high-level Trainer class that abstracts away the complexity of PyTorch training loops. It handles distributed training, mixed precision, checkpointing, and evaluation automatically.

Core Components

Trainer is the main orchestrator that manages the entire training lifecycle. It accepts a model, datasets, and configuration, then handles forward passes, backward passes, optimization, and evaluation. The Trainer is optimized for Transformers models but works with any torch.nn.Module.

TrainingArguments is a dataclass that centralizes all training hyperparameters: learning rate, batch size, number of epochs, evaluation strategy, save strategy, and more. It can be converted to command-line arguments using HfArgumentParser, making it easy to configure training from scripts.

Data Collators batch individual samples into tensors. The framework provides DefaultDataCollator for simple cases and DataCollatorWithPadding for sequence models that need padding. Custom collators can implement special preprocessing logic.

Training Loop Architecture

Loading diagram...

The training loop follows this sequence: initialize the Trainer with model, args, and datasets; set up distributed training via Accelerator; iterate through epochs and batches; compute loss and gradients; update parameters; periodically evaluate and save checkpoints.

Callbacks System

Callbacks provide hooks into the training loop without modifying core code. The TrainerCallback base class defines event methods like on_train_begin, on_step_end, on_evaluate, and on_save. Callbacks receive TrainingArguments, TrainerState (current training metrics), and TrainerControl (to signal early stopping or checkpoint saving).

Built-in callbacks include DefaultFlowCallback (handles logging, evaluation, and saving intervals), ProgressCallback (progress bars), and integration callbacks for TensorBoard, Weights & Biases, and other platforms.

Key Features

  • Gradient Accumulation: Accumulate gradients over multiple batches before updating, enabling larger effective batch sizes.
  • Mixed Precision: Automatic FP16/BF16 training reduces memory and speeds up computation.
  • Distributed Training: Seamless multi-GPU and multi-node training via Accelerator.
  • Checkpointing: Save model states at intervals or when metrics improve; resume from checkpoints.
  • Evaluation Strategies: Evaluate every N steps, every epoch, or only at the end.
  • Hyperparameter Search: Integrate with Optuna or Ray Tune for automated hyperparameter tuning.

Basic Usage

from transformers import Trainer, TrainingArguments

args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=8,
    learning_rate=5e-5,
    eval_strategy="epoch",
    save_strategy="epoch",
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

The Trainer automatically handles device placement, distributed setup, mixed precision, and logging. Call train() to start training, evaluate() for evaluation, or predict() for inference on a dataset.

Model Zoo & Auto Classes

Relevant Files
  • src/transformers/models/auto/modeling_auto.py
  • src/transformers/models/auto/tokenization_auto.py
  • src/transformers/models/auto/image_processing_auto.py
  • src/transformers/models/auto/auto_factory.py
  • src/transformers/models/auto/configuration_auto.py

The Auto Classes system provides a unified, model-agnostic interface for loading pretrained models, tokenizers, and processors. Instead of manually importing specific model classes, you use AutoModel, AutoTokenizer, and related classes that automatically detect and instantiate the correct implementation based on the model name or configuration.

Core Architecture

The Auto system is built on three layers:

  1. Mappings: OrderedDicts that map model types (e.g., "bert", "gpt2") to their corresponding class names
  2. Lazy Loading: _LazyAutoMapping defers class imports until needed, reducing startup time
  3. Factory Classes: _BaseAutoModelClass and similar base classes implement the from_pretrained() and from_config() methods
# Example: Loading a model automatically
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("bert-base-uncased")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

Model Mappings

The system maintains multiple specialized mappings for different tasks:

  • Base Models: MODEL_MAPPING for generic model architectures
  • Task-Specific: MODEL_FOR_SEQUENCE_CLASSIFICATION, MODEL_FOR_CAUSAL_LM, MODEL_FOR_OBJECT_DETECTION, etc.
  • Modality-Specific: MODEL_FOR_IMAGE_CLASSIFICATION, MODEL_FOR_AUDIO_CLASSIFICATION, MODEL_FOR_VISION_2_SEQ

Each mapping is a _LazyAutoMapping that pairs config classes with model classes. When you call from_pretrained(), the system:

  1. Loads the config file from the model repository
  2. Looks up the config class in the mapping
  3. Retrieves the corresponding model class
  4. Instantiates and returns the model

Tokenizer & Processor Auto Classes

Similar to models, the system provides:

  • AutoTokenizer: Maps model types to tokenizer classes (fast or slow variants)
  • AutoImageProcessor: Maps model types to image processing classes
  • AutoProcessor: Combines multiple processors for multimodal models (e.g., vision + text)
  • AutoFeatureExtractor: For audio and video feature extraction
from transformers import AutoTokenizer, AutoImageProcessor, AutoProcessor

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
image_processor = AutoImageProcessor.from_pretrained("vit-base-patch16-224")
processor = AutoProcessor.from_pretrained("microsoft/layoutlmv2-base-uncased")

Registration & Custom Models

You can register custom models with the Auto system:

from transformers import AutoModel, AutoConfig

AutoConfig.register("my_model_type", MyCustomConfig)
AutoModel.register(MyCustomConfig, MyCustomModel)

This enables AutoModel.from_pretrained() to work with your custom implementations, including models with trust_remote_code=True from the Hub.

Key Design Patterns

Lazy Imports: Classes are only imported when accessed, not at module load time, improving performance.

Config-Driven Selection: The model type is determined from the config's model_type attribute, ensuring consistency across the ecosystem.

Task Specialization: Over 40 task-specific Auto classes allow precise model selection without loading unnecessary code.

Fallback Mechanism: If a model type isn't found in the mapping, the system attempts pattern matching on the model name or path.

Quantization & Optimization

Relevant Files
  • src/transformers/quantizers/base.py
  • src/transformers/quantizers/quantizer_*.py
  • src/transformers/integrations/bitsandbytes.py
  • src/transformers/integrations/flash_attention.py
  • src/transformers/utils/quantization_config.py
  • src/transformers/training_args.py

Transformers provides a comprehensive quantization and optimization framework to reduce model size, memory usage, and inference latency while maintaining performance. This section covers the core systems for model compression and acceleration.

Quantization System

The quantization framework is built on the HfQuantizer abstract base class, which standardizes how different quantization methods integrate with model loading. Each quantization technique (GPTQ, AWQ, BitsAndBytes, etc.) implements this interface to handle pre-quantized model loading and optional calibration.

Supported Quantization Methods:

  • BitsAndBytes (4-bit & 8-bit): GPU-optimized quantization with CPU offloading support
  • GPTQ: Post-training quantization with calibration via GPTQModel
  • AWQ: Activation-aware weight quantization for 4-bit compression
  • AQLM, VPTQ, Quanto, EETQ: Specialized quantization techniques
  • Compressed Tensors: Framework-agnostic compression format
  • TorchAO: PyTorch-native quantization and sparsity techniques
  • FP8 Variants: FBGEMM FP8, Fine-grained FP8, FPQuant

Configuration & Loading:

Quantization configs inherit from QuantizationConfigMixin and define method-specific parameters. When loading a quantized model via from_pretrained(), the framework automatically:

  1. Validates environment dependencies
  2. Preprocesses the model skeleton on the meta device
  3. Replaces modules with quantized equivalents
  4. Loads and deserializes quantized weights
  5. Postprocesses the model for inference
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype="float16")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b", quantization_config=config)

Optimization Techniques

Gradient Checkpointing:

Reduces memory usage during training by recomputing activations instead of storing them. Enable via model.gradient_checkpointing_enable() or --gradient_checkpointing in training args. Trades compute for memory with minimal overhead.

Mixed Precision Training:

Combines float32 and float16/bfloat16 computations. Configure via --mixed_precision (options: no, fp16, bf16, fp8). Accelerate handles automatic casting and loss scaling.

Flash Attention:

Optimized attention implementation reducing memory and compute. The flash_attention_forward() function handles dtype casting for quantized models and autocast contexts, ensuring compatibility across different precision settings.

Kernel Fusion:

Fuses multiple operations into single kernels via the Kernels library. Loads optimized compute kernels from the Hub without installation, supporting FlashAttention-2 and other specialized kernels.

Architecture Overview

Loading diagram...

Key Integration Points

  • Device Mapping: Quantizers can override device placement (e.g., BitsAndBytes forces device_map=auto)
  • Weight Conversion: Custom deserializers handle format-specific weight reconstruction
  • Trainability: Some methods support QLoRA fine-tuning; others are inference-only
  • Serialization: Not all quantization formats support saving; check is_serializable() before saving

The framework prioritizes flexibility—each quantization method can customize preprocessing, weight loading, and postprocessing while maintaining a consistent API.

Distributed Training & Integrations

Relevant Files
  • src/transformers/integrations/deepspeed.py
  • src/transformers/integrations/fsdp.py
  • src/transformers/integrations/accelerate.py
  • src/transformers/integrations/tensor_parallel.py
  • src/transformers/distributed/configuration_utils.py
  • src/transformers/trainer.py

Transformers provides seamless integration with multiple distributed training backends through the Accelerate library. The framework abstracts away backend complexity, allowing users to switch between strategies with minimal code changes.

Core Distributed Backends

DeepSpeed Integration (deepspeed.py) enables training of massive models using Zero Redundancy Optimizer (ZeRO) stages. The HfTrainerDeepSpeedConfig class automatically synchronizes DeepSpeed configuration with TrainingArguments values, handling batch size calculations, optimizer settings, and gradient clipping. Key features include:

  • Automatic configuration of ZeRO stages (0, 1, 2, 3) for memory optimization
  • Tensor parallelism support via deepspeed_tp_model_init()
  • Checkpoint management and resume capabilities
  • Integration with Accelerate's DeepSpeedPlugin

FSDP (Fully Sharded Data Parallel) (fsdp.py) provides PyTorch's native distributed training. The module includes:

  • Detection of FSDP-managed modules via is_fsdp_managed_module()
  • Environment-based FSDP enablement checking
  • Support for both FSDP v1 and v2 implementations
  • Auto-wrap policies for automatic layer sharding

Accelerate Integration (accelerate.py) serves as the unified orchestration layer:

  • Device map inference for optimal GPU placement
  • Model dispatching and memory management
  • Support for mixed precision training
  • Integration with quantization backends

Configuration & Initialization

from transformers import TrainingArguments, Trainer

# DeepSpeed example
training_args = TrainingArguments(
    deepspeed="path/to/deepspeed_config.json",
    per_device_train_batch_size=8,
    gradient_accumulation_steps=4,
)

# FSDP example
training_args = TrainingArguments(
    fsdp="full_shard",
    fsdp_config={"fsdp_auto_wrap_policy": "TRANSFORMER_BASED_WRAP"},
)

The Trainer class automatically detects and initializes the appropriate backend during create_accelerator_and_postprocess(). Configuration validation ensures compatibility between distributed settings and training parameters.

Tensor Parallelism

The tensor_parallel.py module enables splitting model layers across devices. initialize_tensor_parallelism() sets up device meshes and initializes the backend, supporting:

  • Automatic device detection (CUDA, XPU, CPU)
  • Distributed process group initialization
  • DTensor-based sharding with custom placement strategies

Distributed Configuration

The DistributedConfig dataclass (configuration_utils.py) provides a base for distributed training settings. It supports:

  • JSON serialization for reproducibility
  • Dictionary-based configuration loading
  • Expert parallelism flags for future extensibility

Training Loop Integration

The Trainer integrates distributed backends throughout the training lifecycle:

  • Model wrapping: Applies appropriate wrappers (FSDP, DDP, DeepSpeed) based on configuration
  • Optimizer initialization: DeepSpeed manages its own optimizer; FSDP uses standard PyTorch optimizers
  • Gradient synchronization: Handled transparently by the backend
  • Checkpoint management: Backend-specific save/load logic via Accelerate utilities
Loading diagram...

Best Practices

  • Use deepspeed for very large models requiring memory optimization
  • Use fsdp for balanced memory and performance on multi-node setups
  • Set gradient_checkpointing=True to reduce memory footprint
  • Validate configuration compatibility before training
  • Monitor distributed communication overhead with profiling tools

Testing & Utilities

Relevant Files
  • src/transformers/testing_utils.py
  • tests/test_modeling_common.py
  • utils/check_repo.py

The Hugging Face Transformers library provides a comprehensive testing infrastructure with utilities, decorators, and base classes to ensure consistent and reliable model testing across the codebase.

Core Testing Utilities

testing_utils.py is the central hub for testing infrastructure. It provides:

  • Decorators for conditional test execution: @slow, @require_torch, @require_torch_gpu, @require_accelerate, @require_bitsandbytes, etc. These skip tests when dependencies are unavailable or when specific environment variables are set (e.g., RUN_SLOW=False).

  • Context managers for test isolation: CaptureStd, CaptureStdout, CaptureStderr, and CaptureLogger capture output streams for assertion. LoggingLevel temporarily adjusts logging verbosity. TemporaryHubRepo creates and cleans up temporary Hub repositories.

  • TestCasePlus base class: Extends unittest.TestCase with path accessors (test_file_path, tests_dir, repo_root_dir) and auto-removable temporary directories via get_auto_remove_tmp_dir().

Model Testing Patterns

test_modeling_common.py defines the standard testing framework:

  • ModelTesterMixin: The primary base class for model tests. It provides 100+ test methods covering forward passes, gradient checkpointing, serialization, attention mechanisms, and more. Tests iterate over all_model_classes and use model_tester.prepare_config_and_inputs_for_common() to generate test data.

  • Model tester classes: Each model has a corresponding tester (e.g., BertModelTester) that generates configs and inputs. These inherit from base testers and implement prepare_config_and_inputs() and prepare_config_and_inputs_for_common().

  • Helper functions: ids_tensor(), floats_tensor(), random_attention_mask() generate random test inputs. seeded_weight_init() and skip_weight_init() context managers control weight initialization for deterministic testing.

Repository Consistency Checks

check_repo.py validates repository structure:

  • Ensures all models are properly defined in __init__ files
  • Verifies models are in auto classes and documented
  • Checks for deprecated models and naming consistency
  • Validates tokenizer, processor, and feature extractor mappings

Run with: python utils/check_repo.py

Test Execution

Tests use pytest with parameterization for comprehensive coverage. Key patterns:

@require_torch
class MyModelTest(ModelTesterMixin, unittest.TestCase):
    all_model_classes = (MyModel, MyModelForCausalLM)
    
    def setUp(self):
        self.model_tester = MyModelTester(self)

Environment variables control test scope: RUN_SLOW=1 enables slow tests, RUN_TRAINING_TESTS=0 skips training tests. Use make fixup to apply style fixes and propagate copied code changes.