Install Now

scikit-learn/scikit-learn

scikit-learn Machine Learning Library

Last updated on Dec 18, 2025 (Commit: bf567c7)

Overview

Relevant Files
  • README.rst
  • sklearn/init.py
  • doc/getting_started.rst
  • doc/user_guide.rst
  • pyproject.toml

scikit-learn is a mature, production-ready Python machine learning library built on NumPy, SciPy, and joblib. It provides a comprehensive suite of supervised and unsupervised learning algorithms, along with tools for model evaluation, selection, and data preprocessing. The library emphasizes a consistent API design where all estimators follow the same fit-predict pattern.

Core Purpose

scikit-learn aims to make machine learning accessible and practical for both researchers and practitioners. It integrates classical ML algorithms into the scientific Python ecosystem, offering simple yet efficient solutions for learning problems across science and engineering domains.

Key Features

  • Unified API: All estimators inherit from BaseEstimator, providing consistent fit(), predict(), and transform() methods
  • Comprehensive Algorithms: Classification, regression, clustering, dimensionality reduction, and feature selection
  • Data Preprocessing: Transformers for scaling, encoding, imputation, and feature engineering
  • Model Selection: Cross-validation, hyperparameter tuning, and evaluation metrics
  • Pipelines: Chain preprocessing and estimators to prevent data leakage and simplify workflows
  • Inspection Tools: Feature importance, partial dependence, and model introspection utilities

Architecture Overview

Loading diagram...

Module Organization

The library is organized into functional submodules accessible via lazy imports:

  • cluster - Clustering algorithms (KMeans, DBSCAN, etc.)
  • ensemble - Ensemble methods (RandomForest, GradientBoosting, etc.)
  • linear_model - Linear regression and classification
  • tree - Decision trees
  • svm - Support Vector Machines
  • preprocessing - Data transformation and scaling
  • model_selection - Cross-validation and hyperparameter tuning
  • metrics - Evaluation metrics and scoring functions
  • decomposition - Dimensionality reduction (PCA, NMF, etc.)
  • neighbors - Nearest neighbors methods
  • neural_network - Multi-layer perceptron models

Dependencies

Required: Python (>=3.11), NumPy (>=1.24.1), SciPy (>=1.10.0), joblib (>=1.3.0), threadpoolctl (>=3.2.0)

Optional: Matplotlib (plotting), pandas (data handling), scikit-image (image processing)

Development Status

scikit-learn is actively maintained by a volunteer team and is in production/stable status. The codebase emphasizes code quality, comprehensive testing, and backward compatibility while continuously adding new algorithms and features.

Architecture & Estimator Interface

Relevant Files
  • sklearn/base.py
  • sklearn/pipeline.py
  • sklearn/utils/validation.py
  • sklearn/utils/_param_validation.py
  • sklearn/utils/metadata_routing.py

Core Estimator Architecture

Scikit-learn's architecture is built on a hierarchy of base classes that define the contract for all estimators. The BaseEstimator class is the foundation, providing parameter management, serialization, and validation capabilities. All estimators inherit from this class and must follow the scikit-learn API convention: estimators are objects with fit() and predict() (or transform()) methods.

The estimator hierarchy uses mixin classes to specify estimator type and behavior:

  • ClassifierMixin – Adds score() method (accuracy by default) and marks estimator as a classifier
  • RegressorMixin – Adds score() method (R² by default) and marks estimator as a regressor
  • TransformerMixin – Adds fit_transform() method and output formatting capabilities
  • ClusterMixin – Adds fit_predict() method for clustering algorithms
  • OutlierMixin – Adds fit_predict() for outlier detection

Parameter Management

BaseEstimator provides two critical methods for parameter handling:

  • get_params(deep=True) – Returns all constructor parameters as a dictionary. With deep=True, recursively retrieves nested estimator parameters using __ notation (e.g., pipeline__step__param)
  • set_params(**params) – Sets parameters on the estimator and nested estimators. Enables grid search and hyperparameter tuning

Parameters are introspected from the __init__ signature, so all estimator parameters must be explicit keyword arguments (no *args or **kwargs).

Pipeline & Composition

The Pipeline class chains transformers and a final estimator sequentially. Intermediate steps must implement fit() and transform(), while the final step only needs fit(). Pipelines support:

  • Caching – Intermediate transformer results can be cached via the memory parameter
  • Parameter routing – Metadata (sample weights, groups) can be routed to specific steps
  • Lazy evaluation – Methods like fit_predict() and fit_transform() optimize the computation graph

FeatureUnion combines multiple transformers in parallel, concatenating their outputs.

Validation & Constraints

Parameter validation occurs through _parameter_constraints class attributes. The validate_parameter_constraints() function checks types and values against constraint specifications:

_parameter_constraints = {
    "C": [Interval(Real, 0, None, closed="neither")],
    "kernel": [StrOptions({"linear", "rbf", "poly"})],
    "random_state": ["random_state"],
}

Data validation uses check_array(), check_is_fitted(), and feature name validation to ensure inputs meet requirements.

Metadata Routing

Modern scikit-learn supports metadata routing to safely pass metadata (sample weights, groups, etc.) through pipelines and meta-estimators. The MetadataRouter and _MetadataRequester classes manage this flow, allowing estimators to declare which metadata they consume and how it should be routed.

Loading diagram...

Estimator Lifecycle

  1. Initialization – Constructor sets parameters; no data is processed
  2. Fittingfit() learns from training data; fitted attributes end with _
  3. Prediction/Transformpredict() or transform() applies learned model to new data
  4. Validationcheck_is_fitted() ensures estimator has been fitted before prediction

This pull request includes code written with the assistance of AI. The code has not yet been reviewed by a human.

Supervised Learning Algorithms

Relevant Files
  • sklearn/linear_model/init.py
  • sklearn/tree/init.py
  • sklearn/ensemble/init.py
  • sklearn/svm/init.py
  • sklearn/neighbors/init.py
  • sklearn/neural_network/init.py
  • sklearn/gaussian_process/init.py
  • sklearn/naive_bayes.py
  • sklearn/discriminant_analysis.py

Supervised learning algorithms learn patterns from labeled training data to make predictions on new, unseen data. Scikit-learn provides a comprehensive collection of algorithms for both classification and regression tasks, organized into distinct modules based on their underlying mathematical principles.

Linear Models

Linear models form the foundation of many machine learning applications. They assume a linear relationship between input features and the target variable.

Regression: LinearRegression fits ordinary least squares, minimizing the residual sum of squares. Ridge and Lasso add L2 and L1 regularization respectively to prevent overfitting. ElasticNet combines both penalties. Specialized variants like BayesianRidge, HuberRegressor, and QuantileRegressor handle different data distributions and outliers.

Classification: LogisticRegression performs binary and multiclass classification using regularized logistic loss. Perceptron and PassiveAggressiveClassifier offer online learning capabilities. SGDClassifier and SGDRegressor enable stochastic gradient descent optimization for large datasets.

Tree-Based Methods

Decision trees recursively partition the feature space, creating interpretable models.

DecisionTreeClassifier and DecisionTreeRegressor support multiple splitting criteria (Gini impurity, entropy, MSE). ExtraTreeClassifier and ExtraTreeRegressor use random thresholds for faster training. Trees are prone to overfitting but serve as building blocks for ensemble methods.

Ensemble Methods

Ensemble methods combine multiple base learners to improve generalization.

Bagging: BaggingClassifier and BaggingRegressor train independent models on random subsets. RandomForestClassifier and RandomForestRegressor are specialized bagging ensembles using decision trees with feature subsampling.

Boosting: AdaBoostClassifier and AdaBoostRegressor sequentially train models, emphasizing misclassified samples. GradientBoostingClassifier and GradientBoostingRegressor fit trees to residuals. HistGradientBoostingClassifier and HistGradientBoostingRegressor use histogram-based learning for efficiency.

Stacking & Voting: StackingClassifier and StackingRegressor train meta-learners on base model predictions. VotingClassifier and VotingRegressor combine predictions via averaging or majority voting.

Support Vector Machines

SVMs find optimal decision boundaries by maximizing the margin between classes.

SVC and SVR support kernel methods (linear, RBF, polynomial). LinearSVC and LinearSVR are optimized for linear kernels. NuSVC and NuSVC use alternative parameterizations. OneClassSVM detects outliers.

Nearest Neighbors

Instance-based methods classify by finding similar training examples.

KNeighborsClassifier and KNeighborsRegressor use k-nearest neighbors voting. RadiusNeighborsClassifier and RadiusNeighborsRegressor use fixed-radius neighborhoods. Efficient spatial indexing via KDTree and BallTree accelerates queries.

Neural Networks

MLPClassifier and MLPRegressor implement multi-layer perceptrons with backpropagation. They support multiple activation functions and solvers (SGD, Adam, L-BFGS) for flexible deep learning on tabular data.

Gaussian Processes

GaussianProcessClassifier and GaussianProcessRegressor provide probabilistic predictions with uncertainty estimates. They use kernel functions to define covariance structures and support various kernels via the kernels module.

Probabilistic Classifiers

GaussianNB, MultinomialNB, BernoulliNB, and CategoricalNB implement Naive Bayes variants assuming feature independence. LinearDiscriminantAnalysis and QuadraticDiscriminantAnalysis model class-conditional distributions using covariance estimation.

Loading diagram...

All estimators follow scikit-learn's consistent API: fit(X, y) for training, predict(X) for inference, and score(X, y) for evaluation. Cross-validation utilities in model_selection help select optimal hyperparameters and assess generalization performance.

Unsupervised Learning & Decomposition

Relevant Files
  • sklearn/cluster/init.py
  • sklearn/decomposition/init.py
  • sklearn/manifold/init.py
  • sklearn/mixture/init.py
  • sklearn/covariance/init.py

Unsupervised learning discovers patterns in unlabeled data through clustering, decomposition, and manifold learning. scikit-learn provides a comprehensive toolkit for these tasks, organized into five main modules.

Clustering Algorithms

Clustering partitions data into groups based on similarity. The module offers diverse algorithms suited to different data geometries and scales:

  • K-Means & Variants: Fast, scalable centroid-based clustering. KMeans works well for convex clusters; MiniBatchKMeans handles large datasets; BisectingKMeans uses hierarchical bisection for efficiency.
  • Hierarchical Clustering: AgglomerativeClustering builds dendrograms via bottom-up merging with configurable linkage criteria (Ward, complete, average). FeatureAgglomeration clusters features instead of samples.
  • Density-Based Methods: DBSCAN finds arbitrary-shaped clusters and identifies outliers; OPTICS extends this with multi-scale analysis; HDBSCAN adds hierarchical structure.
  • Graph-Based: SpectralClustering uses normalized Laplacian eigenvectors for non-convex clusters (e.g., nested circles).
  • Other Approaches: MeanShift finds modes in the feature space; AffinityPropagation uses message passing; Birch provides online clustering with memory efficiency.

Matrix Decomposition

Decomposition factorizes data into interpretable components for dimensionality reduction and feature extraction:

  • PCA Family: PCA performs linear dimensionality reduction via SVD; IncrementalPCA processes data in batches; KernelPCA enables non-linear reduction through kernels.
  • Non-Negative Factorization: NMF and MiniBatchNMF decompose into non-negative factors, useful for topic modeling and source separation.
  • Independent Components: FastICA extracts statistically independent sources from mixed signals.
  • Sparse Methods: SparsePCA and DictionaryLearning learn sparse representations; SparseCoder encodes data using learned dictionaries.
  • Probabilistic Models: FactorAnalysis assumes Gaussian latent factors; LatentDirichletAllocation models discrete topics in text.

Manifold Learning

Manifold learning uncovers low-dimensional structure in high-dimensional data by preserving local or global geometry:

  • Distance-Preserving: MDS and ClassicalMDS preserve pairwise distances; Isomap preserves geodesic distances along manifolds.
  • Neighborhood-Based: LocallyLinearEmbedding reconstructs each point from local neighbors; SpectralEmbedding uses graph Laplacian eigenvectors.
  • Probabilistic Embedding: TSNE minimizes Kullback-Leibler divergence for visualization, excelling at revealing local cluster structure.

Mixture Models

Probabilistic clustering via Gaussian mixtures:

  • GaussianMixture fits soft clusters with EM algorithm; BayesianGaussianMixture adds Bayesian priors for automatic model selection.

Covariance Estimation

Robust covariance and precision matrix estimation for Gaussian graphical models:

  • EmpiricalCovariance computes standard covariance; ShrunkCovariance, LedoitWolf, and OAS apply shrinkage for stability.
  • GraphicalLasso learns sparse precision matrices via L1 regularization.
  • MinCovDet and EllipticEnvelope detect outliers using robust covariance.
Loading diagram...

Choosing an Algorithm

Select based on your data and goal: use K-Means for speed on large datasets with convex clusters; DBSCAN for arbitrary shapes and outlier detection; hierarchical clustering for dendrograms; manifold learning for visualization; NMF for interpretable non-negative factors; PCA for fast linear reduction.

This pull request includes code written with the assistance of AI. The code has not yet been reviewed by a human.

Preprocessing & Feature Engineering

Relevant Files
  • sklearn/preprocessing/init.py
  • sklearn/preprocessing/_data.py
  • sklearn/feature_extraction/init.py
  • sklearn/feature_selection/init.py
  • sklearn/impute/init.py
  • sklearn/compose/init.py
  • sklearn/pipeline.py

Preprocessing and feature engineering are foundational steps in machine learning pipelines. scikit-learn provides a comprehensive suite of tools organized into five main modules that handle data transformation, feature extraction, feature selection, missing value imputation, and pipeline composition.

Data Scaling & Normalization

The preprocessing module offers multiple scalers for normalizing feature ranges. StandardScaler applies z-score normalization (mean=0, std=1), while MinMaxScaler rescales features to a fixed range like [0, 1]. RobustScaler uses median and interquartile range, making it resistant to outliers. MaxAbsScaler scales by the maximum absolute value, preserving sparsity. Normalizer applies L1 or L2 normalization per sample. QuantileTransformer maps features to uniform or normal distributions, useful for skewed data.

Encoding & Discretization

Categorical features require encoding before model training. OneHotEncoder converts categorical variables into binary columns, with options for handling unknown categories and sparse output. OrdinalEncoder maps categories to integers, suitable for ordinal data. LabelEncoder encodes target labels. KBinsDiscretizer bins continuous features into discrete intervals using equal-width, equal-frequency, or k-means strategies. TargetEncoder encodes categories based on target statistics, reducing dimensionality while capturing predictive information.

Feature Extraction

The feature_extraction module handles raw data conversion. TfidfVectorizer and CountVectorizer (in the text submodule) extract features from text documents. DictVectorizer converts dictionaries to sparse matrices. FeatureHasher uses hashing for memory-efficient feature extraction. Image utilities like img_to_graph and grid_to_graph extract spatial features from images.

Missing Value Imputation

The impute module provides strategies for handling missing data. SimpleImputer fills missing values using mean, median, most frequent, or constant strategies. KNNImputer uses k-nearest neighbors to estimate missing values, preserving local structure. MissingIndicator creates binary features indicating missing values, useful for capturing missingness patterns.

Feature Selection

The feature_selection module reduces dimensionality by identifying relevant features. Univariate methods like SelectKBest and SelectPercentile rank features using statistical tests (f_classif, f_regression, chi2, mutual_info_classif). VarianceThreshold removes low-variance features. RFE and RFECV recursively eliminate features based on model weights. SelectFromModel selects features with importance above a threshold. SequentialFeatureSelector uses forward or backward selection.

Pipeline Composition

ColumnTransformer applies different transformers to different column subsets, essential for heterogeneous data. Pipeline chains transformers sequentially, ensuring fit/transform consistency and preventing data leakage. TransformedTargetRegressor applies transformations to target variables. Helper functions like make_column_transformer and make_pipeline simplify construction.

Loading diagram...

Design Patterns

All transformers inherit from TransformerMixin and BaseEstimator, implementing fit(), transform(), and fit_transform() methods. This consistent interface enables composition in pipelines. Transformers support both dense and sparse matrices. The _fit_context decorator manages state validation. Metadata routing enables parameter passing through pipelines for cross-validation and sample weighting.

Model Selection & Evaluation

Relevant Files
  • sklearn/model_selection/init.py
  • sklearn/model_selection/_search.py
  • sklearn/model_selection/_split.py
  • sklearn/model_selection/_validation.py
  • sklearn/metrics/init.py
  • sklearn/metrics/_scorer.py
  • sklearn/inspection/init.py
  • sklearn/calibration.py

Model selection and evaluation are critical for building robust machine learning systems. scikit-learn provides comprehensive tools for hyperparameter tuning, cross-validation, performance metrics, and model inspection.

Cross-Validation Strategies

Cross-validation splits data into multiple train-test folds to assess model generalization. The framework supports various splitting strategies:

  • K-Fold variants: KFold, StratifiedKFold (preserves class distribution), RepeatedKFold
  • Group-aware splits: GroupKFold, StratifiedGroupKFold for grouped data
  • Leave-One-Out: LeaveOneOut, LeaveOneGroupOut for exhaustive evaluation
  • Shuffle splits: ShuffleSplit, StratifiedShuffleSplit for random partitions
  • Time series: TimeSeriesSplit for temporal data respecting order

Use cross_val_score() for quick evaluation or cross_validate() for multiple metrics and detailed results.

Hyperparameter Tuning

Two primary search strategies optimize model parameters:

Grid Search (GridSearchCV): Exhaustively evaluates all parameter combinations. Best for small parameter spaces.

Randomized Search (RandomizedSearchCV): Samples random combinations. More efficient for large spaces.

Both support parallel execution via n_jobs, early stopping, and custom scoring functions. Results include best parameters, cross-validation scores, and fitted estimators.

Performance Metrics

Classification metrics include accuracy, precision, recall, F1-score, ROC-AUC, and confusion matrices. Regression metrics cover MSE, MAE, R², and specialized losses. Clustering metrics evaluate unsupervised quality (silhouette, Davies-Bouldin, adjusted Rand index).

Use make_scorer() to wrap custom metrics for use in search and validation functions.

Model Inspection

Understand model decisions through:

  • Permutation importance: Feature importance via prediction degradation
  • Partial dependence: Feature-target relationships
  • Decision boundaries: Visual classification regions
  • Calibration: Probability reliability assessment via CalibratedClassifierCV
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.metrics import make_scorer, f1_score

# Quick evaluation
scores = cross_val_score(model, X, y, cv=5, scoring='f1')

# Hyperparameter tuning
grid = GridSearchCV(
    model, 
    {'C': [0.1, 1, 10]}, 
    cv=5,
    scoring=make_scorer(f1_score)
)
grid.fit(X, y)

This pull request includes code written with the assistance of AI. The code has not yet been reviewed by a human.

Utilities & Infrastructure

Relevant Files
  • sklearn/utils/init.py
  • sklearn/utils/validation.py
  • sklearn/utils/_param_validation.py
  • sklearn/_config.py
  • sklearn/exceptions.py
  • sklearn/datasets/init.py

Scikit-learn provides a comprehensive utilities and infrastructure layer that underpins the entire library. This layer handles data validation, configuration management, exception handling, and dataset loading—critical functions that ensure consistency and reliability across all estimators and algorithms.

Data Validation & Input Checking

The validation module (sklearn/utils/validation.py) is the backbone of scikit-learn's input handling. Key functions include:

  • check_array() - Validates and converts input arrays, ensuring they meet requirements (2D, finite values, correct dtype, etc.)
  • check_X_y() - Validates feature matrix X and target y together, enforcing consistent length and proper shapes
  • column_or_1d() - Ensures 1D arrays or column vectors for target variables
  • assert_all_finite() - Checks for NaN and infinite values
  • check_consistent_length() - Verifies all inputs have matching sample counts

These functions accept parameters like accept_sparse, dtype, ensure_2d, and ensure_min_samples to customize validation behavior.

Parameter Validation & Constraints

The _param_validation.py module provides a decorator-based system for validating function and method parameters:

  • @validate_params - Decorator that enforces parameter type and value constraints
  • Constraint types - Interval, StrOptions, Options, HasMethods, MissingValues, and more
  • InvalidParameterError - Custom exception for invalid parameters

This system enables early error detection with clear, user-friendly messages.

Global Configuration

The _config.py module manages scikit-learn's global settings via thread-local storage:

  • get_config() - Retrieve current configuration
  • set_config() - Modify global settings (e.g., assume_finite, working_memory, display)
  • config_context() - Context manager for temporary configuration changes

Key settings include transform_output (pandas/polars support), enable_metadata_routing, and skip_parameter_validation.

Exception Hierarchy

Custom exceptions in sklearn/exceptions.py provide semantic error handling:

  • NotFittedError - Raised when using unfitted estimators
  • ConvergenceWarning - Convergence issues in iterative algorithms
  • DataConversionWarning - Implicit type conversions
  • EfficiencyWarning - Inefficient computation patterns
  • UnsetMetadataPassedError - Metadata routing violations

Dataset Loading & Generation

The sklearn/datasets module provides utilities for loading real and synthetic datasets:

  • Loaders - load_iris(), load_digits(), load_wine(), load_breast_cancer()
  • Fetchers - fetch_openml(), fetch_california_housing(), fetch_20newsgroups()
  • Generators - make_classification(), make_regression(), make_blobs(), make_moons()

Utility Functions

Additional utilities in sklearn/utils/__init__.py:

  • Bunch - Dictionary-like object for dataset containers
  • get_tags() - Retrieve estimator capability tags
  • compute_class_weight() - Balance class weights for imbalanced data
  • resample(), shuffle() - Data manipulation helpers
  • estimator_html_repr() - HTML representation for Jupyter notebooks
Loading diagram...

This infrastructure ensures that all estimators operate on clean, validated data with consistent configuration, making scikit-learn robust and user-friendly.

Specialized Techniques & Extensions

Relevant Files
  • sklearn/multiclass.py
  • sklearn/multioutput.py
  • sklearn/semi_supervised/init.py
  • sklearn/semi_supervised/_label_propagation.py
  • sklearn/semi_supervised/_self_training.py
  • sklearn/frozen/init.py
  • sklearn/experimental/init.py

Multiclass Classification Strategies

Scikit-learn provides three meta-estimators for extending binary classifiers to multiclass problems, each with distinct trade-offs:

One-vs-Rest (OvR) trains n_classes binary classifiers, where each classifier distinguishes one class from all others. This is the most commonly used strategy due to its computational efficiency (O(n_classes) complexity) and interpretability. Each class has exactly one dedicated classifier, making it easy to inspect class-specific patterns.

One-vs-One (OvO) trains n_classes * (n_classes - 1) / 2 binary classifiers, one for each class pair. At prediction time, the class receiving the most votes wins. While slower (O(n_classes²) complexity), OvO is advantageous for kernel-based algorithms that don't scale well with sample size, since each binary problem uses only a subset of data.

Error-Correcting Output Codes (ECOC) represents each class as a binary code and trains one classifier per bit. The code_size parameter controls the number of classifiers: values between 0 and 1 compress the model, while values > 1 add redundancy for error correction. This provides flexible trade-offs between model size and robustness.

from sklearn.multiclass import OneVsRestClassifier, OneVsOneClassifier, OutputCodeClassifier
from sklearn.svm import LinearSVC

# One-vs-Rest: n_classes classifiers
ovr = OneVsRestClassifier(LinearSVC(random_state=0))

# One-vs-One: n_classes * (n_classes - 1) / 2 classifiers
ovo = OneVsOneClassifier(LinearSVC(random_state=0))

# Error-Correcting Output Codes: code_size * n_classes classifiers
ecoc = OutputCodeClassifier(LinearSVC(random_state=0), code_size=1.5)

Multi-Output Learning

The multioutput module extends single-output estimators to handle multiple targets simultaneously:

MultiOutputClassifier and MultiOutputRegressor fit one independent estimator per target variable. This is useful when targets are unrelated and can be predicted independently.

ClassifierChain and RegressorChain model target dependencies by training estimators sequentially, where each estimator uses previous predictions as additional features. The order parameter controls the sequence; 'random' or custom arrays enable experimentation with different orderings.

from sklearn.multioutput import MultiOutputClassifier, ClassifierChain
from sklearn.ensemble import RandomForestClassifier

# Independent targets
multi = MultiOutputClassifier(RandomForestClassifier())

# Dependent targets with learned ordering
chain = ClassifierChain(RandomForestClassifier(), order='random')

Semi-Supervised Learning

Semi-supervised algorithms leverage unlabeled data alongside limited labeled data:

LabelPropagation and LabelSpreading construct a graph connecting all samples and propagate labels through it. They support RBF and KNN kernels; KNN is faster for large datasets. The alpha parameter controls label clamping: hard-clamping (1.0) prevents label changes, while soft-clamping (<1.0) allows gradual adjustments.

SelfTrainingClassifier wraps any supervised classifier with predict_proba to iteratively add high-confidence pseudo-labels. The criterion parameter selects labels by threshold or k-best; max_iter controls iterations until convergence.

from sklearn.semi_supervised import LabelPropagation, SelfTrainingClassifier
from sklearn.linear_model import LogisticRegression

# Graph-based propagation
lp = LabelPropagation(kernel='knn', n_neighbors=7)

# Iterative pseudo-labeling
st = SelfTrainingClassifier(LogisticRegression(), threshold=0.75)

Frozen Estimators

FrozenEstimator wraps a fitted estimator to prevent re-fitting. Calling fit() becomes a no-op, and fit_predict/fit_transform are disabled. This is essential when using pre-trained models as transformers in pipelines—it ensures pipeline.fit() doesn't accidentally retrain the frozen step.

from sklearn.frozen import FrozenEstimator
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression().fit(X_train, y_train)
frozen = FrozenEstimator(clf)
frozen.fit(X_new, y_new)  # No-op; clf remains unchanged

Experimental Features

The experimental module provides access to unstable features not yet ready for production. These estimators may change without deprecation cycles. Always check documentation before using experimental features in production systems.