Overview
Relevant Files
README.mdtensorflow/init.pytensorflow/python/init.pytensorflow/core/BUILD
TensorFlow is an end-to-end open-source platform for machine learning, originally developed by Google Brain. It provides a comprehensive ecosystem of tools, libraries, and community resources that enable researchers to push the state-of-the-art in ML and developers to build and deploy ML-powered applications.
Core Purpose
TensorFlow is fundamentally a computational dataflow graph library. It allows you to define machine learning models as directed acyclic graphs where nodes represent mathematical operations and edges represent multi-dimensional data arrays (tensors) flowing between operations. This abstraction enables efficient execution across diverse hardware platforms.
Key Characteristics
- Multi-language support: Stable Python and C++ APIs, with non-guaranteed backward-compatible APIs for other languages
- Hardware acceleration: Supports CPU, GPU (CUDA-enabled), and specialized accelerators (TPU, mobile devices)
- Production-ready: Used in research and production environments at scale
- Modular architecture: Separate components for different use cases (full framework, Lite for mobile, etc.)
Main Components
Loading diagram...
tensorflow/core: The C++ foundation containing:
- Ops: Operation definitions (mathematical operations like matmul, conv2d, etc.)
- Kernels: Hardware-specific implementations of ops (CPU, GPU, TPU variants)
- Graph: Graph construction and manipulation utilities
- Common Runtime: Execution engine for running computational graphs
- Platform: Abstraction layer for OS-specific functionality
tensorflow/python: Python bindings and high-level APIs that wrap the C++ core, providing user-friendly interfaces for model building and training.
tensorflow/compiler: MLIR-based compilation infrastructure for optimizing graphs across different backends (XLA, TensorFlow Lite, etc.).
tensorflow/lite: Lightweight inference framework optimized for mobile and embedded devices with reduced binary size and latency.
tensorflow/cc: C++ API for building and executing TensorFlow graphs directly in C++ applications.
Architecture Layers
- User-facing APIs (Python, C++, Java, Go, JavaScript)
- High-level frameworks (Keras, tf.function, eager execution)
- Graph construction and optimization (MLIR, XLA compiler)
- Core execution engine (graph executor, session management)
- Hardware abstraction (platform-specific kernels and runtimes)
Getting Started
Install via pip for CPU-only or GPU support:
pip install tensorflow
pip install tensorflow-gpu
Basic usage:
import tensorflow as tf
result = tf.add(1, 2).numpy() # Returns 3
The framework handles the complexity of distributing computation across devices while providing a simple, intuitive API for users.
Architecture & Core Components
Relevant Files
tensorflow/core/framework- Op and kernel definitions, device managementtensorflow/core/graph- Graph representation and manipulationtensorflow/core/common_runtime- Graph execution, placement, optimizationtensorflow/core/kernels- Kernel implementations for operationstensorflow/core/public/session.h- Session API for graph execution
TensorFlow's core architecture is organized into distinct layers that work together to define, optimize, and execute computation graphs.
Framework Layer
The framework defines the fundamental abstractions:
- OpDef - Specifies an operation's signature (inputs, outputs, attributes)
- NodeDef - Represents a single node in a graph with specific op type and attribute values
- Device - Abstracts computation hardware (CPU, GPU, TPU) with resource management and op segment caching
- OpRegistry - Global registry mapping op names to their definitions, used during graph construction
Graph Representation
The graph layer provides the computational graph abstraction:
- Graph - In-memory representation of a computation graph with nodes and edges
- GraphDef - Protobuf serialization format for graphs (portable, versionable)
- Node - Graph node with type information, edges, and device assignment
- Edge - Data or control dependency between nodes
GraphDef is converted to Graph via GraphConstructor::Construct(), which validates versions and builds the in-memory representation.
Kernel System
Operations are implemented as kernels registered for specific devices and data types:
- OpKernel - Base class for operation implementations
- KernelDef - Specifies which op, device, and type constraints a kernel handles
- KernelRegistry - Maps
(op_type, device_type)pairs to kernel registrations - REGISTER_KERNEL_BUILDER - Macro for registering kernels at compile time
Kernel lookup matches NodeDef attributes against KernelDef constraints to find the best implementation.
Execution Pipeline
Loading diagram...
GraphExecutionState transforms a GraphDef into an executable graph by:
- Constructing the in-memory Graph
- Placing nodes on devices
- Partitioning for distributed execution
- Creating executors for each device
Executor
The Executor runs a graph on a single device:
- Manages node scheduling and dependency tracking
- Executes nodes when all inputs are ready
- Handles control flow (Switch, Merge, Enter, Exit)
- Supports async and sync execution modes
ExecutorImpl uses ExecutorState to track runtime state and propagate tensor readiness through the graph.
Session API
The Session provides the high-level interface:
Create(GraphDef)- Register a graphRun(inputs, outputs, targets)- Execute the graphExtend(GraphDef)- Add operations to existing graph
DirectSession implements the Session interface by managing executors, handling device placement, and coordinating distributed execution.
Modern Runtime: TFRT
TensorFlow also includes TFRT (TensorFlow Runtime), a newer execution engine:
- GraphExecutor - Compiles graphs to bytecode or BEF (Binary Executable Format)
- MLRT - Machine Learning Runtime for efficient graph execution
- Supports both fallback to legacy kernels and native TFRT operations
Python API & High-Level Interface
Relevant Files
tensorflow/python/framework/ops.pytensorflow/python/eager/context.pytensorflow/python/eager/polymorphic_function/polymorphic_function.pytensorflow/python/keras/engine/base_layer.pytensorflow/python/eager/backprop.pytensorflow/python/ops/
TensorFlow's Python API provides a high-level interface for building and executing machine learning models. The API is organized around two core execution modes: eager execution (default in TF 2.x) and graph mode, with seamless interoperability through tf.function.
Execution Modes
Eager Execution (tensorflow/python/eager/context.py) enables immediate operation evaluation. Operations execute line-by-line, returning concrete values that can be inspected with .numpy(). This is the default in TensorFlow 2.x and provides an intuitive, Pythonic interface.
Graph Mode constructs a computational graph before execution, enabling optimizations and deployment. The tf.Graph class (tensorflow/python/framework/ops.py) represents a dataflow graph where tf.Operation nodes compute on tf.Tensor data. Graph mode is primarily accessed through tf.function or legacy TF 1.x APIs.
Core Abstractions
Tensors are the fundamental data structure. tf.Tensor represents symbolic tensors in graphs, while EagerTensor holds concrete values in eager mode. Both support standard NumPy-like operations.
Operations (tf.Operation) are graph nodes created by calling ops like tf.matmul(), tf.add(), etc. These are defined in tensorflow/python/ops/ and automatically added to the default graph or executed eagerly.
Variables (tf.Variable) maintain mutable state across executions. Unlike tensors, variables persist and can be updated via .assign() methods.
tf.function: Bridging Eager and Graph
tf.function (in tensorflow/python/eager/polymorphic_function/) is the primary mechanism for graph compilation. Decorating a Python function with @tf.function traces it with symbolic arguments, creating an optimized graph:
@tf.function
def compute(x, y):
return x ** 2 + y
result = compute(tf.constant(2.0), tf.constant(3.0))
The function is traced once per unique input signature, enabling XLA compilation and performance optimization while maintaining eager semantics through AutoGraph.
Keras Layers and Models
tf.keras.layers.Layer (tensorflow/python/keras/engine/base_layer.py) is the building block for neural networks. Layers encapsulate computation (call() method) and state (weights). Models compose layers into trainable architectures:
class MyModel(tf.keras.Model):
def __init__(self):
super().__init__()
self.dense = tf.keras.layers.Dense(10)
def call(self, inputs):
return self.dense(inputs)
Automatic Differentiation
tf.GradientTape (tensorflow/python/eager/backprop.py) records operations for gradient computation. Within a tape context, operations are tracked, enabling backpropagation:
with tf.GradientTape() as tape:
y = x ** 2
grads = tape.gradient(y, x)
Device Management
The context system (tensorflow/python/eager/context.py) manages device placement. Use tf.device() to specify execution devices:
with tf.device('/GPU:0'):
result = tf.matmul(a, b)
Key Design Patterns
- Eager-first development: Write code naturally in eager mode, then wrap with
@tf.functionfor performance. - Composable abstractions: Layers, models, and functions compose seamlessly.
- Automatic shape inference: TensorFlow infers shapes dynamically, supporting variable-length inputs.
- Distributed training:
tf.distributestrategies abstract multi-device/multi-machine training.
Execution Engine & Runtime
Relevant Files
tensorflow/core/common_runtime/eager/execute.cctensorflow/core/common_runtime/eager/eager_executor.cctensorflow/core/common_runtime/executor.cctensorflow/core/distributed_runtime/master.htensorflow/core/tfrt/graph_executor/graph_executor.cctensorflow/python/eager/execute.pytensorflow/python/eager/pywrap_tfe_src.cc
TensorFlow's execution engine is the core system that runs computational graphs and eager operations. It bridges Python-level operations with low-level kernel execution across CPUs, GPUs, and other accelerators.
Eager Execution Path
Eager execution enables immediate operation evaluation. When a Python operation is called, the execution flow is:
- Python API (
tensorflow/python/eager/execute.py) callsTFE_Py_Execute()via pybind - C++ Wrapper (
pywrap_tfe_src.cc) constructs aTFE_Opand callsTFE_Execute() - Core Execution (
tensorflow/core/common_runtime/eager/execute.cc) routes to either local or remote execution - Kernel Execution (
EagerKernelExecute()) runs the actual kernel on the assigned device
# Python side
result = tf.add(a, b) # Calls quick_execute()
The EagerExecutor class manages asynchronous or synchronous execution of EagerNode objects. In sync mode, operations execute inline immediately. In async mode, nodes are queued and processed by a background thread, enabling pipelined execution.
Graph Execution Model
For graph-based execution (used in tf.function and distributed training), the system follows:
- Graph Construction → Build computational graph from operations
- Placement (
placer.cc) → Assign each node to a device using colocation constraints - Optimization → Apply graph optimizations (constant folding, dead code elimination)
- Executor Creation (
executor.cc) → Compile graph into executable form - Execution → Run kernels respecting data dependencies
The Executor class uses a propagator-based model: nodes become ready when all inputs are available, then execute on their assigned device. A Rendezvous mechanism coordinates data transfer between nodes.
Device Placement & Colocation
The Placer uses a colocation graph to determine device assignments:
- Nodes with explicit device requests are pinned to those devices
- Nodes without requests are placed on available devices (CPU by default)
- Colocation constraints ensure related operations stay together
- Soft placement allows fallback to other devices if constraints cannot be satisfied
Distributed Execution
For multi-machine training, the Master and Worker components coordinate:
- Master (
distributed_runtime/master.h) manages sessions and schedules graph execution - Worker processes execute subgraphs on local devices
- GraphMgr partitions the graph across workers and manages execution
- Rendezvous handles inter-worker tensor communication via gRPC
TFRT Integration
TensorFlow Runtime (TFRT) provides an alternative execution backend optimized for latency:
- GraphExecutor (
tfrt/graph_executor/graph_executor.cc) compiles graphs to MLRT bytecode - MLRT Interpreter executes compiled functions with minimal overhead
- Supports both synchronous and asynchronous execution modes
- Integrates with fallback mechanisms for unsupported operations
Execution Context
The EagerContext maintains:
- Device manager and available devices
- Function library runtime for executing tf.function
- Rendezvous for inter-op communication
- Thread pool for intra-op parallelism
- Collective executor for distributed operations (AllReduce, etc.)
Loading diagram...
Key Abstractions
- TensorHandle: Reference to a tensor on a device (may be remote)
- EagerNode: Unit of work in async execution (operation, copy, etc.)
- KernelAndDevice: Pairs a kernel implementation with its target device
- ExecutorState: Manages pending operations and ready queue during graph execution
Distributed Training & Strategies
Relevant Files
tensorflow/python/distribute/distribute_lib.pytensorflow/python/distribute/mirrored_strategy.pytensorflow/python/distribute/collective_all_reduce_strategy.pytensorflow/python/distribute/parameter_server_strategy_v2.pytensorflow/python/distribute/tpu_strategy.pytensorflow/core/distributed_runtime/tensorflow/python/distribute/cluster_resolver/
TensorFlow's distributed training system enables training across multiple GPUs, TPUs, and machines with minimal code changes. The tf.distribute.Strategy API abstracts the complexity of distributed execution while maintaining compatibility with high-level APIs like Keras.
Core Concepts
Data Parallelism is the primary distribution model: multiple replicas of the model run on different data slices, with gradients aggregated before parameter updates. Key terminology includes:
- Replica: One copy of the model running on one device with one data slice
- Worker: A physical machine containing one or more replicas
- Synchronous Training: Replicas synchronize before updating parameters (via all-reduce)
- Asynchronous Training: Replicas update independently without synchronization
- Mirrored Variables: Variables replicated across devices and kept in sync
- PerReplica Values: Different values per replica, only readable in replica context
Distribution Strategies
Loading diagram...
MirroredStrategy synchronously trains on multiple GPUs on a single machine. Variables are replicated across all devices and kept synchronized using all-reduce operations. Ideal for single-machine multi-GPU setups.
MultiWorkerMirroredStrategy extends mirroring to multiple machines using collective all-reduce operations. Requires cluster configuration via TF_CONFIG environment variable. All workers must run identical code.
ParameterServerStrategy uses a parameter server architecture where variables live on dedicated servers and workers fetch/update them asynchronously. Supports fault tolerance and preemptible instances. Requires a coordinator task to dispatch work.
TPUStrategy optimizes for TPU Pods with specialized collective operations. Requires TPU initialization and cluster resolver setup.
CentralStorageStrategy places all variables on a single device (CPU or GPU) while replicating compute across devices. Useful for testing and small models.
Distributed Runtime Architecture
The runtime coordinates execution across machines using gRPC:
- Master: Manages sessions and schedules graph execution across workers
- Worker: Executes subgraphs on local devices and communicates via gRPC
- GraphMgr: Partitions computation graphs across workers
- Rendezvous: Handles inter-worker tensor communication
- Collective Operations: All-reduce, all-gather, and broadcast for synchronization
Usage Pattern
strategy = tf.distribute.MirroredStrategy()
with strategy.scope():
model = tf.keras.Sequential([...])
optimizer = tf.keras.optimizers.SGD(learning_rate=0.1)
dataset = tf.data.Dataset.from_tensor_slices((x, y))
dist_dataset = strategy.experimental_distribute_dataset(dataset)
model.compile(optimizer=optimizer, loss='mse')
model.fit(dist_dataset, epochs=10)
Variables and models created within strategy.scope() become strategy-aware. The strategy automatically handles replication, synchronization, and gradient aggregation.
Cluster Configuration
Multi-worker training requires TF_CONFIG environment variable specifying cluster topology:
TF_CONFIG = {
"cluster": {
"worker": ["worker0:12345", "worker1:12345"]
},
"task": {"type": "worker", "index": 0}
}
Cluster resolvers abstract this configuration, supporting GCE, Kubernetes, and custom environments.
Compilation & Optimization
Relevant Files
tensorflow/compiler/mlir- MLIR-based compilation infrastructuretensorflow/compiler/tf2xla- TensorFlow to XLA conversion and bridgetensorflow/core/grappler- Graph optimization frameworktensorflow/compiler/aot- Ahead-of-time compilation for static graphstensorflow/compiler/jit- Just-in-time compilation and clusteringtensorflow/python/compiler- Python compiler APIs
TensorFlow's compilation and optimization pipeline transforms high-level computation graphs into efficient device-specific code. The system uses multiple layers of optimization: graph-level optimizations via Grappler, MLIR-based transformations, and XLA compilation for accelerators.
Graph Optimization with Grappler
Grappler is TensorFlow's graph optimization framework that runs before execution. The MetaOptimizer coordinates a sequence of specialized passes:
- Constant Folding - Evaluates constant expressions at compile time
- Arithmetic Optimization - Simplifies mathematical operations (e.g.,
x * 1 = x) - Layout Optimization - Reorders tensor dimensions for cache efficiency
- Remapping - Fuses compatible operations into single kernels
- Common Subgraph Elimination - Deduplicates identical computation patterns
- Loop Optimization - Optimizes control flow structures
- Function Optimization - Inlines and specializes function calls
Each optimizer can be enabled/disabled via ConfigProto.graph_options.rewrite_options(). Optimizers run iteratively until the graph stabilizes or reaches minimum size thresholds.
MLIR Bridge and tf2xla
The MLIR Bridge (mlir_bridge_pass.cc) converts TensorFlow graphs to XLA-compatible form:
- Clustering - Groups operations for XLA compilation using
RunFunctionTf2xlaClusteringBridge() - Legalization - Converts TensorFlow ops to HLO (High-Level Optimizer) IR
- Runtime Lowering - Inserts device-specific execution ops via
RunLowerClusterToRuntimeOpsPassPipeline()
The bridge supports both replicated (TPU) and non-replicated (GPU/CPU) execution paths. Fallback mode allows unsupported ops to execute on the host.
XLA Compilation Pipeline
XLA compiles HLO modules through device-specific backends:
HLO Module → HLO Passes → Layout Assignment → Backend Codegen → Machine Code
Key HLO optimization passes:
- Algebraic simplification and constant propagation
- Fusion - combines multiple ops into single kernels
- Dead code elimination
- Memory layout optimization
- Collective operation optimization (AllReduce, AllGather)
CPU and GPU backends apply specialized passes. GPU compilation includes CUDA kernel generation and cuDNN integration. CPU compilation produces LLVM IR for multiple architectures.
Ahead-of-Time (AOT) Compilation
The tfcompile tool compiles static graphs to standalone C++ libraries:
GraphDef + Config → tf2xla → XLA HLO → Backend → Object Files + Header
AOT compilation generates optimized code without runtime overhead, useful for embedded and mobile deployment.
Compilation Flow Diagram
Loading diagram...
Configuration and Control
Compilation behavior is controlled via:
tf.config.optimizer.set_experimental_options()- Enable/disable specific passestf.function(jit_compile=True)- Force XLA compilation- Environment variables -
TF_XLA_FLAGS,TF_DUMP_GRAPH_PREFIXfor debugging ConfigProto- Fine-grained rewriter configuration
The system automatically selects optimization levels based on graph size and device type, balancing compilation time against execution performance.
TensorFlow Lite & Mobile Inference
Relevant Files
tensorflow/lite/core/interpreter.htensorflow/lite/core/interpreter_builder.htensorflow/lite/kernelstensorflow/lite/delegatestensorflow/lite/c/common.htensorflow/lite/core/api/op_resolver.h
TensorFlow Lite is TensorFlow's lightweight solution for on-device machine learning inference on mobile, embedded, and edge devices. It enables low-latency model execution with minimal binary size and fast performance through hardware acceleration.
Core Inference Pipeline
The TensorFlow Lite inference process follows a structured pipeline:
- Model Loading - Load a
.tflitemodel (FlatBuffers format) into memory - Interpreter Creation - Build an interpreter using
InterpreterBuilderwith anOpResolver - Tensor Allocation - Call
AllocateTensors()to prepare memory for computation - Input Preparation - Copy input data into input tensors
- Inference Execution - Call
Invoke()to run the model - Output Retrieval - Read results from output tensors
auto model = tflite::FlatBufferModel::BuildFromFile("model.tflite");
std::unique_ptr<tflite::Interpreter> interpreter;
tflite::ops::builtin::BuiltinOpResolver resolver;
tflite::InterpreterBuilder(*model, resolver)(&interpreter);
interpreter->AllocateTensors();
auto input = interpreter->typed_tensor<float>(0);
// Fill input data...
interpreter->Invoke();
auto output = interpreter->typed_tensor<float>(output_index);
Interpreter Architecture
The Interpreter class manages the computation graph and tensor lifecycle. Key responsibilities include:
- Graph Management - Maintains operator nodes and tensor connectivity
- Memory Planning - Uses arena-based allocation for efficient memory usage
- Execution Scheduling - Executes nodes in topologically sorted order
- Tensor Access - Provides typed access to input/output tensors
The interpreter is not thread-safe; clients must serialize access to avoid data races.
Operation Resolution
The OpResolver interface maps operator codes in the FlatBuffers model to executable kernel implementations. Two main implementations exist:
- BuiltinOpResolver - Registers all built-in TensorFlow Lite operations
- MutableOpResolver - Allows selective registration of specific operations for reduced binary size
Hardware Acceleration via Delegates
Delegates enable GPU, DSP, and specialized hardware acceleration by intercepting subgraphs and executing them on alternative backends:
Loading diagram...
Common Delegates:
- GPU Delegate - Metal (iOS), OpenGL ES (Android) for parallel computation
- NNAPI Delegate - Android Neural Networks API for vendor-specific acceleration
- Hexagon Delegate - Qualcomm DSP for quantized models
- CoreML Delegate - Apple Neural Engine on iOS 12+
- XNNPACK Delegate - CPU optimization for ARM NEON and x86 SSE
Kernel Implementation
Kernels implement individual operations. The framework supports multiple implementations per operation:
- Reference - Portable C++ implementation
- Optimized - NEON (ARM), SSE (x86), or specialized implementations
- Quantized - Integer-only kernels for reduced memory and latency
Kernel selection happens at runtime based on input data types and available hardware capabilities.
Memory Management
TensorFlow Lite uses a custom arena-based memory allocator (SimpleMemoryArena) that:
- Pre-allocates a single contiguous buffer for all tensors
- Reuses memory across operations when tensors are no longer needed
- Minimizes fragmentation and allocation overhead
- Supports both static and dynamic tensor shapes
Platform Support
TensorFlow Lite provides APIs across multiple platforms and languages:
- C++ - Core API with full control and performance
- Java/Kotlin - Android convenience API with JNI bindings
- Swift/Objective-C - Native iOS APIs
- Python - Development and testing via
tf.lite.Interpreter - C - Stable ABI for embedded systems and microcontrollers
Data Pipeline & Input Processing
Relevant Files
tensorflow/core/framework/dataset.htensorflow/python/data/ops/dataset_ops.pytensorflow/python/data/ops/from_tensor_slices_op.pytensorflow/python/data/ops/map_op.pytensorflow/python/data/ops/batch_op.pytensorflow/python/data/ops/prefetch_op.pytensorflow/core/data/standalone.h
Overview
The tf.data pipeline system provides a composable, efficient API for building input pipelines. It follows a three-step pattern: create a source dataset, apply transformations, and iterate over elements. The architecture spans both Python and C++ layers, with lazy evaluation enabling optimization and streaming execution.
Core Architecture
The pipeline is built on two fundamental abstractions:
DatasetBase (C++): Represents a potentially infinite range of outputs where each output is a tuple of tensors. It defines the logical structure and metadata of the pipeline.
IteratorBase (C++): Represents the current position in a dataset's outputs. Multiple iterators can be created from the same dataset, each maintaining independent state.
The Python API (tf.data.Dataset) wraps these C++ abstractions, providing a user-friendly interface while delegating execution to the TensorFlow runtime.
Source Datasets
Source datasets create initial data from various inputs:
- from_tensor_slices: Slices tensors along the first dimension, creating individual elements
- TextLineDataset: Reads lines from text files
- TFRecordDataset: Reads serialized TFRecord format files
- from_generator: Creates datasets from Python generators
- range: Generates sequences of integers
Each source implements DatasetSource and produces a variant tensor representing the dataset graph.
Transformations
Transformations create new datasets by applying operations to input datasets:
- map: Applies a function to each element (supports parallel execution with
num_parallel_calls) - batch: Groups consecutive elements into batches
- shuffle: Randomizes element order using a buffer
- filter: Selects elements matching a predicate
- prefetch: Overlaps data loading with model training
- interleave: Merges multiple datasets in parallel
- repeat: Cycles through the dataset multiple times
Transformations are composable and lazy—they build a computation graph without executing until iteration begins.
Execution Model
Loading diagram...
When iterating, the runtime:
- Creates an iterator from the dataset graph
- Calls
GetNext()to fetch elements on-demand - Executes the computation graph lazily
- Supports checkpointing iterator state for resumable pipelines
Optimization
The system includes automatic optimizations:
- Fused operations: map + batch are fused into a single kernel
- Autotune: Dynamically adjusts buffer sizes and parallelism
- Graph rewriting: Simplifies and reorders operations for efficiency
- Cardinality tracking: Determines dataset size when possible
Advanced Features
tf.data.experimental.service: Offloads dataset processing to a distributed service, enabling data sharing across multiple training workers.
Checkpointing: Iterator state can be saved and restored, enabling resumable training pipelines.
Options: Configure behavior via dataset.options() for features like determinism, memory optimization, and performance tuning.
Model Persistence & Serialization
Relevant Files
tensorflow/python/saved_model/save.pytensorflow/python/saved_model/load.pytensorflow/python/checkpoint/checkpoint.pytensorflow/core/protobuf/saved_model.prototensorflow/core/protobuf/meta_graph.prototensorflow/core/protobuf/saved_object_graph.prototensorflow/core/protobuf/trackable_object_graph.proto
TensorFlow provides two complementary persistence mechanisms: SavedModel for complete model export and Checkpoints for training state management. Both rely on Protocol Buffers and a trackable object graph system.
SavedModel Format
SavedModel is the universal serialization format for TensorFlow models. It captures the complete model state in a language-neutral, hermetic format suitable for production serving and cross-platform deployment.
Directory Structure:
saved_model/
├── saved_model.pb # Main protobuf (SavedModel message)
├── assets/ # Auxiliary files (vocabularies, etc.)
├── assets.extra/ # User-provided assets
└── variables/ # Variable checkpoints
├── variables.index
└── variables.data-?????-of-?????
Core Components:
- SavedModel proto (
saved_model.proto): Top-level container with schema version and MetaGraphDef list - MetaGraphDef (
meta_graph.proto): Contains graph definition, signatures, assets, and object graph - SavedObjectGraph (
saved_object_graph.proto): Flattened object dependency graph with function and type information - Signatures: Named input/output specifications for inference (SignatureDef)
Checkpoint System
Checkpoints preserve training state including variables, optimizer state, and object relationships. The checkpoint format uses a two-file structure: an index file and sharded data files.
Key Components:
- TrackableObjectGraph (
trackable_object_graph.proto): Maps Python objects to checkpoint variables - Checkpoint class: Manages save/restore operations with automatic dependency tracking
- SaveableObject: Wraps objects that need custom serialization logic
- Slot variables: Optimizer state (momentum, velocity) linked to original variables
Serialization Flow
Loading diagram...
Key Mechanisms
Trackable System: Objects inherit from Trackable base class, enabling automatic dependency discovery. The framework traverses the object graph to identify all variables and nested objects.
Function Tracing: @tf.function decorated methods are traced into concrete functions and stored as SavedBareConcreteFunction with captured tensors and input signatures.
Asset Management: External files (vocabularies, lookup tables) are copied to the assets/ directory and referenced in the SavedModel proto.
Fingerprinting: Optional cryptographic fingerprints verify model integrity and detect unauthorized modifications.
Loading and Restoration
tf.saved_model.load() reconstructs the object graph bottom-up: creates all objects first (ordered by dependencies), then connects edges. tf.train.Checkpoint.restore() uses the trackable graph to selectively restore variables, enabling flexible model evolution.
Compatibility: SavedModel supports forward/backward compatibility through schema versioning and stripped default attributes in graph definitions.
Language Bindings & APIs
Relevant Files
tensorflow/c/c_api.h– Core C API for TensorFlowtensorflow/c/eager/c_api.h– Eager execution C APItensorflow/cc/client/client_session.h– C++ session managementtensorflow/cc/framework/scope.h– C++ graph constructiontensorflow/go/session.go– Go bindings for sessionstensorflow/go/graph.go– Go graph constructiontensorflow/java/src/main/native/– Java JNI bindings
TensorFlow provides language bindings for C, C++, Go, and Java, each designed for different use cases and deployment scenarios. These bindings wrap the core TensorFlow runtime and expose APIs tailored to each language’s idioms and performance characteristics.
C API: The Foundation Layer
The C API (tensorflow/c/c_api.h) is the lowest-level public interface and serves as the foundation for all other language bindings. It prioritizes simplicity and uniformity over convenience, making it ideal for language-specific wrappers.
Key Design Principles:
- Opaque struct pointers for all objects (no direct memory layout exposure)
- Prefix
TF_for all symbols TF_Statusfor error handling across the ABI boundary- Stable ABI for shared library compatibility
Core Components:
- Graph construction (
TF_Graph,TF_Operation) - Session execution (
TF_Session,TF_Run) - Tensor management (
TF_Tensor,TF_Buffer) - Status and error reporting
Use Case: Embedding TensorFlow in C/C++ applications, creating language bindings, and systems requiring stable ABI boundaries.
C++ API: High-Level Convenience
The C++ API (tensorflow/cc/) provides idiomatic C++ abstractions built on top of the C API. It includes:
- Scope & Ops Framework (
tensorflow/cc/framework/scope.h): Fluent API for graph construction with automatic dependency tracking - ClientSession (
tensorflow/cc/client/client_session.h): Session management with RAII semantics - Gradients (
tensorflow/cc/framework/gradients.h): Automatic differentiation support - SavedModel (
tensorflow/cc/saved_model/): Model loading and inference
Example Usage:
Scope root = Scope::NewRootScope();
auto a = Placeholder(root, DT_INT32);
auto c = Add(root, a, {41});
ClientSession session(root);
std::vector<Tensor> outputs;
session.Run({ {a, {1}} }, {c}, &outputs);
Use Case: Production inference servers, model serving, and C++ applications requiring full TensorFlow functionality.
Go Bindings: Graph Construction & Execution
The Go bindings (tensorflow/go/) wrap the C API through cgo, providing idiomatic Go interfaces for graph construction and execution.
Key Components:
- Graph (
graph.go): Build computation graphs - Session (
session.go): Execute graphs with concurrentRun()support - Tensor (
tensor.go): Type-safe tensor representation - Operations (
op/op.go): Generated operation wrappers
Example Usage:
graph := tf.NewGraph()
input := op.Placeholder(scope, tf.String)
output := op.StringUpper(scope, input)
session, _ := tf.NewSession(graph, nil)
result, _ := session.Run(
map[tf.Output]interface{}{input: "hello"},
[]tf.Output{output},
nil,
)
Use Case: Go microservices, edge inference, and systems where Go’s concurrency model is beneficial.
Java Bindings: JNI Integration
The Java bindings (tensorflow/java/) use JNI to bridge Java and the C API. The legacy version in this repository has been superseded by the TensorFlow Java repository.
Architecture:
- JNI wrappers in
tensorflow/java/src/main/native/(C++ code) - Java interfaces in
tensorflow/java/src/main/java/ - Support for graph-based and eager execution
Note: For modern JVM usage, refer to the external TensorFlow Java repository. For Android, use TensorFlow Lite.
API Hierarchy & Interoperability
Loading diagram...
Choosing the Right Binding
- C API: Low-level integration, custom language bindings, ABI stability required
- C++ API: Full-featured applications, model serving, maximum performance
- Go: Microservices, cloud-native deployments, concurrent workloads
- Java: Legacy JVM applications (use external TensorFlow Java for new projects)
Each binding maintains the same underlying execution semantics while adapting to language-specific idioms and deployment patterns.