Overview
Relevant Files
hudi-common/src/main/java/org/apache/hudi/common/HoodieTableFormat.javahudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.javahudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.javahudi-common/src/main/java/org/apache/hudi/common/model/HoodieTableType.java
Apache Hudi is an open data lakehouse platform that provides a high-performance table format for ingesting, storing, and querying data across cloud environments. It combines the benefits of data lakes and data warehouses by offering ACID transactions, efficient upserts, and multiple query types on cloud storage.
Core Architecture
Hudi's architecture is built on several interconnected layers:
HoodieTable is the central abstraction representing a Hudi table. It manages all table operations including reads, writes, and maintenance tasks. The table holds references to critical components: the write configuration, metadata client, index, storage layout, and timeline.
HoodieTableMetaClient provides access to table metadata and the timeline. It tracks all commits, compactions, cleanups, and other actions performed on the table. The timeline is computed lazily and cached for performance.
HoodieTableFormat is an extensible interface allowing external table formats to integrate with Hudi. It provides hooks for lifecycle events like commits, cleanups, and rollbacks, enabling custom metadata management and format-specific behavior.
Timeline tracks the complete history of table changes. Each action (commit, compaction, clean, etc.) creates an instant with a timestamp. The timeline enables point-in-time queries, incremental reads, and change data capture.
Table Types
Hudi supports two table types with different write and read tradeoffs:
Copy-on-Write (COW) tables rewrite entire files on updates. All data is stored in base files (Parquet), providing zero read amplification and excellent query performance. This is the default table type, ideal for read-heavy workloads.
Merge-on-Read (MOR) tables append updates to log files instead of rewriting base files. Compaction asynchronously merges logs into base files. MOR provides faster writes but requires merging during reads, making it suitable for write-heavy streaming scenarios.
Write Operations
Hudi supports multiple write operations:
- Insert - Add new records to the table
- Upsert - Insert or update records based on record keys
- Bulk Insert - Efficiently load large datasets without caching
- Delete - Remove records by key
- Insert Overwrite - Replace data in specific partitions
- Cluster - Reorganize data layout for query optimization
- Compact - Merge log files into base files (MOR tables only)
Key Components
FileSystemViewManager provides views of the table's file system state, supporting both base-file-only and full slice views for efficient query planning.
HoodieIndex maintains record-level indexes using bloom filters and column statistics to accelerate lookups during upserts and deletes.
HoodieStorageLayout defines how files are organized within the table. Supported layouts include DEFAULT (unordered) and BUCKET (hash-based partitioning).
HoodieTableMetadata maintains auxiliary metadata partitions for file listings, column statistics, and bloom filters, enabling efficient query planning without full file system scans.
Loading diagram...
Query Types
Hudi supports multiple query modes on a single table:
- Snapshot Query - Latest committed state with index acceleration
- Incremental Query - Changes since a point in time
- Time-Travel Query - Table state at a specific instant
- Change Data Capture - Stream of inserts, updates, and deletes
- Read-Optimized Query - Pure columnar reads (MOR tables only)
Architecture & Core Components
Relevant Files
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.javahudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.javahudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndex.javahudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieTimeline.javahudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java
Apache Hudi's architecture is built on four core layers that work together to provide a transactional data lake platform. Understanding these layers is essential for working with the codebase.
Core Components
HoodieTableMetaClient is the entry point for accessing table metadata. It manages the .hoodie directory structure, loads table configuration, and provides access to the timeline. All timelines are computed lazily and cached, so you must call reload() to refresh them after mutations.
HoodieTimeline represents a view of metadata instants (specific points in time) in the table. It is immutable and supports filtering operations that chain together. The timeline tracks all actions: commits, delta commits, compactions, cleanups, rollbacks, savepoints, and more. Each action has three states: requested, inflight, and completed.
HoodieTable is the abstraction layer for table operations. It encapsulates write configuration, metadata access, and file system views. Engine-specific implementations (Spark, Flink, Java) extend this class to provide concrete write, read, and compaction operations. HoodieTable holds references to the metaClient, index, and metadata writer.
BaseHoodieWriteClient orchestrates write operations and manages the commit lifecycle. It creates HoodieTable instances, manages indexes, and handles commit metadata. The client is responsible for starting commits, writing data, updating indexes, and finalizing transactions.
Data Flow Architecture
Loading diagram...
Index Layer
HoodieIndex is an abstract base class for different indexing strategies. It provides two core operations: tagLocation() finds existing records by looking up their file locations, and updateLocation() updates the index after writes. Multiple implementations exist (Bloom, Bucket, Simple, In-Memory Hash) to support different use cases and performance characteristics.
Timeline Management
The timeline uses a state machine for each instant: requested → inflight → completed. This ensures atomicity and allows recovery from failures. The timeline layout is versioned to support schema evolution. Active timelines track in-progress operations, while archived timelines store historical data for efficient queries.
Table Types
Hudi supports two table types, each with different write and read tradeoffs:
- Copy-on-Write (CoW): Synchronous compaction during writes. Faster reads, slower writes.
- Merge-on-Read (MoR): Asynchronous compaction. Faster writes, slower reads with log file merging.
Both types are implemented as concrete HoodieTable subclasses that override abstract methods for upsert, insert, compact, and clean operations.
Table Types & Storage Format
Relevant Files
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/HoodieSparkCopyOnWriteTable.javahudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/HoodieSparkMergeOnReadTable.javahudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormat.javahudi-common/src/main/java/org/apache/hudi/common/model/HoodieLogFile.javahudi-common/src/main/java/org/apache/hudi/common/model/HoodieFileFormat.java
Hudi supports two table types with fundamentally different write and read tradeoffs. Each type is implemented as a concrete subclass of HoodieTable that overrides methods for upsert, insert, delete, and compaction operations.
Copy-on-Write (CoW) Tables
CoW tables prioritize read performance by maintaining all data in base files. When records are updated or deleted, Hudi rewrites the entire base file with the new values. This eliminates read amplification since queries always read from base files directly.
Storage Structure:
- Base files only (Parquet or ORC format)
- File naming:
{fileId}_{writeToken}_{instantTime}.parquet - Updates trigger immediate file rewrites
Write Operations:
- Inserts: Create new base files or merge with the smallest existing file
- Updates: Rewrite the entire base file with updated records
- Deletes: Rewrite the base file excluding deleted records
Characteristics:
- Zero read amplification
- Synchronous compaction during writes
- Slower writes, faster reads
- Default table type, ideal for read-heavy workloads
Merge-on-Read (MoR) Tables
MoR tables optimize write performance by deferring merges until compaction. Updates append to log files instead of rewriting base files. During reads, Hudi merges log files with base files on-the-fly, providing faster writes at the cost of read latency.
Storage Structure:
- Base files (Parquet/ORC) plus log files
- Log file naming:
.{fileId}_{instantTime}.log.{version}_{writeToken} - Log files use Avro binary format with block-based structure
- Default log file size threshold: 512 MB
Write Operations:
- Inserts: Same as CoW (create new base files)
- Updates: Append to rolling log files per file ID
- Deletes: Write delete markers to log files
- Compaction: Asynchronously merges log files into base files
Log File Format:
Log files consist of blocks separated by magic markers (#HUDI#). Each block can be:
- Data Block: Records serialized in Avro binary format
- Delete Block: List of keys to delete (tombstones)
- Command Block: Rollback or other control commands
Characteristics:
- Higher write throughput
- Read amplification due to log merging
- Asynchronous compaction
- Ideal for write-heavy streaming scenarios
File Organization
Both table types organize data by partitions. Within each partition:
partition_path/
├── base_file_1.parquet
├── base_file_2.parquet
├── .file_id_1_timestamp.log.1_token
├── .file_id_1_timestamp.log.2_token
└── .file_id_2_timestamp.log.1_token
File Naming Convention:
- Base files:
{fileId}_{writeToken}_{instantTime}.{extension} - Log files:
.{fileId}_{instantTime}.log.{version}_{writeToken} - Write tokens track concurrent writes to prevent conflicts
Choosing a Table Type
| Aspect | CoW | MoR |
|---|---|---|
| Read Latency | Low | Higher (log merging) |
| Write Latency | Higher | Low |
| Storage | Base files only | Base + log files |
| Compaction | Synchronous | Asynchronous |
| Use Case | Analytics, reporting | Real-time streaming |
CoW is the default and recommended for most use cases. Choose MoR when write throughput is critical and you can tolerate read latency.
Timeline & Metadata Management
Relevant Files
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieInstant.javahudi-common/src/main/java/org/apache/hudi/common/table/timeline/versioning/v2/ActiveTimelineV2.javahudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.javahudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java
Timeline Architecture
The timeline is Hudi's core mechanism for tracking all table operations and maintaining consistency. Every action (commit, compaction, clean, rollback) is represented as an instant with a unique timestamp and state.
Instant States
Each instant progresses through three states:
- REQUESTED – Action is requested but not yet started (used for compactions, cleanups)
- INFLIGHT – Action is in progress; metadata is being written
- COMPLETED – Action finished successfully; data is committed
State transitions follow a strict lifecycle: REQUESTED → INFLIGHT → COMPLETED. This ensures atomicity and allows recovery from failures.
Key Timeline Operations
// Create a new pending instant
timeline.createNewInstant(instant);
// Transition from REQUESTED to INFLIGHT
timeline.transitionRequestedToInflight(requestedInstant, metadata);
// Mark as complete (INFLIGHT to COMPLETED)
timeline.saveAsComplete(inflightInstant, metadata);
// Revert completed instant back to inflight
timeline.revertToInflight(completedInstant);
Metadata Table Structure
The metadata table is an internal Hudi table that maintains auxiliary indexes and statistics about the data table. It enables efficient query planning without full filesystem scans.
Metadata Partitions
The metadata table is organized into specialized partitions:
- files – Lists all files in each data partition; enables fast file discovery
- column_stats – Min/max values and null counts per column; enables predicate pushdown
- bloom_filters – Bloom filter indexes for record-level filtering
- record_index – Maps record keys to file locations; accelerates upserts and lookups
- partition_stats – Statistics aggregated at partition level
- expr_index – Expression-based indexes for custom predicates
- secondary_index – Indexes on non-primary key columns
Metadata Lookup Flow
Loading diagram...
Metadata Initialization and Sync
The metadata table is lazily initialized when first accessed. It maintains synchronization with the data table through:
- Valid instant timestamps – Only instants with corresponding completed data commits are read
- File slice caching – Latest merged file slices are cached per partition for performance
- Reusable readers – File handles can be reused across multiple lookups to reduce I/O
// Initialize metadata table
HoodieBackedTableMetadata metadata = new HoodieBackedTableMetadata(
engineContext, storage, metadataConfig, datasetBasePath);
// Reset and reload from storage
metadata.reset();
// Get synced instant time
Option<String> syncedTime = metadata.getSyncedInstantTime();
Timeline and Metadata Integration
The timeline and metadata work together to provide consistency:
- Write operations create instants and update the timeline
- Metadata table is updated asynchronously with new file and statistics records
- Readers query the timeline to find valid instants, then use metadata for optimization
- Rollbacks revert both timeline state and metadata records to maintain consistency
Spark Integration & Datasource
Relevant Files
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scalahudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scalahudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scalahudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/MergeOnReadSnapshotRelation.scalahudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scalahudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieStreamingSink.scala
Hudi integrates with Apache Spark through a comprehensive datasource provider that enables reading and writing Hudi tables using Spark SQL and DataFrames. The integration supports multiple query modes, table types, and streaming operations.
Architecture Overview
The Spark datasource is built on Spark's DataSourceRegister API. The DefaultSource class implements multiple provider interfaces to support different operations:
- RelationProvider & SchemaRelationProvider for reading tables
- CreatableRelationProvider for writing DataFrames
- StreamSinkProvider & StreamSourceProvider for streaming operations
Loading diagram...
Query Types
Hudi supports three primary query modes, each optimized for different use cases:
-
Snapshot Query (default): Returns the latest view of data by merging base files with log files. Provides complete, up-to-date records. Used for both COW and MOR tables.
-
Read-Optimized Query: Reads only base files, skipping log files. Faster but may miss recent updates in MOR tables. Ideal for analytical workloads where slight staleness is acceptable.
-
Incremental Query: Fetches only records changed since a specified instant time. Supports Change Data Capture (CDC) format to track inserts, updates, and deletes. Essential for streaming pipelines and change propagation.
File Index & Partition Pruning
The HoodieFileIndex class handles intelligent file discovery and partition pruning:
- Loads all files and partition values from the table path
- Applies partition filters to reduce scanned partitions
- Leverages metadata table indices (Record-Level Index, Partition Bucket Index) for data skipping
- Converts partition filters for custom key generators
- Returns
PartitionDirectoryobjects mapping partitions to file slices
This optimization significantly reduces I/O for queries with partition predicates.
Reading Relations
Different relation classes handle different query scenarios:
- BaseFileOnlyRelation: Reads only base files (COW snapshot, MOR read-optimized)
- MergeOnReadSnapshotRelation: Merges base files with log files for MOR snapshot queries
- IncrementalRelation: Handles incremental and CDC queries by tracking timeline changes
Each relation composes RDDs with appropriate readers and applies filters at the file and record level.
Writing & Streaming
The HoodieSparkSqlWriter handles all write operations through a unified interface:
- Supports batch writes (insert, upsert, delete, bulk insert)
- Manages table creation and schema evolution
- Handles deduplication and conflict resolution
- Supports async compaction and clustering
For streaming, HoodieStreamingSink processes micro-batches:
- Integrates with Spark Structured Streaming
- Manages checkpoint state for exactly-once semantics
- Triggers async compaction and clustering services
- Handles batch deduplication and idempotency
Version Compatibility
The datasource is split across multiple modules to support different Spark versions:
hudi-spark-common: Shared code for Spark 3.x and 4.xhudi-spark3-common: Spark 3.x specific implementationshudi-spark3.3.x,hudi-spark3.4.x,hudi-spark3.5.x: Version-specific adaptershudi-spark4-common,hudi-spark4.0.x: Spark 4.x support
Each version has a SparkAdapter that abstracts API differences, enabling a single codebase to work across multiple Spark releases.
Flink Integration & Streaming
Relevant Files
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.javahudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableFactory.javahudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteFunction.javahudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.javahudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/Pipelines.javahudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/HoodiePipeline.javahudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/streamer/HoodieFlinkStreamer.java
Hudi provides first-class support for Apache Flink streaming through a comprehensive datasource and sink implementation. This integration enables real-time data ingestion, transformation, and querying with exactly-once semantics and support for multiple write modes.
Architecture Overview
Loading diagram...
Core Components
HoodieTableSource implements Flink's ScanTableSource interface, supporting both bounded and streaming reads. It handles snapshot queries, incremental reads, and CDC (Change Data Capture) modes. The source supports projection pushdown, filter pushdown, and limit pushdown for query optimization.
HoodieTableFactory creates both sources and sinks from Flink SQL table definitions. It automatically infers table configuration from the Hudi table metadata and validates schema compatibility.
StreamWriteFunction buffers incoming records in a binary in-memory sort buffer and flushes them when batch size thresholds are exceeded or checkpoints trigger. It implements exactly-once semantics by coordinating with the operator coordinator.
StreamWriteOperatorCoordinator manages the write lifecycle across all parallel tasks. It creates new instants on checkpoint boundaries, commits metadata, and schedules compaction or clustering operations.
Write Modes
Hudi Flink supports three primary write modes:
-
Upsert Mode (default): Merges incoming records with existing data using primary keys. Requires bootstrap to load the index. Supports both Copy-on-Write (COW) and Merge-on-Read (MOR) tables.
-
Append Mode: Writes records directly to base files without index lookups. Faster for insert-only workloads. Disables compaction and supports optional async clustering.
-
Bulk Insert: Optimized for batch operations. Shuffles data by partition path, sorts within partitions, and writes in a single pass. Requires batch execution mode.
Pipeline Construction
The Pipelines utility class orchestrates complex streaming pipelines. For upsert operations, the pipeline chains: bootstrap → bucket assignment → stream write → compaction/clustering → clean. For append mode, it skips bootstrap and bucket assignment, writing directly to files.
The HoodiePipeline builder provides a fluent API for programmatic pipeline construction:
HoodiePipeline.Builder builder = HoodiePipeline.builder("myTable");
DataStreamSink<?> sink = builder
.column("id int")
.column("name string")
.pk("id")
.partition("date")
.sink(dataStream, false);
Streaming Configuration
Key configuration options control streaming behavior:
write.tasks: Number of parallel write taskswrite.batch.size: Records per batch before flushcompaction.schedule.enabled: Enable async compactionclustering.schedule.enabled: Enable async clusteringread.as.streaming: Enable streaming reads with monitoringoperation: Write mode (UPSERT, INSERT, BULK_INSERT, INSERT_OVERWRITE)
Exactly-Once Semantics
Hudi achieves exactly-once semantics through checkpoint coordination. The operator coordinator creates a new instant before each checkpoint, and tasks buffer data until the checkpoint completes. Failed writes are rolled back, and the coordinator retries on recovery. One Hudi instant spans exactly one checkpoint interval.
Streaming Reads
Streaming reads use StreamReadMonitoringFunction to detect new commits and StreamReadOperator to read file splits. Supports snapshot queries, incremental reads from specific commits, and CDC mode for capturing all changes including deletes.
Data Ingestion & Streaming
Relevant Files
hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/HoodieStreamer.javahudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.javahudi-utilities/src/main/java/org/apache/hudi/utilities/sources/Source.javahudi-utilities/src/main/java/org/apache/hudi/utilities/sources/KafkaSource.javahudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonKafkaSource.javahudi-utilities/src/main/java/org/apache/hudi/utilities/sources/AvroKafkaSource.javahudi-kafka-connect/src/main/java/org/apache/hudi/kafka/connect
Hudi provides multiple mechanisms for ingesting data from streaming and batch sources. The primary entry point is HoodieStreamer, which orchestrates the entire ingestion pipeline in continuous or batch mode.
Core Architecture
Loading diagram...
HoodieStreamer: Main Orchestrator
HoodieStreamer is the primary utility for incremental data ingestion. It operates in two modes:
- Continuous Mode: Runs in a loop, repeatedly pulling data, writing to Hudi, and scheduling compactions
- Bootstrap Mode: Performs initial bulk load of historical data before switching to continuous ingestion
The streamer manages the complete lifecycle: source reading, schema resolution, write operations, compaction scheduling, and metadata synchronization (Hive, Iceberg, etc.).
Source Abstraction
All data sources extend the Source<T> abstract class. The key method is fetchNext(), which returns an InputBatch containing:
- Batch data: JavaRDD or Dataset in various formats (JSON, Avro, Protobuf, Row)
- Checkpoint: Tracks progress for resumable ingestion
- SchemaProvider: Supplies schema for data transformation
Kafka Integration
Hudi supports multiple Kafka source implementations:
JsonKafkaSource: Reads JSON-serialized messages from Kafka topics. Optionally appends Kafka metadata (offset, partition, timestamp) to records.
AvroKafkaSource: Reads Avro-serialized messages. Supports schema registry integration for schema evolution.
ProtoKafkaSource: Reads Protocol Buffer messages with automatic conversion to Avro.
All Kafka sources use KafkaOffsetGen to manage offset ranges and checkpoint state, enabling exactly-once semantics through offset tracking.
Kafka Connect Sink
The Hudi Kafka Connect Sink provides a distributed connector for Kafka Connect clusters. It:
- Writes Kafka records directly to Hudi tables
- Coordinates transactions across multiple tasks via a control topic
- Supports both Copy-On-Write and Merge-On-Read table types
- Schedules async compaction and clustering as needed
StreamSync: Write Pipeline
StreamSync handles the actual write operations:
- Reads from source via checkpoint-based resumption
- Applies schema transformations and sanitization
- Writes records using SparkRDDWriteClient
- Manages compaction scheduling for MOR tables
- Syncs metadata to external systems (Hive, Iceberg)
Checkpoint Management
Hudi uses two checkpoint versions:
- V1: Request-time based (legacy)
- V2: Completion-time based (recommended for Hudi 1.0+)
Checkpoints enable resumable ingestion after failures, tracking the last successfully processed batch.
Configuration
Key configurations for ingestion:
# Source selection
hoodie.streamer.source.class=org.apache.hudi.utilities.sources.JsonKafkaSource
# Kafka-specific
hoodie.kafka.bootstrap.servers=localhost:9092
hoodie.kafka.topic=my-topic
# Schema provider
hoodie.streamer.schemaprovider.class=org.apache.hudi.utilities.schema.SchemaRegistryProvider
hoodie.streamer.schemaprovider.registry.url=http://localhost:8081
# Write options
hoodie.datasource.write.recordkey.field=id
hoodie.datasource.write.partitionpath.field=date
Error Handling
Sources implement resilience through:
- HoodieRetryingKafkaConsumer: Retries failed Kafka reads
- Checkpoint recovery: Resumes from last successful batch
- Error tables: Optional table for tracking failed records
Metadata Sync & Catalog Integration
Relevant Files
hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/HoodieSyncTool.javahudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncTool.javahudi-sync/hudi-adb-sync/src/main/java/org/apache/hudi/sync/adb/AdbSyncTool.javahudi-sync/hudi-datahub-sync/src/main/java/org/apache/hudi/sync/datahub/DataHubSyncTool.javahudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/util/SyncUtilHelpers.java
Overview
Hudi's metadata sync system bridges the gap between Hudi tables and external data catalogs and metastores. After data is written to a Hudi table, the sync layer automatically propagates schema, partition, and table metadata to external systems like Hive, Alibaba Cloud AnalyticDB (ADB), AWS Glue, and DataHub. This enables seamless querying through external tools while maintaining a single source of truth in Hudi.
Architecture
The sync system follows a pluggable, extensible design:
Loading diagram...
Core Components
HoodieSyncTool is the abstract base class that all sync implementations extend. It provides:
- Configuration management via
HoodieSyncConfig - Metrics collection for sync operations
- Hadoop configuration and filesystem access
- Abstract
syncHoodieTable()method for subclasses to implement
HoodieSyncClient is the common client interface that handles:
- Schema extraction from Parquet files
- Partition discovery and change tracking
- Timeline management to identify new/modified partitions
- Partition value extraction using configurable extractors
Sync Implementations
HiveSyncTool syncs to Apache Hive metastore. For Copy-on-Write tables, it creates a single snapshot table. For Merge-on-Read tables, it can create read-optimized (_ro) and real-time (_rt) views based on the configured strategy. It handles schema evolution, partition updates, and table recreation on errors.
AdbSyncTool targets Alibaba Cloud AnalyticDB with similar partition and schema sync logic, but uses ADB-specific APIs and configuration.
DataHubSyncTool syncs metadata to the DataHub data governance platform via REST APIs, including schema metadata and table properties.
Execution Flow
Sync is triggered after successful commits in multiple contexts:
- Spark DataSource:
HoodieSparkSqlWritercallsSyncUtilHelpers.runHoodieMetaSync()after writes - Flink Streaming:
StreamWriteOperatorCoordinatorexecutes sync asynchronously after checkpoints - DeltaStreamer:
StreamSyncruns sync after each batch write - Kafka Connect:
KafkaConnectTransactionServicessyncs after transaction completion
Concurrency & Locking
SyncUtilHelpers maintains per-table locks using ReentrantLock to prevent concurrent modifications to the same metastore. This avoids ConcurrentModificationException errors when multiple writers target the same table. Each table (identified by base path) gets its own lock, allowing parallel syncs for different tables.
Configuration
Key sync configurations:
hoodie.datasource.meta.sync.enable– Enable/disable syncinghoodie.datasource.hive_sync.database– Target database namehoodie.datasource.hive_sync.table– Target table namehoodie.meta.sync.no_partition_metadata– Skip partition sync for large tableshoodie.datasource.meta.sync.classes– Comma-separated list of sync tool class names
Multiple sync tools can run sequentially for the same table, enabling simultaneous updates to multiple catalogs.
Table Services & Maintenance
Relevant Files
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCompactor.javahudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieClusteringJob.javahudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieSnapshotExporter.javahudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCleaner.java
Hudi provides three core table maintenance services to optimize storage, performance, and data management: Compaction, Clustering, and Cleaning. These services can run independently or as part of an automated pipeline.
Compaction
Compaction merges log files with base files in Merge-on-Read (MOR) tables, reducing read latency and improving query performance. The HoodieCompactor utility supports three execution modes:
- Schedule: Creates a compaction plan without executing it
- Execute: Runs a previously scheduled compaction plan
- ScheduleAndExecute: Plans and executes compaction in one operation
Compaction strategies determine which file groups are prioritized. Built-in strategies include:
LogFileSizeBasedCompactionStrategy(default): Prioritizes log files with the most accumulated unmerged dataBoundedIOCompactionStrategy: Limits total I/O (read + write) per compaction runDayBasedCompactionStrategy: Compacts files based on ageUnBoundedCompactionStrategy: Compacts all eligible files without filtering
java org.apache.hudi.utilities.HoodieCompactor \
--base-path /path/to/table \
--table-name my_table \
--mode scheduleandexecute \
--parallelism 200
Clustering
Clustering reorganizes data files to improve query performance by co-locating related records. The HoodieClusteringJob utility supports similar modes as compaction:
- Schedule: Creates a clustering plan
- Execute: Executes a scheduled clustering plan
- ScheduleAndExecute: Plans and executes clustering immediately
- PurgePendingInstant: Cleans up failed clustering operations
Clustering can be triggered based on commit count thresholds and supports retry logic for failed jobs.
java org.apache.hudi.utilities.HoodieClusteringJob \
--base-path /path/to/table \
--table-name my_table \
--mode scheduleandexecute \
--parallelism 4
Cleaning
Cleaning removes obsolete file versions based on retention policies, preventing unbounded storage growth. The HoodieCleaner utility supports three cleaning policies:
KEEP_LATEST_COMMITS: Retains files from the last N commits (default: 10)KEEP_LATEST_FILE_VERSIONS: Keeps only the latest N versions of each file (default: 1)KEEP_LATEST_BY_HOURS: Retains files written in the last N hours (default: 24)
Cleaning can run synchronously after each commit or asynchronously in the background.
java org.apache.hudi.utilities.HoodieCleaner \
--target-base-path /path/to/table \
--props /path/to/hudi.properties
Snapshot Export
The HoodieSnapshotExporter exports table snapshots to external formats (Parquet, ORC, JSON, or Hudi) for analytics or data sharing. It captures the latest committed version of all records and supports optional data transformations.
java org.apache.hudi.utilities.HoodieSnapshotExporter \
--source-base-path /path/to/source/table \
--target-output-path /path/to/export \
--output-format parquet
Multi-Table Services
For managing multiple tables, HoodieMultiTableServicesMain orchestrates compaction, clustering, cleaning, and archival across tables with auto-discovery and configurable parallelism.