Apache Hudi - Data Lakehouse Platform | Augment Code

Overview

Relevant Files

hudi-common/src/main/java/org/apache/hudi/common/HoodieTableFormat.java
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java
hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieTableType.java

Apache Hudi is an open data lakehouse platform that provides a high-performance table format for ingesting, storing, and querying data across cloud environments. It combines the benefits of data lakes and data warehouses by offering ACID transactions, efficient upserts, and multiple query types on cloud storage.

Core Architecture

Hudi's architecture is built on several interconnected layers:

HoodieTable is the central abstraction representing a Hudi table. It manages all table operations including reads, writes, and maintenance tasks. The table holds references to critical components: the write configuration, metadata client, index, storage layout, and timeline.

HoodieTableMetaClient provides access to table metadata and the timeline. It tracks all commits, compactions, cleanups, and other actions performed on the table. The timeline is computed lazily and cached for performance.

HoodieTableFormat is an extensible interface allowing external table formats to integrate with Hudi. It provides hooks for lifecycle events like commits, cleanups, and rollbacks, enabling custom metadata management and format-specific behavior.

Timeline tracks the complete history of table changes. Each action (commit, compaction, clean, etc.) creates an instant with a timestamp. The timeline enables point-in-time queries, incremental reads, and change data capture.

Table Types

Hudi supports two table types with different write and read tradeoffs:

Copy-on-Write (COW) tables rewrite entire files on updates. All data is stored in base files (Parquet), providing zero read amplification and excellent query performance. This is the default table type, ideal for read-heavy workloads.

Merge-on-Read (MOR) tables append updates to log files instead of rewriting base files. Compaction asynchronously merges logs into base files. MOR provides faster writes but requires merging during reads, making it suitable for write-heavy streaming scenarios.

Write Operations

Hudi supports multiple write operations:

Insert - Add new records to the table
Upsert - Insert or update records based on record keys
Bulk Insert - Efficiently load large datasets without caching
Delete - Remove records by key
Insert Overwrite - Replace data in specific partitions
Cluster - Reorganize data layout for query optimization
Compact - Merge log files into base files (MOR tables only)

Key Components

FileSystemViewManager provides views of the table's file system state, supporting both base-file-only and full slice views for efficient query planning.

HoodieIndex maintains record-level indexes using bloom filters and column statistics to accelerate lookups during upserts and deletes.

HoodieStorageLayout defines how files are organized within the table. Supported layouts include DEFAULT (unordered) and BUCKET (hash-based partitioning).

HoodieTableMetadata maintains auxiliary metadata partitions for file listings, column statistics, and bloom filters, enabling efficient query planning without full file system scans.

Loading diagram...

Query Types

Hudi supports multiple query modes on a single table:

Snapshot Query - Latest committed state with index acceleration
Incremental Query - Changes since a point in time
Time-Travel Query - Table state at a specific instant
Change Data Capture - Stream of inserts, updates, and deletes
Read-Optimized Query - Pure columnar reads (MOR tables only)

Architecture & Core Components

Relevant Files

hudi-common/src/main/java/org/apache/hudi/common/table/HoodieTableMetaClient.java
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/client/BaseHoodieWriteClient.java
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/index/HoodieIndex.java
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieTimeline.java
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/HoodieTable.java

Apache Hudi's architecture is built on four core layers that work together to provide a transactional data lake platform. Understanding these layers is essential for working with the codebase.

Core Components

HoodieTableMetaClient is the entry point for accessing table metadata. It manages the .hoodie directory structure, loads table configuration, and provides access to the timeline. All timelines are computed lazily and cached, so you must call reload() to refresh them after mutations.

HoodieTimeline represents a view of metadata instants (specific points in time) in the table. It is immutable and supports filtering operations that chain together. The timeline tracks all actions: commits, delta commits, compactions, cleanups, rollbacks, savepoints, and more. Each action has three states: requested, inflight, and completed.

HoodieTable is the abstraction layer for table operations. It encapsulates write configuration, metadata access, and file system views. Engine-specific implementations (Spark, Flink, Java) extend this class to provide concrete write, read, and compaction operations. HoodieTable holds references to the metaClient, index, and metadata writer.

BaseHoodieWriteClient orchestrates write operations and manages the commit lifecycle. It creates HoodieTable instances, manages indexes, and handles commit metadata. The client is responsible for starting commits, writing data, updating indexes, and finalizing transactions.

Data Flow Architecture

Loading diagram...

Index Layer

HoodieIndex is an abstract base class for different indexing strategies. It provides two core operations: tagLocation() finds existing records by looking up their file locations, and updateLocation() updates the index after writes. Multiple implementations exist (Bloom, Bucket, Simple, In-Memory Hash) to support different use cases and performance characteristics.

Timeline Management

The timeline uses a state machine for each instant: requested → inflight → completed. This ensures atomicity and allows recovery from failures. The timeline layout is versioned to support schema evolution. Active timelines track in-progress operations, while archived timelines store historical data for efficient queries.

Table Types

Hudi supports two table types, each with different write and read tradeoffs:

Copy-on-Write (CoW): Synchronous compaction during writes. Faster reads, slower writes.
Merge-on-Read (MoR): Asynchronous compaction. Faster writes, slower reads with log file merging.

Both types are implemented as concrete HoodieTable subclasses that override abstract methods for upsert, insert, compact, and clean operations.

Table Types & Storage Format

Relevant Files

hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/HoodieSparkCopyOnWriteTable.java
hudi-client/hudi-spark-client/src/main/java/org/apache/hudi/table/HoodieSparkMergeOnReadTable.java
hudi-common/src/main/java/org/apache/hudi/common/table/log/HoodieLogFormat.java
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieLogFile.java
hudi-common/src/main/java/org/apache/hudi/common/model/HoodieFileFormat.java

Hudi supports two table types with fundamentally different write and read tradeoffs. Each type is implemented as a concrete subclass of HoodieTable that overrides methods for upsert, insert, delete, and compaction operations.

Copy-on-Write (CoW) Tables

CoW tables prioritize read performance by maintaining all data in base files. When records are updated or deleted, Hudi rewrites the entire base file with the new values. This eliminates read amplification since queries always read from base files directly.

Storage Structure:

Base files only (Parquet or ORC format)
File naming: {fileId}_{writeToken}_{instantTime}.parquet
Updates trigger immediate file rewrites

Write Operations:

Inserts: Create new base files or merge with the smallest existing file
Updates: Rewrite the entire base file with updated records
Deletes: Rewrite the base file excluding deleted records

Characteristics:

Zero read amplification
Synchronous compaction during writes
Slower writes, faster reads
Default table type, ideal for read-heavy workloads

Merge-on-Read (MoR) Tables

MoR tables optimize write performance by deferring merges until compaction. Updates append to log files instead of rewriting base files. During reads, Hudi merges log files with base files on-the-fly, providing faster writes at the cost of read latency.

Storage Structure:

Base files (Parquet/ORC) plus log files
Log file naming: .{fileId}_{instantTime}.log.{version}_{writeToken}
Log files use Avro binary format with block-based structure
Default log file size threshold: 512 MB

Write Operations:

Inserts: Same as CoW (create new base files)
Updates: Append to rolling log files per file ID
Deletes: Write delete markers to log files
Compaction: Asynchronously merges log files into base files

Log File Format: Log files consist of blocks separated by magic markers (#HUDI#). Each block can be:

Data Block: Records serialized in Avro binary format
Delete Block: List of keys to delete (tombstones)
Command Block: Rollback or other control commands

Characteristics:

Higher write throughput
Read amplification due to log merging
Asynchronous compaction
Ideal for write-heavy streaming scenarios

File Organization

Both table types organize data by partitions. Within each partition:

partition_path/
├── base_file_1.parquet
├── base_file_2.parquet
├── .file_id_1_timestamp.log.1_token
├── .file_id_1_timestamp.log.2_token
└── .file_id_2_timestamp.log.1_token

File Naming Convention:

Base files: {fileId}_{writeToken}_{instantTime}.{extension}
Log files: .{fileId}_{instantTime}.log.{version}_{writeToken}
Write tokens track concurrent writes to prevent conflicts

Choosing a Table Type

Aspect	CoW	MoR
Read Latency	Low	Higher (log merging)
Write Latency	Higher	Low
Storage	Base files only	Base + log files
Compaction	Synchronous	Asynchronous
Use Case	Analytics, reporting	Real-time streaming

CoW is the default and recommended for most use cases. Choose MoR when write throughput is critical and you can tolerate read latency.

Timeline & Metadata Management

Relevant Files

hudi-common/src/main/java/org/apache/hudi/common/table/timeline/HoodieInstant.java
hudi-common/src/main/java/org/apache/hudi/common/table/timeline/versioning/v2/ActiveTimelineV2.java
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadata.java
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java

Timeline Architecture

The timeline is Hudi's core mechanism for tracking all table operations and maintaining consistency. Every action (commit, compaction, clean, rollback) is represented as an instant with a unique timestamp and state.

Instant States

Each instant progresses through three states:

REQUESTED – Action is requested but not yet started (used for compactions, cleanups)
INFLIGHT – Action is in progress; metadata is being written
COMPLETED – Action finished successfully; data is committed

State transitions follow a strict lifecycle: REQUESTED → INFLIGHT → COMPLETED. This ensures atomicity and allows recovery from failures.

Key Timeline Operations

// Create a new pending instant
timeline.createNewInstant(instant);

// Transition from REQUESTED to INFLIGHT
timeline.transitionRequestedToInflight(requestedInstant, metadata);

// Mark as complete (INFLIGHT to COMPLETED)
timeline.saveAsComplete(inflightInstant, metadata);

// Revert completed instant back to inflight
timeline.revertToInflight(completedInstant);

Metadata Table Structure

The metadata table is an internal Hudi table that maintains auxiliary indexes and statistics about the data table. It enables efficient query planning without full filesystem scans.

Metadata Partitions

The metadata table is organized into specialized partitions:

files – Lists all files in each data partition; enables fast file discovery
column_stats – Min/max values and null counts per column; enables predicate pushdown
bloom_filters – Bloom filter indexes for record-level filtering
record_index – Maps record keys to file locations; accelerates upserts and lookups
partition_stats – Statistics aggregated at partition level
expr_index – Expression-based indexes for custom predicates
secondary_index – Indexes on non-primary key columns

Metadata Lookup Flow

Loading diagram...

Metadata Initialization and Sync

The metadata table is lazily initialized when first accessed. It maintains synchronization with the data table through:

Valid instant timestamps – Only instants with corresponding completed data commits are read
File slice caching – Latest merged file slices are cached per partition for performance
Reusable readers – File handles can be reused across multiple lookups to reduce I/O

// Initialize metadata table
HoodieBackedTableMetadata metadata = new HoodieBackedTableMetadata(
    engineContext, storage, metadataConfig, datasetBasePath);

// Reset and reload from storage
metadata.reset();

// Get synced instant time
Option<String> syncedTime = metadata.getSyncedInstantTime();

Timeline and Metadata Integration

The timeline and metadata work together to provide consistency:

Write operations create instants and update the timeline
Metadata table is updated asynchronously with new file and statistics records
Readers query the timeline to find valid instants, then use metadata for optimization
Rollbacks revert both timeline state and metadata records to maintain consistency

Spark Integration & Datasource

Relevant Files

hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/DefaultSource.scala
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/BaseFileOnlyRelation.scala
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/MergeOnReadSnapshotRelation.scala
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieSparkSqlWriter.scala
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieStreamingSink.scala

Hudi integrates with Apache Spark through a comprehensive datasource provider that enables reading and writing Hudi tables using Spark SQL and DataFrames. The integration supports multiple query modes, table types, and streaming operations.

Architecture Overview

The Spark datasource is built on Spark's DataSourceRegister API. The DefaultSource class implements multiple provider interfaces to support different operations:

RelationProvider & SchemaRelationProvider for reading tables
CreatableRelationProvider for writing DataFrames
StreamSinkProvider & StreamSourceProvider for streaming operations

Loading diagram...

Query Types

Hudi supports three primary query modes, each optimized for different use cases:

Snapshot Query (default): Returns the latest view of data by merging base files with log files. Provides complete, up-to-date records. Used for both COW and MOR tables.
Read-Optimized Query: Reads only base files, skipping log files. Faster but may miss recent updates in MOR tables. Ideal for analytical workloads where slight staleness is acceptable.
Incremental Query: Fetches only records changed since a specified instant time. Supports Change Data Capture (CDC) format to track inserts, updates, and deletes. Essential for streaming pipelines and change propagation.

File Index & Partition Pruning

The HoodieFileIndex class handles intelligent file discovery and partition pruning:

Loads all files and partition values from the table path
Applies partition filters to reduce scanned partitions
Leverages metadata table indices (Record-Level Index, Partition Bucket Index) for data skipping
Converts partition filters for custom key generators
Returns PartitionDirectory objects mapping partitions to file slices

This optimization significantly reduces I/O for queries with partition predicates.

Reading Relations

Different relation classes handle different query scenarios:

BaseFileOnlyRelation: Reads only base files (COW snapshot, MOR read-optimized)
MergeOnReadSnapshotRelation: Merges base files with log files for MOR snapshot queries
IncrementalRelation: Handles incremental and CDC queries by tracking timeline changes

Each relation composes RDDs with appropriate readers and applies filters at the file and record level.

Writing & Streaming

The HoodieSparkSqlWriter handles all write operations through a unified interface:

Supports batch writes (insert, upsert, delete, bulk insert)
Manages table creation and schema evolution
Handles deduplication and conflict resolution
Supports async compaction and clustering

For streaming, HoodieStreamingSink processes micro-batches:

Integrates with Spark Structured Streaming
Manages checkpoint state for exactly-once semantics
Triggers async compaction and clustering services
Handles batch deduplication and idempotency

Version Compatibility

The datasource is split across multiple modules to support different Spark versions:

hudi-spark-common: Shared code for Spark 3.x and 4.x
hudi-spark3-common: Spark 3.x specific implementations
hudi-spark3.3.x, hudi-spark3.4.x, hudi-spark3.5.x: Version-specific adapters
hudi-spark4-common, hudi-spark4.0.x: Spark 4.x support

Each version has a SparkAdapter that abstracts API differences, enabling a single codebase to work across multiple Spark releases.

Flink Integration & Streaming

Relevant Files

hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableSource.java
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/table/HoodieTableFactory.java
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteFunction.java
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/StreamWriteOperatorCoordinator.java
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/sink/utils/Pipelines.java
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/util/HoodiePipeline.java
hudi-flink-datasource/hudi-flink/src/main/java/org/apache/hudi/streamer/HoodieFlinkStreamer.java

Hudi provides first-class support for Apache Flink streaming through a comprehensive datasource and sink implementation. This integration enables real-time data ingestion, transformation, and querying with exactly-once semantics and support for multiple write modes.

Architecture Overview

Loading diagram...

Core Components

HoodieTableSource implements Flink's ScanTableSource interface, supporting both bounded and streaming reads. It handles snapshot queries, incremental reads, and CDC (Change Data Capture) modes. The source supports projection pushdown, filter pushdown, and limit pushdown for query optimization.

HoodieTableFactory creates both sources and sinks from Flink SQL table definitions. It automatically infers table configuration from the Hudi table metadata and validates schema compatibility.

StreamWriteFunction buffers incoming records in a binary in-memory sort buffer and flushes them when batch size thresholds are exceeded or checkpoints trigger. It implements exactly-once semantics by coordinating with the operator coordinator.

StreamWriteOperatorCoordinator manages the write lifecycle across all parallel tasks. It creates new instants on checkpoint boundaries, commits metadata, and schedules compaction or clustering operations.

Write Modes

Hudi Flink supports three primary write modes:

Upsert Mode (default): Merges incoming records with existing data using primary keys. Requires bootstrap to load the index. Supports both Copy-on-Write (COW) and Merge-on-Read (MOR) tables.
Append Mode: Writes records directly to base files without index lookups. Faster for insert-only workloads. Disables compaction and supports optional async clustering.
Bulk Insert: Optimized for batch operations. Shuffles data by partition path, sorts within partitions, and writes in a single pass. Requires batch execution mode.

Pipeline Construction

The Pipelines utility class orchestrates complex streaming pipelines. For upsert operations, the pipeline chains: bootstrap → bucket assignment → stream write → compaction/clustering → clean. For append mode, it skips bootstrap and bucket assignment, writing directly to files.

The HoodiePipeline builder provides a fluent API for programmatic pipeline construction:

HoodiePipeline.Builder builder = HoodiePipeline.builder("myTable");
DataStreamSink&lt;?&gt; sink = builder
    .column("id int")
    .column("name string")
    .pk("id")
    .partition("date")
    .sink(dataStream, false);

Streaming Configuration

Key configuration options control streaming behavior:

write.tasks: Number of parallel write tasks
write.batch.size: Records per batch before flush
compaction.schedule.enabled: Enable async compaction
clustering.schedule.enabled: Enable async clustering
read.as.streaming: Enable streaming reads with monitoring
operation: Write mode (UPSERT, INSERT, BULK_INSERT, INSERT_OVERWRITE)

Exactly-Once Semantics

Hudi achieves exactly-once semantics through checkpoint coordination. The operator coordinator creates a new instant before each checkpoint, and tasks buffer data until the checkpoint completes. Failed writes are rolled back, and the coordinator retries on recovery. One Hudi instant spans exactly one checkpoint interval.

Streaming Reads

Streaming reads use StreamReadMonitoringFunction to detect new commits and StreamReadOperator to read file splits. Supports snapshot queries, incremental reads from specific commits, and CDC mode for capturing all changes including deletes.

Data Ingestion & Streaming

Relevant Files

hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/HoodieStreamer.java
hudi-utilities/src/main/java/org/apache/hudi/utilities/streamer/StreamSync.java
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/Source.java
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/KafkaSource.java
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/JsonKafkaSource.java
hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/AvroKafkaSource.java
hudi-kafka-connect/src/main/java/org/apache/hudi/kafka/connect

Hudi provides multiple mechanisms for ingesting data from streaming and batch sources. The primary entry point is HoodieStreamer, which orchestrates the entire ingestion pipeline in continuous or batch mode.

Core Architecture

Loading diagram...

HoodieStreamer: Main Orchestrator

HoodieStreamer is the primary utility for incremental data ingestion. It operates in two modes:

Continuous Mode: Runs in a loop, repeatedly pulling data, writing to Hudi, and scheduling compactions
Bootstrap Mode: Performs initial bulk load of historical data before switching to continuous ingestion

The streamer manages the complete lifecycle: source reading, schema resolution, write operations, compaction scheduling, and metadata synchronization (Hive, Iceberg, etc.).

Source Abstraction

All data sources extend the Source<T> abstract class. The key method is fetchNext(), which returns an InputBatch containing:

Batch data: JavaRDD or Dataset in various formats (JSON, Avro, Protobuf, Row)
Checkpoint: Tracks progress for resumable ingestion
SchemaProvider: Supplies schema for data transformation

Kafka Integration

Hudi supports multiple Kafka source implementations:

JsonKafkaSource: Reads JSON-serialized messages from Kafka topics. Optionally appends Kafka metadata (offset, partition, timestamp) to records.

AvroKafkaSource: Reads Avro-serialized messages. Supports schema registry integration for schema evolution.

ProtoKafkaSource: Reads Protocol Buffer messages with automatic conversion to Avro.

All Kafka sources use KafkaOffsetGen to manage offset ranges and checkpoint state, enabling exactly-once semantics through offset tracking.

Kafka Connect Sink

The Hudi Kafka Connect Sink provides a distributed connector for Kafka Connect clusters. It:

Writes Kafka records directly to Hudi tables
Coordinates transactions across multiple tasks via a control topic
Supports both Copy-On-Write and Merge-On-Read table types
Schedules async compaction and clustering as needed

StreamSync: Write Pipeline

StreamSync handles the actual write operations:

Reads from source via checkpoint-based resumption
Applies schema transformations and sanitization
Writes records using SparkRDDWriteClient
Manages compaction scheduling for MOR tables
Syncs metadata to external systems (Hive, Iceberg)

Checkpoint Management

Hudi uses two checkpoint versions:

V1: Request-time based (legacy)
V2: Completion-time based (recommended for Hudi 1.0+)

Checkpoints enable resumable ingestion after failures, tracking the last successfully processed batch.

Configuration

Key configurations for ingestion:

# Source selection
hoodie.streamer.source.class=org.apache.hudi.utilities.sources.JsonKafkaSource

# Kafka-specific
hoodie.kafka.bootstrap.servers=localhost:9092
hoodie.kafka.topic=my-topic

# Schema provider
hoodie.streamer.schemaprovider.class=org.apache.hudi.utilities.schema.SchemaRegistryProvider
hoodie.streamer.schemaprovider.registry.url=http://localhost:8081

# Write options
hoodie.datasource.write.recordkey.field=id
hoodie.datasource.write.partitionpath.field=date

Error Handling

Sources implement resilience through:

HoodieRetryingKafkaConsumer: Retries failed Kafka reads
Checkpoint recovery: Resumes from last successful batch
Error tables: Optional table for tracking failed records

Metadata Sync & Catalog Integration

Relevant Files

hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/HoodieSyncTool.java
hudi-sync/hudi-hive-sync/src/main/java/org/apache/hudi/hive/HiveSyncTool.java
hudi-sync/hudi-adb-sync/src/main/java/org/apache/hudi/sync/adb/AdbSyncTool.java
hudi-sync/hudi-datahub-sync/src/main/java/org/apache/hudi/sync/datahub/DataHubSyncTool.java
hudi-sync/hudi-sync-common/src/main/java/org/apache/hudi/sync/common/util/SyncUtilHelpers.java

Overview

Hudi's metadata sync system bridges the gap between Hudi tables and external data catalogs and metastores. After data is written to a Hudi table, the sync layer automatically propagates schema, partition, and table metadata to external systems like Hive, Alibaba Cloud AnalyticDB (ADB), AWS Glue, and DataHub. This enables seamless querying through external tools while maintaining a single source of truth in Hudi.

Architecture

The sync system follows a pluggable, extensible design:

Loading diagram...

Core Components

HoodieSyncTool is the abstract base class that all sync implementations extend. It provides:

Configuration management via HoodieSyncConfig
Metrics collection for sync operations
Hadoop configuration and filesystem access
Abstract syncHoodieTable() method for subclasses to implement

HoodieSyncClient is the common client interface that handles:

Schema extraction from Parquet files
Partition discovery and change tracking
Timeline management to identify new/modified partitions
Partition value extraction using configurable extractors

Sync Implementations

HiveSyncTool syncs to Apache Hive metastore. For Copy-on-Write tables, it creates a single snapshot table. For Merge-on-Read tables, it can create read-optimized (_ro) and real-time (_rt) views based on the configured strategy. It handles schema evolution, partition updates, and table recreation on errors.

AdbSyncTool targets Alibaba Cloud AnalyticDB with similar partition and schema sync logic, but uses ADB-specific APIs and configuration.

DataHubSyncTool syncs metadata to the DataHub data governance platform via REST APIs, including schema metadata and table properties.

Execution Flow

Sync is triggered after successful commits in multiple contexts:

Spark DataSource: HoodieSparkSqlWriter calls SyncUtilHelpers.runHoodieMetaSync() after writes
Flink Streaming: StreamWriteOperatorCoordinator executes sync asynchronously after checkpoints
DeltaStreamer: StreamSync runs sync after each batch write
Kafka Connect: KafkaConnectTransactionServices syncs after transaction completion

Concurrency & Locking

SyncUtilHelpers maintains per-table locks using ReentrantLock to prevent concurrent modifications to the same metastore. This avoids ConcurrentModificationException errors when multiple writers target the same table. Each table (identified by base path) gets its own lock, allowing parallel syncs for different tables.

Configuration

Key sync configurations:

hoodie.datasource.meta.sync.enable – Enable/disable syncing
hoodie.datasource.hive_sync.database – Target database name
hoodie.datasource.hive_sync.table – Target table name
hoodie.meta.sync.no_partition_metadata – Skip partition sync for large tables
hoodie.datasource.meta.sync.classes – Comma-separated list of sync tool class names

Multiple sync tools can run sequentially for the same table, enabling simultaneous updates to multiple catalogs.

Table Services & Maintenance

Relevant Files

hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCompactor.java
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieClusteringJob.java
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieSnapshotExporter.java
hudi-utilities/src/main/java/org/apache/hudi/utilities/HoodieCleaner.java

Hudi provides three core table maintenance services to optimize storage, performance, and data management: Compaction, Clustering, and Cleaning. These services can run independently or as part of an automated pipeline.

Compaction

Compaction merges log files with base files in Merge-on-Read (MOR) tables, reducing read latency and improving query performance. The HoodieCompactor utility supports three execution modes:

Schedule: Creates a compaction plan without executing it
Execute: Runs a previously scheduled compaction plan
ScheduleAndExecute: Plans and executes compaction in one operation

Compaction strategies determine which file groups are prioritized. Built-in strategies include:

LogFileSizeBasedCompactionStrategy (default): Prioritizes log files with the most accumulated unmerged data
BoundedIOCompactionStrategy: Limits total I/O (read + write) per compaction run
DayBasedCompactionStrategy: Compacts files based on age
UnBoundedCompactionStrategy: Compacts all eligible files without filtering

java org.apache.hudi.utilities.HoodieCompactor \
  --base-path /path/to/table \
  --table-name my_table \
  --mode scheduleandexecute \
  --parallelism 200

Clustering

Clustering reorganizes data files to improve query performance by co-locating related records. The HoodieClusteringJob utility supports similar modes as compaction:

Schedule: Creates a clustering plan
Execute: Executes a scheduled clustering plan
ScheduleAndExecute: Plans and executes clustering immediately
PurgePendingInstant: Cleans up failed clustering operations

Clustering can be triggered based on commit count thresholds and supports retry logic for failed jobs.

java org.apache.hudi.utilities.HoodieClusteringJob \
  --base-path /path/to/table \
  --table-name my_table \
  --mode scheduleandexecute \
  --parallelism 4

Cleaning

Cleaning removes obsolete file versions based on retention policies, preventing unbounded storage growth. The HoodieCleaner utility supports three cleaning policies:

KEEP_LATEST_COMMITS: Retains files from the last N commits (default: 10)
KEEP_LATEST_FILE_VERSIONS: Keeps only the latest N versions of each file (default: 1)
KEEP_LATEST_BY_HOURS: Retains files written in the last N hours (default: 24)

Cleaning can run synchronously after each commit or asynchronously in the background.

java org.apache.hudi.utilities.HoodieCleaner \
  --target-base-path /path/to/table \
  --props /path/to/hudi.properties

Snapshot Export

The HoodieSnapshotExporter exports table snapshots to external formats (Parquet, ORC, JSON, or Hudi) for analytics or data sharing. It captures the latest committed version of all records and supports optional data transformations.

java org.apache.hudi.utilities.HoodieSnapshotExporter \
  --source-base-path /path/to/source/table \
  --target-output-path /path/to/export \
  --output-format parquet

Multi-Table Services

For managing multiple tables, HoodieMultiTableServicesMain orchestrates compaction, clustering, cleaning, and archival across tables with auto-discovery and configurable parallelism.