RFC-99: Hudi Type System #14253

rahil-c · 2025-11-13T02:29:42Z

rahil-c
Nov 13, 2025

Background

Wanted to start a discussion around the topic @balaji-varadarajan-ai proposal RFC 99 #13743 around introducing a native type system within Apache Hudi, as well as what would be an initial first step toward an MVP.

For Hudi 1.2.0 we want to be able to let users in the AI/ML space to be able to define some way of representing a "BLOB-like" content (that would encompass the binary content of a image, video, audio) as well store the vector embeddings for these pieces of data in order to perform a similarity search(for more details on see RFC 102: #14218). We will need to ensure our type system can account for those. We also might need to specify some granularity around how large the content this binary content maybe in the case, as well as for vectors what are the dimensions(@balaji-varadarajan-ai RFC captured these details as well)

What types to start with from RFC 99?

I think the following initial types should be the first step that we support to cover both structured and unstructured use cases within a hudi table. From @balaji-varadarajan-ai RFC I am thinking a first step would be first supporting these types, if we feel we should add another type for initial MVP or to cut down this list feel free to leave a comment.

Primitive Types

Logical Type	Description	Parameters
BOOLEAN	A logical boolean value (true/false).	None
INTEGER	A 32-bit signed integer.	None
BIGINT	A 64-bit signed integer.	None
FLOAT16	A 16-bit half-precision floating-point number.	None
FLOAT	A 32-bit single-precision floating-point number.	None
DOUBLE	A 64-bit double-precision floating-point number.	None
DECIMAL(p, s)	An exact numeric with specified precision/scale.	p, s
STRING	A variable-length UTF-8 character string, limited to 2GB per value.	None
LARGE_STRING	A variable-length UTF-8 character string for values exceeding 2GB.	None
FIXED(n)	A fixed-length sequence of n bytes.	n

AI/ML types

Logical Type	Description	Parameters
VECTOR(element_type, dimension)	A dense, fixed-length vector of numeric values.	Element type, dimension
BINARY	A variable-length sequence of bytes, limited to 2GB per value.	None
LARGE_BINARY	A variable-length sequence of bytes for values exceeding 2GB.	None

Nested Types

Logical Type	Description	Parameters
STRUCT<name: type, ...>	An ordered collection of named fields.	Field list
LIST<element_type>	An ordered list of elements of the same type.	Element type
MAP<key_type, value_type>	A collection of key-value pairs. Keys must be unique.	Key, Value types

Temporal types

Logical Type	Description	Parameters
DATE	A calendar date (year, month, day).	None
DATE64	A calendar date stored as milliseconds.	None
TIME(precision)	A time of day without a timezone.	s, ms, us, ns
TIMESTAMP(precision)	An instant in time without a timezone.	us or ns
TIMESTAMPTZ(precision)	An instant in time with a timezone, normalized and stored as UTC.	us or ns

What should the type system be backed by?

Option 1

Currently when looking at other table format projects in the space some such as Apache Iceberg, the approach they take is defining a native type system https://iceberg.apache.org/spec/#primitive-types which is not backed by any particular file format such as (Parquet, Avro, Arrow). https://iceberg.apache.org/spec/?h=parquet+types#parquet.
One option is to follow a similar approach of defining our own type system as Java constructs: https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/types/Types.java and then have different engines and file formats have to convert between our representation. See an example such as this https://github.com/apache/iceberg/blob/main/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/ArrowReader.java#L68

Option 2

Similar to how we use Avro today within the hudi project for schema representation we instead leverage Apache Arrow and use arrow types as first class citizens within the project. For example when looking at Lance, they currently do not have a type system and instead leverage arrow directly.

Corollary 1: There is No Type System[#](https://lancedb.com/blog/lance-v2/#corollary-1-there-is-no-type-system)
The Lance format itself does not have a type system. From Lance’s perspective, every column is simply a collection of pages (with an encoding) and each page a collection of buffers. Of course, the Lance readers and writers will actually need to convert between these pages into typed arrays of some kind. The readers and writers we have written use the [Arrow type system](https://arrow.apache.org/docs/format/Columnar.html) .

danny0405 · 2025-11-13T02:53:21Z

danny0405
Nov 13, 2025
Collaborator

Currently when looking at other table format projects in the space some such as Apache Iceberg, the approach they take is defining a native type system

+1 for option1, type system is very basic infra for a storage, it's better we keep format agnostic and avoid the dependency to other projects, we already see some issues like compatibility, efficiency for current Avro schema. It will gains more in the long run.

And do you think we should also reference the SQL standard for these type naming conventions: https://blog.ansi.org/ansi/sql-standard-iso-iec-9075-2023-ansi-x3-135/? That would be more friendly for integration with SQL which is the major user API based on the investigation here: #13894

6 replies

the-other-tim-brown Nov 14, 2025
Collaborator

From the Hudi perspective, we want to have a type that provides the context for what the column represents. This is what logical types do today in Parquet. For example, we store a long but we're able to interpret it as a timestamp in other systems.

For blobs and vectors we'll want to do something similar where we can potentially represent the data as raw bytes or an array in the files on storage but the types in the Hudi schema will have that extra context for how we want to persist and then interpret the data. Spark has a concept of a UserDefinedType that can help us carry this context through with the dataframe.

danny0405 Nov 14, 2025
Collaborator

@rahil-c , These engines will evolve their data types too, like the new introduced Varient type in Spark, the evolving may be slow especially for SQL types(the discussion for standard takes quite a long time), let's define the types with clear semantics(logicla type) and it's physical data structure probably, we can always migrate to new mappings upon the introducing of new engine types.

@the-other-tim-brown not sure if we really needs to introduce the notion, Hudi just needs the logical type IMO because each engine or file format would after all has their own type systems too, and they should decide what to store with proper phisical presentation.

cshuo Nov 14, 2025

@rahil-c I think the type mapping from engines to hudi need not to be 1-1, BINARY type can be used for ordinary binary field and can also be used to represent a BLOB field. What we need to accomplish is to pass the BLOB type information from engine to Hudi. This transmission doesn't need to be explicitly specifying the BLOB type in the engine DDL; it can also be achieved through other means, such as using configuration like 'hoodie.table.blob-fields' to specify that a BINARY field should be treated as a BLOB type in Hudi.

Btw, maybe we should also add VARIANT to the type list, which is already supported in spark 4.0 and flink 2.1.

rahil-c Nov 14, 2025
Author

@cshuo So to confirm you are thinking something similar to this https://lance.org/integrations/spark/operations/ddl/create-table/#blob-columns? For example if an engine does not have an explicit BLOB type, we would ask the user to use the closet type in this case BINARY and then ask them to maybe pass some table config in the create statement so we can properly have the context to map the correct type in Hudi?

Also thats a good point, if variant has been added now to spark, flink, and parquet we should add it to the initial list.

cshuo Nov 17, 2025

So to confirm you are thinking something similar to this https://lance.org/integrations/spark/operations/ddl/create-table/#blob-columns?

yes, just like that.

voonhous · 2025-11-14T03:54:45Z

voonhous
Nov 14, 2025
Collaborator

+1 for Option 1.

I shared the same concern in RFC-88 here: #12795 (comment)

It's essential for reliable interoperability and long-term data correctness.

The Core Problem: A "Many-to-Many" Mapping Nightmare

Without a standardized table-level type system, every component in the stack is forced into a "many-to-many" mapping problem: Compute Engines (C): Flink, Trino, Spark, Hive, etc., all have their own internal logical type representations.

File Formats (F): Parquet, ORC, Avro all have their own physical types.

For example: The Timestamp type. We have:

Flink: TIMESTAMP(3) (millis), TIMESTAMP(9) (nanos), local vs. UTC semantics.
Trino: TIMESTAMP(p) with precision, TIMESTAMP(p) WITH TIME ZONE.
Parquet: Legacy INT96 (which had ambiguous interpretations), TIMESTAMP_MICROS, TIMESTAMP_MILLIS.

Going with Option 1 allows for a strict contract where there is a well defined "Hub-and-Spoke" abstraction. The type system acts as the central logical "hub". It breaks the C-to-F problem down into a much simpler C + F problem.

Engine Type (Flink) <-> Table Format Type (Hudi) <-> File Format Type (Parquet)

For Writers (e.g., Flink): The Flink connector's job is to map Flink.Timestamp to the corresponding Hudi.Timestamp type. The Hudi format itself then takes responsibility for serializing that logical Hudi.Timestamp into the correct, unambiguous Parquet physical type (e.g., TIMESTAMP_MICROS with isAdjustedToUTC=true, timezone specifics). It is also up to Hudi to dictate which TIMESTAMP type(s) it wants to support.
For Readers (e.g., Trino): The Trino connector's job is to read the Hudi.Timestamp from table schema and map it to Trino.Timestamp. It doesn't need to know or care if the physical bytes on disk are Parquet INT96 or ORC struct { seconds, nanos }. The table format handles that abstraction.

Option 2 feels fragile where we are replacing Avro with another Type system where we assume that , but we are not really solving the actual problem.

Option 1 is the only reliable way to achieve the following:

Guarantee Semantic Consistency
Decouple Engines from Storage: It allows the file formats (like Parquet) to evolve (e.g., add new types) and the engines (like Flink) to evolve (e.g., add new precisions) independently. (as long as we have the specific implementations)

0 replies

rahil-c · 2025-11-14T19:04:10Z

rahil-c
Nov 14, 2025
Author

See Board for tracking type system subtasks #14263 for further discussions.

0 replies

RFC-99: Hudi Type System #14253

Uh oh!

rahil-c Nov 13, 2025

Background

What types to start with from RFC 99?

Primitive Types

AI/ML types

Nested Types

Temporal types

What should the type system be backed by?

Option 1

Option 2

Replies: 3 comments · 6 replies

Uh oh!

danny0405 Nov 13, 2025 Collaborator

Uh oh!

the-other-tim-brown Nov 14, 2025 Collaborator

Uh oh!

Uh oh!

danny0405 Nov 14, 2025 Collaborator

Uh oh!

cshuo Nov 14, 2025

Uh oh!

Uh oh!

rahil-c Nov 14, 2025 Author

Uh oh!

Uh oh!

cshuo Nov 17, 2025

Uh oh!

Uh oh!

voonhous Nov 14, 2025 Collaborator

Uh oh!

rahil-c Nov 14, 2025 Author

rahil-c
Nov 13, 2025

Replies: 3 comments 6 replies

danny0405
Nov 13, 2025
Collaborator

the-other-tim-brown Nov 14, 2025
Collaborator

danny0405 Nov 14, 2025
Collaborator

rahil-c Nov 14, 2025
Author

voonhous
Nov 14, 2025
Collaborator

rahil-c
Nov 14, 2025
Author