RFC-99: Hudi Type System #14253
Replies: 3 comments 6 replies
-
+1 for option1, type system is very basic infra for a storage, it's better we keep format agnostic and avoid the dependency to other projects, we already see some issues like compatibility, efficiency for current Avro schema. It will gains more in the long run. And do you think we should also reference the SQL standard for these type naming conventions: https://blog.ansi.org/ansi/sql-standard-iso-iec-9075-2023-ansi-x3-135/? That would be more friendly for integration with SQL which is the major user API based on the investigation here: #13894 |
Beta Was this translation helpful? Give feedback.
-
|
+1 for Option 1. I shared the same concern in RFC-88 here: #12795 (comment) It's essential for reliable interoperability and long-term data correctness. The Core Problem: A "Many-to-Many" Mapping Nightmare Without a standardized table-level type system, every component in the stack is forced into a "many-to-many" mapping problem: Compute Engines (C): Flink, Trino, Spark, Hive, etc., all have their own internal logical type representations. File Formats (F): Parquet, ORC, Avro all have their own physical types. For example: The
Going with Option 1 allows for a strict contract where there is a well defined "Hub-and-Spoke" abstraction. The type system acts as the central logical "hub". It breaks the C-to-F problem down into a much simpler C + F problem.
Option 2 feels fragile where we are replacing Avro with another Type system where we assume that , but we are not really solving the actual problem. Option 1 is the only reliable way to achieve the following:
|
Beta Was this translation helpful? Give feedback.
-
|
See Board for tracking type system subtasks #14263 for further discussions. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Background
Wanted to start a discussion around the topic @balaji-varadarajan-ai proposal RFC 99 #13743 around introducing a
native type systemwithin Apache Hudi, as well as what would be an initial first step toward an MVP.For Hudi 1.2.0 we want to be able to let users in the AI/ML space to be able to define some way of representing a "BLOB-like" content (that would encompass the binary content of a image, video, audio) as well store the vector embeddings for these pieces of data in order to perform a similarity search(for more details on see RFC 102: #14218). We will need to ensure our type system can account for those. We also might need to specify some granularity around how large the content this binary content maybe in the case, as well as for vectors what are the dimensions(@balaji-varadarajan-ai RFC captured these details as well)
What types to start with from RFC 99?
I think the following initial types should be the first step that we support to cover both structured and unstructured use cases within a hudi table. From @balaji-varadarajan-ai RFC I am thinking a first step would be first supporting these types, if we feel we should add another type for initial MVP or to cut down this list feel free to leave a comment.
Primitive Types
AI/ML types
Nested Types
Temporal types
What should the type system be backed by?
Option 1
Currently when looking at other table format projects in the space some such as Apache Iceberg, the approach they take is defining a native type system https://iceberg.apache.org/spec/#primitive-types which is not backed by any particular file format such as (Parquet, Avro, Arrow). https://iceberg.apache.org/spec/?h=parquet+types#parquet.
One option is to follow a similar approach of defining our own type system as Java constructs: https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/types/Types.java and then have different engines and file formats have to convert between our representation. See an example such as this https://github.com/apache/iceberg/blob/main/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/ArrowReader.java#L68
Option 2
Similar to how we use Avro today within the hudi project for schema representation we instead leverage Apache Arrow and use arrow types as first class citizens within the project. For example when looking at Lance, they currently do not have a type system and instead leverage arrow directly.
Beta Was this translation helpful? Give feedback.
All reactions