Variant in Lance #5238
wojiaodoubao
started this conversation in
Ideas
Replies: 2 comments
This comment has been hidden.
This comment has been hidden.
-
|
Hi @Xuanwo , @westonpace , @wjones127 , @jackye1995 , the proposal is ready for review now, looking forward to your ideas and suggestions! Thanks very much! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Background
Semi-structured data is a common data storage format in AI scenarios. Previously, we discussed JSON document search in Lance and added support for JSON data, including creating scalar and full-text indexes on JSON fields.
I think we can continue the discussion about introducing VARIANT in Lance. Parquet VARIANT is a semi-structured data encoding proposed by the Parquet community, it is intended to allow efficient access to nested data even in the presence of very wide or deep structures. We can refer to Parquet's design to create the Lance variant encoding.
Lance Variant Encoding
A Variant represents a type that contains one of:
Lance Variant is an encoding method for VARIANT data, providing efficient by-field query capabilities and field shredding abilities. Lance Variant can also be seen as a logical extension of Arrow, where the encoding method defines the data organization in meta, value and typed.
Lance Variant can encode common semi-structured types, such as JSON. We can encode a JSON object as a Lance Variant Type value and store it in a Lance table. Conversely, we can convert a Lance Variant encoded value back into a JSON object.
The design of Lance Variant is based on Parquet's Variant design, using meta, value, and typed fields for encoding. The encoding of assembling fields(meta, value) is nearly the same as the parquet variant. The encoding of shredding is different from parquet shredding.Lance Variant leverages Lance's zero-copy data evolution, making shredding flexible and efficient.
Proposal
https://docs.google.com/document/d/1aXEoVlRYUMTIQ290Th-sL0kFy0-zwKhe5AKuYzXwjDc/edit?usp=sharing
Beta Was this translation helpful? Give feedback.
All reactions