feat: support nested-column compression metadata in TBLPROPERTIES#490
Open
LuciferYang wants to merge 1 commit intolance-format:mainfrom
Open
feat: support nested-column compression metadata in TBLPROPERTIES#490LuciferYang wants to merge 1 commit intolance-format:mainfrom
LuciferYang wants to merge 1 commit intolance-format:mainfrom
Conversation
ce00a3a to
5b7eac4
Compare
…nce-format#434) Add a new TBLPROPERTIES key format `lance.<encoding>.column.<path>` that reaches struct, array, and map fields at any depth. The legacy `<column>.lance.<encoding>` format remains supported (top-level only); when both target the same path, the new format wins. Path resolution is type-guided: struct children use literal field names, array elements use the literal token `element`, map keys/values use `key` / `value`. Roles compose for chained array/map nesting. Path depth is bounded at 16 segments. Metadata that crosses an array or map boundary is smuggled on the nearest enclosing StructField under a `lance-nested.` prefix and unpacked by `LanceArrowUtils.toArrowField` onto the corresponding Arrow child Field. Error messages strip control characters (CR/LF/NUL/NEL/LS/PS plus bidi overrides) before interpolating user-controlled paths and values, so hostile column names cannot inject log lines or spoof terminal output. Validators reject null values up-front (previously `Float.parseFloat(null)` leaked an NPE through `validateRleThreshold`'s `NumberFormatException` catch). Stale legacy entries are dropped before validation when the new format covers the same `(path, rule)` pair, so migrating to the new format no longer trips on a pre-existing invalid legacy value. Adds 44 unit cases covering the walker (struct/array/map/fixed-size-list combinations, role composition, deep nesting at the depth limit, fields literally named `column`/`element`/`key`/`value`, control-char identity collision), every validator's invalid-value branch, the legacy-vs-new override semantics, and that nested-prefix keys never leak onto outer Arrow Field metadata. The float16 + nested-compression path is gated by `Assumptions.assumeTrue` so it runs on Spark 4.0+ (Arrow 18+) and skips on 3.5.
5b7eac4 to
88f6473
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #434.
Adds a new TBLPROPERTIES key format that addresses struct, array, and map fields at any depth. The legacy top-level format is kept and remains supported indefinitely.
<column>.lance.<key>lance.<key>.column.<segment1>.<segment2>...Path tokens (type-guided): struct child → field name; array / fixed-size-list element →
element; map →key/value. Roles compose for chained nesting (e.g.lance.compression.column.items.element.valueforARRAY<MAP<…, V>>). Depth bounded at 16 segments.How it lands: paths through only struct children write metadata directly on the deepest
StructField. Paths that cross an array/map boundary (no per-elementStructField) smuggle on the nearest enclosingStructFieldunder alance-nested.prefix;LanceArrowUtils.toArrowFieldunpacks them onto the corresponding Arrow childField.metadata.Format precedence rules:
(path, rule), the new-format entry wins; the colliding legacy entry is dropped before validation, so a stale invalid legacy value doesn't throw after migration..lance.<rule>are never interpreted as nested new-format paths.Hardening along the way:
validateRleThresholdacceptedFloat.NaN(every NaN comparison isfalse) — fixed via positive predicate.nullup-front (previouslyFloat.parseFloat(null)leaked NPE).Files:
LanceEncodingUtils.java(parsers, validator dispatch, sanitizer, legacy-shape filter);SchemaConverter.java(recursive walker);LanceArrowUtils.scala(Arrow-side unpack);docs/src/operations/ddl/create-table.md(docs + 4-language nested example);SchemaConverterNestedCompressionTest.java(51 cases — new).Test plan
make lintcleanSchemaConverterTest(29) andSchemaConverterFloat16Test(5) unchanged