Skip to content

feat: support nested-column compression metadata in TBLPROPERTIES#490

Open
LuciferYang wants to merge 1 commit intolance-format:mainfrom
LuciferYang:nested-tblproperties-metadata
Open

feat: support nested-column compression metadata in TBLPROPERTIES#490
LuciferYang wants to merge 1 commit intolance-format:mainfrom
LuciferYang:nested-tblproperties-metadata

Conversation

@LuciferYang
Copy link
Copy Markdown
Contributor

@LuciferYang LuciferYang commented Apr 28, 2026

Closes #434.

Adds a new TBLPROPERTIES key format that addresses struct, array, and map fields at any depth. The legacy top-level format is kept and remains supported indefinitely.

Format Shape Targets
Legacy <column>.lance.<key> top-level columns only
New lance.<key>.column.<segment1>.<segment2>... top-level and nested

Path tokens (type-guided): struct child → field name; array / fixed-size-list element → element; map → key / value. Roles compose for chained nesting (e.g. lance.compression.column.items.element.value for ARRAY<MAP<…, V>>). Depth bounded at 16 segments.

How it lands: paths through only struct children write metadata directly on the deepest StructField. Paths that cross an array/map boundary (no per-element StructField) smuggle on the nearest enclosing StructField under a lance-nested. prefix; LanceArrowUtils.toArrowField unpacks them onto the corresponding Arrow child Field.metadata.

Format precedence rules:

  • When two distinct properties target the same (path, rule), the new-format entry wins; the colliding legacy entry is dropped before validation, so a stale invalid legacy value doesn't throw after migration.
  • When a single literal key parses both ways (because a top-level column name itself looks like a new-format key), legacy is reserved — keys ending in .lance.<rule> are never interpreted as nested new-format paths.

Hardening along the way:

  • validateRleThreshold accepted Float.NaN (every NaN comparison is false) — fixed via positive predicate.
  • All validators reject null up-front (previously Float.parseFloat(null) leaked NPE).
  • Error messages sanitize CR/LF/NUL/NEL/LS/PS + bidi overrides (Trojan-Source defense).

Files: LanceEncodingUtils.java (parsers, validator dispatch, sanitizer, legacy-shape filter); SchemaConverter.java (recursive walker); LanceArrowUtils.scala (Arrow-side unpack); docs/src/operations/ddl/create-table.md (docs + 4-language nested example); SchemaConverterNestedCompressionTest.java (51 cases — new).

Test plan

  • make lint clean
  • Spark 3.5: 89/89 (1 skipped — float16 needs Arrow 18+)
  • Spark 4.0 & 4.1: 51/51 each (float16 path exercised)
  • Existing SchemaConverterTest (29) and SchemaConverterFloat16Test (5) unchanged
  • Scala 2.13 cross-compile clean

@github-actions github-actions Bot added the enhancement New feature or request label Apr 28, 2026
@LuciferYang LuciferYang force-pushed the nested-tblproperties-metadata branch from ce00a3a to 5b7eac4 Compare April 28, 2026 07:45
…nce-format#434)

Add a new TBLPROPERTIES key format `lance.<encoding>.column.<path>` that
reaches struct, array, and map fields at any depth. The legacy
`<column>.lance.<encoding>` format remains supported (top-level only); when
both target the same path, the new format wins.

Path resolution is type-guided: struct children use literal field names,
array elements use the literal token `element`, map keys/values use `key` /
`value`. Roles compose for chained array/map nesting. Path depth is bounded
at 16 segments. Metadata that crosses an array or map boundary is smuggled
on the nearest enclosing StructField under a `lance-nested.` prefix and
unpacked by `LanceArrowUtils.toArrowField` onto the corresponding Arrow
child Field.

Error messages strip control characters (CR/LF/NUL/NEL/LS/PS plus bidi
overrides) before interpolating user-controlled paths and values, so
hostile column names cannot inject log lines or spoof terminal output.
Validators reject null values up-front (previously `Float.parseFloat(null)`
leaked an NPE through `validateRleThreshold`'s `NumberFormatException`
catch). Stale legacy entries are dropped before validation when the new
format covers the same `(path, rule)` pair, so migrating to the new format
no longer trips on a pre-existing invalid legacy value.

Adds 44 unit cases covering the walker (struct/array/map/fixed-size-list
combinations, role composition, deep nesting at the depth limit, fields
literally named `column`/`element`/`key`/`value`, control-char identity
collision), every validator's invalid-value branch, the legacy-vs-new
override semantics, and that nested-prefix keys never leak onto outer
Arrow Field metadata. The float16 + nested-compression path is gated by
`Assumptions.assumeTrue` so it runs on Spark 4.0+ (Arrow 18+) and skips on
3.5.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support nested columns in tblproperties field metadata

1 participant