Core: V4 write direction wrappers#16936
Conversation
| * <p>REPLACED files are the prior-state entries of v4 REPLACED/MODIFIED pairs and are not live. | ||
| * Returns null for manifest files written by pre-v4 writers. | ||
| */ | ||
| default Integer replacedFilesCount() { |
There was a problem hiding this comment.
Generally, public-interface changes should be minimized and delayed. Justification for keeping these two on the interface in this PR:
In-tree consumers (this PR + close follow-ups):
ContentEntryAdapters.manifestInfo()(this PR) populates the v4 root manifest'smanifest_infostruct frommanifest.replacedFilesCount()/replacedRowsCount().MergingSnapshotProducer.validateAddedDVs(follow-up phase) filters concurrent data manifests byreplacedFilesCount > 0to detect v4 colocated-DV write conflicts — without it, every concurrent data manifest must be deep-scanned.
Convention. The interface-default pattern matches v3's existing extensions for spec-defined optional fields on ManifestFile: firstRowId (row lineage), keyMetadata (encryption), containsNaN (partition NaN). v4 spec-defined manifest_info counts belong on the same interface in the same shape.
A TrackedFile-based parallel v4 API was considered. Would require rewriting ManifestGroup and every engine that consumes FileScanTask. The cost is disproportionate to the savings.
There was a problem hiding this comment.
Projected new default methods across the full v4 stack, all spec-defined additions backing v4 manifest_info / content_entry fields:
In this PR:
default Integer replacedFilesCount()default Long replacedRowsCount()
In the root-manifest writer/reader PR (next in this stack):
default int writerFormatVersion()— read byRootManifestWriter.addEntry()for per-leaf format dispatch in the root manifest.
when root-level manifest DV bitmaps are implemented:
default ByteBuffer deletedPositions()default ByteBuffer replacedPositions()
gaborkaszab
left a comment
There was a problem hiding this comment.
Hey @stevenzwu ,
I went through mostly for my own understanding. Left some comments, mostly I'm a bit hesitant to expose setting status and sequence numbers directly through the builders. Let me know what you think!
|
|
||
| private ContentEntryAdapters() {} | ||
|
|
||
| static TrackedFile fromDataFile( |
There was a problem hiding this comment.
I don't see how the user will use these functions, but would it make sense to merge the fromDataFile and fromDeleteFile into a common fromManifestEntry? Then the user doesn't have to decide if it's data or delete, the adapter can do it instead.
There was a problem hiding this comment.
I actually think we should keep them separate. The validation for delete files is different from data files and so is the error handling and messages, I think it's cleaner that way.
The callers also always know which type they have so it should be easier.
| // - v2 standalone position delete file (POSITION_DELETES stored in Parquet/Avro/ORC) has no | ||
| // v4 representation; it can only live in pre-v4 legacy manifests carried over via a | ||
| // writer_format_version=0 manifest reference. | ||
| if (file.content() == FileContent.POSITION_DELETES) { |
There was a problem hiding this comment.
nit: this if is only necessary to produce specific error messages based on how pos-deletes are represented. Isn't the check needed on line 109 that only eq-deletes are allowed?
There was a problem hiding this comment.
yeah, the purpose is to provide a more specific error msg. I can remove it if people think it is unnecessary
| static TrackedFile fromManifestFile( | ||
| ManifestFile manifest, int writerFormatVersion, EntryStatus status, Long firstRowId) { | ||
| Preconditions.checkArgument(manifest != null, "Invalid manifest file: null"); | ||
| Preconditions.checkArgument( |
There was a problem hiding this comment.
Should this check go into the TrackedFileBuilder instead?
There was a problem hiding this comment.
the writerFormatVersion argument will be removed when it is added to the ManifestFile interface in later phases. so probably good to keep it here for now.
| .specId(manifest.partitionSpecId()) | ||
| .manifestInfo(info); | ||
|
|
||
| if (firstRowId != null) { |
There was a problem hiding this comment.
I recall there was an argument not to put the firstRowId setter to the builder and have a separate setter method. Seeing the usage here, could be part of the builder unless there is any other usage that I don't know of. PR
| .status(status) | ||
| .dataSequenceNumber(dataSequenceNumber) | ||
| .fileSequenceNumber(fileSequenceNumber); | ||
| Long firstRowId = isDataFile ? ((DataFile) file).firstRowId() : null; |
There was a problem hiding this comment.
This seems identical to the check in the status == EntryStatus.ADDED branch. Could this be moved out of this if after line 237?
| * accumulated field values and bypasses {@link TrackingBuilder}. Cannot be combined with the | ||
| * source-tracking path ({@link #from(TrackedFile, long)}). | ||
| */ | ||
| TrackedFileBuilder status(EntryStatus newStatus) { |
There was a problem hiding this comment.
As written in an earlier comment, I think this and the seq num setters are too much freedom given to the user. Can't we avoid exposing these?
status()is needed for EXISTING state. I think this can be achieved with the existingfrom()builder method- can't we avoid these using the
inheritFrom()method?
There was a problem hiding this comment.
I agree with @gaborkaszab.
One thought here, can we combine all the setters?
TrackedFileBuilder explicitTracking(EntryStatus status, Long dataSeqNum, Long fileSeqNum){..}This makes it impossible to call status() without sequence number which is invalid for existing entries.
There was a problem hiding this comment.
I wouldn't allow the user explicitly setting these fields in any form. They could easily break internal invariants of TrackedFile if we do so. We currently have mechanism in the builder to prevent breaking invariants like replaced state for manifest entries and many others. This won't guard against these, or we have to duplicate all these checks here.
I'd look for existing methods on the builder/TrackedFile API to achieve what we need here.
There was a problem hiding this comment.
I switched the TrackingBuilder pattern as @gaborkaszab 's PR #16964
| return this; | ||
| } | ||
|
|
||
| TrackingBuilder firstRowId(Long newFirstRowId) { |
There was a problem hiding this comment.
I have a PR to set firstRowID. I had the impression that we don't want to do that through the builder and rather use a setter, but I can change the approach.
c78d6b1 to
828e0c7
Compare
828e0c7 to
fa86dc1
Compare
53b37e6 to
fa46d3d
Compare
Introduce two package-private helpers used by the v4 manifest writer path: TrackedFileWrapper is a StructLike adapter exposing a TrackedFile via positional access matching TrackedFile.schemaWithContentStats(...). It mirrors V4Metadata.ManifestEntryWrapper for the new content_entry layout, supporting Parquet writes without materializing intermediate records. ContentEntryAdapter converts legacy ManifestEntry and ManifestFile inputs into TrackedFileStruct rows. The leaf factories accept any ManifestEntry whose file is DATA or EQUALITY_DELETES, so ManifestWriter.V4Writer and V4DeleteWriter share one entry point. A DV-specific overload takes ManifestEntry<DataFile> for colocated deletion vector emission used by Phase 6's REPLACED/MODIFIED pairs. fromManifestFile remains for root manifest references and accepts the writer_format_version override (1 for v4 leaves, 0 for v3 leaves carried through a v3->v4 upgrade). Status derivation is delegated to TrackingBuilder; content stats construction goes through MetricsUtil.fromMetrics. ManifestEntry.Status is mapped to EntryStatus in toEntryStatus(...), the only mapping needed since the legacy enum has no REPLACED or MODIFIED. Per the v4 plan, validation of writer_format_version against the supported set lives in the Phase 2 ContentEntryReader, not at storage or write time. The adapter validates only on the manifest factory, the single caller that may legitimately set 0. No callers wired yet; round-trip tests land with the Phase 2 reader and writer rewrite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fa46d3d to
55f5573
Compare
No description provided.