Skip to content

Core: V4 write direction wrappers#16936

Draft
stevenzwu wants to merge 1 commit into
apache:mainfrom
stevenzwu:v4_write_direction_wrappers
Draft

Core: V4 write direction wrappers#16936
stevenzwu wants to merge 1 commit into
apache:mainfrom
stevenzwu:v4_write_direction_wrappers

Conversation

@stevenzwu

Copy link
Copy Markdown
Contributor

No description provided.

* <p>REPLACED files are the prior-state entries of v4 REPLACED/MODIFIED pairs and are not live.
* Returns null for manifest files written by pre-v4 writers.
*/
default Integer replacedFilesCount() {

@stevenzwu stevenzwu Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, public-interface changes should be minimized and delayed. Justification for keeping these two on the interface in this PR:

In-tree consumers (this PR + close follow-ups):

  • ContentEntryAdapters.manifestInfo() (this PR) populates the v4 root manifest's manifest_info struct from manifest.replacedFilesCount() / replacedRowsCount().
  • MergingSnapshotProducer.validateAddedDVs (follow-up phase) filters concurrent data manifests by replacedFilesCount > 0 to detect v4 colocated-DV write conflicts — without it, every concurrent data manifest must be deep-scanned.

Convention. The interface-default pattern matches v3's existing extensions for spec-defined optional fields on ManifestFile: firstRowId (row lineage), keyMetadata (encryption), containsNaN (partition NaN). v4 spec-defined manifest_info counts belong on the same interface in the same shape.

A TrackedFile-based parallel v4 API was considered. Would require rewriting ManifestGroup and every engine that consumes FileScanTask. The cost is disproportionate to the savings.

@stevenzwu stevenzwu Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Projected new default methods across the full v4 stack, all spec-defined additions backing v4 manifest_info / content_entry fields:

In this PR:

  • default Integer replacedFilesCount()
  • default Long replacedRowsCount()

In the root-manifest writer/reader PR (next in this stack):

  • default int writerFormatVersion() — read by RootManifestWriter.addEntry() for per-leaf format dispatch in the root manifest.

when root-level manifest DV bitmaps are implemented:

  • default ByteBuffer deletedPositions()
  • default ByteBuffer replacedPositions()

@gaborkaszab gaborkaszab left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @stevenzwu ,
I went through mostly for my own understanding. Left some comments, mostly I'm a bit hesitant to expose setting status and sequence numbers directly through the builders. Let me know what you think!


private ContentEntryAdapters() {}

static TrackedFile fromDataFile(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how the user will use these functions, but would it make sense to merge the fromDataFile and fromDeleteFile into a common fromManifestEntry? Then the user doesn't have to decide if it's data or delete, the adapter can do it instead.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think we should keep them separate. The validation for delete files is different from data files and so is the error handling and messages, I think it's cleaner that way.

The callers also always know which type they have so it should be easier.

// - v2 standalone position delete file (POSITION_DELETES stored in Parquet/Avro/ORC) has no
// v4 representation; it can only live in pre-v4 legacy manifests carried over via a
// writer_format_version=0 manifest reference.
if (file.content() == FileContent.POSITION_DELETES) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this if is only necessary to produce specific error messages based on how pos-deletes are represented. Isn't the check needed on line 109 that only eq-deletes are allowed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, the purpose is to provide a more specific error msg. I can remove it if people think it is unnecessary

static TrackedFile fromManifestFile(
ManifestFile manifest, int writerFormatVersion, EntryStatus status, Long firstRowId) {
Preconditions.checkArgument(manifest != null, "Invalid manifest file: null");
Preconditions.checkArgument(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this check go into the TrackedFileBuilder instead?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the writerFormatVersion argument will be removed when it is added to the ManifestFile interface in later phases. so probably good to keep it here for now.

#16936 (comment)

.specId(manifest.partitionSpecId())
.manifestInfo(info);

if (firstRowId != null) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall there was an argument not to put the firstRowId setter to the builder and have a separate setter method. Seeing the usage here, could be part of the builder unless there is any other usage that I don't know of. PR

.status(status)
.dataSequenceNumber(dataSequenceNumber)
.fileSequenceNumber(fileSequenceNumber);
Long firstRowId = isDataFile ? ((DataFile) file).firstRowId() : null;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems identical to the check in the status == EntryStatus.ADDED branch. Could this be moved out of this if after line 237?

* accumulated field values and bypasses {@link TrackingBuilder}. Cannot be combined with the
* source-tracking path ({@link #from(TrackedFile, long)}).
*/
TrackedFileBuilder status(EntryStatus newStatus) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As written in an earlier comment, I think this and the seq num setters are too much freedom given to the user. Can't we avoid exposing these?

  1. status() is needed for EXISTING state. I think this can be achieved with the existing from() builder method
  2. can't we avoid these using the inheritFrom() method?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @gaborkaszab.

One thought here, can we combine all the setters?

TrackedFileBuilder explicitTracking(EntryStatus status, Long dataSeqNum, Long fileSeqNum){..}

This makes it impossible to call status() without sequence number which is invalid for existing entries.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't allow the user explicitly setting these fields in any form. They could easily break internal invariants of TrackedFile if we do so. We currently have mechanism in the builder to prevent breaking invariants like replaced state for manifest entries and many others. This won't guard against these, or we have to duplicate all these checks here.
I'd look for existing methods on the builder/TrackedFile API to achieve what we need here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I switched the TrackingBuilder pattern as @gaborkaszab 's PR #16964

return this;
}

TrackingBuilder firstRowId(Long newFirstRowId) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a PR to set firstRowID. I had the impression that we don't want to do that through the builder and rather use a setter, but I can change the approach.

@stevenzwu stevenzwu force-pushed the v4_write_direction_wrappers branch 5 times, most recently from c78d6b1 to 828e0c7 Compare June 25, 2026 21:35
@stevenzwu stevenzwu force-pushed the v4_write_direction_wrappers branch from 828e0c7 to fa86dc1 Compare June 25, 2026 22:18
@github-actions github-actions Bot added the API label Jun 25, 2026
@stevenzwu stevenzwu force-pushed the v4_write_direction_wrappers branch 4 times, most recently from 53b37e6 to fa46d3d Compare June 26, 2026 05:41
Introduce two package-private helpers used by the v4 manifest writer
path:

TrackedFileWrapper is a StructLike adapter exposing a TrackedFile via
positional access matching TrackedFile.schemaWithContentStats(...). It
mirrors V4Metadata.ManifestEntryWrapper for the new content_entry
layout, supporting Parquet writes without materializing intermediate
records.

ContentEntryAdapter converts legacy ManifestEntry and ManifestFile
inputs into TrackedFileStruct rows. The leaf factories accept any
ManifestEntry whose file is DATA or EQUALITY_DELETES, so
ManifestWriter.V4Writer and V4DeleteWriter share one entry point. A
DV-specific overload takes ManifestEntry<DataFile> for colocated
deletion vector emission used by Phase 6's REPLACED/MODIFIED pairs.
fromManifestFile remains for root manifest references and accepts
the writer_format_version override (1 for v4 leaves, 0 for v3 leaves
carried through a v3->v4 upgrade).

Status derivation is delegated to TrackingBuilder; content stats
construction goes through MetricsUtil.fromMetrics. ManifestEntry.Status
is mapped to EntryStatus in toEntryStatus(...), the only mapping
needed since the legacy enum has no REPLACED or MODIFIED.

Per the v4 plan, validation of writer_format_version against the
supported set lives in the Phase 2 ContentEntryReader, not at storage
or write time. The adapter validates only on the manifest factory,
the single caller that may legitimately set 0.

No callers wired yet; round-trip tests land with the Phase 2 reader
and writer rewrite.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants