Skip to content

Conversation

thisisArjit
Copy link
Contributor

@thisisArjit thisisArjit commented Sep 5, 2025

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

Description

  • Instead of adding all the data files to Post publish step for iceberg partition copy, this PR adds Data file to each Iceberg partition copyable file during work unit generation & then adds all the data files to post publish step during commit step, hence avoiding serialisation & deserialisation of all the data files at once

  • Changes:

  • During WU generation, Copyable file is created and serialised. This PR adds an extension to Copyable file, IcebergPartitionCopyable file which contains the corresponding data file

  • During Commit step, in Iceberg post commit step, all the data files are collected from all the iceberg copyable file & the then gets added to IcebergOverwritePartitionsStep

Tests

  • Updated existing tests
  • Manually copied 35tb (66k) iceberg partition

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

@thisisArjit thisisArjit marked this pull request as ready for review September 5, 2025 09:15
Comment on lines 28 to 30
public String getBase64EncodedDataFile() {
return this.base64EncodedDataFile;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be removed

Comment on lines 193 to 205
private Map<Path, FileStatus> calcSrcFileStatusByDestFilePath(
Map<Path, DataFile> destDataFileBySrcPath,
Map<Path, Path> destPathToSrcPath) throws IOException {
Function<Path, FileStatus> getFileStatus = CheckedExceptionFunction.wrapToTunneled(this.sourceFs::getFileStatus);
Map<Path, FileStatus> srcFileStatusByDestFilePath = new ConcurrentHashMap<>();
final Map<Path, FileStatus> srcFileStatusByDestFilePath = new ConcurrentHashMap<>();
try {
srcFileStatusByDestFilePath = destDataFileBySrcPath.entrySet()
destDataFileBySrcPath.entrySet()
.parallelStream()
.collect(Collectors.toConcurrentMap(entry -> new Path(entry.getValue().path().toString()),
entry -> getFileStatus.apply(entry.getKey())));
.forEach(entry -> {
Path destPath = new Path(entry.getValue().path().toString());
destPathToSrcPath.put(destPath, entry.getKey());
srcFileStatusByDestFilePath.put(destPath, getFileStatus.apply(entry.getKey()));
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit : Let's move this logic to caller itself and see if we can parallelize that for-loop to reduce runtime

Comment on lines 358 to 369
List<DataFile> icebergDataFiles = new ArrayList<>();
for (WorkUnitState wus : statesHelper.getNonPostPublishStates()) {
if (wus.getWorkingState() == WorkingState.SUCCESSFUL) {
wus.setWorkingState(WorkUnitState.WorkingState.COMMITTED);
}
CopyEntity copyEntity = CopySource.deserializeCopyEntity(wus);
if (copyEntity instanceof CopyableFile) {
if (copyEntity instanceof CopyableFile || copyEntity instanceof IcebergPartitionCopyableFile) {
if (copyEntity instanceof IcebergPartitionCopyableFile) {
icebergDataFiles.add(((IcebergPartitionCopyableFile) copyEntity).getDataFile());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets add a comment to describe why this is done

Comment on lines +39 to +58

public DataFile getDataFile() {
return SerializationUtil.deserializeFromBase64(base64EncodedDataFile);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit : lets add java doc here too

@thisisArjit thisisArjit force-pushed the iceberg-data-files branch 3 times, most recently from 734354f to 71bb739 Compare September 5, 2025 16:05
@thisisArjit thisisArjit changed the title Construct iceberg data files during commit step [GOBBLIN-2226] Construct iceberg data files during commit step Sep 6, 2025
@Blazer-007 Blazer-007 merged commit 5ad93db into apache:master Sep 8, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants