[GOBBLIN-2226] Construct iceberg data files during commit step #4140

thisisArjit · 2025-09-05T05:02:24Z

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

My PR addresses the following Gobblin JIRA issues and references them in the PR title.
- https://issues.apache.org/jira/browse/GOBBLIN-2226

Description

Instead of adding all the data files to Post publish step for iceberg partition copy, this PR adds Data file to each Iceberg partition copyable file during work unit generation & then adds all the data files to post publish step during commit step, hence avoiding serialisation & deserialisation of all the data files at once
Changes:
During WU generation, Copyable file is created and serialised. This PR adds an extension to Copyable file, IcebergPartitionCopyable file which contains the corresponding data file
During Commit step, in Iceberg post commit step, all the data files are collected from all the iceberg copyable file & the then gets added to IcebergOverwritePartitionsStep

Tests

Updated existing tests
Manually copied 35tb (66k) iceberg partition

Commits

My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Blazer-007 · 2025-09-05T09:18:10Z

.../main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergPartitionCopyableFile.java

+  public String getBase64EncodedDataFile() {
+    return this.base64EncodedDataFile;
+  }


this can be removed

Blazer-007 · 2025-09-05T09:36:45Z

...t/src/main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergPartitionDataset.java

+  private Map<Path, FileStatus> calcSrcFileStatusByDestFilePath(
+      Map<Path, DataFile> destDataFileBySrcPath,
+      Map<Path, Path> destPathToSrcPath) throws IOException {
    Function<Path, FileStatus> getFileStatus = CheckedExceptionFunction.wrapToTunneled(this.sourceFs::getFileStatus);
-    Map<Path, FileStatus> srcFileStatusByDestFilePath = new ConcurrentHashMap<>();
+    final Map<Path, FileStatus> srcFileStatusByDestFilePath = new ConcurrentHashMap<>();
    try {
-      srcFileStatusByDestFilePath = destDataFileBySrcPath.entrySet()
+      destDataFileBySrcPath.entrySet()
          .parallelStream()
-          .collect(Collectors.toConcurrentMap(entry -> new Path(entry.getValue().path().toString()),
-              entry -> getFileStatus.apply(entry.getKey())));
+          .forEach(entry -> {
+                Path destPath = new Path(entry.getValue().path().toString());
+                destPathToSrcPath.put(destPath, entry.getKey());
+                srcFileStatusByDestFilePath.put(destPath, getFileStatus.apply(entry.getKey()));
+          });


nit : Let's move this logic to caller itself and see if we can parallelize that for-loop to reduce runtime

Blazer-007 · 2025-09-05T09:37:23Z

...ement/src/main/java/org/apache/gobblin/data/management/copy/publisher/CopyDataPublisher.java

+    List<DataFile> icebergDataFiles = new ArrayList<>();
    for (WorkUnitState wus : statesHelper.getNonPostPublishStates()) {
      if (wus.getWorkingState() == WorkingState.SUCCESSFUL) {
        wus.setWorkingState(WorkUnitState.WorkingState.COMMITTED);
      }
      CopyEntity copyEntity = CopySource.deserializeCopyEntity(wus);
-      if (copyEntity instanceof CopyableFile) {
+      if (copyEntity instanceof CopyableFile || copyEntity instanceof IcebergPartitionCopyableFile) {
+        if (copyEntity instanceof IcebergPartitionCopyableFile) {
+          icebergDataFiles.add(((IcebergPartitionCopyableFile) copyEntity).getDataFile());


Lets add a comment to describe why this is done

Blazer-007 · 2025-09-05T09:39:02Z

.../main/java/org/apache/gobblin/data/management/copy/iceberg/IcebergPartitionCopyableFile.java

+
+  public DataFile getDataFile() {
+    return SerializationUtil.deserializeFromBase64(base64EncodedDataFile);
+  }


nit : lets add java doc here too

thisisArjit marked this pull request as ready for review September 5, 2025 09:15

Blazer-007 reviewed Sep 5, 2025

View reviewed changes

thisisArjit force-pushed the iceberg-data-files branch 3 times, most recently from 734354f to 71bb739 Compare September 5, 2025 16:05

thisisArjit changed the title ~~Construct iceberg data files during commit step~~ [GOBBLIN-2226] Construct iceberg data files during commit step Sep 6, 2025

Construct iceberg data files during commit step

e0ae58f

thisisArjit force-pushed the iceberg-data-files branch from 71bb739 to e0ae58f Compare September 8, 2025 05:28

Blazer-007 merged commit 5ad93db into apache:master Sep 8, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[GOBBLIN-2226] Construct iceberg data files during commit step #4140

[GOBBLIN-2226] Construct iceberg data files during commit step #4140

Uh oh!

thisisArjit commented Sep 5, 2025 •

edited

Loading

Uh oh!

Blazer-007 Sep 5, 2025

Uh oh!

Blazer-007 Sep 5, 2025

Uh oh!

Blazer-007 Sep 5, 2025

Uh oh!

Blazer-007 Sep 5, 2025

Uh oh!

Uh oh!

Uh oh!

[GOBBLIN-2226] Construct iceberg data files during commit step #4140

[GOBBLIN-2226] Construct iceberg data files during commit step #4140

Uh oh!

Conversation

thisisArjit commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

JIRA

Description

Tests

Commits

Uh oh!

Blazer-007 Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

Blazer-007 Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

Blazer-007 Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

Blazer-007 Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

thisisArjit commented Sep 5, 2025 •

edited

Loading