[GOBBLIN-2223] Optimise writing of serialised Work Unit to File system #4133

thisisArjit · 2025-08-21T15:32:21Z

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

My PR addresses the following Gobblin JIRA issues and references them in the PR title.
- https://issues.apache.org/jira/browse/GOBBLIN-2223

Description

string.getBytes() does additional memory allocation to get byte[]
Instead of above approach, write the string by going through each character without additional memory allocation
Once we start writing iceberg data files to file system, so that the length of serialised work unit is within reasonable limit, we can move to the original approach

Tests

Updated existing unit test to consider characters with ascii range > 127 also
Manual replication for 13.5k workunits for copying ~35tb data & 66k files

Commits

My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
1. Subject is separated from body by a blank line
2. Subject is limited to 50 characters
3. Subject does not end with a period
4. Subject uses the imperative mood ("add", not "adding")
5. Body wraps at 72 characters
6. Body explains "what" and "why", not "how"

Copilot

Pull Request Overview

This PR optimizes the serialization and deserialization of Work Units to the file system by eliminating unnecessary memory allocations in the TextSerializer class. The optimization replaces the use of string.getBytes() which creates additional byte arrays with direct character-by-character writing and reading operations.

Removed UTF-8 byte array creation during serialization, writing strings directly character-by-character
Modified deserialization to use StringBuilder instead of byte array allocation
Eliminated import of StandardCharsets as it's no longer needed

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-01T17:09:51Z

gobblin-api/src/main/java/org/apache/gobblin/compat/hadoop/TextSerializer.java

+    writeVLong(stream, str.length());
+    stream.writeBytes(str);


Using str.length() for the length will cause deserialization errors for multi-byte UTF-8 characters. The length should represent the number of bytes, not the number of characters. Multi-byte UTF-8 characters will have different byte lengths than character counts.

Copilot · 2025-09-01T17:09:52Z

gobblin-api/src/main/java/org/apache/gobblin/compat/hadoop/TextSerializer.java

-    writeVLong(stream, utf8Encoded.length);
-    stream.write(utf8Encoded);
+    writeVLong(stream, str.length());
+    stream.writeBytes(str);


DataOutput.writeBytes() only writes the low 8 bits of each character, which will corrupt any characters outside the ASCII range (0-127). This breaks Unicode support that was previously handled by UTF-8 encoding.

I think this is a good suggestion - https://www.cs.helsinki.fi/group/boi2016/doc/java/api/java/io/DataOutput.html#writeBytes-java.lang.String-
Should we have some handling for this as well ? @thisisArjit

for (int i = 0; i < str.length(); i++) { if (str.charAt(i) > 0x7F) { throw new IllegalArgumentException("Non-ASCII character detected."); } } writeVLong(stream, str.length()); stream.writeBytes(str); // writes 1 byte per character

Copilot · 2025-09-01T17:09:52Z

gobblin-api/src/main/java/org/apache/gobblin/compat/hadoop/TextSerializer.java


-    return new String(buf, StandardCharsets.UTF_8);
+    for (int i = 0; i < bufLen; i++) {
+      sb.append((char) in.readByte());


Casting a byte directly to char will produce incorrect results for multi-byte UTF-8 characters. This approach only works correctly for ASCII characters (0-127) and will corrupt Unicode text.

Copilot · 2025-09-01T17:09:52Z

gobblin-api/src/main/java/org/apache/gobblin/compat/hadoop/TextSerializer.java

  public static void writeStringAsText(DataOutput stream, String str) throws IOException {
-    byte[] utf8Encoded = str.getBytes(StandardCharsets.UTF_8);
-    writeVLong(stream, utf8Encoded.length);
-    stream.write(utf8Encoded);
+    writeVLong(stream, str.length());
+    stream.writeBytes(str);
  }


The method name suggests Hadoop Text compatibility, but the implementation is no longer compatible with Hadoop's Text serialization format, which uses UTF-8 byte encoding. This could break interoperability with Hadoop systems.

thisisArjit · 2025-09-02T06:28:08Z

gobblin-api/src/test/java/org/apache/gobblin/compat/TextSerializerTest.java

-  }
-
-  @Test
-  public void testDeserialize() throws IOException {


removed this test as it is reading using hadoopText which reads byte by byte. For. every character, we are writing 2 bytes, 1 for higher-order byte & another for lower-order byte

codecov-commenter · 2025-09-02T09:14:03Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 55.20%. Comparing base (26b3b77) to head (83acb7e).
⚠️ Report is 7 commits behind head on master.

Additional details and impacted files

@@              Coverage Diff              @@
##             master    #4133       +/-   ##
=============================================
+ Coverage     42.81%   55.20%   +12.39%     
+ Complexity     2480     1594      -886     
=============================================
  Files           513      310      -203     
  Lines         21744    10697    -11047     
  Branches       2478     1074     -1404     
=============================================
- Hits           9309     5905     -3404     
+ Misses        11481     4286     -7195     
+ Partials        954      506      -448

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

thisisArjit marked this pull request as ready for review August 22, 2025 06:18

Optimise writing of serialised Work Unit to File system

9eeca09

thisisArjit force-pushed the optimise-writing-workunit branch from 9482920 to 9eeca09 Compare August 26, 2025 05:37

thisisArjit changed the title ~~Optimise writing of serialised Work Unit to File system~~ [GOBBLIN-2223] Optimise writing of serialised Work Unit to File system Aug 26, 2025

Blazer-007 requested a review from Copilot September 1, 2025 17:08

Copilot AI reviewed Sep 1, 2025

View reviewed changes

thisisArjit force-pushed the optimise-writing-workunit branch from 152e06b to 83acb7e Compare September 2, 2025 06:24

thisisArjit commented Sep 2, 2025

View reviewed changes

Optimise writing of serialised Work Unit to File system

d5e28e9

thisisArjit force-pushed the optimise-writing-workunit branch from 83acb7e to d5e28e9 Compare September 2, 2025 10:26

Blazer-007 approved these changes Sep 2, 2025

View reviewed changes

Blazer-007 merged commit 88555c2 into apache:master Sep 2, 2025
6 checks passed

thisisArjit mentioned this pull request Sep 3, 2025

Revert "[GOBBLIN-2223] Optimise writing of serialised Work Unit to File system" #4138

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[GOBBLIN-2223] Optimise writing of serialised Work Unit to File system #4133

[GOBBLIN-2223] Optimise writing of serialised Work Unit to File system #4133

Uh oh!

thisisArjit commented Aug 21, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Sep 1, 2025

Uh oh!

Copilot AI Sep 1, 2025

Uh oh!

Blazer-007 Sep 1, 2025

Uh oh!

Copilot AI Sep 1, 2025

Uh oh!

Copilot AI Sep 1, 2025

Uh oh!

thisisArjit Sep 2, 2025

Uh oh!

codecov-commenter commented Sep 2, 2025

Uh oh!

Uh oh!

Uh oh!

[GOBBLIN-2223] Optimise writing of serialised Work Unit to File system #4133

[GOBBLIN-2223] Optimise writing of serialised Work Unit to File system #4133

Uh oh!

Conversation

thisisArjit commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

JIRA

Description

Tests

Commits

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Sep 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 1, 2025

Choose a reason for hiding this comment

Uh oh!

Blazer-007 Sep 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 1, 2025

Choose a reason for hiding this comment

Uh oh!

thisisArjit Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Sep 2, 2025

Codecov Report

Uh oh!

Uh oh!

Uh oh!

thisisArjit commented Aug 21, 2025 •

edited

Loading