-
Notifications
You must be signed in to change notification settings - Fork 751
[GOBBLIN-2223] Optimise writing of serialised Work Unit to File system #4133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[GOBBLIN-2223] Optimise writing of serialised Work Unit to File system #4133
Conversation
9482920
to
9eeca09
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR optimizes the serialization and deserialization of Work Units to the file system by eliminating unnecessary memory allocations in the TextSerializer class. The optimization replaces the use of string.getBytes()
which creates additional byte arrays with direct character-by-character writing and reading operations.
- Removed UTF-8 byte array creation during serialization, writing strings directly character-by-character
- Modified deserialization to use StringBuilder instead of byte array allocation
- Eliminated import of StandardCharsets as it's no longer needed
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
writeVLong(stream, str.length()); | ||
stream.writeBytes(str); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using str.length()
for the length will cause deserialization errors for multi-byte UTF-8 characters. The length should represent the number of bytes, not the number of characters. Multi-byte UTF-8 characters will have different byte lengths than character counts.
Copilot uses AI. Check for mistakes.
writeVLong(stream, utf8Encoded.length); | ||
stream.write(utf8Encoded); | ||
writeVLong(stream, str.length()); | ||
stream.writeBytes(str); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataOutput.writeBytes()
only writes the low 8 bits of each character, which will corrupt any characters outside the ASCII range (0-127). This breaks Unicode support that was previously handled by UTF-8 encoding.
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a good suggestion - https://www.cs.helsinki.fi/group/boi2016/doc/java/api/java/io/DataOutput.html#writeBytes-java.lang.String-
Should we have some handling for this as well ? @thisisArjit
for (int i = 0; i < str.length(); i++) {
if (str.charAt(i) > 0x7F) {
throw new IllegalArgumentException("Non-ASCII character detected.");
}
}
writeVLong(stream, str.length());
stream.writeBytes(str); // writes 1 byte per character
|
||
return new String(buf, StandardCharsets.UTF_8); | ||
for (int i = 0; i < bufLen; i++) { | ||
sb.append((char) in.readByte()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Casting a byte directly to char will produce incorrect results for multi-byte UTF-8 characters. This approach only works correctly for ASCII characters (0-127) and will corrupt Unicode text.
Copilot uses AI. Check for mistakes.
public static void writeStringAsText(DataOutput stream, String str) throws IOException { | ||
byte[] utf8Encoded = str.getBytes(StandardCharsets.UTF_8); | ||
writeVLong(stream, utf8Encoded.length); | ||
stream.write(utf8Encoded); | ||
writeVLong(stream, str.length()); | ||
stream.writeBytes(str); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method name suggests Hadoop Text compatibility, but the implementation is no longer compatible with Hadoop's Text serialization format, which uses UTF-8 byte encoding. This could break interoperability with Hadoop systems.
Copilot uses AI. Check for mistakes.
152e06b
to
83acb7e
Compare
} | ||
|
||
@Test | ||
public void testDeserialize() throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed this test as it is reading using hadoopText which reads byte by byte. For. every character, we are writing 2 bytes, 1 for higher-order byte & another for lower-order byte
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #4133 +/- ##
=============================================
+ Coverage 42.81% 55.20% +12.39%
+ Complexity 2480 1594 -886
=============================================
Files 513 310 -203
Lines 21744 10697 -11047
Branches 2478 1074 -1404
=============================================
- Hits 9309 5905 -3404
+ Misses 11481 4286 -7195
+ Partials 954 506 -448 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
83acb7e
to
d5e28e9
Compare
Dear Gobblin maintainers,
Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!
JIRA
Description
Tests
Commits