Skip to content

Conversation

thisisArjit
Copy link
Contributor

@thisisArjit thisisArjit commented Aug 21, 2025

…son string

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

Description

  • Use random id for workunits if recovery helper is not initialised.
  • This is done to remove unnecessary serialisation while getting workunit guid which can add significant memory pressure on service.

Tests

  • Manual replication for 13.5k workunits for copying ~35tb data & 66k files

Commits

  • My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

@thisisArjit thisisArjit marked this pull request as ready for review August 22, 2025 06:18
@thisisArjit thisisArjit force-pushed the change-copyEntity-id-logic branch from 2580b54 to 1d3aef3 Compare August 26, 2025 05:05
Comment on lines 70 to 74
@Override
public Guid guid() throws IOException {
return Guid.fromStrings(toString());
return id;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is not fully backward compatible, given guid usage in DistcpSplitter and RecoveryHelper which are used in CopyDataPublisher and FIleAwareInputStreamWriter, lets make this change atleast backward compatible by adding a new function which accepts boolean createGuidFromUuid and pass this as yes from the caller where we are anticipating this to be called and in case no default to original way

@Blazer-007
Copy link
Member

@thisisArjit Add Jira ticket to description and title

@Blazer-007 Blazer-007 requested a review from Copilot September 1, 2025 17:09
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a unique identifier to the CopyEntity class to improve performance by avoiding expensive serialization operations when computing entity IDs. Instead of computing the GUID from a serialized JSON string, each CopyEntity now has a pre-generated UUID-based identifier.

  • Replace GUID computation from JSON serialization with pre-generated UUID
  • Add debug logging guard to prevent unnecessary string concatenation
  • Import UUID class for unique ID generation

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@@ -54,6 +55,8 @@ public class CopyEntity implements HasGuid {

public static final Gson GSON = GsonInterfaceAdapter.getGson(Object.class);

private final Guid id = new Guid(UUID.randomUUID().toString().getBytes());
Copy link
Preview

Copilot AI Sep 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UUID is converted to string then to bytes, which is inefficient. Consider using UUID.randomUUID().toString().getBytes(StandardCharsets.UTF_8) to specify the charset explicitly, or better yet, use UUID.randomUUID().toString() directly if the Guid constructor accepts strings.

Copilot uses AI. Check for mistakes.

@thisisArjit thisisArjit force-pushed the change-copyEntity-id-logic branch from 5f0bb3e to 21489bb Compare September 2, 2025 05:42
@thisisArjit thisisArjit changed the title Add unique id to CopyEntity Use random id for Workunit if recovery helper is not initialised Sep 2, 2025
@thisisArjit thisisArjit changed the title Use random id for Workunit if recovery helper is not initialised [GOBBLIN-2225] Use random id for Workunit if recovery helper is not initialised Sep 2, 2025
@codecov-commenter
Copy link

codecov-commenter commented Sep 2, 2025

Codecov Report

❌ Patch coverage is 75.00000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 51.64%. Comparing base (26b3b77) to head (ed8f859).
⚠️ Report is 7 commits behind head on master.

Files with missing lines Patch % Lines
...pache/gobblin/data/management/copy/CopyEntity.java 0.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #4134      +/-   ##
============================================
+ Coverage     42.81%   51.64%   +8.82%     
- Complexity     2480     7660    +5180     
============================================
  Files           513     1397     +884     
  Lines         21744    52892   +31148     
  Branches       2478     5807    +3329     
============================================
+ Hits           9309    27314   +18005     
- Misses        11481    23265   +11784     
- Partials        954     2313    +1359     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@thisisArjit thisisArjit force-pushed the change-copyEntity-id-logic branch from ed8f859 to ce77ad6 Compare September 2, 2025 14:19
@Blazer-007 Blazer-007 merged commit 6849776 into apache:master Sep 3, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants