Skip to content

feat: Chunkify single parts to generate them in parallel #155

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jul 29, 2025

Conversation

clflushopt
Copy link
Owner

@clflushopt clflushopt commented Jul 14, 2025

@clflushopt clflushopt requested review from alamb and removed request for alamb July 14, 2025 03:50
Copy link
Collaborator

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

I ran with

cargo run --release -- --tables lineitem --scale-factor 1000 --part 1 --parts 100 --format=parquet

And it kept all my cores busy as promised.

Thank yoU @clflushopt

@alamb alamb marked this pull request as ready for review July 14, 2025 20:16
@alamb
Copy link
Collaborator

alamb commented Jul 15, 2025

I made a PR to add some integration tests here

I think #156 covers the correctness aspect of this PR

I am not sure it covers the "make it parallel" functional aspect, but then again I don 't really have a great story about that

@alamb
Copy link
Collaborator

alamb commented Jul 15, 2025

I wrote up some high level thoughts here

@clflushopt
Copy link
Owner Author

Closing this PR since the approach used doesn't guarantee consistent output.

@clflushopt clflushopt closed this Jul 22, 2025
@alamb alamb reopened this Jul 23, 2025
@alamb
Copy link
Collaborator

alamb commented Jul 23, 2025

I merged up from main -- there are a few test items i need to fix

Copy link
Collaborator

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@clflushopt -- if you have a moment perhaps you could look at this PR

I found:

  1. The "split the job into multiple parts" works well (we just have to avoid Region and Nation tables) -- it passes tests well
  2. The interplay of parts / part and multiple files is a little confusing
  3. There is no good way to return errors

My suggested path forward is:

  1. Go with the approach in this PR
  2. Make a follow on refactoring PR that changes parts and part to Option<i32> and defaults them to None to make the logic clear

What are your thoughts

@clflushopt
Copy link
Owner Author

clflushopt commented Jul 29, 2025

Sounds good @alamb did a sanity check at scale factor 10 with master it all looks good ! About your comment yes, I do agree that parts, part is a bit confusing and switching to Option<i32> could make the logic more clear.

Copy link
Owner Author

@clflushopt clflushopt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be merged once you approve @alamb thanks for picking this one up !

Copy link
Collaborator

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @clflushopt

@alamb alamb merged commit d74abf7 into main Jul 29, 2025
7 checks passed
@alamb alamb deleted the cl/feat/parallel-gen-single-parts branch July 29, 2025 11:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Mutli-threaded generation of specific parts (when --part and --parts are specified)
2 participants