Skip to content

Add ability to save synthesizers and data when running benchmark_single_table #415

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jul 14, 2025

Conversation

R-Palazzo
Copy link
Contributor

Resolve #410
CU-86b5dy0pa

Thanks in advance for the review. Here are a few questions:

  1. Should we crash if the output_destination already exists? Or should we always be overwriting (for instance if two benchmark are launched the same day)
  2. What should include the meta.yaml file that is expected to be saved in SDGym_results_mm_dd_yyyy/<dataset_name_mm_dd_yyyy>? I did not create it yet because I was not sure what to put inside it.
  3. Compared to the naming given in the issue:
    • I don't save with underscores at the end (synthetic_data.csv instead of _synthetic_data.csv), is it okay?
    • The run<id>.yaml is at the output_destination as well as the SDGym_results_mm_dd_yyyy is it correct or should it be inside SDGym_results_mm_dd_yyyy?
    • All the results combined are saved in a results.csv file that is in SDGym_results_mm_dd_yyyy
  4. In run<id>.yaml I defined the starting_date and completed_date that correspond to the time the benchmark was started and fully commpleted.

@R-Palazzo R-Palazzo requested review from rwedge and amontanez24 June 27, 2025 13:32
@R-Palazzo R-Palazzo self-assigned this Jun 27, 2025
@R-Palazzo R-Palazzo requested a review from a team as a code owner June 27, 2025 13:32
@sdv-team
Copy link
Contributor

@R-Palazzo R-Palazzo removed the request for review from a team June 27, 2025 13:32
Copy link

codecov bot commented Jun 27, 2025

Codecov Report

Attention: Patch coverage is 98.92473% with 1 line in your changes missing coverage. Please review.

Project coverage is 68.56%. Comparing base (9795485) to head (662786d).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
sdgym/benchmark.py 98.92% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #415      +/-   ##
==========================================
+ Coverage   66.46%   68.56%   +2.09%     
==========================================
  Files          20       20              
  Lines        1330     1422      +92     
==========================================
+ Hits          884      975      +91     
- Misses        446      447       +1     
Flag Coverage Δ
integration 58.43% <89.24%> (+2.19%) ⬆️
unit 54.92% <82.79%> (+1.91%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@amontanez24
Copy link
Contributor

Resolve #410 CU-86b5dy0pa

Thanks in advance for the review. Here are a few questions:

  1. Should we crash if the output_destination already exists? Or should we always be overwriting (for instance if two benchmark are launched the same day)

  2. What should include the meta.yaml file that is expected to be saved in SDGym_results_mm_dd_yyyy/<dataset_name_mm_dd_yyyy>? I did not create it yet because I was not sure what to put inside it.

  3. Compared to the naming given in the issue:

    • I don't save with underscores at the end (synthetic_data.csv instead of _synthetic_data.csv), is it okay?
    • The run<id>.yaml is at the output_destination as well as the SDGym_results_mm_dd_yyyy is it correct or should it be inside SDGym_results_mm_dd_yyyy?
    • All the results combined are saved in a results.csv file that is in SDGym_results_mm_dd_yyyy
  4. In run<id>.yaml I defined the starting_date and completed_date that correspond to the time the benchmark was started and fully commpleted.

  1. If the output_destination is created I think we can still write to it. If the same synthesizer/dataset combination is passed then we can overwrite it. Otherwise we can just make the new folders within the SDGym_results_mm_dd_yyyyfolder
  2. This might not have been worded well but I think the run_.yaml can replace the meta.yaml. It should have the sdgym version, the synthesizer library version and the list of jobs (eg. [(ctgan, adult, (ctgan, census)...])
  3. a. I think the issue maybe got badly formatted. It should be <synthesizer_name>_synthetic_data.csv. For example, ctgan_synthetic_data.csv.
    b. I think it's fine to have both. I'm also ok if you only have the outer one.
    c. That's good
  4. This is good.

Comment on lines 875 to 893
message = (
f"Parameters '{parameters}' are deprecated in the `benchmark_single_table` "
'function and will be removed in October 2025. '
'Please consider using `output_destination` instead.'
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if this warning message makes sense. I introduce output_destination, but not all the deprecated parameters relate to saving data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm good question. I think we should deprecate run_on_ec2 in the next issue when you add the new benchmark function. You can be more descriptive here and say:
For saving results, please use 'output_destination'. For running SDGym remotely on AWS, please use ...

Comment on lines 875 to 893
message = (
f"Parameters '{parameters}' are deprecated in the `benchmark_single_table` "
'function and will be removed in October 2025. '
'Please consider using `output_destination` instead.'
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm good question. I think we should deprecate run_on_ec2 in the next issue when you add the new benchmark function. You can be more descriptive here and say:
For saving results, please use 'output_destination'. For running SDGym remotely on AWS, please use ...

Comment on lines +922 to +936
with open(run_file, 'r') as f:
run_data = yaml.safe_load(f) or {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have to grab a lock here or worry about multiple runs trying to modify this file at the same time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we're safe here because the method is called after all the jobs are run and the results generated.

Comment on lines 627 to 628
else:
scores.to_csv(result_file, index=False, mode='a', header=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible that two run might try to access this file at the same time?

@R-Palazzo R-Palazzo force-pushed the issue-410-save-synthesizers branch from eab6ecb to 662786d Compare July 14, 2025 07:27
@R-Palazzo R-Palazzo merged commit 66b76b9 into main Jul 14, 2025
55 checks passed
@R-Palazzo R-Palazzo deleted the issue-410-save-synthesizers branch July 14, 2025 07:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add ability to save synthesizers and data when running benchmark_single_table
4 participants