Add ability to save synthesizers and data when running benchmark_single_table #415

R-Palazzo · 2025-06-27T13:32:48Z

Resolve #410
CU-86b5dy0pa

Thanks in advance for the review. Here are a few questions:

Should we crash if the output_destination already exists? Or should we always be overwriting (for instance if two benchmark are launched the same day)
What should include the meta.yaml file that is expected to be saved in SDGym_results_mm_dd_yyyy/<dataset_name_mm_dd_yyyy>? I did not create it yet because I was not sure what to put inside it.
Compared to the naming given in the issue:
- I don't save with underscores at the end (synthetic_data.csv instead of _synthetic_data.csv), is it okay?
- The run<id>.yaml is at the output_destination as well as the SDGym_results_mm_dd_yyyy is it correct or should it be inside SDGym_results_mm_dd_yyyy?
- All the results combined are saved in a results.csv file that is in SDGym_results_mm_dd_yyyy
In run<id>.yaml I defined the starting_date and completed_date that correspond to the time the benchmark was started and fully commpleted.

sdv-team · 2025-06-27T13:32:53Z

Task linked: CU-86b5dy0pa SDGym - Add ability to save synthesizers and data when running benchmark_single_table #410

codecov · 2025-06-27T13:56:24Z

Codecov Report

Attention: Patch coverage is 98.92473% with 1 line in your changes missing coverage. Please review.

Project coverage is 68.56%. Comparing base (9795485) to head (662786d).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
sdgym/benchmark.py	98.92%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #415      +/-   ##
==========================================
+ Coverage   66.46%   68.56%   +2.09%     
==========================================
  Files          20       20              
  Lines        1330     1422      +92     
==========================================
+ Hits          884      975      +91     
- Misses        446      447       +1

Flag	Coverage Δ
integration	`58.43% <89.24%> (+2.19%)`	⬆️
unit	`54.92% <82.79%> (+1.91%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

amontanez24 · 2025-06-27T14:34:02Z

Resolve #410 CU-86b5dy0pa

Thanks in advance for the review. Here are a few questions:

Should we crash if the output_destination already exists? Or should we always be overwriting (for instance if two benchmark are launched the same day)

What should include the meta.yaml file that is expected to be saved in SDGym_results_mm_dd_yyyy/<dataset_name_mm_dd_yyyy>? I did not create it yet because I was not sure what to put inside it.

Compared to the naming given in the issue:

I don't save with underscores at the end (synthetic_data.csv instead of _synthetic_data.csv), is it okay?

The run<id>.yaml is at the output_destination as well as the SDGym_results_mm_dd_yyyy is it correct or should it be inside SDGym_results_mm_dd_yyyy?

All the results combined are saved in a results.csv file that is in SDGym_results_mm_dd_yyyy

In run<id>.yaml I defined the starting_date and completed_date that correspond to the time the benchmark was started and fully commpleted.

If the output_destination is created I think we can still write to it. If the same synthesizer/dataset combination is passed then we can overwrite it. Otherwise we can just make the new folders within the SDGym_results_mm_dd_yyyyfolder
This might not have been worded well but I think the run_.yaml can replace the meta.yaml. It should have the sdgym version, the synthesizer library version and the list of jobs (eg. [(ctgan, adult, (ctgan, census)...])
a. I think the issue maybe got badly formatted. It should be <synthesizer_name>_synthetic_data.csv. For example, ctgan_synthetic_data.csv.
b. I think it's fine to have both. I'm also ok if you only have the outer one.
c. That's good
This is good.

R-Palazzo · 2025-06-30T09:35:23Z

sdgym/benchmark.py

+        message = (
+            f"Parameters '{parameters}' are deprecated in the `benchmark_single_table` "
+            'function and will be removed in October 2025. '
+            'Please consider using `output_destination` instead.'
+        )


Let me know if this warning message makes sense. I introduce output_destination, but not all the deprecated parameters relate to saving data.

hmm good question. I think we should deprecate run_on_ec2 in the next issue when you add the new benchmark function. You can be more descriptive here and say:
For saving results, please use 'output_destination'. For running SDGym remotely on AWS, please use ...

amontanez24 · 2025-07-02T16:05:34Z

sdgym/benchmark.py

+        message = (
+            f"Parameters '{parameters}' are deprecated in the `benchmark_single_table` "
+            'function and will be removed in October 2025. '
+            'Please consider using `output_destination` instead.'
+        )


hmm good question. I think we should deprecate run_on_ec2 in the next issue when you add the new benchmark function. You can be more descriptive here and say:
For saving results, please use 'output_destination'. For running SDGym remotely on AWS, please use ...

amontanez24 · 2025-07-02T16:09:10Z

sdgym/benchmark.py

+    with open(run_file, 'r') as f:
+        run_data = yaml.safe_load(f) or {}


do we have to grab a lock here or worry about multiple runs trying to modify this file at the same time?

No we're safe here because the method is called after all the jobs are run and the results generated.

amontanez24 · 2025-07-02T16:25:19Z

sdgym/benchmark.py

+        else:
+            scores.to_csv(result_file, index=False, mode='a', header=False)


is it possible that two run might try to access this file at the same time?

sdgym/benchmark.py

R-Palazzo requested review from rwedge and amontanez24 June 27, 2025 13:32

R-Palazzo self-assigned this Jun 27, 2025

R-Palazzo requested a review from a team as a code owner June 27, 2025 13:32

R-Palazzo removed the request for review from a team June 27, 2025 13:32

R-Palazzo commented Jun 30, 2025

View reviewed changes

amontanez24 reviewed Jul 2, 2025

View reviewed changes

amontanez24 reviewed Jul 7, 2025

View reviewed changes

sdgym/benchmark.py Outdated Show resolved Hide resolved

amontanez24 reviewed Jul 9, 2025

View reviewed changes

sdgym/benchmark.py Show resolved Hide resolved

amontanez24 mentioned this pull request Jul 9, 2025

Add SDGymResultsExplorer class #420

Merged

amontanez24 approved these changes Jul 10, 2025

View reviewed changes

rwedge approved these changes Jul 11, 2025

View reviewed changes

R-Palazzo added 14 commits July 14, 2025 09:27

deprecate parameters warning

a572a3a

define _setup_output_destination

8e967cc

define run_id.yaml file

cdf3fce

save synthetic_data and synthesizer

6f6b0bc

save final results

8a70378

integration test

98f6ce3

allow existing folder

d83d96c

update naming

995ab2e

update warning message

1b70b04

add a locker to serialize savings

341c77d

update portallocker version

0f81feb

cleaning

a1502a6

remove portallocker and save files separately + all together

1d77b58

restructure files

662786d

R-Palazzo force-pushed the issue-410-save-synthesizers branch from eab6ecb to 662786d Compare July 14, 2025 07:27

R-Palazzo merged commit 66b76b9 into main Jul 14, 2025
55 checks passed

R-Palazzo deleted the issue-410-save-synthesizers branch July 14, 2025 07:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add ability to save synthesizers and data when running benchmark_single_table #415

Add ability to save synthesizers and data when running benchmark_single_table #415

Uh oh!

R-Palazzo commented Jun 27, 2025

Uh oh!

sdv-team commented Jun 27, 2025

Uh oh!

codecov bot commented Jun 27, 2025 •

edited

Loading

Uh oh!

amontanez24 commented Jun 27, 2025

Uh oh!

R-Palazzo Jun 30, 2025

Uh oh!

amontanez24 Jul 2, 2025

Uh oh!

amontanez24 Jul 2, 2025

Uh oh!

amontanez24 Jul 2, 2025

Uh oh!

R-Palazzo Jul 3, 2025

Uh oh!

amontanez24 Jul 2, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

		with open(run_file, 'r') as f:
		run_data = yaml.safe_load(f) or {}

		else:
		scores.to_csv(result_file, index=False, mode='a', header=False)

Add ability to save synthesizers and data when running benchmark_single_table #415

Add ability to save synthesizers and data when running benchmark_single_table #415

Uh oh!

Conversation

R-Palazzo commented Jun 27, 2025

Uh oh!

sdv-team commented Jun 27, 2025

Uh oh!

codecov bot commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

amontanez24 commented Jun 27, 2025

Uh oh!

R-Palazzo Jun 30, 2025

Choose a reason for hiding this comment

Uh oh!

amontanez24 Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

amontanez24 Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

amontanez24 Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

R-Palazzo Jul 3, 2025

Choose a reason for hiding this comment

Uh oh!

amontanez24 Jul 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov bot commented Jun 27, 2025 •

edited

Loading