feat(synthetic-sf): add script to generate synthetic structure factors#234
feat(synthetic-sf): add script to generate synthetic structure factors#234DorisMai wants to merge 6 commits into
Conversation
…ng gemmi structure.
📝 WalkthroughWalkthroughThis PR introduces synthetic structure factor generation via a new CLI that leverages SFC_Torch. The changes refactor common structure-loading logic into reusable utilities, update an existing density-generation module to use those utilities, and add comprehensive argument parsing and batch-processing support for computing structure factors with optional solvent scaling and R-free flags. ChangesSynthetic Structure Factor Generation with Refactored Utilities
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
…rd into synthetic_utils
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (4)
src/sampleworks/eval/synthetic_utils.py (3)
158-158: 💤 Low valueAdd explanatory comment for type ignore.
Per coding guidelines, type ignores should include explanatory comments. The
# ty: ignore[invalid-argument-type]suppresses a type error but doesn't explain why it's safe.♻️ Proposed fix
- altloc_info = detect_altlocs(atom_array) # ty: ignore[invalid-argument-type] + # ty: ignore[invalid-argument-type] - atom_array is AtomArray after stripping ops, + # detect_altlocs signature may be overly narrow + altloc_info = detect_altlocs(atom_array)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/sampleworks/eval/synthetic_utils.py` at line 158, The call to detect_altlocs(atom_array) uses a type ignore comment (# ty: ignore[invalid-argument-type]) without justification; update the invocation in synthetic_utils.py to keep the ignore but add a short explanatory comment after it explaining why the type error is safe (e.g., atom_array is known at runtime to match the expected type/has required attributes or originates from a validated source) and reference the specific symbols detect_altlocs and atom_array so future readers/linters understand the rationale.
81-91: ⚡ Quick winExtra
occ_valuesare silently ignored.When
len(occ_values) > len(altloc_info.altloc_ids), extra values are silently discarded without warning. This asymmetry with the "too few values" case (which warns) may surprise callers.Consider adding a warning for the excess values case:
♻️ Proposed fix
if len(occ_values) != len(altloc_info.altloc_ids): + if len(occ_values) > len(altloc_info.altloc_ids): + logger.warning( + f"Expected {len(altloc_info.altloc_ids)} occupancy values, got {len(occ_values)}. " + "Extra values will be ignored." + ) + occ_values = occ_values[:len(altloc_info.altloc_ids)] + else: - logger.warning( - f"Expected {len(altloc_info.altloc_ids)} occupancy values, got {len(occ_values)}. " - "The missing values are automatically set to 0." - ) - occ_values = occ_values + [0.0] * (len(altloc_info.altloc_ids) - len(occ_values)) + logger.warning( + f"Expected {len(altloc_info.altloc_ids)} occupancy values, got {len(occ_values)}. " + "The missing values are automatically set to 0." + ) + occ_values = occ_values + [0.0] * (len(altloc_info.altloc_ids) - len(occ_values))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/sampleworks/eval/synthetic_utils.py` around lines 81 - 91, The code currently only warns when occ_values is shorter than altloc_info.altloc_ids but silently ignores extra occ_values; update the logic around occ_values handling (before the for loop that zips sorted(altloc_info.altloc_ids) with occ_values) to detect when len(occ_values) > len(altloc_info.altloc_ids) and emit a logger.warning indicating how many extra occupancy values were provided and that they will be ignored, then trim occ_values to the expected length; keep the existing branch that pads with zeros when occ_values is shorter and continue assigning occupancy using occupancy[altloc_info.atom_masks[altloc]] in the existing for altloc, occ loop so behavior remains consistent.
91-91: 💤 Low valueReturn type cast may be incorrect for
AtomArrayStackinputs.The function signature accepts
AtomArray | AtomArrayStack, butcast(AtomArray, result)always assertsAtomArray. If anAtomArrayStackis passed, this cast is misleading to type checkers and callers.If the function genuinely only handles
AtomArray, narrow the signature. Otherwise, remove the cast or make it conditional.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/sampleworks/eval/synthetic_utils.py` at line 91, The return cast to AtomArray is incorrect when the function accepts AtomArray | AtomArrayStack: update the function (in synthetic_utils.py) so it either narrows the signature to AtomArray if it truly never returns a stack, or remove the unconditional cast and return result with its original type; alternatively, if both are valid, return a union type or perform a runtime check (e.g., isinstance(result, AtomArrayStack)) and cast/annotate conditionally so callers and type checkers see the correct AtomArray vs AtomArrayStack type instead of always using cast(AtomArray, result).tests/eval/test_generate_synthetic_sf.py (1)
39-44: 💤 Low valueAdd return type annotation for consistency.
The
stripped_atom_arrayfixture is missing a return type annotation, unlike the other fixtures.Suggested fix
+from biotite.structure import AtomArray + `@pytest.fixture`(scope="module") -def stripped_atom_array(resources_dir: Path): +def stripped_atom_array(resources_dir: Path) -> AtomArray: arr = load_structure_with_altlocs(resources_dir / "6b8x" / "6b8x_final.pdb")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/eval/test_generate_synthetic_sf.py` around lines 39 - 44, stripped_atom_array fixture is missing a return type; update its signature to include the same return type used by the other fixtures (e.g. def stripped_atom_array(resources_dir: Path) -> AtomArray:) and add the appropriate type import if not present. Keep the body unchanged (calls to load_structure_with_altlocs, remove_hydrogens, remove_waters, keep_amino_acids, keep_polymer) but ensure the declared return type matches the actual returned value.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/sampleworks/eval/generate_synthetic_sf.py`:
- Around line 442-460: The Parallel invocation may spawn multiple loky workers
that each initialize CUDA if device is a GPU, causing GPU contention; update the
code around the Parallel(...) call to detect GPU devices (check the passed-in
device variable for values like "cuda" or starting with "cuda:") and if a GPU is
detected and n_jobs > 1 either set n_jobs = 1 (safe default) or emit a clear
warning (e.g., via logging or warnings.warn) and fall back to n_jobs = 1; ensure
this logic is applied before calling Parallel and reference the Parallel(...)
call and the _process_single_row invocation so the change prevents multiple
worker processes from initializing separate CUDA contexts.
In `@tests/eval/test_generate_synthetic_sf.py`:
- Around line 100-122: These two GPU-dependent tests
(test_fprotein_matches_direct_gemmi and test_fprotein_changes_with_occupancy)
should be annotated with the pytest slow marker: add import pytest if missing
and place `@pytest.mark.slow` immediately above each test function definition that
uses the device fixture/SFcalculator so they are excluded from fast CI runs.
Ensure you only add the decorator (no other behavioral changes) above the def
lines for test_fprotein_matches_direct_gemmi and
test_fprotein_changes_with_occupancy.
---
Nitpick comments:
In `@src/sampleworks/eval/synthetic_utils.py`:
- Line 158: The call to detect_altlocs(atom_array) uses a type ignore comment (#
ty: ignore[invalid-argument-type]) without justification; update the invocation
in synthetic_utils.py to keep the ignore but add a short explanatory comment
after it explaining why the type error is safe (e.g., atom_array is known at
runtime to match the expected type/has required attributes or originates from a
validated source) and reference the specific symbols detect_altlocs and
atom_array so future readers/linters understand the rationale.
- Around line 81-91: The code currently only warns when occ_values is shorter
than altloc_info.altloc_ids but silently ignores extra occ_values; update the
logic around occ_values handling (before the for loop that zips
sorted(altloc_info.altloc_ids) with occ_values) to detect when len(occ_values) >
len(altloc_info.altloc_ids) and emit a logger.warning indicating how many extra
occupancy values were provided and that they will be ignored, then trim
occ_values to the expected length; keep the existing branch that pads with zeros
when occ_values is shorter and continue assigning occupancy using
occupancy[altloc_info.atom_masks[altloc]] in the existing for altloc, occ loop
so behavior remains consistent.
- Line 91: The return cast to AtomArray is incorrect when the function accepts
AtomArray | AtomArrayStack: update the function (in synthetic_utils.py) so it
either narrows the signature to AtomArray if it truly never returns a stack, or
remove the unconditional cast and return result with its original type;
alternatively, if both are valid, return a union type or perform a runtime check
(e.g., isinstance(result, AtomArrayStack)) and cast/annotate conditionally so
callers and type checkers see the correct AtomArray vs AtomArrayStack type
instead of always using cast(AtomArray, result).
In `@tests/eval/test_generate_synthetic_sf.py`:
- Around line 39-44: stripped_atom_array fixture is missing a return type;
update its signature to include the same return type used by the other fixtures
(e.g. def stripped_atom_array(resources_dir: Path) -> AtomArray:) and add the
appropriate type import if not present. Keep the body unchanged (calls to
load_structure_with_altlocs, remove_hydrogens, remove_waters, keep_amino_acids,
keep_polymer) but ensure the declared return type matches the actual returned
value.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 724f9f85-5506-461e-aa16-c524bebf8acf
⛔ Files ignored due to path filters (1)
pixi.lockis excluded by!**/*.lock
📒 Files selected for processing (5)
pyproject.tomlsrc/sampleworks/eval/generate_synthetic_density.pysrc/sampleworks/eval/generate_synthetic_sf.pysrc/sampleworks/eval/synthetic_utils.pytests/eval/test_generate_synthetic_sf.py
| Parallel(n_jobs=n_jobs, backend="loky")( | ||
| delayed(_process_single_row)( | ||
| row=row, | ||
| base_dir=base_dir, | ||
| output_dir=output_dir, | ||
| dmin=dmin, | ||
| mode=mode, | ||
| occ_mode=occ_mode, | ||
| test_fraction=test_fraction, | ||
| seed=seed, | ||
| device=device, | ||
| strip_hydrogens=strip_hydrogens, | ||
| strip_waters=strip_waters, | ||
| strip_ligands=strip_ligands, | ||
| simulate_solvent_and_scale=simulate_solvent_and_scale, | ||
| save_structure=save_structure, | ||
| ) | ||
| for row in rows | ||
| ) |
There was a problem hiding this comment.
GPU contention risk when running parallel jobs on CUDA device.
When device is a CUDA GPU and n_jobs > 1, multiple loky worker processes will create separate CUDA contexts and compete for GPU memory. This can cause OOM errors or severe performance degradation. Consider:
- Defaulting to
n_jobs=1when device is CUDA, or - Adding a warning when
n_jobs != 1and device is GPU.
Proposed fix
def process_batch(
csv_path: Path,
base_dir: Path,
output_dir: Path,
dmin: float,
mode: str,
occ_mode: str,
test_fraction: float,
seed: int | None,
device: torch.device,
n_jobs: int = -1,
strip_hydrogens: bool = False,
strip_waters: bool = False,
strip_ligands: bool = False,
simulate_solvent_and_scale: bool = False,
save_structure: bool = False,
) -> None:
...
from joblib import delayed, Parallel
rows = load_batch_csv(csv_path)
+ if device.type == "cuda" and n_jobs != 1:
+ logger.warning(
+ f"Running {n_jobs} parallel jobs on CUDA device may cause GPU memory contention. "
+ "Consider using n_jobs=1 for GPU-based computation."
+ )
logger.info(f"Processing {len(rows)} structures from {csv_path} using {n_jobs} jobs")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/sampleworks/eval/generate_synthetic_sf.py` around lines 442 - 460, The
Parallel invocation may spawn multiple loky workers that each initialize CUDA if
device is a GPU, causing GPU contention; update the code around the
Parallel(...) call to detect GPU devices (check the passed-in device variable
for values like "cuda" or starting with "cuda:") and if a GPU is detected and
n_jobs > 1 either set n_jobs = 1 (safe default) or emit a clear warning (e.g.,
via logging or warnings.warn) and fall back to n_jobs = 1; ensure this logic is
applied before calling Parallel and reference the Parallel(...) call and the
_process_single_row invocation so the change prevents multiple worker processes
from initializing separate CUDA contexts.
| def test_fprotein_matches_direct_gemmi( | ||
| self, gemmi_structure_from_atomarray, stripped_gemmi, device | ||
| ): | ||
| f_atomarray = _compute_fprotein(gemmi_structure_from_atomarray, device) | ||
| f_direct = _compute_fprotein(stripped_gemmi, device) | ||
| np.testing.assert_allclose(np.abs(f_atomarray), np.abs(f_direct), atol=1e-3) | ||
|
|
||
| def test_fprotein_changes_with_occupancy(self, stripped_atom_array, stripped_gemmi, device): | ||
| altloc_info = detect_altlocs(stripped_atom_array) | ||
|
|
||
| arr_default = assign_occupancies(stripped_atom_array, altloc_info, "default") | ||
| f_default = _compute_fprotein( | ||
| atomarray_to_gemmi(arr_default, stripped_gemmi.cell, stripped_gemmi.spacegroup_hm), | ||
| device, | ||
| ) | ||
|
|
||
| arr_custom = assign_occupancies(stripped_atom_array, altloc_info, "custom", [0.2, 0.8, 0.0]) | ||
| f_custom = _compute_fprotein( | ||
| atomarray_to_gemmi(arr_custom, stripped_gemmi.cell, stripped_gemmi.spacegroup_hm), | ||
| device, | ||
| ) | ||
|
|
||
| assert not np.allclose(np.abs(f_default), np.abs(f_custom), atol=1e-3) |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
Mark GPU-dependent tests with @pytest.mark.slow.
The tests test_fprotein_matches_direct_gemmi and test_fprotein_changes_with_occupancy use the device fixture and SFcalculator, which may require GPU resources. As per coding guidelines, tests requiring GPU should be marked with @pytest.mark.slow to exclude them from fast CI runs.
Suggested fix
+ `@pytest.mark.slow`
def test_fprotein_matches_direct_gemmi(
self, gemmi_structure_from_atomarray, stripped_gemmi, device
):
f_atomarray = _compute_fprotein(gemmi_structure_from_atomarray, device)
f_direct = _compute_fprotein(stripped_gemmi, device)
np.testing.assert_allclose(np.abs(f_atomarray), np.abs(f_direct), atol=1e-3)
+ `@pytest.mark.slow`
def test_fprotein_changes_with_occupancy(self, stripped_atom_array, stripped_gemmi, device):🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/eval/test_generate_synthetic_sf.py` around lines 100 - 122, These two
GPU-dependent tests (test_fprotein_matches_direct_gemmi and
test_fprotein_changes_with_occupancy) should be annotated with the pytest slow
marker: add import pytest if missing and place `@pytest.mark.slow` immediately
above each test function definition that uses the device fixture/SFcalculator so
they are excluded from fast CI runs. Ensure you only add the decorator (no other
behavioral changes) above the def lines for test_fprotein_matches_direct_gemmi
and test_fprotein_changes_with_occupancy.
There was a problem hiding this comment.
this test is not slow and should be able to run with cpu too
There was a problem hiding this comment.
@DorisMai, thanks for the clarification! Since SFcalculator supports CPU execution and the device fixture falls back to CPU when no GPU is available, the @pytest.mark.slow annotation is not appropriate here. I'll withdraw that suggestion.
(ミ` ◡ ´ミ)
✏️ Learnings added
Learnt from: DorisMai
Repo: diff-use/sampleworks PR: 234
File: tests/eval/test_generate_synthetic_sf.py:100-122
Timestamp: 2026-05-16T00:14:29.452Z
Learning: In the `diff-use/sampleworks` repo, SFcalculator (SFC_Torch) tests such as those in `tests/eval/test_generate_synthetic_sf.py` (e.g., `test_fprotein_matches_direct_gemmi`, `test_fprotein_changes_with_occupancy`) can run on CPU and are not considered slow. Do NOT suggest adding `pytest.mark.slow` to these tests solely because they use `SFcalculator` or the `device` fixture — the `device` fixture falls back to CPU and execution is fast enough for regular CI runs.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
🧠 Learnings used
Learnt from: k-chrispens
Repo: diff-use/sampleworks PR: 73
File: src/sampleworks/core/samplers/protocol.py:1-18
Timestamp: 2026-02-02T19:13:51.594Z
Learning: In Python files across the repo, do not suggest or add inline or file-level suppressions for Ruff F722 (jaxtyping dimension strings). Instead rely on a global ignore configured in pyproject.toml (as done by the sampleworks project). Verify that F722 is included in the global ignore, and avoid recommending per-file # ruff: noqa: F722 directives unless explicitly documented as an exception.
Learnt from: marcuscollins
Repo: diff-use/sampleworks PR: 132
File: src/sampleworks/utils/guidance_script_utils.py:585-586
Timestamp: 2026-03-05T16:30:40.514Z
Learning: In Python code, if enums subclassing StrEnum are used (e.g., GuidanceType, StructurePredictor), they serialize to plain strings with json.dump and pickle without special handling. Do not flag these as non-serializable in code reviews. Treat StrEnum values as strings for JSON serialization and ensure tests cover that behavior; no extra pickle handling needed.
Summary
generate_synthetic_sf.pyto produce MTZ files of structure factors using SFcalculator (SFC), with optional bulk solvent + default scaling allowed as well as R-free flag generation. Assumes no anomalous scattering for now.generate_synthetic_density.pyin terms of supporting single_structure vs batch_csv generation, altloc occupancy overwrite, and selection and stripping of hydrogens/waters/ligands.assign_occupanciesand selection/stripping fromgenerate_density_sf.pyintoeval/synthetic_utilsfor reuse in the SF scripts.atomarray_to_gemmi. As this function is likely reused in prototyping the new reward function, test is added. Test is specific to SFC environment.Test plan
analysis-dev-sfc,boltz-dev-sfc, andrf3-dev-sfc, that eventually will be merged (or added to workflows).❗ There is dependency conflict with Protenix (Issue Dependency conflict betwen sfcalculator-torch and protenix #235 ), hence no corresponding env.
tests/eval/test_generate_synthetic_sf.pyNext steps
Summary by CodeRabbit
New Features
Refactor
Tests
Chores