CLI tool, valdo.pipeline by minhuanli · Pull Request #45 · Hekstra-Lab/valdo

minhuanli · 2026-05-08T12:47:38Z

No description provided.

Converts the pipeline.ipynb workflow into 9 independently-runnable CLI stages (standardize, reindex, scale, preprocess, train, reconstruct, rescale, add_phases_and_blobs, tag_blobs). Each stage is configured via a YAML or JSON file; `valdo.pipeline init <stage>` prints a commented template. Includes docs/pipeline_cli.md with a full tutorial, stage reference, and guidance on when the reindex stage can be skipped. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Generates CC difference histogram and scatter plot after reindexing. If reindex_record.pkl already exists in output_folder, reindexing is skipped and only the plots are regenerated. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ariant Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Generates end_corr histogram and start_LS vs end_LS scatter after scaling. If scaling_metrics.pkl already exists, scaling is skipped and only plots are regenerated. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add PTP1B_pipeline/PIPELINE_RUN.md documenting steps 1-4 - Add configs for reindex, scale, preprocess, and train stages - Add scaled_filtered_files.txt (1,651 paths after Rf filtering) - Fix VAE training defaults to match notebook: stdof=128, activation=relu, ml_recon=true, sigdF_pct=95.0, extrapolate_factors=[2,4,6,8,16], cutoff=3.5, radius_in_A=4.0 - Add activation config field to train stage (relu/tanh/sigmoid) - Add training loss curve plot (vae_loss_curves.png) saved after training Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Drop all-NaN input rows in run_train (dataset 0110_1 had all-NaN F-obs-scaled due to silent scaling divergence) - Add logvar clamping hook [-10,10] on encoder to prevent KL overflow - Add gradient clipping (max_norm=1.0) as additional safeguard - Add post-scaling check in run_scale that warns about all-NaN output files - Remove 0110_1.mtz from scaled_filtered_files.txt (now 1,650 datasets) - Fix train default activation to tanh (matches notebook) - Add vae_loss_curves.png from clean 1,650-sample training run - Update PIPELINE_RUN.md with step 5, NaN debugging notes, correct counts Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Add configs/config_reconstruct.yaml and config_rescale.yaml - Update PIPELINE_RUN.md with steps 6 and 7 - Reconstruct: recons.npy shape (1650, 77821), all finite - Rescale: 1,650 MTZ files with recons and diff columns Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- run_tag_blobs: add bound column from optional bound_models_folder, create temp flat PDB dir for tag_lig_blobs, skip ligand tagging if bound_models_folder is omitted (bound=0, ligand=0) - config.py: make bound_models_folder optional in tag_blobs schema - config_tag_blobs.yaml: PTP1B config with Cys215 focal residue - plot_auc.py: standalone script to plot ROC/AUC from filtered blob stats - PIPELINE_RUN.md: document steps 8 and 9 with output counts (9318 blobs, 8316 filtered, AUC=0.9399) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- PIPELINE_RUN.md: full rewrite as a how-to guide — Prerequisites section, per-step explanations, Troubleshooting callouts, apo refinement instructions, Step 10 evaluation section; inline Python replaced with script references - parse_refine_logs.py: parse PHENIX log files into refine_summary.csv - filter_datasets.py: filter scaled MTZ files by R-free, CC, and all-NaN F-obs-scaled; resolves reindexing ambiguity; writes filtered file list Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ode quality pipeline.py bug fixes: - run_scale: Scaler_pool does not support when_opt; fall back to single-process Scaler whenever when_opt != "never" (pool silently ignored it before) - run_scale: restore skip-if-exists with --force override (was accidentally removed) - run_add_phases_and_blobs: drop NaN rows after extrapolation (mirrors pipeline_doeke.py) - run_train: add torch.manual_seed for reproducible weight initialisation pipeline.py new features: - --force flag to override skip-if-exists on any stage - Skip-if-exists guards added to all stages (reindex already had one) pipeline.py code quality: - Move all imports to function top (no imports inside loops or mid-function) - Remove import glob as _glob alias (module-level glob already imported) - Move matplotlib.use("Agg") before pyplot import in run_train - Remove duplicate import tempfile in run_tag_blobs - Update _TEMPLATES defaults to match Doeke's settings (when_opt, random_seed, activation) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… results - Integrate filter_datasets.py as valdo.pipeline filter stage (remove standalone script) - Add AUC-vs-N-blobs plot to plot_auc.py (sort by peakz, evaluate at N=500..6000) - Fix bound model naming: standardize to {id}.pdb for tag_lig_blobs lookup - Add configs for filter and add_phases_and_blobs stages - Update PIPELINE_RUN.md with new run counts, AUC=0.9748, bound model naming guide, and stale-file cleanup warnings for reruns - Add CLAUDE.md to .gitignore Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- valdo/helper.py: fix find_phase_file to prefer *_001.mtz over *_data.mtz when multiple glob matches exist, fixing silent phasing failures for ~833 datasets - valdo/commandline/config.py, pipeline.py: filter stage additions - PTP1B_pipeline/compute_valdo_metrics.py: apo peak and heavy atom WDF metric script - notebooks/vae_metric_apo_peak_value.py, vae_metric_heavy_atom_peak_value.py: converted from Doeke's notebooks for reference Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- compute_valdo_metrics.py: add --recons-phased argparse, expose compute_metrics() for import, move mapping txt to local path - collect_ablation_metrics.py: loop all ablation settings, print comparison table, write ablation/ablation_metrics.csv - ligand_cif_to_dataset_mapping.txt: copy from notebooks/ so PTP1B_pipeline/ is self-contained Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

minhuanli and others added 20 commits May 4, 2026 11:37

Remove twinning-specific language from reindex documentation

9e3b8d7

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add valdo.refine prerequisite guidance before add_phases_and_blobs

f103984

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add PTP1B_pipeline/ to .gitignore

f09b994

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Fix Scaler_pool.batch_scaling call — when_opt not supported in pool v…

e76470d

…ariant Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Support .txt file path in expand_glob_field for explicit file lists

868d621

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add scale validation plots and skip-if-done detection

743aebb

Generates end_corr histogram and start_LS vs end_LS scatter after scaling. If scaling_metrics.pkl already exists, scaling is skipped and only plots are regenerated. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add validation plots from reindex and scale stages

439b87c

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Update AUC/ROC plots from full 1617-dataset run

1f5864a

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add Keedy/Ginn split to heavy atom peak metric

cffba70

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLI tool, valdo.pipeline#45

CLI tool, valdo.pipeline#45
minhuanli wants to merge 20 commits into
mainfrom
feature/pipeline-cli

minhuanli commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

minhuanli commented May 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant