Add new_parse_input module and move species_mapper to module_settings… by wolski · Pull Request #1012 · Proteobench/ProteoBench

wolski · 2026-04-15T08:17:44Z

….toml

Phase 1 of separating parsing from benchmarking: create unified parsing entry point (parse_input, load_module_settings, process_species) and add species_mapper to module_settings.toml where it belongs as module-level config. Per-tool TOMLs retain species_mapper marked for future removal.

….toml Phase 1 of separating parsing from benchmarking: create unified parsing entry point (parse_input, load_module_settings, process_species) and add species_mapper to module_settings.toml where it belongs as module-level config. Per-tool TOMLs retain species_mapper marked for future removal.

Phase 2 of parsing/benchmarking separation: - Remove _process_species_information, species_dict, species_expected_ratio from ParseSettingsQuant — callers now use process_species() from new_parse_input.py with ModuleSettings loaded from module_settings.toml - Reorder convert_to_standard_format: melt (wide->long) now happens early - convert_to_standard_format returns DataFrame only (not tuple) - Rename _create_replicate_mapping to public create_replicate_mapping - Update all 11 callers, plot generators, and webinterface tab3 - Remove [species_mapper] from 77 per-tool TOMLs - Add 8 tests for new_parse_input.py functions

- Generalize run_benchmarking() to accept quant_score_class and datapoint_class parameters instead of hardcoding QuantScoresHYE - Base class QuantModule.benchmarking() now delegates to run_benchmarking() with self.quant_score_class and self.datapoint_class - Delete benchmarking() overrides from all 10 subclass module files - Rename precursor_name to precursor_column_name in singlecell and Plasma - Add quant_score_class/datapoint_class class attributes to QuantModule (Plasma overrides with QuantScoresPYE/QuantDatapointPYE) - Delete unused run_benchmarking_with_timing() and benchmarking_2() - Net: -964 lines of duplicated code

Tests verify: 3-tuple return, default HYE classes work, custom quant_score_class/datapoint_class are actually used (via mocks), add_datapoint_func is called when provided and skipped when None.

run_benchmarking() no longer does any parsing — it receives standard_format, replicate_to_raw, and ModuleSettings directly. Base class benchmarking() calls parse_input() then delegates. input_df kept in base class return for webinterface compatibility. get_plots.py migrated to use parse_input(). Tests updated for new 2-tuple return signature.

ZenoTOF still had the old benchmarking() override with removed imports. Add .flake8 config to exclude build/ from linting. Fix pre-existing numpydoc warnings for branch parameter.

Step 1 of moving sample annotation from per-tool TOMLs to module_settings.toml. Each module now has a [[samples]] array with raw_file, sample_name, and condition per sample. Not yet consumed by code — per-tool condition_mapper/run_mapper still active.

- Move 7 ProForma conversion helpers from parse_ion.py to proforma.py - Add 22 direct unit tests in test_proforma.py - Add SampleAnnotation dataclass and ModuleSettings properties (condition_mapper, run_mapper, replicate_to_raw) - Fix numpydoc docstrings for dataclasses, properties, and ParseSettingsDeNovo

- Rename parse_ion.py -> load_input.py (all loaders in one module) - Add _load_proteome_discoverer from parse_peptidoform.py - Delete parse_peptidoform.py (dead code, never imported) - Add 8 direct tests in test_load_input.py - Update imports in 4 files - Fix pre-existing numpydoc issues in quant_base_module.py

…sses - Rename parse_settings.py -> convert_to_intermediate.py - Rename test_parse_settings.py -> test_convert_to_intermediate.py - Rename ParseSettingsQuant -> IntermediateFormatConverter - Rename ParseModificationSettings -> ModificationConverter - Rename ParseSettingsBuilder -> ConverterBuilder - Add backward-compatible aliases for transition - Add 5 direct tests for ModificationConverter, 3 for ConverterBuilder - Update imports in 18 files - Exclude webinterface/ and modules/denovo/ from numpydoc validation (pre-existing violations, separate CI workflow)

…ong-format TOMLs Converter fallback: IntermediateFormatConverter.__init__ now checks for [condition_mapper] in per-tool TOML first, then falls back to [[samples]] from module_settings.toml. run_mapper removed (dead code — was stored but never read after init). Long-format tools (bare names or extensions already handled by _clean_run_name): - DDA QExactive/Astral: MaxQuant, AlphaPept, DIA-NN, MetaMorpheus, i2MassChroQ, quantms/msstats - DIA AIF/Astral/diaPASEF/ZenoTOF/singlecell/plasma: DIA-NN, MaxDIA, Spectronaut, MSAID Wide-format tools keep their per-tool condition_mapper for now (separate step). TODO reorganisation: moved completed TODOs to DONE/ subfolder, added TODO_file_parsing_melt.md documenting 42 wide-format loaders to refactor, TODO_parsing_structure.md and TODO_step_by_step_apr16.md as master plans.

… remaining per-tool TOMLs Extends IntermediateFormatConverter to resolve condition_mapper via three paths: 1. [condition_mapper] in per-tool TOML (highest priority — legacy/override) 2. [run_mapper] + [[samples]] two-step lookup: column → sample_name → condition Used for tools with tool-specific column names (WOMBAT, Proteome Discoverer, PEAKS peptidoform, MSAngel QExactive) that cannot be matched to raw file names by regex alone. 3. [[samples]] raw_file → condition directly (long-format and bare-name tools) Extends default _clean_run_name() regex to strip: ' Intensity' (FragPipe), ' Normalized Area' (PEAKS ion), '.mgf' (ProlineStudio) Per-tool TOML changes (39 files): - Tier 0 (bare names / .mzML already handled): AlphaDIA x6, Custom x10, Sage x3, FragPipe singlecell x1 — remove both sections - Tier 1 (suffix strip via extended regex): FragPipe DDA/DIA x6, PEAKS DDA/DIA ion x8 — remove both sections - Tier 2 (prefix strip via run_name_cleanup): MSAngel Astral, ProlineStudio QEx/Astral — add run_name_cleanup, remove both sections - Tier 2+3 hybrid: MSAngel QExactive — add run_name_cleanup + run_mapper, no condition_mapper (columns don't match raw_file names after prefix strip) - Tier 3 (hard — run_mapper two-step): WOMBAT x3, Proteome Discoverer x1, PEAKS peptidoform x1 — remove condition_mapper, keep run_mapper Result: zero condition_mapper entries in any per-tool TOML. run_mapper remains only in 6 files (WOMBAT x3, PD x1, PEAKS peptidoform x1, MSAngel QExactive x1). Tests: add TestCleanRunNameExtended (5 tests) and TestTier3ConditionMapper (4 tests).

PEAKS wide-format columns use an LFQ_<sample> prefix that _clean_run_name cannot strip, so condition_mapper keys never intersected the cleaned column names and melt() received an empty value_vars set. The run_mapper maps each LFQ_<sample> key to its bare sample name, enabling Tier-2 resolution in IntermediateFormatConverter.__init__. Adds a regression test that synthesizes a PEAKS diaPASEF DataFrame and verifies both the column overlap and the full convert_to_standard_format path.

Adds test_data_download/ with: - extract_raw_file_db.py: CLI with catalog/select/download subcommands for assembling a reproducible test dataset from ProteoBench result repos - Makefile: drives the three subcommands in sequence - marimo/benchmark_analysis.py + index.py: marimo notebooks for per-module in-depth analysis; marimo/Makefile exports all 8 quant modules to HTML - .gitignore entries for generated CSVs, json_dir, HTML exports, and zip archives - TODO/TODO_creating_test_harness.md: design doc for the test harness - .pre-commit-config.yaml: exclude test_data_download/ from numpydoc validation

…oteobench/ProteoBench into intermediate_format_interface # Conflicts: # webinterface/pages/base_pages/denovo.py

…downloaded.csv For each (module, software, version) row in raw_file_db_downloaded.csv, re-runs the module's benchmarking() against the cached input file and asserts that the regenerated metrics at min_nr_observed=3 match the values in the original submission JSON, within pytest.approx(rel=1e-4, abs=1e-6). Skipped entirely when test_data_download/ is absent so a fresh clone and CI stay green. Opt in locally with: cd test_data_download && make catalog && make select && make download Accepts known drift via test/catalog_regression_overrides.json — a hash-keyed file whose metrics override the upstream JSON for specific cases. Seeded with 4 FragPipe (DIA-NN quant) entries where a proteobench 0.13.2 -> 0.15.x parsing refactor dropped ~1% of precursors consistently; drop the entry once the upstream Results JSON is resubmitted. Current run on a populated catalog: 69 pass + 4 pass via override + 7 fail. The 7 failures are all PEAKS wide-format tools and hit the same _clean_run_name path-stripping bug that was latent in commit 794d25e's test_peaks_diapasef_convert_to_standard_format_succeeds: m/z column names collapse to z, producing duplicate columns that break the subsequent melt(). Root-cause fix in _fix_colnames / _clean_run_name is a separate commit. Registers a `regression` pytest marker in pyproject.toml so the harness can be excluded with `pytest -m "not regression"` without touching the test file.

_fix_colnames applied _clean_run_name unchanged to every DataFrame column. _clean_run_name's path-prefix stripper (a/b/c -> c) then ate legitimate metadata column names containing '/', most notably PEAKS' 'm/z' column and its per-sample siblings 'LFQ_<sample> m/z'. All seven variants collapsed to 'z', producing duplicate column names that broke the subsequent melt() with AttributeError: 'DataFrame' object has no attribute 'dtype' Path stripping is correct for raw-file row values (the "Raw file" column) and for condition_mapper keys supplied in the TOML, which are the other two call sites — column names are the outlier. Add a `strip_path: bool = True` parameter to _clean_run_name and call it with strip_path=False from _fix_colnames. Extension / suffix stripping (.mzML, .raw, " Intensity", " Normalized Area", etc.) still applies everywhere. Effect on the suite: - test/test_peaks_diapasef_regression.py::test_peaks_diapasef_convert_to_standard_format_succeeds now passes (was latent-failing since 794d25e). - test/test_catalog_regression.py: all 7 PEAKS cases now pass (dia_diapasef x2, dia_aif, dia_zenotof, dia_astral x2, dda_qexactive). - No other tests affected: 250 passed, 3 skipped in the non-regression suite (was 249 passed, 1 failed, 3 skipped).

The test_data_download/ Makefile and scripts fetch ProteoBench result JSONs and raw quant files to feed AnnData mapping development. That work now lives in the sibling anndata_proteomics_bridge repo, so the tool moves with its consumer. Also drop the now-redundant TODO/TODO_anndata_mapping.md (superseded by anndata_proteomics_bridge/docs/toml_schema.md) and ignore jupyter_notebooks/.

wolski added 8 commits April 14, 2026 16:45

Add direct tests for run_benchmarking() function

bdadc5f

Tests verify: 3-tuple return, default HYE classes work, custom quant_score_class/datapoint_class are actually used (via mocks), add_datapoint_func is called when provided and skipped when None.

style: black formatting fixes

820da02

Fix ZenoTOF stale benchmarking() and add .flake8 config

7039795

ZenoTOF still had the old benchmarking() override with removed imports. Add .flake8 config to exclude build/ from linting. Fix pre-existing numpydoc warnings for branch parameter.

Update TODO files to reflect current refactoring state

f313000

wolski requested review from RobbinBouwmeester and SamvPy April 15, 2026 15:23

wolski and others added 15 commits April 15, 2026 17:28

Merge branch 'main' into intermediate_format_interface

b481692

Merge branch 'intermediate_format_interface' of https://github.com/Pr…

52e9911

…oteobench/ProteoBench into intermediate_format_interface # Conflicts: # webinterface/pages/base_pages/denovo.py

chore(todo): reorganize completed TODO files

0eec11d

Add AnnData mapping TOML design TODO

c654afd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new_parse_input module and move species_mapper to module_settings…#1012

Add new_parse_input module and move species_mapper to module_settings…#1012
wolski wants to merge 23 commits into
mainfrom
intermediate_format_interface

wolski commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wolski commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants