Skip to content

Add new_parse_input module and move species_mapper to module_settings…#1012

Open
wolski wants to merge 23 commits into
mainfrom
intermediate_format_interface
Open

Add new_parse_input module and move species_mapper to module_settings…#1012
wolski wants to merge 23 commits into
mainfrom
intermediate_format_interface

Conversation

@wolski

@wolski wolski commented Apr 15, 2026

Copy link
Copy Markdown
Contributor

….toml

Phase 1 of separating parsing from benchmarking: create unified parsing entry point (parse_input, load_module_settings, process_species) and add species_mapper to module_settings.toml where it belongs as module-level config. Per-tool TOMLs retain species_mapper marked for future removal.

wolski added 8 commits April 14, 2026 16:45
….toml

Phase 1 of separating parsing from benchmarking: create unified parsing
entry point (parse_input, load_module_settings, process_species) and
add species_mapper to module_settings.toml where it belongs as module-level
config. Per-tool TOMLs retain species_mapper marked for future removal.
Phase 2 of parsing/benchmarking separation:
- Remove _process_species_information, species_dict, species_expected_ratio
  from ParseSettingsQuant — callers now use process_species() from
  new_parse_input.py with ModuleSettings loaded from module_settings.toml
- Reorder convert_to_standard_format: melt (wide->long) now happens early
- convert_to_standard_format returns DataFrame only (not tuple)
- Rename _create_replicate_mapping to public create_replicate_mapping
- Update all 11 callers, plot generators, and webinterface tab3
- Remove [species_mapper] from 77 per-tool TOMLs
- Add 8 tests for new_parse_input.py functions
- Generalize run_benchmarking() to accept quant_score_class and
  datapoint_class parameters instead of hardcoding QuantScoresHYE
- Base class QuantModule.benchmarking() now delegates to run_benchmarking()
  with self.quant_score_class and self.datapoint_class
- Delete benchmarking() overrides from all 10 subclass module files
- Rename precursor_name to precursor_column_name in singlecell and Plasma
- Add quant_score_class/datapoint_class class attributes to QuantModule
  (Plasma overrides with QuantScoresPYE/QuantDatapointPYE)
- Delete unused run_benchmarking_with_timing() and benchmarking_2()
- Net: -964 lines of duplicated code
Tests verify: 3-tuple return, default HYE classes work, custom
quant_score_class/datapoint_class are actually used (via mocks),
add_datapoint_func is called when provided and skipped when None.
run_benchmarking() no longer does any parsing — it receives
standard_format, replicate_to_raw, and ModuleSettings directly.
Base class benchmarking() calls parse_input() then delegates.
input_df kept in base class return for webinterface compatibility.
get_plots.py migrated to use parse_input().
Tests updated for new 2-tuple return signature.
ZenoTOF still had the old benchmarking() override with removed imports.
Add .flake8 config to exclude build/ from linting.
Fix pre-existing numpydoc warnings for branch parameter.
wolski and others added 15 commits April 15, 2026 17:28
Step 1 of moving sample annotation from per-tool TOMLs to
module_settings.toml. Each module now has a [[samples]] array
with raw_file, sample_name, and condition per sample. Not yet
consumed by code — per-tool condition_mapper/run_mapper still active.
- Move 7 ProForma conversion helpers from parse_ion.py to proforma.py
- Add 22 direct unit tests in test_proforma.py
- Add SampleAnnotation dataclass and ModuleSettings properties
  (condition_mapper, run_mapper, replicate_to_raw)
- Fix numpydoc docstrings for dataclasses, properties, and ParseSettingsDeNovo
- Rename parse_ion.py -> load_input.py (all loaders in one module)
- Add _load_proteome_discoverer from parse_peptidoform.py
- Delete parse_peptidoform.py (dead code, never imported)
- Add 8 direct tests in test_load_input.py
- Update imports in 4 files
- Fix pre-existing numpydoc issues in quant_base_module.py
…sses

- Rename parse_settings.py -> convert_to_intermediate.py
- Rename test_parse_settings.py -> test_convert_to_intermediate.py
- Rename ParseSettingsQuant -> IntermediateFormatConverter
- Rename ParseModificationSettings -> ModificationConverter
- Rename ParseSettingsBuilder -> ConverterBuilder
- Add backward-compatible aliases for transition
- Add 5 direct tests for ModificationConverter, 3 for ConverterBuilder
- Update imports in 18 files
- Exclude webinterface/ and modules/denovo/ from numpydoc validation
  (pre-existing violations, separate CI workflow)
…ong-format TOMLs

Converter fallback: IntermediateFormatConverter.__init__ now checks for
[condition_mapper] in per-tool TOML first, then falls back to [[samples]]
from module_settings.toml. run_mapper removed (dead code — was stored but
never read after init).

Long-format tools (bare names or extensions already handled by _clean_run_name):
- DDA QExactive/Astral: MaxQuant, AlphaPept, DIA-NN, MetaMorpheus, i2MassChroQ, quantms/msstats
- DIA AIF/Astral/diaPASEF/ZenoTOF/singlecell/plasma: DIA-NN, MaxDIA, Spectronaut, MSAID

Wide-format tools keep their per-tool condition_mapper for now (separate step).

TODO reorganisation: moved completed TODOs to DONE/ subfolder, added
TODO_file_parsing_melt.md documenting 42 wide-format loaders to refactor,
TODO_parsing_structure.md and TODO_step_by_step_apr16.md as master plans.
… remaining per-tool TOMLs

Extends IntermediateFormatConverter to resolve condition_mapper via three paths:
1. [condition_mapper] in per-tool TOML (highest priority — legacy/override)
2. [run_mapper] + [[samples]] two-step lookup: column → sample_name → condition
   Used for tools with tool-specific column names (WOMBAT, Proteome Discoverer,
   PEAKS peptidoform, MSAngel QExactive) that cannot be matched to raw file names
   by regex alone.
3. [[samples]] raw_file → condition directly (long-format and bare-name tools)

Extends default _clean_run_name() regex to strip:
  ' Intensity' (FragPipe), ' Normalized Area' (PEAKS ion), '.mgf' (ProlineStudio)

Per-tool TOML changes (39 files):
- Tier 0 (bare names / .mzML already handled): AlphaDIA x6, Custom x10,
  Sage x3, FragPipe singlecell x1 — remove both sections
- Tier 1 (suffix strip via extended regex): FragPipe DDA/DIA x6,
  PEAKS DDA/DIA ion x8 — remove both sections
- Tier 2 (prefix strip via run_name_cleanup):
  MSAngel Astral, ProlineStudio QEx/Astral — add run_name_cleanup, remove both sections
- Tier 2+3 hybrid: MSAngel QExactive — add run_name_cleanup + run_mapper,
  no condition_mapper (columns don't match raw_file names after prefix strip)
- Tier 3 (hard — run_mapper two-step): WOMBAT x3, Proteome Discoverer x1,
  PEAKS peptidoform x1 — remove condition_mapper, keep run_mapper

Result: zero condition_mapper entries in any per-tool TOML.
run_mapper remains only in 6 files (WOMBAT x3, PD x1, PEAKS peptidoform x1,
MSAngel QExactive x1).

Tests: add TestCleanRunNameExtended (5 tests) and TestTier3ConditionMapper (4 tests).
PEAKS wide-format columns use an LFQ_<sample> prefix that _clean_run_name
cannot strip, so condition_mapper keys never intersected the cleaned column
names and melt() received an empty value_vars set. The run_mapper maps each
LFQ_<sample> key to its bare sample name, enabling Tier-2 resolution in
IntermediateFormatConverter.__init__.

Adds a regression test that synthesizes a PEAKS diaPASEF DataFrame and
verifies both the column overlap and the full convert_to_standard_format path.
Adds test_data_download/ with:
- extract_raw_file_db.py: CLI with catalog/select/download subcommands for
  assembling a reproducible test dataset from ProteoBench result repos
- Makefile: drives the three subcommands in sequence
- marimo/benchmark_analysis.py + index.py: marimo notebooks for per-module
  in-depth analysis; marimo/Makefile exports all 8 quant modules to HTML
- .gitignore entries for generated CSVs, json_dir, HTML exports, and zip archives
- TODO/TODO_creating_test_harness.md: design doc for the test harness
- .pre-commit-config.yaml: exclude test_data_download/ from numpydoc validation
…oteobench/ProteoBench into intermediate_format_interface

# Conflicts:
#	webinterface/pages/base_pages/denovo.py
…downloaded.csv

For each (module, software, version) row in raw_file_db_downloaded.csv, re-runs
the module's benchmarking() against the cached input file and asserts that the
regenerated metrics at min_nr_observed=3 match the values in the original
submission JSON, within pytest.approx(rel=1e-4, abs=1e-6).

Skipped entirely when test_data_download/ is absent so a fresh clone and CI stay
green. Opt in locally with:
    cd test_data_download && make catalog && make select && make download

Accepts known drift via test/catalog_regression_overrides.json — a hash-keyed
file whose metrics override the upstream JSON for specific cases. Seeded with 4
FragPipe (DIA-NN quant) entries where a proteobench 0.13.2 -> 0.15.x parsing
refactor dropped ~1% of precursors consistently; drop the entry once the
upstream Results JSON is resubmitted.

Current run on a populated catalog: 69 pass + 4 pass via override + 7 fail.
The 7 failures are all PEAKS wide-format tools and hit the same
_clean_run_name path-stripping bug that was latent in commit 794d25e's
test_peaks_diapasef_convert_to_standard_format_succeeds: m/z column names
collapse to z, producing duplicate columns that break the subsequent melt().
Root-cause fix in _fix_colnames / _clean_run_name is a separate commit.

Registers a `regression` pytest marker in pyproject.toml so the harness can be
excluded with `pytest -m "not regression"` without touching the test file.
_fix_colnames applied _clean_run_name unchanged to every DataFrame column.
_clean_run_name's path-prefix stripper (a/b/c -> c) then ate legitimate
metadata column names containing '/', most notably PEAKS' 'm/z' column and
its per-sample siblings 'LFQ_<sample> m/z'. All seven variants collapsed to
'z', producing duplicate column names that broke the subsequent melt() with
    AttributeError: 'DataFrame' object has no attribute 'dtype'

Path stripping is correct for raw-file row values (the "Raw file" column)
and for condition_mapper keys supplied in the TOML, which are the other two
call sites — column names are the outlier. Add a `strip_path: bool = True`
parameter to _clean_run_name and call it with strip_path=False from
_fix_colnames. Extension / suffix stripping (.mzML, .raw, " Intensity",
" Normalized Area", etc.) still applies everywhere.

Effect on the suite:
- test/test_peaks_diapasef_regression.py::test_peaks_diapasef_convert_to_standard_format_succeeds
  now passes (was latent-failing since 794d25e).
- test/test_catalog_regression.py: all 7 PEAKS cases now pass
  (dia_diapasef x2, dia_aif, dia_zenotof, dia_astral x2, dda_qexactive).
- No other tests affected: 250 passed, 3 skipped in the non-regression suite
  (was 249 passed, 1 failed, 3 skipped).
The test_data_download/ Makefile and scripts fetch ProteoBench result
JSONs and raw quant files to feed AnnData mapping development. That
work now lives in the sibling anndata_proteomics_bridge repo, so the
tool moves with its consumer.

Also drop the now-redundant TODO/TODO_anndata_mapping.md (superseded by
anndata_proteomics_bridge/docs/toml_schema.md) and ignore jupyter_notebooks/.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants