Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -153,4 +153,7 @@ jupyter_notebooks/analysis/temp_results/*
uv.lock

# Large files
*.gz
*.gz
# Atomic-write / editor temp files
*.tmp
*.tmp.*
77 changes: 76 additions & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -190,7 +190,7 @@ run_name_cleanup = "(?:\\.custom_suffix|\\.raw)$"
- `[species_expected_ratio.<SPECIES>]`: `A_vs_B` (float ratio), `color` (hex)
- `[general]`: `min_count_multispec` (int), `level` ("ion" or "peptidoform")

The `MODULE_TO_CLASS` dict at the bottom of `parse_settings.py` routes module IDs to parser classes: all `quant_lfq_*` modules use `ParseSettingsQuant`, and `denovo_DDA_HCD` uses `ParseSettingsDeNovo`. (The dict contains a benign duplicate `quant_lfq_DDA_ion_Astral` key, and has no `quant_lfq_DIA_peptidoform` entry, matching that module being an unregistered stub.)
The `MODULE_TO_CLASS` dict at the bottom of `parse_settings.py` routes module IDs to parser classes: all `quant_lfq_*` modules use `ParseSettingsQuant`, and `denovo_DDA_HCD` uses `ParseSettingsDeNovo`. (It has no `quant_lfq_DIA_peptidoform` entry, matching that module being an unregistered stub.) The validation layer reuses this dict to infer a module's default validation profile.

`io/parsing/utils.py` provides ProForma fixed-modification helpers: `add_fixed_mod(proforma, mod_name, aas)` and `add_maxquant_fixed_modifications(params, result_perf)`.

Expand Down Expand Up @@ -326,6 +326,78 @@ value = st.session_state[st.session_state[variables.slider_id_uuid]]

Session state serves 4 purposes: widget state (via UUID keys), data cache (DataFrames), plot cache (Plotly figures), and flow control (submit flags, highlight lists, tour state).

### Submission Validation Layer (`proteobench/validation/`)

A framework-agnostic, **registry-driven** validation package that checks uploaded submissions for internal consistency before the public datapoint is created. It returns a structured `ValidationReport` instead of raising generic exceptions. Which checks run is determined by a *validation profile*, so the orchestrator is generic: adding a module of an existing category is config-only, and adding a new category only requires registering a new profile.

**Validation does not block submission.** All findings (including `error`-severity ones) are surfaced in the UI and embedded in the pull-request description via `ValidationReport.summary()` so reviewers see them, but submission always proceeds. The `error`/`warning`/`info` severities only control display prominence and PR inclusion, not gating. (The earlier blocking behavior in `submit_to_repository` was removed.)

**Package layout:**

| File | Contents |
|------|----------|
| `report.py` | `Severity` enum (`error`/`warning`/`info`), `ValidationIssue` dataclass, `ValidationReport` collection |
| `exceptions.py` | `SubmissionValidationError` — wraps a report for programmatic callers |
| `fasta.py` | `FastaReference` — builds the expected protein set from FASTA text, path, bytes, zip/gzip, or URL; parses UniProt `sp|P49327|FAS_HUMAN`, `tr|...`, bare accessions, isoforms |
| `protein_ids.py` | `split_protein_groups()` (`;`/`,` separators), `extract_identifiers()`, `is_decoy_or_contaminant()` |
| `context.py` | `ValidationContext` — bundles all inputs a check might need (`standard_df`, `parameters`, `config`, `fasta`, `input_format`, generic `reference`, `extras`) so every check has the uniform signature `ctx -> list[ValidationIssue]` |
| `config.py` | `ModuleValidationConfig` — column names, contaminant flag, decoy prefixes, FASTA URL, and the resolved `validation_profile`; built via `from_parse_settings(parse_settings_dir, module_id, input_format)` |
| `checks.py` | Pure check functions: `check_protein_ids`, `check_charge_range`, `check_peptide_length`, `check_enzyme`, `check_modifications`, `check_max_modifications`, `check_mass_tolerances`, `check_fdr_psm`, `check_run_consistency` (kept individually unit-testable) |
| `profiles.py` | `Check` (named `ctx`-callable), `ValidationProfile` (ordered list of checks), the profile **registry** (`register_profile`/`unregister_profile`/`get_profile`/`available_profiles`), and the built-in `quant_lfq` and `denovo` profiles |
| `validator.py` | `validate_submission(standard_df, parameters, fasta, config, input_format, profile)` — resolves the profile, builds the context, runs the profile's checks; each check is fault-tolerant (unexpected exceptions become warnings) |

**Profile registry (the extensibility surface).** A `ValidationProfile` is a named, ordered list of `Check`s; checks are reusable across profiles (e.g. `run_consistency` is shared by `quant_lfq` and `denovo`). `validate_submission` looks up `config.validation_profile` in the registry and runs only that profile's checks — there is no `if module_type == ...` branching. To support a brand-new module category: register a profile in `profiles.py` (or from third-party code via `register_profile`) and point the module at it; the orchestrator never changes.

**Profile resolution** (`ModuleValidationConfig.from_parse_settings`, in precedence order):
1. explicit `[validation].profile` in the module's `module_settings.toml` (declarative);
2. inferred from the parser class via the existing `MODULE_TO_CLASS` registry (`ParseSettingsQuant` → `quant_lfq`, `ParseSettingsDeNovo` → `denovo`);
3. `DEFAULT_VALIDATION_PROFILE` (`quant_lfq`).

An unregistered profile name produces a single `unknown_validation_profile` warning and runs nothing (never blocks).

**`ValidationIssue` fields:** `code` (machine-readable), `severity`, `message`, `check`, `field`, `observed`, `expected`, `examples` (up to 20 offending protein identifiers via `MAX_PROTEIN_EXAMPLES`, or up to 10 example rows for the other checks via `MAX_ROW_EXAMPLES`).

**`quant_lfq` profile checks and their default severity** (severity affects display/PR prominence only; nothing blocks):

| Check | Default severity | Notes |
|-------|------------------|-------|
| `protein_ids` (vs FASTA) | ERROR | Skips decoys and contaminants; splits groups; case-insensitive |
| `charge_range` | ERROR | Uses `min_precursor_charge` / `max_precursor_charge` from params |
| `peptide_length` | ERROR | Uses `min_peptide_length` / `max_peptide_length`; counts alpha chars |
| `enzyme` (missed cleavages) | WARNING | Supports trypsin, trypsin/P, Lys-C, Arg-C, Glu-C, chymotrypsin via `_ENZYME_CLEAVAGE_RULES`; N-terminal cleavers (Asp-N/Lys-N) and unknown enzymes skipped as info; heuristic, ignores ragged termini |
| `modifications` (names) | WARNING | Human-readable names in `proforma` vs declared mods; mass/UniMod tokens skipped |
| `max_modifications` | WARNING | Counts bracketed mods per peptidoform vs `max_mods`; upper bound (includes fixed mods written into the sequence) |
| `mass_tolerances` | WARNING | Sanity check of `precursor_mass_tolerance`/`fragment_mass_tolerance` (parseable, positive, plausible ppm/Da); no per-result comparison exists |
| `fdr_psm` | WARNING | `ident_fdr_psm` within `[0,1]` and ≤ `config.recommended_max_fdr_psm` (default 0.01) |
| `run_consistency` (software identity) | ERROR | `params.software_name` vs `input_format` |
| (absent parameter) | WARNING | Each param-dependent check self-reports when its constraint was not parsed; never crashes |

The `denovo` profile currently runs only `run_consistency` plus a `denovo_pending` info placeholder (de novo uses a different standardized schema — `spectrum_id`/`peptide_str`/`aa_scores` — and a ground-truth table rather than a FASTA; content checks are a documented TODO in `profiles.py`).

**Reference FASTA configuration.** Each quant module's `module_settings.toml` contains an optional section:

```toml
[reference_database]
"fasta_url" = "https://proteobench.cubimed.rub.de/fasta/ProteoBenchFASTA_MixedSpecies_HYE.zip"

[validation]
"profile" = "quant_lfq" # optional; usually inferred from the parser class
```

`[reference_database]` is populated for all 9 quant modules (HYE modules share the HYE FASTA; single-cell uses HY; Plasma uses Distler-PYE). The de novo `module_settings.toml` declares `[validation] profile = "denovo"`. If `[reference_database]` is absent the FASTA check is skipped with an info message.

**Integration point.** Validation runs inside `submit_to_repository()` in `webinterface/pages/base_pages/tabs/tab6_submit_results.py`, after the "I really want to upload it" button press and before `create_pull_request()`. The standardized DataFrame is re-derived by rerunning the existing parser on the `input_df` already in session state (no duplicated tool-specific parsing logic). All findings (errors and warnings) are rendered in the UI and appended to the PR description via `ValidationReport.summary()`; none of them block the PR. The local Tab 2 upload path is unaffected.

**Streamlit glue.** `webinterface/pages/base_pages/utils/validation_ui.py` provides:
- `run_submission_validation(variables, ionmodule, user_input, params)` — orchestrates the full flow (re-parse, load FASTA from cache, run checks), returns a `ValidationReport`.
- `_load_fasta_reference(fasta_url, fasta_filename)` — `@st.cache_data` wrapper around `FastaReference.from_url()`.
- `render_validation_report(report)` — renders errors as `st.error`, warnings as `st.warning`, and info items inside a collapsed `st.expander`.

**Documented limitations (intentionally skipped checks):**
- Full enzyme-specificity checks require reference protein sequences (not available); only internal K/R counting is done.
- Cross-tool modification normalization is not implemented (MaxQuant uses human-readable names, DIA-NN UniMod accessions, Sage raw mass dicts); only matching human-readable names against declared mods is attempted.
- Run-level matching (raw-file, sample, experiment) is not possible because `ProteoBenchParameters` does not expose those fields; only software identity is compared.

### Streamlit Coupling in Core Library

Three files in `proteobench/` import Streamlit (coupling violations):
Expand All @@ -347,6 +419,7 @@ Tests in `test/` use pytest. No `conftest.py` (uses defaults). ~74 `def test_` f
- `test/data/quant/quant_lfq_peptidoform_DDA/` - Peptidoform sample files
- `test/data/denovo/` - de novo configs and result files (~8.5MB)
- `test/data/intermediate_files/` - Reference intermediate results
- `test/data/validation/ProteoBench_validation_reference.fasta` - Small 6-protein FASTA fixture for validation unit tests
- `test/params/` - Parameter files for all tools (~124 top-level files, ~130 including `test/params/denovo/`, ~9MB)
- `test/data/quant/quant_lfq_proteingroup_DIA_Astral/` exists but is currently empty; empty legacy dirs `test/data/dda_quant/` and `test/data/dia_quant/` remain.

Expand Down Expand Up @@ -375,6 +448,7 @@ Only two module-level test files exist: `test_module_quant_ion_DDA_QExactive.py`
- `test_plot_quant.py` - `TestPlotDataPoint`
- `test_modules_constants.py` - parametrized check that every `MODULE_SETTINGS_DIRS` entry resolves to an existing directory
- `test_github_repo.py` - tests for `GithubProteobotRepo`
- `test_validation.py` - 63 tests for the submission-validation layer; covers FASTA parsing, protein-ID matching/groups/decoy-contaminant skipping, charge/length violations, missing-parameter warnings, multi-enzyme missed-cleavage checks, modification/max-modification checks, mass-tolerance and PSM-FDR sanity checks, the profile registry (resolution, custom-profile registration, unknown-profile handling, de novo routing), report serialization, and lightweight integration through the real MaxQuant and Sage parsers

### What CI Validates

Expand Down Expand Up @@ -406,6 +480,7 @@ Separate workflow for the webinterface (`test-streamlit.yml`):
### Adding a new benchmark module

1. Create a module class in `modules/quant/` inheriting `QuantModule`, setting `module_id`, the precursor column, and repo names
- For submission validation (see the Submission Validation Layer section): add a `[reference_database] fasta_url = "..."` entry to the new module's `module_settings.toml` (if absent, protein-identifier validation is silently skipped). A quant module auto-resolves to the `quant_lfq` profile; a module of a new category should declare `[validation] profile = "<name>"` and register a matching profile in `proteobench/validation/profiles.py`.
Comment thread
RobbinBouwmeester marked this conversation as resolved.
2. Add the settings directory to `MODULE_SETTINGS_DIRS` in `modules/constants.py`
3. Register the parse settings class in `MODULE_TO_CLASS` in `parse_settings.py`
4. Create TOML configs: `module_settings.toml` + per-tool parse settings in the new directory
Expand Down
19 changes: 18 additions & 1 deletion docs/developer-guide/adding-module.rst
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ that allow for a more modular and portable implementation.
Backend
-------

The backend is organized into six main components that you can extend or customize:
The backend is organized into seven main components that you can extend or customize:

**1. Module implementation** - Define how your benchmarking is performed
- For quantification: Subclass :class:`~proteobench.modules.quant.quant_base_module.QuantModule`
Expand Down Expand Up @@ -88,6 +88,17 @@ The backend is organized into six main components that you can extend or customi
- Functions in :file:`proteobench/io/params` parse parameter setting files
- Customize per software tool in :file:`proteobench/io/params/json/`

**7. Submission validation** - Check uploaded submissions for consistency
- Package: :file:`proteobench/validation/` (framework-agnostic, registry-driven)
- A quantification module is configuration-only: add a ``[reference_database]``
section to its ``module_settings.toml`` and the ``quant_lfq`` profile is
resolved automatically from the parser class
- A new module category registers its own *validation profile*
- Validation is non-blocking: findings are shown to the submitter and added to
the pull-request description, but submission always proceeds
- See :doc:`submission-validation` for integrating, extending, and maintaining
the checks

Architecture example
....................

Expand Down Expand Up @@ -1053,3 +1064,9 @@ a new type of module:
``REPO_MODULE_REGISTRY`` with the format:
``"Results_repo_name": ("module_id", ModuleClass, path_to_params_json)``
This enables automatic datapoint reprocessing for your module.
12. **Configure submission validation.** For a quantification module, add a
``[reference_database]`` section (with ``fasta_url``) to the module's
``module_settings.toml``; the ``quant_lfq`` profile is resolved automatically.
For a new module category, register a *validation profile* and point the
module at it via ``[validation] profile = "<name>"``.
See :doc:`submission-validation`.
3 changes: 2 additions & 1 deletion docs/developer-guide/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,8 @@ Developer guide

repo_layout
adding-module

submission-validation

modules/index

.. toctree::
Expand Down
Loading
Loading