Skip to content

Comments

Improve adduct/isotope generation [WIP]#315

Open
joewandy wants to merge 8 commits intomasterfrom
codex/review-adduct-and-isotope-implementation
Open

Improve adduct/isotope generation [WIP]#315
joewandy wants to merge 8 commits intomasterfrom
codex/review-adduct-and-isotope-implementation

Conversation

@joewandy
Copy link
Member

@joewandy joewandy commented Jan 27, 2026

Summary

This PR improves the realism and configurability of simulated isotope envelopes and adduct mixtures. It replaces the previous carbon-only isotope approximation with a multi-element approach based on natural isotope abundances, adds configurable adduct priors/profiles for different ionisation regimes, and introduces a small deisotoping utility that we use in tests to validate the generated patterns.

Weaknesses in the previous implementation

Previously, isotope generation used a carbon-only binomial shortcut, so it could not model common heteroatom isotope patterns (for example chlorine/bromine M+2 and sulfur fine structure). Adduct proportions were generated using placeholder heuristics and had limited negative-mode coverage. There was also no lightweight deisotoping routine available to sanity-check that generated isotope envelopes behave as expected.

What we changed

Isotope generation

Isotope generation now approximates the full-formula isotope envelope by convolving per-element natural isotope distributions (NATURAL_ISOTOPES).

  • Build per-element (mass_shift, abundance) distributions relative to each element’s monoisotope.
  • Raise each element distribution to the element count via repeated convolution (exponentiation-by-squaring).
  • Merge/prune/cap intermediate states (mass_precision, min_prob, max_states) to keep runtime and memory bounded.
  • Convolve all elements, then keep peaks up to total_proportion (or max_peaks) and renormalise.
  • Always preserve the monoisotopic (zero-shift) peak as isotopes[0] because downstream code assumes that invariant.

Implementation: Isotopes._get_isotope_distribution / _power_distribution / _convolve_distributions in vimms/Chemicals.py.

Adduct generation

Adduct generation now samples adduct weights from a Dirichlet distribution with configurable priors and presets.

  • Defaults come from ADDUCT_NAMES_POS/NEG and prior presets from ADDUCT_PRIOR_POS/NEG + ADDUCT_PROFILE_PRESETS.
  • Sample weights ~ Dirichlet(prior * adduct_concentration).
  • Apply adduct_proportion_cutoff; if all weights are cut, pick the single most likely adduct.
  • Otherwise, rescale by max(weights) so the dominant adduct has weight 1.0 (preserving the historical “scale by max” semantics).

Implementation: profile resolution in Adducts.__init__ and sampling in Adducts._get_adduct_proportions in vimms/Chemicals.py.

We also expanded common negative-mode adduct coverage and removed the default [M+NH3]+H entry because it has the same mass shift as M+NH4 and would otherwise duplicate signal at the same m/z.

Deisotoping

Deisotoping is implemented in vimms/Deisotoping.py as a small utility used in tests to validate generated isotope patterns.

  • Sort peaks by m/z and walk from low to high.
  • For each unassigned peak, guess charge by checking which charge best matches the expected 13C spacing (within ppm_tolerance).
  • Grow a cluster by linking peaks that are within tolerance of any expected single-isotope mass difference (includes fine-structure differences derived from NATURAL_ISOTOPES).
  • Emit a cluster only if it has at least min_isotopes peaks, and mark its peaks as assigned.

Implementation: Deisotoper.deisotope / _guess_charge / _grow_cluster in vimms/Deisotoping.py.

How we know it’s correct

We added tests/test_deisotoping.py to cover multi-element isotope generation and deisotoping end-to-end. The tests verify that isotope envelopes are ordered and normalised, that chlorine-containing formulae produce an M+2 pattern, that the monoisotopic peak is preserved under aggressive filtering, and that the deisotoper recovers the expected monoisotopic m/z from adducted isotope peaks.

I ran the full test suite locally and it passes (pytest: 119 tests).

Notes

UnknownChemical behaviour is unchanged. UnknownChemical instances are typically created from ROI-picked peaks during re-simulation of an existing mzML; those peaks may already correspond to a specific isotope/adduct peak (not a full compound) or may simply be noise. If we generated extra isotopes/adducts from them we’d be inventing correlated peaks and potentially double-counting signal, so we continue to treat UnknownChemical as a single-peak representation.

@joewandy joewandy changed the title Make isotope/adduct modeling configurable and add improved deisotoper Improve adduct/isotope generation [WIP] Jan 27, 2026
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 36ddc497a3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant