Skip to content

Feat feature customisation#222

Open
JemmaLDaniel wants to merge 2 commits into
fix/calibration-bugfixesfrom
feat-hf-feature-customisation
Open

Feat feature customisation#222
JemmaLDaniel wants to merge 2 commits into
fix/calibration-bugfixesfrom
feat-hf-feature-customisation

Conversation

@JemmaLDaniel

@JemmaLDaniel JemmaLDaniel commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Add edit_distance to BeamFeatures: normalised token-level Levenshtein between top-1 and top-2 beam sequences, with I/L equivalence. Undefined cases (fewer than two candidates, null/empty beam rows, missing sequences) return 1.0.
  • Add active_feature_columns to ProbabilityCalibrator: optional override of which metadata columns feed the sklearn MLP (excluding confidence). When unset, behaviour is unchanged: all registered feature columns are used.
  • Harden beam computation for partial inputs: _beam_len, null-safe z-score, and _beam_edit_distance helper.

This replaces the revisions pattern of per-feature excluded_columns with calibrator-level selection only. The HF general model already expresses its trained subset via top-level feature_columns; this PR makes that pattern work on the sklearn calibrator path. I intend for this PR to be merged before #190 , which will then be rebased onto main.


Motivation

The HF general model config.json lists 10 feature columns (plus confidence -> 11 model inputs). Beam edit_distance, among others, are computed for diagnostics but excluded from the trained set. Calibrator-level column selection is required to match that layout without BeamFeatures(excluded_columns=…) on main.

Full HF checkpoint compatibility (safetensors load, feature_columns deserialisation from config.json) will land with #190; this PR is the sklearn-side prerequisite.


Changes

File Change
winnow/calibration/features/beam.py Levenshtein helpers, edit_distance column, null-safe beam handling
winnow/calibration/calibrator.py active_feature_columns ctor arg; columns property override
tests/calibration/features/test_beam.py Levenshtein unit tests, edit_distance integration tests, edge-case coverage
tests/calibration/test_calibrator.py TestFeatureColumns for override vs default behaviour

Follow-ups

@JemmaLDaniel JemmaLDaniel self-assigned this Jul 1, 2026
@JemmaLDaniel JemmaLDaniel added the enhancement New feature or request label Jul 1, 2026
@JemmaLDaniel JemmaLDaniel changed the title Feat hf feature customisation Feat feature customisation Jul 1, 2026
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown

Coverage

Coverage Report
FileStmtsMissCoverMissing
__init__.py00100% 
data_types.py40100% 
calibration
   __init__.py00100% 
   calibration_features.py90100% 
   calibrator.py109793%77, 116, 143–144, 146, 172, 177
   diagnostics.py1564769%74, 104, 108, 130, 192–207, 250–251, 255, 296, 298–313, 324–330
calibration/features
   __init__.py100100% 
   base.py80100% 
   beam.py910100% 
   chimeric.py82198%213
   constants.py90100% 
   fragment_match.py78198%203
   mass_error.py68297%17, 21
   retention_time.py1601491%110, 113–114, 195, 202, 218, 269–271, 281, 284–285, 290–291
   sequence.py190100% 
   token_score.py37197%82
   utils.py261398%96, 368, 594
compat
   __init__.py00100% 
   instanovo.py10640%12, 14–15, 17, 24–25
datasets
   __init__.py00100% 
   calibration_dataset.py1131884%155, 169, 171, 173, 180, 189, 202, 255, 257–258, 264–267, 269–272
   interfaces.py30100% 
   psm_dataset.py250100% 
datasets/data_loaders
   __init__.py50100% 
   instanovo.py1061684%90, 93, 119, 142, 168–169, 172–174, 176–177, 179, 182–184, 191
   mztab.py1793282%103, 106, 157, 161, 210–211, 223, 268, 271, 283–284, 296–298, 300–301, 303, 305, 311, 314–315, 322–324, 331, 333–336, 340, 472, 479
   pointnovo.py70100% 
   utils.py202697%14, 145–148, 216
   winnow.py44490%55–56, 101–102
fdr
   __init__.py00100% 
   base.py581574%81, 85–86, 91, 98–99, 105, 126, 129–130, 135, 137–138, 144, 186
   database_grounded.py250100% 
   nonparametric.py25484%62, 68–69, 72
scripts
   __init__.py00100% 
   main.py2602600%8, 10–13, 16–20, 23–24, 26–28, 32, 39, 44, 47, 53, 55–56, 59, 68, 70–73, 75, 80, 87, 89–91, 93, 95–100, 103, 105–106, 111, 126, 129, 136–142, 145–146, 149, 162–164, 167, 170, 175, 177–179, 181, 183–184, 187–188, 191, 193–194, 196, 198, 200–201, 203, 206–207, 210–211, 214–215, 218–220, 222–225, 228–230, 232, 235, 249–251, 253, 255, 260, 262–264, 266–267, 269, 271–272, 274–276, 278, 280, 282–283, 287–290, 292–293, 295–296, 298–299, 301, 304, 318–320, 323, 326, 331, 333–335, 337–339, 341–342, 345–346, 349, 351–352, 354, 356, 358–359, 361, 364–365, 371–373, 375–378, 381–382, 385–386, 389–390, 393–394, 402–404, 408, 411, 415, 418, 424–426, 428–429, 436–437, 439, 441, 446, 448–450, 452–453, 456, 458–459, 461–464, 466–467, 469–470, 472–474, 480–481, 485–486, 489, 496, 501–502, 507–509, 512, 517, 527, 534, 536, 540, 542–543, 547–548, 551, 574, 587–588, 591, 613, 625–626, 629, 654, 667–668, 671, 686, 698–699, 702, 717, 729–730, 733, 745, 757–758, 761, 776, 788–789, 792, 801, 813–814
utils
   __init__.py40100% 
   config_formatter.py534024%29, 37–38, 40–42, 44, 55, 58–60, 62–63, 66–69, 72–74, 77–78, 80, 91, 102, 113, 127–128, 130–132, 145–147, 150, 153–154, 157–158, 160
   config_path.py76593%24–26, 117–118
   peptide.py25196%32
TOTAL232148379% 

Tests Skipped Failures Errors Time
506 0 💤 0 ❌ 0 🔥 38.203s ⏱️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant