Feat feature customisation#222
Open
JemmaLDaniel wants to merge 2 commits into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
edit_distancetoBeamFeatures: normalised token-level Levenshtein between top-1 and top-2 beam sequences, with I/L equivalence. Undefined cases (fewer than two candidates, null/empty beam rows, missing sequences) return1.0.active_feature_columnstoProbabilityCalibrator: optional override of which metadata columns feed the sklearn MLP (excludingconfidence). When unset, behaviour is unchanged: all registered feature columns are used._beam_len, null-safe z-score, and_beam_edit_distancehelper.This replaces the
revisionspattern of per-featureexcluded_columnswith calibrator-level selection only. The HF general model already expresses its trained subset via top-levelfeature_columns; this PR makes that pattern work on the sklearn calibrator path. I intend for this PR to be merged before #190 , which will then be rebased ontomain.Motivation
The HF general model
config.jsonlists 10 feature columns (plusconfidence-> 11 model inputs). Beamedit_distance, among others, are computed for diagnostics but excluded from the trained set. Calibrator-level column selection is required to match that layout withoutBeamFeatures(excluded_columns=…)onmain.Full HF checkpoint compatibility (safetensors load,
feature_columnsdeserialisation fromconfig.json) will land with #190; this PR is the sklearn-side prerequisite.Changes
winnow/calibration/features/beam.pyedit_distancecolumn, null-safe beam handlingwinnow/calibration/calibrator.pyactive_feature_columnsctor arg;columnsproperty overridetests/calibration/features/test_beam.pyedit_distanceintegration tests, edge-case coveragetests/calibration/test_calibrator.pyTestFeatureColumnsfor override vs default behaviourFollow-ups
active_feature_columnsdeserialisation on loadconfig.json: remove deadexcluded_columns, alterfeature_columnstoactive_feature_columnsand Koina server keys