Normalize component search tokens for better matching by Mbeaulne · Pull Request #2424 · TangleML/tangle-ui

Mbeaulne · 2026-06-18T16:56:34Z

Description

Improves the component search index by normalizing indexed and query text beyond simple lowercasing. Specifically:

Identifier splitting: snake_case, kebab-case, and camelCase component names are split into individual words before indexing, so a query like "train model" matches a component named train-model or train_model, and "load csv file" matches loadCSVFile.
Lightweight stemming: A stemToken function reduces common English inflections (plurals via -s/-ies, gerunds via -ing, past tense via -ed, sibilant plurals) to their base forms. Both the original token and its stem are stored in the index, so queries like "training", "datasets", or "batch" match components described with "train", "dataset", or "batches".
Normalized query tokenization: The same normalizeSearchText pipeline is applied to query text before scoring, ensuring query tokens and indexed tokens are in the same form.

Related Issue and Pull requests

Type of Change

Checklist

I have tested this does not break current pipelines / runs functionality
I have tested the changes on staging

Screenshots (if applicable)

Test Instructions

Run the existing test suite (componentSearchIndex.test.ts) to verify the new normalization cases pass:
- Snake/kebab/camelCase names matched by space-separated queries.
- Stemmed query terms (training, datasets, batch) matching indexed descriptions.
Manually search for components using inflected or hyphenated terms in the UI and confirm relevant results surface.

Additional Comments

The stemmer is intentionally minimal — it handles the most common English suffixes without introducing a full NLP dependency. Both the raw token and its stem are stored so that exact matches are never lost.

github-actions · 2026-06-18T16:56:44Z

🎩 Preview

A preview build has been created at: 06-18-normalize_component_search_tokens_for_better_matching/0494d71

Mbeaulne · 2026-06-18T16:56:50Z

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

- splitIdentifierText: anchor first capital group to a single char to remove O(n²) regex backtracking on long uppercase runs (behavior-preserving) - stemToken: guard -is/-us endings so status/analysis/axis aren't over-stemmed Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

camielvs · 2026-06-19T22:04:51Z

🤖 Code review — Normalize component search tokens for better matching

Solid normalization layer: splitIdentifierText handles camelCase/PascalCase/snake/kebab (including the acronym boundary CSVFile → CSV File), and stemToken gives lightweight plural/-ing/-ed folding. Crucially, normalization is applied symmetrically to both the index and the query, and the original surface form is always retained alongside the split/stemmed forms — that, plus substring matching, makes the lossy stemmer forgiving. Good tests for the three casing styles and the plural/stem cases.

Two things worth weighing:

Token expansion can double-count one concept, skewing scores. normalizeSearchText emits [original, splitText, stem], so the index for training_data contains both training and train. A query for training also expands to [training, train] — and scoreEntry adds the field weight once per token, so both hit:
- query training → matches training (+5) and train (+5) = 10 on the name field
- query train → matches train (+5) = 5
Same component, same conceptual match, 2× score purely because the query used the inflected form. It incidentally rewards exact-inflection matches (arguably nice) but the magnitude (a full extra field-weight) is an undesigned side effect and only fires for words that have a distinct stem, so it's inconsistent across queries. Worth deciding whether to score per concept (collapse original+stem to one canonical token on one side) rather than per surface token. (Add synonym expansion to component lexical search #2425's synonym expansion compounds this — see that review.)
Index strings grow ~3×. Each field now stores original + split + stemmed text. Combined with Expand component search indexing fields #2423's uncapped annotation concatenation, the per-entry searchable footprint is worth keeping an eye on for large registered libraries. Functionally fine; just a memory note.

Nit: the hand-rolled stemmer is intentionally approximate and produces some odd stems (string → str, processed → proces while process is unchanged). Because the original form is kept and matching is substring-based these don't break results, but a one-line comment on stemToken noting it's deliberately lossy heuristic (not a real Porter stemmer) would save the next reader some confusion.

Normalize component search tokens for better matching

e7b76a8

Mbeaulne mentioned this pull request Jun 18, 2026

Expand component search indexing fields #2423

Open

8 tasks

Mbeaulne mentioned this pull request Jun 18, 2026

Add synonym expansion to component lexical search #2425

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalize component search tokens for better matching#2424

Normalize component search tokens for better matching#2424
Mbeaulne wants to merge 2 commits into
06-18-expand_component_search_indexing_fieldsfrom
06-18-normalize_component_search_tokens_for_better_matching

Mbeaulne commented Jun 18, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 18, 2026 •

edited

Loading

Uh oh!

Mbeaulne commented Jun 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

camielvs commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Mbeaulne commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue and Pull requests

Type of Change

Checklist

Screenshots (if applicable)

Test Instructions

Additional Comments

Uh oh!

github-actions Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🎩 Preview

Uh oh!

Mbeaulne commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

camielvs commented Jun 19, 2026

🤖 Code review — Normalize component search tokens for better matching

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Mbeaulne commented Jun 18, 2026 •

edited

Loading

github-actions Bot commented Jun 18, 2026 •

edited

Loading

Mbeaulne commented Jun 18, 2026 •

edited

Loading