Skip to content

Normalize component search tokens for better matching#2424

Open
Mbeaulne wants to merge 2 commits into
06-18-expand_component_search_indexing_fieldsfrom
06-18-normalize_component_search_tokens_for_better_matching
Open

Normalize component search tokens for better matching#2424
Mbeaulne wants to merge 2 commits into
06-18-expand_component_search_indexing_fieldsfrom
06-18-normalize_component_search_tokens_for_better_matching

Conversation

@Mbeaulne

@Mbeaulne Mbeaulne commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Description

Improves the component search index by normalizing indexed and query text beyond simple lowercasing. Specifically:

  • Identifier splitting: snake_case, kebab-case, and camelCase component names are split into individual words before indexing, so a query like "train model" matches a component named train-model or train_model, and "load csv file" matches loadCSVFile.
  • Lightweight stemming: A stemToken function reduces common English inflections (plurals via -s/-ies, gerunds via -ing, past tense via -ed, sibilant plurals) to their base forms. Both the original token and its stem are stored in the index, so queries like "training", "datasets", or "batch" match components described with "train", "dataset", or "batches".
  • Normalized query tokenization: The same normalizeSearchText pipeline is applied to query text before scoring, ensuring query tokens and indexed tokens are in the same form.

Related Issue and Pull requests

Type of Change

  • Bug fix
  • New feature
  • Improvement
  • Cleanup/Refactor
  • Breaking change
  • Documentation update

Checklist

  • I have tested this does not break current pipelines / runs functionality
  • I have tested the changes on staging

Screenshots (if applicable)

Test Instructions

  1. Run the existing test suite (componentSearchIndex.test.ts) to verify the new normalization cases pass:
    • Snake/kebab/camelCase names matched by space-separated queries.
    • Stemmed query terms (training, datasets, batch) matching indexed descriptions.
  2. Manually search for components using inflected or hyphenated terms in the UI and confirm relevant results surface.

Additional Comments

The stemmer is intentionally minimal — it handles the most common English suffixes without introducing a full NLP dependency. Both the raw token and its stem are stored so that exact matches are never lost.

@github-actions

github-actions Bot commented Jun 18, 2026

Copy link
Copy Markdown

🎩 Preview

A preview build has been created at: 06-18-normalize_component_search_tokens_for_better_matching/0494d71

Comment thread src/services/componentSearchIndex.ts
Comment thread src/services/componentSearchIndex.ts
Comment thread src/services/componentSearchIndex.ts
Comment thread src/services/componentSearchIndex.ts
- splitIdentifierText: anchor first capital group to a single char to remove
  O(n²) regex backtracking on long uppercase runs (behavior-preserving)
- stemToken: guard -is/-us endings so status/analysis/axis aren't over-stemmed

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@camielvs

Copy link
Copy Markdown
Collaborator

🤖 Code review — Normalize component search tokens for better matching

Solid normalization layer: splitIdentifierText handles camelCase/PascalCase/snake/kebab (including the acronym boundary CSVFile → CSV File), and stemToken gives lightweight plural/-ing/-ed folding. Crucially, normalization is applied symmetrically to both the index and the query, and the original surface form is always retained alongside the split/stemmed forms — that, plus substring matching, makes the lossy stemmer forgiving. Good tests for the three casing styles and the plural/stem cases.

Two things worth weighing:

  • Token expansion can double-count one concept, skewing scores. normalizeSearchText emits [original, splitText, stem], so the index for training_data contains both training and train. A query for training also expands to [training, train] — and scoreEntry adds the field weight once per token, so both hit:

    • query training → matches training (+5) and train (+5) = 10 on the name field
    • query train → matches train (+5) = 5

    Same component, same conceptual match, 2× score purely because the query used the inflected form. It incidentally rewards exact-inflection matches (arguably nice) but the magnitude (a full extra field-weight) is an undesigned side effect and only fires for words that have a distinct stem, so it's inconsistent across queries. Worth deciding whether to score per concept (collapse original+stem to one canonical token on one side) rather than per surface token. (Add synonym expansion to component lexical search #2425's synonym expansion compounds this — see that review.)

  • Index strings grow ~3×. Each field now stores original + split + stemmed text. Combined with Expand component search indexing fields #2423's uncapped annotation concatenation, the per-entry searchable footprint is worth keeping an eye on for large registered libraries. Functionally fine; just a memory note.

Nit: the hand-rolled stemmer is intentionally approximate and produces some odd stems (string → str, processed → proces while process is unchanged). Because the original form is kept and matching is substring-based these don't break results, but a one-line comment on stemToken noting it's deliberately lossy heuristic (not a real Porter stemmer) would save the next reader some confusion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants