Skip to content

Expand component search indexing fields#2423

Open
Mbeaulne wants to merge 1 commit into
masterfrom
06-18-expand_component_search_indexing_fields
Open

Expand component search indexing fields#2423
Mbeaulne wants to merge 1 commit into
masterfrom
06-18-expand_component_search_indexing_fields

Conversation

@Mbeaulne

@Mbeaulne Mbeaulne commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Description

Expands the component search index to include richer input/output details and a new metadata match field.

Previously, the io searchable field only contained input and output names. It now includes descriptions, types, and annotations for each input and output spec. A new metadata field has been added that indexes component-level metadata annotations (with a blocklist for noisy keys like python_original_code, editor state, and similar large/irrelevant blobs) as well as the source label and published_by value from the component reference.

The MatchField type and all related scoring, labeling, and UI display logic have been updated to include metadata alongside the existing fields. Annotation values longer than 500 characters are excluded from indexing to avoid polluting search with large blobs.

Related Issue and Pull requests

Type of Change

  • Bug fix
  • New feature
  • Improvement
  • Cleanup/Refactor
  • Breaking change
  • Documentation update

Checklist

  • I have tested this does not break current pipelines / runs functionality
  • I have tested the changes on staging

Screenshots (if applicable)

Test Instructions

  1. Open the component dashboard and search for a term that appears in a component's metadata annotations (e.g. a framework name like sklearn or lightgbm).
  2. Verify the result surfaces with metadata listed as a matched field.
  3. Search for a publisher email address and confirm the matching component appears.
  4. Search for a term that exists only in python_original_code or other excluded annotation keys and confirm it does not return results.
  5. Search for an input/output description or type (e.g. parquet, artifact) and confirm results appear with io as the matched field.

Additional Comments

The annotation exclusion list (ANNOTATION_KEYS_EXCLUDED_FROM_SEARCH) and the 500-character value length cap are the primary mechanisms for keeping the metadata index clean. These can be extended as new noisy annotation keys are identified.

@github-actions

Copy link
Copy Markdown

🎩 Preview

A preview build has been created at: 06-18-expand_component_search_indexing_fields/327681f

@camielvs

Copy link
Copy Markdown
Collaborator

🤖 Code review — Expand component search indexing fields

Reviewed as the base of the AI-search stack (#2423#2433). Overall this is a clean, well-tested expansion: input/output descriptions + types + annotations now feed the io field, and a new metadata field indexes component-level annotations, the source label, and published_by. Excluded-key set and the 500-char per-value cap sensibly keep large blobs (python_original_code, editor positions) out. The extractAnnotationsText / stringifySearchValue helpers are tidy and the test coverage matches the new behavior well.

A few small things worth a look — none blocking:

  • source.label indexed into metadata makes generic tokens match broadly. Because the source label is folded into the searchable metadata text, typing user, standard, or published now matches every component from that source (at weight 1). On short queries this can inject noise. The ComponentSearchSource.id doc comment already anticipates "future filter chips / URL state" — source feels more like a filter facet than a free-text token. Worth confirming this is the intended UX.

  • No aggregate cap on annotation text. extractAnnotationsText caps each value at MAX_ANNOTATION_TEXT_LENGTH (500) but there's no bound on the number of annotations concatenated. A component with many sub-500-char annotations could produce a large searchable string. Bounded in practice for real components, but a total-length guard would make the index size predictable.

  • published_by is an email and is now searchable. It's already surfaced in the UI (ComponentHistoryTimeline, ComponentItem), so this is consistent rather than a new disclosure — just flagging that searching by author email is now possible by design.

  • Question: python_dependencies is in the excluded-keys set. Users sometimes search by library/dependency ("tensorflow", "lightgbm"). Implementation text (image + command) covers some of this, but was excluding dependencies a deliberate signal/size tradeoff?

Nice cleanup folding the duplicated name/type guards into isNonEmptyString.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants