Skip to content

Add negative constraint parsing to lexical search#2428

Open
Mbeaulne wants to merge 1 commit into
06-18-add_safe_typo_tolerance_for_names_and_io_fieldsfrom
06-18-parse_negative_constraints_without_not_no_exclude_
Open

Add negative constraint parsing to lexical search#2428
Mbeaulne wants to merge 1 commit into
06-18-add_safe_typo_tolerance_for_names_and_io_fieldsfrom
06-18-parse_negative_constraints_without_not_no_exclude_

Conversation

@Mbeaulne

@Mbeaulne Mbeaulne commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Description

Adds support for negative constraints in lexical search queries. When a user includes phrases like "not GCS", "excluding GCS", or "without GCS" in their search, components matching those excluded terms are filtered out of the results. This allows users to express intent more naturally, such as "I want to upload a file but not to GCS", and receive only the relevant components.

This is implemented by parsing the query text before tokenization, extracting negative constraint phrases using a regex pattern, and scoring any index entry that matches a negative token as zero. The word "but" has also been added to the stop words list to avoid it interfering with scoring.

Related Issue and Pull requests

Type of Change

  • Bug fix
  • New feature
  • Improvement
  • Cleanup/Refactor
  • Breaking change
  • Documentation update

Checklist

  • I have tested this does not break current pipelines / runs functionality
  • I have tested the changes on staging

Screenshots (if applicable)

Test Instructions

  1. Build the search index with components that have overlapping keywords (e.g., two upload components, one for local and one for GCS).
  2. Run a search query containing a negative constraint such as "upload a file but not to GCS" or "upload a file excluding GCS".
  3. Verify that only the component not matching the excluded term is returned.

Additional Comments

The negative constraint pattern currently recognises the trigger words without, excluding, exclude, not, and no, optionally followed by prepositions like to, use, or using.

@github-actions

github-actions Bot commented Jun 18, 2026

Copy link
Copy Markdown

🎩 Preview

A preview build has been created at: 06-18-parse_negative_constraints_without_not_no_exclude_/554c927

@Mbeaulne Mbeaulne changed the title Parse negative constraints: “without”, “not”, “no”, “exclude”. Add negative constraint parsing to lexical search Jun 18, 2026
@Mbeaulne Mbeaulne marked this pull request as ready for review June 18, 2026 17:35
@Mbeaulne Mbeaulne requested a review from a team as a code owner June 18, 2026 17:35
Comment thread src/services/componentSearchIndex.ts Outdated
Comment thread src/services/componentSearchIndex.ts
Comment thread src/services/componentSearchIndex.ts Outdated
Comment thread src/services/componentSearchIndex.ts Outdated
Comment thread src/services/componentSearchIndex.ts Outdated
@Mbeaulne Mbeaulne force-pushed the 06-18-parse_negative_constraints_without_not_no_exclude_ branch from 97e37c0 to 4f20ff2 Compare June 18, 2026 19:12
@Mbeaulne Mbeaulne force-pushed the 06-18-add_safe_typo_tolerance_for_names_and_io_fields branch 2 times, most recently from 0a7d588 to e379e64 Compare June 18, 2026 20:28
@Mbeaulne Mbeaulne force-pushed the 06-18-parse_negative_constraints_without_not_no_exclude_ branch from 4f20ff2 to 638c7b7 Compare June 18, 2026 20:28
@Mbeaulne Mbeaulne force-pushed the 06-18-add_safe_typo_tolerance_for_names_and_io_fields branch from e379e64 to 89029f0 Compare June 18, 2026 20:49
@Mbeaulne Mbeaulne force-pushed the 06-18-parse_negative_constraints_without_not_no_exclude_ branch from 638c7b7 to 3f91762 Compare June 18, 2026 20:49
@Mbeaulne Mbeaulne force-pushed the 06-18-add_safe_typo_tolerance_for_names_and_io_fields branch from 89029f0 to fc80727 Compare June 18, 2026 21:02
@Mbeaulne Mbeaulne force-pushed the 06-18-parse_negative_constraints_without_not_no_exclude_ branch from 3f91762 to 790c426 Compare June 18, 2026 21:02
@Mbeaulne Mbeaulne force-pushed the 06-18-parse_negative_constraints_without_not_no_exclude_ branch from 790c426 to 554c927 Compare June 18, 2026 21:16
@camielvs

Copy link
Copy Markdown
Collaborator

🤖 Code review — Add negative constraint parsing to lexical search

Nicely thought-through. The two design decisions that matter most are both correct and well-tested: negation is literal (negated terms are not synonym-expanded, so "not gcs" doesn't also drop storage/bucket components), and exclusion is a whole-token match rather than substring (a short negated token can't knock a component out by appearing inside an unrelated word). The conjunction-barrier in the capture group (not gcs and train → exclude only gcs, keep train) is a genuinely subtle case and the regex + comment handle it well. Zero added cost when there's no negation (empty negativeTokens short-circuits).

The main thing to weigh:

  • Hard exclusion across all fields is more aggressive than the AI layer's negation, and it happens upstream of it. scoreEntry returns score: 0 if a negated token matches any field — including description, implementation, and metadata. Because the AI rerank candidate pool is built from lexicalSearch results, an excluded component never reaches the reranker at all. So a multi-cloud uploader whose description reads "uploads to GCS, S3, and Azure" is silently dropped by "upload not gcs" — even though it genuinely satisfies "upload, but not only GCS". Keyword negation can't distinguish "uses X" from "mentions X", and the LLM (which already has negative-constraint rules in its system prompt) handles this far more nuancedly. Two options worth considering:

  • Stemmed negation can over-exclude on common words. negativeTokens come from baseSearchTokens (stemmed), so "no models" excludes both models and model. Usually desirable, but model/data/input appear across many components' io/description, so a negated common term can wipe out a large slice. This is inherent to keyword negation on NL phrasing — mostly a "be aware" note, and limiting exclusion to name/io (above) would also blunt it.

Regex looks safe from catastrophic backtracking (word groups separated by required \s+, no ambiguous nested quantifiers), and the excluding|exclude ordering correctly prefers the longer alternative. Tests cover the literal-vs-synonym distinction well — a negative test for "not gcs" against a description-only GCS mention would be a good addition to pin down whichever field-scope decision you land on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants