Skip to content

chore(schema-design): add evals for skill boundary and skill activation MCP-474#34

Merged
Anemy merged 5 commits intomainfrom
MCP-474-skill-boundary-and-activation-schema-design
Apr 24, 2026
Merged

chore(schema-design): add evals for skill boundary and skill activation MCP-474#34
Anemy merged 5 commits intomainfrom
MCP-474-skill-boundary-and-activation-schema-design

Conversation

@Anemy
Copy link
Copy Markdown
Member

@Anemy Anemy commented Apr 22, 2026

MCP-474

Adds an activation eval, with 40 tests, and a script to run it.
Adds a skill boundary for query-optimizer vs schema-design.

I ran the script with the trigger-eval.json in the mongodb-natural-language-querying skill and got a 20/20 result.

Example run from the script:

Results: 38/40 passed, 2 failed
Total cost:  $0.7797

  True positives:  25   (correctly triggered)
  True negatives:  13   (correctly skipped)
  False positives: 0   (triggered when shouldn't)
  False negatives: 2   (missed when should trigger)

  Precision: 100.0%
  Recall:    92.6%
  F1 Score:  96.2%

Failures:
  [19] MISSED (false negative)
       "How can I optimize this aggregation pipeline that has three $lookup stages?"
       Skills called: [mongodb-query-optimizer]
  [20] MISSED (false negative)
       "Why are my write operations so slow on this collection?"
       Skills called: [mongodb-query-optimizer]

Results written to trigger-eval-results.json

The 2 false negatives from the run are a bit ambiguous, I think the query-optimizer could do a fine job on those potentially, although it is part of the skill boundary.

I've run the skill boundaries eval 3 times, it gets 100% on everything except the description gaps. The description gaps vary in each, all above 50% though.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new skills-boundary eval suite and a runnable activation/trigger eval for the mongodb-schema-design skill to validate routing decisions and description gaps between schema design vs query optimization.

Changes:

  • Added a new boundary test suite (query-optimizer-vs-schema-design.json) with 40 cases, including ambiguous and “neither” suppression cases plus description-gap probes.
  • Documented the new boundary suite in testing/skills-boundaries/README.md.
  • Added a mongodb-schema-design trigger-eval dataset (40 prompts) and a TS runner script to execute it via the Claude CLI.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.

File Description
testing/skills-boundaries/query-optimizer-vs-schema-design.json New boundary test suite defining expected routing between query-optimizer vs schema-design, including “neither” and description-gap cases.
testing/skills-boundaries/README.md Documents the new boundary suite and its case breakdown.
testing/mongodb-schema-design/trigger-eval.json New activation eval dataset for schema-design skill triggering.
testing/mongodb-schema-design/run-trigger-eval.ts New CLI-based runner to execute trigger-eval prompts and report pass/fail.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread testing/mongodb-schema-design/run-trigger-eval.ts
Comment thread testing/mongodb-schema-design/run-trigger-eval.ts
Comment thread testing/mongodb-schema-design/run-trigger-eval.ts
Comment thread testing/mongodb-schema-design/run-trigger-eval.ts
Comment thread testing/mongodb-schema-design/run-trigger-eval.ts
Comment thread testing/skills-boundaries/query-optimizer-vs-schema-design.json
Comment thread testing/skills-boundaries/query-optimizer-vs-schema-design.json
Comment thread testing/skills-boundaries/README.md Outdated
@paula-stacho
Copy link
Copy Markdown
Contributor

I agree that the last 2 are ambiguous, they fall exactly in the space where we might need either of the two skills. As actually ensuring that both are considered is something that's out of scope now, I've created this ticket to at least collect examples, as this is a first step we've agreed on: https://jira.mongodb.org/browse/MCP-477
I propose to leave the ambiguous prompts out for the time being and adding them as a comment to this ticket, what do you think?

Comment thread testing/skills-boundaries/README.md Outdated
Copilot AI review requested due to automatic review settings April 24, 2026 14:31
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread testing/skills-boundaries/query-optimizer-vs-schema-design.json
Comment thread testing/skills-boundaries/README.md
Comment thread testing/mongodb-schema-design/run-trigger-eval.ts
Comment thread testing/mongodb-schema-design/run-trigger-eval.ts
Comment thread testing/skills-boundaries/query-optimizer-vs-schema-design.json
@Anemy Anemy merged commit 10781ad into main Apr 24, 2026
7 checks passed
@Anemy Anemy deleted the MCP-474-skill-boundary-and-activation-schema-design branch April 24, 2026 16:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants