chore(schema-design): add evals for skill boundary and skill activation MCP-474 by Anemy · Pull Request #34 · mongodb/agent-skills

Anemy · 2026-04-22T15:36:22Z

Adds an activation eval, with 40 tests, and a script to run it.
Adds a skill boundary for query-optimizer vs schema-design.

I ran the script with the trigger-eval.json in the mongodb-natural-language-querying skill and got a 20/20 result.

Example run from the script:

Results: 38/40 passed, 2 failed
Total cost:  $0.7797

  True positives:  25   (correctly triggered)
  True negatives:  13   (correctly skipped)
  False positives: 0   (triggered when shouldn't)
  False negatives: 2   (missed when should trigger)

  Precision: 100.0%
  Recall:    92.6%
  F1 Score:  96.2%

Failures:
  [19] MISSED (false negative)
       "How can I optimize this aggregation pipeline that has three $lookup stages?"
       Skills called: [mongodb-query-optimizer]
  [20] MISSED (false negative)
       "Why are my write operations so slow on this collection?"
       Skills called: [mongodb-query-optimizer]

Results written to trigger-eval-results.json

The 2 false negatives from the run are a bit ambiguous, I think the query-optimizer could do a fine job on those potentially, although it is part of the skill boundary.

I've run the skill boundaries eval 3 times, it gets 100% on everything except the description gaps. The description gaps vary in each, all above 50% though.

Copilot

Pull request overview

Adds a new skills-boundary eval suite and a runnable activation/trigger eval for the mongodb-schema-design skill to validate routing decisions and description gaps between schema design vs query optimization.

Changes:

Added a new boundary test suite (query-optimizer-vs-schema-design.json) with 40 cases, including ambiguous and “neither” suppression cases plus description-gap probes.
Documented the new boundary suite in testing/skills-boundaries/README.md.
Added a mongodb-schema-design trigger-eval dataset (40 prompts) and a TS runner script to execute it via the Claude CLI.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.

File	Description
testing/skills-boundaries/query-optimizer-vs-schema-design.json	New boundary test suite defining expected routing between query-optimizer vs schema-design, including “neither” and description-gap cases.
testing/skills-boundaries/README.md	Documents the new boundary suite and its case breakdown.
testing/mongodb-schema-design/trigger-eval.json	New activation eval dataset for schema-design skill triggering.
testing/mongodb-schema-design/run-trigger-eval.ts	New CLI-based runner to execute trigger-eval prompts and report pass/fail.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

paula-stacho · 2026-04-23T11:42:15Z

I agree that the last 2 are ambiguous, they fall exactly in the space where we might need either of the two skills. As actually ensuring that both are considered is something that's out of scope now, I've created this ticket to at least collect examples, as this is a first step we've agreed on: https://jira.mongodb.org/browse/MCP-477
I propose to leave the ambiguous prompts out for the time being and adding them as a comment to this ticket, what do you think?

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Anemy added 3 commits April 22, 2026 08:11

chore(schema-design): add trigger evals and test script MCP-474

78dea5e

chore: add boundary eval

495367b

fixup: cleanup script and readme

5d2c8b7

Anemy requested review from Copilot and paula-stacho April 22, 2026 15:36

Anemy requested review from a team as code owners April 22, 2026 15:36

Copilot started reviewing on behalf of Anemy April 22, 2026 15:36 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

fixup: remove ambiguous cases for now

2ec6f3c

paula-stacho reviewed Apr 24, 2026

View reviewed changes

Comment thread testing/skills-boundaries/README.md Outdated

fixup: remove explicit numbers in README

395d475

Copilot AI review requested due to automatic review settings April 24, 2026 14:31

Copilot started reviewing on behalf of Anemy April 24, 2026 14:32 View session

Copilot AI reviewed Apr 24, 2026

View reviewed changes

paula-stacho approved these changes Apr 24, 2026

View reviewed changes

Anemy merged commit 10781ad into main Apr 24, 2026
7 checks passed

Anemy deleted the MCP-474-skill-boundary-and-activation-schema-design branch April 24, 2026 16:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(schema-design): add evals for skill boundary and skill activation MCP-474#34

chore(schema-design): add evals for skill boundary and skill activation MCP-474#34
Anemy merged 5 commits intomainfrom
MCP-474-skill-boundary-and-activation-schema-design

Anemy commented Apr 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

paula-stacho commented Apr 23, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Anemy commented Apr 22, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

paula-stacho commented Apr 23, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants