feat(natural-language-querying): reduce Compass specific instructions and context size MCP-475 by Anemy · Pull Request #35 · mongodb/agent-skills

Anemy · 2026-04-23T03:13:41Z

This skill was adapted from Compass' natural language prompts. As a result, it contains instructions specific to how the LLM output is used in Compass. This pr should improve the skill's quality by reducing instruction ambiguity, reducing the size of the required context overall, and removing the Compass specific expected output format to allow for more generalized agentic uses. There are likely more improvements to do here, this is mostly aimed at low hanging fruit.

This pr also updates the evals to reduce ambiguity. They were also adopted from Compass and expected to be run against either an explicit aggregation or find generator, not one that does both. Added a test for Java and a test that checks that a find with a maxTimeMS can be created.

The evals run below are given with the prompt You are a MongoDB assistant. and the user's prompt. Ping me if you'd like the full log and judge notes. This will lead have bias in the no skill vs skill results - oftentimes without the skill it will deduce the result and not show the query ran to the user. It should give an indicator if these changes have any impact.

Eval results (before)

📊 SUMMARY

#	Eval Name	With Skill	Without Skill	Delta
1	simple-find	70%	75%	-5%
2	geo-based-find	100%	40%	+60%
3	find-with-nested-match	85%	10%	+75%
4	find-translates-to-agg-mode-count	100%	20%	+80%
5	find-translates-to-agg-total-sum	100%	100%	0%
6	find-translates-to-agg-max-host	100%	80%	+20%
7	relative-date-find-last-year	90%	10%	+80%
8	relative-date-find-30-years-ago	85%	10%	+75%
9	number-field-find	100%	90%	+10%
10	find-with-complex-projection	100%	30%	+70%
11	find-with-and-operator	100%	40%	+60%
12	find-with-non-english	85%	65%	+20%
13	find-with-regex-string-ops	95%	100%	-5%
14	find-simple-projection	90%	0%	+90%
15	basic-aggregate	90%	0%	+90%
16	agg-filter-and-projection	30%	0%	+30%
17	geo-based-agg	100%	100%	0%
18	agg-nested-fields-match	10%	10%	0%
19	agg-group-sort-limit-project	70%	10%	+60%
20	agg-group-sort-limit-project-2	30%	20%	+10%
21	relative-date-agg-30-years	100%	85%	+15%
22	relative-date-agg-last-year	90%	70%	+20%
23	agg-array-slice	90%	0%	+90%
24	agg-multiple-conditions-match	100%	100%	0%
25	agg-non-english	100%	95%	+5%
26	agg-simple-sort-limit	100%	100%	0%
27	agg-unwind-group	100%	100%	0%
28	agg-size-operator	85%	75%	+10%
29	agg-complex-word-frequency	95%	82%	+13%
30	agg-super-complex-percentage	100%	90%	+10%
31	agg-complex-regex-string-ops	100%	100%	0%
32	agg-join-lookup	100%	100%	0%
33	agg-simple-projection	85%	10%	+75%
34	no-redundant-exists-with-comparison	90%	0%	+90%
35	max-time-ms-option-used	90%	20%	+70%
36	find-in-java	100%	100%	0%

Metric	With Skill	Without Skill	Delta
Avg Score	87.6%	53.8%	33.8%
Total Cost	$2.7623	$2.9056	$-0.1433
Total Time	1023.1s	1260.9s	-237.8s

Total cost (including judging): $6.8234

Skill wins: 25 | Losses: 2 | Ties: 9

Eval results (with changes)

📊 SUMMARY

#	Eval Name	With Skill	Without Skill	Delta
1	simple-find	100%	50%	+50%
2	geo-based-find	90%	90%	0%
3	find-with-nested-match	90%	10%	+80%
4	find-translates-to-agg-mode-count	100%	10%	+90%
5	find-translates-to-agg-total-sum	100%	40%	+60%
6	find-translates-to-agg-max-host	100%	60%	+40%
7	relative-date-find-last-year	70%	10%	+60%
8	relative-date-find-30-years-ago	0%	10%	-10%
9	number-field-find	100%	100%	0%
10	find-with-complex-projection	100%	85%	+15%
11	find-with-and-operator	100%	20%	+80%
12	find-with-non-english	100%	100%	0%
13	find-with-regex-string-ops	85%	85%	0%
14	find-simple-projection	100%	0%	+100%
15	basic-aggregate	90%	0%	+90%
16	agg-filter-and-projection	30%	0%	+30%
17	geo-based-agg	100%	100%	0%
18	agg-nested-fields-match	85%	50%	+35%
19	agg-group-sort-limit-project	70%	10%	+60%
20	agg-group-sort-limit-project-2	100%	20%	+80%
21	relative-date-agg-30-years	95%	100%	-5%
22	relative-date-agg-last-year	85%	0%	+85%
23	agg-array-slice	100%	10%	+90%
24	agg-multiple-conditions-match	100%	100%	0%
25	agg-non-english	100%	100%	0%
26	agg-simple-sort-limit	100%	100%	0%
27	agg-unwind-group	100%	100%	0%
28	agg-size-operator	100%	90%	+10%
29	agg-complex-word-frequency	95%	30%	+65%
30	agg-super-complex-percentage	100%	85%	+15%
31	agg-complex-regex-string-ops	100%	100%	0%
32	agg-join-lookup	100%	100%	0%
33	agg-simple-projection	90%	0%	+90%
34	no-redundant-exists-with-comparison	90%	0%	+90%
35	max-time-ms-option-used	85%	30%	+55%
36	find-in-java	100%	100%	0%

Metric	With Skill	Without Skill	Delta
Avg Score	90.3%	52.6%	37.6%
Total Cost	$2.8551	$2.5615	$0.2937
Total Time	1395.2s	1007.6s	387.6s

Total cost (including judging): $6.4988

Skill wins: 22 | Losses: 2 | Ties: 12

… and context size MCP-475

Copilot

Pull request overview

This PR updates the mongodb-natural-language-querying skill prompt to be less Compass-specific and to reduce ambiguity/context requirements so it can be used more generally across agentic workflows.

Changes:

Removes Compass-specific framing and the prior Compass-style JSON/stringified-query output format.
Updates response-format guidance to prefer the “workspace language” (or default to MongoDB shell syntax).
Tweaks wording in best practices and replaces the “Size Limits” section with “Managing Context Size”.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR updates the mongodb-natural-language-querying skill prompt to be less Compass-specific by simplifying instructions, reducing required context, and switching examples away from a Compass JSON wrapper toward direct MongoDB shell/driver query output.

Changes:

Removes Compass-specific output format and shows shell-style find() / aggregate() examples.
Trims/rewrites guidance to reduce ambiguity and context size overhead.
Minor wording improvements in best-practices guidance.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…restrictions

dacharyc

Thanks for the PR, @Anemy - just a handful of questions/suggestions here for your consideration.

dacharyc · 2026-04-24T20:01:33Z

-    "limit": "10"
-  }
-}
+```js


If we can omit the mongosh syntax here, that would be ideal. In my testing, I've observed agents conflating mongosh APIs with Node.js APIs, and occasional API contamination across programming languages due to Programming Language Confusion.

How do you feel about removing the examples altogether? I feel the way they were before, combined with the instructions that said to make them a MongoDB shell/Extended JSON syntax in a json object would lead to more programming language and output confusion. My assumption is folks are asking these questions with some context for the agent like an existing workspace with a language. If it's a totally isolated prompt without that context I think we would want to default to mongosh.

dacharyc · 2026-04-24T20:12:15Z

      "id": 15,
      "name": "basic-aggregate",
-      "prompt": "find all the movies released in 1983",
+      "prompt": "aggregate all the movies released in 1983",


Curious about the change from "find" to "aggregate" here, and the more specific requests below. Are you optimizing for test success? I might argue that "find all the movies" is closer to a natural language query than "aggregate" or quoting specific pipeline stages, and represent a better test of how well the skill is guiding agents to form correct queries from natural language.

I would suggest if we do want to test the more refined language specifically, we might want to do that in addition to keeping the more natural-language style, instead of replacing it, so we have a better idea of how the skill performs with varying levels of specificity.

When running the evals more strictly a lot of the evals expecting an aggregation would fail as it would generate a find. The prompt could be fulfilled with a find/project etc. In the skill we have written

Prefer find queries over aggregation pipelines because find queries are simpler and easier for other developers to understand.

We already have this same eval for a basic find that has the same user prompt, so it was a duplicate test before. When making this change I was thinking it would reduce the duplication by giving it a different prompt. Now thinking about it more I'm not sure it's doing anything more than the first prompt. I agree it's likely not something someone would ask. I'm leaning towards removing it altogether.

It looks like these tests were adopted from Compass where they are expected to be run against either an explicit aggregation or find generator, not one that does both. The tests in Compass are also a bit more deterministic, with a stricter output. It would be nice to have that expectation out of these evals, but that would be some additional work.

Gotcha. Ok, why don't we put a pin in this one for now. The skills working group has been having conversations about spinning up more formal eval tooling, so maybe we can revisit this once we have more robust tooling in place.

Copilot

Pull request overview

This PR aims to generalize the MongoDB natural-language-querying skill by removing Compass-specific assumptions, reducing context requirements, and updating the eval suite to better match the intended “find or aggregate” behavior (including new coverage for maxTimeMS and Java driver output).

Changes:

Simplified/adjusted SKILL.md guidance around required context, find vs. aggregate selection, output formatting, and context size management.
Updated eval prompts/expected outputs to reduce ambiguity and allow dynamic date handling.
Added new eval cases for maxTimeMS usage in find queries and for Java driver syntax.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
testing/mongodb-natural-language-querying/evals/evals.json	Refines eval prompts/expected outputs and adds new evals (maxTimeMS + Java).
skills/mongodb-natural-language-querying/SKILL.md	Removes Compass-specific guidance and revises output/context instructions for broader agentic use.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+Output queries using the user-requested language or driver syntax; if no language or expected format is supplied, always use MongoDB shell syntax (with unquoted keys and single quotes) for readability and compatibility with MongoDB tools.

 **Find Query Response:**
 ```json


codeowners-service-app · 2026-04-28T00:03:35Z

Assigned alenakhineika for team dbx-devtools because paula-stacho is out of office.

dacharyc

LGTM ✅

dacharyc · 2026-04-28T13:48:21Z

      "id": 15,
      "name": "basic-aggregate",
-      "prompt": "find all the movies released in 1983",
+      "prompt": "aggregate all the movies released in 1983",


Gotcha. Ok, why don't we put a pin in this one for now. The skills working group has been having conversations about spinning up more formal eval tooling, so maybe we can revisit this once we have more robust tooling in place.

feat(natural-language-querying): reduce Compass specific instructions…

1ba56c7

… and context size MCP-475

Anemy requested review from Copilot and paula-stacho April 23, 2026 03:13

Anemy requested a review from a team as a code owner April 23, 2026 03:13

Copilot started reviewing on behalf of Anemy April 23, 2026 03:14 View session

Copilot AI reviewed Apr 23, 2026

View reviewed changes

Comment thread skills/mongodb-natural-language-querying/SKILL.md Outdated

Comment thread skills/mongodb-natural-language-querying/SKILL.md Outdated

Comment thread skills/mongodb-natural-language-querying/SKILL.md Outdated

fixup: improve language wording

2f4c6e5

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings April 23, 2026 03:54

Copilot started reviewing on behalf of Anemy April 23, 2026 03:55 View session

Copilot AI reviewed Apr 23, 2026

View reviewed changes

Comment thread skills/mongodb-natural-language-querying/SKILL.md

Comment thread skills/mongodb-natural-language-querying/SKILL.md

Comment thread skills/mongodb-natural-language-querying/SKILL.md Outdated

Comment thread skills/mongodb-natural-language-querying/SKILL.md Outdated

fixup: improve evals to reduce ambiguity, remove nebulous find field …

dedffe5

…restrictions

Anemy requested review from a team as code owners April 23, 2026 12:16

dacharyc reviewed Apr 24, 2026

View reviewed changes

fixup: revert examples to previous json formatting

f1a7a9d

Copilot AI review requested due to automatic review settings April 25, 2026 01:35

Copilot started reviewing on behalf of Anemy April 25, 2026 01:35 View session

Copilot AI reviewed Apr 25, 2026

View reviewed changes

fixup: discrete numbers on managing context size

b5b40fe

codeowners-service-app Bot requested a review from alenakhineika April 28, 2026 00:03

dacharyc approved these changes Apr 28, 2026

View reviewed changes

Anemy merged commit 160e31e into main Apr 29, 2026
4 checks passed

Anemy deleted the MCP-475-remove-compass-specific-items-from-natural-language-querying branch April 29, 2026 14:13

Conversation

Anemy commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dacharyc left a comment

Choose a reason for hiding this comment

Uh oh!

dacharyc Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Anemy Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dacharyc Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

Anemy Apr 25, 2026

Choose a reason for hiding this comment

Uh oh!

dacharyc Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codeowners-service-app Bot commented Apr 28, 2026

Uh oh!

dacharyc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dacharyc Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Anemy commented Apr 23, 2026 •

edited

Loading

Anemy Apr 25, 2026 •

edited

Loading