Conversation
… and context size MCP-475
There was a problem hiding this comment.
Pull request overview
This PR updates the mongodb-natural-language-querying skill prompt to be less Compass-specific and to reduce ambiguity/context requirements so it can be used more generally across agentic workflows.
Changes:
- Removes Compass-specific framing and the prior Compass-style JSON/stringified-query output format.
- Updates response-format guidance to prefer the “workspace language” (or default to MongoDB shell syntax).
- Tweaks wording in best practices and replaces the “Size Limits” section with “Managing Context Size”.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR updates the mongodb-natural-language-querying skill prompt to be less Compass-specific by simplifying instructions, reducing required context, and switching examples away from a Compass JSON wrapper toward direct MongoDB shell/driver query output.
Changes:
- Removes Compass-specific output format and shows shell-style
find()/aggregate()examples. - Trims/rewrites guidance to reduce ambiguity and context size overhead.
- Minor wording improvements in best-practices guidance.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "limit": "10" | ||
| } | ||
| } | ||
| ```js |
There was a problem hiding this comment.
If we can omit the mongosh syntax here, that would be ideal. In my testing, I've observed agents conflating mongosh APIs with Node.js APIs, and occasional API contamination across programming languages due to Programming Language Confusion.
There was a problem hiding this comment.
How do you feel about removing the examples altogether? I feel the way they were before, combined with the instructions that said to make them a MongoDB shell/Extended JSON syntax in a json object would lead to more programming language and output confusion. My assumption is folks are asking these questions with some context for the agent like an existing workspace with a language. If it's a totally isolated prompt without that context I think we would want to default to mongosh.
| "id": 15, | ||
| "name": "basic-aggregate", | ||
| "prompt": "find all the movies released in 1983", | ||
| "prompt": "aggregate all the movies released in 1983", |
There was a problem hiding this comment.
Curious about the change from "find" to "aggregate" here, and the more specific requests below. Are you optimizing for test success? I might argue that "find all the movies" is closer to a natural language query than "aggregate" or quoting specific pipeline stages, and represent a better test of how well the skill is guiding agents to form correct queries from natural language.
I would suggest if we do want to test the more refined language specifically, we might want to do that in addition to keeping the more natural-language style, instead of replacing it, so we have a better idea of how the skill performs with varying levels of specificity.
There was a problem hiding this comment.
When running the evals more strictly a lot of the evals expecting an aggregation would fail as it would generate a find. The prompt could be fulfilled with a find/project etc. In the skill we have written
We already have this same eval for a basic find that has the same user prompt, so it was a duplicate test before. When making this change I was thinking it would reduce the duplication by giving it a different prompt. Now thinking about it more I'm not sure it's doing anything more than the first prompt. I agree it's likely not something someone would ask. I'm leaning towards removing it altogether.
It looks like these tests were adopted from Compass where they are expected to be run against either an explicit aggregation or find generator, not one that does both. The tests in Compass are also a bit more deterministic, with a stricter output. It would be nice to have that expectation out of these evals, but that would be some additional work.
There was a problem hiding this comment.
Gotcha. Ok, why don't we put a pin in this one for now. The skills working group has been having conversations about spinning up more formal eval tooling, so maybe we can revisit this once we have more robust tooling in place.
There was a problem hiding this comment.
Pull request overview
This PR aims to generalize the MongoDB natural-language-querying skill by removing Compass-specific assumptions, reducing context requirements, and updating the eval suite to better match the intended “find or aggregate” behavior (including new coverage for maxTimeMS and Java driver output).
Changes:
- Simplified/adjusted
SKILL.mdguidance around required context, find vs. aggregate selection, output formatting, and context size management. - Updated eval prompts/expected outputs to reduce ambiguity and allow dynamic date handling.
- Added new eval cases for
maxTimeMSusage in find queries and for Java driver syntax.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| testing/mongodb-natural-language-querying/evals/evals.json | Refines eval prompts/expected outputs and adds new evals (maxTimeMS + Java). |
| skills/mongodb-natural-language-querying/SKILL.md | Removes Compass-specific guidance and revises output/context instructions for broader agentic use. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Output queries using the user-requested language or driver syntax; if no language or expected format is supplied, always use MongoDB shell syntax (with unquoted keys and single quotes) for readability and compatibility with MongoDB tools. | ||
|
|
||
| **Find Query Response:** | ||
| ```json |
|
Assigned |
| "id": 15, | ||
| "name": "basic-aggregate", | ||
| "prompt": "find all the movies released in 1983", | ||
| "prompt": "aggregate all the movies released in 1983", |
There was a problem hiding this comment.
Gotcha. Ok, why don't we put a pin in this one for now. The skills working group has been having conversations about spinning up more formal eval tooling, so maybe we can revisit this once we have more robust tooling in place.
MCP-475
This skill was adapted from Compass' natural language prompts. As a result, it contains instructions specific to how the LLM output is used in Compass. This pr should improve the skill's quality by reducing instruction ambiguity, reducing the size of the required context overall, and removing the Compass specific expected output format to allow for more generalized agentic uses. There are likely more improvements to do here, this is mostly aimed at low hanging fruit.
This pr also updates the evals to reduce ambiguity. They were also adopted from Compass and expected to be run against either an explicit aggregation or find generator, not one that does both. Added a test for Java and a test that checks that a find with a maxTimeMS can be created.
The evals run below are given with the prompt
You are a MongoDB assistant.and the user's prompt. Ping me if you'd like the full log and judge notes. This will lead have bias in the no skill vs skill results - oftentimes without the skill it will deduce the result and not show the query ran to the user. It should give an indicator if these changes have any impact.Eval results (before)
📊 SUMMARY
Total cost (including judging): $6.8234
Skill wins: 25 | Losses: 2 | Ties: 9
Eval results (with changes)
📊 SUMMARY
Total cost (including judging): $6.4988
Skill wins: 22 | Losses: 2 | Ties: 12