feat - e2e tests + evals #22

mclenhard · 2025-05-02T11:09:47Z

Adds new e2e test that loads an MCP client, which in turn runs the server and processes the actual tool call. Afterwards, it then grades the response for correctness.

Disclosure: I'm the package author

Important

Adds end-to-end tests for Linear MCP server functionalities using mcp-evals, updates documentation, and adds mcp-evals dependency.

Behavior:
- Adds end-to-end tests in evals.ts for evaluating Linear MCP server functionalities using mcp-evals.
- Tests include linear_create_issue, linear_update_issue, linear_search_issues, linear_get_user_issues, and linear_add_comment.
Documentation:
- Updates README.md with instructions for running evaluations using mcp-eval command.
Dependencies:
- Adds mcp-evals to dependencies in package.json.

^{This description was created by}^{for f8a3071. You can customize this summary. It will automatically update as commits are pushed.}

ellipsis-dev

Caution

Changes requested ❌

Reviewed everything up to f8a3071 in 2 minutes and 29 seconds. Click for details.

Reviewed 98 lines of code in 3 files
Skipped 1 files when reviewing.
Skipped posting 7 draft comments. View those below.
Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.

1. README.md:49

Draft comment:
Ensure consistency in the eval command name. The README uses 'npx mcp-eval evals.ts index.ts' while the dependency is 'mcp-evals'. Confirm the correct command name.
Reason this comment was not posted:
Comment looked like it was already resolved.

2. evals.ts:12

Draft comment:
Consider wrapping JSON.parse(result) in a try-catch to handle potential JSON parsing errors.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% This is a reasonable suggestion since JSON.parse can throw. However, in an evaluation context, a JSON parse error would indicate a fundamental problem with the grade() function or model output that should fail the test. Adding try-catch here would hide real issues. The grade() function likely already handles malformed responses appropriately. I could be wrong about the grade() function's error handling. Maybe explicit error handling here would provide better debugging information. In test/evaluation code, failing fast on invalid JSON is actually preferable to masking the error. If we need better error messages, that should be handled in the grade() function itself. The comment should be removed. While technically correct about JSON.parse throwing, adding try-catch here would be counterproductive for evaluation code that should fail on invalid responses.

3. evals.ts:59

Draft comment:
Ensure a newline at the end of the file for proper POSIX compliance.
Reason this comment was not posted:
Confidence changes required: 50% <= threshold 50% None

4. README.md:50

Draft comment:
Ensure CLI command consistency: The README uses 'npx mcp-eval' while the dependency is 'mcp-evals'. Align the naming to avoid confusion.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% The comment makes an assumption about package naming that isn't clearly supported by the evidence. We don't see the actual package name in the diff or full file. The comment also asks for verification/alignment, which violates our rules about not asking for confirmations or explanations. Even if there was an inconsistency, this would be caught when users try to run the command. What if the package really is named mcp-evals and this inconsistency would cause user confusion? What if this is a real issue that should be fixed? Without clear evidence of the actual package name in the dependencies, we can't be certain there's really an inconsistency. Additionally, if the command doesn't work, users will quickly discover the correct command name through npm errors. Delete this comment because it makes assumptions without evidence and asks for verification/alignment, which violates our review rules.

5. evals.ts:59

Draft comment:
Please add a newline at the end of the file to conform with standard POSIX practices.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% While having a newline at the end of files is a common standard, this is typically handled automatically by IDE settings and git configurations. Most modern development environments handle this automatically. Additionally, this is a very minor formatting issue that doesn't affect functionality and would likely be caught by linting rules if it was important to the project. The comment is technically correct about POSIX standards. Missing newlines can cause issues with some Unix tools. While technically correct, this is too minor of an issue to warrant a PR comment. It's better handled by automated tooling and IDE settings rather than manual review comments. This comment should be removed as it's too minor of an issue and better handled by automated tooling rather than manual review comments.

6. evals.ts:7

Draft comment:
Adopt a consistent naming convention for eval function names and their 'name' properties (e.g. consistently using 'Evaluation' or a uniform suffix) to improve clarity.
Reason this comment was not posted:
Marked as duplicate.

7. README.md:56

Draft comment:
Typographical suggestion: In the 'linear_create_issue' section, consider changing 'Create a new Linear issues' to 'Create a new Linear issue' to ensure consistency and clarity.
Reason this comment was not posted:
Comment was not on a location in the diff, so it can't be submitted as a review comment.

Workflow ID: wflow_F776lGNMOM5Gkyf5

^{You can customize}^{by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.}

ellipsis-dev · 2025-05-02T11:12:20Z

evals.ts

+    description: "Evaluates the correctness and completeness of the linear_create_issue tool usage",
+    run: async () => {
+        const result = await grade(openai("gpt-4"), "Please create a new Linear issue using the linear_create_issue tool. The issue should have the title 'Fix login bug for Safari users', teamId 'team123', description 'Safari users are unable to log in properly', priority 2, and status 'Open'. Return the issue identifier and URL.");
+        return JSON.parse(result);


Consider adding error handling (e.g. try/catch) around JSON.parse to gracefully handle unexpected non-JSON responses.

ellipsis-dev · 2025-05-02T11:12:20Z

evals.ts

+import { grade, EvalFunction } from "mcp-evals";
+
+const linear_create_issueEval: EvalFunction = {
+    name: "Linear Create Issue Tool Evaluation",


The evaluation functions use inconsistent naming formats for their name properties. For example, 'Linear Create Issue Tool Evaluation' (line 8) uses title casing and includes 'Tool Evaluation', whereas 'linear_update_issue Evaluation' (line 17), 'linear_search_issues' (line 26), 'linear_get_user_issues Evaluation' (line 35), and 'linear_add_comment Evaluation' (line 44) vary in capitalization and inclusion of the word 'Evaluation'. For better readability and consistency, consider standardizing these names.

feat - tests + evals

f8a3071

ellipsis-dev bot reviewed May 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat - e2e tests + evals #22

feat - e2e tests + evals #22

Uh oh!

mclenhard commented May 2, 2025 •

edited by ellipsis-dev bot

Loading

Uh oh!

ellipsis-dev bot left a comment

Uh oh!

ellipsis-dev bot May 2, 2025

Uh oh!

ellipsis-dev bot May 2, 2025

Uh oh!

Uh oh!

feat - e2e tests + evals #22

Are you sure you want to change the base?

feat - e2e tests + evals #22

Uh oh!

Conversation

mclenhard commented May 2, 2025 • edited by ellipsis-dev bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot May 2, 2025

Choose a reason for hiding this comment

Uh oh!

ellipsis-dev bot May 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mclenhard commented May 2, 2025 •

edited by ellipsis-dev bot

Loading