Skip to content

feat - e2e tests + evals #22

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

mclenhard
Copy link

@mclenhard mclenhard commented May 2, 2025

Adds new e2e test that loads an MCP client, which in turn runs the server and processes the actual tool call. Afterwards, it then grades the response for correctness.

Disclosure: I'm the package author


Important

Adds end-to-end tests for Linear MCP server functionalities using mcp-evals, updates documentation, and adds mcp-evals dependency.

  • Behavior:
    • Adds end-to-end tests in evals.ts for evaluating Linear MCP server functionalities using mcp-evals.
    • Tests include linear_create_issue, linear_update_issue, linear_search_issues, linear_get_user_issues, and linear_add_comment.
  • Documentation:
    • Updates README.md with instructions for running evaluations using mcp-eval command.
  • Dependencies:
    • Adds mcp-evals to dependencies in package.json.

This description was created by Ellipsis for f8a3071. You can customize this summary. It will automatically update as commits are pushed.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Changes requested ❌

Reviewed everything up to f8a3071 in 2 minutes and 29 seconds. Click for details.
  • Reviewed 98 lines of code in 3 files
  • Skipped 1 files when reviewing.
  • Skipped posting 7 draft comments. View those below.
  • Modify your settings and rules to customize what types of comments Ellipsis leaves. And don't forget to react with 👍 or 👎 to teach Ellipsis.
1. README.md:49
  • Draft comment:
    Ensure consistency in the eval command name. The README uses 'npx mcp-eval evals.ts index.ts' while the dependency is 'mcp-evals'. Confirm the correct command name.
  • Reason this comment was not posted:
    Comment looked like it was already resolved.
2. evals.ts:12
  • Draft comment:
    Consider wrapping JSON.parse(result) in a try-catch to handle potential JSON parsing errors.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% This is a reasonable suggestion since JSON.parse can throw. However, in an evaluation context, a JSON parse error would indicate a fundamental problem with the grade() function or model output that should fail the test. Adding try-catch here would hide real issues. The grade() function likely already handles malformed responses appropriately. I could be wrong about the grade() function's error handling. Maybe explicit error handling here would provide better debugging information. In test/evaluation code, failing fast on invalid JSON is actually preferable to masking the error. If we need better error messages, that should be handled in the grade() function itself. The comment should be removed. While technically correct about JSON.parse throwing, adding try-catch here would be counterproductive for evaluation code that should fail on invalid responses.
3. evals.ts:59
  • Draft comment:
    Ensure a newline at the end of the file for proper POSIX compliance.
  • Reason this comment was not posted:
    Confidence changes required: 50% <= threshold 50% None
4. README.md:50
  • Draft comment:
    Ensure CLI command consistency: The README uses 'npx mcp-eval' while the dependency is 'mcp-evals'. Align the naming to avoid confusion.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 10% vs. threshold = 50% The comment makes an assumption about package naming that isn't clearly supported by the evidence. We don't see the actual package name in the diff or full file. The comment also asks for verification/alignment, which violates our rules about not asking for confirmations or explanations. Even if there was an inconsistency, this would be caught when users try to run the command. What if the package really is named mcp-evals and this inconsistency would cause user confusion? What if this is a real issue that should be fixed? Without clear evidence of the actual package name in the dependencies, we can't be certain there's really an inconsistency. Additionally, if the command doesn't work, users will quickly discover the correct command name through npm errors. Delete this comment because it makes assumptions without evidence and asks for verification/alignment, which violates our review rules.
5. evals.ts:59
  • Draft comment:
    Please add a newline at the end of the file to conform with standard POSIX practices.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable: usefulness confidence = 20% vs. threshold = 50% While having a newline at the end of files is a common standard, this is typically handled automatically by IDE settings and git configurations. Most modern development environments handle this automatically. Additionally, this is a very minor formatting issue that doesn't affect functionality and would likely be caught by linting rules if it was important to the project. The comment is technically correct about POSIX standards. Missing newlines can cause issues with some Unix tools. While technically correct, this is too minor of an issue to warrant a PR comment. It's better handled by automated tooling and IDE settings rather than manual review comments. This comment should be removed as it's too minor of an issue and better handled by automated tooling rather than manual review comments.
6. evals.ts:7
  • Draft comment:
    Adopt a consistent naming convention for eval function names and their 'name' properties (e.g. consistently using 'Evaluation' or a uniform suffix) to improve clarity.
  • Reason this comment was not posted:
    Marked as duplicate.
7. README.md:56
  • Draft comment:
    Typographical suggestion: In the 'linear_create_issue' section, consider changing 'Create a new Linear issues' to 'Create a new Linear issue' to ensure consistency and clarity.
  • Reason this comment was not posted:
    Comment was not on a location in the diff, so it can't be submitted as a review comment.

Workflow ID: wflow_F776lGNMOM5Gkyf5

You can customize Ellipsis by changing your verbosity settings, reacting with 👍 or 👎, replying to comments, or adding code review rules.

description: "Evaluates the correctness and completeness of the linear_create_issue tool usage",
run: async () => {
const result = await grade(openai("gpt-4"), "Please create a new Linear issue using the linear_create_issue tool. The issue should have the title 'Fix login bug for Safari users', teamId 'team123', description 'Safari users are unable to log in properly', priority 2, and status 'Open'. Return the issue identifier and URL.");
return JSON.parse(result);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding error handling (e.g. try/catch) around JSON.parse to gracefully handle unexpected non-JSON responses.

import { grade, EvalFunction } from "mcp-evals";

const linear_create_issueEval: EvalFunction = {
name: "Linear Create Issue Tool Evaluation",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The evaluation functions use inconsistent naming formats for their name properties. For example, 'Linear Create Issue Tool Evaluation' (line 8) uses title casing and includes 'Tool Evaluation', whereas 'linear_update_issue Evaluation' (line 17), 'linear_search_issues' (line 26), 'linear_get_user_issues Evaluation' (line 35), and 'linear_add_comment Evaluation' (line 44) vary in capitalization and inclusion of the word 'Evaluation'. For better readability and consistency, consider standardizing these names.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant