Skip to content

Conversation

@centdix
Copy link
Collaborator

@centdix centdix commented Dec 9, 2025

No description provided.

@cloudflare-workers-and-pages
Copy link

cloudflare-workers-and-pages bot commented Dec 9, 2025

Deploying windmill with  Cloudflare Pages  Cloudflare Pages

Latest commit: b3e40ee
Status: ✅  Deploy successful!
Preview URL: https://15798ab1.windmill.pages.dev
Branch Preview URL: https://fg-evallm.windmill.pages.dev

View logs

@centdix centdix marked this pull request as ready for review December 9, 2025 21:02
@claude
Copy link
Contributor

claude bot commented Dec 9, 2025

Claude finished @centdix's task in 3m 42s —— View job


PR Review: Add test pipeline for AI chat flow

Overview

This PR adds a comprehensive evaluation framework for testing the AI chat flow generation feature. The implementation includes:

  • 7 test cases covering flow creation and modification scenarios
  • A variant system for comparing different prompting strategies and models
  • LLM-based evaluation comparing generated flows against expected outputs
  • Results writing to markdown files with detailed metrics

Code Quality & Architecture

✅ Strengths

  1. Well-structured evaluation framework: The separation of concerns is excellent:

    • evalRunner.ts - Main test execution engine
    • evalHelpers.ts - Mock helpers for flow state
    • evalFlowComparison.ts - LLM-based flow comparison
    • evalResultsWriter.ts - Result file generation
    • evalVariants.ts - Variant configuration system
  2. Flexible variant system: The variant configuration allows testing different:

    • System prompts (default, custom, minimal-single-tool)
    • Tool configurations (all tools, subset, custom)
    • LLM models (Gemini 2.5 Flash, Claude Haiku 4.5, GPT-4o)
  3. Comprehensive test coverage: Tests cover both creation (tests 1-4) and modification (tests 5-7) scenarios with varying complexity.

  4. Smart file organization: Results are excluded from git via .gitignore, and test data is well-organized in expected/ and initial/ directories.

  5. Parallel execution: Variants run in parallel using Promise.all for efficiency.

Issues & Recommendations

🔴 Critical Issues

  1. Non-deterministic test assertions (flowChat.eval.test.ts:87-104 and similar)

    • All tests use expect(true).toBe(true) which makes them always pass
    • Impact: Tests don't validate actual quality of LLM outputs
    • Fix: Add meaningful assertions based on resemblance scores (see inline comment)
  2. Missing error handling in file operations (evalResultsWriter.ts:35-165)

    • File write operations lack try-catch blocks
    • Impact: Tests will crash with unclear errors on filesystem issues
    • Fix: Wrap file operations in error handling (see inline comment)
  3. Hardcoded API keys with @ts-ignore (flowChat.eval.test.ts:30, evalFlowComparison.ts:77)

    • Using @ts-ignore to bypass type checking is fragile
    • Impact: Runtime errors if API key is missing, poor developer experience
    • Fix: Add proper types and validation (see inline comment)

🟡 Major Issues

  1. Commented-out code (evalRunner.ts:272-280, flowChat.eval.test.ts:28-29, 56)

    • Significant dead code without explanation
    • Fix: Remove or document why it's preserved (see inline comment)
  2. Missing type safety for JSON imports (flowChat.eval.test.ts:5-24)

    • All JSON imports use @ts-ignore
    • Impact: No compile-time validation of test data structure
    • Fix: Enable resolveJsonModule or add proper type definitions (see inline comment)
  3. Inconsistent timeout values (throughout flowChat.eval.test.ts)

    • Tests use different timeout multipliers (5x, 2x) without clear reasoning
    • Fix: Document rationale or standardize
  4. No validation of expected flow files

    • Test doesn't validate expected flows match the schema
    • Impact: Malformed test data only discovered at runtime
    • Fix: Add schema validation when loading expected flows

🟢 Minor Issues

  1. Magic numbers (evalRunner.ts:85, flowChat.eval.test.ts:59)

    • Hard-coded values should be named constants (see inline comment)
  2. Duplicate model/variant name handling (flowChat.eval.test.ts:39-54)

    • Variant names generated by simple string replacement could collide
    • Fix: Use more robust naming or add uniqueness validation
  3. Inconsistent error formatting (evalRunner.ts:204-219)

    • APIError vs Error handled differently
    • Fix: Extract to helper function

Performance Considerations

✅ Good

  • Parallel variant execution using Promise.all
  • Temperature set to 0 for deterministic outputs
  • Mock helpers avoid expensive operations

🟡 Areas for Improvement

  • No caching mechanism - each test re-runs full LLM evaluation
  • Large result files written for every variant
  • Sequential execution code suggests undocumented rate limiting concerns

Security Review

✅ Good

  • API keys from environment variables, not hardcoded
  • No sensitive data in test fixtures
  • Test results properly excluded from version control

🟡 Considerations

  • Ensure CI logs are secured (could contain API keys in stack traces)
  • If extended to user prompts, add validation to prevent prompt injection

Configuration Changes Review

All configuration changes are appropriate:

  • vite.config.js - Proper test setup with Node environment
  • .gitignore - Results directory excluded
  • test-setup.ts - Standard storage mocks
  • core.ts - Minimal, focused changes (export function, clean up schema)

Summary

This is a well-architected evaluation framework that provides valuable testing infrastructure for the AI chat flow feature. The code demonstrates good separation of concerns and thoughtful design.

Before merging, address the critical issues:

  1. Replace expect(true).toBe(true) with meaningful assertions
  2. Add error handling to file operations
  3. Improve type safety for environment variables and JSON imports

The major issues (commented code, timeout inconsistencies, schema validation) should also be addressed for production-ready code.

Recommendation: Request changes to fix critical issues, then approve after fixes.


Testing Instructions for Manual Verification

To manually verify the changes added in this PR:

Prerequisites:

  • Set the OPENROUTER_API_KEY environment variable with a valid OpenRouter API key that has access to the models being tested (Gemini 2.5 Flash, Claude Haiku 4.5, GPT-4o)

Running the evaluation tests:

  1. Navigate to the frontend directory: cd frontend
  2. Install dependencies if not already done: npm install
  3. Export your API key: export OPENROUTER_API_KEY=your-key-here
  4. Run the evaluation test suite: npm test -- flowChat.eval.test.ts

What to verify:

  • All 7 test cases should execute (may take several minutes due to LLM API calls)
  • Check the console output for resemblance scores - they should be logged for each variant
  • After tests complete, verify results are written to frontend/src/lib/components/copilot/chat/flow/__tests__/eval/results/
  • Each test run should create a timestamped folder containing:
    • summary.md - Overview with scores and evaluation details
    • Multiple {variant-name}.json files with detailed results
    • Multiple {variant-name}_flow.json files with generated flow definitions

What should work:

  • Tests should run without crashes (though current assertions always pass)
  • LLM should generate flows matching the prompts
  • Evaluation should produce resemblance scores comparing generated vs expected flows
  • Results should be written to disk with proper formatting

Known limitations (from review):

  • Tests will always show as "passed" even if LLM generates poor quality flows (due to expect(true).toBe(true) assertions)
  • No validation that generated flows are functionally correct beyond LLM-based comparison


// Assert all variants succeeded
for (const result of results) {
expect(true).toBe(true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical Issue: This assertion expect(true).toBe(true) always passes, making the test ineffective.

Consider adding meaningful assertions based on the evaluation results:

expect(result.success).toBe(true)
if (result.evaluationResult) {
  expect(result.evaluationResult.resemblanceScore).toBeGreaterThanOrEqual(70)
  expect(result.evaluationResult.missingRequirements?.length || 0).toBe(0)
}

This pattern applies to all test cases in this file.

// Get API key from environment - tests will be skipped if not set
// @ts-ignore
// const OPENAI_API_KEY = process.env.OPENAI_API_KEY
const OPENROUTER_API_KEY = process.env.OPENROUTER_API_KEY
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major Issue: Using @ts-ignore to bypass type checking is fragile. Consider:

  1. Adding proper types for process.env:
declare global {
  namespace NodeJS {
    interface ProcessEnv {
      OPENROUTER_API_KEY?: string
    }
  }
}
  1. Adding validation before use:
const OPENROUTER_API_KEY = process.env.OPENROUTER_API_KEY
if (!OPENROUTER_API_KEY) {
  throw new Error('OPENROUTER_API_KEY environment variable is required for tests')
}

})
)
return results
// const results: EvalResult[] = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major Issue: Remove this commented-out code or add a TODO comment explaining why it's preserved.

If sequential execution is needed for rate limiting, consider making it configurable:

export async function runVariantComparison(
  userPrompt: string,
  variants: VariantConfig[],
  openaiApiKey: string,
  baseOptions?: Omit<EvalOptions, 'variant'> & { sequential?: boolean }
): Promise<EvalResult[]> {
  if (baseOptions?.sequential) {
    const results: EvalResult[] = []
    for (const variant of variants) {
      results.push(await runFlowEval(userPrompt, openaiApiKey, { ...baseOptions, variant }))
    }
    return results
  }
  return await Promise.all(...)
}

const timestamp = generateTimestamp()

// Ensure results directory exists
await mkdir(resultsDir, { recursive: true })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critical Issue: File operations lack error handling. If the file system is read-only or disk space is insufficient, tests will crash.

Add error handling:

try {
  await mkdir(resultsDir, { recursive: true })
  const resultFolder = join(resultsDir, timestamp)
  await mkdir(resultFolder, { recursive: true })
} catch (err) {
  throw new Error(`Failed to create results directory: ${err instanceof Error ? err.message : String(err)}`)
}

Apply similar error handling to all writeFile calls below.

import { runVariantComparison, type ExpectedFlow } from './evalRunner'
import { writeComparisonResults } from './evalResultsWriter'
import { BASELINE_VARIANT, MINIMAL_SINGLE_TOOL_VARIANT, NO_FULL_SCHEMA_VARIANT } from './variants'
// @ts-ignore - JSON import
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Major Issue: Multiple @ts-ignore comments for JSON imports bypass type checking.

Consider:

  1. Enable resolveJsonModule in tsconfig
  2. Create a shared type definition:
import type { FlowModule } from '$lib/gen'

interface ExpectedFlowData {
  summary?: string
  value: {
    modules: FlowModule[]
  }
  schema?: Record<string, any>
}
  1. Use type assertions:
import expectedTest1Data from './expected/test1.json'
const expectedTest1 = expectedTest1Data as ExpectedFlowData

const toolsCalled: string[] = []
const toolCallDetails: ToolCallDetail[] = []
let iterations = 0
const maxIterations = options?.maxIterations ?? 20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor Issue: Magic number. Extract to a named constant:

const DEFAULT_MAX_ITERATIONS = 20
const maxIterations = options?.maxIterations ?? DEFAULT_MAX_ITERATIONS

This improves readability and makes the value easier to adjust.

@hugocasa hugocasa merged commit 36cb5bd into main Dec 10, 2025
3 checks passed
@hugocasa hugocasa deleted the fg/evallm branch December 10, 2025 13:09
@github-actions github-actions bot locked and limited conversation to collaborators Dec 10, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants