chore(aichat): add test pipeline #7321

centdix · 2025-12-09T11:53:19Z

No description provided.

cloudflare-workers-and-pages · 2025-12-09T11:58:44Z

Deploying windmill with Cloudflare Pages

Latest commit:	`b3e40ee`
Status:	✅ Deploy successful!
Preview URL:	https://15798ab1.windmill.pages.dev
Branch Preview URL:	https://fg-evallm.windmill.pages.dev

View logs

claude · 2025-12-09T21:02:57Z

Claude finished @centdix's task in 3m 42s —— View job

PR Review: Add test pipeline for AI chat flow

Overview

This PR adds a comprehensive evaluation framework for testing the AI chat flow generation feature. The implementation includes:

7 test cases covering flow creation and modification scenarios
A variant system for comparing different prompting strategies and models
LLM-based evaluation comparing generated flows against expected outputs
Results writing to markdown files with detailed metrics

Code Quality & Architecture

✅ Strengths

Well-structured evaluation framework: The separation of concerns is excellent:
- evalRunner.ts - Main test execution engine
- evalHelpers.ts - Mock helpers for flow state
- evalFlowComparison.ts - LLM-based flow comparison
- evalResultsWriter.ts - Result file generation
- evalVariants.ts - Variant configuration system
Flexible variant system: The variant configuration allows testing different:
- System prompts (default, custom, minimal-single-tool)
- Tool configurations (all tools, subset, custom)
- LLM models (Gemini 2.5 Flash, Claude Haiku 4.5, GPT-4o)
Comprehensive test coverage: Tests cover both creation (tests 1-4) and modification (tests 5-7) scenarios with varying complexity.
Smart file organization: Results are excluded from git via .gitignore, and test data is well-organized in expected/ and initial/ directories.
Parallel execution: Variants run in parallel using Promise.all for efficiency.

Issues & Recommendations

🔴 Critical Issues

Non-deterministic test assertions (flowChat.eval.test.ts:87-104 and similar)
- All tests use expect(true).toBe(true) which makes them always pass
- Impact: Tests don't validate actual quality of LLM outputs
- Fix: Add meaningful assertions based on resemblance scores (see inline comment)
Missing error handling in file operations (evalResultsWriter.ts:35-165)
- File write operations lack try-catch blocks
- Impact: Tests will crash with unclear errors on filesystem issues
- Fix: Wrap file operations in error handling (see inline comment)
Hardcoded API keys with @ts-ignore (flowChat.eval.test.ts:30, evalFlowComparison.ts:77)
- Using @ts-ignore to bypass type checking is fragile
- Impact: Runtime errors if API key is missing, poor developer experience
- Fix: Add proper types and validation (see inline comment)

🟡 Major Issues

Commented-out code (evalRunner.ts:272-280, flowChat.eval.test.ts:28-29, 56)
- Significant dead code without explanation
- Fix: Remove or document why it's preserved (see inline comment)
Missing type safety for JSON imports (flowChat.eval.test.ts:5-24)
- All JSON imports use @ts-ignore
- Impact: No compile-time validation of test data structure
- Fix: Enable resolveJsonModule or add proper type definitions (see inline comment)
Inconsistent timeout values (throughout flowChat.eval.test.ts)
- Tests use different timeout multipliers (5x, 2x) without clear reasoning
- Fix: Document rationale or standardize
No validation of expected flow files
- Test doesn't validate expected flows match the schema
- Impact: Malformed test data only discovered at runtime
- Fix: Add schema validation when loading expected flows

🟢 Minor Issues

Magic numbers (evalRunner.ts:85, flowChat.eval.test.ts:59)
- Hard-coded values should be named constants (see inline comment)
Duplicate model/variant name handling (flowChat.eval.test.ts:39-54)
- Variant names generated by simple string replacement could collide
- Fix: Use more robust naming or add uniqueness validation
Inconsistent error formatting (evalRunner.ts:204-219)
- APIError vs Error handled differently
- Fix: Extract to helper function

Performance Considerations

✅ Good

Parallel variant execution using Promise.all
Temperature set to 0 for deterministic outputs
Mock helpers avoid expensive operations

🟡 Areas for Improvement

No caching mechanism - each test re-runs full LLM evaluation
Large result files written for every variant
Sequential execution code suggests undocumented rate limiting concerns

Security Review

✅ Good

API keys from environment variables, not hardcoded
No sensitive data in test fixtures
Test results properly excluded from version control

🟡 Considerations

Ensure CI logs are secured (could contain API keys in stack traces)
If extended to user prompts, add validation to prevent prompt injection

Configuration Changes Review

All configuration changes are appropriate:

✅ vite.config.js - Proper test setup with Node environment
✅ .gitignore - Results directory excluded
✅ test-setup.ts - Standard storage mocks
✅ core.ts - Minimal, focused changes (export function, clean up schema)

Summary

This is a well-architected evaluation framework that provides valuable testing infrastructure for the AI chat flow feature. The code demonstrates good separation of concerns and thoughtful design.

Before merging, address the critical issues:

Replace expect(true).toBe(true) with meaningful assertions
Add error handling to file operations
Improve type safety for environment variables and JSON imports

The major issues (commented code, timeout inconsistencies, schema validation) should also be addressed for production-ready code.

Recommendation: Request changes to fix critical issues, then approve after fixes.

Testing Instructions for Manual Verification

To manually verify the changes added in this PR:

Prerequisites:

Set the OPENROUTER_API_KEY environment variable with a valid OpenRouter API key that has access to the models being tested (Gemini 2.5 Flash, Claude Haiku 4.5, GPT-4o)

Running the evaluation tests:

Navigate to the frontend directory: cd frontend
Install dependencies if not already done: npm install
Export your API key: export OPENROUTER_API_KEY=your-key-here
Run the evaluation test suite: npm test -- flowChat.eval.test.ts

What to verify:

All 7 test cases should execute (may take several minutes due to LLM API calls)
Check the console output for resemblance scores - they should be logged for each variant
After tests complete, verify results are written to frontend/src/lib/components/copilot/chat/flow/__tests__/eval/results/
Each test run should create a timestamped folder containing:
- summary.md - Overview with scores and evaluation details
- Multiple {variant-name}.json files with detailed results
- Multiple {variant-name}_flow.json files with generated flow definitions

What should work:

Tests should run without crashes (though current assertions always pass)
LLM should generate flows matching the prompts
Evaluation should produce resemblance scores comparing generated vs expected flows
Results should be written to disk with proper formatting

Known limitations (from review):

Tests will always show as "passed" even if LLM generates poor quality flows (due to expect(true).toBe(true) assertions)
No validation that generated flows are functionally correct beyond LLM-based comparison

claude · 2025-12-09T21:05:48Z

frontend/src/lib/components/copilot/chat/flow/__tests__/eval/flowChat.eval.test.ts

+
+			// Assert all variants succeeded
+			for (const result of results) {
+				expect(true).toBe(true)


Critical Issue: This assertion expect(true).toBe(true) always passes, making the test ineffective.

Consider adding meaningful assertions based on the evaluation results:

expect(result.success).toBe(true) if (result.evaluationResult) { expect(result.evaluationResult.resemblanceScore).toBeGreaterThanOrEqual(70) expect(result.evaluationResult.missingRequirements?.length || 0).toBe(0) }

This pattern applies to all test cases in this file.

claude · 2025-12-09T21:05:49Z

frontend/src/lib/components/copilot/chat/flow/__tests__/eval/flowChat.eval.test.ts

+// Get API key from environment - tests will be skipped if not set
+// @ts-ignore
+// const OPENAI_API_KEY = process.env.OPENAI_API_KEY
+const OPENROUTER_API_KEY = process.env.OPENROUTER_API_KEY


Major Issue: Using @ts-ignore to bypass type checking is fragile. Consider:

Adding proper types for process.env:

declare global { namespace NodeJS { interface ProcessEnv { OPENROUTER_API_KEY?: string } } }

Adding validation before use:

const OPENROUTER_API_KEY = process.env.OPENROUTER_API_KEY if (!OPENROUTER_API_KEY) { throw new Error('OPENROUTER_API_KEY environment variable is required for tests') }

claude · 2025-12-09T21:05:50Z

frontend/src/lib/components/copilot/chat/flow/__tests__/eval/evalRunner.ts

+		})
+	)
+	return results
+	// const results: EvalResult[] = []


Major Issue: Remove this commented-out code or add a TODO comment explaining why it's preserved.

If sequential execution is needed for rate limiting, consider making it configurable:

export async function runVariantComparison( userPrompt: string, variants: VariantConfig[], openaiApiKey: string, baseOptions?: Omit<EvalOptions, 'variant'> & { sequential?: boolean } ): Promise<EvalResult[]> { if (baseOptions?.sequential) { const results: EvalResult[] = [] for (const variant of variants) { results.push(await runFlowEval(userPrompt, openaiApiKey, { ...baseOptions, variant })) } return results } return await Promise.all(...) }

claude · 2025-12-09T21:05:51Z

frontend/src/lib/components/copilot/chat/flow/__tests__/eval/evalResultsWriter.ts

+	const timestamp = generateTimestamp()
+
+	// Ensure results directory exists
+	await mkdir(resultsDir, { recursive: true })


Critical Issue: File operations lack error handling. If the file system is read-only or disk space is insufficient, tests will crash.

Add error handling:

try { await mkdir(resultsDir, { recursive: true }) const resultFolder = join(resultsDir, timestamp) await mkdir(resultFolder, { recursive: true }) } catch (err) { throw new Error(`Failed to create results directory: ${err instanceof Error ? err.message : String(err)}`) }

Apply similar error handling to all writeFile calls below.

claude · 2025-12-09T21:05:52Z

frontend/src/lib/components/copilot/chat/flow/__tests__/eval/flowChat.eval.test.ts

+import { runVariantComparison, type ExpectedFlow } from './evalRunner'
+import { writeComparisonResults } from './evalResultsWriter'
+import { BASELINE_VARIANT, MINIMAL_SINGLE_TOOL_VARIANT, NO_FULL_SCHEMA_VARIANT } from './variants'
+// @ts-ignore - JSON import


Major Issue: Multiple @ts-ignore comments for JSON imports bypass type checking.

Consider:

Enable resolveJsonModule in tsconfig

Create a shared type definition:

import type { FlowModule } from '$lib/gen' interface ExpectedFlowData { summary?: string value: { modules: FlowModule[] } schema?: Record<string, any> }

Use type assertions:

import expectedTest1Data from './expected/test1.json' const expectedTest1 = expectedTest1Data as ExpectedFlowData

claude · 2025-12-09T21:05:53Z

frontend/src/lib/components/copilot/chat/flow/__tests__/eval/evalRunner.ts

+	const toolsCalled: string[] = []
+	const toolCallDetails: ToolCallDetail[] = []
+	let iterations = 0
+	const maxIterations = options?.maxIterations ?? 20


Minor Issue: Magic number. Extract to a named constant:

const DEFAULT_MAX_ITERATIONS = 20 const maxIterations = options?.maxIterations ?? DEFAULT_MAX_ITERATIONS

This improves readability and makes the value easier to adjust.

centdix added 14 commits December 9, 2025 21:50

setup first tests

f20c3ee

better

ec1ffbf

variants

7d097e1

use openrouter + cleaning

7e1713e

write to files

6c75462

cleaning

b7b1cd0

cleaning

c3c8560

inline scripts

10a57c2

more tests

47bc177

cleaning

6697d1a

better

be9e9c6

test all

e965960

few

d29f29a

better errors

bf877e0

centdix force-pushed the fg/evallm branch from e0fab9e to bf877e0 Compare December 9, 2025 20:51

centdix marked this pull request as ready for review December 9, 2025 21:02

centdix requested review from alpetric, hugocasa and rubenfiszel as code owners December 9, 2025 21:02

claude bot reviewed Dec 9, 2025

View reviewed changes

cleaning

b3e40ee

hugocasa merged commit 36cb5bd into main Dec 10, 2025
3 checks passed

hugocasa deleted the fg/evallm branch December 10, 2025 13:09

github-actions bot locked and limited conversation to collaborators Dec 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore(aichat): add test pipeline #7321

chore(aichat): add test pipeline #7321

centdix commented Dec 9, 2025

Uh oh!

cloudflare-workers-and-pages bot commented Dec 9, 2025 •

edited

Loading

Uh oh!

claude bot commented Dec 9, 2025 •

edited

Loading

Uh oh!

claude bot Dec 9, 2025

Uh oh!

claude bot Dec 9, 2025

Uh oh!

claude bot Dec 9, 2025

Uh oh!

claude bot Dec 9, 2025

Uh oh!

claude bot Dec 9, 2025

Uh oh!

claude bot Dec 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chore(aichat): add test pipeline #7321

chore(aichat): add test pipeline #7321

Conversation

centdix commented Dec 9, 2025

Uh oh!

cloudflare-workers-and-pages bot commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying windmill with Cloudflare Pages

Uh oh!

claude bot commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Add test pipeline for AI chat flow

Overview

Code Quality & Architecture

✅ Strengths

Issues & Recommendations

🔴 Critical Issues

🟡 Major Issues

🟢 Minor Issues

Performance Considerations

✅ Good

🟡 Areas for Improvement

Security Review

✅ Good

🟡 Considerations

Configuration Changes Review

Summary

Testing Instructions for Manual Verification

Uh oh!

claude bot Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

claude bot Dec 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cloudflare-workers-and-pages bot commented Dec 9, 2025 •

edited

Loading

claude bot commented Dec 9, 2025 •

edited

Loading