Skip to content

Commit 36152cb

Browse files
authored
docs(ai): assistant evals development workflow (supabase#45840)
Adds `README.md` to `apps/studio/evals` explaining the development workflow for updating offline and online evals for Studio's AI Assistant. Resolves AI-681 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Documentation** * Added comprehensive documentation for Studio Assistant Evals, covering evaluation setup, configuration of scoring methods, and deployment workflows for both offline and online evaluation processes. [![Review Change Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/supabase/supabase/pull/45840) <!-- end of auto-generated comment: release notes by coderabbit.ai -->
1 parent d143571 commit 36152cb

1 file changed

Lines changed: 56 additions & 0 deletions

File tree

apps/studio/evals/README.md

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,56 @@
1+
# Studio Assistant Evals
2+
3+
We use [Braintrust](https://www.braintrust.dev/) to evaluate Assistant behaviors against a tracked dataset (offline evals) and against live traces (online evals).
4+
5+
## Offline Evals
6+
7+
Add offline eval test cases to `dataset.ts`. If needed, add new scorers (see below) for the specific dimension you wish to test. Expect to update and run offline evals when adding new Assistant behaviors
8+
9+
You may wish to run offline evals when:
10+
11+
- You updated the eval suite with a new test case or scorer
12+
- You changed Assistant's behavior and want to check for improvements/regressions
13+
14+
### Running Offline Evals in CI
15+
16+
Add the `run-evals` label on a PR to the repo and Braintrust's GitHub Action will run evals and post a summary comment ([example](https://github.com/supabase/supabase/pull/44729)).
17+
18+
You can find detailed results in the "Experiments" tab of the "Assistant" project on Braintrust.
19+
20+
### Running Offline Evals in Local Dev
21+
22+
Within `apps/studio`
23+
24+
```bash
25+
# To set up WASM files
26+
pnpm evals:setup
27+
28+
# Run all evals and upload results to Braintrust
29+
pnpm evals:upload
30+
31+
# Run all evals without uploading results
32+
pnpm evals:run
33+
34+
# Run an upload single test case
35+
pnpm braintrust eval evals/assistant.eval.ts --filter "input.prompt=How many projects"
36+
```
37+
38+
Upload results when you want to inspect Experiments or Logs in the Braintrust dashboard or API. You can use developer tools like [Braintrust MCP](https://www.braintrust.dev/docs/integrations/developer-tools/mcp) or [`bt` CLI](https://www.braintrust.dev/docs/reference/cli/quickstart) to analyze results with an agent.
39+
40+
## Scorers
41+
42+
Scorers look at a `thread` or task `output` and assign a score deterministically or via LLM-as-a-judge. Optionally they can consider `expected` values.
43+
44+
Define scorers in `scorer.ts` and include them in `assistant.eval.ts` to run them in offline evals.
45+
46+
### Updating Online Scorers
47+
48+
Online scorers run as serverless functions on Braintrust infrastructure. They're deployed from the `scorer-online.ts` script. Since these scoring against production traces, they can't rely on ground truth `expected` values. Structure scoring logic and LLM prompts accordingly. Not every scorer needs to be an online scorer.
49+
50+
To opt-in to online scoring, add the scorer to `scorer-online-manifest.json` and add a corresponding handler in `scorer-online.ts`
51+
52+
### Testing & Deploying Online Scorers
53+
54+
Add the `preview-scorers` label to a PR to deploy branch-prefixed scorers to the "Assistant (Staging Scorers)" Braintrust project ([example](https://github.com/supabase/supabase/pull/45654#issuecomment-4398433047)). From that project dashboard, you can manually test the scorer against a trace from any project.
55+
56+
After merge to `master`, preview scorers automatically clean up and deploy to the production in the "Assistant" Braintrust project. Update the "Online Scoring" automation in the Logs page to include the new scorer function.

0 commit comments

Comments
 (0)