|
| 1 | +# Studio Assistant Evals |
| 2 | + |
| 3 | +We use [Braintrust](https://www.braintrust.dev/) to evaluate Assistant behaviors against a tracked dataset (offline evals) and against live traces (online evals). |
| 4 | + |
| 5 | +## Offline Evals |
| 6 | + |
| 7 | +Add offline eval test cases to `dataset.ts`. If needed, add new scorers (see below) for the specific dimension you wish to test. Expect to update and run offline evals when adding new Assistant behaviors |
| 8 | + |
| 9 | +You may wish to run offline evals when: |
| 10 | + |
| 11 | +- You updated the eval suite with a new test case or scorer |
| 12 | +- You changed Assistant's behavior and want to check for improvements/regressions |
| 13 | + |
| 14 | +### Running Offline Evals in CI |
| 15 | + |
| 16 | +Add the `run-evals` label on a PR to the repo and Braintrust's GitHub Action will run evals and post a summary comment ([example](https://github.com/supabase/supabase/pull/44729)). |
| 17 | + |
| 18 | +You can find detailed results in the "Experiments" tab of the "Assistant" project on Braintrust. |
| 19 | + |
| 20 | +### Running Offline Evals in Local Dev |
| 21 | + |
| 22 | +Within `apps/studio` |
| 23 | + |
| 24 | +```bash |
| 25 | +# To set up WASM files |
| 26 | +pnpm evals:setup |
| 27 | + |
| 28 | +# Run all evals and upload results to Braintrust |
| 29 | +pnpm evals:upload |
| 30 | + |
| 31 | +# Run all evals without uploading results |
| 32 | +pnpm evals:run |
| 33 | + |
| 34 | +# Run an upload single test case |
| 35 | +pnpm braintrust eval evals/assistant.eval.ts --filter "input.prompt=How many projects" |
| 36 | +``` |
| 37 | + |
| 38 | +Upload results when you want to inspect Experiments or Logs in the Braintrust dashboard or API. You can use developer tools like [Braintrust MCP](https://www.braintrust.dev/docs/integrations/developer-tools/mcp) or [`bt` CLI](https://www.braintrust.dev/docs/reference/cli/quickstart) to analyze results with an agent. |
| 39 | + |
| 40 | +## Scorers |
| 41 | + |
| 42 | +Scorers look at a `thread` or task `output` and assign a score deterministically or via LLM-as-a-judge. Optionally they can consider `expected` values. |
| 43 | + |
| 44 | +Define scorers in `scorer.ts` and include them in `assistant.eval.ts` to run them in offline evals. |
| 45 | + |
| 46 | +### Updating Online Scorers |
| 47 | + |
| 48 | +Online scorers run as serverless functions on Braintrust infrastructure. They're deployed from the `scorer-online.ts` script. Since these scoring against production traces, they can't rely on ground truth `expected` values. Structure scoring logic and LLM prompts accordingly. Not every scorer needs to be an online scorer. |
| 49 | + |
| 50 | +To opt-in to online scoring, add the scorer to `scorer-online-manifest.json` and add a corresponding handler in `scorer-online.ts` |
| 51 | + |
| 52 | +### Testing & Deploying Online Scorers |
| 53 | + |
| 54 | +Add the `preview-scorers` label to a PR to deploy branch-prefixed scorers to the "Assistant (Staging Scorers)" Braintrust project ([example](https://github.com/supabase/supabase/pull/45654#issuecomment-4398433047)). From that project dashboard, you can manually test the scorer against a trace from any project. |
| 55 | + |
| 56 | +After merge to `master`, preview scorers automatically clean up and deploy to the production in the "Assistant" Braintrust project. Update the "Online Scoring" automation in the Logs page to include the new scorer function. |
0 commit comments