Skip to content

[pull] master from supabase:master#911

Merged
pull[bot] merged 2 commits into
code:masterfrom
supabase:master
May 12, 2026
Merged

[pull] master from supabase:master#911
pull[bot] merged 2 commits into
code:masterfrom
supabase:master

Conversation

@pull
Copy link
Copy Markdown

@pull pull Bot commented May 12, 2026

See Commits and Changes for more details.


Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

…h needsApproval (#45654)

## Motivation

When Assistant runs a potentially destructive tool like `execute_sql`,
it stops the LLM request and prompts for client-side approval and
execution of the tool. After approval, a second request kicks off under
a separate trace. This has made scoring and
[Topics](https://www.braintrust.dev/blog/topics) classification
challenging, as the generated `output` is split across stateless
requests. The [span-level
scoring](https://www.braintrust.dev/docs/evaluate/custom-code#score-spans)
approach we've used thusfar (after the LLM call, we massage the result
into an `output` payload that's stuck onto the root span) has been
cumbersome and led to invalid scores / topics where only part of the
assistant response is considered. It's also inefficient, as we're
duplicating potentially large info (like the `search_docs` output) that
already exists within the trace.

An alternative to scoring spans is to [score
traces](https://www.braintrust.dev/docs/evaluate/custom-code#score-traces).
Braintrust [best
practices](https://www.braintrust.dev/docs/evaluate/score-online#best-practices)
advise:

> Use span scope for evaluating individual operations or outputs. Use
trace scope for evaluating multi-turn conversations, overall workflow
completion, or when your scorer needs access to the full execution
context.

We've also received [direct
guidance](https://supabase.slack.com/archives/C05QYJBLX89/p1777925770927149?thread_ts=1777905716.911979&cid=C05QYJBLX89)
from their team to use this approach.

## Changes

Migrates eval scorers from custom `AssistantEvalOutput` shape to
trace-level scoring via `trace.getThread()` / `trace.getSpans()`, with
thread parsing that scores the full latest Assistant turn and passes
prior conversation separately where relevant.

Moves `execute_sql` and `deploy_edge_function` from client-side
execution after approval to AI SDK `needsApproval` + server-side
`execute()`. SQL results returned to the model are gated by AI opt-in
level, so row data is only included with `schema_and_log_and_data`;
otherwise the tool returns the no-data-permissions sentinel.

Adds `metadata.isFinalStep` to disambiguate multiple LLM requests within
an "assistant" turn due to tool call requests/responses. For online
evals, this means we should configure automations to only score traces
with `metadata.isFinalStep = true` to ensure we're judging the complete
generated response.

Other minor kaizen changes:
- Renamed `promptProviderOptions` to `systemProviderOptions` to clarify
that this is associated with the "system" message and disambiguate from
the root `providerOptions`
- Adds `evals/trace-utils.ts` to handle Zod validation of the `unknown`
span shapes from Braintrust, to more easily access typed inputs/output
on tool spans.
- Bumps AI SDK floor version `^6.0.116` → `^6.0.174`
- Tweaked the "Conciseness" scorer to not unfairly dock points for the
new `[called tool_name]` labels in serialized assistant response

## Verification

In the studio staging build, I asked Assistant to create a todos table
with 3 sample todos. I manually approved the `execute_sql` call and saw
Assistant generate text before & after the call.

In Braintrust I verified two traces were produced (see [filtered
logs](https://www.braintrust.dev/app/supabase.io/p/Assistant/logs?v=Staging&tvt=trace&search={%22filter%22:[{%22text%22:%22metadata.environment%2520%253D%2520%27staging%27%22,%22label%22:%22metadata.environment%2520%253D%2520%27staging%27%22,%22originType%22:%22btql%22},{%22text%22:%22%2560Chat%2520ID%2560%2520%253D%2520%25221cb2ac45-e5e7-458c-9da4-3bf6863b8842%2522%22,%22label%22:%22Chat%2520ID%2520equals%25201cb2ac45-e5e7-458c-9da4-3bf6863b8842%22,%22originType%22:%22form%22}]})),
the first with `metadata.isFinalStep = false` and the second with
`metadata.isFinalStep = true`.

In the Braintrust staging scorers, I ran the preview Completeness scorer
on the second trace and verified it sees the complete Assistant response
including markers for tool calls ([link to
trace](https://www.braintrust.dev/app/supabase.io/p/Assistant%20(Staging%20Scorers)/trace?object_type=project_logs&object_id=b5214b62-ad1e-4929-9d5b-40b1daebe948&r=0ed0a4f8-8aff-4a34-bb1d-1df1d88a5070&s=ff9015f8-6bf7-4ab3-83a9-ca4e69e27e82))

<img width="1193" height="960" alt="CleanShot 2026-05-07 at 11 27 10@2x"
src="https://github.com/user-attachments/assets/509d4858-c3a1-4068-986d-3aa4d5617d1a"
/>

I also tested the `deploy_edge_function` workflow and verified it still
prompts for permission and warns on deployment of existing functions.

**References**
- https://www.braintrust.dev/docs/evaluate/custom-code#score-traces
-
https://ai-sdk.dev/docs/ai-sdk-core/tools-and-tool-calling#tool-execution-approval

Supercedes #45556 and
#45339

Closes AI-473

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Tool actions (SQL execution, edge-function deploy) now require
explicit user Approve/Deny before proceeding.

* **Improvements**
* Assistant pauses for approval responses before sending follow-ups,
giving clearer control over risky actions.
  * Deploy/replace flows show confirmation and clearer replace warnings.
* Evaluation/scoring updated to use richer trace data for more accurate
assistant performance signals.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Adds `README.md` to `apps/studio/evals` explaining the development
workflow for updating offline and online evals for Studio's AI
Assistant.

Resolves AI-681

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Documentation**
* Added comprehensive documentation for Studio Assistant Evals, covering
evaluation setup, configuration of scoring methods, and deployment
workflows for both offline and online evaluation processes.

[![Review Change
Stack](https://storage.googleapis.com/coderabbit_public_assets/review-stack-in-coderabbit-ui.svg)](https://app.coderabbit.ai/change-stack/supabase/supabase/pull/45840)

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@pull pull Bot locked and limited conversation to collaborators May 12, 2026
@pull pull Bot added the ⤵️ pull label May 12, 2026
@pull pull Bot merged commit 36152cb into code:master May 12, 2026
1 of 4 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant