Skip to content

Commit d143571

Browse files
authored
feat(assistant): trace-level scorers + server-side tool execution with needsApproval (supabase#45654)
## Motivation When Assistant runs a potentially destructive tool like `execute_sql`, it stops the LLM request and prompts for client-side approval and execution of the tool. After approval, a second request kicks off under a separate trace. This has made scoring and [Topics](https://www.braintrust.dev/blog/topics) classification challenging, as the generated `output` is split across stateless requests. The [span-level scoring](https://www.braintrust.dev/docs/evaluate/custom-code#score-spans) approach we've used thusfar (after the LLM call, we massage the result into an `output` payload that's stuck onto the root span) has been cumbersome and led to invalid scores / topics where only part of the assistant response is considered. It's also inefficient, as we're duplicating potentially large info (like the `search_docs` output) that already exists within the trace. An alternative to scoring spans is to [score traces](https://www.braintrust.dev/docs/evaluate/custom-code#score-traces). Braintrust [best practices](https://www.braintrust.dev/docs/evaluate/score-online#best-practices) advise: > Use span scope for evaluating individual operations or outputs. Use trace scope for evaluating multi-turn conversations, overall workflow completion, or when your scorer needs access to the full execution context. We've also received [direct guidance](https://supabase.slack.com/archives/C05QYJBLX89/p1777925770927149?thread_ts=1777905716.911979&cid=C05QYJBLX89) from their team to use this approach. ## Changes Migrates eval scorers from custom `AssistantEvalOutput` shape to trace-level scoring via `trace.getThread()` / `trace.getSpans()`, with thread parsing that scores the full latest Assistant turn and passes prior conversation separately where relevant. Moves `execute_sql` and `deploy_edge_function` from client-side execution after approval to AI SDK `needsApproval` + server-side `execute()`. SQL results returned to the model are gated by AI opt-in level, so row data is only included with `schema_and_log_and_data`; otherwise the tool returns the no-data-permissions sentinel. Adds `metadata.isFinalStep` to disambiguate multiple LLM requests within an "assistant" turn due to tool call requests/responses. For online evals, this means we should configure automations to only score traces with `metadata.isFinalStep = true` to ensure we're judging the complete generated response. Other minor kaizen changes: - Renamed `promptProviderOptions` to `systemProviderOptions` to clarify that this is associated with the "system" message and disambiguate from the root `providerOptions` - Adds `evals/trace-utils.ts` to handle Zod validation of the `unknown` span shapes from Braintrust, to more easily access typed inputs/output on tool spans. - Bumps AI SDK floor version `^6.0.116` → `^6.0.174` - Tweaked the "Conciseness" scorer to not unfairly dock points for the new `[called tool_name]` labels in serialized assistant response ## Verification In the studio staging build, I asked Assistant to create a todos table with 3 sample todos. I manually approved the `execute_sql` call and saw Assistant generate text before & after the call. In Braintrust I verified two traces were produced (see [filtered logs](https://www.braintrust.dev/app/supabase.io/p/Assistant/logs?v=Staging&tvt=trace&search={%22filter%22:[{%22text%22:%22metadata.environment%2520%253D%2520%27staging%27%22,%22label%22:%22metadata.environment%2520%253D%2520%27staging%27%22,%22originType%22:%22btql%22},{%22text%22:%22%2560Chat%2520ID%2560%2520%253D%2520%25221cb2ac45-e5e7-458c-9da4-3bf6863b8842%2522%22,%22label%22:%22Chat%2520ID%2520equals%25201cb2ac45-e5e7-458c-9da4-3bf6863b8842%22,%22originType%22:%22form%22}]})), the first with `metadata.isFinalStep = false` and the second with `metadata.isFinalStep = true`. In the Braintrust staging scorers, I ran the preview Completeness scorer on the second trace and verified it sees the complete Assistant response including markers for tool calls ([link to trace](https://www.braintrust.dev/app/supabase.io/p/Assistant%20(Staging%20Scorers)/trace?object_type=project_logs&object_id=b5214b62-ad1e-4929-9d5b-40b1daebe948&r=0ed0a4f8-8aff-4a34-bb1d-1df1d88a5070&s=ff9015f8-6bf7-4ab3-83a9-ca4e69e27e82)) <img width="1193" height="960" alt="CleanShot 2026-05-07 at 11 27 10@2x" src="https://github.com/user-attachments/assets/509d4858-c3a1-4068-986d-3aa4d5617d1a" /> I also tested the `deploy_edge_function` workflow and verified it still prompts for permission and warns on deployment of existing functions. **References** - https://www.braintrust.dev/docs/evaluate/custom-code#score-traces - https://ai-sdk.dev/docs/ai-sdk-core/tools-and-tool-calling#tool-execution-approval Supercedes supabase#45556 and supabase#45339 Closes AI-473 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Tool actions (SQL execution, edge-function deploy) now require explicit user Approve/Deny before proceeding. * **Improvements** * Assistant pauses for approval responses before sending follow-ups, giving clearer control over risky actions. * Deploy/replace flows show confirmation and clearer replace warnings. * Evaluation/scoring updated to use richer trace data for more accurate assistant performance signals. <!-- end of auto-generated comment: release notes by coderabbit.ai -->
1 parent 90d383f commit d143571

30 files changed

Lines changed: 1026 additions & 428 deletions

apps/studio/components/ui/AIAssistantPanel/AIAssistant.tsx

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
import type { UIMessage as MessageType } from '@ai-sdk/react'
22
import { useChat } from '@ai-sdk/react'
3-
import { lastAssistantMessageIsCompleteWithToolCalls } from 'ai'
3+
import { lastAssistantMessageIsCompleteWithApprovalResponses } from 'ai'
44
import { LOCAL_STORAGE_KEYS, useFlag } from 'common'
55
import { useParams, useSearchParamsShallow } from 'common/hooks'
66
import { AnimatePresence, motion } from 'framer-motion'
@@ -158,15 +158,15 @@ export const AIAssistant = ({ className }: AIAssistantProps) => {
158158
error,
159159
sendMessage,
160160
setMessages,
161-
addToolResult,
161+
addToolApprovalResponse,
162162
stop,
163163
regenerate,
164164
} = useChat({
165165
id: snap.activeChatId,
166166
...(snap.activeChatId && snap.chatInstances[snap.activeChatId]
167167
? { chat: snap.chatInstances[snap.activeChatId] }
168168
: {}),
169-
sendAutomaticallyWhen: lastAssistantMessageIsCompleteWithToolCalls,
169+
sendAutomaticallyWhen: lastAssistantMessageIsCompleteWithApprovalResponses,
170170
onError: onErrorChat,
171171
})
172172

@@ -281,7 +281,7 @@ export const AIAssistant = ({ className }: AIAssistantProps) => {
281281
message={message}
282282
isLoading={chatStatus === 'submitted' || chatStatus === 'streaming'}
283283
readOnly={message.role === 'user'}
284-
addToolResult={addToolResult}
284+
addToolApprovalResponse={addToolApprovalResponse}
285285
onDelete={deleteMessageFromHere}
286286
onEdit={editMessage}
287287
isAfterEditedMessage={isAfterEditedMessage}
@@ -300,7 +300,7 @@ export const AIAssistant = ({ className }: AIAssistantProps) => {
300300
cancelEdit,
301301
editingMessageId,
302302
chatStatus,
303-
addToolResult,
303+
addToolApprovalResponse,
304304
handleRateMessage,
305305
messageRatings,
306306
]

apps/studio/components/ui/AIAssistantPanel/DisplayBlockRenderer.tsx

Lines changed: 17 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
import { acceptUntrustedSql, type UntrustedSqlFragment } from '@supabase/pg-meta'
22
import { PermissionAction } from '@supabase/shared-types/out/constants'
33
import { useQueryClient } from '@tanstack/react-query'
4+
import type { ToolUIPart } from 'ai'
45
import { useParams } from 'common'
56
import { useRouter } from 'next/router'
67
import { useRef, useState, type DragEvent, type PropsWithChildren } from 'react'
@@ -31,13 +32,19 @@ interface DisplayBlockRendererProps {
3132
yAxis?: string
3233
}
3334
initialResults?: unknown
34-
onResults?: (args: { messageId: string; results: unknown }) => void
35+
/** Called when locally running SQL fails before or during client-side execution. */
3536
onError?: (args: { messageId: string; errorText: string }) => void
36-
toolState?: 'input-streaming' | 'input-available' | 'output-available' | 'output-error'
37+
/** Responds affirmatively to an AI SDK tool approval request; does not run SQL directly. */
38+
onApprove?: () => void
39+
/** Responds negatively to an AI SDK tool approval request; does not run SQL directly. */
40+
onDeny?: () => void
41+
/** AI SDK tool state used to show approval UI for pending tool calls. */
42+
toolState?: ToolUIPart['state']
3743
isLastPart?: boolean
3844
isLastMessage?: boolean
3945
showConfirmFooter?: boolean
4046
onChartConfigChange?: (chartConfig: ChartConfig) => void
47+
/** Called when the user clicks the query block play button to run SQL locally. */
4148
onQueryRun?: (queryType: 'select' | 'mutation') => void
4249
}
4350

@@ -46,8 +53,9 @@ export const DisplayBlockRenderer = ({
4653
toolCallId,
4754
initialArgs,
4855
initialResults,
49-
onResults,
5056
onError,
57+
onApprove,
58+
onDeny,
5159
toolState,
5260
isLastPart = false,
5361
isLastMessage = false,
@@ -169,10 +177,6 @@ export const DisplayBlockRenderer = ({
169177
onSuccess: (data) => {
170178
setRows(Array.isArray(data.result) ? data.result : undefined)
171179
setIsWriteQuery(queryType === 'mutation' || initialArgs.isWriteQuery || false)
172-
onResults?.({
173-
messageId,
174-
results: Array.isArray(data.result) ? data.result : undefined,
175-
})
176180
if (queryType === 'mutation') {
177181
queryClient.invalidateQueries({ queryKey: lintKeys.lint(ref) })
178182
queryClient.invalidateQueries({ queryKey: entityTypeKeys.list(ref) })
@@ -219,13 +223,13 @@ export const DisplayBlockRenderer = ({
219223
)
220224
}
221225

222-
const resolvedHasDecision = initialResults !== undefined || rows !== undefined
223226
const shouldShowConfirmFooter =
224227
showConfirmFooter &&
225-
!resolvedHasDecision &&
226-
toolState === 'input-available' &&
228+
toolState === 'approval-requested' &&
227229
isLastPart &&
228-
isLastMessage
230+
isLastMessage &&
231+
!!onApprove &&
232+
!!onDeny
229233

230234
return (
231235
<div className="display-block w-auto overflow-x-hidden">
@@ -252,12 +256,8 @@ export const DisplayBlockRenderer = ({
252256
cancelLabel="Skip"
253257
confirmLabel={executeSqlLoading ? 'Running...' : 'Run Query'}
254258
isLoading={executeSqlLoading}
255-
onCancel={async () => {
256-
onResults?.({ messageId, results: 'User skipped running the query' })
257-
}}
258-
onConfirm={() => {
259-
handleExecute(isWriteQuery ? 'mutation' : 'select')
260-
}}
259+
onCancel={onDeny}
260+
onConfirm={onApprove}
261261
/>
262262
</div>
263263
)}
Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
import { screen } from '@testing-library/react'
2+
import userEvent from '@testing-library/user-event'
3+
import { beforeEach, describe, expect, it, vi } from 'vitest'
4+
5+
import { EdgeFunctionRenderer } from './EdgeFunctionRenderer'
6+
import { render } from '@/tests/helpers'
7+
8+
const {
9+
mockSendEvent,
10+
mockUseEdgeFunctionQuery,
11+
mockUseParams,
12+
mockUseProjectSettingsV2Query,
13+
mockUseSelectedOrganizationQuery,
14+
} = vi.hoisted(() => ({
15+
mockSendEvent: vi.fn(),
16+
mockUseEdgeFunctionQuery: vi.fn(),
17+
mockUseParams: vi.fn(),
18+
mockUseProjectSettingsV2Query: vi.fn(),
19+
mockUseSelectedOrganizationQuery: vi.fn(),
20+
}))
21+
22+
vi.mock('common', async () => {
23+
const actual = await vi.importActual<typeof import('common')>('common')
24+
25+
return {
26+
...actual,
27+
useParams: mockUseParams,
28+
}
29+
})
30+
31+
vi.mock('@/data/config/project-settings-v2-query', () => ({
32+
useProjectSettingsV2Query: mockUseProjectSettingsV2Query,
33+
}))
34+
35+
vi.mock('@/data/edge-functions/edge-function-query', () => ({
36+
useEdgeFunctionQuery: mockUseEdgeFunctionQuery,
37+
}))
38+
39+
vi.mock('@/data/telemetry/send-event-mutation', () => ({
40+
useSendEventMutation: () => ({ mutate: mockSendEvent }),
41+
}))
42+
43+
vi.mock('@/hooks/misc/useSelectedOrganization', () => ({
44+
useSelectedOrganizationQuery: mockUseSelectedOrganizationQuery,
45+
}))
46+
47+
vi.mock('../EdgeFunctionBlock/EdgeFunctionBlock', () => ({
48+
EdgeFunctionBlock: ({
49+
showReplaceWarning,
50+
onCancelReplace,
51+
onConfirmReplace,
52+
}: {
53+
showReplaceWarning?: boolean
54+
onCancelReplace?: () => void
55+
onConfirmReplace?: () => void
56+
}) => (
57+
<div>
58+
{showReplaceWarning && (
59+
<div>
60+
<p>An edge function with this name already exists.</p>
61+
<button onClick={onCancelReplace}>Cancel</button>
62+
<button onClick={onConfirmReplace}>Replace function</button>
63+
</div>
64+
)}
65+
</div>
66+
),
67+
}))
68+
69+
vi.mock('./ConfirmFooter', () => ({
70+
ConfirmFooter: ({
71+
confirmLabel,
72+
onConfirm,
73+
}: {
74+
confirmLabel?: string
75+
onConfirm?: () => void
76+
}) => <button onClick={onConfirm}>{confirmLabel ?? 'Confirm'}</button>,
77+
}))
78+
79+
describe('EdgeFunctionRenderer', () => {
80+
beforeEach(() => {
81+
mockSendEvent.mockReset()
82+
mockUseEdgeFunctionQuery.mockReset()
83+
mockUseParams.mockReturnValue({ ref: 'project-ref' })
84+
mockUseProjectSettingsV2Query.mockReturnValue({ data: undefined })
85+
mockUseSelectedOrganizationQuery.mockReturnValue({ data: { slug: 'org-slug' } })
86+
})
87+
88+
it('only deploys an existing function from the replace warning confirmation', async () => {
89+
const user = userEvent.setup()
90+
const onApprove = vi.fn()
91+
92+
mockUseEdgeFunctionQuery.mockReturnValue({ data: { slug: 'hello-world' } })
93+
94+
render(
95+
<EdgeFunctionRenderer
96+
label="Deploy Edge Function"
97+
code="Deno.serve(() => new Response('ok'))"
98+
functionName="hello-world"
99+
onApprove={onApprove}
100+
/>
101+
)
102+
103+
await user.click(screen.getByRole('button', { name: 'Deploy' }))
104+
expect(screen.getByText('An edge function with this name already exists.')).toBeInTheDocument()
105+
expect(onApprove).not.toHaveBeenCalled()
106+
107+
await user.click(screen.getByRole('button', { name: 'Deploy' }))
108+
expect(onApprove).not.toHaveBeenCalled()
109+
expect(mockSendEvent).not.toHaveBeenCalled()
110+
111+
await user.click(screen.getByRole('button', { name: 'Replace function' }))
112+
expect(onApprove).toHaveBeenCalledTimes(1)
113+
expect(mockSendEvent).toHaveBeenCalledTimes(1)
114+
})
115+
116+
it('deploys immediately when no existing function is found', async () => {
117+
const user = userEvent.setup()
118+
const onApprove = vi.fn()
119+
120+
mockUseEdgeFunctionQuery.mockReturnValue({ data: undefined })
121+
122+
render(
123+
<EdgeFunctionRenderer
124+
label="Deploy Edge Function"
125+
code="Deno.serve(() => new Response('ok'))"
126+
functionName="hello-world"
127+
onApprove={onApprove}
128+
/>
129+
)
130+
131+
await user.click(screen.getByRole('button', { name: 'Deploy' }))
132+
133+
expect(onApprove).toHaveBeenCalledTimes(1)
134+
expect(mockSendEvent).toHaveBeenCalledTimes(1)
135+
})
136+
})

0 commit comments

Comments
 (0)