diff --git a/docs/plans/README.md b/docs/plans/README.md index e17612d..c640ad3 100644 --- a/docs/plans/README.md +++ b/docs/plans/README.md @@ -14,8 +14,8 @@ choices, immutable log). which tools are actually attached. The demand-side counterpart to `mcp.md`. - [Compare two runs side-by-side](compare-runs.md) — pick any two runs and view them in a split layout that surfaces what diverged. -- [Evals](evals.md) — ingestion shape for eval results plus the open - questions on data model and UI. +- [Evaluation](evaluation.md) — scores, human annotations, datasets, and an + in-app LLM-judge runner, built emitter-agnostic on OTel `gen_ai.evaluation.*`. - [HTTP API for LLM debugging](http-api.md) — expose loupe's classification / reconstruction / aggregation views over plain endpoints so an LLM-driven dev tool can pull run data while a developer is debugging. diff --git a/docs/plans/compare-runs.md b/docs/plans/compare-runs.md index 23686a1..88738c4 100644 --- a/docs/plans/compare-runs.md +++ b/docs/plans/compare-runs.md @@ -57,7 +57,7 @@ Goal: pick any two runs (same agent, different agents, regression vs main, etc.) - [ ] Permalink that captures view mode + selection. - [ ] "Pick a third run" → N-way compare (probably never; 2 covers 95% of regression workflows). - [ ] Diff rules editor: ignore timing drift under X ms, ignore certain attribute keys. -- [ ] Save a comparison as a "baseline" — ties into the evals comparison story in `docs/plans/evals.md`. +- [ ] Save a comparison as a "baseline" — ties into the evals comparison story in `docs/plans/evaluation.md`. ## Non-goals (v1) diff --git a/docs/plans/evals.md b/docs/plans/evals.md deleted file mode 100644 index 2bd4e3c..0000000 --- a/docs/plans/evals.md +++ /dev/null @@ -1,157 +0,0 @@ -# Evals — Feature Plan - -Status: draft. Ingestion shape is roughly settled (push + OTel + drop + manual). The questions below need answers before we lock the data model and UI. - ---- - -## Open questions (decide before building) - -### 1. Presentation & organization - -How does a user think about "evals" inside loupe? Pick the mental model first; the schema falls out of it. - -- **Flat list of runs, tagged?** Every ingested `ScenarioRunResult` is just another timestamped row; we filter by tag (`name=qa-bot-regression`, `env=ci`, `git_sha=…`). Cheapest. No "definition" concept at all — the eval *is* the set of runs that share a name. -- **Definition + runs (two levels)?** A named eval definition (durable card) has a stream of runs underneath. Matches how Foundry / MEAI users think. Slightly more schema, much better landing page. -- **Suites (three levels)?** Suite → definition → run. Probably premature; revisit when a customer asks. - -Sub-questions: -- Do eval definitions live per-project, or are they global with a project filter? -- Are definitions explicit (user creates one) or implicit (first ingest with a new `name` auto-creates)? Implicit is friendlier; explicit is tidier. -- Where do evals appear in the nav — sibling to `/runs`, or nested under a run? Probably sibling; an eval can cover many runs. - -### 2. Storage — can we be minimalist (or skip our own store entirely)? - -Three options, increasing in commitment: - -- **A. No DB.** Eval results live as OTel logs in OpenObserve. Every list / detail page is a query. Pros: zero new tables, free trace linkage, single source of truth. Cons: comparison/aggregation across many runs is slow and awkward in OpenObserve; pass-rate-over-time charts need expensive scans; no good place to store user-authored metadata (notes, baselines). -- **B. Thin index.** Tiny relational table (`eval_runs`: id, definition_name, external_id, started_at, summary_json, trace_id). Detail rows stay in OpenObserve and are fetched on demand. Pros: fast list/compare on summary metrics; detail stays "free." Cons: two stores to keep in sync; what happens when OpenObserve retention drops old detail? -- **C. Full mirror.** Both `eval_runs` and `eval_run_results` in our DB. Pros: every query is fast, retention is ours to set, comparison is trivial. Cons: duplicate storage; we now own a real data pipeline. - -Decision driver: **how often do users compare across >10 runs?** Rare → A. Common → B or C. The OpenObserve query budget per page load is the real constraint. - -Worth a 1-day spike: build option A end-to-end with synthetic data; see if listing 200 runs and a 30-day pass-rate chart is acceptable. - -### 3. History & comparison - -The comparison primitives we need to support (pick which are v1): - -- **Run vs run on the same definition** — "did the latest CI run regress?" Requires stable identity for *test cases* across runs (i.e., `scenario_name` + `iteration` is the key). -- **Pass rate over time** — line chart of % passed for a definition over the last N runs / days. -- **Per-row diff** — given two runs, show which cases flipped pass↔fail and which metric values moved. -- **Baseline / "blessed" run** — pin one run as the baseline; all later runs are diffed against it. -- **Bisect by `git_sha`** — only useful if the user tags runs with their commit SHA on ingest. Cheap to support if we just store the tag. -- **Cross-definition comparison** — same dataset, two models. Probably v2. - -Stable case identity is the single most important schema decision: if `(definition_name, scenario_name, iteration)` isn't stable across runs, none of the above work. Ingest should refuse runs that change this shape silently, or surface the drift loudly. - -Retention: if we go option A or B above, OpenObserve retention dictates how far back comparison reaches. Need to decide: do we keep summaries forever in our DB even if detail rolls off? - -### 4. Triggering existing evals against captured sessions - -Assume sessions from other users are already visible to us (per `docs/plans/sessions.md`). Can a user pick a saved eval definition and "run it" against an incoming session — without breaking the ingest-only stance? - -Tentative compromise: **we orchestrate, we don't evaluate.** - -1. User clicks "Run eval X against session Y" in the UI. -2. loupe POSTs the session messages + eval criteria to a **user-registered evaluator endpoint** (an HTTP webhook the user owns — could be their CI, a Lambda, an LLM-judge service). -3. The evaluator runs wherever the user wants and POSTs results back to the existing `/api/evals/ingest`, tagged with the source session id. -4. The result row links to both the eval definition and the originating session. - -What changes vs. today: -- **Eval definition** gains an optional `evaluator_endpoint_url` (or webhook ref). Without it, the definition remains ingest-only as today — a card that displays results, not one that fires them. -- **`eval_runs`** gains an optional `triggered_against_session_id` (and/or `triggered_against_run_id` for finer granularity). -- **Non-goal still holds**: we don't host evaluator code. The endpoint is the user's; we just hand off and wait. - -Sub-questions: -- Input granularity — whole session, or one specific run inside it? Probably let the eval definition declare which it wants. -- Async by default — evaluators may take minutes (LLM judge). UI shows a `pending` run row; result lands when the evaluator POSTs back. -- Auto-trigger on new matching session, or manual only? Manual first. Auto-eval (e.g., "score every new session matching filter F with eval E") is a v2 once the manual flow works. -- Authorization — when the registered evaluator is external, we sign outbound requests so the evaluator can verify the source. - ---- - -## Feature overview - -Users run evals on their agents (MEAI in .NET, `agent_framework` in Python, or custom). loupe collects the results and shows them on an eval page with history and trace linkage. No outbound calls to Microsoft, no Foundry dependency. - -## Ingestion — four paths, one landing zone - -All paths normalize to the same internal shape, so the UI doesn't care which path produced a row. - -### Path 1 — Direct push (default, easiest) - -``` -POST /api/evals/ingest -Authorization: Bearer -Body: ScenarioRunResult JSON (single or batched) -``` - -Idempotent on `(project_id, definition_name, run.external_id)`. Callers: a `loupe-upload` CLI, a GitHub Action, or a tiny `@loupe/evals` SDK they import in their test setup. - -### Path 2 — OTel piggyback (for users who already ship OTel to OpenObserve) - -When this lands, the `eval.*` attrs below should be declared in [`../explanation/02-spec.md`](../explanation/02-spec.md) alongside the existing `gen_ai.*` and `task.*` sets — loupe's convention spec is the canonical home for "what attrs loupe reads", and eval attrs are a natural spec extension. Cross-link both directions when the spec gets a new section. - -Ship a small MEAI `IEvaluationResultStore` / Python equivalent that emits each result as an **OTel log record** with a known attribute schema: - -``` -eval.run.external_id = "" -eval.definition.name = "qa-bot-regression" -eval.scenario = "..." -eval.iteration = "..." -eval.metric.name = "Relevance" -eval.metric.value = 0.82 -eval.metric.passed = true -eval.metric.reason = "..." -``` - -The log inherits the parent agent span's trace context → trace linkage is free. loupe queries OpenObserve for `event.name = "loupe.eval"` on a cron (or lazily) and materializes into the same tables as Path 1. - -### Path 3 — Object-storage drop (no OTel, no outbound HTTP from CI) - -User writes results to a blob container they own (MEAI's `AzureStorageReportingConfiguration` does this natively). They grant us read-only creds. Worker lists new objects every minute, ingests, marks done. - -### Path 4 — Manual upload in UI - -Drag-drop a folder or zip of `ScenarioRunResult` JSONs onto the eval page. Reuses the ingest validator. For air-gapped / one-off use, and as the "try it without writing code" path. - -## Data model (placeholder — depends on storage decision above) - -If we land on option B (thin index): - -``` -eval_definitions (id, project_id, name, created_at, baseline_run_id?) -eval_runs (id, definition_id, external_id, status, - started_at, ended_at, summary jsonb, - git_sha?, env?, trace_id?) -eval_run_results — kept in OpenObserve; fetched on detail page -``` - -Indexes: `(definition_id, started_at desc)` for history; `(project_id, name)` for upsert. - -## UI - -- `/evals` — list of definitions, last-run badge, pass-rate sparkline. -- `/evals/$id` — definition header + run history table + pass-rate chart + "compare to baseline" toggle. -- `/evals/$id/runs/$rid` — per-case table, filter to failed, click row → trace in OpenObserve. -- `/evals/$id/compare?a=…&b=…` — per-case diff between two runs. -- On existing `/runs/$runId`: side panel "Evaluated by: …" linking out. - -## Build order - -1. Decide the three open questions above (½ day spike on storage option A). -2. Ingest endpoint + minimal tables + idempotency. -3. `/evals` list and `/evals/$id` history page on synthetic data. -4. Manual upload (Path 4) so we can dogfood without writing the CLI. -5. CLI / GH Action wrapper (Path 1). -6. Per-run detail page with trace linkage. -7. Comparison view + baseline pinning. -8. OTel piggyback shim (Path 2) — only after a real .NET user appears. -9. Object-storage drop (Path 3) — only on request. - -## Non-goals (v1) - -- Authoring eval definitions inside loupe (we receive, we don't define). -- Running evals ourselves / hosting evaluator LLMs. -- Foundry integration (skipped per current direction). -- Real-time streaming of a long-running eval. Batch on completion is fine. diff --git a/src/components/app-sidebar.tsx b/src/components/app-sidebar.tsx index f56d959..d4f468a 100644 --- a/src/components/app-sidebar.tsx +++ b/src/components/app-sidebar.tsx @@ -1,5 +1,4 @@ import { - InboxIcon, KeyboardIcon, Logout01Icon, Moon01Icon, @@ -41,9 +40,9 @@ import { SidebarMenuButton, SidebarMenuItem, } from '#/components/ui/sidebar' +import { useChangelogUnseen } from '#/hooks/use-changelog-unseen' import { useUser, useUserId } from '#/hooks/use-user' import { DEFAULT } from '#/lib/time-range' -import { inboxUnreadCountQuery } from '#/routes/inbox/-data' import { currentUserSessionsQuery } from '#/routes/sessions/-data' const APP_VERSION = `v${__APP_VERSION__}` @@ -54,7 +53,7 @@ const WORKBENCH_NAV = NAV_ITEMS.filter((n) => n.group === 'workbench') export function AppSidebar() { const pathname = useRouterState({ select: (s) => s.location.pathname }) const [settingsOpen, setSettingsOpen] = useState(false) - const { data: unreadCount = 0 } = useQuery(inboxUnreadCountQuery()) + const changelogUnseen = useChangelogUnseen(__APP_VERSION__) const [userId] = useUserId() const { data: recentData } = useQuery(currentUserSessionsQuery(DEFAULT, userId)) const recentSessions = recentData?.sessions ?? [] @@ -138,22 +137,18 @@ export function AppSidebar() { - - - - - {unreadCount > 0 && ( - - - - {unreadCount > 99 ? '99+' : unreadCount} - - - )} - - Inbox + + + + Changelog + {changelogUnseen && ( + + + New release + + )} @@ -251,12 +246,6 @@ function NavUser() { Keyboard shortcuts - - - - Changelog - - diff --git a/src/components/inspect/view-bar.tsx b/src/components/inspect/view-bar.tsx index 2019b70..4f53ad0 100644 --- a/src/components/inspect/view-bar.tsx +++ b/src/components/inspect/view-bar.tsx @@ -7,7 +7,6 @@ import { INSPECT_AUTO_REFRESH_OPTIONS, } from '#/components/auto-refresh-select' import { IconTabs } from '#/components/icon-tabs' -import { Separator } from '#/components/ui/separator' import { Toggle } from '#/components/ui/toggle' import { Tooltip, TooltipContent, TooltipTrigger } from '#/components/ui/tooltip' @@ -73,7 +72,6 @@ export function InspectViewBar({ )} - {showRawAll && (hasRefreshGroup || extras != null) && } {hasRefreshGroup && ( ('unread') + const { data: count = 0 } = useQuery(inboxUnreadCountQuery()) + const { data: unread = [] } = useQuery(inboxQuery()) + const { data: all = [] } = useQuery(recentInboxQuery()) + + const invalidate = () => + Promise.all([ + queryClient.invalidateQueries({ queryKey: queryKeys.inbox.all() }), + queryClient.invalidateQueries({ queryKey: queryKeys.inbox.recent() }), + queryClient.invalidateQueries({ queryKey: queryKeys.inbox.unreadCount() }), + queryClient.invalidateQueries({ queryKey: queryKeys.home.all() }), + ]) + const markRead = useMutation({ mutationFn: () => markAllInboxReadFn(), onSuccess: invalidate }) + + const items = tab === 'unread' ? unread : all + const unreadIds = new Set(unread.map((i) => i.id)) + + return ( + + + + + +
+ Notifications + {count > 0 && ( + + )} +
+ +
+ setTab(v as 'unread' | 'all')}> + + + Unread + + + All + + + +
+ + {items.length === 0 ? ( +
+ {tab === 'unread' ? "You're all caught up." : 'No notifications.'} +
+ ) : ( +
    + {items.map((item) => ( + + ))} +
+ )} + +
+ +
+
+
+ ) +} + +function NotificationItem({ item, unread }: { item: InboxRow; unread: boolean }) { + const inner = ( +
+ +
+

+ {item.summary} +

+ +
+
+ ) + + const to = item.sessionId + ? { + to: '/sessions/$sessionId' as const, + params: { sessionId: item.sessionId }, + search: { range: 1, view: 'conversation' as const }, + } + : item.traceId + ? { to: '/traces/$traceId' as const, params: { traceId: item.traceId } } + : null + + if (!to) return
  • {inner}
  • + return ( +
  • + + {inner} + +
  • + ) +} diff --git a/src/components/site-header.tsx b/src/components/site-header.tsx index 51aab9b..96a8f30 100644 --- a/src/components/site-header.tsx +++ b/src/components/site-header.tsx @@ -1,5 +1,6 @@ import type { ReactNode } from 'react' import { CommandPaletteTrigger } from '#/components/command-palette' +import { NotificationBell } from '#/components/notification-bell' import { Separator } from '#/components/ui/separator' import { SidebarTrigger } from '#/components/ui/sidebar' @@ -13,6 +14,7 @@ export function SiteHeader({ title, actions }: { title: ReactNode; actions?: Rea
    {actions} +
    diff --git a/src/hooks/use-changelog-unseen.ts b/src/hooks/use-changelog-unseen.ts new file mode 100644 index 0000000..1943788 --- /dev/null +++ b/src/hooks/use-changelog-unseen.ts @@ -0,0 +1,10 @@ +import { useCallback, useSyncExternalStore } from 'react' +import { getChangelogLastSeen, subscribeChangelogSeen } from '#/lib/changelog-seen' + +// Server snapshot reports the current version ("seen") so the dot never renders +// during SSR — it appears only after the client reads an older localStorage value. +export function useChangelogUnseen(currentVersion: string): boolean { + const getServerSnapshot = useCallback(() => currentVersion, [currentVersion]) + const lastSeen = useSyncExternalStore(subscribeChangelogSeen, getChangelogLastSeen, getServerSnapshot) + return lastSeen !== currentVersion +} diff --git a/src/lib/changelog-seen.ts b/src/lib/changelog-seen.ts new file mode 100644 index 0000000..3bb6d52 --- /dev/null +++ b/src/lib/changelog-seen.ts @@ -0,0 +1,27 @@ +// Last changelog version the user opened, as an external store so the sidebar dot +// and changelog page stay in sync within the tab and across tabs. + +const STORAGE_KEY = 'changelog-last-seen-version' +const EVENT = 'changelog-seen-change' + +export function getChangelogLastSeen(): string | null { + if (typeof window === 'undefined') return null + return window.localStorage.getItem(STORAGE_KEY) +} + +export function markChangelogSeen(version: string): void { + if (typeof window === 'undefined') return + if (window.localStorage.getItem(STORAGE_KEY) === version) return + window.localStorage.setItem(STORAGE_KEY, version) + window.dispatchEvent(new Event(EVENT)) +} + +export function subscribeChangelogSeen(onChange: () => void): () => void { + if (typeof window === 'undefined') return () => {} + window.addEventListener(EVENT, onChange) // same-tab writes + window.addEventListener('storage', onChange) // other tabs + return () => { + window.removeEventListener(EVENT, onChange) + window.removeEventListener('storage', onChange) + } +} diff --git a/src/lib/query-keys.ts b/src/lib/query-keys.ts index f59cf93..8652f92 100644 --- a/src/lib/query-keys.ts +++ b/src/lib/query-keys.ts @@ -18,6 +18,7 @@ export const queryKeys = { }, inbox: { all: () => ['inbox'] as const, + recent: () => ['inbox', 'recent'] as const, unreadCount: () => ['inbox', 'unread-count'] as const, }, home: { diff --git a/src/routes/changelog/index.tsx b/src/routes/changelog/index.tsx index 50824cc..ff73b17 100644 --- a/src/routes/changelog/index.tsx +++ b/src/routes/changelog/index.tsx @@ -1,10 +1,12 @@ import { ArrowUpRight01Icon } from '@hugeicons/core-free-icons' import { HugeiconsIcon } from '@hugeicons/react' import { createFileRoute } from '@tanstack/react-router' +import { useEffect } from 'react' import { Markdown } from '#/components/markdown' import { Page } from '#/components/page' import { Badge } from '#/components/ui/badge' import { Card, CardContent, CardHeader, CardTitle } from '#/components/ui/card' +import { markChangelogSeen } from '#/lib/changelog-seen' import changelogRaw from '../../../CHANGELOG.md?raw' import { type ChangelogVersion, parseChangelog } from './-changelog-data' @@ -16,6 +18,9 @@ export const Route = createFileRoute('/changelog/')({ }) function ChangelogPage() { + useEffect(() => { + markChangelogSeen(APP_VERSION) + }, []) return (
    diff --git a/src/routes/inbox/-data.ts b/src/routes/inbox/-data.ts index ce42c6e..2f5e946 100644 --- a/src/routes/inbox/-data.ts +++ b/src/routes/inbox/-data.ts @@ -1,12 +1,25 @@ import { queryOptions } from '@tanstack/react-query' import { createServerFn } from '@tanstack/react-start' import { queryKeys, STALE_TELEMETRY_MS } from '#/lib/query-keys' -import { countOpenInboxItems, dismissInboxItem, listOpenInboxItems, snoozeInboxItem } from '#/server/inbox' +import { + countOpenInboxItems, + dismissInboxItem, + listOpenInboxItems, + listRecentInboxItems, + markAllInboxRead, + snoozeInboxItem, +} from '#/server/inbox' const fetchInbox = createServerFn({ method: 'GET' }).handler(() => listOpenInboxItems()) +const fetchRecentInbox = createServerFn({ method: 'GET' }).handler(() => listRecentInboxItems()) + const fetchInboxUnreadCount = createServerFn({ method: 'GET' }).handler(() => countOpenInboxItems()) +export const markAllInboxReadFn = createServerFn({ method: 'POST' }).handler(async () => { + await markAllInboxRead() +}) + export const dismissInboxItemFn = createServerFn({ method: 'POST' }) .inputValidator((id: number) => id) .handler(async ({ data }) => { @@ -27,6 +40,14 @@ export const inboxQuery = () => refetchInterval: STALE_TELEMETRY_MS, }) +export const recentInboxQuery = () => + queryOptions({ + queryKey: queryKeys.inbox.recent(), + queryFn: () => fetchRecentInbox(), + staleTime: STALE_TELEMETRY_MS, + refetchInterval: STALE_TELEMETRY_MS, + }) + export const inboxUnreadCountQuery = () => queryOptions({ queryKey: queryKeys.inbox.unreadCount(), diff --git a/src/server/inbox.ts b/src/server/inbox.ts index 38f774f..e32cc7b 100644 --- a/src/server/inbox.ts +++ b/src/server/inbox.ts @@ -44,10 +44,31 @@ export async function countOpenInboxItems(): Promise { return row?.value ?? 0 } +// Recent items regardless of dismissal (drives the "All" tab); future-snoozed stay hidden. +export async function listRecentInboxItems(limit = 100): Promise { + const now = new Date() + const rows = await db + .select() + .from(inboxItems) + .where(or(isNull(inboxItems.snoozeUntil), lte(inboxItems.snoozeUntil, now))) + .orderBy(desc(inboxItems.firedAt)) + .limit(limit) + + return rows.map(toInboxRow) +} + export async function dismissInboxItem(id: number): Promise { await db.update(inboxItems).set({ dismissedAt: new Date() }).where(eq(inboxItems.id, id)) } +export async function markAllInboxRead(): Promise { + const now = new Date() + await db + .update(inboxItems) + .set({ dismissedAt: now }) + .where(and(isNull(inboxItems.dismissedAt), or(isNull(inboxItems.snoozeUntil), lte(inboxItems.snoozeUntil, now)))) +} + export async function snoozeInboxItem(id: number, until: Date): Promise { await db.update(inboxItems).set({ snoozeUntil: until }).where(eq(inboxItems.id, id)) }