Skip to content

fix(api-proxy): write placeholder token-usage record when usage extraction fails#5251

Closed
lpcox wants to merge 2 commits into
mainfrom
fix/api-proxy-token-usage-missing-record
Closed

fix(api-proxy): write placeholder token-usage record when usage extraction fails#5251
lpcox wants to merge 2 commits into
mainfrom
fix/api-proxy-token-usage-missing-record

Conversation

@lpcox

@lpcox lpcox commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

Problem

A successful (2xx) LLM completion response from which no usage payload could be extracted previously produced no line at all in token-usage.jsonl. The hard gate is in finalizeHttpTracking():

const normalized = normalizeUsage(usage);
if (!normalized) {            // no usage extracted from the response
  if (typeof onSpanEnd === 'function') onSpanEnd(proxyRes.statusCode);
  return;                     // ← returns WITHOUT writing any record
}

When this fires, the request becomes invisible to every downstream consumer — step summaries, OTEL fan-out, and AI-credit aggregation. This is the producer-side root cause of the regression where Copilot calls stopped appearing in token-usage logs, as called out in github/gh-aw#40085:

The root cause — api-proxy no longer writing token-usage.jsonl for Copilot calls — requires an upstream fix.

I verified the parsers/injection are correct for all documented payload shapes (chat-completions with stream_options.include_usage, Responses API response.completed/response.done, non-streaming JSON). The only no-usage case is a streaming chat-completions response whose final usage chunk is absent — but whatever the exact upstream trigger, the proxy silently dropping the entire request is the brittle behavior worth fixing.

Fix

For completion-style endpoints (chat/completions, responses, messages, :generateContent, …), when a 2xx response yields no usage, write a placeholder record with zeroed token counts and usage_missing: true:

  • The request stays observable downstream (dedup by request_id, request counts, OTEL spans).
  • The extraction gap is diagnosable in artifacts (currently it leaves no trace), enabling a precise follow-up that recovers the real tokens.
  • The usage_missing flag clearly marks the counts as non-measured, so consumers don't treat them as a genuine zero-cost call.
  • Non-completion traffic (e.g. /models, health checks) is still skipped to avoid noise.
  • Behavior is unchanged when usage is present.

Changes

  • token-parsers.js: add looksLikeCompletionRequest() path heuristic.
  • token-tracker-http.js: add writeMissingUsageRecord(); gate the no-usage branch on it.
  • schemas/token-usage.schema.json: document the usage_missing field.
  • Unit tests for the new helper and the placeholder-record behavior (streaming + non-streaming, completion vs non-completion paths, onSpanEnd still fires).

Testing

  • cd containers/api-proxy && npm test1247 passed (54 suites), +20 new tests.

Relationship to github/gh-aw#40085

That PR hardens the consumer (honor ai_credits_this_response, dedup across both log paths, fix the .jsonl/.json artifact extension). This PR is the producer-side fix it asks for: the api-proxy will now always emit a row per completion request, so a row can never silently vanish again.

lpcox and others added 2 commits June 17, 2026 22:08
Replaces 'sleep 3' with proper health check loops that wait up to 30 seconds
for proxies to be ready before running tests.

Root cause: Squid/Envoy containers were not fully initialized before agent
containers tried to connect, causing spurious test failures.

Changes:
- Squid runc test: Wait for proxy port 3128 to respond
- Squid gVisor test: Wait for proxy port 3128 to respond
- Envoy runc test: Wait for admin /ready endpoint
- Envoy gVisor test: Wait for admin /ready endpoint
- Squid perf test: Wait for proxy port 3128 to respond
- Envoy perf test: Wait for admin /ready endpoint

Each health check:
- Retries for up to 30 seconds
- Uses lightweight curl container for network checks
- Shows container logs on failure
- Exits with error if proxy fails to start

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
A successful (2xx) LLM completion response from which no usage payload
could be extracted previously produced NO line in `token-usage.jsonl` at
all: `finalizeHttpTracking()` early-returned on `!normalizeUsage(usage)`.
This made the request invisible to every downstream consumer (step
summaries, OTEL fan-out, AI-credit aggregation) and is the root cause of
the regression where Copilot calls stopped appearing in token-usage logs
(see github/gh-aw#40085).

Now, for completion-style endpoints (chat/completions, responses,
messages, generateContent, ...), a placeholder record is written with
zeroed token counts and `usage_missing: true` so the request stays
observable and the extraction gap is diagnosable, while clearly marking
the counts as non-measured. Non-completion traffic (e.g. /models, health
checks) is still skipped to avoid noise. Behavior is unchanged when usage
is present.

- Add `looksLikeCompletionRequest()` path heuristic to token-parsers
- Add `writeMissingUsageRecord()` and gate the no-usage branch on it
- Document `usage_missing` in schemas/token-usage.schema.json
- Add unit tests for the new helper and placeholder-record behavior

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 18, 2026 15:24

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@github-actions

Copy link
Copy Markdown
Contributor

✅ Coverage Check Passed

Overall Coverage

Metric Base PR Delta
Lines 97.57% 97.61% 📈 +0.04%
Statements 97.50% 97.54% 📈 +0.04%
Functions 98.84% 98.84% ➡️ +0.00%
Branches 92.95% 92.98% 📈 +0.03%
📁 Per-file Coverage Changes (1 files)
File Lines (Before → After) Statements (Before → After)
src/workdir-setup.ts 92.7% → 94.5% (+1.82%) 92.7% → 94.5% (+1.82%)

Coverage comparison generated by scripts/ci/compare-coverage.ts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants