feat: token-receipt accuracy rework + 2026-05-09 follow-ups#2
Open
kingdoooo wants to merge 22 commits intoHchen1218:mainfrom
Open
feat: token-receipt accuracy rework + 2026-05-09 follow-ups#2kingdoooo wants to merge 22 commits intoHchen1218:mainfrom
kingdoooo wants to merge 22 commits intoHchen1218:mainfrom
Conversation
Covers all 11 reported issues: token total math, cache-aware USD estimation, 1h TTL support, Bedrock auto-detection, Claude-projects jsonl aggregation with dedup, render gating, CLI flags, and HTML link fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Remove the min(cached, input) clamp that was under-reporting sessions. Each bucket (input, cache-read, cache-write, output) is now billed at its own rate. Add resolve_cache_write_split to route cache-write tokens between 5m and 1h TTL via CLI override, per-message snapshot fields, the ENABLE_PROMPT_CACHING_1H_BEDROCK env flag, or the 5m default. Introduce compute_total summing all billable buckets, plus PARTIAL status (with partial_reasons) when required rates are missing from pricing.json; 1h falls back to 2 * input_rate per Anthropic docs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The deepseek case had cached_input_tokens (500k) less than input_tokens (1M). The old clamp billed the 500k overlap as cached and only the remaining 500k at the input rate; the new math bills the full 1M at input AND the 500k at cached (per-bucket independent). Expected amount goes from $0.364000 to $0.434000. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…--write Adds six smoke-test cases to scripts/validate_receipt.py covering the Part 06/05/08/09 behaviors, scrubs 4 leaky env vars in run_script (same list as tests/test_cli_flags.py::_run), fixes a manual-mode TOTAL regression in load_manual_snapshot (leaving total_tokens=0 so cli.main's compute_total covers all four buckets), and prepends the 2026-05-08 CHANGELOG block.
Lock in the three open choices from the 2026-05-09 brainstorm: - FU-2: take Option A, preserve Optional[str] contract, new tests/test_data.py. - FU-3: LEAKY_ENV_VARS lives in token_receipt/_envguard.py; guard test required. - Execution order: FU-3 first, then FU-2, then FU-1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Claude Code on Bedrock dev shells export ENABLE_PROMPT_CACHING_1H_BEDROCK, CLAUDE_CODE_USE_BEDROCK, ANTHROPIC_MODEL, ANTHROPIC_SMALL_FAST_MODEL, which leak into estimate_cost (1h TTL branch) and resolve_provider_and_model (force-rewrite to aws bedrock). The subprocess scrubs in tests/test_cli_flags and scripts/validate_receipt already handled this; the in-process unit tests in tests/test_pricing_math and tests/test_bedrock did not. Add token_receipt/_envguard.py holding LEAKY_ENV_VARS, import it from both subprocess scrub sites (replacing two duplicated literals) and from setUp in EstimateCostTest and ResolveSnapshotWiresInBedrockTest. Adds a guard test that pins the four known leakers so a future trim can't silently regress the dev-shell matrix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…SION_ID Claude Code actually exports CLAUDE_CODE_SESSION_ID (confirmed with `env | grep SESSION` in a live Claude Code shell); the resolver was reading CLAUDE_SESSION_ID only, so --scope session/session-all raised "sessionId not set" even when a real session was live. runtime_claude_session_id now checks CLAUDE_CODE_SESSION_ID first and falls back to CLAUDE_SESSION_ID (which scripts/validate_receipt.py continues to exercise via its fixture). Optional[str] return type is preserved so _load_claude_aggregate's `if not session_id:` guard stays valid. Also updates the SystemExit message in _load_claude_aggregate to name both env vars. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
_MAJOR_MINOR_DASH_RE only matched claude-<word>-N-M (e.g. claude-opus-4-7 → claude-opus-4.7), not claude-N-M-<word>. APAC Bedrock Haiku 3.5 came out as claude-3-5-haiku, which find_price missed — the receipt showed PROVIDER: AWS BEDROCK but USD ESTIMATE: UNMAPPED even though the pricing row was already present (Part 03 added cache_write_1h_per_million for it). Adds _MAJOR_MINOR_SLUG_RE as a second branch after the existing dash-RE. The first branch early-returns so the two shapes stay independent, keeping the existing normalization tests untouched. Tests cover the APAC-only case, all four region prefixes, and an end-to-end check that find_price resolves the normalized model against DEFAULT_PRICING. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The original DoD Gate 6 used `--model apac.anthropic.claude-3-5-haiku`, but that path hits the intentional `_finalize` contract where explicit `--model` is preserved verbatim (pinned by tests/test_bedrock.py:108 and scripts/validate_receipt.py:820). FU-1's normalization fix runs before `_finalize`, so the auto-detect path (Bedrock model from env, no explicit `--model`) demonstrates the fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace 11 occurrences of a hard-coded /Users/... absolute path in the followups plan's shell examples with `cd <repo-root>` and plain `git ...` commands. The absolute paths were artifacts of drafting the plan for a subagent to copy-paste verbatim; they leaked developer-local path information once the branch opened as a public PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR bundles two waves of work that accumulated on the Kent/kingdoooo fork over the past week. Numbers and commands below are all pulled from
CHANGELOG.mdentries## 2026-05-08and## 2026-05-09.Wave 1 — Accuracy & UX rework (2026-05-08, 18 commits)
Fixes a 50-70× USD under-reporting bug and adds Bedrock / aggregator / PARTIAL / scope / hide-fields / silent-write features. Key outcomes:
Full rationale and decision log in `docs/superpowers/specs/2026-05-08-token-receipt-accuracy-design.md` + `docs/superpowers/plans/2026-05-08-token-receipt-accuracy/`.
Wave 2 — 2026-05-09 follow-ups (3 commits)
Three narrow fixes surfaced during per-part code review that were out of scope for any specific Wave 1 part:
FU-1 — Bedrock `claude-N-M-` normalization (`fix(bedrock)`)
APAC Bedrock Haiku 3.5 (`apac.anthropic.claude-3-5-haiku`) was normalizing to `claude-3-5-haiku`, which missed the `claude-3.5-haiku` entry in `references/pricing.json`. Users saw `PROVIDER: AWS BEDROCK` but `USD ESTIMATE: UNMAPPED`. Added `_MAJOR_MINOR_SLUG_RE` as a second branch after the existing `_MAJOR_MINOR_DASH_RE`; the original regex (which handles `claude--N-M` for e.g. `claude-opus-4-7 → claude-opus-4.7`) is left untouched.
FU-2 — Widen Claude Code session-id env lookup (`fix(data)`)
Claude Code actually exports `CLAUDE_CODE_SESSION_ID`, not `CLAUDE_SESSION_ID`. `runtime_claude_session_id` now checks the new name first and falls back to the legacy one. `--scope session`/`session-all` used to fail with "sessionId not set" inside real Claude Code sessions; now works. `Optional[str]` return contract preserved.
FU-3 — Scrub leaky env vars in in-process tests (`refactor(tests)`)
Claude Code on Bedrock dev shells export `ENABLE_PROMPT_CACHING_1H_BEDROCK` / `CLAUDE_CODE_USE_BEDROCK` / `ANTHROPIC_MODEL` / `ANTHROPIC_SMALL_FAST_MODEL`, which leak into `estimate_cost` (flips TTL resolver) and `resolve_provider_and_model` (force-rewrites provider). Subprocess-based tests already scrubbed these; the in-process `EstimateCostTest` and `ResolveSnapshotWiresInBedrockTest` did not. Extracted `LEAKY_ENV_VARS` into `token_receipt/_envguard.py` (imported by tests and `scripts/validate_receipt.py`), added `setUp` that rebuilds `os.environ` via `patch.dict(..., clear=True)`, plus a guard test pinning the 4 known leakers.
Full spec: `docs/superpowers/specs/2026-05-09-followups-after-accuracy-rework.md`.
Test plan
Notes for reviewers
🤖 Generated with Claude Code