Skip to content

feat: token-receipt accuracy rework + 2026-05-09 follow-ups#2

Open
kingdoooo wants to merge 22 commits intoHchen1218:mainfrom
kingdoooo:feat/2026-05-09-followups
Open

feat: token-receipt accuracy rework + 2026-05-09 follow-ups#2
kingdoooo wants to merge 22 commits intoHchen1218:mainfrom
kingdoooo:feat/2026-05-09-followups

Conversation

@kingdoooo
Copy link
Copy Markdown

Summary

This PR bundles two waves of work that accumulated on the Kent/kingdoooo fork over the past week. Numbers and commands below are all pulled from CHANGELOG.md entries ## 2026-05-08 and ## 2026-05-09.

Wave 1 — Accuracy & UX rework (2026-05-08, 18 commits)

Fixes a 50-70× USD under-reporting bug and adds Bedrock / aggregator / PARTIAL / scope / hide-fields / silent-write features. Key outcomes:

  • `TOTAL` now includes cache read and cache write tokens (previously undercounted on cache-heavy sessions).
  • `USD ESTIMATE` bills cached input and cache write at their own rates (previously dropped them).
  • `PROVIDER`/`MODEL` auto-detect AWS Bedrock from `CLAUDE_CODE_USE_BEDROCK=1` or `.anthropic.*[1m]` model strings.
  • New `--cache-ttl {auto,5m,1h}` flag for Anthropic cache-write rate selection.
  • New `--hide-fields supplier,model,context,price-mapping,price-date,rate-note` for surgical row removal.
  • New `--scope today` and `--scope session-all` for Claude Code cross-jsonl aggregation.
  • New `PARTIAL` pricing status — rendered as `USD ESTIMATE*` + `PRICE: PARTIAL` when the rate table is incomplete.
  • `--write` is now fully silent; `--write --write-html` prints exactly two `wrote to:` lines.
  • Pricing, Bedrock normalization, and Claude aggregation live in new single-purpose modules under `token_receipt/`.
  • Test scaffold (`tests/`): unittest discover + a comprehensive `scripts/validate_receipt.py` smoke script.

Full rationale and decision log in `docs/superpowers/specs/2026-05-08-token-receipt-accuracy-design.md` + `docs/superpowers/plans/2026-05-08-token-receipt-accuracy/`.

Wave 2 — 2026-05-09 follow-ups (3 commits)

Three narrow fixes surfaced during per-part code review that were out of scope for any specific Wave 1 part:

FU-1 — Bedrock `claude-N-M-` normalization (`fix(bedrock)`)
APAC Bedrock Haiku 3.5 (`apac.anthropic.claude-3-5-haiku`) was normalizing to `claude-3-5-haiku`, which missed the `claude-3.5-haiku` entry in `references/pricing.json`. Users saw `PROVIDER: AWS BEDROCK` but `USD ESTIMATE: UNMAPPED`. Added `_MAJOR_MINOR_SLUG_RE` as a second branch after the existing `_MAJOR_MINOR_DASH_RE`; the original regex (which handles `claude--N-M` for e.g. `claude-opus-4-7 → claude-opus-4.7`) is left untouched.

FU-2 — Widen Claude Code session-id env lookup (`fix(data)`)
Claude Code actually exports `CLAUDE_CODE_SESSION_ID`, not `CLAUDE_SESSION_ID`. `runtime_claude_session_id` now checks the new name first and falls back to the legacy one. `--scope session`/`session-all` used to fail with "sessionId not set" inside real Claude Code sessions; now works. `Optional[str]` return contract preserved.

FU-3 — Scrub leaky env vars in in-process tests (`refactor(tests)`)
Claude Code on Bedrock dev shells export `ENABLE_PROMPT_CACHING_1H_BEDROCK` / `CLAUDE_CODE_USE_BEDROCK` / `ANTHROPIC_MODEL` / `ANTHROPIC_SMALL_FAST_MODEL`, which leak into `estimate_cost` (flips TTL resolver) and `resolve_provider_and_model` (force-rewrites provider). Subprocess-based tests already scrubbed these; the in-process `EstimateCostTest` and `ResolveSnapshotWiresInBedrockTest` did not. Extracted `LEAKY_ENV_VARS` into `token_receipt/_envguard.py` (imported by tests and `scripts/validate_receipt.py`), added `setUp` that rebuilds `os.environ` via `patch.dict(..., clear=True)`, plus a guard test pinning the 4 known leakers.

Full spec: `docs/superpowers/specs/2026-05-09-followups-after-accuracy-rework.md`.

Test plan

  • Clean-env test suite: `env -i PATH="$PATH" HOME="$HOME" python3 -m unittest discover -s tests -v` → 70/70 OK
  • Dev-shell test suite (with 4 Bedrock leakers exported): `python3 -m unittest discover -s tests -v` → 70/70 OK
  • Clean-env smoke: `env -i PATH="$PATH" HOME="$HOME" python3 scripts/validate_receipt.py` → `token-receipt validation passed`
  • Dev-shell smoke: `python3 scripts/validate_receipt.py` → `token-receipt validation passed`
  • FU-1 auto-detect Haiku 3.5 sanity: `env -i ... CLAUDE_CODE_USE_BEDROCK=1 ANTHROPIC_MODEL=apac.anthropic.claude-3-5-haiku python3 scripts/token_receipt.py --agent-tool claude-code --input-tokens 1000 --output-tokens 1000 --width 48` renders `AWS BEDROCK` / `claude-3.5-haiku` / `$0.004800`
  • Upstream `--chat-reply` (merged in during rebase) still works end-to-end

Notes for reviewers

  • Branch was rebased onto `Hchen1218/main` (the 3 upstream commits `6f5d353` / `507e75b` / `fa443ce` are not present in the diff; they land as the rebase base). Three files had conflicts that were resolved as additive merges: `token_receipt/cli.py` (our `--cache-ttl`/`--hide-fields`/silent-write + upstream `--chat-reply`), `SKILL.md` (file:// note merged into the `--chat-reply` section), and `scripts/validate_receipt.py` (LEAKY_ENV_VARS import alongside new codex-thread imports).
  • All three FUs are surgical: FU-1 is 5 LOC + 3 tests; FU-2 is 8 LOC + 3 tests; FU-3 is ~20 LOC + 1 guard test + 2 setUp blocks + 6-site import swap.
  • The 18 Wave 1 commits form a complete narrative when reviewed with `git log --reverse`. Specs and plans are checked in under `docs/superpowers/` for traceability — those are 100% docs, safe to skim or skip.

🤖 Generated with Claude Code

kingdoooo and others added 22 commits May 9, 2026 12:10
Covers all 11 reported issues: token total math, cache-aware USD
estimation, 1h TTL support, Bedrock auto-detection, Claude-projects
jsonl aggregation with dedup, render gating, CLI flags, and HTML link fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Remove the min(cached, input) clamp that was under-reporting sessions.
Each bucket (input, cache-read, cache-write, output) is now billed at its
own rate. Add resolve_cache_write_split to route cache-write tokens
between 5m and 1h TTL via CLI override, per-message snapshot fields,
the ENABLE_PROMPT_CACHING_1H_BEDROCK env flag, or the 5m default.
Introduce compute_total summing all billable buckets, plus PARTIAL
status (with partial_reasons) when required rates are missing from
pricing.json; 1h falls back to 2 * input_rate per Anthropic docs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The deepseek case had cached_input_tokens (500k) less than input_tokens
(1M). The old clamp billed the 500k overlap as cached and only the
remaining 500k at the input rate; the new math bills the full 1M at
input AND the 500k at cached (per-bucket independent). Expected amount
goes from $0.364000 to $0.434000.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…--write

Adds six smoke-test cases to scripts/validate_receipt.py covering the
Part 06/05/08/09 behaviors, scrubs 4 leaky env vars in run_script
(same list as tests/test_cli_flags.py::_run), fixes a manual-mode TOTAL
regression in load_manual_snapshot (leaving total_tokens=0 so cli.main's
compute_total covers all four buckets), and prepends the 2026-05-08
CHANGELOG block.
Lock in the three open choices from the 2026-05-09 brainstorm:
- FU-2: take Option A, preserve Optional[str] contract, new tests/test_data.py.
- FU-3: LEAKY_ENV_VARS lives in token_receipt/_envguard.py; guard test required.
- Execution order: FU-3 first, then FU-2, then FU-1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Claude Code on Bedrock dev shells export ENABLE_PROMPT_CACHING_1H_BEDROCK,
CLAUDE_CODE_USE_BEDROCK, ANTHROPIC_MODEL, ANTHROPIC_SMALL_FAST_MODEL, which
leak into estimate_cost (1h TTL branch) and resolve_provider_and_model
(force-rewrite to aws bedrock). The subprocess scrubs in tests/test_cli_flags
and scripts/validate_receipt already handled this; the in-process unit tests
in tests/test_pricing_math and tests/test_bedrock did not.

Add token_receipt/_envguard.py holding LEAKY_ENV_VARS, import it from both
subprocess scrub sites (replacing two duplicated literals) and from setUp in
EstimateCostTest and ResolveSnapshotWiresInBedrockTest. Adds a guard test
that pins the four known leakers so a future trim can't silently regress
the dev-shell matrix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…SION_ID

Claude Code actually exports CLAUDE_CODE_SESSION_ID (confirmed with `env |
grep SESSION` in a live Claude Code shell); the resolver was reading
CLAUDE_SESSION_ID only, so --scope session/session-all raised
"sessionId not set" even when a real session was live.

runtime_claude_session_id now checks CLAUDE_CODE_SESSION_ID first and falls
back to CLAUDE_SESSION_ID (which scripts/validate_receipt.py continues to
exercise via its fixture). Optional[str] return type is preserved so
_load_claude_aggregate's `if not session_id:` guard stays valid.

Also updates the SystemExit message in _load_claude_aggregate to name both
env vars.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
_MAJOR_MINOR_DASH_RE only matched claude-<word>-N-M (e.g. claude-opus-4-7 →
claude-opus-4.7), not claude-N-M-<word>. APAC Bedrock Haiku 3.5 came out as
claude-3-5-haiku, which find_price missed — the receipt showed PROVIDER:
AWS BEDROCK but USD ESTIMATE: UNMAPPED even though the pricing row was
already present (Part 03 added cache_write_1h_per_million for it).

Adds _MAJOR_MINOR_SLUG_RE as a second branch after the existing dash-RE.
The first branch early-returns so the two shapes stay independent, keeping
the existing normalization tests untouched.

Tests cover the APAC-only case, all four region prefixes, and an end-to-end
check that find_price resolves the normalized model against DEFAULT_PRICING.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The original DoD Gate 6 used `--model apac.anthropic.claude-3-5-haiku`,
but that path hits the intentional `_finalize` contract where explicit
`--model` is preserved verbatim (pinned by tests/test_bedrock.py:108 and
scripts/validate_receipt.py:820). FU-1's normalization fix runs before
`_finalize`, so the auto-detect path (Bedrock model from env, no explicit
`--model`) demonstrates the fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace 11 occurrences of a hard-coded /Users/... absolute path in the
followups plan's shell examples with `cd <repo-root>` and plain
`git ...` commands. The absolute paths were artifacts of drafting the
plan for a subagent to copy-paste verbatim; they leaked developer-local
path information once the branch opened as a public PR.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant