Skip to content

fix: deduplicate message hashes across files, not just within each file#647

Open
enzonaute wants to merge 1 commit intosteipete:mainfrom
enzonaute:fix/global-dedup-cross-file
Open

fix: deduplicate message hashes across files, not just within each file#647
enzonaute wants to merge 1 commit intosteipete:mainfrom
enzonaute:fix/global-dedup-cross-file

Conversation

@enzonaute
Copy link
Copy Markdown

parseClaudeFile() resets its seenKeys set per file. Claude Code logs subagent messages to both the parent session JSONL and the subagent's own JSONL under /subagents/. Since both files share the same messageId:requestId pairs, per-file dedup misses the cross-file duplicates and inflates the totals.

On a real dataset (30 days, 538 files):

  • per-file dedup: 7.98B tokens (current behavior)
  • global dedup: 2.59B tokens (correct, matches ccusage)
  • ratio: 3.08x overcounting

Fix: pass the seenKeys set through ClaudeScanState so it accumulates across all files in a single scan pass. The new existingSeenKeys parameter defaults to [] so existing call sites and tests are unaffected.

parseClaudeFile() resets its seenKeys set per file. Claude Code logs
subagent messages to both the parent session JSONL and the subagent's
own JSONL under <session>/subagents/. Since both files share the same
messageId:requestId pairs, per-file dedup misses the cross-file
duplicates and inflates the totals.

On a real dataset (30 days, 538 files):
  - per-file dedup:  7.98B tokens (current behavior)
  - global dedup:    2.59B tokens (correct, matches ccusage)
  - ratio: 3.08x overcounting

Fix: pass the seenKeys set through ClaudeScanState so it accumulates
across all files in a single scan pass. The new existingSeenKeys
parameter defaults to [] so existing call sites and tests are
unaffected.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 07c7558007

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +329 to +330
existingSeenKeys: state.globalSeenKeys)
state.globalSeenKeys = parsed.seenKeys
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Recompute duplicate ownership when files are removed

Passing globalSeenKeys into each parse makes later duplicate files persist with empty days, but that per-file attribution is then cached. If the file that originally “won” a duplicate key is deleted in a later refresh, stale cleanup subtracts its usage while the surviving duplicate file is skipped as unchanged, so those tokens vanish from totals until a forced rescan. This creates incorrect undercounting in normal log-rotation/deletion flows for Claude sessions with parent/subagent duplicates.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant