Skip to content

fix(scheduler): correctly handle torrent eviction from cache#635

Merged
Anton-Kalpakchiev merged 2 commits into
masterfrom
fix-stale-torrent-control-500-error
Jun 19, 2026
Merged

fix(scheduler): correctly handle torrent eviction from cache#635
Anton-Kalpakchiev merged 2 commits into
masterfrom
fix-stale-torrent-control-500-error

Conversation

@Anton-Kalpakchiev

@Anton-Kalpakchiev Anton-Kalpakchiev commented Jun 18, 2026

Copy link
Copy Markdown
Collaborator

TLDR

This PR fixes a bug in agent where blob downloads fail if 1) a blob is being seeded to other agents, 2) it got evicted by the disk cache, and 3) 5 minutes have not passed since the eviction.

Details

Agent keeps track of whether a blob is in cache or not based on its presence on disk. However, that is not the single SOT. The P2P scheduler also briefly keeps track of which torrents (aka blobs) are available for seeding. This is only done temporarily if another peer requests a blob. The bug happens when the 2 states are out of sync, i.e. when the scheduler's mem state thinks a blob exists, but the disk cache's eviction mechanism has deleted it to make space for other blobs.

The specific order to trigger the bug is the following:

  1. Agent downloads a blob.
  2. Another peer requests the blob. This triggers the scheduler to create a new Torrent and store in memory that the torrent is available.
  3. The disk cache's cleanup job deletes the blob from disk (e.g. due to the 12h TTL we enforce). Now the scheduler's mem state thinks the torrent (blob) is present, while it is actually not.
  4. A download request for the blob comes in to our agent.
  5. The agent checks the disk cache and sees the blob is missing.
  6. The agent calls the scheduler and triggers a blob download.
  7. However, instead of triggering the download, the scheduler INCORRECTLY sees in its mem state that the blob is already present and returns as a no-op.
  8. The client that called the scheduler's Download function looks for the blob in the disk cache, expecting the scheduler to have downloaded it BUT it never got downloaded by the scheduler.
  9. Agent returns a 500 error due to the invariant violation.

This actually happened in production and can be seen in the agent's logs - a blob gets downloaded and then ~12h later the bug happens, leading to download errors (12h because the blob gets evicted due to the 12h TTL).

5 minutes after the blob is evicted, the agent cleans the scheduler's mem state, as it has a 5m TTI on memory torrent entries that are not being seeded. So the bug only happens if the blob is requested within that 5min window.

The fix

To fix this, I change the behavior in step 7) - the scheduler now does not use its memory state to decide whether a Download call should be a no-op, but instead checks the Torrent struct's fields, which are not out-of-sync.

Test Plan

I added a unit test that covers the scenario and fails. After this PR, the test passes.

Comment thread lib/torrent/scheduler/events.go Outdated
@Anton-Kalpakchiev Anton-Kalpakchiev merged commit a1f3ff7 into master Jun 19, 2026
11 checks passed
@Anton-Kalpakchiev Anton-Kalpakchiev deleted the fix-stale-torrent-control-500-error branch June 19, 2026 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants