Skip to content

A few targeted tweaks to address HF rate limits#2009

Merged
ciaranbor merged 1 commit intomainfrom
fix-hf-rate-limits
Apr 30, 2026
Merged

A few targeted tweaks to address HF rate limits#2009
ciaranbor merged 1 commit intomainfrom
fix-hf-rate-limits

Conversation

@ciaranbor
Copy link
Copy Markdown
Member

Motivation

  • exo bursts ~200 HF Hub-API requests on every cold start, blowing past the anonymous 500-req/5-min budget.
  • The existing retry loop catches 429 generically and gives up in ~3s — well before HF's reset window.
  • file_meta and _download_file had no 429 handling at all (became AssertionError).
  • Disk file-list cache was bypassed on every process restart.

Changes

All in src/exo/download/download_utils.py + tests.

  • Parse t= from HF's RateLimit header on 429; sleep min(t, 300s) + jitter.
  • Handle 429 at all three call sites (_fetch_file_list, file_meta, _download_file).
  • n_attempts: 3 → 5.
  • Disk cache now primary across restarts (24h mtime TTL).
  • ?recursive=true instead of N+1 subdir walks.

Why It Works

t=<seconds> is HF's "wait this long and you'll be unblocked" — sleeping that long lets the window reset. Disk-cache-as-primary plus recursive listing cuts cold-start Hub-API traffic by ~10×.

Test Plan

Manual Testing

MacBook Pro M1 Max. Tripped the real HF 429. Pre-fix: failed in 3.4s. Post-fix: slept (HF returned t=158) and recovered.

Automated Testing

  • New test_rate_limit_handling.py (19 tests) — header parsing, retry-loop behaviour, plus HTTP-level coverage that mocks aiohttp to return a 429 and asserts each call site raises HuggingFaceRateLimitError(retry_after=52.0).
  • New TestFileListCacheTTL in test_offline_mode.py — fresh cache hits, stale cache refetches.
  • 421 tests pass; basedpyright / ruff / nix fmt clean.

api_url = f"{get_hf_endpoint()}/api/models/{model_id}/tree/{revision}"
url = f"{api_url}/{path}" if path else api_url
# ?recursive=true returns the whole subtree in one request
if recursive:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scary

Copy link
Copy Markdown
Collaborator

@rltakashige rltakashige left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sloppy slop slop but I'll approve this to unblock us as I don't think it breaks any existing functionality.

on_connection_lost: Callable[[], None] = lambda: None,
) -> list[FileListEntry]:
n_attempts = 3
n_attempts = 5
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weird change

return None


# reset window is 5 min
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: cmon lol

@@ -0,0 +1,355 @@
"""Tests for HuggingFace 429 rate-limit handling in download_utils."""
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh my god a billion more lines of tests :)

@ciaranbor ciaranbor enabled auto-merge (squash) April 30, 2026 18:01
@ciaranbor ciaranbor merged commit 8dae3ec into main Apr 30, 2026
8 of 9 checks passed
@ciaranbor ciaranbor deleted the fix-hf-rate-limits branch April 30, 2026 18:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants