Skip to content

feat: stale-while-revalidate cache for large Slack workspaces#225

Open
flacoste wants to merge 3 commits intokorotovsky:masterfrom
flacoste:feat/stale-while-revalidate-cache
Open

feat: stale-while-revalidate cache for large Slack workspaces#225
flacoste wants to merge 3 commits intokorotovsky:masterfrom
flacoste:feat/stale-while-revalidate-cache

Conversation

@flacoste
Copy link
Copy Markdown

@flacoste flacoste commented Mar 6, 2026

Summary

  • Problem: On large Slack workspaces (41K+ users, 81MB cache), the server blocks for ~90 seconds during startup fetching all users/channels from the Slack API, exceeding MCP client connection timeouts (e.g., Claude Code's default). Related: Be smarter about users caching maybe? #93, Server errors if caches are not ready #143
  • Solution: Implement stale-while-revalidate caching — load expired cache files immediately on startup, mark the server as ready, then refresh data in the background via a goroutine
  • Result: Server starts in under 1 second regardless of workspace size when a cache file exists

Changes

pkg/provider/api.go

  • Stale-while-revalidate in refreshUsersInternal and refreshChannelsInternal: when cache exists but is expired, load stale data into snapshot, set ready flag, release lock, spawn background goroutine for fresh fetch, return nil immediately
  • atomic.Bool for ready flags: Convert usersReady/channelsReady from bool to atomic.Bool to fix data race (read without lock in IsReady(), written under lock in refresh)
  • Concurrent refresh coalescing: Add refreshingUsers/refreshingChannels atomic.Bool with CompareAndSwap to prevent duplicate background fetches
  • Default TTL: Change defaultCacheTTL from 1h to 24h (user data rarely changes, stale-while-revalidate eliminates penalty for longer TTL)

cmd/slack-mcp-server/main.go

  • Remove the stdio IsReady() polling loop — redundant with stale-while-revalidate (ready flags set immediately from stale cache). On first run (no cache), handlers return appropriate "not ready" errors.

docs/03-configuration-and-usage.md

  • Document SLACK_MCP_CACHE_TTL and SLACK_MCP_MIN_REFRESH_INTERVAL env vars

pkg/provider/cache_test.go

  • Test: default TTL is 24 hours
  • Test: atomic.Bool ready flags with concurrent access (passes go test -race)
  • Test: refresh-in-progress flag prevents concurrent refreshes via CompareAndSwap
  • Test: stale cache loaded and ready flag set immediately (before API fetch)

Design decisions

  1. Keep bulk cache, don't block on it — bulk fetch is needed for DM name resolution across all tools. The change is internal to the refresh functions; callers see a fast return.
  2. Background goroutine is fire-and-forget — logs warnings on API failure, never fatals (server continues serving stale data)
  3. First-run behavior — no cache file means server starts with empty caches. Handlers already return ErrUsersNotReady/ErrChannelsNotReady. This only happens once per workspace.

Test plan

  • go test ./pkg/provider/ -race -v — all new and existing cache tests pass
  • go test ./... -race — all packages pass (handler integration tests require SLACK_MCP_XOXP_TOKEN, pre-existing skip)
  • go build ./... — compiles clean
  • Manual test on large workspace with expired cache file — verify sub-second startup

Post-Deploy Monitoring & Validation

No additional operational monitoring required: this is a library/server startup optimization with no new external dependencies or state changes. Existing log messages ("Serving stale users cache, background refresh starting", "Background users refresh failed") provide observability.


🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

On large workspaces (41K+ users), the server blocks for ~90 seconds
during startup while fetching all users/channels from the Slack API,
exceeding MCP client connection timeouts.

Changes:
- Load expired cache files immediately, mark server ready, then refresh
  in background via goroutine (stale-while-revalidate pattern)
- Convert usersReady/channelsReady to atomic.Bool for race-free reads
- Add refreshingUsers/refreshingChannels atomic.Bool to coalesce
  concurrent background refreshes via CompareAndSwap
- Remove stdio IsReady() polling loop (no longer needed)
- Increase default cache TTL from 1h to 24h
- Document SLACK_MCP_CACHE_TTL and SLACK_MCP_MIN_REFRESH_INTERVAL
  env vars in docs/03-configuration-and-usage.md

Fixes startup timeout on large workspaces. Server now starts in under
1 second regardless of workspace size when a cache file exists.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@chriscoey chriscoey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We run this on a 1000+ channel workspace and hit exactly this blocking-refresh timeout — great fix. Two issues before merge:

1. Stdio cold-start regression. Removing the IsReady() polling loop means ServeStdio() starts before caches are populated on first run (no cache file). Tool calls return ErrUsersNotReady for 60-90s. The SWR case is fine — ready flag is set synchronously from stale data, so the loop would exit in one tick. Suggest keeping the polling loop unconditionally; it's effectively free with SWR.

2. Concurrent fetchAndStore race. On master, defer usersMu.Unlock() serializes the entire refresh including API call + file write. The PR unlocks before fetchAndStoreUsers, which has no synchronization. The refreshingUsers CAS guards background refreshes but ForceRefreshUsers bypasses it, so force + background can overlap — racing on os.WriteFile (truncate-then-write) and usersSnapshot.Store (last writer wins). Suggest either extending the CAS to cover force refreshes too, or using atomic file writes (temp file + os.Rename).

Rebase note: Master added len(cachedUsers) == 0 guards since this PR's base — need to integrate those into the new code flow during rebase.

Nit: Typo TestRefreshingFlagPreventsConucrrentRefreshesConcurrent

flacoste and others added 2 commits March 23, 2026 20:36
- Restore IsReady() polling loop before ServeStdio() to fix cold-start
  regression where tool calls fail for 60-90s on first run (no cache)
- Add fetchUsersMu/fetchChannelsMu mutexes to serialize fetchAndStore*
  calls, preventing race between ForceRefresh and background refresh
- Use atomic file writes (temp + os.Rename) to prevent corrupt cache
  files on crash
- Guard against empty API results overwriting valid cache
- Guard against empty cache files being treated as valid data
- Fix typo: TestRefreshingFlagPreventsConucrrentRefreshes → Concurrent

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix infinite loop on cold start when API returns zero results: return
  error instead of nil when no existing cache is available, so the
  watcher calls Fatal rather than spinning IsReady() forever
- Secure temp file handling: use os.CreateTemp for unpredictable names
  (prevents symlink attacks) and clean up temp files on any failure
- Restrict cache file permissions from 0644 to 0600 and cache directory
  from 0755 to 0700 (cache contains user PII)
- Extract atomicWriteFile helper to deduplicate temp+rename pattern
- Fix stale docstring: getCacheTTL default is 24h not 1h

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@flacoste
Copy link
Copy Markdown
Author

Addressed review feedback in two commits:

Commit 1 (cc20f47): Direct fixes from @chriscoey's review

  • Restored IsReady() polling loop before ServeStdio() — fixes cold-start regression where tool calls fail for 60-90s on first run
  • Added fetchUsersMu/fetchChannelsMu sync.Mutex to serialize fetchAndStore* calls — prevents race between ForceRefresh and background SWR refresh on both os.WriteFile and snapshot.Store
  • Atomic file writes (temp file + os.Rename) for crash safety
  • Empty-result guards: API returning zero users/channels no longer overwrites valid cache; empty cache files treated as cache miss instead of valid data
  • Fixed typo: TestRefreshingFlagPreventsConucrrentRefreshesConcurrent

Commit 2 (db9c161): Fixes from self-review

  • Fixed infinite loop when API returns zero results on cold start — now returns an error (so Fatal fires) instead of nil with ready=false
  • Secured temp file handling: os.CreateTemp for unpredictable filenames (prevents symlink attacks), cleanup on any failure
  • Restricted cache file permissions from 0644 to 0600 and cache directory from 0755 to 0700 (cache contains user PII)
  • Fixed stale docstring on getCacheTTL (said "1 hour", should be "24 hours")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants