feat: stale-while-revalidate cache for large Slack workspaces#225
feat: stale-while-revalidate cache for large Slack workspaces#225flacoste wants to merge 3 commits intokorotovsky:masterfrom
Conversation
On large workspaces (41K+ users), the server blocks for ~90 seconds during startup while fetching all users/channels from the Slack API, exceeding MCP client connection timeouts. Changes: - Load expired cache files immediately, mark server ready, then refresh in background via goroutine (stale-while-revalidate pattern) - Convert usersReady/channelsReady to atomic.Bool for race-free reads - Add refreshingUsers/refreshingChannels atomic.Bool to coalesce concurrent background refreshes via CompareAndSwap - Remove stdio IsReady() polling loop (no longer needed) - Increase default cache TTL from 1h to 24h - Document SLACK_MCP_CACHE_TTL and SLACK_MCP_MIN_REFRESH_INTERVAL env vars in docs/03-configuration-and-usage.md Fixes startup timeout on large workspaces. Server now starts in under 1 second regardless of workspace size when a cache file exists. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
chriscoey
left a comment
There was a problem hiding this comment.
We run this on a 1000+ channel workspace and hit exactly this blocking-refresh timeout — great fix. Two issues before merge:
1. Stdio cold-start regression. Removing the IsReady() polling loop means ServeStdio() starts before caches are populated on first run (no cache file). Tool calls return ErrUsersNotReady for 60-90s. The SWR case is fine — ready flag is set synchronously from stale data, so the loop would exit in one tick. Suggest keeping the polling loop unconditionally; it's effectively free with SWR.
2. Concurrent fetchAndStore race. On master, defer usersMu.Unlock() serializes the entire refresh including API call + file write. The PR unlocks before fetchAndStoreUsers, which has no synchronization. The refreshingUsers CAS guards background refreshes but ForceRefreshUsers bypasses it, so force + background can overlap — racing on os.WriteFile (truncate-then-write) and usersSnapshot.Store (last writer wins). Suggest either extending the CAS to cover force refreshes too, or using atomic file writes (temp file + os.Rename).
Rebase note: Master added len(cachedUsers) == 0 guards since this PR's base — need to integrate those into the new code flow during rebase.
Nit: Typo TestRefreshingFlagPreventsConucrrentRefreshes → Concurrent
- Restore IsReady() polling loop before ServeStdio() to fix cold-start regression where tool calls fail for 60-90s on first run (no cache) - Add fetchUsersMu/fetchChannelsMu mutexes to serialize fetchAndStore* calls, preventing race between ForceRefresh and background refresh - Use atomic file writes (temp + os.Rename) to prevent corrupt cache files on crash - Guard against empty API results overwriting valid cache - Guard against empty cache files being treated as valid data - Fix typo: TestRefreshingFlagPreventsConucrrentRefreshes → Concurrent Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix infinite loop on cold start when API returns zero results: return error instead of nil when no existing cache is available, so the watcher calls Fatal rather than spinning IsReady() forever - Secure temp file handling: use os.CreateTemp for unpredictable names (prevents symlink attacks) and clean up temp files on any failure - Restrict cache file permissions from 0644 to 0600 and cache directory from 0755 to 0700 (cache contains user PII) - Extract atomicWriteFile helper to deduplicate temp+rename pattern - Fix stale docstring: getCacheTTL default is 24h not 1h Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Addressed review feedback in two commits: Commit 1 (
Commit 2 (
|
Summary
Changes
pkg/provider/api.go
cmd/slack-mcp-server/main.go
docs/03-configuration-and-usage.md
pkg/provider/cache_test.go
Design decisions
Test plan
Post-Deploy Monitoring & Validation
No additional operational monitoring required: this is a library/server startup optimization with no new external dependencies or state changes. Existing log messages ("Serving stale users cache, background refresh starting", "Background users refresh failed") provide observability.
🤖 Generated with Claude Code
Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com