feat: stale-while-revalidate cache for large Slack workspaces by flacoste · Pull Request #225 · korotovsky/slack-mcp-server

flacoste · 2026-03-06T02:53:59Z

Summary

Problem: On large Slack workspaces (41K+ users, 81MB cache), the server blocks for ~90 seconds during startup fetching all users/channels from the Slack API, exceeding MCP client connection timeouts (e.g., Claude Code's default). Related: Be smarter about users caching maybe? #93, Server errors if caches are not ready #143
Solution: Implement stale-while-revalidate caching — load expired cache files immediately on startup, mark the server as ready, then refresh data in the background via a goroutine
Result: Server starts in under 1 second regardless of workspace size when a cache file exists

Changes

pkg/provider/api.go

Stale-while-revalidate in refreshUsersInternal and refreshChannelsInternal: when cache exists but is expired, load stale data into snapshot, set ready flag, release lock, spawn background goroutine for fresh fetch, return nil immediately
atomic.Bool for ready flags: Convert usersReady/channelsReady from bool to atomic.Bool to fix data race (read without lock in IsReady(), written under lock in refresh)
Concurrent refresh coalescing: Add refreshingUsers/refreshingChannels atomic.Bool with CompareAndSwap to prevent duplicate background fetches
Default TTL: Change defaultCacheTTL from 1h to 24h (user data rarely changes, stale-while-revalidate eliminates penalty for longer TTL)

cmd/slack-mcp-server/main.go

Remove the stdio IsReady() polling loop — redundant with stale-while-revalidate (ready flags set immediately from stale cache). On first run (no cache), handlers return appropriate "not ready" errors.

docs/03-configuration-and-usage.md

Document SLACK_MCP_CACHE_TTL and SLACK_MCP_MIN_REFRESH_INTERVAL env vars

pkg/provider/cache_test.go

Test: default TTL is 24 hours
Test: atomic.Bool ready flags with concurrent access (passes go test -race)
Test: refresh-in-progress flag prevents concurrent refreshes via CompareAndSwap
Test: stale cache loaded and ready flag set immediately (before API fetch)

Design decisions

Keep bulk cache, don't block on it — bulk fetch is needed for DM name resolution across all tools. The change is internal to the refresh functions; callers see a fast return.
Background goroutine is fire-and-forget — logs warnings on API failure, never fatals (server continues serving stale data)
First-run behavior — no cache file means server starts with empty caches. Handlers already return ErrUsersNotReady/ErrChannelsNotReady. This only happens once per workspace.

Test plan

go test ./pkg/provider/ -race -v — all new and existing cache tests pass
go test ./... -race — all packages pass (handler integration tests require SLACK_MCP_XOXP_TOKEN, pre-existing skip)
go build ./... — compiles clean
Manual test on large workspace with expired cache file — verify sub-second startup

Post-Deploy Monitoring & Validation

No additional operational monitoring required: this is a library/server startup optimization with no new external dependencies or state changes. Existing log messages ("Serving stale users cache, background refresh starting", "Background users refresh failed") provide observability.

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.6 noreply@anthropic.com

On large workspaces (41K+ users), the server blocks for ~90 seconds during startup while fetching all users/channels from the Slack API, exceeding MCP client connection timeouts. Changes: - Load expired cache files immediately, mark server ready, then refresh in background via goroutine (stale-while-revalidate pattern) - Convert usersReady/channelsReady to atomic.Bool for race-free reads - Add refreshingUsers/refreshingChannels atomic.Bool to coalesce concurrent background refreshes via CompareAndSwap - Remove stdio IsReady() polling loop (no longer needed) - Increase default cache TTL from 1h to 24h - Document SLACK_MCP_CACHE_TTL and SLACK_MCP_MIN_REFRESH_INTERVAL env vars in docs/03-configuration-and-usage.md Fixes startup timeout on large workspaces. Server now starts in under 1 second regardless of workspace size when a cache file exists. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chriscoey

We run this on a 1000+ channel workspace and hit exactly this blocking-refresh timeout — great fix. Two issues before merge:

1. Stdio cold-start regression. Removing the IsReady() polling loop means ServeStdio() starts before caches are populated on first run (no cache file). Tool calls return ErrUsersNotReady for 60-90s. The SWR case is fine — ready flag is set synchronously from stale data, so the loop would exit in one tick. Suggest keeping the polling loop unconditionally; it's effectively free with SWR.

2. Concurrent fetchAndStore race. On master, defer usersMu.Unlock() serializes the entire refresh including API call + file write. The PR unlocks before fetchAndStoreUsers, which has no synchronization. The refreshingUsers CAS guards background refreshes but ForceRefreshUsers bypasses it, so force + background can overlap — racing on os.WriteFile (truncate-then-write) and usersSnapshot.Store (last writer wins). Suggest either extending the CAS to cover force refreshes too, or using atomic file writes (temp file + os.Rename).

Rebase note: Master added len(cachedUsers) == 0 guards since this PR's base — need to integrate those into the new code flow during rebase.

Nit: Typo TestRefreshingFlagPreventsConucrrentRefreshes → Concurrent

- Restore IsReady() polling loop before ServeStdio() to fix cold-start regression where tool calls fail for 60-90s on first run (no cache) - Add fetchUsersMu/fetchChannelsMu mutexes to serialize fetchAndStore* calls, preventing race between ForceRefresh and background refresh - Use atomic file writes (temp + os.Rename) to prevent corrupt cache files on crash - Guard against empty API results overwriting valid cache - Guard against empty cache files being treated as valid data - Fix typo: TestRefreshingFlagPreventsConucrrentRefreshes → Concurrent Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Fix infinite loop on cold start when API returns zero results: return error instead of nil when no existing cache is available, so the watcher calls Fatal rather than spinning IsReady() forever - Secure temp file handling: use os.CreateTemp for unpredictable names (prevents symlink attacks) and clean up temp files on any failure - Restrict cache file permissions from 0644 to 0600 and cache directory from 0755 to 0700 (cache contains user PII) - Extract atomicWriteFile helper to deduplicate temp+rename pattern - Fix stale docstring: getCacheTTL default is 24h not 1h Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

flacoste · 2026-03-24T00:50:13Z

Addressed review feedback in two commits:

Commit 1 (cc20f47): Direct fixes from @chriscoey's review

Restored IsReady() polling loop before ServeStdio() — fixes cold-start regression where tool calls fail for 60-90s on first run
Added fetchUsersMu/fetchChannelsMu sync.Mutex to serialize fetchAndStore* calls — prevents race between ForceRefresh and background SWR refresh on both os.WriteFile and snapshot.Store
Atomic file writes (temp file + os.Rename) for crash safety
Empty-result guards: API returning zero users/channels no longer overwrites valid cache; empty cache files treated as cache miss instead of valid data
Fixed typo: TestRefreshingFlagPreventsConucrrentRefreshes → Concurrent

Commit 2 (db9c161): Fixes from self-review

Fixed infinite loop when API returns zero results on cold start — now returns an error (so Fatal fires) instead of nil with ready=false
Secured temp file handling: os.CreateTemp for unpredictable filenames (prevents symlink attacks), cleanup on any failure
Restricted cache file permissions from 0644 to 0600 and cache directory from 0755 to 0700 (cache contains user PII)
Fixed stale docstring on getCacheTTL (said "1 hour", should be "24 hours")

chriscoey reviewed Mar 23, 2026

View reviewed changes

flacoste and others added 2 commits March 23, 2026 20:36

jhutar mentioned this pull request Mar 25, 2026

Enable sending DM to user I have not chatted before #253

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: stale-while-revalidate cache for large Slack workspaces#225

feat: stale-while-revalidate cache for large Slack workspaces#225
flacoste wants to merge 3 commits intokorotovsky:masterfrom
flacoste:feat/stale-while-revalidate-cache

flacoste commented Mar 6, 2026

Uh oh!

chriscoey left a comment

Uh oh!

flacoste commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

flacoste commented Mar 6, 2026

Summary

Changes

pkg/provider/api.go

cmd/slack-mcp-server/main.go

docs/03-configuration-and-usage.md

pkg/provider/cache_test.go

Design decisions

Test plan

Post-Deploy Monitoring & Validation

Uh oh!

chriscoey left a comment

Choose a reason for hiding this comment

Uh oh!

flacoste commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants