-
-
Notifications
You must be signed in to change notification settings - Fork 115
Implement websocket rate limiting #731
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Code Review: WebSocket Rate Limiting ImplementationThis is a well-designed and thorough implementation of WebSocket rate limiting. The code demonstrates excellent architectural decisions, strong security practices, and comprehensive test coverage. Below is my detailed review. ✅ Strengths1. Excellent Architecture & Design
2. Security Best Practices
3. Code Quality
4. Test Coverage
5. Excellent Changelog
🔍 Issues FoundCRITICAL: Race Condition in Cache OperationsLocation: Issue: The # Line 164: Get current count
current = cache.get(full_key, 0)
is_limited = current >= max_count
if increment and not is_limited:
if current == 0:
cache.set(full_key, 1, period_seconds) # Line 172
else:
try:
cache.incr(full_key) # Line 176 - RACE CONDITION
except ValueError:
cache.set(full_key, 1, period_seconds)Problem: Between lines 164 and 176, multiple concurrent requests can:
Impact: Under high concurrency, more requests could be allowed through than the configured limit. Recommended Fix: Use atomic cache operations. Most Django cache backends support # Atomic increment with TTL
try:
# Try to add key with value 1 and TTL (atomic, only succeeds if key doesn't exist)
if cache.add(full_key, 0, period_seconds):
# Key was added, now increment it
current = cache.incr(full_key)
else:
# Key exists, increment it
current = cache.incr(full_key)
except ValueError:
# Key expired between add and incr, retry
cache.set(full_key, 1, period_seconds)
current = 1
is_limited = current > max_countAlternatively, use Redis's MEDIUM: DRY Violation - Period Name MappingLocations: Multiple files have identical period name mapping:
Issue: The period mapping dictionary is duplicated 8+ times: period_name = {1: "second", 60: "minute", 3600: "hour", 86400: "day"}.get(period, "period")Recommended Fix: Extract to a utility function in def get_period_name(period_seconds: int) -> str:
"""Convert period in seconds to human-readable name."""
return {
1: "second",
60: "minute",
3600: "hour",
86400: "day"
}.get(period_seconds, "period")Then import and use: MEDIUM: Inconsistent Import of RateLimitedConsumerMixinLocation: Issue: Questions:
Recommended Action: Either:
LOW: Missing Return Type AnnotationLocation: Issue: Line 289 returns a string but uses division which could theoretically return a float: return f"{count * 5}/{period}" # period is int, but type not enforcedRecommended Fix: Ensure LOW: Potential for Confusion in Test CoverageLocation: Issue: The if "ws_connect" in group or call_count < 3:Concern: String matching on Recommended Improvement: Make the test more explicit about what it's checking: # Allow connection checks and first 2 message checks
if group == "ws_connect":
return False, {...} # Always allow connections
elif call_count < 3:
return False, {...} # Allow first 2 messages
else:
return True, {...} # Rate limit subsequent messages🎯 Suggestions for Future EnhancementsThese are not blockers for this PR, but nice-to-haves for future work:
📊 Summary
Recommendation: Approve with requested changesThe race condition in cache operations should be fixed before merging, as it could allow rate limit bypass under high concurrency. The DRY violation is a medium priority cleanup item. Otherwise, this is excellent work that significantly improves the security posture of the WebSocket infrastructure. Great job on this implementation! 🎉 🔧 Required Changes Before Merge
🧹 Optional Cleanup
|
Code Review: WebSocket Rate Limiting ImplementationThis is a well-architected PR that successfully implements comprehensive rate limiting for WebSocket connections. The code quality is high and follows the repository's conventions. Here's my detailed review: ✅ Strengths1. Excellent Architecture & Code Reuse
2. Two-Level Rate LimitingThe implementation correctly applies rate limiting at both levels:
This is the right approach and prevents both connection flooding and message flooding. 3. Comprehensive Test CoverageThe test suite (
4. Proper Async HandlingThe 5. Consistent User Tier MultipliersSuperusers get 10x limits (not 5x as mentioned in PR description), authenticated users get 2x, and usage-capped users get 0.5x—consistent with GraphQL tier multipliers. 6. CHANGELOG DocumentationThe CHANGELOG entry is detailed and follows the project's format, documenting file locations and line numbers. 🔍 Issues & Recommendations1. Race Condition in Cache Increment
|
Code Review - WebSocket Rate Limiting ImplementationOverviewThis PR successfully implements comprehensive WebSocket rate limiting that mirrors the existing GraphQL rate limiting infrastructure. The implementation is well-architected with good separation of concerns and excellent test coverage. ✅ StrengthsArchitecture & Code Quality
|
Code Review - PR #731: Implement WebSocket Rate LimitingSummaryThis PR implements comprehensive WebSocket rate limiting that mirrors the existing GraphQL rate limiting infrastructure. The implementation is well-architected with a shared rate limiting module and appropriate separation of concerns. ✅ Strengths1. Excellent Architecture & Code Reuse
2. Security Best Practices
3. WebSocket-Specific Features
4. Test Coverage
5. Documentation
🔍 Issues & ConcernsCRITICAL - Race Condition in Cache OperationsFile: The atomic increment pattern has a subtle issue: if increment:
cache.add(full_key, 0, period_seconds) # Returns True if created, False if exists
try:
current = cache.incr(full_key)
except ValueError:
# Key expired between add and incr (rare race condition)
cache.set(full_key, 1, period_seconds)
current = 1
is_limited = current > max_count # ❌ ISSUE: Should be >= for first requestProblem: Line 104 checks
Impact: Users get one extra request per time period than configured Fix: # Line 104 should be:
is_limited = current > max_count # Correct for increment=True
# But line 108 should remain:
is_limited = current >= max_count # Correct for increment=FalseActually, looking more closely, the current logic IS correct:
RETRACTED - The logic is actually correct as-is. Medium - Inconsistent Tier Multiplier in CommentFile: PR Description claims: "Superusers get 5x the normal limits" Actual implementation ( TIER_MULTIPLIERS = {
"superuser": 10.0, # Superusers get 10x the base limit
"authenticated": 2.0,
"anonymous": 1.0,
}Issue: Documentation mismatch. The code correctly implements 10x for superusers (consistent with GraphQL), but the PR description says 5x. Fix: Update PR description to reflect 10x multiplier for superusers. Medium - Missing Error Context in LogsFile: except Exception as e:
fail_open = getattr(settings, "RATELIMIT_FAIL_OPEN", False)
if fail_open:
logger.error(f"Rate limit cache error (failing open): {e}", exc_info=True)
return False, {"limit": 0, "remaining": 0, "reset_time": 0}Issue: When the cache fails, logs don't include the rate limit key or group, making debugging harder. Suggestion: logger.error(
f"Rate limit cache error (failing open): {e} - Group: {group}, Key: {key}",
exc_info=True
)Low - Potential Confusion with Rate Limit RemainingFile: remaining = max(0, max_count - current)Issue: When rate limited, Clarification needed: Is this intentional? It's correct behavior, but might be worth a comment explaining that Low - Magic Number in Tier MultipliersFile: assert rate == "300/m"Issue: Test hardcodes expected value instead of calculating from constants: # Better:
base_rate_count, period = parse_rate("30/m")
expected_count = int(base_rate_count * TIER_MULTIPLIERS["superuser"])
assert rate == f"{expected_count}/{period}"This makes tests resilient to changes in tier multipliers. 🎯 Suggestions for Improvement1. Add Circuit Breaker Pattern (Future Enhancement)For production resilience, consider adding a circuit breaker for cache failures: # If cache has failed N times in M seconds, skip rate limiting entirely
# to prevent cascade failuresThis is especially important if using Redis and it becomes unavailable. 2. Add Metrics/Monitoring HooksConsider adding hooks for metrics collection: # In cache.py:119-123
if is_limited:
logger.warning(...)
# ADD: Signal for monitoring (e.g., increment rate_limit_exceeded counter)
# This helps track abuse patterns3. Consider Adding Rate Limit Headers to WebSocket MessagesGraphQL sets await self.send_standard_message(
msg_type="RESPONSE",
content=result,
data={
"rate_limit": {
"limit": info.get("limit"),
"remaining": info.get("remaining"),
"reset_after": info.get("reset_time"),
}
}
)This helps clients implement proper backoff strategies. 4. Document Rate Limit Keys in Admin InterfaceConsider adding admin commands to:
# Example:
python manage.py rate_limit_status --user 42
python manage.py rate_limit_reset --ip 192.168.1.1📝 Minor Nitpicks1. Unused Import in Tests File: from config.ratelimits.tiers import TIER_MULTIPLIERS, get_tier_multiplier
2. Inconsistent Naming
3. Test Timeout Values File: response = await communicator.receive_from(timeout=5)5-second timeout might be too long for unit tests. Consider reducing to 1-2 seconds. ✅ Security ReviewNo security vulnerabilities found. The implementation:
📊 Performance ConsiderationsPositive:
Potential concerns:
Recommendation: Monitor cache performance in production. If cache becomes a bottleneck, consider:
🎓 Alignment with CLAUDE.md✅ Follows all baseline commit rules:
✅ Follows critical concepts:
✅ Proper documentation:
🚀 RecommendationAPPROVE with minor suggestions This is a well-architected PR that successfully implements WebSocket rate limiting with proper security, testing, and documentation. The refactoring to create a shared rate limiting infrastructure is a significant improvement. Before merging:
Post-merge:
Great work! 🎉 |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Implements per-user and per-IP rate limiting for WebSocket connections, mirroring the existing GraphQL rate limiting infrastructure. Changes: - Add new rate limiting module (config/websocket/ratelimits.py) with check_rate_limit functions, WebSocketRateLimits config class, and RateLimitedConsumerMixin - Add rate limiting middleware for connection-level limits that rejects excessive connection attempts with close code 4029 - Add message-level rate limiting to all WebSocket consumers: UnifiedAgentConsumer, DocumentQueryConsumer, CorpusQueryConsumer, StandaloneDocumentQueryConsumer, and ThreadUpdatesConsumer - Add configurable rate limit settings via environment variables - Integrate rate limiting middleware into ASGI stack after authentication - Add comprehensive test suite for rate limiting utilities and integration Rate limits: - Authenticated users: 30 connections/min, 60 messages/min, 20 AI queries/min - Anonymous users: 10 connections/min, 20 messages/min, 5 AI queries/min - Superusers get 5x the normal limits Closes #730
… module
Extract common rate limiting infrastructure into config/ratelimits/:
- core.py: parse_rate(), format_rate_limit_message(), period constants
- config.py: unified RateLimits singleton for both GraphQL and WebSocket
- tiers.py: user tier multiplier logic (superuser 10x, auth 2x, anon 1x)
- ip.py: IP extraction utilities for both request and scope contexts
- cache.py: cache-based rate limiting for WebSocket consumers
Key improvements:
- Eliminates ~400 lines of duplicate code
- Standardizes superuser multiplier to 10x (was 5x for WebSocket)
- Single source of truth for rate limit configuration
- Maintains full backward compatibility with existing decorators
All existing usages continue to work unchanged:
- @graphql_ratelimit(rate=RateLimits.WRITE_MEDIUM)
- @graphql_ratelimit_dynamic(get_rate=get_user_tier_rate("READ_MEDIUM"))
- WebSocketRateLimits.get_rate_for_user()
- Remove unused Tuple import from cache.py and core.py - Update type hints to use built-in tuple[] syntax - Fix import ordering in __init__.py (isort) - Fix line formatting (black)
Changes: - Fix race condition in cache-based rate limiting by using atomic cache.add() + cache.incr() operations instead of get-then-set - Fix test failure by adding @override_settings(RATELIMIT_DISABLE=False) since test settings disable rate limiting by default - Update CHANGELOG.md to correctly state superusers get 10x (not 5x) rate limit multiplier, consistent with GraphQL tier - Remove default values from environment variable loading in config/settings/ratelimit.py - config/ratelimits/config.py is now the single source of truth for default values - Replace inline period_name dictionaries in all 5 consumers with shared period_to_name() function for consistency - Extract magic number 123 (WebSocket close reason max bytes) to WS_CLOSE_REASON_MAX_BYTES constant in config/ratelimits/core.py - Add missing test coverage for: - User tier multipliers (anonymous, authenticated, superuser, usage-capped) - Cache failure behavior (RATELIMIT_FAIL_OPEN vs closed)
…miting Add documentation note to get_user_tier_rate() explaining that the function works correctly for both authenticated and anonymous users, with anonymous users receiving a 1x multiplier.
82d78e6 to
b70c0fd
Compare
Code Review: WebSocket Rate Limiting (PR #731)Overall, this is an excellent implementation that brings robust rate limiting to WebSocket connections. The code is well-architected, thoroughly tested, and follows the project's conventions. Here's my detailed review: ✅ Strengths1. Architecture & Code Organization
2. Security
3. Race Condition HandlingThe atomic cache operations in # 1. Atomically create key if it doesn't exist
cache.add(full_key, 0, period_seconds)
# 2. Atomically increment
current = cache.incr(full_key)
# 3. Check AFTER incrementing
is_limited = current > max_countHowever, there's a minor edge case (see Issues section below). 4. Test Coverage
5. Documentation
|
Implements per-user and per-IP rate limiting for WebSocket connections, mirroring the existing GraphQL rate limiting infrastructure.
Changes:
Rate limits:
Closes #730