generated from kubernetes/kubernetes-template-project
-
Notifications
You must be signed in to change notification settings - Fork 201
Open
Labels
needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.Indicates an issue or PR lacks a `triage/foo` label and requires one.
Description
Hello maintainers
What would you like to be added:
I'd like to propose implementing a rate limiting plugin leveraging the existing Request Control framework. This plugin would integrate as an AdmissionPlugin (and potentially other hook points) to enforce configurable request rate limits per user, model, IP address, or other dimensions.
Goals
- Prevent abuse or overuse of inference resources.
- Support multi-tenancy with fair resource allocation.
- Provide flexible, dynamic rate limit configuration (e.g., different quotas for free vs. premium users).
- Integrate cleanly with the current plugin architecture (PreRequest, AdmissionPlugin, etc.).
Suggested Design
- Use a token bucket algorithm for smooth rate limiting.
- Extract rate-limiting keys from request metadata (e.g., x-user-id header, target model name).
- Allow runtime updates to rate limit rules (e.g., via config map or API).
- Return appropriate error codes (e.g., 429 Too Many Requests) when limits are exceeded.
Integration Point
The plugin would be registered via requestcontrol.NewConfig().WithAdmissionPlugins(...) and invoked during the admission phase before scheduling.
Why is this needed:
- Limit free-tier users to 10 requests/minute.
- Enforce stricter limits on expensive models (e.g., gpt-4).
- Protect backend pods from traffic spikes.
Metadata
Metadata
Assignees
Labels
needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.Indicates an issue or PR lacks a `triage/foo` label and requires one.