Feature Request: Implement a Rate Limiting Plugin Based on the Request Control Framework

Hello maintainers

**What would you like to be added**:

I'd like to propose implementing a rate limiting plugin leveraging the existing Request Control framework. This plugin would integrate as an AdmissionPlugin (and potentially other hook points) to enforce configurable request rate limits per user, model, IP address, or other dimensions.

#### Goals
- Prevent abuse or overuse of inference resources.
- Support multi-tenancy with fair resource allocation.
- Provide flexible, dynamic rate limit configuration (e.g., different quotas for free vs. premium users).
- Integrate cleanly with the current plugin architecture (PreRequest, AdmissionPlugin, etc.).
####  Suggested Design
- Use a token bucket algorithm for smooth rate limiting.
- Extract rate-limiting keys from request metadata (e.g., x-user-id header, target model name).
- Allow runtime updates to rate limit rules (e.g., via config map or API).
- Return appropriate error codes (e.g., 429 Too Many Requests) when limits are exceeded.

####  Integration Point
The plugin would be registered via `requestcontrol.NewConfig().WithAdmissionPlugins(...)` and invoked during the admission phase before scheduling.


**Why is this needed**:

- Limit free-tier users to 10 requests/minute.
- Enforce stricter limits on expensive models (e.g., gpt-4).
- Protect backend pods from traffic spikes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Implement a Rate Limiting Plugin Based on the Request Control Framework #1912

Goals

Suggested Design

Integration Point

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Implement a Rate Limiting Plugin Based on the Request Control Framework #1912

Description

Goals

Suggested Design

Integration Point

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions