Skip to content

Feature Request: Implement a Rate Limiting Plugin Based on the Request Control Framework #1912

@googs1025

Description

@googs1025

Hello maintainers

What would you like to be added:

I'd like to propose implementing a rate limiting plugin leveraging the existing Request Control framework. This plugin would integrate as an AdmissionPlugin (and potentially other hook points) to enforce configurable request rate limits per user, model, IP address, or other dimensions.

Goals

  • Prevent abuse or overuse of inference resources.
  • Support multi-tenancy with fair resource allocation.
  • Provide flexible, dynamic rate limit configuration (e.g., different quotas for free vs. premium users).
  • Integrate cleanly with the current plugin architecture (PreRequest, AdmissionPlugin, etc.).

Suggested Design

  • Use a token bucket algorithm for smooth rate limiting.
  • Extract rate-limiting keys from request metadata (e.g., x-user-id header, target model name).
  • Allow runtime updates to rate limit rules (e.g., via config map or API).
  • Return appropriate error codes (e.g., 429 Too Many Requests) when limits are exceeded.

Integration Point

The plugin would be registered via requestcontrol.NewConfig().WithAdmissionPlugins(...) and invoked during the admission phase before scheduling.

Why is this needed:

  • Limit free-tier users to 10 requests/minute.
  • Enforce stricter limits on expensive models (e.g., gpt-4).
  • Protect backend pods from traffic spikes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions