Token-Based Rate Limiting¶
AI services bill on a per-token basis, so uncontrolled usage can lead to unexpected costs. The AI Workspace provides two ways to apply token-based rate limits:
| Approach | Configured Via | Applies To |
|---|---|---|
| Built-in Rate Limiting | Rate Limiting tab on LLM Providers | All proxies using that provider |
| Token Based Rate Limit Policy | Guardrails tab on LLM Providers or Proxies | The specific provider or proxy it is attached to |
Built-in Rate Limiting¶
Configured through the Rate Limiting tab on an LLM Provider. Supports limiting by request count and token count.
Scopes¶
| Scope | Controls | Status |
|---|---|---|
| Backend | Requests from the gateway to the upstream LLM provider | Available |
| Per Consumer | Requests from client applications to the gateway | Coming soon |
Per Consumer Rate Limiting — Coming Soon
Only the Backend scope is currently configurable.
Configuration Modes¶
- Provider-wide — A single limit across all API endpoints.
- Per Resource — Individual limits per API endpoint.
Note
Provider-wide and Per Resource modes are mutually exclusive. Clear existing limits before switching.
Limit Criteria¶
| Criterion | Description |
|---|---|
| Request Count | Maximum requests within the time window |
| Token Count | Maximum tokens (prompt + completion) within the time window |
| Cost | (Coming soon) |
For each criterion, set a Quota and a Reset Duration (second, minute, hour).
Configure¶
- Navigate to AI Workspace > LLM Providers.
- Click on the provider name and go to the Rate Limiting tab.
- Select Provider-wide or Per Resource under the Backend section.
- Enable Request Count and/or Token Count, set the Quota and Reset Duration.
- For Per Resource: expand Limit per Resource for defaults, then expand individual resources to override.
- Click Save and redeploy the provider.
Token Based Rate Limit Policy¶
A policy attached via the Guardrails tab. Tracks prompt tokens, completion tokens, and total tokens separately — giving finer control than the built-in Rate Limiting tab. Can be applied to both LLM Providers and LLM Proxies.
Configuration Parameters¶
Configure limits for any combination of the three token categories. At least one must be specified.
| Parameter | Required | Description |
|---|---|---|
| Prompt Token Count | Conditional | Maximum prompt tokens allowed within the duration. |
| Completion Token Count | Conditional | Maximum completion tokens allowed within the duration. |
| Total Token Count | Conditional | Maximum combined prompt + completion tokens within the duration. |
| Duration | Yes | Time window for the limit (e.g., 60s, 1m, 1h). |
Add This Policy¶
- Navigate to AI Workspace > LLM Providers or LLM Proxies.
- Click on the provider or proxy name and go to the Guardrails tab.
- Click + Add Guardrail and select Token Based Rate Limit.
- Configure limits for prompt tokens, completion tokens, total tokens, or any combination.
- Click Add (for providers) or Submit (for proxies).
- Redeploy the provider or proxy to apply the changes.
Behavior¶
- Token usage is read from the LLM provider's response (
usage.prompt_tokens,usage.completion_tokens,usage.total_tokens). - When any limit is exceeded, subsequent requests receive
429 Too Many Requests. - When multiple limit rules are configured, the most restrictive limit is enforced.
Related¶
- Manage LLM Provider — Rate Limiting tab reference
- Rate Limit - Basic — Simple request count-based rate limiting
- Policy Hub — Full policy specification and latest version