Token-Based Rate Limit¶

The Token-Based Rate Limit policy enforces rate limits based on token consumption rather than request count. Limits can be applied to prompt (input) tokens, completion (output) tokens, or total tokens — independently or in combination.

Configuration Parameters¶

At least one of the three limit categories must be configured. Any combination is valid.

Parameter	Required	Description
Prompt Token Limits	Conditional	Rate limits applied to prompt (input) tokens.
Completion Token Limits	Conditional	Rate limits applied to completion (output) tokens.
Total Token Limits	Conditional	Rate limits applied to combined prompt + completion tokens.

Each entry within a category requires:

Field	Required	Description
count	Yes	Maximum tokens allowed within the duration. Minimum: `1`.
duration	Yes	Time window as a Go duration string (e.g., `"1m"`, `"1h"`, `"24h"`).

When multiple limits are configured within a category, the most restrictive limit is enforced.

Add This Policy¶

Navigate to AI Workspace > LLM Providers or LLM Proxies.
Click on the provider or proxy name.
Go to the Guardrails tab.
Click + Add Guardrail and select Token Based Rate Limit from the sidebar.
Configure limits for prompt tokens, completion tokens, total tokens, or any combination.
Click Add (for providers) or Submit (for proxies).
Deploy the provider or proxy to apply the changes.

Behavior¶

Token counts are extracted from the LLM provider's response using paths defined in the provider template.
When any configured limit is exceeded, the gateway returns 429 Too Many Requests.
Each counter resets after its configured duration elapses.
Rate limit headers are included in responses:

Header	Description
`X-RateLimit-Limit`	Configured token limit
`X-RateLimit-Remaining`	Remaining tokens in the current window
`X-RateLimit-Reset`	Time (epoch seconds) when the window resets
`RateLimit-*`	IETF equivalents of the above

Example: Cap Total Tokens Per Minute¶

Block requests once 100,000 total tokens have been consumed in a 1-minute window.

Parameter	Value
Total Token Limits — count	`100000`
Total Token Limits — duration	`1m`

Example: Separate Prompt and Completion Limits¶

Limit prompt tokens to 50,000 per hour and completion tokens to 20,000 per hour independently.

Category	count	duration
Prompt Token Limits	`50000`	`1h`
Completion Token Limits	`20000`	`1h`

LLM Cost-Based Rate Limit — Enforce monetary spending budgets instead of token counts
Rate Limit - Basic — Simple request count-based rate limiting
Policy Hub — Full policy specification and latest version