3 Model-Quotas
Virgil edited this page 2026-02-19 16:57:54 +00:00

Model Quotas

go-ratelimit ships with default quotas for Google Generative AI models and supports server-side token counting via the Google API.

Default Quotas

New() pre-configures quotas based on Tier 1 Google AI observations:

Model RPM TPM RPD
gemini-3-pro-preview 150 1,000,000 1,000
gemini-3-flash-preview 150 1,000,000 1,000
gemini-2.5-pro 150 1,000,000 1,000
gemini-2.0-flash 150 1,000,000 Unlimited
gemini-2.0-flash-lite Unlimited Unlimited Unlimited

A value of 0 means unlimited for that dimension.

Custom Quotas

Quotas are stored in the Quotas map and can be modified directly:

rl, _ := ratelimit.New()

// Add a custom model
rl.Quotas["my-fine-tuned-model"] = ratelimit.ModelQuota{
    MaxRPM: 60,
    MaxTPM: 500000,
    MaxRPD: 500,
}

// Remove a quota (model becomes unlimited)
delete(rl.Quotas, "gemini-2.0-flash-lite")

Unknown models (those not in the Quotas map) are always allowed through -- CanSend returns true for any model without a configured quota.

Token Counting

CountTokens calls the Google Generative AI countTokens endpoint to get an exact server-side token count for a prompt:

count, err := ratelimit.CountTokens(apiKey, "gemini-2.5-pro", promptText)
if err != nil {
    log.Printf("token count failed: %v", err)
    // Fall back to an estimate
    count = len(promptText) / 4
}

if rl.CanSend("gemini-2.5-pro", count) {
    // Safe to send
}

API Details

The function sends a POST request to:

https://generativelanguage.googleapis.com/v1beta/models/{model}:countTokens

Request body:

{
  "contents": [
    {
      "parts": [
        {"text": "your prompt here"}
      ]
    }
  ]
}

Authentication is via the x-goog-api-key header. The response contains a totalTokens integer.

Error Handling

CountTokens returns an error if:

  • The HTTP request fails (network error)
  • The API returns a non-200 status (invalid key, model not found, quota exceeded)
  • The response JSON cannot be decoded

In production code, always have a fallback estimate (e.g. len(text) / 4) when the API is unavailable.

Quota Structure

type ModelQuota struct {
    MaxRPM int `yaml:"max_rpm"` // Requests per minute
    MaxTPM int `yaml:"max_tpm"` // Tokens per minute
    MaxRPD int `yaml:"max_rpd"` // Requests per day (0 = unlimited)
}

Quotas are persisted alongside usage state in ~/.core/ratelimits.yaml via the YAML tags.

See also: Home | Usage-Tracking