go-ratelimit/docs/architecture.md
Snider cde6443e4c docs: graduate TODO/FINDINGS into production documentation
Replace internal task tracking (TODO.md, FINDINGS.md) with structured
documentation in docs/. Trim CLAUDE.md to agent instructions only.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-20 15:01:55 +00:00

286 lines
9.6 KiB
Markdown

# Architecture
go-ratelimit is a provider-agnostic rate limiter for LLM API calls. It enforces
three independent quota dimensions per model — requests per minute (RPM), tokens
per minute (TPM), and requests per day (RPD) — using an in-memory sliding window
that can be persisted across process restarts via YAML or SQLite.
Module path: `forge.lthn.ai/core/go-ratelimit`
---
## Sliding Window Algorithm
The limiter maintains per-model `UsageStats` structs in memory:
```go
type UsageStats struct {
Requests []time.Time // timestamps of recent requests (1-minute window)
Tokens []TokenEntry // token counts with timestamps (1-minute window)
DayStart time.Time // when the current daily window started
DayCount int // total requests recorded since DayStart
}
```
Every call to `CanSend()` or `Stats()` first calls `prune()`, which scans both
slices and discards entries older than `now - 1 minute`. Pruning is done
in-place to avoid allocation on the hot path:
```go
validReqs := 0
for _, t := range stats.Requests {
if t.After(window) {
stats.Requests[validReqs] = t
validReqs++
}
}
stats.Requests = stats.Requests[:validReqs]
```
The same loop runs for token entries. After pruning, `CanSend()` checks each
quota dimension in priority order: RPD first (cheapest check), then RPM, then
TPM. A zero value for any dimension means that dimension is unlimited. If all
three are zero the model is treated as fully unlimited and the check short-circuits
before touching any state.
### Daily Reset
The daily counter resets automatically inside `prune()`. When
`now - stats.DayStart >= 24h`, `DayCount` is set to zero and `DayStart` is set
to the current time. This means the daily window is a rolling 24-hour period
anchored to the first request of the day, not a calendar boundary.
### Concurrency
All reads and writes are protected by a single `sync.RWMutex`. Methods that
write state — `CanSend()`, `RecordUsage()`, `Reset()`, `Load()` — acquire a
full write lock. `Persist()`, `Stats()`, and `AllStats()` acquire a read lock
where possible. The `CanSend()` method acquires a write lock because it calls
`prune()`, which mutates the state slices.
`go test -race ./...` passes clean with 20 goroutines performing concurrent
`CanSend()`, `RecordUsage()`, and `Stats()` calls.
---
## Provider and Quota Configuration
### Types
```go
type Provider string // "gemini", "openai", "anthropic", "local"
type ModelQuota struct {
MaxRPM int `yaml:"max_rpm"` // 0 = unlimited
MaxTPM int `yaml:"max_tpm"`
MaxRPD int `yaml:"max_rpd"`
}
type Config struct {
FilePath string // default: ~/.core/ratelimits.yaml
Backend string // "yaml" (default) or "sqlite"
Quotas map[string]ModelQuota // explicit per-model overrides
Providers []Provider // provider profiles to load
}
```
### Quota Resolution
1. Provider profiles are loaded first (from `DefaultProfiles()`).
2. Explicit `Config.Quotas` are merged on top, overriding any matching model.
3. If neither `Providers` nor `Quotas` are specified, Gemini defaults are used.
`SetQuota()` and `AddProvider()` allow runtime modification; both are
mutex-protected. `AddProvider()` is additive — it does not remove existing
quotas for models outside the new provider's profile.
### Default Quotas (as of February 2026)
| Provider | Model | MaxRPM | MaxTPM | MaxRPD |
|-----------|------------------------|-----------|-----------|-----------|
| Gemini | gemini-3-pro-preview | 150 | 1,000,000 | 1,000 |
| Gemini | gemini-3-flash-preview | 150 | 1,000,000 | 1,000 |
| Gemini | gemini-2.5-pro | 150 | 1,000,000 | 1,000 |
| Gemini | gemini-2.0-flash | 150 | 1,000,000 | unlimited |
| Gemini | gemini-2.0-flash-lite | unlimited | unlimited | unlimited |
| OpenAI | gpt-4o, gpt-4-turbo | 500 | 30,000 | unlimited |
| OpenAI | gpt-4o-mini, o1-mini | 500 | 200,000 | unlimited |
| OpenAI | o1, o3-mini | 500 | varies | unlimited |
| Anthropic | claude-opus-4 | 50 | 40,000 | unlimited |
| Anthropic | claude-sonnet-4 | 50 | 40,000 | unlimited |
| Anthropic | claude-haiku-3.5 | 50 | 50,000 | unlimited |
| Local | (none by default) | user-defined |
The Local provider exists for local inference backends (Ollama, MLX, llama.cpp)
where the throttle limit is hardware rather than an API quota. No defaults are
provided; callers add per-model limits via `Config.Quotas` or `SetQuota()`.
---
## YAML Persistence (Legacy)
The default backend serialises the entire `RateLimiter` struct — both the
`Quotas` map and the `State` map — to a YAML file at `~/.core/ratelimits.yaml`.
```yaml
quotas:
gemini-3-pro-preview:
max_rpm: 150
max_tpm: 1000000
max_rpd: 1000
state:
gemini-3-pro-preview:
requests:
- 2026-02-20T14:32:01.123456789Z
tokens:
- time: 2026-02-20T14:32:01.123456789Z
count: 1500
day_start: 2026-02-20T00:00:00Z
day_count: 42
```
`Persist()` creates parent directories with `os.MkdirAll` before writing.
`Load()` treats a missing file as an empty state (no error). Corrupt or
unreadable files return an error.
**Limitations of YAML backend:**
- Single-process only. Concurrent writes from multiple processes corrupt the
file because the write is not atomic at the OS level.
- The entire state is serialised on every `Persist()` call, which grows linearly
with the number of tracked models and entries.
- Timestamps are serialised as RFC3339 strings; sub-nanosecond precision is
preserved by Go's time marshaller but depends on the YAML library.
---
## SQLite Backend
The SQLite backend was added in Phase 2 to support multi-process scenarios and
provide a more robust persistence layer. It uses `modernc.org/sqlite` — a pure
Go port of SQLite that compiles without CGO.
### Connection Settings
```go
db.SetMaxOpenConns(1) // single connection for PRAGMA consistency
db.Exec("PRAGMA journal_mode=WAL") // WAL mode for concurrent readers
db.Exec("PRAGMA busy_timeout=5000") // 5-second busy timeout
```
WAL mode allows one writer and multiple concurrent readers. The 5-second busy
timeout prevents immediate failure when a second process is mid-commit. A single
`sql.DB` connection is used because SQLite's WAL mode handles reader concurrency
at the file level; multiple Go connections to the same file through a single
process would not add throughput but would complicate locking.
### Schema
```sql
CREATE TABLE IF NOT EXISTS quotas (
model TEXT PRIMARY KEY,
max_rpm INTEGER NOT NULL DEFAULT 0,
max_tpm INTEGER NOT NULL DEFAULT 0,
max_rpd INTEGER NOT NULL DEFAULT 0
);
CREATE TABLE IF NOT EXISTS requests (
model TEXT NOT NULL,
ts INTEGER NOT NULL -- UnixNano
);
CREATE TABLE IF NOT EXISTS tokens (
model TEXT NOT NULL,
ts INTEGER NOT NULL, -- UnixNano
count INTEGER NOT NULL
);
CREATE TABLE IF NOT EXISTS daily (
model TEXT PRIMARY KEY,
day_start INTEGER NOT NULL, -- UnixNano
day_count INTEGER NOT NULL DEFAULT 0
);
CREATE INDEX IF NOT EXISTS idx_requests_model_ts ON requests(model, ts);
CREATE INDEX IF NOT EXISTS idx_tokens_model_ts ON tokens(model, ts);
```
Timestamps are stored as `INTEGER` UnixNano values. This preserves nanosecond
precision without relying on SQLite's text date format, and allows efficient
range queries using the composite indices.
### Save Strategy
`saveState()` uses a delete-then-insert pattern inside a single transaction.
All three state tables are truncated and rewritten atomically:
```go
tx.Exec("DELETE FROM requests")
tx.Exec("DELETE FROM tokens")
tx.Exec("DELETE FROM daily")
// then INSERT for every model in state
tx.Commit()
```
`saveQuotas()` uses `INSERT ... ON CONFLICT(model) DO UPDATE` (upsert) so
existing quota rows are updated in place without deleting unrelated models.
### Constructors
```go
// YAML backend (default)
rl, err := ratelimit.New()
rl, err := ratelimit.NewWithConfig(cfg)
// SQLite backend
rl, err := ratelimit.NewWithSQLite(dbPath)
rl, err := ratelimit.NewWithSQLiteConfig(dbPath, cfg)
defer rl.Close() // releases the database connection
```
`Close()` is a no-op on YAML-backed limiters.
---
## Migration Path
`MigrateYAMLToSQLite(yamlPath, sqlitePath string) error` reads an existing YAML
state file and writes all quotas and usage state to a new SQLite database. The
function is idempotent — running it again on the same YAML file overwrites the
SQLite database state.
Typical one-time migration:
```go
err := ratelimit.MigrateYAMLToSQLite(
filepath.Join(home, ".core", "ratelimits.yaml"),
filepath.Join(home, ".core", "ratelimits.db"),
)
```
After migration, switch the constructor:
```go
// Before
rl, _ := ratelimit.New()
// After
rl, _ := ratelimit.NewWithSQLite(filepath.Join(home, ".core", "ratelimits.db"))
defer rl.Close()
```
The YAML file can be kept as a backup; the two backends do not share state.
---
## CountTokens
`CountTokens(apiKey, model, text string) (int, error)` calls the Google
Generative Language API to obtain an exact token count for a prompt string. It
is Gemini-specific and hardcodes the `generativelanguage.googleapis.com`
endpoint. The URL is not configurable, which prevents unit testing of the
success path without network access.
For other providers, callers must supply `estimatedTokens` directly to
`CanSend()` and `RecordUsage()`. Accurate token counts are typically available
in API response metadata after a call completes.