core-agent-ide/codex-rs/state
Charley Cunningham 7f3dbaeb25
state: enforce 10 MiB log caps for thread and threadless process logs (#12038)
## Summary
- enforce a 10 MiB cap per `thread_id` in state log storage
- enforce a 10 MiB cap per `process_uuid` for threadless (`thread_id IS
NULL`) logs
- scope pruning to only keys affected by the current insert batch
- add a cheap per-key `SUM(...)` precheck so windowed prune queries only
run for keys that are currently over the cap
- add SQLite indexes used by the pruning queries
- add focused runtime tests covering both pruning behaviors

## Why
This keeps log growth bounded by the intended partition semantics while
preserving a small, readable implementation localized to the existing
insert path.

## Local Latency Snapshot (No Truncation-Pressure Run)
Collected from session `019c734f-1d16-7002-9e00-c966c9fbbcae` using
local-only (uncommitted) instrumentation, while not specifically
benchmarking the truncation-heavy regime.

### Percentiles By Query (ms)
| query | count | p50 | p90 | p95 | p99 | max |
|---|---:|---:|---:|---:|---:|---:|
| `insert_logs.insert_batch` | 110 | 0.332 | 0.999 | 1.811 | 2.978 |
3.493 |
| `insert_logs.precheck.process` | 106 | 0.074 | 0.152 | 0.206 | 0.258 |
0.426 |
| `insert_logs.precheck.thread` | 73 | 0.118 | 0.206 | 0.253 | 1.025 |
1.025 |
| `insert_logs.prune.process` | 58 | 0.291 | 0.576 | 0.607 | 1.088 |
1.088 |
| `insert_logs.prune.thread` | 44 | 0.318 | 0.467 | 0.728 | 0.797 |
0.797 |
| `insert_logs.prune_total` | 110 | 0.488 | 0.976 | 1.237 | 1.593 |
1.684 |
| `insert_logs.total` | 110 | 1.315 | 2.889 | 3.623 | 5.739 | 5.961 |
| `insert_logs.tx_begin` | 110 | 0.133 | 0.235 | 0.282 | 0.412 | 0.546 |
| `insert_logs.tx_commit` | 110 | 0.259 | 0.689 | 0.772 | 1.065 | 1.080
|

### `insert_logs.total` Histogram (ms)
| bucket | count |
|---|---:|
| `<= 0.100` | 0 |
| `<= 0.250` | 0 |
| `<= 0.500` | 7 |
| `<= 1.000` | 33 |
| `<= 2.000` | 40 |
| `<= 5.000` | 28 |
| `<= 10.000` | 2 |
| `<= 20.000` | 0 |
| `<= 50.000` | 0 |
| `<= 100.000` | 0 |
| `> 100.000` | 0 |

## Local Latency Snapshot (Truncation-Heavy / Cap-Hit Regime)
Collected from a run where cap-hit behavior was frequent (`135/180`
insert calls), using local-only (uncommitted) instrumentation and a
temporary local cap of `10_000` bytes for stress testing (not the merged
`10 MiB` cap).

### Percentiles By Query (ms)
| query | count | p50 | p90 | p95 | p99 | max |
|---|---:|---:|---:|---:|---:|---:|
| `insert_logs.insert_batch` | 180 | 0.524 | 1.645 | 2.163 | 3.424 |
3.777 |
| `insert_logs.precheck.process` | 171 | 0.086 | 0.235 | 0.373 | 0.758 |
1.147 |
| `insert_logs.precheck.thread` | 100 | 0.105 | 0.251 | 0.291 | 1.176 |
1.622 |
| `insert_logs.prune.process` | 109 | 0.386 | 0.839 | 1.146 | 1.548 |
2.588 |
| `insert_logs.prune.thread` | 56 | 0.253 | 0.550 | 1.148 | 2.484 |
2.484 |
| `insert_logs.prune_total` | 180 | 0.511 | 1.221 | 1.695 | 4.548 |
5.512 |
| `insert_logs.total` | 180 | 1.631 | 3.902 | 5.103 | 8.901 | 9.095 |
| `insert_logs.total_cap_hit` | 135 | 1.876 | 4.501 | 5.547 | 8.902 |
9.096 |
| `insert_logs.total_no_cap_hit` | 45 | 0.520 | 1.700 | 2.079 | 3.294 |
3.294 |
| `insert_logs.tx_begin` | 180 | 0.109 | 0.253 | 0.287 | 1.088 | 1.406 |
| `insert_logs.tx_commit` | 180 | 0.267 | 0.813 | 1.170 | 2.497 | 2.574
|

### `insert_logs.total` Histogram (ms)
| bucket | count |
|---|---:|
| `<= 0.100` | 0 |
| `<= 0.250` | 0 |
| `<= 0.500` | 16 |
| `<= 1.000` | 39 |
| `<= 2.000` | 60 |
| `<= 5.000` | 54 |
| `<= 10.000` | 11 |
| `<= 20.000` | 0 |
| `<= 50.000` | 0 |
| `<= 100.000` | 0 |
| `> 100.000` | 0 |

### `insert_logs.total` Histogram When Cap Was Hit (ms)
| bucket | count |
|---|---:|
| `<= 0.100` | 0 |
| `<= 0.250` | 0 |
| `<= 0.500` | 0 |
| `<= 1.000` | 22 |
| `<= 2.000` | 51 |
| `<= 5.000` | 51 |
| `<= 10.000` | 11 |
| `<= 20.000` | 0 |
| `<= 50.000` | 0 |
| `<= 100.000` | 0 |
| `> 100.000` | 0 |

### Performance Takeaways
- Even in a cap-hit-heavy run (`75%` cap-hit calls), `insert_logs.total`
stays sub-10ms at p99 (`8.901ms`) and max (`9.095ms`).
- Calls that did **not** hit the cap are materially cheaper
(`insert_logs.total_no_cap_hit` p95 `2.079ms`) than cap-hit calls
(`insert_logs.total_cap_hit` p95 `5.547ms`).
- Compared to the earlier non-truncation-pressure run, overall
`insert_logs.total` rose from p95 `3.623ms` to p95 `5.103ms`
(+`1.48ms`), indicating bounded overhead when pruning is active.
- This truncation-heavy run used an intentionally low local cap for
stress testing; with the real 10 MiB cap, cap-hit frequency should be
much lower in normal sessions.

## Testing
- `just fmt` (in `codex-rs`)
- `cargo test -p codex-state` (in `codex-rs`)
2026-02-18 17:08:08 -08:00
..
migrations state: enforce 10 MiB log caps for thread and threadless process logs (#12038) 2026-02-18 17:08:08 -08:00
src state: enforce 10 MiB log caps for thread and threadless process logs (#12038) 2026-02-18 17:08:08 -08:00
BUILD.bazel feat: sqlite 1 (#10004) 2026-01-28 15:29:14 +01:00
Cargo.toml feat: drop sqlx logging (#10398) 2026-02-02 19:26:58 +00:00