## Align with the new phase-1 design
Basically we know run phase 1 in parallel by considering:
* Max 64 rollouts
* Max 1 month old
* Consider the most recent first
This PR also adds stronger parallelization capabilities by detecting
stale jobs, retry policies, ownership of computation to prevent double
computations etc etc
Summary:
- read conversation summaries and cwd info from the state DB when
possible so we no longer rely on rollout files for metadata and avoid
extra I/O
- persist CLI version in thread metadata, surface it through summary
builders, and add the necessary DB migration hooks
- simplify thread listing by using enriched state DB data directly
rather than reading rollout heads
Testing:
- Not run (not requested)
## Summary
This PR makes SQLite rollout backfill resumable and repeatable instead
of one-shot-on-db-create.
## What changed
- Added a persisted backfill state table:
- state/migrations/0008_backfill_state.sql
- Tracks status (pending|running|complete), last_watermark, and
last_success_at.
- Added backfill state model/types in codex-state:
- BackfillState, BackfillStatus (state/src/model/backfill_state.rs)
- Added runtime APIs to manage backfill lifecycle/progress:
- get_backfill_state
- mark_backfill_running
- checkpoint_backfill
- mark_backfill_complete
- Updated core startup behavior:
- Backfill now runs whenever state is not Complete (not only when DB
file is newly created).
- Reworked backfill execution:
- Collect rollout files, derive deterministic watermark per path, sort,
resume from last_watermark.
- Process in batches (BACKFILL_BATCH_SIZE = 200), checkpoint after each
batch.
- Mark complete with last_success_at at the end.
## Why
Previous behavior could leave users permanently partially backfilled if
the process exited during initial async backfill. This change allows
safe continuation across restarts and avoids restarting from scratch.
- Schema: thread_id (PK, FK to threads.id with cascade delete),
trace_summary, memory_summary, updated_at.
- Migration: creates the table and an index on (updated_at DESC,
thread_id DESC) for efficient recent-first reads.
- Runtime API (DB-only):
- `get_thread_memory(thread_id)`: fetch one memory row.
- `upsert_thread_memory(thread_id, trace_summary, memory_summary)`:
insert/update by thread id and always advance updated_at.
- `get_last_n_thread_memories_for_cwd(cwd, n)`: join thread_memory with
threads and return newest n rows for an exact cwd match.
- Model layer: introduced ThreadMemory and row conversion types to keep
query decoding typed and consistent with existing state models.
Persist thread_dynamic_tools in sqlite and read first from it. Fall back
to rollout files if it's not found. Persist dynamic tools to both sqlite
and rollout files.
Saw that new sessions get populated to db correctly & old sessions get
backfilled correctly at startup:
```
celia@com-92114 codex-rs % sqlite3 ~/.codex/state.sqlite \ "select thread_id, position,name,description,input_schema from thread_dynamic_tools;"
019c0cad-ec0d-74b2-a787-e8b33a349117|0|geo_lookup|lookup a city|{"properties":{"city":{"type":"string"}},"required":["city"],"type":"object"}
....
019c10ca-aa4b-7620-ae40-c0919fbd7ea7|0|geo_lookup|lookup a city|{"properties":{"city":{"type":"string"}},"required":["city"],"type":"object"}
```
Add a `.sqlite` database to be used to store rollout metatdata (and
later logs)
This PR is phase 1:
* Add the database and the required infrastructure
* Add a backfill of the database
* Persist the newly created rollout both in files and in the DB
* When we need to get metadata or a rollout, consider the `JSONL` as the
source of truth but compare the results with the DB and show any errors