Brings rocmModel into compliance with the updated inference.TextModel
interface from go-inference.
- Classify: simulates prefill-only via max_tokens=1, temperature=0
- BatchGenerate: sequential autoregressive per prompt via /v1/completions
- Info: populates ModelInfo from GGUF metadata (architecture, layers, quant)
- Metrics: captures timing + VRAM usage via sysfs after each operation
- Refactors duplicate server-exit error handling into setServerExitErr()
- Adds timing instrumentation to existing Generate and Chat methods
Co-Authored-By: Virgil <virgil@lethean.io>
Clear lastErr at the start of each Generate/Chat call so that Err()
reflects the most recent call, not a stale cancellation from a prior one.
Add two integration tests:
- GracefulShutdown: cancel mid-stream then generate again on the same
model, verifying the server survives cancellation.
- ConcurrentRequests: three goroutines calling Generate() simultaneously,
verifying no panics or deadlocks (llama-server serialises via slots).
Co-Authored-By: Virgil <virgil@lethean.io>
rocmModel implements inference.TextModel with Generate() and Chat()
methods that delegate to the llamacpp HTTP client, mapping go-inference
types to llama-server's OpenAI-compatible API. Token streaming via
iter.Seq[inference.Token] with mutex-protected error propagation.
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>