Clear lastErr at the start of each Generate/Chat call so that Err()
reflects the most recent call, not a stale cancellation from a prior one.
Add two integration tests:
- GracefulShutdown: cancel mid-stream then generate again on the same
model, verifying the server survives cancellation.
- ConcurrentRequests: three goroutines calling Generate() simultaneously,
verifying no panics or deadlocks (llama-server serialises via slots).
Co-Authored-By: Virgil <virgil@lethean.io>
rocmModel implements inference.TextModel with Generate() and Chat()
methods that delegate to the llamacpp HTTP client, mapping go-inference
types to llama-server's OpenAI-compatible API. Token streaming via
iter.Seq[inference.Token] with mutex-protected error propagation.
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>