BenchmarkDecode, BenchmarkTTFT, BenchmarkConcurrent across 3 models.
Uses testing.B with b.ReportMetric for Go benchstat integration.
Co-Authored-By: Virgil <virgil@lethean.io>
Integration test verifies model discovery on real GGUF files.
All 9 models in /data/lem/gguf/ discovered with correct metadata.
Co-Authored-By: Virgil <virgil@lethean.io>
Replaces filename-based guessModelType with GGUF header parsing.
Caps default context at 4096 to prevent VRAM exhaustion on models
with 128K+ native context.
Co-Authored-By: Virgil <virgil@lethean.io>
Reads GGUF v2/v3 binary headers to extract model metadata (architecture,
name, quantisation type, context length, block count). Includes string
length limits for malformed input protection and uint64 value support
for compatibility with varied GGUF producers.
Co-Authored-By: Virgil <virgil@lethean.io>
Clear lastErr at the start of each Generate/Chat call so that Err()
reflects the most recent call, not a stale cancellation from a prior one.
Add two integration tests:
- GracefulShutdown: cancel mid-stream then generate again on the same
model, verifying the server survives cancellation.
- ConcurrentRequests: three goroutines calling Generate() simultaneously,
verifying no panics or deadlocks (llama-server serialises via slots).
Co-Authored-By: Virgil <virgil@lethean.io>
LoadModel → Generate → Chat → Close on real AMD GPU hardware.
Build-tagged //go:build rocm so normal go test skips them.
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace stub backend with real implementation: Available() checks
/dev/kfd and llama-server presence, LoadModel() wires up server
lifecycle to return a rocmModel. Add guessModelType() for architecture
detection from GGUF filenames (handles hyphenated variants like
Llama-3). Add TestAvailable and TestGuessModelType.
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
rocmModel implements inference.TextModel with Generate() and Chat()
methods that delegate to the llamacpp HTTP client, mapping go-inference
types to llama-server's OpenAI-compatible API. Token streaming via
iter.Seq[inference.Token] with mutex-protected error propagation.
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds server.go with the process lifecycle layer that manages spawning
llama-server, waiting for readiness, and graceful shutdown. Includes
three helper functions (findLlamaServer, freePort, serverEnv) and the
full startServer/waitReady/stop lifecycle. The serverEnv function
critically filters HIP_VISIBLE_DEVICES to mask the Ryzen 9 iGPU
which crashes llama-server if not excluded.
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use sync.Once to ensure resp.Body is closed exactly once, preventing
TCP connection leaks when the iterator is never consumed and
double-close when iterated twice. Also adds Accept: text/event-stream
header to both SSE endpoints.
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add TestComplete_Streaming (multi-chunk SSE with three tokens) and
TestComplete_HTTPError (400 status propagation) to exercise the
Complete() method alongside the existing chat tests.
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add ChatComplete() and Complete() methods to the llamacpp Client,
backed by a shared parseSSE() line parser. Types include ChatMessage,
ChatRequest, CompletionRequest and their chunked response structs.
Tests cover multi-chunk streaming, empty responses, HTTP errors, and
context cancellation — all using httptest SSE servers.
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add internal/llamacpp package with Client type and Health() method.
Client communicates with llama-server via HTTP; Health checks the
/health endpoint and reports readiness. Foundation type for the
streaming methods (Tasks 2-3).
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Token.ID = 0 acceptable for Phase 1 (no consumer uses it)
- StopTokens: ignore in Phase 1 (YAGNI)
- serverEnv() should filter existing HIP_VISIBLE_DEVICES before appending
- guessModelType() fine for now, upgrade to /props endpoint in Phase 2
- Integration test build tag approach approved
Charon, 19 Feb 2026
Co-Authored-By: Virgil <virgil@lethean.io>