Commit graph

38 commits

Author SHA1 Message Date
Claude
870ee232bf
feat: benchmark suite for decode speed, TTFT, and concurrent throughput
BenchmarkDecode, BenchmarkTTFT, BenchmarkConcurrent across 3 models.
Uses testing.B with b.ReportMetric for Go benchstat integration.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 23:16:40 +00:00
Claude
72120bb200
feat: pass --parallel N to llama-server for concurrent inference slots
Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 23:13:19 +00:00
Claude
4b6cffb9c4
docs: Phase 4 performance implementation plan
5 tasks: go-inference ParallelSlots, wire --parallel, benchmark
suite, flash attention comparison, documentation.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 23:11:30 +00:00
Claude
31bf0e8850
docs: Phase 4 performance design
Benchmark suite (testing.B), parallel slots via go-inference,
flash attention manual comparison.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 23:09:56 +00:00
Claude
d7db2d6e95
docs: Phase 3 complete — GGUF metadata, discovery, auto context
Integration test verifies model discovery on real GGUF files.
All 9 models in /data/lem/gguf/ discovered with correct metadata.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 22:24:52 +00:00
Claude
2c77f6f968
feat: use GGUF metadata for model type and context window auto-detection
Replaces filename-based guessModelType with GGUF header parsing.
Caps default context at 4096 to prevent VRAM exhaustion on models
with 128K+ native context.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 22:23:07 +00:00
Claude
af235653ca
feat: model discovery scanning directories for GGUF files
Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 22:21:48 +00:00
Claude
c7c9389749
feat: GGUF metadata parser for model discovery
Reads GGUF v2/v3 binary headers to extract model metadata (architecture,
name, quantisation type, context length, block count). Includes string
length limits for malformed input protection and uint64 value support
for compatibility with varied GGUF producers.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 22:20:02 +00:00
Claude
f8b091f511
docs: Phase 3 model support implementation plan
4 tasks: GGUF metadata parser, model discovery, LoadModel
enrichment, integration tests + documentation.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 22:12:31 +00:00
Claude
2454761e34
docs: Phase 3 model support design — approved
GGUF metadata parser, model discovery, LoadModel enrichment,
chat template verification.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 22:04:18 +00:00
Claude
34f02fdcd8
docs: Phase 2 complete — robustness features implemented
All 5 Phase 2 items done: crash detection, port retry,
graceful shutdown verification, VRAM monitoring, concurrent
requests. Concurrency findings documented.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 21:54:34 +00:00
Claude
a6e647c5b7
test: graceful shutdown and concurrent request integration tests
Clear lastErr at the start of each Generate/Chat call so that Err()
reflects the most recent call, not a stale cancellation from a prior one.

Add two integration tests:
- GracefulShutdown: cancel mid-stream then generate again on the same
  model, verifying the server survives cancellation.
- ConcurrentRequests: three goroutines calling Generate() simultaneously,
  verifying no panics or deadlocks (llama-server serialises via slots).

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 21:50:47 +00:00
Claude
954c57071a
fix: clamp VRAM Free to prevent uint64 underflow
Guard against transient sysfs inconsistency where used > total.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 21:48:19 +00:00
Claude
501de83d3b
feat: VRAM monitoring via sysfs with dGPU auto-detection
Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 21:45:02 +00:00
Claude
b7342ec819
fix: only retry startServer on process exit, not timeout
Distinguishes retryable failures (process exited, e.g. port conflict)
from non-retryable ones (60s timeout, e.g. stuck server). Avoids
3x timeout penalty.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 21:43:06 +00:00
Claude
c50a8e9e9b
feat: retry port selection in startServer on process failure
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 21:40:05 +00:00
Claude
c07f37afe9
fix: guard nil exitErr wrapping, document concurrency invariant
Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 21:38:01 +00:00
Claude
2c4966e652
feat: detect server crash before Generate/Chat calls
Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 21:34:46 +00:00
Claude
d963cbf787
docs: Phase 2 robustness implementation plan
5 tasks: crash detection, port retry, VRAM monitoring,
graceful shutdown test, concurrent requests test.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 21:31:24 +00:00
Claude
2f743c5772
docs: Phase 2 robustness design — approved
Covers: graceful shutdown verification, port conflict retry,
server crash detection, VRAM monitoring via sysfs, concurrent
request testing.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 21:26:41 +00:00
Claude
6744a7c78f
docs: mark Phase 1 tasks complete
All 5 Phase 1 items done: GPU detection, server lifecycle,
HTTP client with SSE streaming, TextModel, integration tests.

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 21:16:11 +00:00
Claude
0e68d71c8a
test: integration tests for full ROCm inference pipeline
LoadModel → Generate → Chat → Close on real AMD GPU hardware.
Build-tagged //go:build rocm so normal go test skips them.

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 21:15:02 +00:00
Claude
1d8d65f55b
feat: Backend Available() and LoadModel() with GPU detection
Replace stub backend with real implementation: Available() checks
/dev/kfd and llama-server presence, LoadModel() wires up server
lifecycle to return a rocmModel. Add guessModelType() for architecture
detection from GGUF filenames (handles hyphenated variants like
Llama-3). Add TestAvailable and TestGuessModelType.

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 21:12:02 +00:00
Claude
a8c494771d
feat: TextModel implementation wrapping llama-server
rocmModel implements inference.TextModel with Generate() and Chat()
methods that delegate to the llamacpp HTTP client, mapping go-inference
types to llama-server's OpenAI-compatible API. Token streaming via
iter.Seq[inference.Token] with mutex-protected error propagation.

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 21:11:55 +00:00
Claude
9aa7f624ba
feat: server lifecycle and helpers for llama-server subprocess
Adds server.go with the process lifecycle layer that manages spawning
llama-server, waiting for readiness, and graceful shutdown. Includes
three helper functions (findLlamaServer, freePort, serverEnv) and the
full startServer/waitReady/stop lifecycle. The serverEnv function
critically filters HIP_VISIBLE_DEVICES to mask the Ryzen 9 iGPU
which crashes llama-server if not excluded.

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 21:08:07 +00:00
Claude
5778f1f011
fix: guard response body lifecycle in SSE streaming client
Use sync.Once to ensure resp.Body is closed exactly once, preventing
TCP connection leaks when the iterator is never consumed and
double-close when iterated twice. Also adds Accept: text/event-stream
header to both SSE endpoints.

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 21:04:02 +00:00
Claude
1bc8c9948b
test: completion streaming tests for llamacpp client
Add TestComplete_Streaming (multi-chunk SSE with three tokens) and
TestComplete_HTTPError (400 status propagation) to exercise the
Complete() method alongside the existing chat tests.

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 20:59:21 +00:00
Claude
def3167199
feat: llamacpp SSE streaming client for chat completions
Add ChatComplete() and Complete() methods to the llamacpp Client,
backed by a shared parseSSE() line parser. Types include ChatMessage,
ChatRequest, CompletionRequest and their chunked response structs.

Tests cover multi-chunk streaming, empty responses, HTTP errors, and
context cancellation — all using httptest SSE servers.

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 20:58:46 +00:00
Claude
d5a92c7212
fix: health check includes response body in errors, adds 503 test
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 20:54:52 +00:00
Claude
3c756771ec
feat: llamacpp health check client
Add internal/llamacpp package with Client type and Health() method.
Client communicates with llama-server via HTTP; Health checks the
/health endpoint and reports readiness. Foundation type for the
streaming methods (Tasks 2-3).

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 20:50:36 +00:00
Claude
9dda860df4
docs: incorporate Charon review — safer serverEnv() filtering
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 20:47:16 +00:00
Claude
78e244f26f
docs: Phase 1 plan review — approved with notes
- Token.ID = 0 acceptable for Phase 1 (no consumer uses it)
- StopTokens: ignore in Phase 1 (YAGNI)
- serverEnv() should filter existing HIP_VISIBLE_DEVICES before appending
- guessModelType() fine for now, upgrade to /props endpoint in Phase 2
- Integration test build tag approach approved

Charon, 19 Feb 2026

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 20:44:37 +00:00
Claude
ff9cf550e8
docs: flag Token.ID and StopTokens interface questions for Virgil
QUESTION: Token.ID always 0 — llama-server SSE doesn't include token IDs
QUESTION: StopTokens []int32 vs llama-server stop []string mismatch

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 20:41:53 +00:00
Claude
acf79e3351
docs: Phase 1 implementation plan
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 20:38:11 +00:00
Claude
34407a69ca
docs: Phase 1 core implementation design
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 20:32:22 +00:00
Claude
68bc7300aa
docs: Phase 0 complete — environment validated, llama-server built
- ROCm 7.2, gfx1100 (corrected from gfx1101), kernel 6.17
- llama-server built with HIP from llama.cpp 11c325c
- Gemma3-4B baseline: 109 tok/s decode, 396 tok/s prefill
- Critical: iGPU crash requires HIP_VISIBLE_DEVICES=0
- All Phase 0 tasks marked done

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 19:57:14 +00:00
Snider
aa42cff417 feat: scaffold go-rocm AMD GPU inference package
Implements inference.Backend via llama-server subprocess (llama.cpp + HIP/ROCm).
Targets RX 7800 XT (gfx1101, RDNA 3, 16GB VRAM).

Includes:
- Backend registration with build tags (linux/amd64)
- Stub backend.go with llama-server lifecycle outline
- CLAUDE.md with build instructions for llama.cpp + ROCm
- TODO.md with 5-phase task queue
- FINDINGS.md with hardware specs, VRAM budget, design rationale

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 19:39:40 +00:00
252e28e81e Initial commit 2026-02-19 19:35:55 +00:00