go-rocm

Author	SHA1	Message	Date
Claude	add2ba1dbd	chore: sync workspace dependency versions Run go work sync to align dependency versions across workspace. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-22 21:41:04 +00:00
Snider	76b843e116	docs: add README with quick start and docs links Co-Authored-By: Virgil <virgil@lethean.io>	2026-02-20 15:11:26 +00:00
Snider	7915f7ad3c	docs: graduate TODO/FINDINGS into production documentation Co-Authored-By: Virgil <virgil@lethean.io>	2026-02-20 15:03:17 +00:00
Claude	61a95e4d4f	docs: Phase 4 complete — benchmarks, flash attention, parallel slots Gemma3-4B: 103 tok/s decode, 14ms TTFT, 239 tok/s concurrent (4 slots). Flash attention shows no benefit at ctx=2048 (designed for larger contexts). Co-Authored-By: Virgil <virgil@lethean.io>	2026-02-19 23:22:04 +00:00
Claude	870ee232bf	feat: benchmark suite for decode speed, TTFT, and concurrent throughput BenchmarkDecode, BenchmarkTTFT, BenchmarkConcurrent across 3 models. Uses testing.B with b.ReportMetric for Go benchstat integration. Co-Authored-By: Virgil <virgil@lethean.io>	2026-02-19 23:16:40 +00:00
Claude	72120bb200	feat: pass --parallel N to llama-server for concurrent inference slots Co-Authored-By: Virgil <virgil@lethean.io>	2026-02-19 23:13:19 +00:00
Claude	4b6cffb9c4	docs: Phase 4 performance implementation plan 5 tasks: go-inference ParallelSlots, wire --parallel, benchmark suite, flash attention comparison, documentation. Co-Authored-By: Virgil <virgil@lethean.io>	2026-02-19 23:11:30 +00:00
Claude	31bf0e8850	docs: Phase 4 performance design Benchmark suite (testing.B), parallel slots via go-inference, flash attention manual comparison. Co-Authored-By: Virgil <virgil@lethean.io>	2026-02-19 23:09:56 +00:00
Claude	d7db2d6e95	docs: Phase 3 complete — GGUF metadata, discovery, auto context Integration test verifies model discovery on real GGUF files. All 9 models in /data/lem/gguf/ discovered with correct metadata. Co-Authored-By: Virgil <virgil@lethean.io>	2026-02-19 22:24:52 +00:00
Claude	2c77f6f968	feat: use GGUF metadata for model type and context window auto-detection Replaces filename-based guessModelType with GGUF header parsing. Caps default context at 4096 to prevent VRAM exhaustion on models with 128K+ native context. Co-Authored-By: Virgil <virgil@lethean.io>	2026-02-19 22:23:07 +00:00
Claude	af235653ca	feat: model discovery scanning directories for GGUF files Co-Authored-By: Virgil <virgil@lethean.io>	2026-02-19 22:21:48 +00:00
Claude	c7c9389749	feat: GGUF metadata parser for model discovery Reads GGUF v2/v3 binary headers to extract model metadata (architecture, name, quantisation type, context length, block count). Includes string length limits for malformed input protection and uint64 value support for compatibility with varied GGUF producers. Co-Authored-By: Virgil <virgil@lethean.io>	2026-02-19 22:20:02 +00:00
Claude	f8b091f511	docs: Phase 3 model support implementation plan 4 tasks: GGUF metadata parser, model discovery, LoadModel enrichment, integration tests + documentation. Co-Authored-By: Virgil <virgil@lethean.io>	2026-02-19 22:12:31 +00:00
Claude	2454761e34	docs: Phase 3 model support design — approved GGUF metadata parser, model discovery, LoadModel enrichment, chat template verification. Co-Authored-By: Virgil <virgil@lethean.io>	2026-02-19 22:04:18 +00:00
Claude	34f02fdcd8	docs: Phase 2 complete — robustness features implemented All 5 Phase 2 items done: crash detection, port retry, graceful shutdown verification, VRAM monitoring, concurrent requests. Concurrency findings documented. Co-Authored-By: Virgil <virgil@lethean.io>	2026-02-19 21:54:34 +00:00
Claude	a6e647c5b7	test: graceful shutdown and concurrent request integration tests Clear lastErr at the start of each Generate/Chat call so that Err() reflects the most recent call, not a stale cancellation from a prior one. Add two integration tests: - GracefulShutdown: cancel mid-stream then generate again on the same model, verifying the server survives cancellation. - ConcurrentRequests: three goroutines calling Generate() simultaneously, verifying no panics or deadlocks (llama-server serialises via slots). Co-Authored-By: Virgil <virgil@lethean.io>	2026-02-19 21:50:47 +00:00
Claude	954c57071a	fix: clamp VRAM Free to prevent uint64 underflow Guard against transient sysfs inconsistency where used > total. Co-Authored-By: Virgil <virgil@lethean.io>	2026-02-19 21:48:19 +00:00
Claude	501de83d3b	feat: VRAM monitoring via sysfs with dGPU auto-detection Co-Authored-By: Virgil <virgil@lethean.io>	2026-02-19 21:45:02 +00:00
Claude	b7342ec819	fix: only retry startServer on process exit, not timeout Distinguishes retryable failures (process exited, e.g. port conflict) from non-retryable ones (60s timeout, e.g. stuck server). Avoids 3x timeout penalty. Co-Authored-By: Virgil <virgil@lethean.io>	2026-02-19 21:43:06 +00:00
Claude	c50a8e9e9b	feat: retry port selection in startServer on process failure Co-Authored-By: Virgil <virgil@lethean.io> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 21:40:05 +00:00
Claude	c07f37afe9	fix: guard nil exitErr wrapping, document concurrency invariant Co-Authored-By: Virgil <virgil@lethean.io>	2026-02-19 21:38:01 +00:00
Claude	2c4966e652	feat: detect server crash before Generate/Chat calls Co-Authored-By: Virgil <virgil@lethean.io>	2026-02-19 21:34:46 +00:00
Claude	d963cbf787	docs: Phase 2 robustness implementation plan 5 tasks: crash detection, port retry, VRAM monitoring, graceful shutdown test, concurrent requests test. Co-Authored-By: Virgil <virgil@lethean.io>	2026-02-19 21:31:24 +00:00
Claude	2f743c5772	docs: Phase 2 robustness design — approved Covers: graceful shutdown verification, port conflict retry, server crash detection, VRAM monitoring via sysfs, concurrent request testing. Co-Authored-By: Virgil <virgil@lethean.io>	2026-02-19 21:26:41 +00:00
Claude	6744a7c78f	docs: mark Phase 1 tasks complete All 5 Phase 1 items done: GPU detection, server lifecycle, HTTP client with SSE streaming, TextModel, integration tests. Co-Authored-By: Virgil <virgil@lethean.io> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 21:16:11 +00:00
Claude	0e68d71c8a	test: integration tests for full ROCm inference pipeline LoadModel → Generate → Chat → Close on real AMD GPU hardware. Build-tagged //go:build rocm so normal go test skips them. Co-Authored-By: Virgil <virgil@lethean.io> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 21:15:02 +00:00
Claude	1d8d65f55b	feat: Backend Available() and LoadModel() with GPU detection Replace stub backend with real implementation: Available() checks /dev/kfd and llama-server presence, LoadModel() wires up server lifecycle to return a rocmModel. Add guessModelType() for architecture detection from GGUF filenames (handles hyphenated variants like Llama-3). Add TestAvailable and TestGuessModelType. Co-Authored-By: Virgil <virgil@lethean.io> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 21:12:02 +00:00
Claude	a8c494771d	feat: TextModel implementation wrapping llama-server rocmModel implements inference.TextModel with Generate() and Chat() methods that delegate to the llamacpp HTTP client, mapping go-inference types to llama-server's OpenAI-compatible API. Token streaming via iter.Seq[inference.Token] with mutex-protected error propagation. Co-Authored-By: Virgil <virgil@lethean.io> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 21:11:55 +00:00
Claude	9aa7f624ba	feat: server lifecycle and helpers for llama-server subprocess Adds server.go with the process lifecycle layer that manages spawning llama-server, waiting for readiness, and graceful shutdown. Includes three helper functions (findLlamaServer, freePort, serverEnv) and the full startServer/waitReady/stop lifecycle. The serverEnv function critically filters HIP_VISIBLE_DEVICES to mask the Ryzen 9 iGPU which crashes llama-server if not excluded. Co-Authored-By: Virgil <virgil@lethean.io> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 21:08:07 +00:00
Claude	5778f1f011	fix: guard response body lifecycle in SSE streaming client Use sync.Once to ensure resp.Body is closed exactly once, preventing TCP connection leaks when the iterator is never consumed and double-close when iterated twice. Also adds Accept: text/event-stream header to both SSE endpoints. Co-Authored-By: Virgil <virgil@lethean.io> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 21:04:02 +00:00
Claude	1bc8c9948b	test: completion streaming tests for llamacpp client Add TestComplete_Streaming (multi-chunk SSE with three tokens) and TestComplete_HTTPError (400 status propagation) to exercise the Complete() method alongside the existing chat tests. Co-Authored-By: Virgil <virgil@lethean.io> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 20:59:21 +00:00
Claude	def3167199	feat: llamacpp SSE streaming client for chat completions Add ChatComplete() and Complete() methods to the llamacpp Client, backed by a shared parseSSE() line parser. Types include ChatMessage, ChatRequest, CompletionRequest and their chunked response structs. Tests cover multi-chunk streaming, empty responses, HTTP errors, and context cancellation — all using httptest SSE servers. Co-Authored-By: Virgil <virgil@lethean.io> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 20:58:46 +00:00
Claude	d5a92c7212	fix: health check includes response body in errors, adds 503 test Co-Authored-By: Virgil <virgil@lethean.io> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 20:54:52 +00:00
Claude	3c756771ec	feat: llamacpp health check client Add internal/llamacpp package with Client type and Health() method. Client communicates with llama-server via HTTP; Health checks the /health endpoint and reports readiness. Foundation type for the streaming methods (Tasks 2-3). Co-Authored-By: Virgil <virgil@lethean.io> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 20:50:36 +00:00
Claude	9dda860df4	docs: incorporate Charon review — safer serverEnv() filtering Co-Authored-By: Virgil <virgil@lethean.io> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 20:47:16 +00:00
Claude	78e244f26f	docs: Phase 1 plan review — approved with notes - Token.ID = 0 acceptable for Phase 1 (no consumer uses it) - StopTokens: ignore in Phase 1 (YAGNI) - serverEnv() should filter existing HIP_VISIBLE_DEVICES before appending - guessModelType() fine for now, upgrade to /props endpoint in Phase 2 - Integration test build tag approach approved Charon, 19 Feb 2026 Co-Authored-By: Virgil <virgil@lethean.io>	2026-02-19 20:44:37 +00:00
Claude	ff9cf550e8	docs: flag Token.ID and StopTokens interface questions for Virgil QUESTION: Token.ID always 0 — llama-server SSE doesn't include token IDs QUESTION: StopTokens []int32 vs llama-server stop []string mismatch Co-Authored-By: Virgil <virgil@lethean.io> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 20:41:53 +00:00
Claude	acf79e3351	docs: Phase 1 implementation plan Co-Authored-By: Virgil <virgil@lethean.io> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 20:38:11 +00:00
Claude	34407a69ca	docs: Phase 1 core implementation design Co-Authored-By: Virgil <virgil@lethean.io> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-19 20:32:22 +00:00
Claude	68bc7300aa	docs: Phase 0 complete — environment validated, llama-server built - ROCm 7.2, gfx1100 (corrected from gfx1101), kernel 6.17 - llama-server built with HIP from llama.cpp 11c325c - Gemma3-4B baseline: 109 tok/s decode, 396 tok/s prefill - Critical: iGPU crash requires HIP_VISIBLE_DEVICES=0 - All Phase 0 tasks marked done Co-Authored-By: Virgil <virgil@lethean.io>	2026-02-19 19:57:14 +00:00
Snider	aa42cff417	feat: scaffold go-rocm AMD GPU inference package Implements inference.Backend via llama-server subprocess (llama.cpp + HIP/ROCm). Targets RX 7800 XT (gfx1101, RDNA 3, 16GB VRAM). Includes: - Backend registration with build tags (linux/amd64) - Stub backend.go with llama-server lifecycle outline - CLAUDE.md with build instructions for llama.cpp + ROCm - TODO.md with 5-phase task queue - FINDINGS.md with hardware specs, VRAM budget, design rationale Co-Authored-By: Virgil <virgil@lethean.io>	2026-02-19 19:39:40 +00:00
Virgil	252e28e81e	Initial commit	2026-02-19 19:35:55 +00:00

42 commits