Commit graph

51 commits

Author SHA1 Message Date
Snider
ed9bdd1306 fix: migrate module paths from forge.lthn.ai to dappco.re
Some checks are pending
Security Scan / security (push) Waiting to run
Test / Vet & Build (push) Waiting to run
Co-Authored-By: Virgil <virgil@lethean.io>
2026-04-04 16:21:13 +01:00
2fa87bfeb6 Merge pull request '[agent/claude:opus] DX audit and fix. 1) Review CLAUDE.md — update any outdate...' (#1) from agent/dx-audit-and-fix--1--review-claude-md into main
Some checks failed
Security Scan / security (push) Successful in 7s
Test / Vet & Build (push) Failing after 20s
2026-03-17 08:50:45 +00:00
Snider
9aaa404397 fix(dx): audit coding standards and add tests for untested paths
Some checks failed
Security Scan / security (pull_request) Successful in 7s
Test / Vet & Build (pull_request) Failing after 21s
- CLAUDE.md: document coreerr.E() error handling and go-io exclusion
- server_test.go: replace fmt.Errorf with coreerr.E() in test fixtures
- gguf_test.go: add tests for v2 format, skipValue (all type branches),
  readTypedValue uint64 path, unsupported version, truncated file
- discover_test.go: add test for corrupt GGUF file skipping
- vram_test.go: add tests for invalid/empty sysfs content

Coverage: 65.8% → 79.2% (+13.4%)

Co-Authored-By: Virgil <virgil@lethean.io>
2026-03-17 08:50:17 +00:00
Snider
5dc79971e2 chore: sync dependencies for v0.0.1
Some checks failed
Security Scan / security (push) Successful in 8s
Test / Vet & Build (push) Failing after 22s
Co-Authored-By: Virgil <virgil@lethean.io>
2026-03-16 22:20:49 +00:00
Snider
4669cc503d refactor: replace fmt.Errorf/errors.New with coreerr.E()
Some checks failed
Security Scan / security (push) Successful in 8s
Test / Vet & Build (push) Failing after 23s
Co-Authored-By: Virgil <virgil@lethean.io>
2026-03-16 21:08:52 +00:00
Claude
c0b7485129
docs: archive completed phase 1-4 plans
Some checks failed
Security Scan / security (push) Successful in 10s
Test / Vet & Build (push) Failing after 29s
All four phases fully implemented and tested on RX 7800 XT.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-24 19:42:07 +00:00
Claude
402bdc2205
test: add integration tests for Classify, BatchGenerate, Info, Metrics
Some checks failed
Security Scan / security (push) Successful in 8s
Test / Vet & Build (push) Failing after 29s
Verified on RX 7800 XT (gfx1100, ROCm 7.2):
- Classify: greedy single-token via max_tokens=1
- BatchGenerate: sequential multi-prompt generation
- Info: GGUF metadata (gemma3, 26 layers, Q5_K_M)
- Metrics: 250 tok/s decode, 3244 MiB VRAM

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-24 18:52:10 +00:00
Claude
b03f357f5d
feat: implement Classify, BatchGenerate, Info, Metrics on rocmModel
Some checks failed
Security Scan / security (push) Successful in 10s
Test / Vet & Build (push) Failing after 34s
Brings rocmModel into compliance with the updated inference.TextModel
interface from go-inference.

- Classify: simulates prefill-only via max_tokens=1, temperature=0
- BatchGenerate: sequential autoregressive per prompt via /v1/completions
- Info: populates ModelInfo from GGUF metadata (architecture, layers, quant)
- Metrics: captures timing + VRAM usage via sysfs after each operation
- Refactors duplicate server-exit error handling into setServerExitErr()
- Adds timing instrumentation to existing Generate and Chat methods

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-24 18:50:37 +00:00
Claude
197c537e9f
ci: add Forgejo Actions test and security scan workflows
Some checks failed
Security Scan / security (push) Successful in 9s
Test / Vet & Build (push) Failing after 39s
Uses reusable workflows from core/go-devops for Go testing
(with race detector and coverage) and security scanning
(govulncheck, gitleaks, trivy).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 03:28:08 +00:00
Claude
add2ba1dbd
chore: sync workspace dependency versions
Run go work sync to align dependency versions across workspace.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 21:41:04 +00:00
Snider
76b843e116 docs: add README with quick start and docs links
Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-20 15:11:26 +00:00
Snider
7915f7ad3c docs: graduate TODO/FINDINGS into production documentation
Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-20 15:03:17 +00:00
Claude
61a95e4d4f
docs: Phase 4 complete — benchmarks, flash attention, parallel slots
Gemma3-4B: 103 tok/s decode, 14ms TTFT, 239 tok/s concurrent (4 slots).
Flash attention shows no benefit at ctx=2048 (designed for larger contexts).

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 23:22:04 +00:00
Claude
870ee232bf
feat: benchmark suite for decode speed, TTFT, and concurrent throughput
BenchmarkDecode, BenchmarkTTFT, BenchmarkConcurrent across 3 models.
Uses testing.B with b.ReportMetric for Go benchstat integration.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 23:16:40 +00:00
Claude
72120bb200
feat: pass --parallel N to llama-server for concurrent inference slots
Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 23:13:19 +00:00
Claude
4b6cffb9c4
docs: Phase 4 performance implementation plan
5 tasks: go-inference ParallelSlots, wire --parallel, benchmark
suite, flash attention comparison, documentation.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 23:11:30 +00:00
Claude
31bf0e8850
docs: Phase 4 performance design
Benchmark suite (testing.B), parallel slots via go-inference,
flash attention manual comparison.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 23:09:56 +00:00
Claude
d7db2d6e95
docs: Phase 3 complete — GGUF metadata, discovery, auto context
Integration test verifies model discovery on real GGUF files.
All 9 models in /data/lem/gguf/ discovered with correct metadata.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 22:24:52 +00:00
Claude
2c77f6f968
feat: use GGUF metadata for model type and context window auto-detection
Replaces filename-based guessModelType with GGUF header parsing.
Caps default context at 4096 to prevent VRAM exhaustion on models
with 128K+ native context.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 22:23:07 +00:00
Claude
af235653ca
feat: model discovery scanning directories for GGUF files
Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 22:21:48 +00:00
Claude
c7c9389749
feat: GGUF metadata parser for model discovery
Reads GGUF v2/v3 binary headers to extract model metadata (architecture,
name, quantisation type, context length, block count). Includes string
length limits for malformed input protection and uint64 value support
for compatibility with varied GGUF producers.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 22:20:02 +00:00
Claude
f8b091f511
docs: Phase 3 model support implementation plan
4 tasks: GGUF metadata parser, model discovery, LoadModel
enrichment, integration tests + documentation.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 22:12:31 +00:00
Claude
2454761e34
docs: Phase 3 model support design — approved
GGUF metadata parser, model discovery, LoadModel enrichment,
chat template verification.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 22:04:18 +00:00
Claude
34f02fdcd8
docs: Phase 2 complete — robustness features implemented
All 5 Phase 2 items done: crash detection, port retry,
graceful shutdown verification, VRAM monitoring, concurrent
requests. Concurrency findings documented.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 21:54:34 +00:00
Claude
a6e647c5b7
test: graceful shutdown and concurrent request integration tests
Clear lastErr at the start of each Generate/Chat call so that Err()
reflects the most recent call, not a stale cancellation from a prior one.

Add two integration tests:
- GracefulShutdown: cancel mid-stream then generate again on the same
  model, verifying the server survives cancellation.
- ConcurrentRequests: three goroutines calling Generate() simultaneously,
  verifying no panics or deadlocks (llama-server serialises via slots).

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 21:50:47 +00:00
Claude
954c57071a
fix: clamp VRAM Free to prevent uint64 underflow
Guard against transient sysfs inconsistency where used > total.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 21:48:19 +00:00
Claude
501de83d3b
feat: VRAM monitoring via sysfs with dGPU auto-detection
Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 21:45:02 +00:00
Claude
b7342ec819
fix: only retry startServer on process exit, not timeout
Distinguishes retryable failures (process exited, e.g. port conflict)
from non-retryable ones (60s timeout, e.g. stuck server). Avoids
3x timeout penalty.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 21:43:06 +00:00
Claude
c50a8e9e9b
feat: retry port selection in startServer on process failure
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 21:40:05 +00:00
Claude
c07f37afe9
fix: guard nil exitErr wrapping, document concurrency invariant
Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 21:38:01 +00:00
Claude
2c4966e652
feat: detect server crash before Generate/Chat calls
Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 21:34:46 +00:00
Claude
d963cbf787
docs: Phase 2 robustness implementation plan
5 tasks: crash detection, port retry, VRAM monitoring,
graceful shutdown test, concurrent requests test.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 21:31:24 +00:00
Claude
2f743c5772
docs: Phase 2 robustness design — approved
Covers: graceful shutdown verification, port conflict retry,
server crash detection, VRAM monitoring via sysfs, concurrent
request testing.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 21:26:41 +00:00
Claude
6744a7c78f
docs: mark Phase 1 tasks complete
All 5 Phase 1 items done: GPU detection, server lifecycle,
HTTP client with SSE streaming, TextModel, integration tests.

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 21:16:11 +00:00
Claude
0e68d71c8a
test: integration tests for full ROCm inference pipeline
LoadModel → Generate → Chat → Close on real AMD GPU hardware.
Build-tagged //go:build rocm so normal go test skips them.

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 21:15:02 +00:00
Claude
1d8d65f55b
feat: Backend Available() and LoadModel() with GPU detection
Replace stub backend with real implementation: Available() checks
/dev/kfd and llama-server presence, LoadModel() wires up server
lifecycle to return a rocmModel. Add guessModelType() for architecture
detection from GGUF filenames (handles hyphenated variants like
Llama-3). Add TestAvailable and TestGuessModelType.

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 21:12:02 +00:00
Claude
a8c494771d
feat: TextModel implementation wrapping llama-server
rocmModel implements inference.TextModel with Generate() and Chat()
methods that delegate to the llamacpp HTTP client, mapping go-inference
types to llama-server's OpenAI-compatible API. Token streaming via
iter.Seq[inference.Token] with mutex-protected error propagation.

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 21:11:55 +00:00
Claude
9aa7f624ba
feat: server lifecycle and helpers for llama-server subprocess
Adds server.go with the process lifecycle layer that manages spawning
llama-server, waiting for readiness, and graceful shutdown. Includes
three helper functions (findLlamaServer, freePort, serverEnv) and the
full startServer/waitReady/stop lifecycle. The serverEnv function
critically filters HIP_VISIBLE_DEVICES to mask the Ryzen 9 iGPU
which crashes llama-server if not excluded.

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 21:08:07 +00:00
Claude
5778f1f011
fix: guard response body lifecycle in SSE streaming client
Use sync.Once to ensure resp.Body is closed exactly once, preventing
TCP connection leaks when the iterator is never consumed and
double-close when iterated twice. Also adds Accept: text/event-stream
header to both SSE endpoints.

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 21:04:02 +00:00
Claude
1bc8c9948b
test: completion streaming tests for llamacpp client
Add TestComplete_Streaming (multi-chunk SSE with three tokens) and
TestComplete_HTTPError (400 status propagation) to exercise the
Complete() method alongside the existing chat tests.

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 20:59:21 +00:00
Claude
def3167199
feat: llamacpp SSE streaming client for chat completions
Add ChatComplete() and Complete() methods to the llamacpp Client,
backed by a shared parseSSE() line parser. Types include ChatMessage,
ChatRequest, CompletionRequest and their chunked response structs.

Tests cover multi-chunk streaming, empty responses, HTTP errors, and
context cancellation — all using httptest SSE servers.

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 20:58:46 +00:00
Claude
d5a92c7212
fix: health check includes response body in errors, adds 503 test
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 20:54:52 +00:00
Claude
3c756771ec
feat: llamacpp health check client
Add internal/llamacpp package with Client type and Health() method.
Client communicates with llama-server via HTTP; Health checks the
/health endpoint and reports readiness. Foundation type for the
streaming methods (Tasks 2-3).

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 20:50:36 +00:00
Claude
9dda860df4
docs: incorporate Charon review — safer serverEnv() filtering
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 20:47:16 +00:00
Claude
78e244f26f
docs: Phase 1 plan review — approved with notes
- Token.ID = 0 acceptable for Phase 1 (no consumer uses it)
- StopTokens: ignore in Phase 1 (YAGNI)
- serverEnv() should filter existing HIP_VISIBLE_DEVICES before appending
- guessModelType() fine for now, upgrade to /props endpoint in Phase 2
- Integration test build tag approach approved

Charon, 19 Feb 2026

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 20:44:37 +00:00
Claude
ff9cf550e8
docs: flag Token.ID and StopTokens interface questions for Virgil
QUESTION: Token.ID always 0 — llama-server SSE doesn't include token IDs
QUESTION: StopTokens []int32 vs llama-server stop []string mismatch

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 20:41:53 +00:00
Claude
acf79e3351
docs: Phase 1 implementation plan
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 20:38:11 +00:00
Claude
34407a69ca
docs: Phase 1 core implementation design
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 20:32:22 +00:00
Claude
68bc7300aa
docs: Phase 0 complete — environment validated, llama-server built
- ROCm 7.2, gfx1100 (corrected from gfx1101), kernel 6.17
- llama-server built with HIP from llama.cpp 11c325c
- Gemma3-4B baseline: 109 tok/s decode, 396 tok/s prefill
- Critical: iGPU crash requires HIP_VISIBLE_DEVICES=0
- All Phase 0 tasks marked done

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 19:57:14 +00:00
Snider
aa42cff417 feat: scaffold go-rocm AMD GPU inference package
Implements inference.Backend via llama-server subprocess (llama.cpp + HIP/ROCm).
Targets RX 7800 XT (gfx1101, RDNA 3, 16GB VRAM).

Includes:
- Backend registration with build tags (linux/amd64)
- Stub backend.go with llama-server lifecycle outline
- CLAUDE.md with build instructions for llama.cpp + ROCm
- TODO.md with 5-phase task queue
- FINDINGS.md with hardware specs, VRAM budget, design rationale

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 19:39:40 +00:00