go-rocm/docs/plans/completed/2026-02-19-phase1-core-implementation-design.md
Claude c0b7485129
Some checks failed
Security Scan / security (push) Successful in 10s
Test / Vet & Build (push) Failing after 29s
docs: archive completed phase 1-4 plans
All four phases fully implemented and tested on RX 7800 XT.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-24 19:42:07 +00:00

2.3 KiB

Phase 1: Core Implementation Design

Approved 19 Feb 2026.

Component Structure

4 files, layered bottom-up:

internal/llamacpp/client.go   <- HTTP + SSE (no inference types)
server.go                     <- process lifecycle (spawn, health, kill)
model.go                      <- inference.TextModel (wraps server + client)
backend.go                    <- inference.Backend (fill in stubs)

Each layer only knows about the one below it. internal/llamacpp/ is a pure HTTP client with zero go-inference dependency — it speaks llama-server's OpenAI-compatible API and returns plain Go types.

Data Flow

LoadModel("/path/to/model.gguf")
  -> backend.go: find llama-server binary, apply LoadConfig
  -> server.go:  spawn process (HIP_VISIBLE_DEVICES=0, random port)
                  poll GET /health until {"status":"ok"} (timeout 60s)
  -> model.go:   return &rocmModel{server, client}

Generate(ctx, "prompt")
  -> model.go:   build request body from GenerateConfig
  -> client.go:  POST /v1/completions, stream=true
                  parse SSE lines: data: {json}
                  yield Token{Text: chunk} via iter.Seq[Token]
  -> on ctx cancel: close HTTP response body (stops SSE)

Close()
  -> server.go:  SIGTERM -> wait 5s -> SIGKILL if needed

Key Decisions

  • Server lifecycle: Long-lived. Spawn on LoadModel(), reuse for all Generate/Chat calls, kill on Close().
  • Port selection: net.Listen(":0") for kernel-assigned free port. Close listener, pass port to llama-server.
  • SSE parsing: Hand-rolled with bufio.Scanner. llama-server's SSE format is simple (data: {json}\n\n), ~20 lines.
  • Health polling: Simple loop, 100ms interval, 60s timeout. Returns error if exceeded.
  • Error handling: Err() stores last error. Process death mid-stream detected via response body EOF + process state.
  • Thread safety: rocmModel safe for concurrent calls (llama-server handles via slots). lastErr protected by mutex.
  • iGPU mitigation: HIP_VISIBLE_DEVICES=0 set in process env unconditionally.

Testing

  • Unit tests (no GPU): mock HTTP responses for SSE parsing in client.go
  • Integration test (GPU + model): LoadModel -> Generate -> Close, t.Skip() when unavailable
  • Test model: LEK-Gemma3-1B-layered-v2-Q5_K_M.gguf (smallest, fast load)