docs: Phase 3 complete — GGUF metadata, discovery, auto context

Integration test verifies model discovery on real GGUF files.
All 9 models in /data/lem/gguf/ discovered with correct metadata.

Co-Authored-By: Virgil <virgil@lethean.io>
This commit is contained in:
Claude 2026-02-19 22:24:52 +00:00
parent 2c77f6f968
commit d7db2d6e95
No known key found for this signature in database
GPG key ID: AF404715446AEB41
3 changed files with 60 additions and 3 deletions

View file

@ -250,3 +250,39 @@ Note: sysfs reads are non-atomic. Total and Used are read separately, so transie
### lastErr Design Limitation
`rocmModel.lastErr` is a single mutex-protected field shared across all callers. With concurrent Generate/Chat calls, errors can be clobbered (last writer wins). `Err()` is only reliable in single-caller scenarios. This matches the go-inference interface contract (single `Err() error` method), so it's a known limitation, not a bug. Per-call error returns would require an interface change in go-inference.
---
## 2026-02-19: Phase 3 Model Support (Charon)
### GGUF Metadata Parser
New `internal/gguf/` package reads GGUF v2/v3 binary headers. Extracts metadata KV pairs without reading tensor data (<1ms per file). Supports all 13 GGUF value types (uint8..float64, string, array, bool). String length capped at 1 MiB to prevent memory exhaustion from malformed files. Handles uint64 values for context_length/block_count (some producers use uint64 instead of uint32).
### Model Inventory
Discovered models from `/data/lem/gguf/` using GGUF metadata:
| Model | Architecture | Size | Quant | Context | Blocks |
|-------|-------------|------|-------|---------|--------|
| Gemma3-1B Q5_K_M | gemma3 | 1B | Q5_K_M | 32768 | 26 |
| Gemma3-1B Q8_0 | gemma3 | 1B | Q8_0 | 32768 | 26 |
| Gemma3-4B Q4_K_M | gemma3 | 4B | Q4_K_M | 131072 | 34 |
| Gemma3-12B Q4_K_M | gemma3 | 12B | Q4_K_M | 131072 | 42 |
| Gemma3-27B Q4_K_M | gemma3 | 27B | Q4_K_M | 131072 | 46 |
| Llama-3.1-8B Q4_K_M | llama | 8B | Q4_K_M | 131072 | 32 |
| Mistral-7B-v0.3 Q4_K_M | llama | 7B | Q4_K_M | 32768 | 32 |
| Qwen-2.5-7B Q4_K_M | qwen2 | 7B | Q4_K_M | 32768 | 28 |
Key observations:
- Mistral-7B-v0.3 reports `general.architecture = "llama"` (correct — Mistral is a Llama architecture variant). Old `guessModelType` returned "mistral", GGUF metadata returns "llama".
- Qwen-2.5-7B reports `general.architecture = "qwen2"` (not "qwen3"). Old `guessModelType` would have returned "qwen" due to filename matching.
- Gemma3-4B/12B/27B have 131072 native context — without auto-capping at 4096, these would exhaust VRAM.
### Chat Templates
llama-server reads `tokenizer.chat_template` from the GGUF and applies it automatically on `/v1/chat/completions`. No go-rocm code needed. Verified working with Gemma3 integration tests.
### Context Window Auto-Detection
Default context capped at `min(model_context_length, 4096)` when user doesn't specify `inference.WithContextLen(N)`. Without this cap, Llama-3.1 would try to allocate 131072 context (~4GB KV cache), which combined with model weights would not fit in 16GB VRAM for larger models.

View file

@ -33,9 +33,11 @@ The Ryzen 9 9950X iGPU shows up as ROCm Device 1, reports 100GB free (system RAM
## Phase 3: Model Support
- [ ] **GGUF model discovery** — Implement model path scanning: find .gguf files, parse metadata (model name, params, quant level, size). Return structured inventory.
- [ ] **Chat templates** — llama-server handles chat templates natively via `--chat-template`. Verify Gemma3, Qwen3, Llama3 templates work. If not, add template formatting in model.go.
- [ ] **Context window sizing** — Auto-detect optimal context window from model metadata. Default to 4096 if unknown.
- [x] **GGUF metadata parser**`internal/gguf/` reads GGUF v2/v3 binary headers. Extracts architecture, name, file type, size label, context length, block count. String length limits for malformed input protection. Commit `c7c9389`. (Charon, 19 Feb 2026)
- [x] **GGUF model discovery**`DiscoverModels(dir)` scans directory for `.gguf` files, parses metadata via GGUF parser, returns `[]ModelInfo`. Commit `af23565`. (Charon, 19 Feb 2026)
- [x] **LoadModel enrichment** — Replaced `guessModelType` with GGUF metadata for real architecture. Auto-caps context at 4096 when user doesn't specify. Commit `2c77f6f`. (Charon, 19 Feb 2026)
- [x] **Chat templates** — llama-server reads `tokenizer.chat_template` from GGUF natively on `/v1/chat/completions`. No go-rocm code needed. Verified with Gemma3 integration test. (Charon, 19 Feb 2026)
- [x] **Context window sizing** — Auto-detected from GGUF metadata. Default caps at `min(model_context_length, 4096)` to prevent VRAM exhaustion. (Charon, 19 Feb 2026)
## Phase 4: Performance

View file

@ -5,6 +5,7 @@ package rocm
import (
"context"
"os"
"path/filepath"
"strings"
"sync"
"testing"
@ -199,3 +200,21 @@ func TestROCm_ConcurrentRequests(t *testing.T) {
assert.NotEmpty(t, result, "goroutine %d produced no output", i)
}
}
func TestROCm_DiscoverModels(t *testing.T) {
dir := filepath.Dir(testModel)
if _, err := os.Stat(dir); err != nil {
t.Skip("model directory not available")
}
models, err := DiscoverModels(dir)
require.NoError(t, err)
require.NotEmpty(t, models, "expected at least one model in %s", dir)
for _, m := range models {
t.Logf("Found: %s (%s %s %s, ctx=%d)", filepath.Base(m.Path), m.Architecture, m.Parameters, m.Quantisation, m.ContextLen)
assert.NotEmpty(t, m.Architecture)
assert.NotEmpty(t, m.Name)
assert.Greater(t, m.FileSize, int64(0))
}
}