1 Models
Claude edited this page 2026-02-19 20:07:58 +00:00

Models

VRAM Budget

The RX 7800 XT has 16GB GDDR6. Budget:

Total VRAM:     16,384 MiB
Model:          varies (see table)
KV cache:       ~80-500 MiB (depends on context length + layers)
Compute buffer: ~500 MiB
Overhead:       ~200 MiB
─────────────────────────────
Available for model: ~15,000 MiB

Available Models

All at /data/lem/gguf/ (SMB mount from M3 Ultra at //10.69.69.108/Data).

Model File Size VRAM (est) Fits? Notes
Gemma3-1B Q5 LEK-Gemma3-1B-layered-v2-Q5_K_M.gguf ~0.9G ~1.5G Yes Smallest, good for quick tests
Gemma3-1B Q8 LEK-Gemma3-1B-layered-v2-Q8_0.gguf ~1.4G ~2G Yes Higher quality 1B
Gemma3-4B Q4 LEK-Gemma3-4B-Q4_K_M.gguf 2.7G ~3.4G Yes Baseline tested: 109 tok/s
Gemma3-12B Q4 LEK-Gemma3-12B-Q4_K_M.gguf ~7.5G ~9G Yes Good quality/speed balance
Gemma3-27B Q4 LEK-Gemma3-27B-Q4_K_M.gguf ~16G ~18G Tight May need reduced context
Llama-3.1-8B Q4 LEK-Llama-3.1-8B-Q4_K_M.gguf ~5G ~6G Yes
Mistral-7B Q4 LEK-Mistral-7B-v0.3-Q4_K_M.gguf ~4G ~5G Yes
Qwen-2.5-7B Q4 LEK-Qwen-2.5-7B-Q4_K_M.gguf ~4G ~5G Yes

Use LEK-Gemma3-4B-Q4_K_M.gguf for development/testing:

  • Small enough to load fast (~6s)
  • Leaves plenty of VRAM headroom
  • Has validated baseline: 396 tok/s prefill, 109 tok/s decode
  • Chat template works natively in llama-server

For integration tests that need to be fast, use LEK-Gemma3-1B-layered-v2-Q5_K_M.gguf.

SMB Mount Note

Models are on an SMB mount from the M3 Ultra. If the mount is unavailable:

if _, err := os.Stat(modelPath); os.IsNotExist(err) {
    t.Skip("model not available (SMB mount down?)")
}

For CI, models would need to be copied locally or tests skipped.

Baseline Benchmarks

Gemma3-4B-Q4_K_M on RX 7800 XT (Charon, 19 Feb 2026):

Metric Value
Prefill 396 tok/s (2.5ms/tok)
Decode 109 tok/s (9.2ms/tok)
Time to first token ~40ms (16 token prompt)
Startup ~6s (load + warmup)
Context 4096 (model supports 131072)
Flash attention Auto-enabled
Concurrent slots 4