1
Models
Claude edited this page 2026-02-19 20:07:58 +00:00
Table of Contents
Models
VRAM Budget
The RX 7800 XT has 16GB GDDR6. Budget:
Total VRAM: 16,384 MiB
Model: varies (see table)
KV cache: ~80-500 MiB (depends on context length + layers)
Compute buffer: ~500 MiB
Overhead: ~200 MiB
─────────────────────────────
Available for model: ~15,000 MiB
Available Models
All at /data/lem/gguf/ (SMB mount from M3 Ultra at //10.69.69.108/Data).
| Model | File | Size | VRAM (est) | Fits? | Notes |
|---|---|---|---|---|---|
| Gemma3-1B Q5 | LEK-Gemma3-1B-layered-v2-Q5_K_M.gguf | ~0.9G | ~1.5G | Yes | Smallest, good for quick tests |
| Gemma3-1B Q8 | LEK-Gemma3-1B-layered-v2-Q8_0.gguf | ~1.4G | ~2G | Yes | Higher quality 1B |
| Gemma3-4B Q4 | LEK-Gemma3-4B-Q4_K_M.gguf | 2.7G | ~3.4G | Yes | Baseline tested: 109 tok/s |
| Gemma3-12B Q4 | LEK-Gemma3-12B-Q4_K_M.gguf | ~7.5G | ~9G | Yes | Good quality/speed balance |
| Gemma3-27B Q4 | LEK-Gemma3-27B-Q4_K_M.gguf | ~16G | ~18G | Tight | May need reduced context |
| Llama-3.1-8B Q4 | LEK-Llama-3.1-8B-Q4_K_M.gguf | ~5G | ~6G | Yes | |
| Mistral-7B Q4 | LEK-Mistral-7B-v0.3-Q4_K_M.gguf | ~4G | ~5G | Yes | |
| Qwen-2.5-7B Q4 | LEK-Qwen-2.5-7B-Q4_K_M.gguf | ~4G | ~5G | Yes |
Recommended Test Model
Use LEK-Gemma3-4B-Q4_K_M.gguf for development/testing:
- Small enough to load fast (~6s)
- Leaves plenty of VRAM headroom
- Has validated baseline: 396 tok/s prefill, 109 tok/s decode
- Chat template works natively in llama-server
For integration tests that need to be fast, use LEK-Gemma3-1B-layered-v2-Q5_K_M.gguf.
SMB Mount Note
Models are on an SMB mount from the M3 Ultra. If the mount is unavailable:
if _, err := os.Stat(modelPath); os.IsNotExist(err) {
t.Skip("model not available (SMB mount down?)")
}
For CI, models would need to be copied locally or tests skipped.
Baseline Benchmarks
Gemma3-4B-Q4_K_M on RX 7800 XT (Charon, 19 Feb 2026):
| Metric | Value |
|---|---|
| Prefill | 396 tok/s (2.5ms/tok) |
| Decode | 109 tok/s (9.2ms/tok) |
| Time to first token | ~40ms (16 token prompt) |
| Startup | ~6s (load + warmup) |
| Context | 4096 (model supports 131072) |
| Flash attention | Auto-enabled |
| Concurrent slots | 4 |