Brings rocmModel into compliance with the updated inference.TextModel
interface from go-inference.
- Classify: simulates prefill-only via max_tokens=1, temperature=0
- BatchGenerate: sequential autoregressive per prompt via /v1/completions
- Info: populates ModelInfo from GGUF metadata (architecture, layers, quant)
- Metrics: captures timing + VRAM usage via sysfs after each operation
- Refactors duplicate server-exit error handling into setServerExitErr()
- Adds timing instrumentation to existing Generate and Chat methods
Co-Authored-By: Virgil <virgil@lethean.io>
Replaces filename-based guessModelType with GGUF header parsing.
Caps default context at 4096 to prevent VRAM exhaustion on models
with 128K+ native context.
Co-Authored-By: Virgil <virgil@lethean.io>
Replace stub backend with real implementation: Available() checks
/dev/kfd and llama-server presence, LoadModel() wires up server
lifecycle to return a rocmModel. Add guessModelType() for architecture
detection from GGUF filenames (handles hyphenated variants like
Llama-3). Add TestAvailable and TestGuessModelType.
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>