Replaces filename-based guessModelType with GGUF header parsing.
Caps default context at 4096 to prevent VRAM exhaustion on models
with 128K+ native context.
Co-Authored-By: Virgil <virgil@lethean.io>
Replace stub backend with real implementation: Available() checks
/dev/kfd and llama-server presence, LoadModel() wires up server
lifecycle to return a rocmModel. Add guessModelType() for architecture
detection from GGUF filenames (handles hyphenated variants like
Llama-3). Add TestAvailable and TestGuessModelType.
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds server.go with the process lifecycle layer that manages spawning
llama-server, waiting for readiness, and graceful shutdown. Includes
three helper functions (findLlamaServer, freePort, serverEnv) and the
full startServer/waitReady/stop lifecycle. The serverEnv function
critically filters HIP_VISIBLE_DEVICES to mask the Ryzen 9 iGPU
which crashes llama-server if not excluded.
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>