1
Backends
Virgil edited this page 2026-02-19 18:36:26 +00:00
Backends
Backend Interface
All inference backends implement:
type Backend interface {
Generate(ctx context.Context, prompt string, opts GenOpts) (string, error)
Chat(ctx context.Context, messages []Message, opts GenOpts) (string, error)
Name() string
Available() bool
}
type StreamingBackend interface {
Backend
GenerateStream(ctx context.Context, prompt string, opts GenOpts, cb TokenCallback) error
ChatStream(ctx context.Context, messages []Message, opts GenOpts, cb TokenCallback) error
}
Implementations
MLX Backend (backend_mlx.go)
Native Apple Metal GPU inference via core/go-mlx. Only available on darwin/arm64.
- Loads safetensors models via
model.LoadModel() - Token-by-token generation with streaming callbacks
- Memory management: cache limits, GC pressure, active memory monitoring
- Supports Gemma3 and Qwen3 chat templates
HTTP Backend (backend_http.go)
Generic HTTP API client compatible with Ollama, vLLM, and OpenAI-compatible endpoints.
- Configurable base URL and model selection
- Request/response marshalling
- Tested with mock HTTP server
Llama Backend (backend_llama.go)
llama.cpp integration via subprocess execution.
- Manages GGUF model files
- Spawns llama.cpp server process
- Communicates via local HTTP API
Ollama Backend (ollama.go)
Ollama-specific helpers built on the HTTP backend.
- Model pulling and management
- Ollama API conveniences