Backends

Backend Interface

All inference backends implement:

type Backend interface {
    Generate(ctx context.Context, prompt string, opts GenOpts) (string, error)
    Chat(ctx context.Context, messages []Message, opts GenOpts) (string, error)
    Name() string
    Available() bool
}

type StreamingBackend interface {
    Backend
    GenerateStream(ctx context.Context, prompt string, opts GenOpts, cb TokenCallback) error
    ChatStream(ctx context.Context, messages []Message, opts GenOpts, cb TokenCallback) error
}

Implementations

MLX Backend (`backend_mlx.go`)

Native Apple Metal GPU inference via core/go-mlx. Only available on darwin/arm64.

Loads safetensors models via model.LoadModel()
Token-by-token generation with streaming callbacks
Memory management: cache limits, GC pressure, active memory monitoring
Supports Gemma3 and Qwen3 chat templates

HTTP Backend (`backend_http.go`)

Generic HTTP API client compatible with Ollama, vLLM, and OpenAI-compatible endpoints.

Configurable base URL and model selection
Request/response marshalling
Tested with mock HTTP server

Llama Backend (`backend_llama.go`)

llama.cpp integration via subprocess execution.

Manages GGUF model files
Spawns llama.cpp server process
Communicates via local HTTP API

Ollama Backend (`ollama.go`)

Ollama-specific helpers built on the HTTP backend.

Model pulling and management
Ollama API conveniences

Backends

Backend Interface

Implementations

MLX Backend (backend_mlx.go)

HTTP Backend (backend_http.go)

Llama Backend (backend_llama.go)

Ollama Backend (ollama.go)

MLX Backend (`backend_mlx.go`)

HTTP Backend (`backend_http.go`)

Llama Backend (`backend_llama.go`)

Ollama Backend (`ollama.go`)