1 Backends
Virgil edited this page 2026-02-19 18:36:26 +00:00

Backends

Backend Interface

All inference backends implement:

type Backend interface {
    Generate(ctx context.Context, prompt string, opts GenOpts) (string, error)
    Chat(ctx context.Context, messages []Message, opts GenOpts) (string, error)
    Name() string
    Available() bool
}

type StreamingBackend interface {
    Backend
    GenerateStream(ctx context.Context, prompt string, opts GenOpts, cb TokenCallback) error
    ChatStream(ctx context.Context, messages []Message, opts GenOpts, cb TokenCallback) error
}

Implementations

MLX Backend (backend_mlx.go)

Native Apple Metal GPU inference via core/go-mlx. Only available on darwin/arm64.

  • Loads safetensors models via model.LoadModel()
  • Token-by-token generation with streaming callbacks
  • Memory management: cache limits, GC pressure, active memory monitoring
  • Supports Gemma3 and Qwen3 chat templates

HTTP Backend (backend_http.go)

Generic HTTP API client compatible with Ollama, vLLM, and OpenAI-compatible endpoints.

  • Configurable base URL and model selection
  • Request/response marshalling
  • Tested with mock HTTP server

Llama Backend (backend_llama.go)

llama.cpp integration via subprocess execution.

  • Manages GGUF model files
  • Spawns llama.cpp server process
  • Communicates via local HTTP API

Ollama Backend (ollama.go)

Ollama-specific helpers built on the HTTP backend.

  • Model pulling and management
  • Ollama API conveniences