Automated fixes: interface{} → any, range-over-int, t.Context(),
wg.Go(), strings.SplitSeq, strings.Builder, slices.Contains,
maps helpers, min/max builtins.
Co-Authored-By: Virgil <virgil@lethean.io>
16 KiB
LEM Distill Backend Migration Implementation Plan
For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.
Goal: Replace raw go-inference usage in distill.go with go-ml Backend interface, adding Metal memory management to prevent unbounded memory growth.
Architecture: distill.go currently calls inference.LoadModel() directly and iterates tokens via model.Chat() (iter.Seq). We replace this with ml.NewMLXBackend() which wraps the same model in an InferenceAdapter providing memory limits (mlx.SetCacheLimit/SetMemoryLimit), GC discipline between probes, and the new Result{Text, Metrics} return type for tok/s logging. The reference implementation is go-ml/cmd/cmd_ab.go.
Tech Stack: Go 1.25, forge.lthn.ai/core/go-ml (Backend, GenOpts, Result, Message, NewMLXBackend), forge.lthn.ai/core/go-mlx (SetCacheLimit, SetMemoryLimit), forge.lthn.ai/core/go-inference (GenerateMetrics — via Result.Metrics)
Design doc: docs/plans/2026-02-22-distill-backend-migration-design.md
Task 1: Add go-ml to go.mod
go-ml is in the replace block but not in the require block. The compiler will refuse to import it until it's required.
Files:
- Modify:
go.mod
Step 1: Add go-ml to require block
Add this line to the first require block in go.mod, between go-inference and go-duckdb:
forge.lthn.ai/core/go-ml v0.0.0-00010101000000-000000000000
The version doesn't matter because the replace directive overrides it.
Step 2: Run go mod tidy
Run: cd /Users/snider/Code/LEM && go mod tidy
This will resolve the version and pull in any transitive deps from go-ml.
Step 3: Verify build still works
Run: cd /Users/snider/Code/LEM && go build ./...
Expected: Clean build (go-ml is now available but not yet imported)
Step 4: Commit
cd /Users/snider/Code/LEM
git add go.mod go.sum
git commit -m "$(cat <<'EOF'
chore: add go-ml to go.mod require block
Prerequisite for distill migration from raw go-inference to
go-ml Backend interface with memory management.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
EOF
)"
Task 2: Add memory config fields to DistillConfig
Add CacheLimit and MemoryLimit fields to DistillConfig in config.go, and add corresponding YAML entries to ai.yaml.
Files:
- Modify:
pkg/lem/config.go:38-42 - Modify:
.core/ai/ai.yaml:27-29
Step 1: Add fields to DistillConfig
In pkg/lem/config.go, replace the DistillConfig struct (lines 39-42):
// DistillConfig holds distillation defaults.
type DistillConfig struct {
Runs int `yaml:"runs"`
MinChars int `yaml:"min_chars"`
CacheLimit int `yaml:"cache_limit"` // Metal cache limit in GB (0 = no limit)
MemoryLimit int `yaml:"memory_limit"` // Metal memory limit in GB (0 = no limit)
}
Step 2: Add YAML entries to ai.yaml
In .core/ai/ai.yaml, replace the distill: block (lines 27-29):
# Distillation defaults.
distill:
runs: 3 # Generations per probe (best kept)
min_chars: 20 # Reject responses shorter than this
cache_limit: 8 # Metal cache limit in GB (0 = no limit)
memory_limit: 16 # Metal memory limit in GB (0 = no limit)
Step 3: Verify build
Run: cd /Users/snider/Code/LEM && go build ./...
Expected: Clean build
Step 4: Commit
cd /Users/snider/Code/LEM
git add pkg/lem/config.go .core/ai/ai.yaml
git commit -m "$(cat <<'EOF'
feat(distill): add Metal memory limit config fields
CacheLimit (8GB) and MemoryLimit (16GB) in DistillConfig control
mlx.SetCacheLimit/SetMemoryLimit before model load. Conservative
defaults for 1B model on 96GB machine.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
EOF
)"
Task 3: Add --cache-limit and --mem-limit flags to RunDistill
Wire the new config fields into CLI flags so they can be overridden per-run.
Files:
- Modify:
pkg/lem/distill.go:38-51(flag parsing section)
Step 1: Add flags after existing flag declarations
In pkg/lem/distill.go, add these two flags after the root flag (after line 47, before fs.Parse):
cacheLimit := fs.Int("cache-limit", 0, "Metal cache limit in GB (0 = use ai.yaml default)")
memLimit := fs.Int("mem-limit", 0, "Metal memory limit in GB (0 = use ai.yaml default)")
Step 2: Add flag-to-config merge after existing overrides
After the *runs override block (after line 71), add:
cacheLimitGB := aiCfg.Distill.CacheLimit
if *cacheLimit > 0 {
cacheLimitGB = *cacheLimit
}
memLimitGB := aiCfg.Distill.MemoryLimit
if *memLimit > 0 {
memLimitGB = *memLimit
}
Step 3: Add memory limits to dry-run output
In the dry-run block, after the Generate: line (after line 121), add:
fmt.Printf("Memory: cache=%dGB limit=%dGB\n", cacheLimitGB, memLimitGB)
Step 4: Verify build
Run: cd /Users/snider/Code/LEM && go build ./...
Expected: Clean build (flags are parsed but not yet used for model loading)
Step 5: Commit
cd /Users/snider/Code/LEM
git add pkg/lem/distill.go
git commit -m "$(cat <<'EOF'
feat(distill): add --cache-limit and --mem-limit flags
Override ai.yaml memory config per-run. Values in GB.
Not yet wired to model loading.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
EOF
)"
Task 4: Replace inference.LoadModel with ml.NewMLXBackend
The core migration: swap inference.LoadModel() + raw iter.Seq for ml.NewMLXBackend() + backend.Chat(). This is the biggest task.
Files:
- Modify:
pkg/lem/distill.go(imports, model loading, inference loop, metrics)
Step 1: Update imports
Replace the import block (lines 3-16) with:
import (
"context"
"encoding/json"
"flag"
"fmt"
"log"
"os"
"path/filepath"
"runtime"
"strings"
"time"
"forge.lthn.ai/core/go-i18n/reversal"
ml "forge.lthn.ai/core/go-ml"
"forge.lthn.ai/core/go-mlx"
)
Key changes:
- Remove
"forge.lthn.ai/core/go-inference" - Add
ml "forge.lthn.ai/core/go-ml"(named import to avoid collision with the package name) - Add
"forge.lthn.ai/core/go-mlx"(formlx.SetCacheLimit,mlx.SetMemoryLimit) - Add
"runtime"(forruntime.GC())
Step 2: Replace model loading with memory-managed backend
Replace the model loading block (lines 138-147):
// Set Metal memory limits before loading model.
if cacheLimitGB > 0 {
mlx.SetCacheLimit(uint64(cacheLimitGB) * 1024 * 1024 * 1024)
log.Printf("metal cache limit: %dGB", cacheLimitGB)
}
if memLimitGB > 0 {
mlx.SetMemoryLimit(uint64(memLimitGB) * 1024 * 1024 * 1024)
log.Printf("metal memory limit: %dGB", memLimitGB)
}
// Load model via go-ml Backend (wraps go-inference with memory management).
log.Printf("loading model: %s", modelCfg.Paths.Base)
backend, err := ml.NewMLXBackend(modelCfg.Paths.Base)
if err != nil {
log.Fatalf("load model: %v", err)
}
defer backend.Close()
log.Printf("model loaded via %s backend", backend.Name())
Note: backend.Close() replaces model.Close(). We lose model.Info() for the architecture log line — that's fine, NewMLXBackend already logs arch/layers/quant via slog.
Step 3: Build GenOpts from merged config
Add this after the model loading block, before the tokeniser init (before the tok := reversal.NewTokeniser() line):
// Build generation options from merged config.
genOpts := ml.GenOpts{
MaxTokens: genCfg.MaxTokens,
Temperature: genCfg.Temperature,
TopP: genCfg.TopP,
TopK: genCfg.TopK,
RepeatPenalty: genCfg.RepeatPenalty,
}
Step 4: Replace the inference loop
Replace the inner inference block (lines 178-201):
Old code (lines 178-201):
// Inference uses bare probe — the model generates from its weights.
// Sandwich wrapping is only for the training output format.
messages := []inference.Message{
{Role: "user", Content: probe.Prompt},
}
// Generate via native Metal inference.
start := time.Now()
var sb strings.Builder
for token := range model.Chat(ctx, messages,
inference.WithMaxTokens(genCfg.MaxTokens),
inference.WithTemperature(float32(genCfg.Temperature)),
inference.WithTopP(float32(genCfg.TopP)),
inference.WithTopK(genCfg.TopK),
inference.WithRepeatPenalty(float32(genCfg.RepeatPenalty)),
) {
sb.WriteString(token.Text)
}
if err := model.Err(); err != nil {
fmt.Fprintf(os.Stderr, " → ERROR: %v\n", err)
continue
}
response := sb.String()
elapsed := time.Since(start)
New code:
// Inference uses bare probe — the model generates from its weights.
// Sandwich wrapping is only for the training output format.
messages := []ml.Message{
{Role: "user", Content: probe.Prompt},
}
// Generate via go-ml Backend (memory-managed Metal inference).
start := time.Now()
result, err := backend.Chat(ctx, messages, genOpts)
if err != nil {
fmt.Fprintf(os.Stderr, " → ERROR: %v\n", err)
continue
}
response := result.Text
elapsed := time.Since(start)
Step 5: Replace metrics access
Replace the metrics line (line 214):
Old:
met := model.Metrics()
fmt.Fprintf(os.Stderr, " → %d chars, g=%.1f up=%+.1f echo=%.2f enr=%+.1f, %.1fs (%.0f tok/s)\n",
len(response), grammar.Composite,
delta.Uplift, delta.Echo, delta.Enrichment,
elapsed.Seconds(), met.DecodeTokensPerSec)
New:
tokPerSec := 0.0
if result.Metrics != nil {
tokPerSec = result.Metrics.DecodeTokensPerSec
}
fmt.Fprintf(os.Stderr, " → %d chars, g=%.1f up=%+.1f echo=%.2f enr=%+.1f, %.1fs (%.0f tok/s)\n",
len(response), grammar.Composite,
delta.Uplift, delta.Echo, delta.Enrichment,
elapsed.Seconds(), tokPerSec)
Step 6: Add runtime.GC() after each probe
After the quality gate block's closing brace (after line 257 — the closing } of the if best != nil / else block), add:
// Release GPU memory between probes to prevent incremental leak.
runtime.GC()
Step 7: Update the summary footer
Replace the model info line in the summary (line 263):
Old:
fmt.Fprintf(os.Stderr, "Model: %s (%s)\n", modelCfg.Name, info.Architecture)
New:
fmt.Fprintf(os.Stderr, "Model: %s (%s)\n", modelCfg.Name, backend.Name())
Step 8: Verify build
Run: cd /Users/snider/Code/LEM && go build ./...
Expected: Clean build. No remaining references to go-inference in distill.go.
Step 9: Verify no stale inference imports
Run: grep -n 'go-inference' /Users/snider/Code/LEM/pkg/lem/distill.go
Expected: No output (import fully removed)
Step 10: Commit
cd /Users/snider/Code/LEM
git add pkg/lem/distill.go
git commit -m "$(cat <<'EOF'
feat(distill): migrate from go-inference to go-ml Backend
Replace inference.LoadModel() with ml.NewMLXBackend() which wraps
the same Metal model with memory management (SetCacheLimit,
SetMemoryLimit). Replace raw iter.Seq token loop with backend.Chat()
returning Result{Text, Metrics}. Add runtime.GC() between probes
to prevent incremental memory leak.
Reference: go-ml/cmd/cmd_ab.go memory management pattern.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
EOF
)"
Task 5: Update backend_metal.go
backend_metal.go currently blank-imports go-mlx to register the Metal backend. Since ml.NewMLXBackend() (which we now call from distill.go) already does this import via go-ml/backend_mlx.go, the LEM-side blank import may be redundant. However, keep it for safety — it ensures the Metal backend is registered even if distill.go isn't the only consumer.
Files:
- Modify:
pkg/lem/backend_metal.go
Step 1: Verify the file is still needed
Read pkg/lem/backend_metal.go. It should contain:
//go:build darwin && arm64
package lem
import _ "forge.lthn.ai/core/go-mlx"
This is still valid. go-mlx registers itself via init(), and ml.NewMLXBackend() also imports it. The double import is harmless (Go deduplicates). No change needed here — leave as-is.
Step 2: Verify build on darwin/arm64
Run: cd /Users/snider/Code/LEM && go build ./...
Expected: Clean build
No commit needed — no changes.
Task 6: Run go mod tidy and verify
After all code changes, clean up the dependency graph.
Files:
- Modify:
go.mod,go.sum
Step 1: Run go mod tidy
Run: cd /Users/snider/Code/LEM && go mod tidy
This may remove go-inference from the direct require block if distill.go was the only direct consumer. Check: backend_metal.go imports go-mlx (not go-inference), and no other .go files in pkg/lem/ import go-inference directly.
Step 2: Check if go-inference moved to indirect
Run: grep 'go-inference' /Users/snider/Code/LEM/go.mod
Expected: Either removed entirely (if go-ml pulls it transitively) or moved to // indirect. Either is correct.
Step 3: Full build
Run: cd /Users/snider/Code/LEM && go build ./...
Expected: Clean build
Step 4: Run go vet
Run: cd /Users/snider/Code/LEM && go vet ./...
Expected: Clean (no issues)
Step 5: Commit if go.mod/go.sum changed
cd /Users/snider/Code/LEM
git add go.mod go.sum
git commit -m "$(cat <<'EOF'
chore: go mod tidy after distill migration
go-inference moves to indirect (pulled transitively via go-ml).
go-ml is now a direct dependency.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
EOF
)"
Task 7: Smoke test with --dry-run
Verify the full flag pipeline works end-to-end without loading a model.
Files: None (test only)
Step 1: Build the lem binary
Run: cd /Users/snider/Code/LEM && go build -o lem .
Expected: Binary built successfully
Step 2: Run dry-run
Run: cd /Users/snider/Code/LEM && ./lem distill --model gemma3/1b --probes core --dry-run
Expected output (approximate):
Model: gemma-3-1b-it (path...)
Backend: metal
Probes: 101
Runs: 3 per probe (303 total generations)
Gate: grammar v3 composite >= 40.0
Generate: temp=0.80 max_tokens=4096 top_p=0.95
Memory: cache=8GB limit=16GB
Output: (path to lesson file)
core-001: ...
core-002: ...
... and 91 more
Key checks:
Memory:line appears with values from ai.yaml (8/16)- No crash, no import errors
Step 3: Test flag override
Run: cd /Users/snider/Code/LEM && ./lem distill --model gemma3/1b --probes core --dry-run --cache-limit 4 --mem-limit 8
Expected: Memory: cache=4GB limit=8GB (flag overrides config)
No commit needed — test only.
Task 8: Live inference test (optional, requires GPU)
Only run this if on a machine with the model downloaded and Metal GPU available.
Files: None (test only)
Step 1: Run a single probe with memory limits
Run:
cd /Users/snider/Code/LEM
./lem distill --model gemma3/1b --probes core --runs 1 --cache-limit 8 --mem-limit 16 2>&1 | head -30
Expected:
- Model loads with memory limit logs
- First probe generates, shows tok/s
- No memory pressure red zone
runtime.GC()runs between probes (no visible output, but memory stays bounded)
Step 2: Monitor memory
In a separate terminal: watch -n1 'sysctl hw.memsize; vm_stat | head -5'
Or check Activity Monitor → Memory Pressure. Should stay green/yellow, not red.
No commit needed — test only.
Summary of Changes
| File | Change |
|---|---|
go.mod |
Add go-ml to require, go-inference moves to indirect |
go.sum |
Updated transitively |
pkg/lem/config.go:39-42 |
Add CacheLimit, MemoryLimit to DistillConfig |
.core/ai/ai.yaml:27-29 |
Add cache_limit: 8, memory_limit: 16 |
pkg/lem/distill.go |
Full migration: imports, model loading, inference loop, metrics, GC |
pkg/lem/backend_metal.go |
No change (blank import still valid) |
What Stays the Same
- Grammar v3 scoring (
go-i18n/reversal) — unchanged - Sandwich output format — unchanged
- Bare probe inference (model sees probe only) — unchanged
- Best-of-N selection — unchanged
- Quality gate — unchanged
- All probe loading, config merging, output writing — unchanged
main.gorouting — unchanged