1
0
Fork 0
forked from lthn/LEM
LEM/docs/plans/2026-02-22-distill-migration.md
Snider f75458bce6 refactor: apply go fix modernizers for Go 1.26
Automated fixes: interface{} → any, range-over-int, t.Context(),
wg.Go(), strings.SplitSeq, strings.Builder, slices.Contains,
maps helpers, min/max builtins.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-22 21:00:17 +00:00

16 KiB

LEM Distill Backend Migration Implementation Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers:executing-plans to implement this plan task-by-task.

Goal: Replace raw go-inference usage in distill.go with go-ml Backend interface, adding Metal memory management to prevent unbounded memory growth.

Architecture: distill.go currently calls inference.LoadModel() directly and iterates tokens via model.Chat() (iter.Seq). We replace this with ml.NewMLXBackend() which wraps the same model in an InferenceAdapter providing memory limits (mlx.SetCacheLimit/SetMemoryLimit), GC discipline between probes, and the new Result{Text, Metrics} return type for tok/s logging. The reference implementation is go-ml/cmd/cmd_ab.go.

Tech Stack: Go 1.25, forge.lthn.ai/core/go-ml (Backend, GenOpts, Result, Message, NewMLXBackend), forge.lthn.ai/core/go-mlx (SetCacheLimit, SetMemoryLimit), forge.lthn.ai/core/go-inference (GenerateMetrics — via Result.Metrics)

Design doc: docs/plans/2026-02-22-distill-backend-migration-design.md


Task 1: Add go-ml to go.mod

go-ml is in the replace block but not in the require block. The compiler will refuse to import it until it's required.

Files:

  • Modify: go.mod

Step 1: Add go-ml to require block

Add this line to the first require block in go.mod, between go-inference and go-duckdb:

forge.lthn.ai/core/go-ml v0.0.0-00010101000000-000000000000

The version doesn't matter because the replace directive overrides it.

Step 2: Run go mod tidy

Run: cd /Users/snider/Code/LEM && go mod tidy

This will resolve the version and pull in any transitive deps from go-ml.

Step 3: Verify build still works

Run: cd /Users/snider/Code/LEM && go build ./... Expected: Clean build (go-ml is now available but not yet imported)

Step 4: Commit

cd /Users/snider/Code/LEM
git add go.mod go.sum
git commit -m "$(cat <<'EOF'
chore: add go-ml to go.mod require block

Prerequisite for distill migration from raw go-inference to
go-ml Backend interface with memory management.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
EOF
)"

Task 2: Add memory config fields to DistillConfig

Add CacheLimit and MemoryLimit fields to DistillConfig in config.go, and add corresponding YAML entries to ai.yaml.

Files:

  • Modify: pkg/lem/config.go:38-42
  • Modify: .core/ai/ai.yaml:27-29

Step 1: Add fields to DistillConfig

In pkg/lem/config.go, replace the DistillConfig struct (lines 39-42):

// DistillConfig holds distillation defaults.
type DistillConfig struct {
	Runs        int `yaml:"runs"`
	MinChars    int `yaml:"min_chars"`
	CacheLimit  int `yaml:"cache_limit"`  // Metal cache limit in GB (0 = no limit)
	MemoryLimit int `yaml:"memory_limit"` // Metal memory limit in GB (0 = no limit)
}

Step 2: Add YAML entries to ai.yaml

In .core/ai/ai.yaml, replace the distill: block (lines 27-29):

# Distillation defaults.
distill:
  runs: 3                  # Generations per probe (best kept)
  min_chars: 20            # Reject responses shorter than this
  cache_limit: 8           # Metal cache limit in GB (0 = no limit)
  memory_limit: 16         # Metal memory limit in GB (0 = no limit)

Step 3: Verify build

Run: cd /Users/snider/Code/LEM && go build ./... Expected: Clean build

Step 4: Commit

cd /Users/snider/Code/LEM
git add pkg/lem/config.go .core/ai/ai.yaml
git commit -m "$(cat <<'EOF'
feat(distill): add Metal memory limit config fields

CacheLimit (8GB) and MemoryLimit (16GB) in DistillConfig control
mlx.SetCacheLimit/SetMemoryLimit before model load. Conservative
defaults for 1B model on 96GB machine.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
EOF
)"

Task 3: Add --cache-limit and --mem-limit flags to RunDistill

Wire the new config fields into CLI flags so they can be overridden per-run.

Files:

  • Modify: pkg/lem/distill.go:38-51 (flag parsing section)

Step 1: Add flags after existing flag declarations

In pkg/lem/distill.go, add these two flags after the root flag (after line 47, before fs.Parse):

	cacheLimit := fs.Int("cache-limit", 0, "Metal cache limit in GB (0 = use ai.yaml default)")
	memLimit := fs.Int("mem-limit", 0, "Metal memory limit in GB (0 = use ai.yaml default)")

Step 2: Add flag-to-config merge after existing overrides

After the *runs override block (after line 71), add:

	cacheLimitGB := aiCfg.Distill.CacheLimit
	if *cacheLimit > 0 {
		cacheLimitGB = *cacheLimit
	}
	memLimitGB := aiCfg.Distill.MemoryLimit
	if *memLimit > 0 {
		memLimitGB = *memLimit
	}

Step 3: Add memory limits to dry-run output

In the dry-run block, after the Generate: line (after line 121), add:

		fmt.Printf("Memory:   cache=%dGB limit=%dGB\n", cacheLimitGB, memLimitGB)

Step 4: Verify build

Run: cd /Users/snider/Code/LEM && go build ./... Expected: Clean build (flags are parsed but not yet used for model loading)

Step 5: Commit

cd /Users/snider/Code/LEM
git add pkg/lem/distill.go
git commit -m "$(cat <<'EOF'
feat(distill): add --cache-limit and --mem-limit flags

Override ai.yaml memory config per-run. Values in GB.
Not yet wired to model loading.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
EOF
)"

Task 4: Replace inference.LoadModel with ml.NewMLXBackend

The core migration: swap inference.LoadModel() + raw iter.Seq for ml.NewMLXBackend() + backend.Chat(). This is the biggest task.

Files:

  • Modify: pkg/lem/distill.go (imports, model loading, inference loop, metrics)

Step 1: Update imports

Replace the import block (lines 3-16) with:

import (
	"context"
	"encoding/json"
	"flag"
	"fmt"
	"log"
	"os"
	"path/filepath"
	"runtime"
	"strings"
	"time"

	"forge.lthn.ai/core/go-i18n/reversal"
	ml "forge.lthn.ai/core/go-ml"
	"forge.lthn.ai/core/go-mlx"
)

Key changes:

  • Remove "forge.lthn.ai/core/go-inference"
  • Add ml "forge.lthn.ai/core/go-ml" (named import to avoid collision with the package name)
  • Add "forge.lthn.ai/core/go-mlx" (for mlx.SetCacheLimit, mlx.SetMemoryLimit)
  • Add "runtime" (for runtime.GC())

Step 2: Replace model loading with memory-managed backend

Replace the model loading block (lines 138-147):

	// Set Metal memory limits before loading model.
	if cacheLimitGB > 0 {
		mlx.SetCacheLimit(uint64(cacheLimitGB) * 1024 * 1024 * 1024)
		log.Printf("metal cache limit: %dGB", cacheLimitGB)
	}
	if memLimitGB > 0 {
		mlx.SetMemoryLimit(uint64(memLimitGB) * 1024 * 1024 * 1024)
		log.Printf("metal memory limit: %dGB", memLimitGB)
	}

	// Load model via go-ml Backend (wraps go-inference with memory management).
	log.Printf("loading model: %s", modelCfg.Paths.Base)
	backend, err := ml.NewMLXBackend(modelCfg.Paths.Base)
	if err != nil {
		log.Fatalf("load model: %v", err)
	}
	defer backend.Close()

	log.Printf("model loaded via %s backend", backend.Name())

Note: backend.Close() replaces model.Close(). We lose model.Info() for the architecture log line — that's fine, NewMLXBackend already logs arch/layers/quant via slog.

Step 3: Build GenOpts from merged config

Add this after the model loading block, before the tokeniser init (before the tok := reversal.NewTokeniser() line):

	// Build generation options from merged config.
	genOpts := ml.GenOpts{
		MaxTokens:     genCfg.MaxTokens,
		Temperature:   genCfg.Temperature,
		TopP:          genCfg.TopP,
		TopK:          genCfg.TopK,
		RepeatPenalty: genCfg.RepeatPenalty,
	}

Step 4: Replace the inference loop

Replace the inner inference block (lines 178-201):

Old code (lines 178-201):

			// Inference uses bare probe — the model generates from its weights.
			// Sandwich wrapping is only for the training output format.
			messages := []inference.Message{
				{Role: "user", Content: probe.Prompt},
			}

			// Generate via native Metal inference.
			start := time.Now()
			var sb strings.Builder
			for token := range model.Chat(ctx, messages,
				inference.WithMaxTokens(genCfg.MaxTokens),
				inference.WithTemperature(float32(genCfg.Temperature)),
				inference.WithTopP(float32(genCfg.TopP)),
				inference.WithTopK(genCfg.TopK),
				inference.WithRepeatPenalty(float32(genCfg.RepeatPenalty)),
			) {
				sb.WriteString(token.Text)
			}
			if err := model.Err(); err != nil {
				fmt.Fprintf(os.Stderr, " → ERROR: %v\n", err)
				continue
			}
			response := sb.String()
			elapsed := time.Since(start)

New code:

			// Inference uses bare probe — the model generates from its weights.
			// Sandwich wrapping is only for the training output format.
			messages := []ml.Message{
				{Role: "user", Content: probe.Prompt},
			}

			// Generate via go-ml Backend (memory-managed Metal inference).
			start := time.Now()
			result, err := backend.Chat(ctx, messages, genOpts)
			if err != nil {
				fmt.Fprintf(os.Stderr, " → ERROR: %v\n", err)
				continue
			}
			response := result.Text
			elapsed := time.Since(start)

Step 5: Replace metrics access

Replace the metrics line (line 214):

Old:

			met := model.Metrics()
			fmt.Fprintf(os.Stderr, " → %d chars, g=%.1f up=%+.1f echo=%.2f enr=%+.1f, %.1fs (%.0f tok/s)\n",
				len(response), grammar.Composite,
				delta.Uplift, delta.Echo, delta.Enrichment,
				elapsed.Seconds(), met.DecodeTokensPerSec)

New:

			tokPerSec := 0.0
			if result.Metrics != nil {
				tokPerSec = result.Metrics.DecodeTokensPerSec
			}
			fmt.Fprintf(os.Stderr, " → %d chars, g=%.1f up=%+.1f echo=%.2f enr=%+.1f, %.1fs (%.0f tok/s)\n",
				len(response), grammar.Composite,
				delta.Uplift, delta.Echo, delta.Enrichment,
				elapsed.Seconds(), tokPerSec)

Step 6: Add runtime.GC() after each probe

After the quality gate block's closing brace (after line 257 — the closing } of the if best != nil / else block), add:


		// Release GPU memory between probes to prevent incremental leak.
		runtime.GC()

Step 7: Update the summary footer

Replace the model info line in the summary (line 263):

Old:

	fmt.Fprintf(os.Stderr, "Model:    %s (%s)\n", modelCfg.Name, info.Architecture)

New:

	fmt.Fprintf(os.Stderr, "Model:    %s (%s)\n", modelCfg.Name, backend.Name())

Step 8: Verify build

Run: cd /Users/snider/Code/LEM && go build ./... Expected: Clean build. No remaining references to go-inference in distill.go.

Step 9: Verify no stale inference imports

Run: grep -n 'go-inference' /Users/snider/Code/LEM/pkg/lem/distill.go Expected: No output (import fully removed)

Step 10: Commit

cd /Users/snider/Code/LEM
git add pkg/lem/distill.go
git commit -m "$(cat <<'EOF'
feat(distill): migrate from go-inference to go-ml Backend

Replace inference.LoadModel() with ml.NewMLXBackend() which wraps
the same Metal model with memory management (SetCacheLimit,
SetMemoryLimit). Replace raw iter.Seq token loop with backend.Chat()
returning Result{Text, Metrics}. Add runtime.GC() between probes
to prevent incremental memory leak.

Reference: go-ml/cmd/cmd_ab.go memory management pattern.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
EOF
)"

Task 5: Update backend_metal.go

backend_metal.go currently blank-imports go-mlx to register the Metal backend. Since ml.NewMLXBackend() (which we now call from distill.go) already does this import via go-ml/backend_mlx.go, the LEM-side blank import may be redundant. However, keep it for safety — it ensures the Metal backend is registered even if distill.go isn't the only consumer.

Files:

  • Modify: pkg/lem/backend_metal.go

Step 1: Verify the file is still needed

Read pkg/lem/backend_metal.go. It should contain:

//go:build darwin && arm64

package lem

import _ "forge.lthn.ai/core/go-mlx"

This is still valid. go-mlx registers itself via init(), and ml.NewMLXBackend() also imports it. The double import is harmless (Go deduplicates). No change needed here — leave as-is.

Step 2: Verify build on darwin/arm64

Run: cd /Users/snider/Code/LEM && go build ./... Expected: Clean build

No commit needed — no changes.


Task 6: Run go mod tidy and verify

After all code changes, clean up the dependency graph.

Files:

  • Modify: go.mod, go.sum

Step 1: Run go mod tidy

Run: cd /Users/snider/Code/LEM && go mod tidy

This may remove go-inference from the direct require block if distill.go was the only direct consumer. Check: backend_metal.go imports go-mlx (not go-inference), and no other .go files in pkg/lem/ import go-inference directly.

Step 2: Check if go-inference moved to indirect

Run: grep 'go-inference' /Users/snider/Code/LEM/go.mod

Expected: Either removed entirely (if go-ml pulls it transitively) or moved to // indirect. Either is correct.

Step 3: Full build

Run: cd /Users/snider/Code/LEM && go build ./... Expected: Clean build

Step 4: Run go vet

Run: cd /Users/snider/Code/LEM && go vet ./... Expected: Clean (no issues)

Step 5: Commit if go.mod/go.sum changed

cd /Users/snider/Code/LEM
git add go.mod go.sum
git commit -m "$(cat <<'EOF'
chore: go mod tidy after distill migration

go-inference moves to indirect (pulled transitively via go-ml).
go-ml is now a direct dependency.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
EOF
)"

Task 7: Smoke test with --dry-run

Verify the full flag pipeline works end-to-end without loading a model.

Files: None (test only)

Step 1: Build the lem binary

Run: cd /Users/snider/Code/LEM && go build -o lem . Expected: Binary built successfully

Step 2: Run dry-run

Run: cd /Users/snider/Code/LEM && ./lem distill --model gemma3/1b --probes core --dry-run

Expected output (approximate):

Model:    gemma-3-1b-it (path...)
Backend:  metal
Probes:   101
Runs:     3 per probe (303 total generations)
Gate:     grammar v3 composite >= 40.0
Generate: temp=0.80 max_tokens=4096 top_p=0.95
Memory:   cache=8GB limit=16GB
Output:   (path to lesson file)

  core-001: ...
  core-002: ...
  ... and 91 more

Key checks:

  • Memory: line appears with values from ai.yaml (8/16)
  • No crash, no import errors

Step 3: Test flag override

Run: cd /Users/snider/Code/LEM && ./lem distill --model gemma3/1b --probes core --dry-run --cache-limit 4 --mem-limit 8

Expected: Memory: cache=4GB limit=8GB (flag overrides config)

No commit needed — test only.


Task 8: Live inference test (optional, requires GPU)

Only run this if on a machine with the model downloaded and Metal GPU available.

Files: None (test only)

Step 1: Run a single probe with memory limits

Run:

cd /Users/snider/Code/LEM
./lem distill --model gemma3/1b --probes core --runs 1 --cache-limit 8 --mem-limit 16 2>&1 | head -30

Expected:

  • Model loads with memory limit logs
  • First probe generates, shows tok/s
  • No memory pressure red zone
  • runtime.GC() runs between probes (no visible output, but memory stays bounded)

Step 2: Monitor memory

In a separate terminal: watch -n1 'sysctl hw.memsize; vm_stat | head -5'

Or check Activity Monitor → Memory Pressure. Should stay green/yellow, not red.

No commit needed — test only.


Summary of Changes

File Change
go.mod Add go-ml to require, go-inference moves to indirect
go.sum Updated transitively
pkg/lem/config.go:39-42 Add CacheLimit, MemoryLimit to DistillConfig
.core/ai/ai.yaml:27-29 Add cache_limit: 8, memory_limit: 16
pkg/lem/distill.go Full migration: imports, model loading, inference loop, metrics, GC
pkg/lem/backend_metal.go No change (blank import still valid)

What Stays the Same

  • Grammar v3 scoring (go-i18n/reversal) — unchanged
  • Sandwich output format — unchanged
  • Bare probe inference (model sees probe only) — unchanged
  • Best-of-N selection — unchanged
  • Quality gate — unchanged
  • All probe loading, config merging, output writing — unchanged
  • main.go routing — unchanged