go-ml/FINDINGS.md

# FINDINGS.md — go-ml Research & Discovery

## 2026-02-19: Split from go-ai (Virgil)

### Origin

Split from go-ai on 19 Feb 2026. Was `ai/ml/` subpackage inside `forge.lthn.ai/core/go-ai`. Zero internal go-ai dependencies — imports go-mlx (external module) and core/go framework only.

### What Was Extracted

- 41 Go files (~7,494 LOC excluding tests)
- 6 test files (backend_http, exact, heuristic, judge, probes, score)
- ml/ was 53% of go-ai's total LOC. After extraction, go-ai drops from ~14K to ~3.4K LOC (ai/ facade + mcp/ hub).

### Dependencies

- `forge.lthn.ai/core/go-mlx` — Metal GPU inference (backend_mlx.go, darwin/arm64 only)
- `forge.lthn.ai/core/go-inference` — Shared TextModel/Backend/Token interfaces (target for Phase 1)
- `forge.lthn.ai/core/go` — Framework services, process management, logging
- `github.com/marcboeker/go-duckdb` — Analytics storage
- `github.com/parquet-go/parquet-go` — Columnar data I/O
- `github.com/stretchr/testify` — Test assertions

### Consumers

- `go-ai/mcp/tools_ml.go` — Exposes ML as MCP tools (uses `ml.Service`, `ml.GenOpts`, `ml.Backend`)
- LEM Lab — Uses MLXBackend for chat inference
- go-i18n Phase 2a — Needs 5K sentences/sec Gemma3-1B classification (blocked on go-inference)

## go-inference Interface Mapping

### Type Correspondence

| go-ml | go-inference | Notes |
|-------|-------------|-------|
| `ml.Backend` | `inference.Backend` | Different semantics: ml returns string, inference returns TextModel |
| `ml.StreamingBackend` | (built into TextModel) | iter.Seq[Token] is inherently streaming |
| `ml.GenOpts` | `inference.GenerateConfig` | Use functional options: `WithMaxTokens(n)` etc. |
| `ml.Message` | `inference.Message` | Identical struct: Role + Content |
| `ml.TokenCallback` | (not needed) | iter.Seq[Token] replaces callbacks |
| (no equivalent) | `inference.Token` | `{ID int32, Text string}` |
| (no equivalent) | `inference.TextModel` | Generate/Chat return iter.Seq[Token] |

### Method Mapping

```
ml.Backend.Generate(ctx, prompt, GenOpts) → (string, error)
   ↕ InferenceAdapter collects tokens
inference.TextModel.Generate(ctx, prompt, ...GenerateOption) → iter.Seq[Token]

ml.StreamingBackend.GenerateStream(ctx, prompt, opts, TokenCallback) → error
   ↕ InferenceAdapter forwards tokens to callback
inference.TextModel.Generate(ctx, prompt, ...GenerateOption) → iter.Seq[Token]

ml.GenOpts{Temperature: 0.7, MaxTokens: 2048}
   ↕ convertOpts helper
inference.WithTemperature(0.7), inference.WithMaxTokens(2048)
```

### backend_mlx.go Before/After

**Before** (253 LOC — BROKEN, old subpackage imports):
```go
import (
    "forge.lthn.ai/core/go-mlx"
    "forge.lthn.ai/core/go-mlx/cache"    // REMOVED
    "forge.lthn.ai/core/go-mlx/model"    // REMOVED
    "forge.lthn.ai/core/go-mlx/sample"   // REMOVED
    "forge.lthn.ai/core/go-mlx/tokenizer"// REMOVED
)

type MLXBackend struct {
    model      model.Model
    tok        *tokenizer.Tokenizer
    caches     []cache.Cache
    sampler    sample.Sampler
    // ... manual tokenisation, KV cache mgmt, sampling loop, memory cleanup
}
```

**After** (~60 LOC — uses go-inference + InferenceAdapter):
```go
import (
    "forge.lthn.ai/core/go-inference"
    _ "forge.lthn.ai/core/go-mlx"  // registers "metal" backend via init()
)

func NewMLXBackend(modelPath string) (*InferenceAdapter, error) {
    m, err := inference.LoadModel(modelPath)
    if err != nil { return nil, fmt.Errorf("mlx: %w", err) }
    return &InferenceAdapter{model: m, name: "mlx"}, nil
}
```

All tokenisation, KV cache, sampling, and memory management is now handled inside go-mlx's `internal/metal/` package, accessed through the go-inference `TextModel` interface.

## Scoring Engine Architecture

### 5 Suites

| Suite | Method | LLM needed? | Metrics |
|-------|--------|-------------|---------|
| **Heuristic** | Regex + word analysis | No | 9 metrics → LEK composite |
| **Semantic** | LLM-as-judge | Yes | 4 dimensions (sovereignty, ethical, creative, self-concept) |
| **Content** | LLM-as-judge | Yes | 6 sovereignty probes (CCP, truth, engagement, etc.) |
| **Standard** | LLM-as-judge | Yes | TruthfulQA, DoNotAnswer, Toxigen |
| **Exact** | Numeric extraction | No | GSM8K answer matching |

### LEK Score Formula

```
LEK = EngagementDepth×2 + CreativeForm×3 + EmotionalRegister×2 + FirstPerson×1.5
    - ComplianceMarkers×5 - FormulaicPreamble×3 - Degeneration×4 - EmptyBroken×20
```

Positive signals: engagement depth, creative form, emotional register, first-person voice.
Negative signals: RLHF compliance markers, formulaic preambles, text degeneration, empty/broken output.

### Concurrency Model

`Engine.ScoreAll()` fans out goroutines bounded by semaphore (`concurrency` setting). Heuristic runs inline (instant). Semantic/content/standard run via worker pool with `sync.WaitGroup`. Results collected into `[]PromptScore` via mutex.

## Phase 2 Audit: StreamingBackend Usage (Virgil, 20 Feb 2026)

### Callers of GenerateStream/ChatStream

Only 2 files across the entire ecosystem call StreamingBackend methods:

1. **`host-uk/cli/cmd/ml/cmd_serve.go`** (lines 146, 201, 319)
   - Type-asserts `backend.(ml.StreamingBackend)` for SSE streaming
   - `/v1/completions` → `streamer.GenerateStream()` (line 201)
   - `/v1/chat/completions` → `streamer.ChatStream()` (line 319)
   - Has non-streaming fallback: `backend.Generate()` when assertion fails

2. **`host-uk/cli/cmd/ml/cmd_chat.go`**
   - Direct `ChatStream()` call for terminal token-by-token echo
   - No fallback — assumes backend supports streaming

### Non-streaming consumers (use Backend.Generate only)

| File | Method | Notes |
|------|--------|-------|
| service.go | `Backend.Generate()` | Backend registry dispatch |
| judge.go | `Backend.Generate()` | Via judgeChat() |
| agent.go | `Backend.Generate()` | Probe evaluation |
| expand.go | `Backend.Generate()` | Prompt expansion |
| go-ai/mcp/tools_ml.go | `ml.Service` | Via service layer |

### Backend Implementation Status

| Backend | Backend? | StreamingBackend? | Notes |
|---------|----------|-------------------|-------|
| InferenceAdapter | YES | YES | Bridges iter.Seq[Token] → callbacks |
| HTTPBackend | YES | NO | Returns complete string from API |
| LlamaBackend | YES | NO | Returns complete string via HTTP |

### Conclusion

StreamingBackend is only needed by `host-uk/cli` (2 files, out of go-ml scope). Safe to deprecate in go-ml with a comment. The actual migration of those CLI files is a separate task for the cli repo.

### GenOpts vs GenerateConfig Field Comparison

| ml.GenOpts | inference.GenerateConfig | Type |
|-----------|--------------------------|------|
| Temperature | Temperature | float64 vs float32 |
| MaxTokens | MaxTokens | int (same) |
| Model | (none) | string |
| (none) | TopK | int |
| (none) | TopP | float32 |
| (none) | StopTokens | []int32 |
| (none) | RepeatPenalty | float32 |
| (none) | ReturnLogits | bool |

## Known Issues

- ~~**backend_mlx.go imports dead subpackages**~~ — FIXED in Phase 1 (`c3c2c14`)
- **agent.go too large** — 1,070 LOC, SSH + InfluxDB + scoring + publishing mixed together
- **Hardcoded infrastructure** — InfluxDB endpoint `10.69.69.165:8181`, M3 SSH details in agent.go
- **No tests for backend_llama and backend_mlx** — Only backend_http_test.go exists
- **score.go concurrency untested** — No race condition tests
- ~~**Message type duplication**~~ — FIXED in Phase 2 (`747e703`): type alias `Message = inference.Message`

## Phase 3 Audit: agent.go Structure (Virgil, 20 Feb 2026)

### File Layout (1,070 LOC)

| Section | Lines | LOC | Purpose |
|---------|-------|-----|---------|
| Types & Config | 19–112 | ~95 | `AgentConfig`, `Checkpoint`, config maps, `AdapterMeta()` |
| Main Loop | 141–343 | ~200 | `RunAgentLoop()`, checkpoint discovery, unscored filtering |
| Evaluation | 345–700 | ~355 | MLX-native + conversion paths, 4 probe functions |
| Judge & Push | 708–887 | ~180 | Scoring, InfluxDB line protocol, DuckDB dual-write |
| Buffering | 926–977 | ~50 | JSONL buffer for InfluxDB failures |
| SSH/SCP | 979–1070 | ~90 | `SSHCommand()`, `SCPFrom()`, `SCPTo()`, utility helpers |

### Hardcoded Infrastructure

- SSH options duplicated across 3 functions: `ConnectTimeout=10, BatchMode=yes, StrictHostKeyChecking=no`
- InfluxDB timestamp base: `1739577600` (13 Feb 2026 00:00 UTC)
- InfluxDB measurements: `probe_score`, `capability_score`, `capability_judge`, `content_score`
- DuckDB tables: `checkpoint_scores`, `probe_results`

### Test Coverage

Zero tests for agent.go. Testable without infrastructure:
- `AdapterMeta()` — pure function, dirname → metadata
- `FindUnscored()` — filtering logic
- `BufferInfluxResult()`/`ReplayInfluxBuffer()` — JSONL round-trip