Commit graph

72 commits

Author SHA1 Message Date
c1baeb9254 Merge pull request 'chore: Go 1.26 modernization' (#2) from chore/go-1.26-modernization into main
All checks were successful
Security Scan / security (push) Successful in 15s
Test / Vet & Build (push) Successful in 38s
2026-02-24 18:01:47 +00:00
Claude
eeec1fe372
chore: use slices.Sorted(maps.Keys()) for ordered iteration
All checks were successful
Security Scan / security (pull_request) Successful in 13s
Test / Vet & Build (pull_request) Successful in 39s
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 16:32:47 +00:00
Claude
85775dc082
chore: use range-over-integer (Go 1.22+)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 16:32:00 +00:00
Claude
694e78ca34
chore: sort.Slice → slices.SortFunc
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 16:29:48 +00:00
Claude
9f6dd9d4eb
chore: fmt.Errorf(static) → errors.New
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 16:28:27 +00:00
Claude
421d0c42ff
docs: archive completed backend-abstraction and batch-inference plans
All checks were successful
Security Scan / security (push) Successful in 15s
Test / Vet & Build (push) Successful in 50s
Move both plans to docs/plans/completed/ with summaries. MLX backend
implements shared interfaces and batch inference at 5K sentences/sec.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-24 13:51:21 +00:00
Snider
c0f07478c8 docs: document InspectAttention KV cache extraction in architecture guide
All checks were successful
Security Scan / security (push) Successful in 11s
Test / Vet & Build (push) Successful in 31s
Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-23 12:34:31 +00:00
Snider
f7553eb277 chore: bump go-inference to v0.0.3
All checks were successful
Security Scan / security (push) Successful in 13s
Test / Vet & Build (push) Successful in 53s
Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-23 06:59:59 +00:00
Snider
30eee2525f chore: add Go repo norms (badges, contributing, lint, taskfile, editorconfig)
Some checks failed
Security Scan / security (push) Successful in 10s
Test / Vet & Build (push) Failing after 29s
Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-23 06:45:42 +00:00
Snider
51ac442a09 fix: add deterministic GPU memory cleanup across inference paths
Some checks failed
Security Scan / security (push) Successful in 15s
Test / Vet & Build (push) Failing after 32s
- defer freeCaches() in Generate and InspectAttention
- Free orphaned arrays during KVCache growth and slice updates
- Free per-token scalar intermediates in samplers and ops
- Free intermediate arrays in applyRepeatPenalty

Found by 3-way review: Claude explorer, Codex (gpt-5.3), Gemini Ultra.
Gemini implemented the fixes.

Co-Authored-By: Gemini <noreply@google.com>
Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-23 05:08:02 +00:00
Claude
208f76b067
ci: add Forgejo Actions test and security scan workflows
Some checks failed
Security Scan / security (push) Successful in 15s
Test / Vet & Build (push) Failing after 40s
Uses reusable workflows from core/go-devops for Go testing
(with race detector and coverage) and security scanning
(govulncheck, gitleaks, trivy).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-23 03:28:55 +00:00
Snider
c2177f754a feat: implement AttentionInspector via KV cache extraction after prefill
Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-23 00:37:29 +00:00
Claude
9a48774538
chore: sync workspace dependency versions
Run go work sync to align dependency versions across workspace.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-22 21:41:57 +00:00
Snider
5004ac258a refactor: apply go fix modernizers for Go 1.26
Automated fixes: interface{} → any, range-over-int, t.Context(),
wg.Go(), strings.SplitSeq, strings.Builder, slices.Contains,
maps helpers, min/max builtins.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-22 21:00:16 +00:00
Snider
fc27c2cd27 chore: go mod tidy for 1.26.0
Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-22 20:35:59 +00:00
Snider
60f9a5b740 chore: bump go directive to 1.26.0
Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-22 20:33:48 +00:00
Snider
2f0be488dd chore: pin forge deps to v0.0.1 tags for Go 1.26 compat
Go 1.26 rejects non-semver version strings (like 'main') in go.mod.
Tags v0.0.1 now exist on all forge repos — workspace still overrides
for local development.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-22 20:15:06 +00:00
Snider
094d82f535 chore: use workspace-resolved versions, drop replace directives
Forge module versions now use main branch resolution via ~/Code/go.work
workspace. Removes local replace directives — the central go.work handles
all cross-repo resolution during development.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-22 19:49:36 +00:00
Snider
2292557fd6 chore: vendor MLX C headers for Go module consumers
dist/include/ contains the MLX and MLX-C headers needed for CGo
compilation. Without these, go-mlx cannot be used as a module
dependency (headers not found in module cache).

Libraries (dylib/metallib) are still gitignored — users build
those locally via cmake.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-21 19:14:04 +00:00
Snider
1c2f13fd9d chore: resolve go-inference from forge, remove local replace
Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-21 19:08:51 +00:00
Snider
98acc11e23 docs: add README with quick start and docs links
Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-20 15:11:20 +00:00
Snider
1ea90b03b4 docs: graduate TODO/FINDINGS into production documentation
Replace internal task tracking with structured docs covering CGO/mlx-c
architecture, 4 model architectures, training pipeline, mlxlm backend,
development guide, and full project history across 5 phases.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-20 15:03:39 +00:00
Snider
015ebbc2ac chore: add gitignore entries for IDE/build artifacts
Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-20 09:15:25 +00:00
Snider
757a241f59 feat(mlxlm): Phase 5.5 — subprocess backend using Python mlx-lm
Implements inference.Backend via a Python subprocess communicating
over JSON Lines (stdin/stdout). No CGO required — pure Go + os/exec.

- bridge.py: embedded Python script wrapping mlx_lm.load() and
  mlx_lm.stream_generate() with load/generate/chat/info/cancel/quit
  commands. Flushes stdout after every JSON line for streaming.

- backend.go: Go subprocess manager. Extracts bridge.py from
  go:embed to temp file, spawns python3, pipes JSON requests.
  mlxlmModel implements full TextModel interface with mutex-
  serialised Generate/Chat, context cancellation with drain,
  and 2-second graceful Close with kill fallback.
  Auto-registers as "mlx_lm" via init(). Build tag: !nomlxlm.

- backend_test.go: 15 tests using mock_bridge.py (no mlx_lm needed):
  name, load, generate, cancel, chat, close, error propagation,
  invalid path, auto-register, concurrent serialisation, classify/
  batch unsupported, info, metrics, max_tokens limiting.

All tests pass with -race. go vet clean.

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-20 09:02:30 +00:00
Snider
887c221974 docs: write detailed mlxlm subprocess backend spec (Phase 5.5)
3 steps: Python bridge script (JSON Lines protocol), Go backend
wrapper (os/exec, iter.Seq streaming), tests with mock subprocess.
Uses stdlib only — no core/go dependency.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-20 08:58:19 +00:00
Snider
d1fb26d51e docs: expand package doc with workflow examples
Cover generate, chat, classify, batch generate, metrics, model info,
discovery, and Metal memory controls.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 23:44:07 +00:00
Snider
d7c8f176f0 docs: mark embed-friendly model loading complete in TODO
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 23:37:58 +00:00
Snider
dd49b4afb6 feat: add model discovery test and update TODO
Discover() finds 20 models across /Volumes/Data/lem — Gemma3 (1B/4B/
12B/27B), DeepSeek R1, Llama 3.1, GPT-OSS. Mark quantisation awareness
and inference metrics complete in TODO.

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 23:37:37 +00:00
Snider
ceb966b66b feat(metal): expose model metadata via Info()
Return architecture, vocab size, layer count, hidden dimension, and
quantisation config (bits + group size) for loaded models.

Gemma3-1B 4-bit: arch=gemma3, vocab=262144, layers=26, hidden=1152,
quant=4-bit/group64.

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 23:36:23 +00:00
Snider
a44e9f5789 feat(metal): add inference metrics (timing, throughput, memory)
Instrument Generate, Classify, and BatchGenerate with:
- Prefill/decode timing (separate phases)
- Token counts (prompt + generated)
- Throughput (tok/s for each phase)
- Peak and active GPU memory via Metal allocator

Wire through metalAdapter.Metrics() to go-inference interface.
Test validates all fields populated after generation.

Gemma3-1B 4-bit on M3 Ultra: prefill 246 tok/s, decode 82 tok/s,
peak 6.2 GB GPU memory.

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 23:34:40 +00:00
Snider
197326bd22 docs: mark batch inference API complete in TODO.md
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 23:28:40 +00:00
Snider
5644857034 feat(metal): implement batch inference (Classify, BatchGenerate)
- Add ForwardMasked to InternalModel, Gemma3 and Qwen3 architectures
- Thread attention mask through decoder layers and SDPA calls
- Use ScaledDotProductAttentionWithMask when explicit mask provided
- Create batch.go with padded batching, mask construction, Classify
  (prefill-only) and BatchGenerate (autoregressive) implementations
- Wire Classify/BatchGenerate through metalAdapter to go-inference
- Tests: mask unit tests (shape, values, multi-batch), Classify with
  4 prompts (152 prompts/s), WithLogits, BatchGenerate with 2 prompts

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 23:28:15 +00:00
Snider
ce1acef462 docs: batch inference API design (Phase 5)
Two new TextModel methods: Classify (prefill-only, fast path for
classification) and BatchGenerate (autoregressive, multi-prompt).
Adds attention masking for padded batches. Primary consumer: go-i18n
Phase 2a domain classification at ~5K sentences/sec.

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 23:18:38 +00:00
Snider
e3fbc221ce feat(metal): add mixed precision training via LoRAConfig.DType (Phase 3)
LoRA A/B matrices can now be created in BFloat16 or Float16 for mixed
precision training. DType field added to LoRAConfig, passed through
ApplyLoRA and NewLoRALinear. MLX auto-promotes for cross-dtype ops.
BFloat16 validated: loss 7.15→6.29, matches Float32 accuracy with
half param memory.

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 23:13:49 +00:00
Snider
fa08ed1e2a test(metal): validate gradient checkpointing with real model (Phase 3)
Checkpoint() wraps forward pass to recompute activations during
backward, trading compute for memory. Verified with Gemma3-1B LoRA
training: produces correct gradients (loss 7.15→7.08, matches
non-checkpointed initial loss). Unit test confirms gradient
correctness on simple function (sum(x^2), grad=[2,4,6]).

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 23:11:15 +00:00
Snider
fb0692baf3 test(metal): add LoRA end-to-end training pipeline test (Phase 3)
Validates full pipeline: load Gemma3-1B → apply LoRA (rank=8, 745K
params across 52 layers) → train 5 steps with cross-entropy loss
(7.15→6.31) → save adapter to safetensors → reload and verify all
weights match. Uses ValueAndGrad for autograd + AdamW optimiser.

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 23:09:16 +00:00
Snider
19c4823b04 feat(metal): add Llama 3 model support (Llama 3.1 8B validated)
Llama shares the Qwen3 loader (same decoder: pre-norm, SwiGLU, GQA).
Model type now detected from config.json model_type field instead of
weight-only heuristic. Llama 3 chat template and EOS token added.
Model tests now clear Metal GPU cache between runs.

Llama 3.1 8B Instruct 4-bit: 30 tok/s on M3 Ultra.

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 23:06:43 +00:00
Snider
535b04d5d6 feat(metal): add Qwen2 model support (DeepSeek R1 validated)
Qwen2 and Qwen3 share the same architecture — Qwen3 adds Q/K RMS
normalization which Qwen2 lacks. The loader auto-detects the variant
from weight presence and reports the correct ModelType().

- Add "qwen2" to architecture dispatch in model.go
- Make Q/K norm optional in attention forward (nil-safe check)
- Store detected model type on Qwen3Model struct
- Add "qwen2" to chat template routing
- DeepSeek R1 7B (4-bit): 27 tok/s on M3 Ultra
- 2 new tests: inference + chat

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 21:55:56 +00:00
Snider
a2493e0242 test(metal): add model loading robustness tests (Phase 2)
24 new tests covering error paths in model loading:
- Missing/invalid config.json, unsupported architecture
- Missing tokenizer.json for both Gemma3 and Qwen3
- Missing safetensors: was a nil-pointer panic in precomputeScaledWeights,
  fixed with early error return in both LoadGemma3 and LoadQwen3
- Config parsing: defaults, quantization, nested text_config
- isLayerSliding sliding window pattern logic
- resolveWeight with language_model. prefix fallback

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 21:49:07 +00:00
Snider
18e8dca9f8 feat(metal): validate Gemma3-1B inference end-to-end (Phase 2)
- Fix model_type "gemma3_text" not matched in architecture dispatch
- Fix GPT-2 BPE false detection on large SentencePiece vocabs (Gemma3
  262K vocab contains Ġ but uses ▁ for spaces — check "Ġthe" not bare "Ġ")
- Add TestGemma3_1B_Inference: greedy decode, 46 tok/s, coherent output
- Add TestGemma3_1B_Chat: validates chat template formatting
- Add TestGemma3_1B_ContextCancel: validates ctx.Done() stops generation

4-bit quantised Gemma3-1B loads in ~700ms, generates at 46 tok/s on M3 Ultra.

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 21:44:28 +00:00
Snider
443347a2f8 fix(metal): address 4 minor code review items
- Rename New() → newArray() to signal internal-only intent (112 usages)
- Remove unused Collect() function and its test
- Fix discarded json.Unmarshal error in qwen3.go
- Document AsStrided stride formula in gemma3.go

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 21:36:40 +00:00
Snider
fb95cde30c fix(metal): address 5 important code review items
1. RepeatPenalty: implemented applyRepeatPenalty() — tracks generated
   token IDs, deduplicates, divides positive logits by penalty and
   multiplies negative logits by penalty. 2 new tests.

2. DefaultGPUStream/DefaultCPUStream: now cached with sync.Once,
   no more C stream allocation on every call.

3. CompileShapeless: removed dead C closure, callback, sync.Map,
   and nextID infrastructure. CompiledFunc is now a plain function
   wrapper with mutex. API unchanged.

4. Tokenizer BPE: implemented bpeMerge() — standard BPE algorithm
   using merge rank lookup. Both SentencePiece and GPT-2 Encode paths
   now apply merges instead of falling back to character-level lookup.
   3 new tests.

5. KV cache lifecycle: documented in Generate() godoc — fresh caches
   per call, ClearCache() between turns for prompt Metal reclaim.

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 21:31:45 +00:00
Snider
c96f9bd006 fix(metal): address 3 critical code review items
1. Error handler thread safety: last_mlx_error now uses _Atomic(const char*)
   with atomic_store_explicit/atomic_exchange_explicit (release/acquire).

2. macOS version minimum: -mmacosx-version-min changed from 26.0 to 13.3
   (MLX's own minimum), no longer locks out macOS 14/15 users.

3. LoadOption applied in metalBackend.LoadModel(): calls ApplyLoadOpts(),
   passes ContextLen through to Model which replaces unbounded KVCache
   with RotatingKVCache when set. GPULayers=0 logs a warning.

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 21:24:10 +00:00
Snider
bd5668967c docs: Virgil code review feedback after Phase 4 completion
3 critical (error handler thread safety, macOS version min, LoadOption ignored),
5 important (KV cache leak, RepeatPenalty dead, stream leak, tokenizer BPE,
dead compile code), and 4 minor items. Plus 3 design questions.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 21:17:43 +00:00
Snider
f13a8c9289 feat(metal): deterministic Close() and Array.Iter()
Model.Close() now walks the full model tree (Gemma3/Qwen3) and
explicitly frees all weight arrays. Handles tied output weights,
nil safety, idempotent double-close. Helpers: freeLinear,
freeEmbedding, freeRMSNorm, freeCaches, closeGemma, closeQwen3.

Array.Iter() returns iter.Seq[float32] for range-over-func iteration.
Handles non-contiguous arrays and supports early break.

192 tests passing (12 new: 8 close, 4 iter).

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 21:09:39 +00:00
Snider
f2ca7fe188 docs(cpp): add on-demand research tasks for CLion Claude
Version bump validation, batch evaluation patterns, GPU profiling,
quantised matmul variants, and async/streaming patterns — all
activate when the Go side needs them.

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 21:01:45 +00:00
Snider
754d6e2f93 fix(metal): error handling audit — propagate MLX errors instead of swallowing
Replace checkError() log+swallow with lastError() that returns real MLX
error messages. Add Eval/EvalAsync as error-returning variants of
Materialize. Generate loop now propagates GPU errors via model.Err().
LoadAllSafetensors returns (map, error). Model loaders check lastError()
after safetensors load. 180 tests passing.

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 20:59:37 +00:00
Snider
ff01175a62 bench(metal): add 29 benchmarks baselined on M3 Ultra
MatMul (128² to 4096², token projection), Softmax, element-wise
ops, fused Metal kernels (RMSNorm, LayerNorm, RoPE, SDPA), Linear,
Embedding, reductions, and full sampler chain. CGO floor ~170μs.

Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-19 20:47:25 +00:00
Snider
7435648f66 docs: update findings and TODO with CLion Claude research results
Document contiguous fix, TopP/MinP implementations, new bindings.
Update memory management TODO — CLion Claude confirmed safe patterns.
176 tests passing.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 20:40:35 +00:00
Snider
f507620e85 docs(cpp): complete mlx-c API research — all tasks done
CLion Claude session 1 completed all orientation and priority tasks:
- Full API surface map (180 ops, Go binds ~40)
- mlx_contiguous found (fixes Floats() bug)
- mlx_cumsum found (unblocks TopP sampling)
- Memory management fully documented (refcounted, safe to free)
- Error model documented (free-form strings only)
- Device info API documented (mlx_metal_device_info)

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-19 20:40:00 +00:00