docs: document AttentionSnapshot and AttentionInspector in architecture guide

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-23 12:34:30 +00:00 · 2026-02-23 12:34:30 +00:00 · 66d9d63017
commit 66d9d63017
parent 63c3d6aa53
1 changed files with 44 additions and 0 deletions
--- a/docs/architecture.md
+++ b/docs/architecture.md
@ -131,6 +131,50 @@ type ModelInfo struct {

 Static metadata about a loaded model. `QuantBits` is zero for unquantised (FP16/BF16) models.

+### AttentionSnapshot
+
+```go
+type AttentionSnapshot struct {
+    NumLayers    int
+    NumHeads     int           // num_kv_heads (may differ from query heads in GQA)
+    SeqLen       int           // number of tokens in the prompt
+    HeadDim      int
+    Keys         [][][]float32 // [layer][head] → flat float32 of len seq_len*head_dim
+    Architecture string
+}
+```
+
+Post-RoPE K vectors extracted from the KV cache after a single prefill pass. The `Keys` tensor is indexed `[layer][head][position*head_dim]` — each head's K vectors are flattened into a single slice of length `SeqLen * HeadDim`.
+
+This type is consumed by LEM's Q/K Bone Orientation analysis engine, which computes coherence, cross-layer alignment, head entropy, phase-lock, and joint collapse metrics from the raw K tensors. The analysis is pure Go CPU math — no GPU dependencies.
+
+For GQA models (e.g. Gemma3 where `num_kv_heads < num_query_heads`), `NumHeads` reflects the KV head count. Single-head layers use position-wise differentiation rather than pairwise head comparison.
+
+## Optional Interfaces
+
+### AttentionInspector
+
+```go
+type AttentionInspector interface {
+    InspectAttention(ctx context.Context, prompt string, opts ...GenerateOption) (*AttentionSnapshot, error)
+}
+```
+
+Backends may implement `AttentionInspector` to expose attention-level data for Q/K Bone Orientation analysis. This is an optional interface — consumers discover it via type assertion:
+
+```go
+if inspector, ok := model.(AttentionInspector); ok {
+    snap, err := inspector.InspectAttention(ctx, prompt)
+    // analyse snap.Keys
+}
+```
+
+Following rule 3 of the stability contract: new capability is expressed as separate interfaces, not by extending `TextModel`. Backends that don't support attention inspection (HTTP, llama.cpp subprocess) are unaffected.
+
+**Implementations:**
+- `go-mlx` — Extracts post-RoPE K vectors from Metal KV cache after prefill (native GPU memory read)
+- `go-ml` — `InferenceAdapter.InspectAttention()` delegates via type assertion to the underlying `TextModel`
+
 ## TextModel Interface

 ```go