docs: document InspectAttention KV cache extraction in architecture guide
All checks were successful
Security Scan / security (push) Successful in 11s
Test / Vet & Build (push) Successful in 31s

Co-Authored-By: Virgil <virgil@lethean.io>
This commit is contained in:
Snider 2026-02-23 12:34:31 +00:00
parent f7553eb277
commit c0f07478c8

View file

@ -495,6 +495,28 @@ for tok := range m.Generate(ctx, "prompt", inference.WithMaxTokens(128)) {
- `ContextLen` — replaces unbounded `KVCache` with `RotatingKVCache(contextLen)` for all layers
- `GPULayers` — logged as a warning if set to 0 (Metal always uses full GPU offload)
### AttentionInspector (Q/K Bone Orientation)
`metalAdapter` implements the optional `inference.AttentionInspector` interface, enabling Q/K Bone Orientation analysis from the KV cache.
```go
inspector, ok := model.(inference.AttentionInspector)
snap, err := inspector.InspectAttention(ctx, "What is kindness?")
// snap.Keys[layer][head] → post-RoPE K vectors as flat float32
```
**How it works:**
1. The prompt is tokenised and a single prefill pass populates all layer KV caches
2. For each layer, `cache.State()[0]` returns the K tensor with shape `[1, num_kv_heads, seq_alloc, head_dim]`
3. The tensor is sliced to valid token positions (cache may pre-allocate padding beyond `seq_len`)
4. K vectors are copied to CPU float32 slices via `.Floats()` and reshaped to `[head][seq_len * head_dim]`
5. GPU arrays are freed immediately after extraction
The K tensors are post-RoPE — rotary position embeddings have already been applied during the attention forward pass. This is the same data the model uses for attention scoring, making it suitable for coherence analysis.
For GQA models (Gemma3), `num_kv_heads` may be 1 per layer while `num_query_heads` is 8+. The returned snapshot reflects the KV head count, not query heads.
---
## mlxlm Subprocess Backend