1 LEK-1B-Benchmarks
Virgil edited this page 2026-02-19 15:57:36 +00:00

LEK-Gemma3-1B-v2 Benchmarks

Benchmark results from testing the fine-tuned 1B parameter model (LEK-Gemma3-1B-layered-v2) for use in the go-i18n classification pipeline. Conducted 2026-02-19.

Summary

The 1B model is not useful for generative tasks (code, prose, pattern completion) but has a genuine sweet spot in constrained classification — specifically domain classification at 75% accuracy in 0.17s per task (~5,000 sentences/second on M3).

Round 1: Practical Dev Tasks

5 open-ended tasks, temp=0.3, max_tokens=512.

Task Result Notes
Bug spotting (Go security) Partial Right direction, wrong specific fix
Commit message Miss Wrong format, verbose
Go function generation Miss Hallucinated APIs
Grammar table (5 verbs) 2/5 "deploy", "scaffold" correct; rest wrong
Code review (1 sentence) Partial Muddled reasoning, pad token degeneration

Verdict: Not useful for generative work. Pad token degeneration on longer output.

Round 2: Narrow Constrained Tasks

24 tasks, temp=0.1, one-word/short answers. Average 0.19s per task.

Category Score Notes
Domain classification 3/4 (75%) technical, creative, ethical, casual
Article selection 2/3 (67%) "an API" and "an SSH" correct, missed "a URL"
Tense detection 2/4 (50%) Weak on gerund vs base form
Plural detection 2/3 (67%) "matrices" and "datum" correct
Conjugation Mixed Some correct, many hallucinated forms

Round 3: Tightest Format

27 tasks, temp=0.05, binary T/F or forced choice. Average 0.18s per task.

Category Score Notes
Domain classification 6/8 (75%) Consistent — this is the sweet spot
Article correctness T/F 3/3 (100%) "an SSH"=true, "a API"=false, "an URL"=false
Tone/sentiment 2/3 (67%) Got positive + negative, neutral confused
Irregular base forms A/B 2/2 (100%) "went"→go, "mice"→mouse
True/False grammar 4/8 (50%) Strong false-bias
Pattern fill 0/4 (0%) Echoes prompt or hallucinates

Sweet Spots (Use These)

Capability Accuracy Speed Scale
Domain classification 75% 0.17s ~5K sentences/sec
Article correctness 100% 0.17s Binary T/F format
Irregular base form ID 100% 0.17s Forced A/B choice

Dead Zones (Avoid)

Capability Accuracy Problem
Pattern fill 0% Echoes or hallucinates
True/False (general) 50% False-bias when uncertain
Tense detection 50% Gerund/base confusion
Code generation ~0% Hallucinated APIs, degeneration

Implications for go-i18n Pipeline

Phase 2a: 1B Pre-Classification

The 88K Phase 0 seed corpus can be pre-tagged with domain labels (technical/creative/ethical/casual) in ~18 seconds on M3. This pre-sort enables:

  1. Batched imprint processing — group seeds by domain before running GrammarImprint
  2. Per-domain reference distributions — calibrate what "normal" grammar looks like within each domain
  3. Anomaly detection — texts where 1B tag disagrees with imprint-based classification are either misclassified (training signal for 1B) or genuinely cross-domain

Known Weakness to Fix

The technical ↔ creative confusion is the main error pattern. The model sometimes calls technical text (CLI commands, Docker instructions) "creative". This is likely fixable with targeted fine-tuning examples showing code/CLI vs literary prose.

Article/Irregular Validators

The 100% accuracy on article correctness and irregular base forms suggests these grammar features are well-learned. Worth testing as lightweight validators that supplement the rule-based Article() and PastTense() for edge cases the grammar tables miss.

1B vs 27B Comparison

Same creative probe (lighthouse keeper story, 3 perspectives):

Metric 1B 27B
Output length 3,209 chars 4,470 chars
Generation time 3.5s 28.1s
Throughput 918 chars/s 159 chars/s
Style Journal-style, atmospheric, distinctive voice Structured three-part, conventional, complete

The 1B model is 6x faster with arguably more distinctive prose per token. For classification (not generation), the speed advantage is the key metric.