Table of Contents
LEK-Gemma3-1B-v2 Benchmarks
Benchmark results from testing the fine-tuned 1B parameter model (LEK-Gemma3-1B-layered-v2) for use in the go-i18n classification pipeline. Conducted 2026-02-19.
Summary
The 1B model is not useful for generative tasks (code, prose, pattern completion) but has a genuine sweet spot in constrained classification — specifically domain classification at 75% accuracy in 0.17s per task (~5,000 sentences/second on M3).
Round 1: Practical Dev Tasks
5 open-ended tasks, temp=0.3, max_tokens=512.
| Task | Result | Notes |
|---|---|---|
| Bug spotting (Go security) | Partial | Right direction, wrong specific fix |
| Commit message | Miss | Wrong format, verbose |
| Go function generation | Miss | Hallucinated APIs |
| Grammar table (5 verbs) | 2/5 | "deploy", "scaffold" correct; rest wrong |
| Code review (1 sentence) | Partial | Muddled reasoning, pad token degeneration |
Verdict: Not useful for generative work. Pad token degeneration on longer output.
Round 2: Narrow Constrained Tasks
24 tasks, temp=0.1, one-word/short answers. Average 0.19s per task.
| Category | Score | Notes |
|---|---|---|
| Domain classification | 3/4 (75%) | technical, creative, ethical, casual |
| Article selection | 2/3 (67%) | "an API" and "an SSH" correct, missed "a URL" |
| Tense detection | 2/4 (50%) | Weak on gerund vs base form |
| Plural detection | 2/3 (67%) | "matrices" and "datum" correct |
| Conjugation | Mixed | Some correct, many hallucinated forms |
Round 3: Tightest Format
27 tasks, temp=0.05, binary T/F or forced choice. Average 0.18s per task.
| Category | Score | Notes |
|---|---|---|
| Domain classification | 6/8 (75%) | Consistent — this is the sweet spot |
| Article correctness T/F | 3/3 (100%) | "an SSH"=true, "a API"=false, "an URL"=false |
| Tone/sentiment | 2/3 (67%) | Got positive + negative, neutral confused |
| Irregular base forms A/B | 2/2 (100%) | "went"→go, "mice"→mouse |
| True/False grammar | 4/8 (50%) | Strong false-bias |
| Pattern fill | 0/4 (0%) | Echoes prompt or hallucinates |
Sweet Spots (Use These)
| Capability | Accuracy | Speed | Scale |
|---|---|---|---|
| Domain classification | 75% | 0.17s | ~5K sentences/sec |
| Article correctness | 100% | 0.17s | Binary T/F format |
| Irregular base form ID | 100% | 0.17s | Forced A/B choice |
Dead Zones (Avoid)
| Capability | Accuracy | Problem |
|---|---|---|
| Pattern fill | 0% | Echoes or hallucinates |
| True/False (general) | 50% | False-bias when uncertain |
| Tense detection | 50% | Gerund/base confusion |
| Code generation | ~0% | Hallucinated APIs, degeneration |
Implications for go-i18n Pipeline
Phase 2a: 1B Pre-Classification
The 88K Phase 0 seed corpus can be pre-tagged with domain labels (technical/creative/ethical/casual) in ~18 seconds on M3. This pre-sort enables:
- Batched imprint processing — group seeds by domain before running GrammarImprint
- Per-domain reference distributions — calibrate what "normal" grammar looks like within each domain
- Anomaly detection — texts where 1B tag disagrees with imprint-based classification are either misclassified (training signal for 1B) or genuinely cross-domain
Known Weakness to Fix
The technical ↔ creative confusion is the main error pattern. The model sometimes calls technical text (CLI commands, Docker instructions) "creative". This is likely fixable with targeted fine-tuning examples showing code/CLI vs literary prose.
Article/Irregular Validators
The 100% accuracy on article correctness and irregular base forms suggests these grammar features are well-learned. Worth testing as lightweight validators that supplement the rule-based Article() and PastTense() for edge cases the grammar tables miss.
1B vs 27B Comparison
Same creative probe (lighthouse keeper story, 3 perspectives):
| Metric | 1B | 27B |
|---|---|---|
| Output length | 3,209 chars | 4,470 chars |
| Generation time | 3.5s | 28.1s |
| Throughput | 918 chars/s | 159 chars/s |
| Style | Journal-style, atmospheric, distinctive voice | Structured three-part, conventional, complete |
The 1B model is 6x faster with arguably more distinctive prose per token. For classification (not generation), the speed advantage is the key metric.