LEK-Gemma3-1B-v2 Benchmarks

Benchmark results from testing the fine-tuned 1B parameter model (LEK-Gemma3-1B-layered-v2) for use in the go-i18n classification pipeline. Conducted 2026-02-19.

Summary

The 1B model is not useful for generative tasks (code, prose, pattern completion) but has a genuine sweet spot in constrained classification — specifically domain classification at 75% accuracy in 0.17s per task (~5,000 sentences/second on M3).

Round 1: Practical Dev Tasks

5 open-ended tasks, temp=0.3, max_tokens=512.

Task	Result	Notes
Bug spotting (Go security)	Partial	Right direction, wrong specific fix
Commit message	Miss	Wrong format, verbose
Go function generation	Miss	Hallucinated APIs
Grammar table (5 verbs)	2/5	"deploy", "scaffold" correct; rest wrong
Code review (1 sentence)	Partial	Muddled reasoning, pad token degeneration

Verdict: Not useful for generative work. Pad token degeneration on longer output.

Round 2: Narrow Constrained Tasks

24 tasks, temp=0.1, one-word/short answers. Average 0.19s per task.

Category	Score	Notes
Domain classification	3/4 (75%)	technical, creative, ethical, casual
Article selection	2/3 (67%)	"an API" and "an SSH" correct, missed "a URL"
Tense detection	2/4 (50%)	Weak on gerund vs base form
Plural detection	2/3 (67%)	"matrices" and "datum" correct
Conjugation	Mixed	Some correct, many hallucinated forms

Round 3: Tightest Format

27 tasks, temp=0.05, binary T/F or forced choice. Average 0.18s per task.

Category	Score	Notes
Domain classification	6/8 (75%)	Consistent — this is the sweet spot
Article correctness T/F	3/3 (100%)	"an SSH"=true, "a API"=false, "an URL"=false
Tone/sentiment	2/3 (67%)	Got positive + negative, neutral confused
Irregular base forms A/B	2/2 (100%)	"went"→go, "mice"→mouse
True/False grammar	4/8 (50%)	Strong false-bias
Pattern fill	0/4 (0%)	Echoes prompt or hallucinates

Sweet Spots (Use These)

Capability	Accuracy	Speed	Scale
Domain classification	75%	0.17s	~5K sentences/sec
Article correctness	100%	0.17s	Binary T/F format
Irregular base form ID	100%	0.17s	Forced A/B choice

Dead Zones (Avoid)

Capability	Accuracy	Problem
Pattern fill	0%	Echoes or hallucinates
True/False (general)	50%	False-bias when uncertain
Tense detection	50%	Gerund/base confusion
Code generation	~0%	Hallucinated APIs, degeneration

Implications for go-i18n Pipeline

Phase 2a: 1B Pre-Classification

The 88K Phase 0 seed corpus can be pre-tagged with domain labels (technical/creative/ethical/casual) in ~18 seconds on M3. This pre-sort enables:

Batched imprint processing — group seeds by domain before running GrammarImprint
Per-domain reference distributions — calibrate what "normal" grammar looks like within each domain
Anomaly detection — texts where 1B tag disagrees with imprint-based classification are either misclassified (training signal for 1B) or genuinely cross-domain

Known Weakness to Fix

The technical ↔ creative confusion is the main error pattern. The model sometimes calls technical text (CLI commands, Docker instructions) "creative". This is likely fixable with targeted fine-tuning examples showing code/CLI vs literary prose.

Article/Irregular Validators

The 100% accuracy on article correctness and irregular base forms suggests these grammar features are well-learned. Worth testing as lightweight validators that supplement the rule-based Article() and PastTense() for edge cases the grammar tables miss.

1B vs 27B Comparison

Same creative probe (lighthouse keeper story, 3 perspectives):

Metric	1B	27B
Output length	3,209 chars	4,470 chars
Generation time	3.5s	28.1s
Throughput	918 chars/s	159 chars/s
Style	Journal-style, atmospheric, distinctive voice	Structured three-part, conventional, complete

The 1B model is 6x faster with arguably more distinctive prose per token. For classification (not generation), the speed advantage is the key metric.