docs: flesh out English stemming task spec for Phase 1 completion

Detail Porter-style stemmer algorithm, tokenize() integration, search scoring (scoreStemWord = 0.7), and comprehensive test matrix including regression verification. Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-20 07:56:33 +00:00 · 2026-02-20 07:56:33 +00:00 · 2cca8d5656
commit 2cca8d5656
parent 23cef8592a
1 changed files with 10 additions and 1 deletions
--- a/TODO.md
+++ b/TODO.md
@ -16,7 +16,16 @@ Dispatched from core/go orchestration. Pick up tasks in order.
 ## Phase 1: Search Improvements

 - [x] **Fuzzy matching** -- Levenshtein distance with max edit distance of 2. Words under 3 chars skip fuzzy. Score: +0.3 per fuzzy match (below prefix +0.5 and exact +1.0).
- [ ] Add stemming support for English search terms
+- [ ] **English stemming** — Add a lightweight Porter-style stemmer for English search terms. Pure Go, no external deps.
+  - **Create `stemmer.go`** — Implement `stem(word string) string` covering the most impactful English suffix rules:
+    - Step 1: Plurals and -ed/-ing forms (`-sses` → `-ss`, `-ies` → `-i`, `-s` → `""`, `-eed` → `-ee`, `-ed` → `""`, `-ing` → `""`)
+    - Step 2: Derivational suffixes (`-ational` → `-ate`, `-tional` → `-tion`, `-fulness` → `-ful`, `-ness` → `""`, `-ment` → `""`, `-ation` → `-ate`, `-ously` → `-ous`, `-ively` → `-ive`, `-ably` → `-able`, `-ally` → `-al`, `-izer` → `-ize`, `-ingly` → `-ing`)
+    - Guard: words under 4 chars are returned unchanged; result must be at least 2 chars
+    - Use a simple suffix-stripping approach (not the full Porter algorithm — we don't need morphological analysis for a help catalog)
+  - **Modify `tokenize()`** — Add stemmed variants: for each word, compute `stem(word)`. If the stem differs from the word, return BOTH the original word AND the stem. This ensures exact matches still work while adding stemmed coverage.
+  - **Modify `Search()`** — Stem query words before matching. Add `scoreStemWord = 0.7` constant (between exact 1.0 and prefix 0.5) for stem-only matches.
+  - **Integration**: When indexing (`Add`), `tokenize()` already produces stemmed variants, so the index naturally contains stems. When searching, stem the query words and match both raw and stemmed forms against the index.
+  - **Tests** — (a) `stem()` unit tests for all suffix rules (15+ cases), (b) short words unchanged, (c) search "running" matches topic containing "run", (d) search "configurations" matches "configure", (e) plural "servers" matches "server", (f) existing tests still pass (no regression), (g) benchmark `BenchmarkStem` with 100 words
 - [x] **Phrase search** -- Quoted multi-word queries via `extractPhrases()`. Phrase boost: +8.0. Searches title, content, and section content.
 - [x] **Improved scoring weights** -- Named constants: title +10, section +5, tag +3, phrase +8, all-words bonus +2, exact +1.0, prefix +0.5, fuzzy +0.3.
 - [x] **Tag boost** -- Query words matching tags add +3.0 per matching tag.