docs: flesh out English stemming task spec for Phase 1 completion

Detail Porter-style stemmer algorithm, tokenize() integration, search
scoring (scoreStemWord = 0.7), and comprehensive test matrix including
regression verification.

Co-Authored-By: Virgil <virgil@lethean.io>
This commit is contained in:
Snider 2026-02-20 07:56:33 +00:00
parent 23cef8592a
commit 2cca8d5656

11
TODO.md
View file

@ -16,7 +16,16 @@ Dispatched from core/go orchestration. Pick up tasks in order.
## Phase 1: Search Improvements
- [x] **Fuzzy matching** -- Levenshtein distance with max edit distance of 2. Words under 3 chars skip fuzzy. Score: +0.3 per fuzzy match (below prefix +0.5 and exact +1.0).
- [ ] Add stemming support for English search terms
- [ ] **English stemming** — Add a lightweight Porter-style stemmer for English search terms. Pure Go, no external deps.
- **Create `stemmer.go`** — Implement `stem(word string) string` covering the most impactful English suffix rules:
- Step 1: Plurals and -ed/-ing forms (`-sses``-ss`, `-ies``-i`, `-s``""`, `-eed``-ee`, `-ed``""`, `-ing``""`)
- Step 2: Derivational suffixes (`-ational``-ate`, `-tional``-tion`, `-fulness``-ful`, `-ness``""`, `-ment``""`, `-ation``-ate`, `-ously``-ous`, `-ively``-ive`, `-ably``-able`, `-ally``-al`, `-izer``-ize`, `-ingly``-ing`)
- Guard: words under 4 chars are returned unchanged; result must be at least 2 chars
- Use a simple suffix-stripping approach (not the full Porter algorithm — we don't need morphological analysis for a help catalog)
- **Modify `tokenize()`** — Add stemmed variants: for each word, compute `stem(word)`. If the stem differs from the word, return BOTH the original word AND the stem. This ensures exact matches still work while adding stemmed coverage.
- **Modify `Search()`** — Stem query words before matching. Add `scoreStemWord = 0.7` constant (between exact 1.0 and prefix 0.5) for stem-only matches.
- **Integration**: When indexing (`Add`), `tokenize()` already produces stemmed variants, so the index naturally contains stems. When searching, stem the query words and match both raw and stemmed forms against the index.
- **Tests** — (a) `stem()` unit tests for all suffix rules (15+ cases), (b) short words unchanged, (c) search "running" matches topic containing "run", (d) search "configurations" matches "configure", (e) plural "servers" matches "server", (f) existing tests still pass (no regression), (g) benchmark `BenchmarkStem` with 100 words
- [x] **Phrase search** -- Quoted multi-word queries via `extractPhrases()`. Phrase boost: +8.0. Searches title, content, and section content.
- [x] **Improved scoring weights** -- Named constants: title +10, section +5, tag +3, phrase +8, all-words bonus +2, exact +1.0, prefix +0.5, fuzzy +0.3.
- [x] **Tag boost** -- Query words matching tags add +3.0 per matching tag.