diff --git a/TODO.md b/TODO.md index e8f88c9..5412a00 100644 --- a/TODO.md +++ b/TODO.md @@ -16,7 +16,16 @@ Dispatched from core/go orchestration. Pick up tasks in order. ## Phase 1: Search Improvements - [x] **Fuzzy matching** -- Levenshtein distance with max edit distance of 2. Words under 3 chars skip fuzzy. Score: +0.3 per fuzzy match (below prefix +0.5 and exact +1.0). -- [ ] Add stemming support for English search terms +- [ ] **English stemming** — Add a lightweight Porter-style stemmer for English search terms. Pure Go, no external deps. + - **Create `stemmer.go`** — Implement `stem(word string) string` covering the most impactful English suffix rules: + - Step 1: Plurals and -ed/-ing forms (`-sses` → `-ss`, `-ies` → `-i`, `-s` → `""`, `-eed` → `-ee`, `-ed` → `""`, `-ing` → `""`) + - Step 2: Derivational suffixes (`-ational` → `-ate`, `-tional` → `-tion`, `-fulness` → `-ful`, `-ness` → `""`, `-ment` → `""`, `-ation` → `-ate`, `-ously` → `-ous`, `-ively` → `-ive`, `-ably` → `-able`, `-ally` → `-al`, `-izer` → `-ize`, `-ingly` → `-ing`) + - Guard: words under 4 chars are returned unchanged; result must be at least 2 chars + - Use a simple suffix-stripping approach (not the full Porter algorithm — we don't need morphological analysis for a help catalog) + - **Modify `tokenize()`** — Add stemmed variants: for each word, compute `stem(word)`. If the stem differs from the word, return BOTH the original word AND the stem. This ensures exact matches still work while adding stemmed coverage. + - **Modify `Search()`** — Stem query words before matching. Add `scoreStemWord = 0.7` constant (between exact 1.0 and prefix 0.5) for stem-only matches. + - **Integration**: When indexing (`Add`), `tokenize()` already produces stemmed variants, so the index naturally contains stems. When searching, stem the query words and match both raw and stemmed forms against the index. + - **Tests** — (a) `stem()` unit tests for all suffix rules (15+ cases), (b) short words unchanged, (c) search "running" matches topic containing "run", (d) search "configurations" matches "configure", (e) plural "servers" matches "server", (f) existing tests still pass (no regression), (g) benchmark `BenchmarkStem` with 100 words - [x] **Phrase search** -- Quoted multi-word queries via `extractPhrases()`. Phrase boost: +8.0. Searches title, content, and section content. - [x] **Improved scoring weights** -- Named constants: title +10, section +5, tag +3, phrase +8, all-words bonus +2, exact +1.0, prefix +0.5, fuzzy +0.3. - [x] **Tag boost** -- Query words matching tags add +3.0 per matching tag.