docs: flesh out English stemming task spec for Phase 1 completion
Detail Porter-style stemmer algorithm, tokenize() integration, search scoring (scoreStemWord = 0.7), and comprehensive test matrix including regression verification. Co-Authored-By: Virgil <virgil@lethean.io>
This commit is contained in:
parent
23cef8592a
commit
2cca8d5656
1 changed files with 10 additions and 1 deletions
11
TODO.md
11
TODO.md
|
|
@ -16,7 +16,16 @@ Dispatched from core/go orchestration. Pick up tasks in order.
|
|||
## Phase 1: Search Improvements
|
||||
|
||||
- [x] **Fuzzy matching** -- Levenshtein distance with max edit distance of 2. Words under 3 chars skip fuzzy. Score: +0.3 per fuzzy match (below prefix +0.5 and exact +1.0).
|
||||
- [ ] Add stemming support for English search terms
|
||||
- [ ] **English stemming** — Add a lightweight Porter-style stemmer for English search terms. Pure Go, no external deps.
|
||||
- **Create `stemmer.go`** — Implement `stem(word string) string` covering the most impactful English suffix rules:
|
||||
- Step 1: Plurals and -ed/-ing forms (`-sses` → `-ss`, `-ies` → `-i`, `-s` → `""`, `-eed` → `-ee`, `-ed` → `""`, `-ing` → `""`)
|
||||
- Step 2: Derivational suffixes (`-ational` → `-ate`, `-tional` → `-tion`, `-fulness` → `-ful`, `-ness` → `""`, `-ment` → `""`, `-ation` → `-ate`, `-ously` → `-ous`, `-ively` → `-ive`, `-ably` → `-able`, `-ally` → `-al`, `-izer` → `-ize`, `-ingly` → `-ing`)
|
||||
- Guard: words under 4 chars are returned unchanged; result must be at least 2 chars
|
||||
- Use a simple suffix-stripping approach (not the full Porter algorithm — we don't need morphological analysis for a help catalog)
|
||||
- **Modify `tokenize()`** — Add stemmed variants: for each word, compute `stem(word)`. If the stem differs from the word, return BOTH the original word AND the stem. This ensures exact matches still work while adding stemmed coverage.
|
||||
- **Modify `Search()`** — Stem query words before matching. Add `scoreStemWord = 0.7` constant (between exact 1.0 and prefix 0.5) for stem-only matches.
|
||||
- **Integration**: When indexing (`Add`), `tokenize()` already produces stemmed variants, so the index naturally contains stems. When searching, stem the query words and match both raw and stemmed forms against the index.
|
||||
- **Tests** — (a) `stem()` unit tests for all suffix rules (15+ cases), (b) short words unchanged, (c) search "running" matches topic containing "run", (d) search "configurations" matches "configure", (e) plural "servers" matches "server", (f) existing tests still pass (no regression), (g) benchmark `BenchmarkStem` with 100 words
|
||||
- [x] **Phrase search** -- Quoted multi-word queries via `extractPhrases()`. Phrase boost: +8.0. Searches title, content, and section content.
|
||||
- [x] **Improved scoring weights** -- Named constants: title +10, section +5, tag +3, phrase +8, all-words bonus +2, exact +1.0, prefix +0.5, fuzzy +0.3.
|
||||
- [x] **Tag boost** -- Query words matching tags add +3.0 per matching tag.
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue