Detail Porter-style stemmer algorithm, tokenize() integration, search scoring (scoreStemWord = 0.7), and comprehensive test matrix including regression verification. Co-Authored-By: Virgil <virgil@lethean.io>
4.6 KiB
4.6 KiB
TODO.md -- go-help
Dispatched from core/go orchestration. Pick up tasks in order.
Phase 0: Hardening & Test Coverage
- Expand parser tests -- Parser at 100%. Tests cover: empty input, frontmatter-only, malformed YAML (3 variants), frontmatter not at start, deeply nested headings (H4-H6 with content), Unicode (CJK, emoji, diacritics, mixed scripts), 10K+ line document, empty sections, headings without space, consecutive headings, GenerateID edge cases, path-derived IDs.
- Expand search tests -- Added tests for: empty query (4 variants), no results (3 variants), case sensitivity (4 variants), multi-word queries (4 variants), special characters (@, dots, underscores), overlapping matches, scoring boundary cases, nil-topic guard, snippet edge cases (headings-only, whitespace-only), duplicate topic IDs, catalog integration.
- Add catalog tests -- Created
catalog_test.gocovering: DefaultCatalog, Add, List, Search, Get (found/not-found), score tie-breaking. - Benchmark search --
search_bench_test.gowith 8 benchmarks: single word, multi-word, no results, partial match, 500-topic catalog, 1000-topic catalog, Add indexing, tokenize. Usesb.Loop()(Go 1.25+) andb.ReportAllocs(). go vet ./...clean -- No warnings.- Coverage: 100% -- Up from 92.1%.
Phase 1: Search Improvements
- Fuzzy matching -- Levenshtein distance with max edit distance of 2. Words under 3 chars skip fuzzy. Score: +0.3 per fuzzy match (below prefix +0.5 and exact +1.0).
- English stemming — Add a lightweight Porter-style stemmer for English search terms. Pure Go, no external deps.
- Create
stemmer.go— Implementstem(word string) stringcovering the most impactful English suffix rules:- Step 1: Plurals and -ed/-ing forms (
-sses→-ss,-ies→-i,-s→"",-eed→-ee,-ed→"",-ing→"") - Step 2: Derivational suffixes (
-ational→-ate,-tional→-tion,-fulness→-ful,-ness→"",-ment→"",-ation→-ate,-ously→-ous,-ively→-ive,-ably→-able,-ally→-al,-izer→-ize,-ingly→-ing) - Guard: words under 4 chars are returned unchanged; result must be at least 2 chars
- Use a simple suffix-stripping approach (not the full Porter algorithm — we don't need morphological analysis for a help catalog)
- Step 1: Plurals and -ed/-ing forms (
- Modify
tokenize()— Add stemmed variants: for each word, computestem(word). If the stem differs from the word, return BOTH the original word AND the stem. This ensures exact matches still work while adding stemmed coverage. - Modify
Search()— Stem query words before matching. AddscoreStemWord = 0.7constant (between exact 1.0 and prefix 0.5) for stem-only matches. - Integration: When indexing (
Add),tokenize()already produces stemmed variants, so the index naturally contains stems. When searching, stem the query words and match both raw and stemmed forms against the index. - Tests — (a)
stem()unit tests for all suffix rules (15+ cases), (b) short words unchanged, (c) search "running" matches topic containing "run", (d) search "configurations" matches "configure", (e) plural "servers" matches "server", (f) existing tests still pass (no regression), (g) benchmarkBenchmarkStemwith 100 words
- Create
- Phrase search -- Quoted multi-word queries via
extractPhrases(). Phrase boost: +8.0. Searches title, content, and section content. - Improved scoring weights -- Named constants: title +10, section +5, tag +3, phrase +8, all-words bonus +2, exact +1.0, prefix +0.5, fuzzy +0.3.
- Tag boost -- Query words matching tags add +3.0 per matching tag.
- Multi-word bonus -- All query words present in topic adds +2.0.
- Tests for all new features -- Levenshtein, min3, extractPhrases, fuzzy search, phrase search, tag boost, multi-word bonus, scoring constants, phrase highlighting, section phrase matching.
Phase 2: core.help Integration
- Feed CLI docs into help catalog (parse
coresubcommand help text) - Serve catalog via HTTP for the
core.helpdomain - Add Markdown rendering for topic bodies in HTTP responses
- Generate static site from catalog for BunnyCDN deployment
Phase 3: AI-Assisted Search
- Embed help topics into go-rag collections for semantic search
- Add vector similarity fallback when keyword search returns no results
- Support natural language queries ("how do I push all repos?")
Workflow
- Virgil in core/go writes tasks here after research
- This repo's dedicated session picks up tasks in phase order
- Mark
[x]when done, note commit hash