go-help/TODO.md
Snider 2cca8d5656 docs: flesh out English stemming task spec for Phase 1 completion
Detail Porter-style stemmer algorithm, tokenize() integration, search
scoring (scoreStemWord = 0.7), and comprehensive test matrix including
regression verification.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-20 07:56:33 +00:00

4.6 KiB

TODO.md -- go-help

Dispatched from core/go orchestration. Pick up tasks in order.


Phase 0: Hardening & Test Coverage

  • Expand parser tests -- Parser at 100%. Tests cover: empty input, frontmatter-only, malformed YAML (3 variants), frontmatter not at start, deeply nested headings (H4-H6 with content), Unicode (CJK, emoji, diacritics, mixed scripts), 10K+ line document, empty sections, headings without space, consecutive headings, GenerateID edge cases, path-derived IDs.
  • Expand search tests -- Added tests for: empty query (4 variants), no results (3 variants), case sensitivity (4 variants), multi-word queries (4 variants), special characters (@, dots, underscores), overlapping matches, scoring boundary cases, nil-topic guard, snippet edge cases (headings-only, whitespace-only), duplicate topic IDs, catalog integration.
  • Add catalog tests -- Created catalog_test.go covering: DefaultCatalog, Add, List, Search, Get (found/not-found), score tie-breaking.
  • Benchmark search -- search_bench_test.go with 8 benchmarks: single word, multi-word, no results, partial match, 500-topic catalog, 1000-topic catalog, Add indexing, tokenize. Uses b.Loop() (Go 1.25+) and b.ReportAllocs().
  • go vet ./... clean -- No warnings.
  • Coverage: 100% -- Up from 92.1%.

Phase 1: Search Improvements

  • Fuzzy matching -- Levenshtein distance with max edit distance of 2. Words under 3 chars skip fuzzy. Score: +0.3 per fuzzy match (below prefix +0.5 and exact +1.0).
  • English stemming — Add a lightweight Porter-style stemmer for English search terms. Pure Go, no external deps.
    • Create stemmer.go — Implement stem(word string) string covering the most impactful English suffix rules:
      • Step 1: Plurals and -ed/-ing forms (-sses-ss, -ies-i, -s"", -eed-ee, -ed"", -ing"")
      • Step 2: Derivational suffixes (-ational-ate, -tional-tion, -fulness-ful, -ness"", -ment"", -ation-ate, -ously-ous, -ively-ive, -ably-able, -ally-al, -izer-ize, -ingly-ing)
      • Guard: words under 4 chars are returned unchanged; result must be at least 2 chars
      • Use a simple suffix-stripping approach (not the full Porter algorithm — we don't need morphological analysis for a help catalog)
    • Modify tokenize() — Add stemmed variants: for each word, compute stem(word). If the stem differs from the word, return BOTH the original word AND the stem. This ensures exact matches still work while adding stemmed coverage.
    • Modify Search() — Stem query words before matching. Add scoreStemWord = 0.7 constant (between exact 1.0 and prefix 0.5) for stem-only matches.
    • Integration: When indexing (Add), tokenize() already produces stemmed variants, so the index naturally contains stems. When searching, stem the query words and match both raw and stemmed forms against the index.
    • Tests — (a) stem() unit tests for all suffix rules (15+ cases), (b) short words unchanged, (c) search "running" matches topic containing "run", (d) search "configurations" matches "configure", (e) plural "servers" matches "server", (f) existing tests still pass (no regression), (g) benchmark BenchmarkStem with 100 words
  • Phrase search -- Quoted multi-word queries via extractPhrases(). Phrase boost: +8.0. Searches title, content, and section content.
  • Improved scoring weights -- Named constants: title +10, section +5, tag +3, phrase +8, all-words bonus +2, exact +1.0, prefix +0.5, fuzzy +0.3.
  • Tag boost -- Query words matching tags add +3.0 per matching tag.
  • Multi-word bonus -- All query words present in topic adds +2.0.
  • Tests for all new features -- Levenshtein, min3, extractPhrases, fuzzy search, phrase search, tag boost, multi-word bonus, scoring constants, phrase highlighting, section phrase matching.

Phase 2: core.help Integration

  • Feed CLI docs into help catalog (parse core subcommand help text)
  • Serve catalog via HTTP for the core.help domain
  • Add Markdown rendering for topic bodies in HTTP responses
  • Generate static site from catalog for BunnyCDN deployment
  • Embed help topics into go-rag collections for semantic search
  • Add vector similarity fallback when keyword search returns no results
  • Support natural language queries ("how do I push all repos?")

Workflow

  1. Virgil in core/go writes tasks here after research
  2. This repo's dedicated session picks up tasks in phase order
  3. Mark [x] when done, note commit hash