go-help/FINDINGS.md
Claude 3e91510bcf
feat(search): add fuzzy matching, phrase search, and improved scoring
Phase 0: Push test coverage from 92.1% to 100% by adding catalog_test.go
and targeted tests for all uncovered branches in search.go. Add
BenchmarkSearch with 150 topics (~745us/op baseline).

Phase 1: Implement three search improvements:
- Levenshtein-based fuzzy matching (max distance 2, words >= 3 chars)
- Quoted phrase search via extractPhrases() with +8.0 boost
- Tag boost (+3.0) and multi-word bonus (+2.0) scoring
- Named scoring constants replacing magic numbers

All changes are backward-compatible; Search() signature unchanged.

Co-Authored-By: Charon <developers@lethean.io>
2026-02-20 01:21:35 +00:00

87 lines
3.2 KiB
Markdown

# FINDINGS.md -- go-help
## 2026-02-19: Split from core/go (Virgil)
### Origin
Extracted from `forge.lthn.ai/core/go` `pkg/help/` on 19 Feb 2026.
### Architecture
- `Topic` struct with title, body, tags, related topics
- `Section` groups topics under a heading
- `Frontmatter` for YAML metadata in topic files
- `Catalog` loads topics from YAML files on disk
- `Search` provides keyword search across topics with scoring:
- Title match: +10
- Section match: +5
- Partial/body match: +0.5
### Dependencies
- Pure Go, only external dependency is `gopkg.in/yaml.v3`
### Tests
- 2 test files covering catalog loading and search behaviour
## 2026-02-20: Phase 0 + Phase 1 (Charon)
### Phase 0: Coverage 92.1% -> 100%
- Created `catalog_test.go` — the entire `catalog.go` was untested (0%)
- Added targeted search tests for previously uncovered branches:
- Nil topic guard in `Search()` (stale index references)
- Alphabetical tie-breaking when scores are equal
- Headings-only content in snippet extraction (no body text)
- Whitespace-only content trimmed to empty in snippets
- Empty regex slice in `highlight()`
- Overlapping match extension in highlight merging
- Added `BenchmarkSearch` with 150 generated topics
- Baseline: ~745us/op, ~392KB/op, 4114 allocs/op (Ryzen 9 9950X)
- `go vet ./...` clean
### Phase 1: Search Improvements
#### Fuzzy Matching (Levenshtein distance)
- Implemented `levenshtein()` using two-row DP (memory-efficient)
- Integrated into `Search()` with max edit distance of 2
- Only applied to query words >= 3 characters (avoids noise from short words)
- Score: +0.3 per fuzzy match (lower than prefix +0.5 and exact +1.0)
- Skips words already matched as exact or prefix (no double-counting)
#### Phrase Search
- `extractPhrases()` pulls `"quoted strings"` from the query
- Remaining text is tokenised normally for keyword search
- Phrase matching checks title + content + all section content (case-insensitive)
- Phrase boost: +8.0 per matching phrase
- Phrase terms are also compiled as regexes for snippet highlighting
- Empty quotes `""` are left as-is (regex requires `[^"]+`)
- Whitespace-only quotes are ignored
#### Improved Scoring Weights
- Replaced magic numbers with named constants for clarity:
- `scoreExactWord = 1.0` -- exact word in index
- `scorePrefixWord = 0.5` -- prefix/partial word match
- `scoreFuzzyWord = 0.3` -- Levenshtein fuzzy match
- `scoreTitleBoost = 10.0` -- query word in topic title
- `scoreSectionBoost = 5.0` -- query word in section title
- `scoreTagBoost = 3.0` -- query word matches a tag (NEW)
- `scorePhraseBoost = 8.0` -- exact phrase match (NEW)
- `scoreAllWords = 2.0` -- all query words present (NEW)
- `fuzzyMaxDistance = 2` -- max Levenshtein distance
#### New Scoring Features
- **Tag boost** (+3.0): topics with tags matching query words rank higher
- **Multi-word bonus** (+2.0): topics containing ALL query words get a bonus
- Both are additive with existing boosts (title, section, exact/prefix)
### API Compatibility
- `Search(query string) []*SearchResult` signature unchanged
- All existing behaviour preserved; new features are additive
- Existing tests pass without modification