feat(search): add fuzzy matching, phrase search, and improved scoring
Phase 0: Push test coverage from 92.1% to 100% by adding catalog_test.go and targeted tests for all uncovered branches in search.go. Add BenchmarkSearch with 150 topics (~745us/op baseline). Phase 1: Implement three search improvements: - Levenshtein-based fuzzy matching (max distance 2, words >= 3 chars) - Quoted phrase search via extractPhrases() with +8.0 boost - Tag boost (+3.0) and multi-word bonus (+2.0) scoring - Named scoring constants replacing magic numbers All changes are backward-compatible; Search() signature unchanged. Co-Authored-By: Charon <developers@lethean.io>
This commit is contained in:
parent
b6da16c717
commit
3e91510bcf
5 changed files with 907 additions and 19 deletions
61
FINDINGS.md
61
FINDINGS.md
|
|
@ -24,3 +24,64 @@ Extracted from `forge.lthn.ai/core/go` `pkg/help/` on 19 Feb 2026.
|
|||
### Tests
|
||||
|
||||
- 2 test files covering catalog loading and search behaviour
|
||||
|
||||
## 2026-02-20: Phase 0 + Phase 1 (Charon)
|
||||
|
||||
### Phase 0: Coverage 92.1% -> 100%
|
||||
|
||||
- Created `catalog_test.go` — the entire `catalog.go` was untested (0%)
|
||||
- Added targeted search tests for previously uncovered branches:
|
||||
- Nil topic guard in `Search()` (stale index references)
|
||||
- Alphabetical tie-breaking when scores are equal
|
||||
- Headings-only content in snippet extraction (no body text)
|
||||
- Whitespace-only content trimmed to empty in snippets
|
||||
- Empty regex slice in `highlight()`
|
||||
- Overlapping match extension in highlight merging
|
||||
- Added `BenchmarkSearch` with 150 generated topics
|
||||
- Baseline: ~745us/op, ~392KB/op, 4114 allocs/op (Ryzen 9 9950X)
|
||||
- `go vet ./...` clean
|
||||
|
||||
### Phase 1: Search Improvements
|
||||
|
||||
#### Fuzzy Matching (Levenshtein distance)
|
||||
|
||||
- Implemented `levenshtein()` using two-row DP (memory-efficient)
|
||||
- Integrated into `Search()` with max edit distance of 2
|
||||
- Only applied to query words >= 3 characters (avoids noise from short words)
|
||||
- Score: +0.3 per fuzzy match (lower than prefix +0.5 and exact +1.0)
|
||||
- Skips words already matched as exact or prefix (no double-counting)
|
||||
|
||||
#### Phrase Search
|
||||
|
||||
- `extractPhrases()` pulls `"quoted strings"` from the query
|
||||
- Remaining text is tokenised normally for keyword search
|
||||
- Phrase matching checks title + content + all section content (case-insensitive)
|
||||
- Phrase boost: +8.0 per matching phrase
|
||||
- Phrase terms are also compiled as regexes for snippet highlighting
|
||||
- Empty quotes `""` are left as-is (regex requires `[^"]+`)
|
||||
- Whitespace-only quotes are ignored
|
||||
|
||||
#### Improved Scoring Weights
|
||||
|
||||
- Replaced magic numbers with named constants for clarity:
|
||||
- `scoreExactWord = 1.0` -- exact word in index
|
||||
- `scorePrefixWord = 0.5` -- prefix/partial word match
|
||||
- `scoreFuzzyWord = 0.3` -- Levenshtein fuzzy match
|
||||
- `scoreTitleBoost = 10.0` -- query word in topic title
|
||||
- `scoreSectionBoost = 5.0` -- query word in section title
|
||||
- `scoreTagBoost = 3.0` -- query word matches a tag (NEW)
|
||||
- `scorePhraseBoost = 8.0` -- exact phrase match (NEW)
|
||||
- `scoreAllWords = 2.0` -- all query words present (NEW)
|
||||
- `fuzzyMaxDistance = 2` -- max Levenshtein distance
|
||||
|
||||
#### New Scoring Features
|
||||
|
||||
- **Tag boost** (+3.0): topics with tags matching query words rank higher
|
||||
- **Multi-word bonus** (+2.0): topics containing ALL query words get a bonus
|
||||
- Both are additive with existing boosts (title, section, exact/prefix)
|
||||
|
||||
### API Compatibility
|
||||
|
||||
- `Search(query string) []*SearchResult` signature unchanged
|
||||
- All existing behaviour preserved; new features are additive
|
||||
- Existing tests pass without modification
|
||||
|
|
|
|||
19
TODO.md
19
TODO.md
|
|
@ -6,17 +6,22 @@ Dispatched from core/go orchestration. Pick up tasks in order.
|
|||
|
||||
## Phase 0: Hardening & Test Coverage
|
||||
|
||||
- [ ] **Expand parser tests** — `parser_test.go` exists but coverage unknown. Add tests for: empty input, frontmatter-only (no body), malformed YAML frontmatter, deeply nested headings (####, #####), Unicode content, very long documents (10K+ lines).
|
||||
- [ ] **Expand search tests** — `search_test.go` exists. Add tests for: empty query, no results, case sensitivity, multi-word queries, special characters in query, overlapping matches, scoring boundary cases (exact title match vs partial body match).
|
||||
- [ ] **Benchmark search** — Add `BenchmarkSearch` with catalog of 100+ topics. Measure search latency and allocation count. Baseline for Phase 1 improvements.
|
||||
- [ ] **`go vet ./...` clean** — Verify no vet warnings. Fix any found.
|
||||
- [x] **Expand parser tests** — Parser already at 100%. Existing tests cover: empty input, frontmatter-only, malformed YAML, all heading levels (H1-H6), Unicode content, path-derived IDs.
|
||||
- [x] **Expand search tests** — Added tests for: empty query, no results, case sensitivity, multi-word queries, special characters, overlapping matches, scoring boundary cases, nil-topic guard, snippet edge cases (headings-only, whitespace-only).
|
||||
- [x] **Add catalog tests** — Created `catalog_test.go` covering: DefaultCatalog, Add, List, Search, Get (found/not-found), score tie-breaking.
|
||||
- [x] **Benchmark search** — `BenchmarkSearch` with 150 topics. Baseline: ~745us/op, ~392KB/op, 4114 allocs/op (Ryzen 9 9950X).
|
||||
- [x] **`go vet ./...` clean** — No warnings.
|
||||
- [x] **Coverage: 100%** — Up from 92.1%.
|
||||
|
||||
## Phase 1: Search Improvements
|
||||
|
||||
- [ ] Add fuzzy matching (Levenshtein distance or similar)
|
||||
- [x] **Fuzzy matching** — Levenshtein distance with max edit distance of 2. Words under 3 chars skip fuzzy. Score: +0.3 per fuzzy match (below prefix +0.5 and exact +1.0).
|
||||
- [ ] Add stemming support for English search terms
|
||||
- [ ] Add phrase search (quoted multi-word queries)
|
||||
- [ ] Improve scoring weights — currently title +10, section +5, partial +0.5
|
||||
- [x] **Phrase search** — Quoted multi-word queries via `extractPhrases()`. Phrase boost: +8.0. Searches title, content, and section content.
|
||||
- [x] **Improved scoring weights** — Named constants: title +10, section +5, tag +3, phrase +8, all-words bonus +2, exact +1.0, prefix +0.5, fuzzy +0.3.
|
||||
- [x] **Tag boost** — Query words matching tags add +3.0 per matching tag.
|
||||
- [x] **Multi-word bonus** — All query words present in topic adds +2.0.
|
||||
- [x] **Tests for all new features** — Levenshtein, min3, extractPhrases, fuzzy search, phrase search, tag boost, multi-word bonus, scoring constants, phrase highlighting, section phrase matching.
|
||||
|
||||
## Phase 2: core.help Integration
|
||||
|
||||
|
|
|
|||
212
catalog_test.go
Normal file
212
catalog_test.go
Normal file
|
|
@ -0,0 +1,212 @@
|
|||
package help
|
||||
|
||||
import (
|
||||
"fmt"
|
||||
"testing"
|
||||
|
||||
"github.com/stretchr/testify/assert"
|
||||
"github.com/stretchr/testify/require"
|
||||
)
|
||||
|
||||
func TestDefaultCatalog_Good(t *testing.T) {
|
||||
c := DefaultCatalog()
|
||||
|
||||
require.NotNil(t, c)
|
||||
require.NotNil(t, c.topics)
|
||||
require.NotNil(t, c.index)
|
||||
|
||||
t.Run("contains built-in topics", func(t *testing.T) {
|
||||
topics := c.List()
|
||||
assert.GreaterOrEqual(t, len(topics), 2, "should have at least 2 default topics")
|
||||
})
|
||||
|
||||
t.Run("getting-started topic exists", func(t *testing.T) {
|
||||
topic, err := c.Get("getting-started")
|
||||
require.NoError(t, err)
|
||||
assert.Equal(t, "Getting Started", topic.Title)
|
||||
assert.Contains(t, topic.Content, "Common Commands")
|
||||
})
|
||||
|
||||
t.Run("config topic exists", func(t *testing.T) {
|
||||
topic, err := c.Get("config")
|
||||
require.NoError(t, err)
|
||||
assert.Equal(t, "Configuration", topic.Title)
|
||||
assert.Contains(t, topic.Content, "Environment Variables")
|
||||
})
|
||||
}
|
||||
|
||||
func TestCatalog_Add_Good(t *testing.T) {
|
||||
c := &Catalog{
|
||||
topics: make(map[string]*Topic),
|
||||
index: newSearchIndex(),
|
||||
}
|
||||
|
||||
topic := &Topic{
|
||||
ID: "test-topic",
|
||||
Title: "Test Topic",
|
||||
Content: "This is a test topic for unit testing.",
|
||||
Tags: []string{"test", "unit"},
|
||||
}
|
||||
|
||||
c.Add(topic)
|
||||
|
||||
t.Run("topic is retrievable after add", func(t *testing.T) {
|
||||
got, err := c.Get("test-topic")
|
||||
require.NoError(t, err)
|
||||
assert.Equal(t, topic, got)
|
||||
})
|
||||
|
||||
t.Run("topic is searchable after add", func(t *testing.T) {
|
||||
results := c.Search("test")
|
||||
assert.NotEmpty(t, results)
|
||||
})
|
||||
|
||||
t.Run("overwrite existing topic", func(t *testing.T) {
|
||||
replacement := &Topic{
|
||||
ID: "test-topic",
|
||||
Title: "Replaced Topic",
|
||||
Content: "Replacement content.",
|
||||
}
|
||||
c.Add(replacement)
|
||||
|
||||
got, err := c.Get("test-topic")
|
||||
require.NoError(t, err)
|
||||
assert.Equal(t, "Replaced Topic", got.Title)
|
||||
})
|
||||
}
|
||||
|
||||
func TestCatalog_List_Good(t *testing.T) {
|
||||
c := &Catalog{
|
||||
topics: make(map[string]*Topic),
|
||||
index: newSearchIndex(),
|
||||
}
|
||||
|
||||
t.Run("empty catalog returns empty list", func(t *testing.T) {
|
||||
list := c.List()
|
||||
assert.Empty(t, list)
|
||||
})
|
||||
|
||||
t.Run("returns all added topics", func(t *testing.T) {
|
||||
c.Add(&Topic{ID: "alpha", Title: "Alpha"})
|
||||
c.Add(&Topic{ID: "beta", Title: "Beta"})
|
||||
c.Add(&Topic{ID: "gamma", Title: "Gamma"})
|
||||
|
||||
list := c.List()
|
||||
assert.Len(t, list, 3)
|
||||
|
||||
// Collect IDs (order is not guaranteed from map)
|
||||
ids := make(map[string]bool)
|
||||
for _, t := range list {
|
||||
ids[t.ID] = true
|
||||
}
|
||||
assert.True(t, ids["alpha"])
|
||||
assert.True(t, ids["beta"])
|
||||
assert.True(t, ids["gamma"])
|
||||
})
|
||||
}
|
||||
|
||||
func TestCatalog_Search_Good(t *testing.T) {
|
||||
c := DefaultCatalog()
|
||||
|
||||
t.Run("finds default topics", func(t *testing.T) {
|
||||
results := c.Search("configuration")
|
||||
assert.NotEmpty(t, results)
|
||||
})
|
||||
|
||||
t.Run("empty query returns nil", func(t *testing.T) {
|
||||
results := c.Search("")
|
||||
assert.Nil(t, results)
|
||||
})
|
||||
|
||||
t.Run("no match returns empty", func(t *testing.T) {
|
||||
results := c.Search("zzzyyyxxx")
|
||||
assert.Empty(t, results)
|
||||
})
|
||||
}
|
||||
|
||||
func TestCatalog_Get_Good(t *testing.T) {
|
||||
c := &Catalog{
|
||||
topics: make(map[string]*Topic),
|
||||
index: newSearchIndex(),
|
||||
}
|
||||
|
||||
c.Add(&Topic{ID: "exists", Title: "Existing Topic"})
|
||||
|
||||
t.Run("existing topic", func(t *testing.T) {
|
||||
topic, err := c.Get("exists")
|
||||
require.NoError(t, err)
|
||||
assert.Equal(t, "Existing Topic", topic.Title)
|
||||
})
|
||||
|
||||
t.Run("missing topic returns error", func(t *testing.T) {
|
||||
topic, err := c.Get("does-not-exist")
|
||||
assert.Nil(t, topic)
|
||||
assert.Error(t, err)
|
||||
assert.Contains(t, err.Error(), "topic not found")
|
||||
assert.Contains(t, err.Error(), "does-not-exist")
|
||||
})
|
||||
}
|
||||
|
||||
func TestCatalog_Search_Good_ScoreTiebreaking(t *testing.T) {
|
||||
// Tests the alphabetical tie-breaking in search result sorting (search.go:165).
|
||||
c := &Catalog{
|
||||
topics: make(map[string]*Topic),
|
||||
index: newSearchIndex(),
|
||||
}
|
||||
|
||||
// Add topics with identical content so they receive the same score.
|
||||
c.Add(&Topic{
|
||||
ID: "zebra-topic",
|
||||
Title: "Zebra",
|
||||
Content: "Unique keyword zephyr.",
|
||||
})
|
||||
c.Add(&Topic{
|
||||
ID: "alpha-topic",
|
||||
Title: "Alpha",
|
||||
Content: "Unique keyword zephyr.",
|
||||
})
|
||||
|
||||
results := c.Search("zephyr")
|
||||
require.Len(t, results, 2)
|
||||
|
||||
// With equal scores, results should be sorted alphabetically by title.
|
||||
assert.Equal(t, "Alpha", results[0].Topic.Title)
|
||||
assert.Equal(t, "Zebra", results[1].Topic.Title)
|
||||
assert.Equal(t, results[0].Score, results[1].Score,
|
||||
"scores should be equal for tie-breaking to apply")
|
||||
}
|
||||
|
||||
func BenchmarkSearch(b *testing.B) {
|
||||
// Build a catalog with 100+ topics for benchmarking.
|
||||
c := &Catalog{
|
||||
topics: make(map[string]*Topic),
|
||||
index: newSearchIndex(),
|
||||
}
|
||||
|
||||
for i := 0; i < 150; i++ {
|
||||
c.Add(&Topic{
|
||||
ID: fmt.Sprintf("topic-%d", i),
|
||||
Title: fmt.Sprintf("Topic Number %d About Various Subjects", i),
|
||||
Content: fmt.Sprintf("This is the content of topic %d. It covers installation, configuration, deployment, and testing of the system.", i),
|
||||
Tags: []string{"generated", fmt.Sprintf("tag%d", i%10)},
|
||||
Sections: []Section{
|
||||
{
|
||||
ID: fmt.Sprintf("section-%d-a", i),
|
||||
Title: "Overview",
|
||||
Content: "An overview of the topic and its purpose.",
|
||||
},
|
||||
{
|
||||
ID: fmt.Sprintf("section-%d-b", i),
|
||||
Title: "Details",
|
||||
Content: "Detailed information about the topic including examples and usage.",
|
||||
},
|
||||
},
|
||||
})
|
||||
}
|
||||
|
||||
b.ResetTimer()
|
||||
b.ReportAllocs()
|
||||
for i := 0; i < b.N; i++ {
|
||||
c.Search("installation configuration")
|
||||
}
|
||||
}
|
||||
180
search.go
180
search.go
|
|
@ -7,6 +7,19 @@ import (
|
|||
"unicode"
|
||||
)
|
||||
|
||||
// Scoring weights for search result ranking.
|
||||
const (
|
||||
scoreExactWord = 1.0 // Exact word match in the index
|
||||
scorePrefixWord = 0.5 // Prefix/partial word match
|
||||
scoreFuzzyWord = 0.3 // Fuzzy (Levenshtein) match
|
||||
scoreTitleBoost = 10.0 // Query word appears in topic title
|
||||
scoreSectionBoost = 5.0 // Query word appears in section title
|
||||
scoreTagBoost = 3.0 // Query word appears in topic tags
|
||||
scorePhraseBoost = 8.0 // Exact phrase match in content
|
||||
scoreAllWords = 2.0 // All query words present (multi-word bonus)
|
||||
fuzzyMaxDistance = 2 // Maximum edit distance for fuzzy matching
|
||||
)
|
||||
|
||||
// SearchResult represents a search match.
|
||||
type SearchResult struct {
|
||||
Topic *Topic
|
||||
|
|
@ -72,10 +85,17 @@ func (i *searchIndex) addToIndex(word, topicID string) {
|
|||
i.index[word] = append(i.index[word], topicID)
|
||||
}
|
||||
|
||||
// Search finds topics matching the query.
|
||||
// Search finds topics matching the query. Supports:
|
||||
// - Single and multi-word keyword queries
|
||||
// - Quoted phrase search (e.g. `"rate limit"`)
|
||||
// - Fuzzy matching via Levenshtein distance for typo tolerance
|
||||
// - Prefix matching for partial words
|
||||
func (i *searchIndex) Search(query string) []*SearchResult {
|
||||
queryWords := tokenize(query)
|
||||
if len(queryWords) == 0 {
|
||||
// Extract quoted phrases before tokenising.
|
||||
phrases, stripped := extractPhrases(query)
|
||||
|
||||
queryWords := tokenize(stripped)
|
||||
if len(queryWords) == 0 && len(phrases) == 0 {
|
||||
return nil
|
||||
}
|
||||
|
||||
|
|
@ -86,7 +106,7 @@ func (i *searchIndex) Search(query string) []*SearchResult {
|
|||
// Exact matches
|
||||
if topicIDs, ok := i.index[word]; ok {
|
||||
for _, topicID := range topicIDs {
|
||||
scores[topicID] += 1.0
|
||||
scores[topicID] += scoreExactWord
|
||||
}
|
||||
}
|
||||
|
||||
|
|
@ -94,7 +114,25 @@ func (i *searchIndex) Search(query string) []*SearchResult {
|
|||
for indexWord, topicIDs := range i.index {
|
||||
if strings.HasPrefix(indexWord, word) && indexWord != word {
|
||||
for _, topicID := range topicIDs {
|
||||
scores[topicID] += 0.5 // Lower score for partial matches
|
||||
scores[topicID] += scorePrefixWord
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Fuzzy matches (Levenshtein distance)
|
||||
if len(word) >= 3 {
|
||||
for indexWord, topicIDs := range i.index {
|
||||
if indexWord == word {
|
||||
continue // Already scored as exact match
|
||||
}
|
||||
if strings.HasPrefix(indexWord, word) {
|
||||
continue // Already scored as prefix match
|
||||
}
|
||||
dist := levenshtein(word, indexWord)
|
||||
if dist > 0 && dist <= fuzzyMaxDistance {
|
||||
for _, topicID := range topicIDs {
|
||||
scores[topicID] += scoreFuzzyWord
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
@ -109,8 +147,28 @@ func (i *searchIndex) Search(query string) []*SearchResult {
|
|||
}
|
||||
}
|
||||
}
|
||||
// Also add phrase regexes for highlighting
|
||||
for _, phrase := range phrases {
|
||||
if re, err := regexp.Compile("(?i)" + regexp.QuoteMeta(phrase)); err == nil {
|
||||
res = append(res, re)
|
||||
}
|
||||
}
|
||||
|
||||
// Build results with title boost and snippet extraction
|
||||
// Phrase matching: boost topics that contain the exact phrase.
|
||||
for _, phrase := range phrases {
|
||||
phraseLower := strings.ToLower(phrase)
|
||||
for topicID, topic := range i.topics {
|
||||
text := strings.ToLower(topic.Title + " " + topic.Content)
|
||||
for _, section := range topic.Sections {
|
||||
text += " " + strings.ToLower(section.Title+" "+section.Content)
|
||||
}
|
||||
if strings.Contains(text, phraseLower) {
|
||||
scores[topicID] += scorePhraseBoost
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Build results with title/section/tag boosts and snippet extraction
|
||||
var results []*SearchResult
|
||||
for topicID, score := range scores {
|
||||
topic := i.topics[topicID]
|
||||
|
|
@ -120,15 +178,40 @@ func (i *searchIndex) Search(query string) []*SearchResult {
|
|||
|
||||
// Title boost: if query words appear in title
|
||||
titleLower := strings.ToLower(topic.Title)
|
||||
hasTitleMatch := false
|
||||
titleMatchCount := 0
|
||||
for _, word := range queryWords {
|
||||
if strings.Contains(titleLower, word) {
|
||||
hasTitleMatch = true
|
||||
break
|
||||
titleMatchCount++
|
||||
}
|
||||
}
|
||||
if hasTitleMatch {
|
||||
score += 10.0
|
||||
if titleMatchCount > 0 {
|
||||
score += scoreTitleBoost
|
||||
}
|
||||
|
||||
// Tag boost: if query words match tags
|
||||
for _, tag := range topic.Tags {
|
||||
tagLower := strings.ToLower(tag)
|
||||
for _, word := range queryWords {
|
||||
if tagLower == word || strings.Contains(tagLower, word) {
|
||||
score += scoreTagBoost
|
||||
break
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// Multi-word bonus: if all query words are present in the topic
|
||||
if len(queryWords) > 1 {
|
||||
allPresent := true
|
||||
fullText := strings.ToLower(topic.Title + " " + topic.Content)
|
||||
for _, word := range queryWords {
|
||||
if !strings.Contains(fullText, word) {
|
||||
allPresent = false
|
||||
break
|
||||
}
|
||||
}
|
||||
if allPresent {
|
||||
score += scoreAllWords
|
||||
}
|
||||
}
|
||||
|
||||
// Find matching section and extract snippet
|
||||
|
|
@ -145,7 +228,7 @@ func (i *searchIndex) Search(query string) []*SearchResult {
|
|||
}
|
||||
}
|
||||
if hasSectionTitleMatch {
|
||||
score += 5.0
|
||||
score += scoreSectionBoost
|
||||
}
|
||||
}
|
||||
|
||||
|
|
@ -168,6 +251,79 @@ func (i *searchIndex) Search(query string) []*SearchResult {
|
|||
return results
|
||||
}
|
||||
|
||||
// extractPhrases pulls quoted substrings from the query and returns them
|
||||
// alongside the remaining query text with quotes removed.
|
||||
// For example: `hello "rate limit" world` returns
|
||||
// phrases=["rate limit"], remaining="hello world".
|
||||
func extractPhrases(query string) (phrases []string, remaining string) {
|
||||
re := regexp.MustCompile(`"([^"]+)"`)
|
||||
matches := re.FindAllStringSubmatch(query, -1)
|
||||
for _, m := range matches {
|
||||
phrase := strings.TrimSpace(m[1])
|
||||
if phrase != "" {
|
||||
phrases = append(phrases, phrase)
|
||||
}
|
||||
}
|
||||
remaining = re.ReplaceAllString(query, "")
|
||||
return phrases, remaining
|
||||
}
|
||||
|
||||
// levenshtein computes the edit distance between two strings.
|
||||
// Used for fuzzy matching to tolerate typos in search queries.
|
||||
func levenshtein(a, b string) int {
|
||||
aRunes := []rune(a)
|
||||
bRunes := []rune(b)
|
||||
aLen := len(aRunes)
|
||||
bLen := len(bRunes)
|
||||
|
||||
if aLen == 0 {
|
||||
return bLen
|
||||
}
|
||||
if bLen == 0 {
|
||||
return aLen
|
||||
}
|
||||
|
||||
// Use two rows instead of full matrix to save memory.
|
||||
prev := make([]int, bLen+1)
|
||||
curr := make([]int, bLen+1)
|
||||
|
||||
for j := 0; j <= bLen; j++ {
|
||||
prev[j] = j
|
||||
}
|
||||
|
||||
for i := 1; i <= aLen; i++ {
|
||||
curr[0] = i
|
||||
for j := 1; j <= bLen; j++ {
|
||||
cost := 1
|
||||
if aRunes[i-1] == bRunes[j-1] {
|
||||
cost = 0
|
||||
}
|
||||
curr[j] = min3(
|
||||
prev[j]+1, // deletion
|
||||
curr[j-1]+1, // insertion
|
||||
prev[j-1]+cost, // substitution
|
||||
)
|
||||
}
|
||||
prev, curr = curr, prev
|
||||
}
|
||||
|
||||
return prev[bLen]
|
||||
}
|
||||
|
||||
// min3 returns the minimum of three integers.
|
||||
func min3(a, b, c int) int {
|
||||
if a < b {
|
||||
if a < c {
|
||||
return a
|
||||
}
|
||||
return c
|
||||
}
|
||||
if b < c {
|
||||
return b
|
||||
}
|
||||
return c
|
||||
}
|
||||
|
||||
// findBestMatch finds the section with the best match and extracts a snippet.
|
||||
func (i *searchIndex) findBestMatch(topic *Topic, queryWords []string, res []*regexp.Regexp) (*Section, string) {
|
||||
var bestSection *Section
|
||||
|
|
|
|||
454
search_test.go
454
search_test.go
|
|
@ -338,3 +338,457 @@ func TestSearchResult_Score_Good(t *testing.T) {
|
|||
assert.Equal(t, "topic-in-title", results[0].Topic.ID)
|
||||
assert.Greater(t, results[0].Score, results[1].Score)
|
||||
}
|
||||
|
||||
func TestExtractSnippet_Good_HeadingsOnly(t *testing.T) {
|
||||
// Content with only headings and no body text should return empty snippet
|
||||
// when no regexes are provided. Covers the empty-return branch.
|
||||
content := "# Heading One\n## Heading Two\n### Heading Three"
|
||||
|
||||
snippet := extractSnippet(content, nil)
|
||||
assert.Empty(t, snippet, "headings-only content without regexes should return empty snippet")
|
||||
}
|
||||
|
||||
func TestExtractSnippet_Good_SnippetTrimmedToEmpty(t *testing.T) {
|
||||
// After word-boundary trimming the snippet could become empty.
|
||||
// This exercises the snippet=="" guard after TrimSpace.
|
||||
content := strings.Repeat(" ", 200)
|
||||
|
||||
snippet := extractSnippet(content, compileRegexes([]string{"zz"}))
|
||||
assert.Empty(t, snippet, "whitespace-only content should yield empty snippet")
|
||||
}
|
||||
|
||||
func TestHighlight_Good_EmptyRegexes(t *testing.T) {
|
||||
// Calling highlight with an empty regex slice should return the
|
||||
// text unchanged. Covers the early-return branch.
|
||||
result := highlight("some text here", nil)
|
||||
assert.Equal(t, "some text here", result)
|
||||
|
||||
result = highlight("some text here", []*regexp.Regexp{})
|
||||
assert.Equal(t, "some text here", result)
|
||||
}
|
||||
|
||||
func TestHighlight_Good_OverlappingExtension(t *testing.T) {
|
||||
// Test overlapping matches where the second match extends past the
|
||||
// first, exercising the curr.end extension branch in merging.
|
||||
text := "abcdefghij"
|
||||
|
||||
// First regex matches "abcdef", second matches "cdefghij"
|
||||
// They overlap and the second extends the merged range.
|
||||
re1, _ := regexp.Compile("abcdef")
|
||||
re2, _ := regexp.Compile("cdefghij")
|
||||
|
||||
result := highlight(text, []*regexp.Regexp{re1, re2})
|
||||
assert.Equal(t, "**abcdefghij**", result)
|
||||
}
|
||||
|
||||
func TestSearchIndex_Search_Good_NilTopicGuard(t *testing.T) {
|
||||
// Manually inject a stale reference in the scores map so the
|
||||
// nil-topic guard is exercised. We do this by manipulating the
|
||||
// index directly.
|
||||
idx := newSearchIndex()
|
||||
|
||||
idx.Add(&Topic{
|
||||
ID: "real-topic",
|
||||
Title: "Real Topic",
|
||||
Content: "Contains testword for matching.",
|
||||
})
|
||||
|
||||
// Inject a mapping from "testword" to a non-existent topic ID.
|
||||
idx.index["testword"] = append(idx.index["testword"], "ghost-topic")
|
||||
|
||||
results := idx.Search("testword")
|
||||
// Should still find the real topic, ghost-topic should be skipped.
|
||||
assert.Len(t, results, 1)
|
||||
assert.Equal(t, "real-topic", results[0].Topic.ID)
|
||||
}
|
||||
|
||||
func TestSearchIndex_Search_Good_SpecialCharacters(t *testing.T) {
|
||||
idx := newSearchIndex()
|
||||
|
||||
idx.Add(&Topic{
|
||||
ID: "special-chars",
|
||||
Title: "Rate Limiting (v2.0)",
|
||||
Content: "Configure rate-limiting rules with special characters: @#$%.",
|
||||
})
|
||||
|
||||
t.Run("query with special characters", func(t *testing.T) {
|
||||
results := idx.Search("rate limiting")
|
||||
assert.NotEmpty(t, results)
|
||||
assert.Equal(t, "special-chars", results[0].Topic.ID)
|
||||
})
|
||||
|
||||
t.Run("query with punctuation stripped", func(t *testing.T) {
|
||||
results := idx.Search("v2.0")
|
||||
assert.NotEmpty(t, results)
|
||||
})
|
||||
}
|
||||
|
||||
func TestSearchIndex_Search_Good_CaseInsensitive(t *testing.T) {
|
||||
idx := newSearchIndex()
|
||||
|
||||
idx.Add(&Topic{
|
||||
ID: "case-test",
|
||||
Title: "UPPERCASE Title",
|
||||
Content: "MiXeD CaSe content here.",
|
||||
})
|
||||
|
||||
t.Run("lowercase query finds uppercase content", func(t *testing.T) {
|
||||
results := idx.Search("uppercase")
|
||||
assert.NotEmpty(t, results)
|
||||
assert.Equal(t, "case-test", results[0].Topic.ID)
|
||||
})
|
||||
|
||||
t.Run("uppercase query finds content", func(t *testing.T) {
|
||||
results := idx.Search("MIXED")
|
||||
assert.NotEmpty(t, results)
|
||||
})
|
||||
}
|
||||
|
||||
func TestSearchIndex_Search_Good_SingleCharQuery(t *testing.T) {
|
||||
idx := newSearchIndex()
|
||||
|
||||
idx.Add(&Topic{
|
||||
ID: "single-char",
|
||||
Title: "Test Topic",
|
||||
Content: "Some content.",
|
||||
})
|
||||
|
||||
// Single-character queries are filtered out by tokenize (min 2 chars),
|
||||
// so searching for "a" should return nil.
|
||||
results := idx.Search("a")
|
||||
assert.Nil(t, results)
|
||||
}
|
||||
|
||||
// --- Phase 1: Fuzzy matching tests ---
|
||||
|
||||
func TestLevenshtein_Good(t *testing.T) {
|
||||
tests := []struct {
|
||||
name string
|
||||
a string
|
||||
b string
|
||||
expected int
|
||||
}{
|
||||
{name: "identical strings", a: "hello", b: "hello", expected: 0},
|
||||
{name: "single substitution", a: "hello", b: "hallo", expected: 1},
|
||||
{name: "single insertion", a: "hello", b: "helloo", expected: 1},
|
||||
{name: "single deletion", a: "hello", b: "helo", expected: 1},
|
||||
{name: "two edits", a: "kitten", b: "sitting", expected: 3},
|
||||
{name: "completely different", a: "abc", b: "xyz", expected: 3},
|
||||
{name: "empty first string", a: "", b: "hello", expected: 5},
|
||||
{name: "empty second string", a: "hello", b: "", expected: 5},
|
||||
{name: "both empty", a: "", b: "", expected: 0},
|
||||
{name: "single char strings", a: "a", b: "b", expected: 1},
|
||||
{name: "same single char", a: "a", b: "a", expected: 0},
|
||||
{name: "transposition", a: "ab", b: "ba", expected: 2},
|
||||
{name: "unicode strings", a: "cafe", b: "cafe", expected: 0},
|
||||
}
|
||||
|
||||
for _, tt := range tests {
|
||||
t.Run(tt.name, func(t *testing.T) {
|
||||
result := levenshtein(tt.a, tt.b)
|
||||
assert.Equal(t, tt.expected, result)
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func TestMin3_Good(t *testing.T) {
|
||||
tests := []struct {
|
||||
name string
|
||||
a, b, c int
|
||||
expected int
|
||||
}{
|
||||
{name: "first smallest", a: 1, b: 2, c: 3, expected: 1},
|
||||
{name: "second smallest", a: 2, b: 1, c: 3, expected: 1},
|
||||
{name: "third smallest", a: 3, b: 2, c: 1, expected: 1},
|
||||
{name: "all equal", a: 5, b: 5, c: 5, expected: 5},
|
||||
{name: "first and second equal smallest", a: 1, b: 1, c: 3, expected: 1},
|
||||
{name: "second and third equal smallest", a: 3, b: 1, c: 1, expected: 1},
|
||||
{name: "first and third equal smallest", a: 1, b: 3, c: 1, expected: 1},
|
||||
{name: "negative values", a: -3, b: -1, c: -2, expected: -3},
|
||||
}
|
||||
|
||||
for _, tt := range tests {
|
||||
t.Run(tt.name, func(t *testing.T) {
|
||||
result := min3(tt.a, tt.b, tt.c)
|
||||
assert.Equal(t, tt.expected, result)
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func TestSearchIndex_Search_Good_FuzzyMatching(t *testing.T) {
|
||||
idx := newSearchIndex()
|
||||
|
||||
idx.Add(&Topic{
|
||||
ID: "configuration",
|
||||
Title: "Configuration Guide",
|
||||
Content: "Learn how to configure the application settings.",
|
||||
})
|
||||
|
||||
idx.Add(&Topic{
|
||||
ID: "deployment",
|
||||
Title: "Deployment Process",
|
||||
Content: "Deploy your application to production servers.",
|
||||
})
|
||||
|
||||
t.Run("typo in query finds correct topic", func(t *testing.T) {
|
||||
// "configuraton" is 1 edit away from "configuration"
|
||||
results := idx.Search("configuraton")
|
||||
assert.NotEmpty(t, results, "fuzzy match should find results for typo")
|
||||
found := false
|
||||
for _, r := range results {
|
||||
if r.Topic.ID == "configuration" {
|
||||
found = true
|
||||
break
|
||||
}
|
||||
}
|
||||
assert.True(t, found, "should find configuration topic with typo")
|
||||
})
|
||||
|
||||
t.Run("two-edit typo still matches", func(t *testing.T) {
|
||||
// "deplymnt" is within 2 edits of "deployment" — but first check
|
||||
// that "deploymnt" (1 edit) works.
|
||||
results := idx.Search("deploymnt")
|
||||
assert.NotEmpty(t, results, "fuzzy match should find results for 1-edit typo")
|
||||
})
|
||||
|
||||
t.Run("too many edits returns no fuzzy match", func(t *testing.T) {
|
||||
// "zzzzzzz" is very far from any indexed word.
|
||||
results := idx.Search("zzzzzzz")
|
||||
assert.Empty(t, results, "large edit distance should not produce results")
|
||||
})
|
||||
|
||||
t.Run("short words skip fuzzy matching", func(t *testing.T) {
|
||||
// Words shorter than 3 characters skip fuzzy matching.
|
||||
// "to" is in the index but "tx" (1 edit) should not fuzzy-match
|
||||
// because "tx" is only 2 chars.
|
||||
results := idx.Search("tx")
|
||||
// May or may not find results via prefix, but should not crash.
|
||||
_ = results
|
||||
})
|
||||
|
||||
t.Run("fuzzy scores lower than exact", func(t *testing.T) {
|
||||
// Exact match on "configure" should score higher than fuzzy.
|
||||
exactResults := idx.Search("configure")
|
||||
fuzzyResults := idx.Search("configurr")
|
||||
|
||||
if len(exactResults) > 0 && len(fuzzyResults) > 0 {
|
||||
assert.GreaterOrEqual(t, exactResults[0].Score, fuzzyResults[0].Score,
|
||||
"exact match should score at least as high as fuzzy match")
|
||||
}
|
||||
})
|
||||
}
|
||||
|
||||
// --- Phase 1: Phrase search tests ---
|
||||
|
||||
func TestExtractPhrases_Good(t *testing.T) {
|
||||
tests := []struct {
|
||||
name string
|
||||
query string
|
||||
expectedPhrases []string
|
||||
expectedRemaining string
|
||||
}{
|
||||
{
|
||||
name: "no phrases",
|
||||
query: "hello world",
|
||||
expectedPhrases: nil,
|
||||
expectedRemaining: "hello world",
|
||||
},
|
||||
{
|
||||
name: "single phrase",
|
||||
query: `"rate limit"`,
|
||||
expectedPhrases: []string{"rate limit"},
|
||||
expectedRemaining: "",
|
||||
},
|
||||
{
|
||||
name: "phrase with surrounding words",
|
||||
query: `configure "rate limit" rules`,
|
||||
expectedPhrases: []string{"rate limit"},
|
||||
expectedRemaining: "configure rules",
|
||||
},
|
||||
{
|
||||
name: "multiple phrases",
|
||||
query: `"rate limit" and "error handling"`,
|
||||
expectedPhrases: []string{"rate limit", "error handling"},
|
||||
expectedRemaining: " and ",
|
||||
},
|
||||
{
|
||||
name: "empty quotes left in remaining",
|
||||
query: `"" hello`,
|
||||
expectedPhrases: nil,
|
||||
expectedRemaining: `"" hello`,
|
||||
},
|
||||
{
|
||||
name: "whitespace-only quotes ignored",
|
||||
query: `" " hello`,
|
||||
expectedPhrases: nil,
|
||||
expectedRemaining: " hello",
|
||||
},
|
||||
{
|
||||
name: "unclosed quote treated as plain text",
|
||||
query: `"unclosed phrase`,
|
||||
expectedPhrases: nil,
|
||||
expectedRemaining: `"unclosed phrase`,
|
||||
},
|
||||
}
|
||||
|
||||
for _, tt := range tests {
|
||||
t.Run(tt.name, func(t *testing.T) {
|
||||
phrases, remaining := extractPhrases(tt.query)
|
||||
assert.Equal(t, tt.expectedPhrases, phrases)
|
||||
assert.Equal(t, tt.expectedRemaining, remaining)
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
func TestSearchIndex_Search_Good_PhraseSearch(t *testing.T) {
|
||||
idx := newSearchIndex()
|
||||
|
||||
idx.Add(&Topic{
|
||||
ID: "rate-limiting",
|
||||
Title: "Rate Limiting",
|
||||
Content: "Configure rate limit rules for API endpoints. The rate limit protects against abuse.",
|
||||
Tags: []string{"api", "security"},
|
||||
})
|
||||
|
||||
idx.Add(&Topic{
|
||||
ID: "error-handling",
|
||||
Title: "Error Handling",
|
||||
Content: "Handle errors and rate responses correctly. Limit retries to avoid loops.",
|
||||
Tags: []string{"errors"},
|
||||
})
|
||||
|
||||
t.Run("quoted phrase matches exact sequence", func(t *testing.T) {
|
||||
results := idx.Search(`"rate limit"`)
|
||||
assert.NotEmpty(t, results)
|
||||
// rate-limiting topic has "rate limit" as exact phrase
|
||||
assert.Equal(t, "rate-limiting", results[0].Topic.ID)
|
||||
})
|
||||
|
||||
t.Run("unquoted words match both topics", func(t *testing.T) {
|
||||
results := idx.Search("rate limit")
|
||||
assert.GreaterOrEqual(t, len(results), 2,
|
||||
"unquoted query should match both topics that contain the words separately")
|
||||
})
|
||||
|
||||
t.Run("phrase not found yields no phrase boost", func(t *testing.T) {
|
||||
results := idx.Search(`"nonexistent phrase here"`)
|
||||
assert.Empty(t, results, "phrase with no tokenisable words and no match should return empty")
|
||||
})
|
||||
|
||||
t.Run("phrase with surrounding keywords", func(t *testing.T) {
|
||||
results := idx.Search(`"rate limit" api`)
|
||||
assert.NotEmpty(t, results)
|
||||
assert.Equal(t, "rate-limiting", results[0].Topic.ID)
|
||||
})
|
||||
|
||||
t.Run("phrase-only query with no loose words", func(t *testing.T) {
|
||||
// Query is only a quoted phrase; tokenize of remaining is empty,
|
||||
// but phrase matching should still score topics.
|
||||
results := idx.Search(`"rate limit"`)
|
||||
assert.NotEmpty(t, results)
|
||||
})
|
||||
|
||||
t.Run("phrase found in section content", func(t *testing.T) {
|
||||
sectionIdx := newSearchIndex()
|
||||
sectionIdx.Add(&Topic{
|
||||
ID: "section-phrase",
|
||||
Title: "Advanced Guide",
|
||||
Content: "Overview of the system.",
|
||||
Sections: []Section{
|
||||
{
|
||||
ID: "limits",
|
||||
Title: "Limits",
|
||||
Content: "The rate limit is set per client.",
|
||||
},
|
||||
},
|
||||
})
|
||||
|
||||
results := sectionIdx.Search(`"rate limit"`)
|
||||
assert.NotEmpty(t, results)
|
||||
assert.Equal(t, "section-phrase", results[0].Topic.ID)
|
||||
})
|
||||
}
|
||||
|
||||
// --- Phase 1: Improved scoring tests ---
|
||||
|
||||
func TestSearchIndex_Search_Good_TagBoost(t *testing.T) {
|
||||
idx := newSearchIndex()
|
||||
|
||||
idx.Add(&Topic{
|
||||
ID: "tagged-topic",
|
||||
Title: "Some Guide",
|
||||
Content: "General content about the system.",
|
||||
Tags: []string{"deployment", "production"},
|
||||
})
|
||||
|
||||
idx.Add(&Topic{
|
||||
ID: "content-topic",
|
||||
Title: "Other Guide",
|
||||
Content: "Information about deployment processes.",
|
||||
})
|
||||
|
||||
results := idx.Search("deployment")
|
||||
assert.NotEmpty(t, results)
|
||||
|
||||
// The tagged topic should rank higher because of the tag boost.
|
||||
assert.Equal(t, "tagged-topic", results[0].Topic.ID,
|
||||
"topic with matching tag should rank higher")
|
||||
}
|
||||
|
||||
func TestSearchIndex_Search_Good_MultiWordBonus(t *testing.T) {
|
||||
idx := newSearchIndex()
|
||||
|
||||
// Both topics have neutral titles (no query words in title) to
|
||||
// isolate the multi-word bonus effect.
|
||||
idx.Add(&Topic{
|
||||
ID: "both-words",
|
||||
Title: "Complete Guide",
|
||||
Content: "Learn about deploying and monitoring in one place.",
|
||||
})
|
||||
|
||||
idx.Add(&Topic{
|
||||
ID: "one-word",
|
||||
Title: "Other Guide",
|
||||
Content: "Just deploying steps without monitoring.",
|
||||
})
|
||||
|
||||
results := idx.Search("deploying monitoring")
|
||||
assert.NotEmpty(t, results)
|
||||
|
||||
// Topic with both words should score higher due to multi-word bonus.
|
||||
assert.Equal(t, "both-words", results[0].Topic.ID,
|
||||
"topic containing all query words should rank higher")
|
||||
}
|
||||
|
||||
func TestSearchIndex_Search_Good_ScoringConstants(t *testing.T) {
|
||||
// Verify the scoring constants are sensible relative to each other.
|
||||
assert.Greater(t, scoreTitleBoost, scoreSectionBoost,
|
||||
"title boost should exceed section boost")
|
||||
assert.Greater(t, scoreSectionBoost, scoreTagBoost,
|
||||
"section boost should exceed tag boost")
|
||||
assert.Greater(t, scoreExactWord, scorePrefixWord,
|
||||
"exact match should score higher than prefix match")
|
||||
assert.Greater(t, scorePrefixWord, scoreFuzzyWord,
|
||||
"prefix match should score higher than fuzzy match")
|
||||
assert.Greater(t, scorePhraseBoost, scoreAllWords,
|
||||
"phrase boost should exceed multi-word bonus")
|
||||
}
|
||||
|
||||
func TestSearchIndex_Search_Good_PhraseHighlighting(t *testing.T) {
|
||||
idx := newSearchIndex()
|
||||
|
||||
idx.Add(&Topic{
|
||||
ID: "phrase-highlight",
|
||||
Title: "API Guide",
|
||||
Content: "Configure rate limit rules for the API gateway.",
|
||||
})
|
||||
|
||||
results := idx.Search(`"rate limit" api`)
|
||||
assert.NotEmpty(t, results)
|
||||
|
||||
// The snippet should highlight both the phrase and keyword.
|
||||
if results[0].Snippet != "" {
|
||||
assert.Contains(t, results[0].Snippet, "**rate limit**",
|
||||
"phrase should be highlighted in snippet")
|
||||
}
|
||||
}
|
||||
|
|
|
|||
Loading…
Add table
Reference in a new issue