feat(search): add fuzzy matching, phrase search, and improved scoring

Phase 0: Push test coverage from 92.1% to 100% by adding catalog_test.go and targeted tests for all uncovered branches in search.go. Add BenchmarkSearch with 150 topics (~745us/op baseline). Phase 1: Implement three search improvements: - Levenshtein-based fuzzy matching (max distance 2, words >= 3 chars) - Quoted phrase search via extractPhrases() with +8.0 boost - Tag boost (+3.0) and multi-word bonus (+2.0) scoring - Named scoring constants replacing magic numbers All changes are backward-compatible; Search() signature unchanged. Co-Authored-By: Charon <developers@lethean.io>
2026-02-20 01:21:35 +00:00 · 2026-02-20 01:21:35 +00:00 · 3e91510bcf
commit 3e91510bcf
parent b6da16c717
5 changed files with 907 additions and 19 deletions
--- a/FINDINGS.md
+++ b/FINDINGS.md
@ -24,3 +24,64 @@ Extracted from `forge.lthn.ai/core/go` `pkg/help/` on 19 Feb 2026.
 ### Tests

 - 2 test files covering catalog loading and search behaviour
+
+## 2026-02-20: Phase 0 + Phase 1 (Charon)
+
+### Phase 0: Coverage 92.1% -> 100%
+
+- Created `catalog_test.go` — the entire `catalog.go` was untested (0%)
+- Added targeted search tests for previously uncovered branches:
+  - Nil topic guard in `Search()` (stale index references)
+  - Alphabetical tie-breaking when scores are equal
+  - Headings-only content in snippet extraction (no body text)
+  - Whitespace-only content trimmed to empty in snippets
+  - Empty regex slice in `highlight()`
+  - Overlapping match extension in highlight merging
+- Added `BenchmarkSearch` with 150 generated topics
+  - Baseline: ~745us/op, ~392KB/op, 4114 allocs/op (Ryzen 9 9950X)
+- `go vet ./...` clean
+
+### Phase 1: Search Improvements
+
+#### Fuzzy Matching (Levenshtein distance)
+
+- Implemented `levenshtein()` using two-row DP (memory-efficient)
+- Integrated into `Search()` with max edit distance of 2
+- Only applied to query words >= 3 characters (avoids noise from short words)
+- Score: +0.3 per fuzzy match (lower than prefix +0.5 and exact +1.0)
+- Skips words already matched as exact or prefix (no double-counting)
+
+#### Phrase Search
+
+- `extractPhrases()` pulls `"quoted strings"` from the query
+- Remaining text is tokenised normally for keyword search
+- Phrase matching checks title + content + all section content (case-insensitive)
+- Phrase boost: +8.0 per matching phrase
+- Phrase terms are also compiled as regexes for snippet highlighting
+- Empty quotes `""` are left as-is (regex requires `[^"]+`)
+- Whitespace-only quotes are ignored
+
+#### Improved Scoring Weights
+
+- Replaced magic numbers with named constants for clarity:
+  - `scoreExactWord = 1.0` -- exact word in index
+  - `scorePrefixWord = 0.5` -- prefix/partial word match
+  - `scoreFuzzyWord = 0.3` -- Levenshtein fuzzy match
+  - `scoreTitleBoost = 10.0` -- query word in topic title
+  - `scoreSectionBoost = 5.0` -- query word in section title
+  - `scoreTagBoost = 3.0` -- query word matches a tag (NEW)
+  - `scorePhraseBoost = 8.0` -- exact phrase match (NEW)
+  - `scoreAllWords = 2.0` -- all query words present (NEW)
+  - `fuzzyMaxDistance = 2` -- max Levenshtein distance
+
+#### New Scoring Features
+
+- **Tag boost** (+3.0): topics with tags matching query words rank higher
+- **Multi-word bonus** (+2.0): topics containing ALL query words get a bonus
+- Both are additive with existing boosts (title, section, exact/prefix)
+
+### API Compatibility
+
+- `Search(query string) []*SearchResult` signature unchanged
+- All existing behaviour preserved; new features are additive
+- Existing tests pass without modification
--- a/TODO.md
+++ b/TODO.md
@ -6,17 +6,22 @@ Dispatched from core/go orchestration. Pick up tasks in order.

 ## Phase 0: Hardening & Test Coverage

- [ ] **Expand parser tests** — `parser_test.go` exists but coverage unknown. Add tests for: empty input, frontmatter-only (no body), malformed YAML frontmatter, deeply nested headings (####, #####), Unicode content, very long documents (10K+ lines).
- [ ] **Expand search tests** — `search_test.go` exists. Add tests for: empty query, no results, case sensitivity, multi-word queries, special characters in query, overlapping matches, scoring boundary cases (exact title match vs partial body match).
- [ ] **Benchmark search** — Add `BenchmarkSearch` with catalog of 100+ topics. Measure search latency and allocation count. Baseline for Phase 1 improvements.
- [ ] **`go vet ./...` clean** — Verify no vet warnings. Fix any found.
+- [x] **Expand parser tests** — Parser already at 100%. Existing tests cover: empty input, frontmatter-only, malformed YAML, all heading levels (H1-H6), Unicode content, path-derived IDs.
+- [x] **Expand search tests** — Added tests for: empty query, no results, case sensitivity, multi-word queries, special characters, overlapping matches, scoring boundary cases, nil-topic guard, snippet edge cases (headings-only, whitespace-only).
+- [x] **Add catalog tests** — Created `catalog_test.go` covering: DefaultCatalog, Add, List, Search, Get (found/not-found), score tie-breaking.
+- [x] **Benchmark search** — `BenchmarkSearch` with 150 topics. Baseline: ~745us/op, ~392KB/op, 4114 allocs/op (Ryzen 9 9950X).
+- [x] **`go vet ./...` clean** — No warnings.
+- [x] **Coverage: 100%** — Up from 92.1%.

 ## Phase 1: Search Improvements

- [ ] Add fuzzy matching (Levenshtein distance or similar)
+- [x] **Fuzzy matching** — Levenshtein distance with max edit distance of 2. Words under 3 chars skip fuzzy. Score: +0.3 per fuzzy match (below prefix +0.5 and exact +1.0).
 - [ ] Add stemming support for English search terms
- [ ] Add phrase search (quoted multi-word queries)
- [ ] Improve scoring weights — currently title +10, section +5, partial +0.5
+- [x] **Phrase search** — Quoted multi-word queries via `extractPhrases()`. Phrase boost: +8.0. Searches title, content, and section content.
+- [x] **Improved scoring weights** — Named constants: title +10, section +5, tag +3, phrase +8, all-words bonus +2, exact +1.0, prefix +0.5, fuzzy +0.3.
+- [x] **Tag boost** — Query words matching tags add +3.0 per matching tag.
+- [x] **Multi-word bonus** — All query words present in topic adds +2.0.
+- [x] **Tests for all new features** — Levenshtein, min3, extractPhrases, fuzzy search, phrase search, tag boost, multi-word bonus, scoring constants, phrase highlighting, section phrase matching.

 ## Phase 2: core.help Integration

--- a/catalog_test.go
+++ b/catalog_test.go
@ -0,0 +1,212 @@
+package help
+
+import (
+	"fmt"
+	"testing"
+
+	"github.com/stretchr/testify/assert"
+	"github.com/stretchr/testify/require"
+)
+
+func TestDefaultCatalog_Good(t *testing.T) {
+	c := DefaultCatalog()
+
+	require.NotNil(t, c)
+	require.NotNil(t, c.topics)
+	require.NotNil(t, c.index)
+
+	t.Run("contains built-in topics", func(t *testing.T) {
+		topics := c.List()
+		assert.GreaterOrEqual(t, len(topics), 2, "should have at least 2 default topics")
+	})
+
+	t.Run("getting-started topic exists", func(t *testing.T) {
+		topic, err := c.Get("getting-started")
+		require.NoError(t, err)
+		assert.Equal(t, "Getting Started", topic.Title)
+		assert.Contains(t, topic.Content, "Common Commands")
+	})
+
+	t.Run("config topic exists", func(t *testing.T) {
+		topic, err := c.Get("config")
+		require.NoError(t, err)
+		assert.Equal(t, "Configuration", topic.Title)
+		assert.Contains(t, topic.Content, "Environment Variables")
+	})
+}
+
+func TestCatalog_Add_Good(t *testing.T) {
+	c := &Catalog{
+		topics: make(map[string]*Topic),
+		index:  newSearchIndex(),
+	}
+
+	topic := &Topic{
+		ID:      "test-topic",
+		Title:   "Test Topic",
+		Content: "This is a test topic for unit testing.",
+		Tags:    []string{"test", "unit"},
+	}
+
+	c.Add(topic)
+
+	t.Run("topic is retrievable after add", func(t *testing.T) {
+		got, err := c.Get("test-topic")
+		require.NoError(t, err)
+		assert.Equal(t, topic, got)
+	})
+
+	t.Run("topic is searchable after add", func(t *testing.T) {
+		results := c.Search("test")
+		assert.NotEmpty(t, results)
+	})
+
+	t.Run("overwrite existing topic", func(t *testing.T) {
+		replacement := &Topic{
+			ID:      "test-topic",
+			Title:   "Replaced Topic",
+			Content: "Replacement content.",
+		}
+		c.Add(replacement)
+
+		got, err := c.Get("test-topic")
+		require.NoError(t, err)
+		assert.Equal(t, "Replaced Topic", got.Title)
+	})
+}
+
+func TestCatalog_List_Good(t *testing.T) {
+	c := &Catalog{
+		topics: make(map[string]*Topic),
+		index:  newSearchIndex(),
+	}
+
+	t.Run("empty catalog returns empty list", func(t *testing.T) {
+		list := c.List()
+		assert.Empty(t, list)
+	})
+
+	t.Run("returns all added topics", func(t *testing.T) {
+		c.Add(&Topic{ID: "alpha", Title: "Alpha"})
+		c.Add(&Topic{ID: "beta", Title: "Beta"})
+		c.Add(&Topic{ID: "gamma", Title: "Gamma"})
+
+		list := c.List()
+		assert.Len(t, list, 3)
+
+		// Collect IDs (order is not guaranteed from map)
+		ids := make(map[string]bool)
+		for _, t := range list {
+			ids[t.ID] = true
+		}
+		assert.True(t, ids["alpha"])
+		assert.True(t, ids["beta"])
+		assert.True(t, ids["gamma"])
+	})
+}
+
+func TestCatalog_Search_Good(t *testing.T) {
+	c := DefaultCatalog()
+
+	t.Run("finds default topics", func(t *testing.T) {
+		results := c.Search("configuration")
+		assert.NotEmpty(t, results)
+	})
+
+	t.Run("empty query returns nil", func(t *testing.T) {
+		results := c.Search("")
+		assert.Nil(t, results)
+	})
+
+	t.Run("no match returns empty", func(t *testing.T) {
+		results := c.Search("zzzyyyxxx")
+		assert.Empty(t, results)
+	})
+}
+
+func TestCatalog_Get_Good(t *testing.T) {
+	c := &Catalog{
+		topics: make(map[string]*Topic),
+		index:  newSearchIndex(),
+	}
+
+	c.Add(&Topic{ID: "exists", Title: "Existing Topic"})
+
+	t.Run("existing topic", func(t *testing.T) {
+		topic, err := c.Get("exists")
+		require.NoError(t, err)
+		assert.Equal(t, "Existing Topic", topic.Title)
+	})
+
+	t.Run("missing topic returns error", func(t *testing.T) {
+		topic, err := c.Get("does-not-exist")
+		assert.Nil(t, topic)
+		assert.Error(t, err)
+		assert.Contains(t, err.Error(), "topic not found")
+		assert.Contains(t, err.Error(), "does-not-exist")
+	})
+}
+
+func TestCatalog_Search_Good_ScoreTiebreaking(t *testing.T) {
+	// Tests the alphabetical tie-breaking in search result sorting (search.go:165).
+	c := &Catalog{
+		topics: make(map[string]*Topic),
+		index:  newSearchIndex(),
+	}
+
+	// Add topics with identical content so they receive the same score.
+	c.Add(&Topic{
+		ID:      "zebra-topic",
+		Title:   "Zebra",
+		Content: "Unique keyword zephyr.",
+	})
+	c.Add(&Topic{
+		ID:      "alpha-topic",
+		Title:   "Alpha",
+		Content: "Unique keyword zephyr.",
+	})
+
+	results := c.Search("zephyr")
+	require.Len(t, results, 2)
+
+	// With equal scores, results should be sorted alphabetically by title.
+	assert.Equal(t, "Alpha", results[0].Topic.Title)
+	assert.Equal(t, "Zebra", results[1].Topic.Title)
+	assert.Equal(t, results[0].Score, results[1].Score,
+		"scores should be equal for tie-breaking to apply")
+}
+
+func BenchmarkSearch(b *testing.B) {
+	// Build a catalog with 100+ topics for benchmarking.
+	c := &Catalog{
+		topics: make(map[string]*Topic),
+		index:  newSearchIndex(),
+	}
+
+	for i := 0; i < 150; i++ {
+		c.Add(&Topic{
+			ID:      fmt.Sprintf("topic-%d", i),
+			Title:   fmt.Sprintf("Topic Number %d About Various Subjects", i),
+			Content: fmt.Sprintf("This is the content of topic %d. It covers installation, configuration, deployment, and testing of the system.", i),
+			Tags:    []string{"generated", fmt.Sprintf("tag%d", i%10)},
+			Sections: []Section{
+				{
+					ID:      fmt.Sprintf("section-%d-a", i),
+					Title:   "Overview",
+					Content: "An overview of the topic and its purpose.",
+				},
+				{
+					ID:      fmt.Sprintf("section-%d-b", i),
+					Title:   "Details",
+					Content: "Detailed information about the topic including examples and usage.",
+				},
+			},
+		})
+	}
+
+	b.ResetTimer()
+	b.ReportAllocs()
+	for i := 0; i < b.N; i++ {
+		c.Search("installation configuration")
+	}
+}
--- a/search.go
+++ b/search.go
@ -7,6 +7,19 @@ import (
 	"unicode"
 )

+// Scoring weights for search result ranking.
+const (
+	scoreExactWord   = 1.0  // Exact word match in the index
+	scorePrefixWord  = 0.5  // Prefix/partial word match
+	scoreFuzzyWord   = 0.3  // Fuzzy (Levenshtein) match
+	scoreTitleBoost  = 10.0 // Query word appears in topic title
+	scoreSectionBoost = 5.0 // Query word appears in section title
+	scoreTagBoost    = 3.0  // Query word appears in topic tags
+	scorePhraseBoost = 8.0  // Exact phrase match in content
+	scoreAllWords    = 2.0  // All query words present (multi-word bonus)
+	fuzzyMaxDistance = 2    // Maximum edit distance for fuzzy matching
+)
+
 // SearchResult represents a search match.
 type SearchResult struct {
 	Topic   *Topic
@ -72,10 +85,17 @@ func (i *searchIndex) addToIndex(word, topicID string) {
 	i.index[word] = append(i.index[word], topicID)
 }

-// Search finds topics matching the query.
+// Search finds topics matching the query. Supports:
+//   - Single and multi-word keyword queries
+//   - Quoted phrase search (e.g. `"rate limit"`)
+//   - Fuzzy matching via Levenshtein distance for typo tolerance
+//   - Prefix matching for partial words
 func (i *searchIndex) Search(query string) []*SearchResult {
-	queryWords := tokenize(query)
-	if len(queryWords) == 0 {
+	// Extract quoted phrases before tokenising.
+	phrases, stripped := extractPhrases(query)
+
+	queryWords := tokenize(stripped)
+	if len(queryWords) == 0 && len(phrases) == 0 {
 		return nil
 	}

@ -86,7 +106,7 @@ func (i *searchIndex) Search(query string) []*SearchResult {
 		// Exact matches
 		if topicIDs, ok := i.index[word]; ok {
 			for _, topicID := range topicIDs {
-				scores[topicID] += 1.0
+				scores[topicID] += scoreExactWord
 			}
 		}

@ -94,7 +114,25 @@ func (i *searchIndex) Search(query string) []*SearchResult {
 		for indexWord, topicIDs := range i.index {
 			if strings.HasPrefix(indexWord, word) && indexWord != word {
 				for _, topicID := range topicIDs {
-					scores[topicID] += 0.5 // Lower score for partial matches
+					scores[topicID] += scorePrefixWord
+				}
+			}
+		}
+
+		// Fuzzy matches (Levenshtein distance)
+		if len(word) >= 3 {
+			for indexWord, topicIDs := range i.index {
+				if indexWord == word {
+					continue // Already scored as exact match
+				}
+				if strings.HasPrefix(indexWord, word) {
+					continue // Already scored as prefix match
+				}
+				dist := levenshtein(word, indexWord)
+				if dist > 0 && dist <= fuzzyMaxDistance {
+					for _, topicID := range topicIDs {
+						scores[topicID] += scoreFuzzyWord
+					}
 				}
 			}
 		}
@ -109,8 +147,28 @@ func (i *searchIndex) Search(query string) []*SearchResult {
 			}
 		}
 	}
+	// Also add phrase regexes for highlighting
+	for _, phrase := range phrases {
+		if re, err := regexp.Compile("(?i)" + regexp.QuoteMeta(phrase)); err == nil {
+			res = append(res, re)
+		}
+	}

-	// Build results with title boost and snippet extraction
+	// Phrase matching: boost topics that contain the exact phrase.
+	for _, phrase := range phrases {
+		phraseLower := strings.ToLower(phrase)
+		for topicID, topic := range i.topics {
+			text := strings.ToLower(topic.Title + " " + topic.Content)
+			for _, section := range topic.Sections {
+				text += " " + strings.ToLower(section.Title+" "+section.Content)
+			}
+			if strings.Contains(text, phraseLower) {
+				scores[topicID] += scorePhraseBoost
+			}
+		}
+	}
+
+	// Build results with title/section/tag boosts and snippet extraction
 	var results []*SearchResult
 	for topicID, score := range scores {
 		topic := i.topics[topicID]
@ -120,15 +178,40 @@ func (i *searchIndex) Search(query string) []*SearchResult {

 		// Title boost: if query words appear in title
 		titleLower := strings.ToLower(topic.Title)
-		hasTitleMatch := false
+		titleMatchCount := 0
 		for _, word := range queryWords {
 			if strings.Contains(titleLower, word) {
-				hasTitleMatch = true
-				break
+				titleMatchCount++
 			}
 		}
-		if hasTitleMatch {
-			score += 10.0
+		if titleMatchCount > 0 {
+			score += scoreTitleBoost
+		}
+
+		// Tag boost: if query words match tags
+		for _, tag := range topic.Tags {
+			tagLower := strings.ToLower(tag)
+			for _, word := range queryWords {
+				if tagLower == word || strings.Contains(tagLower, word) {
+					score += scoreTagBoost
+					break
+				}
+			}
+		}
+
+		// Multi-word bonus: if all query words are present in the topic
+		if len(queryWords) > 1 {
+			allPresent := true
+			fullText := strings.ToLower(topic.Title + " " + topic.Content)
+			for _, word := range queryWords {
+				if !strings.Contains(fullText, word) {
+					allPresent = false
+					break
+				}
+			}
+			if allPresent {
+				score += scoreAllWords
+			}
 		}

 		// Find matching section and extract snippet
@ -145,7 +228,7 @@ func (i *searchIndex) Search(query string) []*SearchResult {
 				}
 			}
 			if hasSectionTitleMatch {
-				score += 5.0
+				score += scoreSectionBoost
 			}
 		}

@ -168,6 +251,79 @@ func (i *searchIndex) Search(query string) []*SearchResult {
 	return results
 }

+// extractPhrases pulls quoted substrings from the query and returns them
+// alongside the remaining query text with quotes removed.
+// For example: `hello "rate limit" world` returns
+// phrases=["rate limit"], remaining="hello  world".
+func extractPhrases(query string) (phrases []string, remaining string) {
+	re := regexp.MustCompile(`"([^"]+)"`)
+	matches := re.FindAllStringSubmatch(query, -1)
+	for _, m := range matches {
+		phrase := strings.TrimSpace(m[1])
+		if phrase != "" {
+			phrases = append(phrases, phrase)
+		}
+	}
+	remaining = re.ReplaceAllString(query, "")
+	return phrases, remaining
+}
+
+// levenshtein computes the edit distance between two strings.
+// Used for fuzzy matching to tolerate typos in search queries.
+func levenshtein(a, b string) int {
+	aRunes := []rune(a)
+	bRunes := []rune(b)
+	aLen := len(aRunes)
+	bLen := len(bRunes)
+
+	if aLen == 0 {
+		return bLen
+	}
+	if bLen == 0 {
+		return aLen
+	}
+
+	// Use two rows instead of full matrix to save memory.
+	prev := make([]int, bLen+1)
+	curr := make([]int, bLen+1)
+
+	for j := 0; j <= bLen; j++ {
+		prev[j] = j
+	}
+
+	for i := 1; i <= aLen; i++ {
+		curr[0] = i
+		for j := 1; j <= bLen; j++ {
+			cost := 1
+			if aRunes[i-1] == bRunes[j-1] {
+				cost = 0
+			}
+			curr[j] = min3(
+				prev[j]+1,       // deletion
+				curr[j-1]+1,     // insertion
+				prev[j-1]+cost,  // substitution
+			)
+		}
+		prev, curr = curr, prev
+	}
+
+	return prev[bLen]
+}
+
+// min3 returns the minimum of three integers.
+func min3(a, b, c int) int {
+	if a < b {
+		if a < c {
+			return a
+		}
+		return c
+	}
+	if b < c {
+		return b
+	}
+	return c
+}
+
 // findBestMatch finds the section with the best match and extracts a snippet.
 func (i *searchIndex) findBestMatch(topic *Topic, queryWords []string, res []*regexp.Regexp) (*Section, string) {
 	var bestSection *Section
--- a/search_test.go
+++ b/search_test.go
@ -338,3 +338,457 @@ func TestSearchResult_Score_Good(t *testing.T) {
 	assert.Equal(t, "topic-in-title", results[0].Topic.ID)
 	assert.Greater(t, results[0].Score, results[1].Score)
 }
+
+func TestExtractSnippet_Good_HeadingsOnly(t *testing.T) {
+	// Content with only headings and no body text should return empty snippet
+	// when no regexes are provided. Covers the empty-return branch.
+	content := "# Heading One\n## Heading Two\n### Heading Three"
+
+	snippet := extractSnippet(content, nil)
+	assert.Empty(t, snippet, "headings-only content without regexes should return empty snippet")
+}
+
+func TestExtractSnippet_Good_SnippetTrimmedToEmpty(t *testing.T) {
+	// After word-boundary trimming the snippet could become empty.
+	// This exercises the snippet=="" guard after TrimSpace.
+	content := strings.Repeat(" ", 200)
+
+	snippet := extractSnippet(content, compileRegexes([]string{"zz"}))
+	assert.Empty(t, snippet, "whitespace-only content should yield empty snippet")
+}
+
+func TestHighlight_Good_EmptyRegexes(t *testing.T) {
+	// Calling highlight with an empty regex slice should return the
+	// text unchanged. Covers the early-return branch.
+	result := highlight("some text here", nil)
+	assert.Equal(t, "some text here", result)
+
+	result = highlight("some text here", []*regexp.Regexp{})
+	assert.Equal(t, "some text here", result)
+}
+
+func TestHighlight_Good_OverlappingExtension(t *testing.T) {
+	// Test overlapping matches where the second match extends past the
+	// first, exercising the curr.end extension branch in merging.
+	text := "abcdefghij"
+
+	// First regex matches "abcdef", second matches "cdefghij"
+	// They overlap and the second extends the merged range.
+	re1, _ := regexp.Compile("abcdef")
+	re2, _ := regexp.Compile("cdefghij")
+
+	result := highlight(text, []*regexp.Regexp{re1, re2})
+	assert.Equal(t, "**abcdefghij**", result)
+}
+
+func TestSearchIndex_Search_Good_NilTopicGuard(t *testing.T) {
+	// Manually inject a stale reference in the scores map so the
+	// nil-topic guard is exercised. We do this by manipulating the
+	// index directly.
+	idx := newSearchIndex()
+
+	idx.Add(&Topic{
+		ID:      "real-topic",
+		Title:   "Real Topic",
+		Content: "Contains testword for matching.",
+	})
+
+	// Inject a mapping from "testword" to a non-existent topic ID.
+	idx.index["testword"] = append(idx.index["testword"], "ghost-topic")
+
+	results := idx.Search("testword")
+	// Should still find the real topic, ghost-topic should be skipped.
+	assert.Len(t, results, 1)
+	assert.Equal(t, "real-topic", results[0].Topic.ID)
+}
+
+func TestSearchIndex_Search_Good_SpecialCharacters(t *testing.T) {
+	idx := newSearchIndex()
+
+	idx.Add(&Topic{
+		ID:      "special-chars",
+		Title:   "Rate Limiting (v2.0)",
+		Content: "Configure rate-limiting rules with special characters: @#$%.",
+	})
+
+	t.Run("query with special characters", func(t *testing.T) {
+		results := idx.Search("rate limiting")
+		assert.NotEmpty(t, results)
+		assert.Equal(t, "special-chars", results[0].Topic.ID)
+	})
+
+	t.Run("query with punctuation stripped", func(t *testing.T) {
+		results := idx.Search("v2.0")
+		assert.NotEmpty(t, results)
+	})
+}
+
+func TestSearchIndex_Search_Good_CaseInsensitive(t *testing.T) {
+	idx := newSearchIndex()
+
+	idx.Add(&Topic{
+		ID:      "case-test",
+		Title:   "UPPERCASE Title",
+		Content: "MiXeD CaSe content here.",
+	})
+
+	t.Run("lowercase query finds uppercase content", func(t *testing.T) {
+		results := idx.Search("uppercase")
+		assert.NotEmpty(t, results)
+		assert.Equal(t, "case-test", results[0].Topic.ID)
+	})
+
+	t.Run("uppercase query finds content", func(t *testing.T) {
+		results := idx.Search("MIXED")
+		assert.NotEmpty(t, results)
+	})
+}
+
+func TestSearchIndex_Search_Good_SingleCharQuery(t *testing.T) {
+	idx := newSearchIndex()
+
+	idx.Add(&Topic{
+		ID:      "single-char",
+		Title:   "Test Topic",
+		Content: "Some content.",
+	})
+
+	// Single-character queries are filtered out by tokenize (min 2 chars),
+	// so searching for "a" should return nil.
+	results := idx.Search("a")
+	assert.Nil(t, results)
+}
+
+// --- Phase 1: Fuzzy matching tests ---
+
+func TestLevenshtein_Good(t *testing.T) {
+	tests := []struct {
+		name     string
+		a        string
+		b        string
+		expected int
+	}{
+		{name: "identical strings", a: "hello", b: "hello", expected: 0},
+		{name: "single substitution", a: "hello", b: "hallo", expected: 1},
+		{name: "single insertion", a: "hello", b: "helloo", expected: 1},
+		{name: "single deletion", a: "hello", b: "helo", expected: 1},
+		{name: "two edits", a: "kitten", b: "sitting", expected: 3},
+		{name: "completely different", a: "abc", b: "xyz", expected: 3},
+		{name: "empty first string", a: "", b: "hello", expected: 5},
+		{name: "empty second string", a: "hello", b: "", expected: 5},
+		{name: "both empty", a: "", b: "", expected: 0},
+		{name: "single char strings", a: "a", b: "b", expected: 1},
+		{name: "same single char", a: "a", b: "a", expected: 0},
+		{name: "transposition", a: "ab", b: "ba", expected: 2},
+		{name: "unicode strings", a: "cafe", b: "cafe", expected: 0},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			result := levenshtein(tt.a, tt.b)
+			assert.Equal(t, tt.expected, result)
+		})
+	}
+}
+
+func TestMin3_Good(t *testing.T) {
+	tests := []struct {
+		name     string
+		a, b, c  int
+		expected int
+	}{
+		{name: "first smallest", a: 1, b: 2, c: 3, expected: 1},
+		{name: "second smallest", a: 2, b: 1, c: 3, expected: 1},
+		{name: "third smallest", a: 3, b: 2, c: 1, expected: 1},
+		{name: "all equal", a: 5, b: 5, c: 5, expected: 5},
+		{name: "first and second equal smallest", a: 1, b: 1, c: 3, expected: 1},
+		{name: "second and third equal smallest", a: 3, b: 1, c: 1, expected: 1},
+		{name: "first and third equal smallest", a: 1, b: 3, c: 1, expected: 1},
+		{name: "negative values", a: -3, b: -1, c: -2, expected: -3},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			result := min3(tt.a, tt.b, tt.c)
+			assert.Equal(t, tt.expected, result)
+		})
+	}
+}
+
+func TestSearchIndex_Search_Good_FuzzyMatching(t *testing.T) {
+	idx := newSearchIndex()
+
+	idx.Add(&Topic{
+		ID:      "configuration",
+		Title:   "Configuration Guide",
+		Content: "Learn how to configure the application settings.",
+	})
+
+	idx.Add(&Topic{
+		ID:      "deployment",
+		Title:   "Deployment Process",
+		Content: "Deploy your application to production servers.",
+	})
+
+	t.Run("typo in query finds correct topic", func(t *testing.T) {
+		// "configuraton" is 1 edit away from "configuration"
+		results := idx.Search("configuraton")
+		assert.NotEmpty(t, results, "fuzzy match should find results for typo")
+		found := false
+		for _, r := range results {
+			if r.Topic.ID == "configuration" {
+				found = true
+				break
+			}
+		}
+		assert.True(t, found, "should find configuration topic with typo")
+	})
+
+	t.Run("two-edit typo still matches", func(t *testing.T) {
+		// "deplymnt" is within 2 edits of "deployment" — but first check
+		// that "deploymnt" (1 edit) works.
+		results := idx.Search("deploymnt")
+		assert.NotEmpty(t, results, "fuzzy match should find results for 1-edit typo")
+	})
+
+	t.Run("too many edits returns no fuzzy match", func(t *testing.T) {
+		// "zzzzzzz" is very far from any indexed word.
+		results := idx.Search("zzzzzzz")
+		assert.Empty(t, results, "large edit distance should not produce results")
+	})
+
+	t.Run("short words skip fuzzy matching", func(t *testing.T) {
+		// Words shorter than 3 characters skip fuzzy matching.
+		// "to" is in the index but "tx" (1 edit) should not fuzzy-match
+		// because "tx" is only 2 chars.
+		results := idx.Search("tx")
+		// May or may not find results via prefix, but should not crash.
+		_ = results
+	})
+
+	t.Run("fuzzy scores lower than exact", func(t *testing.T) {
+		// Exact match on "configure" should score higher than fuzzy.
+		exactResults := idx.Search("configure")
+		fuzzyResults := idx.Search("configurr")
+
+		if len(exactResults) > 0 && len(fuzzyResults) > 0 {
+			assert.GreaterOrEqual(t, exactResults[0].Score, fuzzyResults[0].Score,
+				"exact match should score at least as high as fuzzy match")
+		}
+	})
+}
+
+// --- Phase 1: Phrase search tests ---
+
+func TestExtractPhrases_Good(t *testing.T) {
+	tests := []struct {
+		name             string
+		query            string
+		expectedPhrases  []string
+		expectedRemaining string
+	}{
+		{
+			name:             "no phrases",
+			query:            "hello world",
+			expectedPhrases:  nil,
+			expectedRemaining: "hello world",
+		},
+		{
+			name:             "single phrase",
+			query:            `"rate limit"`,
+			expectedPhrases:  []string{"rate limit"},
+			expectedRemaining: "",
+		},
+		{
+			name:             "phrase with surrounding words",
+			query:            `configure "rate limit" rules`,
+			expectedPhrases:  []string{"rate limit"},
+			expectedRemaining: "configure  rules",
+		},
+		{
+			name:             "multiple phrases",
+			query:            `"rate limit" and "error handling"`,
+			expectedPhrases:  []string{"rate limit", "error handling"},
+			expectedRemaining: " and ",
+		},
+		{
+			name:             "empty quotes left in remaining",
+			query:            `"" hello`,
+			expectedPhrases:  nil,
+			expectedRemaining: `"" hello`,
+		},
+		{
+			name:             "whitespace-only quotes ignored",
+			query:            `"   " hello`,
+			expectedPhrases:  nil,
+			expectedRemaining: " hello",
+		},
+		{
+			name:             "unclosed quote treated as plain text",
+			query:            `"unclosed phrase`,
+			expectedPhrases:  nil,
+			expectedRemaining: `"unclosed phrase`,
+		},
+	}
+
+	for _, tt := range tests {
+		t.Run(tt.name, func(t *testing.T) {
+			phrases, remaining := extractPhrases(tt.query)
+			assert.Equal(t, tt.expectedPhrases, phrases)
+			assert.Equal(t, tt.expectedRemaining, remaining)
+		})
+	}
+}
+
+func TestSearchIndex_Search_Good_PhraseSearch(t *testing.T) {
+	idx := newSearchIndex()
+
+	idx.Add(&Topic{
+		ID:      "rate-limiting",
+		Title:   "Rate Limiting",
+		Content: "Configure rate limit rules for API endpoints. The rate limit protects against abuse.",
+		Tags:    []string{"api", "security"},
+	})
+
+	idx.Add(&Topic{
+		ID:      "error-handling",
+		Title:   "Error Handling",
+		Content: "Handle errors and rate responses correctly. Limit retries to avoid loops.",
+		Tags:    []string{"errors"},
+	})
+
+	t.Run("quoted phrase matches exact sequence", func(t *testing.T) {
+		results := idx.Search(`"rate limit"`)
+		assert.NotEmpty(t, results)
+		// rate-limiting topic has "rate limit" as exact phrase
+		assert.Equal(t, "rate-limiting", results[0].Topic.ID)
+	})
+
+	t.Run("unquoted words match both topics", func(t *testing.T) {
+		results := idx.Search("rate limit")
+		assert.GreaterOrEqual(t, len(results), 2,
+			"unquoted query should match both topics that contain the words separately")
+	})
+
+	t.Run("phrase not found yields no phrase boost", func(t *testing.T) {
+		results := idx.Search(`"nonexistent phrase here"`)
+		assert.Empty(t, results, "phrase with no tokenisable words and no match should return empty")
+	})
+
+	t.Run("phrase with surrounding keywords", func(t *testing.T) {
+		results := idx.Search(`"rate limit" api`)
+		assert.NotEmpty(t, results)
+		assert.Equal(t, "rate-limiting", results[0].Topic.ID)
+	})
+
+	t.Run("phrase-only query with no loose words", func(t *testing.T) {
+		// Query is only a quoted phrase; tokenize of remaining is empty,
+		// but phrase matching should still score topics.
+		results := idx.Search(`"rate limit"`)
+		assert.NotEmpty(t, results)
+	})
+
+	t.Run("phrase found in section content", func(t *testing.T) {
+		sectionIdx := newSearchIndex()
+		sectionIdx.Add(&Topic{
+			ID:      "section-phrase",
+			Title:   "Advanced Guide",
+			Content: "Overview of the system.",
+			Sections: []Section{
+				{
+					ID:      "limits",
+					Title:   "Limits",
+					Content: "The rate limit is set per client.",
+				},
+			},
+		})
+
+		results := sectionIdx.Search(`"rate limit"`)
+		assert.NotEmpty(t, results)
+		assert.Equal(t, "section-phrase", results[0].Topic.ID)
+	})
+}
+
+// --- Phase 1: Improved scoring tests ---
+
+func TestSearchIndex_Search_Good_TagBoost(t *testing.T) {
+	idx := newSearchIndex()
+
+	idx.Add(&Topic{
+		ID:      "tagged-topic",
+		Title:   "Some Guide",
+		Content: "General content about the system.",
+		Tags:    []string{"deployment", "production"},
+	})
+
+	idx.Add(&Topic{
+		ID:      "content-topic",
+		Title:   "Other Guide",
+		Content: "Information about deployment processes.",
+	})
+
+	results := idx.Search("deployment")
+	assert.NotEmpty(t, results)
+
+	// The tagged topic should rank higher because of the tag boost.
+	assert.Equal(t, "tagged-topic", results[0].Topic.ID,
+		"topic with matching tag should rank higher")
+}
+
+func TestSearchIndex_Search_Good_MultiWordBonus(t *testing.T) {
+	idx := newSearchIndex()
+
+	// Both topics have neutral titles (no query words in title) to
+	// isolate the multi-word bonus effect.
+	idx.Add(&Topic{
+		ID:      "both-words",
+		Title:   "Complete Guide",
+		Content: "Learn about deploying and monitoring in one place.",
+	})
+
+	idx.Add(&Topic{
+		ID:      "one-word",
+		Title:   "Other Guide",
+		Content: "Just deploying steps without monitoring.",
+	})
+
+	results := idx.Search("deploying monitoring")
+	assert.NotEmpty(t, results)
+
+	// Topic with both words should score higher due to multi-word bonus.
+	assert.Equal(t, "both-words", results[0].Topic.ID,
+		"topic containing all query words should rank higher")
+}
+
+func TestSearchIndex_Search_Good_ScoringConstants(t *testing.T) {
+	// Verify the scoring constants are sensible relative to each other.
+	assert.Greater(t, scoreTitleBoost, scoreSectionBoost,
+		"title boost should exceed section boost")
+	assert.Greater(t, scoreSectionBoost, scoreTagBoost,
+		"section boost should exceed tag boost")
+	assert.Greater(t, scoreExactWord, scorePrefixWord,
+		"exact match should score higher than prefix match")
+	assert.Greater(t, scorePrefixWord, scoreFuzzyWord,
+		"prefix match should score higher than fuzzy match")
+	assert.Greater(t, scorePhraseBoost, scoreAllWords,
+		"phrase boost should exceed multi-word bonus")
+}
+
+func TestSearchIndex_Search_Good_PhraseHighlighting(t *testing.T) {
+	idx := newSearchIndex()
+
+	idx.Add(&Topic{
+		ID:      "phrase-highlight",
+		Title:   "API Guide",
+		Content: "Configure rate limit rules for the API gateway.",
+	})
+
+	results := idx.Search(`"rate limit" api`)
+	assert.NotEmpty(t, results)
+
+	// The snippet should highlight both the phrase and keyword.
+	if results[0].Snippet != "" {
+		assert.Contains(t, results[0].Snippet, "**rate limit**",
+			"phrase should be highlighted in snippet")
+	}
+}