diff --git a/Search-Engine.-.md b/Search-Engine.-.md deleted file mode 100644 index 5fa6b9b..0000000 --- a/Search-Engine.-.md +++ /dev/null @@ -1,114 +0,0 @@ -# Search Engine - -The help package includes a full-text search engine built on an inverted index with scoring, prefix matching, and snippet extraction. - -## SearchResult - -```go -type SearchResult struct { - Topic *Topic - Section *Section // nil if topic-level match - Score float64 - Snippet string // Context around match with highlighting -} -``` - -Results are sorted by score (highest first). Ties are broken alphabetically by topic title. - -## How Search Works - -```go -results := catalog.Search("installation guide") -``` - -1. **Tokenise** the query into lowercase words (min 2 characters). -2. **Look up** each word in the inverted index for exact matches (score +1.0 per match). -3. **Prefix match** — words in the index that start with a query word score +0.5 (e.g. query `"config"` matches indexed word `"configuration"`). -4. **Title boost** — if any query word appears in the topic title, add +10.0. -5. **Section title boost** — if the best-matching section title contains a query word, add +5.0. -6. **Extract snippet** from the best-matching section content. -7. **Sort** results by score descending. - -## Tokenisation - -```go -func tokenize(text string) []string -``` - -Splits text into lowercase words. Rules: - -- Letters and digits are kept together as words. -- Punctuation and whitespace are word separators. -- Single-character words are discarded (minimum length is 2). - -| Input | Tokens | -|-------|--------| -| `"Hello World"` | `["hello", "world"]` | -| `"v2.0 and config123"` | `["v2", "and", "config123"]` | -| `"a b c hello d"` | `["hello"]` | - -## Inverted Index - -The search index maps each unique word to a list of topic IDs: - -``` -"installation" -> ["getting-started", "installation-guide"] -"config" -> ["configuration"] -"cli" -> ["getting-started", "configuration"] -``` - -When a topic is added via `Catalog.Add`, the index is populated from: -- Topic title words -- Topic content words -- Section title words -- Section content words -- Tag words - -Duplicate word-to-topic mappings are prevented. - -## Scoring - -| Match Type | Points | -|------------|--------| -| Exact word match in index | +1.0 per query word | -| Prefix match in index | +0.5 per query word | -| Query word in topic title | +10.0 (once) | -| Query word in best section title | +5.0 (once) | - -Within `findBestMatch`, section title matches count double when selecting the best section. - -## Snippet Extraction - -```go -func extractSnippet(content string, regexes []*regexp.Regexp) string -``` - -Extracts up to 150 characters of context around the first match: - -1. Finds the earliest match position across all query regexes. -2. Centres a 150-character window around that position (50 chars before the match). -3. Trims to word boundaries and adds `"..."` prefix/suffix when truncated. -4. Applies bold highlighting to all matches. - -If no regexes are provided, returns the first non-empty, non-heading line of content (truncated to 150 characters). - -### Highlighting - -Matches are wrapped in `**bold**` markup. The highlighting engine: - -- Uses case-insensitive regex matching. -- Sorts matches by position and merges overlapping/adjacent spans to avoid nested markers (e.g. `"**Searching**"` not `"**Search****ing**"`). -- Applies replacements back-to-front to avoid position shifts. - -Examples: - -| Content | Query | Highlighted | -|---------|-------|-------------| -| `"The configuration is complete."` | `"config"` | `"The **config**uration is complete."` | -| `"Searching for something."` | `"search", "searching"` | `"**Searching** for something."` | - -## UTF-8 Safety - -All snippet extraction operates on `[]rune` slices rather than byte offsets, ensuring multi-byte characters (e.g. Japanese, accented Latin) are never split mid-character. The byte position of a regex match is converted to a rune position before windowing. - -See [[Topics-and-Catalog]] for topic and catalog API. Back to [[Home]].