diff --git a/Search-Engine.-.md b/Search-Engine.-.md new file mode 100644 index 0000000..a071be0 --- /dev/null +++ b/Search-Engine.-.md @@ -0,0 +1,118 @@ +# Search Engine + +go-help includes a full-text search engine built on an inverted index. It supports word tokenisation, prefix matching, title and section boosting, and snippet extraction with match highlighting. + +## Architecture + +The search index is an inverted map from words to topic IDs: + +```go +type searchIndex struct { + topics map[string]*Topic // topicID -> Topic + index map[string][]string // word -> []topicID +} +``` + +When a topic is added via `Catalog.Add`, the index processes: + +1. **Title words** (indexed for title-boost scoring) +2. **Content words** (full body text) +3. **Section titles and content** (each heading and its body) +4. **Tags** (all tag values) + +## Tokenisation + +The `tokenize` function splits text into lowercase words: + +- Characters: letters and digits are kept +- Separators: everything else acts as a word boundary +- Minimum length: single-character words are discarded + +``` +"Getting Started with Go" -> ["getting", "started", "with", "go"] +"API Reference (v2)" -> ["api", "reference", "v2"] +``` + +This produces a flat list suitable for both indexing and query processing. + +## Scoring Algorithm + +When `Search(query)` is called, the query is tokenised and scored against the index: + +### Base Score + +For each query word: +- **Exact match**: +1.0 per topic found in the index for that word +- **Prefix match**: +0.5 per topic where an indexed word starts with the query word (but is not an exact match) + +### Boosts + +After base scoring, additional boosts are applied: + +| Condition | Boost | +|-----------|-------| +| Query word appears in topic title | +10.0 | +| Query word appears in matching section title | +5.0 | + +### Sorting + +Results are sorted by descending score. Ties are broken alphabetically by topic title. + +## Search Results + +```go +type SearchResult struct { + Topic *Topic // The matched topic + Section *Section // Best matching section (nil if topic-level) + Score float64 // Relevance score + Snippet string // Context around match with highlighting +} +``` + +### Example + +```go +results := catalog.Search("deploy docker") +for _, r := range results { + section := "" + if r.Section != nil { + section = " > " + r.Section.Title + } + fmt.Printf("[%.1f] %s%s\n %s\n", r.Score, r.Topic.Title, section, r.Snippet) +} +``` + +## Snippet Extraction + +The search engine extracts a ~150-character snippet around the first match in the best-matching section: + +1. **Find** the position of the first regex match in the content +2. **Extract** a window of 150 runes centred on the match +3. **Trim** to word boundaries (adding `...` prefix/suffix as needed) +4. **Highlight** all matches by wrapping them in `**bold**` markers + +If no regex matches are found, the snippet falls back to the first non-empty, non-heading line of the section content. + +## Highlighting + +The `highlight` function wraps matched text in `**` markers: + +``` +Input: "How to deploy with docker compose" +Query: "deploy docker" +Output: "How to **deploy** with **docker** compose" +``` + +Overlapping or adjacent matches are merged before highlighting to avoid nested markers. + +## Best Match Selection + +For each topic in the results, the engine selects the single best-matching section: + +1. Score each section by counting query word matches in its title (weighted 2x) and content +2. The section with the highest combined score is selected +3. The snippet is extracted from that section's content + +If no section matches, the snippet is extracted from the topic-level content. + +See also: [[Home]] | [[Topics-and-Catalog]]