Delete page "Search-Engine.-"

Virgil 2026-02-19 16:57:06 +00:00
parent fdcf0f2866
commit 46fd1262ab

@ -1,114 +0,0 @@
# Search Engine
The help package includes a full-text search engine built on an inverted index with scoring, prefix matching, and snippet extraction.
## SearchResult
```go
type SearchResult struct {
Topic *Topic
Section *Section // nil if topic-level match
Score float64
Snippet string // Context around match with highlighting
}
```
Results are sorted by score (highest first). Ties are broken alphabetically by topic title.
## How Search Works
```go
results := catalog.Search("installation guide")
```
1. **Tokenise** the query into lowercase words (min 2 characters).
2. **Look up** each word in the inverted index for exact matches (score +1.0 per match).
3. **Prefix match** — words in the index that start with a query word score +0.5 (e.g. query `"config"` matches indexed word `"configuration"`).
4. **Title boost** — if any query word appears in the topic title, add +10.0.
5. **Section title boost** — if the best-matching section title contains a query word, add +5.0.
6. **Extract snippet** from the best-matching section content.
7. **Sort** results by score descending.
## Tokenisation
```go
func tokenize(text string) []string
```
Splits text into lowercase words. Rules:
- Letters and digits are kept together as words.
- Punctuation and whitespace are word separators.
- Single-character words are discarded (minimum length is 2).
| Input | Tokens |
|-------|--------|
| `"Hello World"` | `["hello", "world"]` |
| `"v2.0 and config123"` | `["v2", "and", "config123"]` |
| `"a b c hello d"` | `["hello"]` |
## Inverted Index
The search index maps each unique word to a list of topic IDs:
```
"installation" -> ["getting-started", "installation-guide"]
"config" -> ["configuration"]
"cli" -> ["getting-started", "configuration"]
```
When a topic is added via `Catalog.Add`, the index is populated from:
- Topic title words
- Topic content words
- Section title words
- Section content words
- Tag words
Duplicate word-to-topic mappings are prevented.
## Scoring
| Match Type | Points |
|------------|--------|
| Exact word match in index | +1.0 per query word |
| Prefix match in index | +0.5 per query word |
| Query word in topic title | +10.0 (once) |
| Query word in best section title | +5.0 (once) |
Within `findBestMatch`, section title matches count double when selecting the best section.
## Snippet Extraction
```go
func extractSnippet(content string, regexes []*regexp.Regexp) string
```
Extracts up to 150 characters of context around the first match:
1. Finds the earliest match position across all query regexes.
2. Centres a 150-character window around that position (50 chars before the match).
3. Trims to word boundaries and adds `"..."` prefix/suffix when truncated.
4. Applies bold highlighting to all matches.
If no regexes are provided, returns the first non-empty, non-heading line of content (truncated to 150 characters).
### Highlighting
Matches are wrapped in `**bold**` markup. The highlighting engine:
- Uses case-insensitive regex matching.
- Sorts matches by position and merges overlapping/adjacent spans to avoid nested markers (e.g. `"**Searching**"` not `"**Search****ing**"`).
- Applies replacements back-to-front to avoid position shifts.
Examples:
| Content | Query | Highlighted |
|---------|-------|-------------|
| `"The configuration is complete."` | `"config"` | `"The **config**uration is complete."` |
| `"Searching for something."` | `"search", "searching"` | `"**Searching** for something."` |
## UTF-8 Safety
All snippet extraction operates on `[]rune` slices rather than byte offsets, ensuring multi-byte characters (e.g. Japanese, accented Latin) are never split mid-character. The byte position of a regex match is converted to a rune position before windowing.
See [[Topics-and-Catalog]] for topic and catalog API. Back to [[Home]].