Delete page "Search-Engine.-"
parent
fdcf0f2866
commit
46fd1262ab
1 changed files with 0 additions and 114 deletions
|
|
@ -1,114 +0,0 @@
|
|||
# Search Engine
|
||||
|
||||
The help package includes a full-text search engine built on an inverted index with scoring, prefix matching, and snippet extraction.
|
||||
|
||||
## SearchResult
|
||||
|
||||
```go
|
||||
type SearchResult struct {
|
||||
Topic *Topic
|
||||
Section *Section // nil if topic-level match
|
||||
Score float64
|
||||
Snippet string // Context around match with highlighting
|
||||
}
|
||||
```
|
||||
|
||||
Results are sorted by score (highest first). Ties are broken alphabetically by topic title.
|
||||
|
||||
## How Search Works
|
||||
|
||||
```go
|
||||
results := catalog.Search("installation guide")
|
||||
```
|
||||
|
||||
1. **Tokenise** the query into lowercase words (min 2 characters).
|
||||
2. **Look up** each word in the inverted index for exact matches (score +1.0 per match).
|
||||
3. **Prefix match** — words in the index that start with a query word score +0.5 (e.g. query `"config"` matches indexed word `"configuration"`).
|
||||
4. **Title boost** — if any query word appears in the topic title, add +10.0.
|
||||
5. **Section title boost** — if the best-matching section title contains a query word, add +5.0.
|
||||
6. **Extract snippet** from the best-matching section content.
|
||||
7. **Sort** results by score descending.
|
||||
|
||||
## Tokenisation
|
||||
|
||||
```go
|
||||
func tokenize(text string) []string
|
||||
```
|
||||
|
||||
Splits text into lowercase words. Rules:
|
||||
|
||||
- Letters and digits are kept together as words.
|
||||
- Punctuation and whitespace are word separators.
|
||||
- Single-character words are discarded (minimum length is 2).
|
||||
|
||||
| Input | Tokens |
|
||||
|-------|--------|
|
||||
| `"Hello World"` | `["hello", "world"]` |
|
||||
| `"v2.0 and config123"` | `["v2", "and", "config123"]` |
|
||||
| `"a b c hello d"` | `["hello"]` |
|
||||
|
||||
## Inverted Index
|
||||
|
||||
The search index maps each unique word to a list of topic IDs:
|
||||
|
||||
```
|
||||
"installation" -> ["getting-started", "installation-guide"]
|
||||
"config" -> ["configuration"]
|
||||
"cli" -> ["getting-started", "configuration"]
|
||||
```
|
||||
|
||||
When a topic is added via `Catalog.Add`, the index is populated from:
|
||||
- Topic title words
|
||||
- Topic content words
|
||||
- Section title words
|
||||
- Section content words
|
||||
- Tag words
|
||||
|
||||
Duplicate word-to-topic mappings are prevented.
|
||||
|
||||
## Scoring
|
||||
|
||||
| Match Type | Points |
|
||||
|------------|--------|
|
||||
| Exact word match in index | +1.0 per query word |
|
||||
| Prefix match in index | +0.5 per query word |
|
||||
| Query word in topic title | +10.0 (once) |
|
||||
| Query word in best section title | +5.0 (once) |
|
||||
|
||||
Within `findBestMatch`, section title matches count double when selecting the best section.
|
||||
|
||||
## Snippet Extraction
|
||||
|
||||
```go
|
||||
func extractSnippet(content string, regexes []*regexp.Regexp) string
|
||||
```
|
||||
|
||||
Extracts up to 150 characters of context around the first match:
|
||||
|
||||
1. Finds the earliest match position across all query regexes.
|
||||
2. Centres a 150-character window around that position (50 chars before the match).
|
||||
3. Trims to word boundaries and adds `"..."` prefix/suffix when truncated.
|
||||
4. Applies bold highlighting to all matches.
|
||||
|
||||
If no regexes are provided, returns the first non-empty, non-heading line of content (truncated to 150 characters).
|
||||
|
||||
### Highlighting
|
||||
|
||||
Matches are wrapped in `**bold**` markup. The highlighting engine:
|
||||
|
||||
- Uses case-insensitive regex matching.
|
||||
- Sorts matches by position and merges overlapping/adjacent spans to avoid nested markers (e.g. `"**Searching**"` not `"**Search****ing**"`).
|
||||
- Applies replacements back-to-front to avoid position shifts.
|
||||
|
||||
Examples:
|
||||
|
||||
| Content | Query | Highlighted |
|
||||
|---------|-------|-------------|
|
||||
| `"The configuration is complete."` | `"config"` | `"The **config**uration is complete."` |
|
||||
| `"Searching for something."` | `"search", "searching"` | `"**Searching** for something."` |
|
||||
|
||||
## UTF-8 Safety
|
||||
|
||||
All snippet extraction operates on `[]rune` slices rather than byte offsets, ensuring multi-byte characters (e.g. Japanese, accented Latin) are never split mid-character. The byte position of a regex match is converted to a rune position before windowing.
|
||||
|
||||
See [[Topics-and-Catalog]] for topic and catalog API. Back to [[Home]].
|
||||
Loading…
Add table
Reference in a new issue