Add "Search-Engine"
parent
81cefa9f31
commit
d0b53acff0
1 changed files with 118 additions and 0 deletions
118
Search-Engine.-.md
Normal file
118
Search-Engine.-.md
Normal file
|
|
@ -0,0 +1,118 @@
|
|||
# Search Engine
|
||||
|
||||
go-help includes a full-text search engine built on an inverted index. It supports word tokenisation, prefix matching, title and section boosting, and snippet extraction with match highlighting.
|
||||
|
||||
## Architecture
|
||||
|
||||
The search index is an inverted map from words to topic IDs:
|
||||
|
||||
```go
|
||||
type searchIndex struct {
|
||||
topics map[string]*Topic // topicID -> Topic
|
||||
index map[string][]string // word -> []topicID
|
||||
}
|
||||
```
|
||||
|
||||
When a topic is added via `Catalog.Add`, the index processes:
|
||||
|
||||
1. **Title words** (indexed for title-boost scoring)
|
||||
2. **Content words** (full body text)
|
||||
3. **Section titles and content** (each heading and its body)
|
||||
4. **Tags** (all tag values)
|
||||
|
||||
## Tokenisation
|
||||
|
||||
The `tokenize` function splits text into lowercase words:
|
||||
|
||||
- Characters: letters and digits are kept
|
||||
- Separators: everything else acts as a word boundary
|
||||
- Minimum length: single-character words are discarded
|
||||
|
||||
```
|
||||
"Getting Started with Go" -> ["getting", "started", "with", "go"]
|
||||
"API Reference (v2)" -> ["api", "reference", "v2"]
|
||||
```
|
||||
|
||||
This produces a flat list suitable for both indexing and query processing.
|
||||
|
||||
## Scoring Algorithm
|
||||
|
||||
When `Search(query)` is called, the query is tokenised and scored against the index:
|
||||
|
||||
### Base Score
|
||||
|
||||
For each query word:
|
||||
- **Exact match**: +1.0 per topic found in the index for that word
|
||||
- **Prefix match**: +0.5 per topic where an indexed word starts with the query word (but is not an exact match)
|
||||
|
||||
### Boosts
|
||||
|
||||
After base scoring, additional boosts are applied:
|
||||
|
||||
| Condition | Boost |
|
||||
|-----------|-------|
|
||||
| Query word appears in topic title | +10.0 |
|
||||
| Query word appears in matching section title | +5.0 |
|
||||
|
||||
### Sorting
|
||||
|
||||
Results are sorted by descending score. Ties are broken alphabetically by topic title.
|
||||
|
||||
## Search Results
|
||||
|
||||
```go
|
||||
type SearchResult struct {
|
||||
Topic *Topic // The matched topic
|
||||
Section *Section // Best matching section (nil if topic-level)
|
||||
Score float64 // Relevance score
|
||||
Snippet string // Context around match with highlighting
|
||||
}
|
||||
```
|
||||
|
||||
### Example
|
||||
|
||||
```go
|
||||
results := catalog.Search("deploy docker")
|
||||
for _, r := range results {
|
||||
section := ""
|
||||
if r.Section != nil {
|
||||
section = " > " + r.Section.Title
|
||||
}
|
||||
fmt.Printf("[%.1f] %s%s\n %s\n", r.Score, r.Topic.Title, section, r.Snippet)
|
||||
}
|
||||
```
|
||||
|
||||
## Snippet Extraction
|
||||
|
||||
The search engine extracts a ~150-character snippet around the first match in the best-matching section:
|
||||
|
||||
1. **Find** the position of the first regex match in the content
|
||||
2. **Extract** a window of 150 runes centred on the match
|
||||
3. **Trim** to word boundaries (adding `...` prefix/suffix as needed)
|
||||
4. **Highlight** all matches by wrapping them in `**bold**` markers
|
||||
|
||||
If no regex matches are found, the snippet falls back to the first non-empty, non-heading line of the section content.
|
||||
|
||||
## Highlighting
|
||||
|
||||
The `highlight` function wraps matched text in `**` markers:
|
||||
|
||||
```
|
||||
Input: "How to deploy with docker compose"
|
||||
Query: "deploy docker"
|
||||
Output: "How to **deploy** with **docker** compose"
|
||||
```
|
||||
|
||||
Overlapping or adjacent matches are merged before highlighting to avoid nested markers.
|
||||
|
||||
## Best Match Selection
|
||||
|
||||
For each topic in the results, the engine selects the single best-matching section:
|
||||
|
||||
1. Score each section by counting query word matches in its title (weighted 2x) and content
|
||||
2. The section with the highest combined score is selected
|
||||
3. The snippet is extracted from that section's content
|
||||
|
||||
If no section matches, the snippet is extracted from the topic-level content.
|
||||
|
||||
See also: [[Home]] | [[Topics-and-Catalog]]
|
||||
Loading…
Add table
Reference in a new issue