go-help/docs/history.md

# Project History — go-help

## Origin

Extracted from `forge.lthn.ai/core/go` `pkg/help/` on 19 February 2026 by Virgil.

Commit: `ad5e70937b30829aa63747e89e259aeb5cada575`
Message: `feat: extract go-help from core/go pkg/help`

The initial extraction included:
- `Topic`, `Section`, `Frontmatter` type definitions
- `Catalog` with `Add`, `List`, `Get`, `Search`
- `Parser` with frontmatter extraction, section parsing, and ID generation
- Basic search with title (+10), section (+5), and partial/body (+0.5) scoring
- Two test files

The only external dependency at extraction was `gopkg.in/yaml.v3`.

---

## Phase 0 — Hardening and Test Coverage

Completed: 20 February 2026 (Charon)
Commit: `23cef8592a825ffcc7e551079fa561092514d47c`
Message: `test: complete Phase 0 hardening — 124 tests, 94% coverage, 8 benchmarks`

Coverage rose from 92.1% to 100%. The entire `catalog.go` had zero test coverage at extraction.

Work completed:

- `catalog_test.go` created from scratch covering `DefaultCatalog`, `Add`, `List`, `Search`, `Get` (found and not-found), and score tie-breaking.
- Targeted search tests for previously uncovered branches: nil topic guard in `Search()` (stale index references after hypothetical removal), alphabetical tie-breaking when scores are equal, headings-only content in snippet extraction, whitespace-only content trimmed to empty in snippets, empty regex slice in `highlight()`, overlapping match extension in highlight merging.
- Parser tests expanded to cover: empty input, frontmatter-only documents, three malformed YAML variants, frontmatter not at start of file, deeply nested headings (H4–H6 with body text), Unicode (CJK, emoji, diacritics, mixed scripts), 10,000-line documents, empty sections, headings without the mandatory space after `#`, consecutive headings, and `GenerateID` edge cases.
- `search_bench_test.go` with eight benchmarks using `b.Loop()` (Go 1.25) and `b.ReportAllocs()`. Benchmarks cover: single word, multi-word, no results, partial match, 500-topic catalog, 1,000-topic catalog, `Add` indexing throughput, and `tokenize` throughput.
- Baseline benchmark (Ryzen 9 9950X): ~745 µs/op, ~392 KB/op, 4,114 allocs/op for a 150-topic corpus.
- `go vet ./...` clean.

---

## Phase 1 — Search Improvements

Completed: 20 February 2026 (Charon)
Commits:
- `3e91510bcf1c3553849a46440253bb572a809e9e` — `feat(search): add fuzzy matching, phrase search, and improved scoring`
- `fc758a832b89b98a19083fcfb7c3e3d4c31a314c` — `feat(search): add English stemmer for improved search recall`

### Fuzzy matching

`levenshtein(a, b string) int` implemented using two-row dynamic programming. Applied to query words of three or more characters. Maximum edit distance of 2. Fuzzy matches score at `scoreFuzzyWord = 0.3`, below prefix (`0.5`) and exact (`1.0`). Words already matched as exact or prefix are not double-counted.

### Phrase search

`extractPhrases(query)` extracts `"quoted strings"` from the query before tokenisation. Phrase matching scans the concatenated title, content, and all section content of every candidate topic. Matched phrases boost by `scorePhraseBoost = 8.0`. Phrase terms are also compiled as regexes for snippet highlighting. Empty quotes and whitespace-only quoted strings are ignored.

### English stemmer

`stemmer.go` added. Lightweight Porter-style suffix stripping in two passes (inflectional then derivational). Words under four characters are unchanged; results are always at least two characters. The stemmer is intentionally not the full Porter algorithm — only the highest-impact rules for a help catalog are included.

`tokenize()` modified to emit both the raw word and its stem (if they differ). The search index therefore contains both forms. When searching, query words are also stemmed. Stem-only matches score at `scoreStemWord = 0.7`.

### Improved scoring weights

All magic numbers replaced with named constants. New features added:

- `scoreTagBoost = 3.0` — topics with tags matching query words rank higher.
- `scoreAllWords = 2.0` — topics containing all query words receive a multi-word bonus.

### API compatibility

`Search(query string) []*SearchResult` signature unchanged. All existing tests passed without modification.

---

## Phase 2 — HTTP Server, Rendering, Static Site, and CLI Ingestion

Completed: 20 February 2026 (Charon)
Commit: `944cad006b4920c3406d170fe42d006533483d59`
Message: `feat(help): Phase 2 — HTTP server, rendering, static site generator, CLI ingestion`

### Markdown rendering (`render.go`)

`RenderMarkdown` added using `github.com/yuin/goldmark` with GFM, Typographer, and `html.WithUnsafe()`. Returns an HTML fragment.

### HTTP server (`server.go`)

Six-route HTTP server. Implements `http.Handler`. Routes registered at construction. HTML and JSON routes share the same catalog; JSON routes return raw `Topic` structs without pre-rendered HTML.

### Templates (`templates.go`, `templates/`)

Five embedded templates with dark theme CSS. Template functions: `renderMarkdown`, `truncate`, `pluralise`, `multiply`, `sub`. Topics grouped by first tag on the index page, sorted by Order then Title within each group.

### Static site generator (`generate.go`)

`Generate` writes five output artefacts. Client-side search uses inline JavaScript that loads `search-index.json` (content truncated to 500 runes per topic) and scores results with title (+10), content (+1), and tag (+3) weights. All DOM insertion uses safe methods (`textContent`, `createElement`) — no `innerHTML` with user-supplied strings.

### CLI ingestion (`ingest.go`)

`ParseHelpText` converts Go flag-style and Cobra-style help output to a `Topic`. `IngestCLIHelp` batch-ingests a map of command names to help texts.

---

## Known Limitations

### Search

- Phrase search does not apply title or section boosts; it uses a flat `+8.0` regardless of where the phrase appears.
- Stemmer covers English only. Non-English catalog content will not stem correctly but will still match via exact and prefix routes.
- The stemmer does not implement the full Porter algorithm. Some irregular forms (e.g. `ran` → `run`) are not handled.
- Tag boost applies per tag that matches, but a single query word can match multiple tags on the same topic, causing unbounded tag score accumulation.
- The inverted index has no removal operation. Topics added via `Add` cannot be retracted without rebuilding the catalog.

### HTTP server

- No request timeout is configured in `ListenAndServe`. Callers embedding the server in a larger application should wrap it in an `http.Server` with appropriate `ReadTimeout` and `WriteTimeout` values.
- Template parsing occurs on every request (`parseTemplates` is called per handler invocation). For high-throughput deployments, templates should be pre-parsed and cached.
- No access logging is provided.

### Static site generator

- Client-side search is a simple substring match. It does not implement stemming, fuzzy matching, or phrase search — only the server-side search has those capabilities.
- The `search-index.json` content field is truncated to 500 runes per topic, which may omit relevant content in long topics.

### General

- The `Catalog` is not safe for concurrent writes. Multiple goroutines calling `Add` concurrently will cause a data race. Reads via `List`, `Get`, and `Search` are safe once the catalog is fully populated, provided no concurrent `Add` calls occur.

---

## Future Considerations

### Phase 3 — AI-assisted search

The outstanding phase from the original roadmap:

- Embed help topics into go-rag collections for semantic search.
- Add vector similarity as a fallback when keyword search returns no results.
- Support natural language queries such as "how do I push all repos?".

This requires a dependency on `forge.lthn.ai/core/go-rag` and an Ollama or compatible embedding backend.

### Concurrency safety

Add a `sync.RWMutex` to `Catalog` to allow safe concurrent reads alongside occasional writes. This is a straightforward change with no API surface impact.

### Template caching

Pre-parse all templates once at `NewServer` construction time and cache the parsed `*template.Template` values. This removes per-request parsing overhead.

### Pagination

The `/api/topics` endpoint returns all topics in a single response. For large catalogs (thousands of topics), cursor-based or offset pagination should be added.

### YAML file loading

The current `Catalog` has no `LoadDir` or `LoadFile` function; topics are added programmatically via `Add` or via `ParseTopic`. A convenience function to load all `*.md` files from a directory would simplify catalog population for file-backed deployments.