# Project History — go-help ## Origin Extracted from `forge.lthn.ai/core/go` `pkg/help/` on 19 February 2026 by Virgil. Commit: `ad5e70937b30829aa63747e89e259aeb5cada575` Message: `feat: extract go-help from core/go pkg/help` The initial extraction included: - `Topic`, `Section`, `Frontmatter` type definitions - `Catalog` with `Add`, `List`, `Get`, `Search` - `Parser` with frontmatter extraction, section parsing, and ID generation - Basic search with title (+10), section (+5), and partial/body (+0.5) scoring - Two test files The only external dependency at extraction was `gopkg.in/yaml.v3`. --- ## Phase 0 — Hardening and Test Coverage Completed: 20 February 2026 (Charon) Commit: `23cef8592a825ffcc7e551079fa561092514d47c` Message: `test: complete Phase 0 hardening — 124 tests, 94% coverage, 8 benchmarks` Coverage rose from 92.1% to 100%. The entire `catalog.go` had zero test coverage at extraction. Work completed: - `catalog_test.go` created from scratch covering `DefaultCatalog`, `Add`, `List`, `Search`, `Get` (found and not-found), and score tie-breaking. - Targeted search tests for previously uncovered branches: nil topic guard in `Search()` (stale index references after hypothetical removal), alphabetical tie-breaking when scores are equal, headings-only content in snippet extraction, whitespace-only content trimmed to empty in snippets, empty regex slice in `highlight()`, overlapping match extension in highlight merging. - Parser tests expanded to cover: empty input, frontmatter-only documents, three malformed YAML variants, frontmatter not at start of file, deeply nested headings (H4–H6 with body text), Unicode (CJK, emoji, diacritics, mixed scripts), 10,000-line documents, empty sections, headings without the mandatory space after `#`, consecutive headings, and `GenerateID` edge cases. - `search_bench_test.go` with eight benchmarks using `b.Loop()` (Go 1.25) and `b.ReportAllocs()`. Benchmarks cover: single word, multi-word, no results, partial match, 500-topic catalog, 1,000-topic catalog, `Add` indexing throughput, and `tokenize` throughput. - Baseline benchmark (Ryzen 9 9950X): ~745 µs/op, ~392 KB/op, 4,114 allocs/op for a 150-topic corpus. - `go vet ./...` clean. --- ## Phase 1 — Search Improvements Completed: 20 February 2026 (Charon) Commits: - `3e91510bcf1c3553849a46440253bb572a809e9e` — `feat(search): add fuzzy matching, phrase search, and improved scoring` - `fc758a832b89b98a19083fcfb7c3e3d4c31a314c` — `feat(search): add English stemmer for improved search recall` ### Fuzzy matching `levenshtein(a, b string) int` implemented using two-row dynamic programming. Applied to query words of three or more characters. Maximum edit distance of 2. Fuzzy matches score at `scoreFuzzyWord = 0.3`, below prefix (`0.5`) and exact (`1.0`). Words already matched as exact or prefix are not double-counted. ### Phrase search `extractPhrases(query)` extracts `"quoted strings"` from the query before tokenisation. Phrase matching scans the concatenated title, content, and all section content of every candidate topic. Matched phrases boost by `scorePhraseBoost = 8.0`. Phrase terms are also compiled as regexes for snippet highlighting. Empty quotes and whitespace-only quoted strings are ignored. ### English stemmer `stemmer.go` added. Lightweight Porter-style suffix stripping in two passes (inflectional then derivational). Words under four characters are unchanged; results are always at least two characters. The stemmer is intentionally not the full Porter algorithm — only the highest-impact rules for a help catalog are included. `tokenize()` modified to emit both the raw word and its stem (if they differ). The search index therefore contains both forms. When searching, query words are also stemmed. Stem-only matches score at `scoreStemWord = 0.7`. ### Improved scoring weights All magic numbers replaced with named constants. New features added: - `scoreTagBoost = 3.0` — topics with tags matching query words rank higher. - `scoreAllWords = 2.0` — topics containing all query words receive a multi-word bonus. ### API compatibility `Search(query string) []*SearchResult` signature unchanged. All existing tests passed without modification. --- ## Phase 2 — HTTP Server, Rendering, Static Site, and CLI Ingestion Completed: 20 February 2026 (Charon) Commit: `944cad006b4920c3406d170fe42d006533483d59` Message: `feat(help): Phase 2 — HTTP server, rendering, static site generator, CLI ingestion` ### Markdown rendering (`render.go`) `RenderMarkdown` added using `github.com/yuin/goldmark` with GFM, Typographer, and `html.WithUnsafe()`. Returns an HTML fragment. ### HTTP server (`server.go`) Six-route HTTP server. Implements `http.Handler`. Routes registered at construction. HTML and JSON routes share the same catalog; JSON routes return raw `Topic` structs without pre-rendered HTML. ### Templates (`templates.go`, `templates/`) Five embedded templates with dark theme CSS. Template functions: `renderMarkdown`, `truncate`, `pluralise`, `multiply`, `sub`. Topics grouped by first tag on the index page, sorted by Order then Title within each group. ### Static site generator (`generate.go`) `Generate` writes five output artefacts. Client-side search uses inline JavaScript that loads `search-index.json` (content truncated to 500 runes per topic) and scores results with title (+10), content (+1), and tag (+3) weights. All DOM insertion uses safe methods (`textContent`, `createElement`) — no `innerHTML` with user-supplied strings. ### CLI ingestion (`ingest.go`) `ParseHelpText` converts Go flag-style and Cobra-style help output to a `Topic`. `IngestCLIHelp` batch-ingests a map of command names to help texts. --- ## Known Limitations ### Search - Phrase search does not apply title or section boosts; it uses a flat `+8.0` regardless of where the phrase appears. - Stemmer covers English only. Non-English catalog content will not stem correctly but will still match via exact and prefix routes. - The stemmer does not implement the full Porter algorithm. Some irregular forms (e.g. `ran` → `run`) are not handled. - Tag boost applies per tag that matches, but a single query word can match multiple tags on the same topic, causing unbounded tag score accumulation. - The inverted index has no removal operation. Topics added via `Add` cannot be retracted without rebuilding the catalog. ### HTTP server - No request timeout is configured in `ListenAndServe`. Callers embedding the server in a larger application should wrap it in an `http.Server` with appropriate `ReadTimeout` and `WriteTimeout` values. - Template parsing occurs on every request (`parseTemplates` is called per handler invocation). For high-throughput deployments, templates should be pre-parsed and cached. - No access logging is provided. ### Static site generator - Client-side search is a simple substring match. It does not implement stemming, fuzzy matching, or phrase search — only the server-side search has those capabilities. - The `search-index.json` content field is truncated to 500 runes per topic, which may omit relevant content in long topics. ### General - The `Catalog` is not safe for concurrent writes. Multiple goroutines calling `Add` concurrently will cause a data race. Reads via `List`, `Get`, and `Search` are safe once the catalog is fully populated, provided no concurrent `Add` calls occur. --- ## Future Considerations ### Phase 3 — AI-assisted search The outstanding phase from the original roadmap: - Embed help topics into go-rag collections for semantic search. - Add vector similarity as a fallback when keyword search returns no results. - Support natural language queries such as "how do I push all repos?". This requires a dependency on `forge.lthn.ai/core/go-rag` and an Ollama or compatible embedding backend. ### Concurrency safety Add a `sync.RWMutex` to `Catalog` to allow safe concurrent reads alongside occasional writes. This is a straightforward change with no API surface impact. ### Template caching Pre-parse all templates once at `NewServer` construction time and cache the parsed `*template.Template` values. This removes per-request parsing overhead. ### Pagination The `/api/topics` endpoint returns all topics in a single response. For large catalogs (thousands of topics), cursor-based or offset pagination should be added. ### YAML file loading The current `Catalog` has no `LoadDir` or `LoadFile` function; topics are added programmatically via `Add` or via `ParseTopic`. A convenience function to load all `*.md` files from a directory would simplify catalog population for file-backed deployments.