Replace internal task tracking (TODO.md, FINDINGS.md) with structured documentation in docs/. Trim CLAUDE.md to agent instructions only. Co-Authored-By: Virgil <virgil@lethean.io>
8.5 KiB
Project History — go-help
Origin
Extracted from forge.lthn.ai/core/go pkg/help/ on 19 February 2026 by Virgil.
Commit: ad5e70937b30829aa63747e89e259aeb5cada575
Message: feat: extract go-help from core/go pkg/help
The initial extraction included:
Topic,Section,Frontmattertype definitionsCatalogwithAdd,List,Get,SearchParserwith frontmatter extraction, section parsing, and ID generation- Basic search with title (+10), section (+5), and partial/body (+0.5) scoring
- Two test files
The only external dependency at extraction was gopkg.in/yaml.v3.
Phase 0 — Hardening and Test Coverage
Completed: 20 February 2026 (Charon)
Commit: 23cef8592a825ffcc7e551079fa561092514d47c
Message: test: complete Phase 0 hardening — 124 tests, 94% coverage, 8 benchmarks
Coverage rose from 92.1% to 100%. The entire catalog.go had zero test coverage at extraction.
Work completed:
catalog_test.gocreated from scratch coveringDefaultCatalog,Add,List,Search,Get(found and not-found), and score tie-breaking.- Targeted search tests for previously uncovered branches: nil topic guard in
Search()(stale index references after hypothetical removal), alphabetical tie-breaking when scores are equal, headings-only content in snippet extraction, whitespace-only content trimmed to empty in snippets, empty regex slice inhighlight(), overlapping match extension in highlight merging. - Parser tests expanded to cover: empty input, frontmatter-only documents, three malformed YAML variants, frontmatter not at start of file, deeply nested headings (H4–H6 with body text), Unicode (CJK, emoji, diacritics, mixed scripts), 10,000-line documents, empty sections, headings without the mandatory space after
#, consecutive headings, andGenerateIDedge cases. search_bench_test.gowith eight benchmarks usingb.Loop()(Go 1.25) andb.ReportAllocs(). Benchmarks cover: single word, multi-word, no results, partial match, 500-topic catalog, 1,000-topic catalog,Addindexing throughput, andtokenizethroughput.- Baseline benchmark (Ryzen 9 9950X): ~745 µs/op, ~392 KB/op, 4,114 allocs/op for a 150-topic corpus.
go vet ./...clean.
Phase 1 — Search Improvements
Completed: 20 February 2026 (Charon) Commits:
3e91510bcf1c3553849a46440253bb572a809e9e—feat(search): add fuzzy matching, phrase search, and improved scoringfc758a832b89b98a19083fcfb7c3e3d4c31a314c—feat(search): add English stemmer for improved search recall
Fuzzy matching
levenshtein(a, b string) int implemented using two-row dynamic programming. Applied to query words of three or more characters. Maximum edit distance of 2. Fuzzy matches score at scoreFuzzyWord = 0.3, below prefix (0.5) and exact (1.0). Words already matched as exact or prefix are not double-counted.
Phrase search
extractPhrases(query) extracts "quoted strings" from the query before tokenisation. Phrase matching scans the concatenated title, content, and all section content of every candidate topic. Matched phrases boost by scorePhraseBoost = 8.0. Phrase terms are also compiled as regexes for snippet highlighting. Empty quotes and whitespace-only quoted strings are ignored.
English stemmer
stemmer.go added. Lightweight Porter-style suffix stripping in two passes (inflectional then derivational). Words under four characters are unchanged; results are always at least two characters. The stemmer is intentionally not the full Porter algorithm — only the highest-impact rules for a help catalog are included.
tokenize() modified to emit both the raw word and its stem (if they differ). The search index therefore contains both forms. When searching, query words are also stemmed. Stem-only matches score at scoreStemWord = 0.7.
Improved scoring weights
All magic numbers replaced with named constants. New features added:
scoreTagBoost = 3.0— topics with tags matching query words rank higher.scoreAllWords = 2.0— topics containing all query words receive a multi-word bonus.
API compatibility
Search(query string) []*SearchResult signature unchanged. All existing tests passed without modification.
Phase 2 — HTTP Server, Rendering, Static Site, and CLI Ingestion
Completed: 20 February 2026 (Charon)
Commit: 944cad006b4920c3406d170fe42d006533483d59
Message: feat(help): Phase 2 — HTTP server, rendering, static site generator, CLI ingestion
Markdown rendering (render.go)
RenderMarkdown added using github.com/yuin/goldmark with GFM, Typographer, and html.WithUnsafe(). Returns an HTML fragment.
HTTP server (server.go)
Six-route HTTP server. Implements http.Handler. Routes registered at construction. HTML and JSON routes share the same catalog; JSON routes return raw Topic structs without pre-rendered HTML.
Templates (templates.go, templates/)
Five embedded templates with dark theme CSS. Template functions: renderMarkdown, truncate, pluralise, multiply, sub. Topics grouped by first tag on the index page, sorted by Order then Title within each group.
Static site generator (generate.go)
Generate writes five output artefacts. Client-side search uses inline JavaScript that loads search-index.json (content truncated to 500 runes per topic) and scores results with title (+10), content (+1), and tag (+3) weights. All DOM insertion uses safe methods (textContent, createElement) — no innerHTML with user-supplied strings.
CLI ingestion (ingest.go)
ParseHelpText converts Go flag-style and Cobra-style help output to a Topic. IngestCLIHelp batch-ingests a map of command names to help texts.
Known Limitations
Search
- Phrase search does not apply title or section boosts; it uses a flat
+8.0regardless of where the phrase appears. - Stemmer covers English only. Non-English catalog content will not stem correctly but will still match via exact and prefix routes.
- The stemmer does not implement the full Porter algorithm. Some irregular forms (e.g.
ran→run) are not handled. - Tag boost applies per tag that matches, but a single query word can match multiple tags on the same topic, causing unbounded tag score accumulation.
- The inverted index has no removal operation. Topics added via
Addcannot be retracted without rebuilding the catalog.
HTTP server
- No request timeout is configured in
ListenAndServe. Callers embedding the server in a larger application should wrap it in anhttp.Serverwith appropriateReadTimeoutandWriteTimeoutvalues. - Template parsing occurs on every request (
parseTemplatesis called per handler invocation). For high-throughput deployments, templates should be pre-parsed and cached. - No access logging is provided.
Static site generator
- Client-side search is a simple substring match. It does not implement stemming, fuzzy matching, or phrase search — only the server-side search has those capabilities.
- The
search-index.jsoncontent field is truncated to 500 runes per topic, which may omit relevant content in long topics.
General
- The
Catalogis not safe for concurrent writes. Multiple goroutines callingAddconcurrently will cause a data race. Reads viaList,Get, andSearchare safe once the catalog is fully populated, provided no concurrentAddcalls occur.
Future Considerations
Phase 3 — AI-assisted search
The outstanding phase from the original roadmap:
- Embed help topics into go-rag collections for semantic search.
- Add vector similarity as a fallback when keyword search returns no results.
- Support natural language queries such as "how do I push all repos?".
This requires a dependency on forge.lthn.ai/core/go-rag and an Ollama or compatible embedding backend.
Concurrency safety
Add a sync.RWMutex to Catalog to allow safe concurrent reads alongside occasional writes. This is a straightforward change with no API surface impact.
Template caching
Pre-parse all templates once at NewServer construction time and cache the parsed *template.Template values. This removes per-request parsing overhead.
Pagination
The /api/topics endpoint returns all topics in a single response. For large catalogs (thousands of topics), cursor-based or offset pagination should be added.
YAML file loading
The current Catalog has no LoadDir or LoadFile function; topics are added programmatically via Add or via ParseTopic. A convenience function to load all *.md files from a directory would simplify catalog population for file-backed deployments.