Snider 142567a8f5 docs: graduate TODO/FINDINGS into production documentation

Replace internal task tracking (TODO.md, FINDINGS.md) with structured
documentation in docs/. Trim CLAUDE.md to agent instructions only.

Co-Authored-By: Virgil <virgil@lethean.io>

2026-02-20 15:01:55 +00:00

8.5 KiB

Raw Blame History

Project History — go-help

Origin

Extracted from forge.lthn.ai/core/go pkg/help/ on 19 February 2026 by Virgil.

Commit: ad5e70937b30829aa63747e89e259aeb5cada575 Message: feat: extract go-help from core/go pkg/help

The initial extraction included:

Topic, Section, Frontmatter type definitions
Catalog with Add, List, Get, Search
Parser with frontmatter extraction, section parsing, and ID generation
Basic search with title (+10), section (+5), and partial/body (+0.5) scoring
Two test files

The only external dependency at extraction was gopkg.in/yaml.v3.

Phase 0 — Hardening and Test Coverage

Completed: 20 February 2026 (Charon) Commit: 23cef8592a825ffcc7e551079fa561092514d47c Message: test: complete Phase 0 hardening — 124 tests, 94% coverage, 8 benchmarks

Coverage rose from 92.1% to 100%. The entire catalog.go had zero test coverage at extraction.

Work completed:

catalog_test.go created from scratch covering DefaultCatalog, Add, List, Search, Get (found and not-found), and score tie-breaking.
Targeted search tests for previously uncovered branches: nil topic guard in Search() (stale index references after hypothetical removal), alphabetical tie-breaking when scores are equal, headings-only content in snippet extraction, whitespace-only content trimmed to empty in snippets, empty regex slice in highlight(), overlapping match extension in highlight merging.
Parser tests expanded to cover: empty input, frontmatter-only documents, three malformed YAML variants, frontmatter not at start of file, deeply nested headings (H4–H6 with body text), Unicode (CJK, emoji, diacritics, mixed scripts), 10,000-line documents, empty sections, headings without the mandatory space after #, consecutive headings, and GenerateID edge cases.
search_bench_test.go with eight benchmarks using b.Loop() (Go 1.25) and b.ReportAllocs(). Benchmarks cover: single word, multi-word, no results, partial match, 500-topic catalog, 1,000-topic catalog, Add indexing throughput, and tokenize throughput.
Baseline benchmark (Ryzen 9 9950X): ~745 µs/op, ~392 KB/op, 4,114 allocs/op for a 150-topic corpus.
go vet ./... clean.

Phase 1 — Search Improvements

Completed: 20 February 2026 (Charon) Commits:

3e91510bcf1c3553849a46440253bb572a809e9e — feat(search): add fuzzy matching, phrase search, and improved scoring
fc758a832b89b98a19083fcfb7c3e3d4c31a314c — feat(search): add English stemmer for improved search recall

Fuzzy matching

levenshtein(a, b string) int implemented using two-row dynamic programming. Applied to query words of three or more characters. Maximum edit distance of 2. Fuzzy matches score at scoreFuzzyWord = 0.3, below prefix (0.5) and exact (1.0). Words already matched as exact or prefix are not double-counted.

Phrase search

extractPhrases(query) extracts "quoted strings" from the query before tokenisation. Phrase matching scans the concatenated title, content, and all section content of every candidate topic. Matched phrases boost by scorePhraseBoost = 8.0. Phrase terms are also compiled as regexes for snippet highlighting. Empty quotes and whitespace-only quoted strings are ignored.

English stemmer

stemmer.go added. Lightweight Porter-style suffix stripping in two passes (inflectional then derivational). Words under four characters are unchanged; results are always at least two characters. The stemmer is intentionally not the full Porter algorithm — only the highest-impact rules for a help catalog are included.

tokenize() modified to emit both the raw word and its stem (if they differ). The search index therefore contains both forms. When searching, query words are also stemmed. Stem-only matches score at scoreStemWord = 0.7.

Improved scoring weights

All magic numbers replaced with named constants. New features added:

scoreTagBoost = 3.0 — topics with tags matching query words rank higher.
scoreAllWords = 2.0 — topics containing all query words receive a multi-word bonus.

API compatibility

Search(query string) []*SearchResult signature unchanged. All existing tests passed without modification.

Phase 2 — HTTP Server, Rendering, Static Site, and CLI Ingestion

Completed: 20 February 2026 (Charon) Commit: 944cad006b4920c3406d170fe42d006533483d59 Message: feat(help): Phase 2 — HTTP server, rendering, static site generator, CLI ingestion

Markdown rendering (`render.go`)

RenderMarkdown added using github.com/yuin/goldmark with GFM, Typographer, and html.WithUnsafe(). Returns an HTML fragment.

HTTP server (`server.go`)

Six-route HTTP server. Implements http.Handler. Routes registered at construction. HTML and JSON routes share the same catalog; JSON routes return raw Topic structs without pre-rendered HTML.

Templates (`templates.go`, `templates/`)

Five embedded templates with dark theme CSS. Template functions: renderMarkdown, truncate, pluralise, multiply, sub. Topics grouped by first tag on the index page, sorted by Order then Title within each group.

Static site generator (`generate.go`)

Generate writes five output artefacts. Client-side search uses inline JavaScript that loads search-index.json (content truncated to 500 runes per topic) and scores results with title (+10), content (+1), and tag (+3) weights. All DOM insertion uses safe methods (textContent, createElement) — no innerHTML with user-supplied strings.

CLI ingestion (`ingest.go`)

ParseHelpText converts Go flag-style and Cobra-style help output to a Topic. IngestCLIHelp batch-ingests a map of command names to help texts.

Known Limitations

Search

Phrase search does not apply title or section boosts; it uses a flat +8.0 regardless of where the phrase appears.
Stemmer covers English only. Non-English catalog content will not stem correctly but will still match via exact and prefix routes.
The stemmer does not implement the full Porter algorithm. Some irregular forms (e.g. ran → run) are not handled.
Tag boost applies per tag that matches, but a single query word can match multiple tags on the same topic, causing unbounded tag score accumulation.
The inverted index has no removal operation. Topics added via Add cannot be retracted without rebuilding the catalog.

HTTP server

No request timeout is configured in ListenAndServe. Callers embedding the server in a larger application should wrap it in an http.Server with appropriate ReadTimeout and WriteTimeout values.
Template parsing occurs on every request (parseTemplates is called per handler invocation). For high-throughput deployments, templates should be pre-parsed and cached.
No access logging is provided.

Static site generator

Client-side search is a simple substring match. It does not implement stemming, fuzzy matching, or phrase search — only the server-side search has those capabilities.
The search-index.json content field is truncated to 500 runes per topic, which may omit relevant content in long topics.

General

The Catalog is not safe for concurrent writes. Multiple goroutines calling Add concurrently will cause a data race. Reads via List, Get, and Search are safe once the catalog is fully populated, provided no concurrent Add calls occur.

Future Considerations

Phase 3 — AI-assisted search

The outstanding phase from the original roadmap:

Embed help topics into go-rag collections for semantic search.
Add vector similarity as a fallback when keyword search returns no results.
Support natural language queries such as "how do I push all repos?".

This requires a dependency on forge.lthn.ai/core/go-rag and an Ollama or compatible embedding backend.

Concurrency safety

Add a sync.RWMutex to Catalog to allow safe concurrent reads alongside occasional writes. This is a straightforward change with no API surface impact.

Template caching

Pre-parse all templates once at NewServer construction time and cache the parsed *template.Template values. This removes per-request parsing overhead.

Pagination

The /api/topics endpoint returns all topics in a single response. For large catalogs (thousands of topics), cursor-based or offset pagination should be added.

YAML file loading

The current Catalog has no LoadDir or LoadFile function; topics are added programmatically via Add or via ParseTopic. A convenience function to load all *.md files from a directory would simplify catalog population for file-backed deployments.

8.5 KiB Raw Blame History Unescape Escape