go-help/docs/history.md
Snider 142567a8f5 docs: graduate TODO/FINDINGS into production documentation
Replace internal task tracking (TODO.md, FINDINGS.md) with structured
documentation in docs/. Trim CLAUDE.md to agent instructions only.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-20 15:01:55 +00:00

155 lines
8.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Project History — go-help
## Origin
Extracted from `forge.lthn.ai/core/go` `pkg/help/` on 19 February 2026 by Virgil.
Commit: `ad5e70937b30829aa63747e89e259aeb5cada575`
Message: `feat: extract go-help from core/go pkg/help`
The initial extraction included:
- `Topic`, `Section`, `Frontmatter` type definitions
- `Catalog` with `Add`, `List`, `Get`, `Search`
- `Parser` with frontmatter extraction, section parsing, and ID generation
- Basic search with title (+10), section (+5), and partial/body (+0.5) scoring
- Two test files
The only external dependency at extraction was `gopkg.in/yaml.v3`.
---
## Phase 0 — Hardening and Test Coverage
Completed: 20 February 2026 (Charon)
Commit: `23cef8592a825ffcc7e551079fa561092514d47c`
Message: `test: complete Phase 0 hardening — 124 tests, 94% coverage, 8 benchmarks`
Coverage rose from 92.1% to 100%. The entire `catalog.go` had zero test coverage at extraction.
Work completed:
- `catalog_test.go` created from scratch covering `DefaultCatalog`, `Add`, `List`, `Search`, `Get` (found and not-found), and score tie-breaking.
- Targeted search tests for previously uncovered branches: nil topic guard in `Search()` (stale index references after hypothetical removal), alphabetical tie-breaking when scores are equal, headings-only content in snippet extraction, whitespace-only content trimmed to empty in snippets, empty regex slice in `highlight()`, overlapping match extension in highlight merging.
- Parser tests expanded to cover: empty input, frontmatter-only documents, three malformed YAML variants, frontmatter not at start of file, deeply nested headings (H4H6 with body text), Unicode (CJK, emoji, diacritics, mixed scripts), 10,000-line documents, empty sections, headings without the mandatory space after `#`, consecutive headings, and `GenerateID` edge cases.
- `search_bench_test.go` with eight benchmarks using `b.Loop()` (Go 1.25) and `b.ReportAllocs()`. Benchmarks cover: single word, multi-word, no results, partial match, 500-topic catalog, 1,000-topic catalog, `Add` indexing throughput, and `tokenize` throughput.
- Baseline benchmark (Ryzen 9 9950X): ~745 µs/op, ~392 KB/op, 4,114 allocs/op for a 150-topic corpus.
- `go vet ./...` clean.
---
## Phase 1 — Search Improvements
Completed: 20 February 2026 (Charon)
Commits:
- `3e91510bcf1c3553849a46440253bb572a809e9e``feat(search): add fuzzy matching, phrase search, and improved scoring`
- `fc758a832b89b98a19083fcfb7c3e3d4c31a314c``feat(search): add English stemmer for improved search recall`
### Fuzzy matching
`levenshtein(a, b string) int` implemented using two-row dynamic programming. Applied to query words of three or more characters. Maximum edit distance of 2. Fuzzy matches score at `scoreFuzzyWord = 0.3`, below prefix (`0.5`) and exact (`1.0`). Words already matched as exact or prefix are not double-counted.
### Phrase search
`extractPhrases(query)` extracts `"quoted strings"` from the query before tokenisation. Phrase matching scans the concatenated title, content, and all section content of every candidate topic. Matched phrases boost by `scorePhraseBoost = 8.0`. Phrase terms are also compiled as regexes for snippet highlighting. Empty quotes and whitespace-only quoted strings are ignored.
### English stemmer
`stemmer.go` added. Lightweight Porter-style suffix stripping in two passes (inflectional then derivational). Words under four characters are unchanged; results are always at least two characters. The stemmer is intentionally not the full Porter algorithm — only the highest-impact rules for a help catalog are included.
`tokenize()` modified to emit both the raw word and its stem (if they differ). The search index therefore contains both forms. When searching, query words are also stemmed. Stem-only matches score at `scoreStemWord = 0.7`.
### Improved scoring weights
All magic numbers replaced with named constants. New features added:
- `scoreTagBoost = 3.0` — topics with tags matching query words rank higher.
- `scoreAllWords = 2.0` — topics containing all query words receive a multi-word bonus.
### API compatibility
`Search(query string) []*SearchResult` signature unchanged. All existing tests passed without modification.
---
## Phase 2 — HTTP Server, Rendering, Static Site, and CLI Ingestion
Completed: 20 February 2026 (Charon)
Commit: `944cad006b4920c3406d170fe42d006533483d59`
Message: `feat(help): Phase 2 — HTTP server, rendering, static site generator, CLI ingestion`
### Markdown rendering (`render.go`)
`RenderMarkdown` added using `github.com/yuin/goldmark` with GFM, Typographer, and `html.WithUnsafe()`. Returns an HTML fragment.
### HTTP server (`server.go`)
Six-route HTTP server. Implements `http.Handler`. Routes registered at construction. HTML and JSON routes share the same catalog; JSON routes return raw `Topic` structs without pre-rendered HTML.
### Templates (`templates.go`, `templates/`)
Five embedded templates with dark theme CSS. Template functions: `renderMarkdown`, `truncate`, `pluralise`, `multiply`, `sub`. Topics grouped by first tag on the index page, sorted by Order then Title within each group.
### Static site generator (`generate.go`)
`Generate` writes five output artefacts. Client-side search uses inline JavaScript that loads `search-index.json` (content truncated to 500 runes per topic) and scores results with title (+10), content (+1), and tag (+3) weights. All DOM insertion uses safe methods (`textContent`, `createElement`) — no `innerHTML` with user-supplied strings.
### CLI ingestion (`ingest.go`)
`ParseHelpText` converts Go flag-style and Cobra-style help output to a `Topic`. `IngestCLIHelp` batch-ingests a map of command names to help texts.
---
## Known Limitations
### Search
- Phrase search does not apply title or section boosts; it uses a flat `+8.0` regardless of where the phrase appears.
- Stemmer covers English only. Non-English catalog content will not stem correctly but will still match via exact and prefix routes.
- The stemmer does not implement the full Porter algorithm. Some irregular forms (e.g. `ran``run`) are not handled.
- Tag boost applies per tag that matches, but a single query word can match multiple tags on the same topic, causing unbounded tag score accumulation.
- The inverted index has no removal operation. Topics added via `Add` cannot be retracted without rebuilding the catalog.
### HTTP server
- No request timeout is configured in `ListenAndServe`. Callers embedding the server in a larger application should wrap it in an `http.Server` with appropriate `ReadTimeout` and `WriteTimeout` values.
- Template parsing occurs on every request (`parseTemplates` is called per handler invocation). For high-throughput deployments, templates should be pre-parsed and cached.
- No access logging is provided.
### Static site generator
- Client-side search is a simple substring match. It does not implement stemming, fuzzy matching, or phrase search — only the server-side search has those capabilities.
- The `search-index.json` content field is truncated to 500 runes per topic, which may omit relevant content in long topics.
### General
- The `Catalog` is not safe for concurrent writes. Multiple goroutines calling `Add` concurrently will cause a data race. Reads via `List`, `Get`, and `Search` are safe once the catalog is fully populated, provided no concurrent `Add` calls occur.
---
## Future Considerations
### Phase 3 — AI-assisted search
The outstanding phase from the original roadmap:
- Embed help topics into go-rag collections for semantic search.
- Add vector similarity as a fallback when keyword search returns no results.
- Support natural language queries such as "how do I push all repos?".
This requires a dependency on `forge.lthn.ai/core/go-rag` and an Ollama or compatible embedding backend.
### Concurrency safety
Add a `sync.RWMutex` to `Catalog` to allow safe concurrent reads alongside occasional writes. This is a straightforward change with no API surface impact.
### Template caching
Pre-parse all templates once at `NewServer` construction time and cache the parsed `*template.Template` values. This removes per-request parsing overhead.
### Pagination
The `/api/topics` endpoint returns all topics in a single response. For large catalogs (thousands of topics), cursor-based or offset pagination should be added.
### YAML file loading
The current `Catalog` has no `LoadDir` or `LoadFile` function; topics are added programmatically via `Add` or via `ParseTopic`. A convenience function to load all `*.md` files from a directory would simplify catalog population for file-backed deployments.