Replace internal task tracking (TODO.md, FINDINGS.md) with structured documentation in docs/. Trim CLAUDE.md to agent instructions only. Co-Authored-By: Virgil <virgil@lethean.io>
155 lines
8.5 KiB
Markdown
155 lines
8.5 KiB
Markdown
# Project History — go-help
|
||
|
||
## Origin
|
||
|
||
Extracted from `forge.lthn.ai/core/go` `pkg/help/` on 19 February 2026 by Virgil.
|
||
|
||
Commit: `ad5e70937b30829aa63747e89e259aeb5cada575`
|
||
Message: `feat: extract go-help from core/go pkg/help`
|
||
|
||
The initial extraction included:
|
||
- `Topic`, `Section`, `Frontmatter` type definitions
|
||
- `Catalog` with `Add`, `List`, `Get`, `Search`
|
||
- `Parser` with frontmatter extraction, section parsing, and ID generation
|
||
- Basic search with title (+10), section (+5), and partial/body (+0.5) scoring
|
||
- Two test files
|
||
|
||
The only external dependency at extraction was `gopkg.in/yaml.v3`.
|
||
|
||
---
|
||
|
||
## Phase 0 — Hardening and Test Coverage
|
||
|
||
Completed: 20 February 2026 (Charon)
|
||
Commit: `23cef8592a825ffcc7e551079fa561092514d47c`
|
||
Message: `test: complete Phase 0 hardening — 124 tests, 94% coverage, 8 benchmarks`
|
||
|
||
Coverage rose from 92.1% to 100%. The entire `catalog.go` had zero test coverage at extraction.
|
||
|
||
Work completed:
|
||
|
||
- `catalog_test.go` created from scratch covering `DefaultCatalog`, `Add`, `List`, `Search`, `Get` (found and not-found), and score tie-breaking.
|
||
- Targeted search tests for previously uncovered branches: nil topic guard in `Search()` (stale index references after hypothetical removal), alphabetical tie-breaking when scores are equal, headings-only content in snippet extraction, whitespace-only content trimmed to empty in snippets, empty regex slice in `highlight()`, overlapping match extension in highlight merging.
|
||
- Parser tests expanded to cover: empty input, frontmatter-only documents, three malformed YAML variants, frontmatter not at start of file, deeply nested headings (H4–H6 with body text), Unicode (CJK, emoji, diacritics, mixed scripts), 10,000-line documents, empty sections, headings without the mandatory space after `#`, consecutive headings, and `GenerateID` edge cases.
|
||
- `search_bench_test.go` with eight benchmarks using `b.Loop()` (Go 1.25) and `b.ReportAllocs()`. Benchmarks cover: single word, multi-word, no results, partial match, 500-topic catalog, 1,000-topic catalog, `Add` indexing throughput, and `tokenize` throughput.
|
||
- Baseline benchmark (Ryzen 9 9950X): ~745 µs/op, ~392 KB/op, 4,114 allocs/op for a 150-topic corpus.
|
||
- `go vet ./...` clean.
|
||
|
||
---
|
||
|
||
## Phase 1 — Search Improvements
|
||
|
||
Completed: 20 February 2026 (Charon)
|
||
Commits:
|
||
- `3e91510bcf1c3553849a46440253bb572a809e9e` — `feat(search): add fuzzy matching, phrase search, and improved scoring`
|
||
- `fc758a832b89b98a19083fcfb7c3e3d4c31a314c` — `feat(search): add English stemmer for improved search recall`
|
||
|
||
### Fuzzy matching
|
||
|
||
`levenshtein(a, b string) int` implemented using two-row dynamic programming. Applied to query words of three or more characters. Maximum edit distance of 2. Fuzzy matches score at `scoreFuzzyWord = 0.3`, below prefix (`0.5`) and exact (`1.0`). Words already matched as exact or prefix are not double-counted.
|
||
|
||
### Phrase search
|
||
|
||
`extractPhrases(query)` extracts `"quoted strings"` from the query before tokenisation. Phrase matching scans the concatenated title, content, and all section content of every candidate topic. Matched phrases boost by `scorePhraseBoost = 8.0`. Phrase terms are also compiled as regexes for snippet highlighting. Empty quotes and whitespace-only quoted strings are ignored.
|
||
|
||
### English stemmer
|
||
|
||
`stemmer.go` added. Lightweight Porter-style suffix stripping in two passes (inflectional then derivational). Words under four characters are unchanged; results are always at least two characters. The stemmer is intentionally not the full Porter algorithm — only the highest-impact rules for a help catalog are included.
|
||
|
||
`tokenize()` modified to emit both the raw word and its stem (if they differ). The search index therefore contains both forms. When searching, query words are also stemmed. Stem-only matches score at `scoreStemWord = 0.7`.
|
||
|
||
### Improved scoring weights
|
||
|
||
All magic numbers replaced with named constants. New features added:
|
||
|
||
- `scoreTagBoost = 3.0` — topics with tags matching query words rank higher.
|
||
- `scoreAllWords = 2.0` — topics containing all query words receive a multi-word bonus.
|
||
|
||
### API compatibility
|
||
|
||
`Search(query string) []*SearchResult` signature unchanged. All existing tests passed without modification.
|
||
|
||
---
|
||
|
||
## Phase 2 — HTTP Server, Rendering, Static Site, and CLI Ingestion
|
||
|
||
Completed: 20 February 2026 (Charon)
|
||
Commit: `944cad006b4920c3406d170fe42d006533483d59`
|
||
Message: `feat(help): Phase 2 — HTTP server, rendering, static site generator, CLI ingestion`
|
||
|
||
### Markdown rendering (`render.go`)
|
||
|
||
`RenderMarkdown` added using `github.com/yuin/goldmark` with GFM, Typographer, and `html.WithUnsafe()`. Returns an HTML fragment.
|
||
|
||
### HTTP server (`server.go`)
|
||
|
||
Six-route HTTP server. Implements `http.Handler`. Routes registered at construction. HTML and JSON routes share the same catalog; JSON routes return raw `Topic` structs without pre-rendered HTML.
|
||
|
||
### Templates (`templates.go`, `templates/`)
|
||
|
||
Five embedded templates with dark theme CSS. Template functions: `renderMarkdown`, `truncate`, `pluralise`, `multiply`, `sub`. Topics grouped by first tag on the index page, sorted by Order then Title within each group.
|
||
|
||
### Static site generator (`generate.go`)
|
||
|
||
`Generate` writes five output artefacts. Client-side search uses inline JavaScript that loads `search-index.json` (content truncated to 500 runes per topic) and scores results with title (+10), content (+1), and tag (+3) weights. All DOM insertion uses safe methods (`textContent`, `createElement`) — no `innerHTML` with user-supplied strings.
|
||
|
||
### CLI ingestion (`ingest.go`)
|
||
|
||
`ParseHelpText` converts Go flag-style and Cobra-style help output to a `Topic`. `IngestCLIHelp` batch-ingests a map of command names to help texts.
|
||
|
||
---
|
||
|
||
## Known Limitations
|
||
|
||
### Search
|
||
|
||
- Phrase search does not apply title or section boosts; it uses a flat `+8.0` regardless of where the phrase appears.
|
||
- Stemmer covers English only. Non-English catalog content will not stem correctly but will still match via exact and prefix routes.
|
||
- The stemmer does not implement the full Porter algorithm. Some irregular forms (e.g. `ran` → `run`) are not handled.
|
||
- Tag boost applies per tag that matches, but a single query word can match multiple tags on the same topic, causing unbounded tag score accumulation.
|
||
- The inverted index has no removal operation. Topics added via `Add` cannot be retracted without rebuilding the catalog.
|
||
|
||
### HTTP server
|
||
|
||
- No request timeout is configured in `ListenAndServe`. Callers embedding the server in a larger application should wrap it in an `http.Server` with appropriate `ReadTimeout` and `WriteTimeout` values.
|
||
- Template parsing occurs on every request (`parseTemplates` is called per handler invocation). For high-throughput deployments, templates should be pre-parsed and cached.
|
||
- No access logging is provided.
|
||
|
||
### Static site generator
|
||
|
||
- Client-side search is a simple substring match. It does not implement stemming, fuzzy matching, or phrase search — only the server-side search has those capabilities.
|
||
- The `search-index.json` content field is truncated to 500 runes per topic, which may omit relevant content in long topics.
|
||
|
||
### General
|
||
|
||
- The `Catalog` is not safe for concurrent writes. Multiple goroutines calling `Add` concurrently will cause a data race. Reads via `List`, `Get`, and `Search` are safe once the catalog is fully populated, provided no concurrent `Add` calls occur.
|
||
|
||
---
|
||
|
||
## Future Considerations
|
||
|
||
### Phase 3 — AI-assisted search
|
||
|
||
The outstanding phase from the original roadmap:
|
||
|
||
- Embed help topics into go-rag collections for semantic search.
|
||
- Add vector similarity as a fallback when keyword search returns no results.
|
||
- Support natural language queries such as "how do I push all repos?".
|
||
|
||
This requires a dependency on `forge.lthn.ai/core/go-rag` and an Ollama or compatible embedding backend.
|
||
|
||
### Concurrency safety
|
||
|
||
Add a `sync.RWMutex` to `Catalog` to allow safe concurrent reads alongside occasional writes. This is a straightforward change with no API surface impact.
|
||
|
||
### Template caching
|
||
|
||
Pre-parse all templates once at `NewServer` construction time and cache the parsed `*template.Template` values. This removes per-request parsing overhead.
|
||
|
||
### Pagination
|
||
|
||
The `/api/topics` endpoint returns all topics in a single response. For large catalogs (thousands of topics), cursor-based or offset pagination should be added.
|
||
|
||
### YAML file loading
|
||
|
||
The current `Catalog` has no `LoadDir` or `LoadFile` function; topics are added programmatically via `Add` or via `ParseTopic`. A convenience function to load all `*.md` files from a directory would simplify catalog population for file-backed deployments.
|