go-rag/docs/architecture.md
Snider ce4e311b54 docs: graduate TODO/FINDINGS into production documentation
Replace internal task tracking (TODO.md, FINDINGS.md) with structured
documentation in docs/. Trim CLAUDE.md to agent instructions only.

Co-Authored-By: Virgil <virgil@lethean.io>
2026-02-20 15:01:55 +00:00

12 KiB

go-rag Architecture

Module: forge.lthn.ai/core/go-rag

Overview

go-rag is a Retrieval-Augmented Generation library for Go. It provides document chunking, embedding generation via Ollama, vector storage and search via Qdrant, and formatted context retrieval suitable for injection into LLM prompts. The library is designed around two core interfaces — Embedder and VectorStore — that decouple business logic from external service implementations.

Package Layout

File Purpose
embedder.go Embedder interface definition
vectorstore.go VectorStore interface + CollectionInfo struct
chunk.go Markdown chunking — sections, paragraphs, sentences, overlap
ollama.go OllamaClient — implements Embedder via Ollama HTTP API
qdrant.go QdrantClient — implements VectorStore via Qdrant gRPC
ingest.go Ingestion pipeline — read files, chunk, embed, batch upsert
query.go Query pipeline — embed query, search, threshold filter, format results
keyword.go Keyword boosting post-filter for re-ranking search results
collections.go Package-level collection management helpers
helpers.go Convenience wrappers — *With variants and default-client functions

Core Interfaces

Embedder

type Embedder interface {
    Embed(ctx context.Context, text string) ([]float32, error)
    EmbedBatch(ctx context.Context, texts []string) ([][]float32, error)
    EmbedDimension() uint64
}

OllamaClient satisfies this interface. The interface enables mock-based testing without a live Ollama instance.

VectorStore

type VectorStore interface {
    CreateCollection(ctx context.Context, name string, vectorSize uint64) error
    CollectionExists(ctx context.Context, name string) (bool, error)
    DeleteCollection(ctx context.Context, name string) error
    ListCollections(ctx context.Context) ([]string, error)
    CollectionInfo(ctx context.Context, name string) (*CollectionInfo, error)
    UpsertPoints(ctx context.Context, collection string, points []Point) error
    Search(ctx context.Context, collection string, vector []float32, limit uint64, filter map[string]string) ([]SearchResult, error)
}

QdrantClient satisfies this interface. CollectionInfo is backend-agnostic (name, point count, vector size, status string).

Qdrant Client

QdrantClient wraps the official github.com/qdrant/go-client gRPC library.

Connection: gRPC on port 6334 (default). Supports TLS and API key authentication.

Collection creation: Uses cosine distance metric (qdrant.Distance_Cosine). Vector dimensionality is derived from the configured embedding model via Embedder.EmbedDimension().

Point IDs: Qdrant requires valid UUIDs. Point IDs are generated by ChunkID() using MD5 of "path:index:text_prefix", producing 32-character hex strings that Qdrant accepts as UUIDs.

Search: Uses Qdrant's QueryPoints API. Payload filters are expressed as Must conditions (logical AND). Results include the similarity score and full payload.

Payload conversion: Qdrant payloads are protobuf Value types. The valueToGo function converts these to native Go types (string, int64, float64, bool, []any, map[string]any).

Version mismatch: The client library (v1.16.2) logs a benign warning when connecting to Qdrant v1.16.3. All operations function correctly.

Ollama Embedding Client

OllamaClient wraps the github.com/ollama/ollama/api HTTP client.

Connection: HTTP on port 11434 (default), 30-second timeout.

Embedding: Calls /api/embed. The Ollama API returns float64 values; these are converted to float32 for Qdrant compatibility.

Batch embedding: EmbedBatch is sequential — it calls Embed in a loop. Ollama has no native batch API endpoint. Batch throughput equals single-embed throughput.

Supported models and dimensions:

Model Dimensions
nomic-embed-text (default) 768
mxbai-embed-large 1024
all-minilm 384
(unknown) 768 (fallback)

Determinism: nomic-embed-text produces bit-identical float32 vectors for identical input text, which makes ingest operations idempotent.

Empty strings: Ollama accepts empty string input and returns a valid zero-padded vector without error. This behaviour is Ollama-specific.

Markdown Chunking

ChunkMarkdown(text string, cfg ChunkConfig) []Chunk is the primary chunking function.

ChunkConfig:

type ChunkConfig struct {
    Size    int  // Target characters per chunk (default 500)
    Overlap int  // Overlap in runes between adjacent chunks (default 50)
}

Three-level splitting strategy:

  1. Section split — Text is first split at ## header boundaries. The header line is preserved with its section content.

  2. Paragraph split — Sections larger than Size are split at double-newline paragraph boundaries. Multiple consecutive newlines are normalised to double-newlines.

  3. Sentence split — Paragraphs that individually exceed Size are split at sentence boundaries (. , ? , ! ). Sentence splitting is applied before paragraph accumulation to avoid oversized chunks. When no sentence boundaries exist, the oversized paragraph is added as-is.

Overlap: When a chunk boundary is crossed, the new chunk begins with the trailing Overlap runes of the previous chunk. The overlap start point is aligned to the nearest word boundary (first space within the overlap slice) to avoid splitting mid-word. Overlap is rune-safe; UTF-8 multi-byte characters are handled correctly.

Chunk identity: Each Chunk struct carries Text, Section (the ## header title), and Index (zero-based global counter across all sections in the document).

ChunkID: Deterministic MD5 hash of "path:index:text_prefix" (first 100 runes of text). Used as the Qdrant point ID.

Category detection: Category(path string) string classifies files by path keywords into categories: ui-component, brand, product-brief, help-doc, task, architecture, documentation. Used as a payload field to enable category-scoped queries.

Accepted file types: .md, .markdown, .txt (checked by ShouldProcess).

Ingestion Pipeline

Ingest and IngestFile accept VectorStore and Embedder interfaces.

Directory ingestion (Ingest):

  1. Resolve and validate the source directory.
  2. Check whether the target collection exists. If Recreate is set and the collection exists, delete it first.
  3. Create the collection if it does not exist, using embedder.EmbedDimension() for vector size.
  4. Walk the directory recursively, collecting files matching ShouldProcess.
  5. For each file: read content, call ChunkMarkdown, embed each chunk, build Point structs with payload fields (text, source, section, category, chunk_index).
  6. Batch-upsert points to the vector store in slices of BatchSize (default 100).

Point payload schema:

Field Type Description
text string Raw chunk text
source string Relative file path from the ingestion directory root
section string Markdown section header (may be empty)
category string Category from Category() path detection
chunk_index int Chunk position within the document

Query Pipeline

Query(ctx, store, embedder, query string, cfg QueryConfig) ([]QueryResult, error):

  1. Embed the query text using embedder.Embed.
  2. Construct a payload filter from cfg.Category if set.
  3. Call store.Search with the query vector, limit, and filter.
  4. Discard results below cfg.Threshold (default 0.5).
  5. Deserialise payload fields into QueryResult structs. chunk_index handles int64, float64, and int types to accommodate JSON unmarshalling differences.
  6. Optionally apply keyword boosting (cfg.Keywords == true).

QueryConfig:

type QueryConfig struct {
    Collection string
    Limit      uint64   // Default 5
    Threshold  float32  // Default 0.5
    Category   string   // Payload filter; empty means no filter
    Keywords   bool     // Enable keyword boosting post-filter
}

Keyword Boosting

KeywordFilter(results []QueryResult, keywords []string) []QueryResult re-ranks results after vector search.

Algorithm: For each result, count how many keywords appear (case-insensitive substring match) in the chunk text. Apply a 10% score boost per matching keyword: score *= 1.0 + 0.1 * matchCount. Re-sort by boosted score descending.

Keyword extraction: extractKeywords splits the query on whitespace and discards words shorter than 3 characters.

When enabled: Query calls extractKeywords on the query string and passes the result to KeywordFilter after the threshold filter has been applied.

Result Formatting

Three output formats are available:

Function Format Use case
FormatResultsText Plain text with score/source headers Human-readable display
FormatResultsContext XML <retrieved_context> with <document> elements LLM prompt injection
FormatResultsJSON Hand-crafted JSON array Structured consumption

FormatResultsContext applies html.EscapeString to all attribute values and text content to produce well-formed XML safe for embedding in prompts.

Collection Management

Package-level helpers in collections.go delegate to VectorStore:

ListCollections(ctx, store VectorStore) ([]string, error)
DeleteCollection(ctx, store VectorStore, name string) error
CollectionStats(ctx, store VectorStore, name string) (*CollectionInfo, error)

Convenience Helpers

Two tiers of helpers are provided in helpers.go:

Interface-accepting (*With variants) — accept pre-constructed VectorStore and Embedder. Suitable for testing, long-lived processes, and high-throughput use:

QueryWith(ctx, store, embedder, question, collectionName string, topK int) ([]QueryResult, error)
QueryContextWith(ctx, store, embedder, question, collectionName string, topK int) (string, error)
IngestDirWith(ctx, store, embedder, directory, collectionName string, recreate bool) error
IngestFileWith(ctx, store, embedder, filePath, collectionName string) (int, error)

Default-client wrappers — construct new QdrantClient and OllamaClient on each call using DefaultQdrantConfig and DefaultOllamaConfig. Each call opens a new gRPC connection. Suitable for CLI commands and infrequent operations:

QueryDocs(ctx, question, collectionName string, topK int) ([]QueryResult, error)
QueryDocsContext(ctx, question, collectionName string, topK int) (string, error)
IngestDirectory(ctx, directory, collectionName string, recreate bool) error
IngestSingleFile(ctx, filePath, collectionName string) (int, error)

IngestDirectory and IngestSingleFile additionally run HealthCheck on Qdrant and VerifyModel on Ollama before proceeding.

Performance Characteristics

Measured on AMD Ryzen 9 9950X + RX 7800 XT with ROCm, nomic-embed-text (F16):

Operation Latency Throughput
Single embed 10.3ms 97/sec
Batch embed (10 texts) 102ms 98/sec effective
Qdrant search (100 points) 111µs 9,042 QPS
Qdrant search (200 points) 152µs 6,580 QPS
Chunk 50 sections 11.2µs 89K/sec
Chunk 1000 paragraphs 107µs 9.4K/sec

The embedding step dominates pipeline latency. In a full ingest+query cycle for 5 documents, approximately 95% of elapsed time is in embedding calls. Text length (50 to 2000 characters) has negligible effect on embedding latency because tokenisation and HTTP overhead dominate the GPU compute time (~2ms).

Dependency Graph

go-rag
├── forge.lthn.ai/core/go          (logging — pkg/log)
├── github.com/ollama/ollama        (embedding HTTP client)
├── github.com/qdrant/go-client     (vector DB gRPC client)
└── github.com/stretchr/testify     (test assertions)

Transitive: google.golang.org/grpc, google.golang.org/protobuf, github.com/google/uuid.