go-rag/docs/architecture.md

# go-rag Architecture

Module: `forge.lthn.ai/core/go-rag`

## Overview

go-rag is a Retrieval-Augmented Generation library for Go. It provides document chunking, embedding generation via Ollama, vector storage and search via Qdrant, and formatted context retrieval suitable for injection into LLM prompts. The library is designed around two core interfaces — `Embedder` and `VectorStore` — that decouple business logic from external service implementations.

## Package Layout

| File | Purpose |
|------|---------|
| `embedder.go` | `Embedder` interface definition |
| `vectorstore.go` | `VectorStore` interface + `CollectionInfo` struct |
| `chunk.go` | Markdown chunking — sections, paragraphs, sentences, overlap |
| `ollama.go` | `OllamaClient` — implements `Embedder` via Ollama HTTP API |
| `qdrant.go` | `QdrantClient` — implements `VectorStore` via Qdrant gRPC |
| `ingest.go` | Ingestion pipeline — read files, chunk, embed, batch upsert |
| `query.go` | Query pipeline — embed query, search, threshold filter, format results |
| `keyword.go` | Keyword boosting post-filter for re-ranking search results |
| `collections.go` | Package-level collection management helpers |
| `helpers.go` | Convenience wrappers — `*With` variants and default-client functions |

## Core Interfaces

### Embedder

```go
type Embedder interface {
    Embed(ctx context.Context, text string) ([]float32, error)
    EmbedBatch(ctx context.Context, texts []string) ([][]float32, error)
    EmbedDimension() uint64
}
```

`OllamaClient` satisfies this interface. The interface enables mock-based testing without a live Ollama instance.

### VectorStore

```go
type VectorStore interface {
    CreateCollection(ctx context.Context, name string, vectorSize uint64) error
    CollectionExists(ctx context.Context, name string) (bool, error)
    DeleteCollection(ctx context.Context, name string) error
    ListCollections(ctx context.Context) ([]string, error)
    CollectionInfo(ctx context.Context, name string) (*CollectionInfo, error)
    UpsertPoints(ctx context.Context, collection string, points []Point) error
    Search(ctx context.Context, collection string, vector []float32, limit uint64, filter map[string]string) ([]SearchResult, error)
}
```

`QdrantClient` satisfies this interface. `CollectionInfo` is backend-agnostic (name, point count, vector size, status string).

## Qdrant Client

`QdrantClient` wraps the official `github.com/qdrant/go-client` gRPC library.

**Connection**: gRPC on port 6334 (default). Supports TLS and API key authentication.

**Collection creation**: Uses cosine distance metric (`qdrant.Distance_Cosine`). Vector dimensionality is derived from the configured embedding model via `Embedder.EmbedDimension()`.

**Point IDs**: Qdrant requires valid UUIDs. Point IDs are generated by `ChunkID()` using MD5 of `"path:index:text_prefix"`, producing 32-character hex strings that Qdrant accepts as UUIDs.

**Search**: Uses Qdrant's `QueryPoints` API. Payload filters are expressed as `Must` conditions (logical AND). Results include the similarity score and full payload.

**Payload conversion**: Qdrant payloads are protobuf `Value` types. The `valueToGo` function converts these to native Go types (`string`, `int64`, `float64`, `bool`, `[]any`, `map[string]any`).

**Version mismatch**: The client library (v1.16.2) logs a benign warning when connecting to Qdrant v1.16.3. All operations function correctly.

## Ollama Embedding Client

`OllamaClient` wraps the `github.com/ollama/ollama/api` HTTP client.

**Connection**: HTTP on port 11434 (default), 30-second timeout.

**Embedding**: Calls `/api/embed`. The Ollama API returns `float64` values; these are converted to `float32` for Qdrant compatibility.

**Batch embedding**: `EmbedBatch` is sequential — it calls `Embed` in a loop. Ollama has no native batch API endpoint. Batch throughput equals single-embed throughput.

**Supported models and dimensions**:

| Model | Dimensions |
|-------|-----------|
| `nomic-embed-text` (default) | 768 |
| `mxbai-embed-large` | 1024 |
| `all-minilm` | 384 |
| (unknown) | 768 (fallback) |

**Determinism**: `nomic-embed-text` produces bit-identical `float32` vectors for identical input text, which makes ingest operations idempotent.

**Empty strings**: Ollama accepts empty string input and returns a valid zero-padded vector without error. This behaviour is Ollama-specific.

## Markdown Chunking

`ChunkMarkdown(text string, cfg ChunkConfig) []Chunk` is the primary chunking function.

**ChunkConfig**:

```go
type ChunkConfig struct {
    Size    int  // Target characters per chunk (default 500)
    Overlap int  // Overlap in runes between adjacent chunks (default 50)
}
```

**Three-level splitting strategy**:

1. **Section split** — Text is first split at `## ` header boundaries. The header line is preserved with its section content.

2. **Paragraph split** — Sections larger than `Size` are split at double-newline paragraph boundaries. Multiple consecutive newlines are normalised to double-newlines.

3. **Sentence split** — Paragraphs that individually exceed `Size` are split at sentence boundaries (`. `, `? `, `! `). Sentence splitting is applied before paragraph accumulation to avoid oversized chunks. When no sentence boundaries exist, the oversized paragraph is added as-is.

**Overlap**: When a chunk boundary is crossed, the new chunk begins with the trailing `Overlap` runes of the previous chunk. The overlap start point is aligned to the nearest word boundary (first space within the overlap slice) to avoid splitting mid-word. Overlap is rune-safe; UTF-8 multi-byte characters are handled correctly.

**Chunk identity**: Each `Chunk` struct carries `Text`, `Section` (the `## ` header title), and `Index` (zero-based global counter across all sections in the document).

**ChunkID**: Deterministic MD5 hash of `"path:index:text_prefix"` (first 100 runes of text). Used as the Qdrant point ID.

**Category detection**: `Category(path string) string` classifies files by path keywords into categories: `ui-component`, `brand`, `product-brief`, `help-doc`, `task`, `architecture`, `documentation`. Used as a payload field to enable category-scoped queries.

**Accepted file types**: `.md`, `.markdown`, `.txt` (checked by `ShouldProcess`).

## Ingestion Pipeline

`Ingest` and `IngestFile` accept `VectorStore` and `Embedder` interfaces.

**Directory ingestion** (`Ingest`):

1. Resolve and validate the source directory.
2. Check whether the target collection exists. If `Recreate` is set and the collection exists, delete it first.
3. Create the collection if it does not exist, using `embedder.EmbedDimension()` for vector size.
4. Walk the directory recursively, collecting files matching `ShouldProcess`.
5. For each file: read content, call `ChunkMarkdown`, embed each chunk, build `Point` structs with payload fields (`text`, `source`, `section`, `category`, `chunk_index`).
6. Batch-upsert points to the vector store in slices of `BatchSize` (default 100).

**Point payload schema**:

| Field | Type | Description |
|-------|------|-------------|
| `text` | string | Raw chunk text |
| `source` | string | Relative file path from the ingestion directory root |
| `section` | string | Markdown section header (may be empty) |
| `category` | string | Category from `Category()` path detection |
| `chunk_index` | int | Chunk position within the document |

## Query Pipeline

`Query(ctx, store, embedder, query string, cfg QueryConfig) ([]QueryResult, error)`:

1. Embed the query text using `embedder.Embed`.
2. Construct a payload filter from `cfg.Category` if set.
3. Call `store.Search` with the query vector, limit, and filter.
4. Discard results below `cfg.Threshold` (default 0.5).
5. Deserialise payload fields into `QueryResult` structs. `chunk_index` handles `int64`, `float64`, and `int` types to accommodate JSON unmarshalling differences.
6. Optionally apply keyword boosting (`cfg.Keywords == true`).

**QueryConfig**:

```go
type QueryConfig struct {
    Collection string
    Limit      uint64   // Default 5
    Threshold  float32  // Default 0.5
    Category   string   // Payload filter; empty means no filter
    Keywords   bool     // Enable keyword boosting post-filter
}
```

## Keyword Boosting

`KeywordFilter(results []QueryResult, keywords []string) []QueryResult` re-ranks results after vector search.

**Algorithm**: For each result, count how many keywords appear (case-insensitive substring match) in the chunk text. Apply a 10% score boost per matching keyword: `score *= 1.0 + 0.1 * matchCount`. Re-sort by boosted score descending.

**Keyword extraction**: `extractKeywords` splits the query on whitespace and discards words shorter than 3 characters.

**When enabled**: `Query` calls `extractKeywords` on the query string and passes the result to `KeywordFilter` after the threshold filter has been applied.

## Result Formatting

Three output formats are available:

| Function | Format | Use case |
|----------|--------|----------|
| `FormatResultsText` | Plain text with score/source headers | Human-readable display |
| `FormatResultsContext` | XML `<retrieved_context>` with `<document>` elements | LLM prompt injection |
| `FormatResultsJSON` | Hand-crafted JSON array | Structured consumption |

`FormatResultsContext` applies `html.EscapeString` to all attribute values and text content to produce well-formed XML safe for embedding in prompts.

## Collection Management

Package-level helpers in `collections.go` delegate to `VectorStore`:

```go
ListCollections(ctx, store VectorStore) ([]string, error)
DeleteCollection(ctx, store VectorStore, name string) error
CollectionStats(ctx, store VectorStore, name string) (*CollectionInfo, error)
```

## Convenience Helpers

Two tiers of helpers are provided in `helpers.go`:

**Interface-accepting (`*With` variants)** — accept pre-constructed `VectorStore` and `Embedder`. Suitable for testing, long-lived processes, and high-throughput use:

```go
QueryWith(ctx, store, embedder, question, collectionName string, topK int) ([]QueryResult, error)
QueryContextWith(ctx, store, embedder, question, collectionName string, topK int) (string, error)
IngestDirWith(ctx, store, embedder, directory, collectionName string, recreate bool) error
IngestFileWith(ctx, store, embedder, filePath, collectionName string) (int, error)
```

**Default-client wrappers** — construct new `QdrantClient` and `OllamaClient` on each call using `DefaultQdrantConfig` and `DefaultOllamaConfig`. Each call opens a new gRPC connection. Suitable for CLI commands and infrequent operations:

```go
QueryDocs(ctx, question, collectionName string, topK int) ([]QueryResult, error)
QueryDocsContext(ctx, question, collectionName string, topK int) (string, error)
IngestDirectory(ctx, directory, collectionName string, recreate bool) error
IngestSingleFile(ctx, filePath, collectionName string) (int, error)
```

`IngestDirectory` and `IngestSingleFile` additionally run `HealthCheck` on Qdrant and `VerifyModel` on Ollama before proceeding.

## Performance Characteristics

Measured on AMD Ryzen 9 9950X + RX 7800 XT with ROCm, `nomic-embed-text` (F16):

| Operation | Latency | Throughput |
|-----------|---------|------------|
| Single embed | 10.3ms | 97/sec |
| Batch embed (10 texts) | 102ms | 98/sec effective |
| Qdrant search (100 points) | 111µs | 9,042 QPS |
| Qdrant search (200 points) | 152µs | 6,580 QPS |
| Chunk 50 sections | 11.2µs | 89K/sec |
| Chunk 1000 paragraphs | 107µs | 9.4K/sec |

The embedding step dominates pipeline latency. In a full ingest+query cycle for 5 documents, approximately 95% of elapsed time is in embedding calls. Text length (50 to 2000 characters) has negligible effect on embedding latency because tokenisation and HTTP overhead dominate the GPU compute time (~2ms).

## Dependency Graph

```
go-rag
├── forge.lthn.ai/core/go          (logging — pkg/log)
├── github.com/ollama/ollama        (embedding HTTP client)
├── github.com/qdrant/go-client     (vector DB gRPC client)
└── github.com/stretchr/testify     (test assertions)
```

Transitive: `google.golang.org/grpc`, `google.golang.org/protobuf`, `github.com/google/uuid`.