Chunking

Overview

chunk.go splits documents into semantic chunks suitable for vector embedding. Chunks preserve context while staying within embedding model token limits.

Algorithm

Split by headers: Markdown headers (#, ##, etc.) create natural chunk boundaries
Split by paragraphs: Within sections, paragraphs are separate chunks
Size enforcement: Chunks exceeding the token limit are split at sentence boundaries
Overlap: Configurable overlap between chunks preserves cross-boundary context

Category Detection

Each chunk is categorised for filtering during search:

code: Fenced code blocks, inline code
documentation: Prose, explanations, descriptions
configuration: YAML, JSON, TOML blocks
api: Function signatures, endpoint definitions

Chunk Structure

type Chunk struct {
    Content  string
    Source   string   // Origin file path
    Category string   // code, documentation, etc.
    Index    int      // Position in source document
    Metadata map[string]string
}

Testing

chunk_test.go covers:

Markdown header splitting
Paragraph boundary detection
Category classification
Edge cases (empty input, single-line documents)