1 Chunking
Virgil edited this page 2026-02-19 18:37:31 +00:00

Chunking

Overview

chunk.go splits documents into semantic chunks suitable for vector embedding. Chunks preserve context while staying within embedding model token limits.

Algorithm

  1. Split by headers: Markdown headers (#, ##, etc.) create natural chunk boundaries
  2. Split by paragraphs: Within sections, paragraphs are separate chunks
  3. Size enforcement: Chunks exceeding the token limit are split at sentence boundaries
  4. Overlap: Configurable overlap between chunks preserves cross-boundary context

Category Detection

Each chunk is categorised for filtering during search:

  • code: Fenced code blocks, inline code
  • documentation: Prose, explanations, descriptions
  • configuration: YAML, JSON, TOML blocks
  • api: Function signatures, endpoint definitions

Chunk Structure

type Chunk struct {
    Content  string
    Source   string   // Origin file path
    Category string   // code, documentation, etc.
    Index    int      // Position in source document
    Metadata map[string]string
}

Testing

chunk_test.go covers:

  • Markdown header splitting
  • Paragraph boundary detection
  • Category classification
  • Edge cases (empty input, single-line documents)