1
Chunking
Virgil edited this page 2026-02-19 18:37:31 +00:00
Table of Contents
Chunking
Overview
chunk.go splits documents into semantic chunks suitable for vector embedding. Chunks preserve context while staying within embedding model token limits.
Algorithm
- Split by headers: Markdown headers (
#,##, etc.) create natural chunk boundaries - Split by paragraphs: Within sections, paragraphs are separate chunks
- Size enforcement: Chunks exceeding the token limit are split at sentence boundaries
- Overlap: Configurable overlap between chunks preserves cross-boundary context
Category Detection
Each chunk is categorised for filtering during search:
- code: Fenced code blocks, inline code
- documentation: Prose, explanations, descriptions
- configuration: YAML, JSON, TOML blocks
- api: Function signatures, endpoint definitions
Chunk Structure
type Chunk struct {
Content string
Source string // Origin file path
Category string // code, documentation, etc.
Index int // Position in source document
Metadata map[string]string
}
Testing
chunk_test.go covers:
- Markdown header splitting
- Paragraph boundary detection
- Category classification
- Edge cases (empty input, single-line documents)