* feat(cli): wire release command and add installer scripts
- Wire up `core build release` subcommand (was orphaned)
- Wire up `core monitor` command (missing import in full variant)
- Add installer scripts for Unix (.sh) and Windows (.bat)
- setup: Interactive with variant selection
- ci: Minimal for CI/CD environments
- dev: Full development variant
- go/php/agent: Targeted development variants
- All scripts include security hardening:
- Secure temp directories (mktemp -d)
- Architecture validation
- Version validation after GitHub API call
- Proper cleanup on exit
- PowerShell PATH updates on Windows (avoids setx truncation)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* feat(build): add tar.xz support and unified installer scripts
- Add tar.xz archive support using Borg's compress package
- ArchiveXZ() and ArchiveWithFormat() for configurable compression
- Better compression ratio than gzip for release artifacts
- Consolidate 12 installer scripts into 2 unified scripts
- install.sh and install.bat with BunnyCDN edge variable support
- Subdomains: setup.core.help, ci.core.help, dev.core.help, etc.
- MODE and VARIANT transformed at edge based on subdomain
- Installers prefer tar.xz with automatic fallback to tar.gz
- Fixed CodeRabbit issues: HTTP status patterns, tar error handling,
verify_install params, VARIANT validation, CI PATH persistence
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* chore: add build and release config files
- .core/build.yaml - cross-platform build configuration
- .core/release.yaml - release workflow configuration
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* chore: move plans from docs/ to tasks/
Consolidate planning documents in tasks/plans/ directory.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* fix(install): address CodeRabbit review feedback
- Add curl timeout (--max-time) to prevent hanging on slow networks
- Rename TMPDIR to WORK_DIR to avoid clobbering system env var
- Add chmod +x to ensure binary has execute permissions
- Add error propagation after subroutine calls in batch file
- Remove System32 install attempt in CI mode (use consistent INSTALL_DIR)
- Fix HTTP status regex for HTTP/2 compatibility
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* feat(rag): add Go RAG implementation with Qdrant + Ollama
Add RAG (Retrieval Augmented Generation) tools for storing documentation
in Qdrant vector database and querying with semantic search. This replaces
the Python tools/rag implementation with a native Go solution.
New commands:
- core rag ingest [directory] - Ingest markdown files into Qdrant
- core rag query [question] - Query vector database with semantic search
- core rag collections - List and manage Qdrant collections
Features:
- Markdown chunking by sections and paragraphs with overlap
- UTF-8 safe text handling for international content
- Automatic category detection from file paths
- Multiple output formats: text, JSON, LLM context injection
- Environment variable support for host configuration
Dependencies:
- github.com/qdrant/go-client (gRPC client)
- github.com/ollama/ollama/api (embeddings API)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* feat(deploy): add pure-Go Ansible executor and Coolify API integration
Implement infrastructure deployment system with:
- pkg/ansible: Pure Go Ansible executor
- Playbook/inventory parsing (types.go, parser.go)
- Full execution engine with variable templating, loops, blocks,
conditionals, handlers, and fact gathering (executor.go)
- SSH client with key/password auth and privilege escalation (ssh.go)
- 35+ module implementations: shell, command, copy, template, file,
apt, service, systemd, user, group, git, docker_compose, etc. (modules.go)
- pkg/deploy/coolify: Coolify API client wrapping Python swagger client
- List/get servers, projects, applications, databases, services
- Generic Call() for any OpenAPI operation
- pkg/deploy/python: Embedded Python runtime for swagger client integration
- internal/cmd/deploy: CLI commands
- core deploy servers/projects/apps/databases/services/team
- core deploy call <operation> [params-json]
This enables Docker-free infrastructure deployment with Ansible-compatible
playbooks executed natively in Go.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* fix(deploy): address linter warnings and build errors
- Fix fmt.Sprintf format verb error in ssh.go (remove unused stat command)
- Fix errcheck warnings by explicitly ignoring best-effort operations
- Fix ineffassign warning in cmd_ansible.go
All golangci-lint checks now pass for deploy packages.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* style(deploy): fix gofmt formatting
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* fix(deploy): use known_hosts for SSH host key verification
Address CodeQL security alert by using the user's known_hosts file
for SSH host key verification when available. Falls back to accepting
any key only when known_hosts doesn't exist (common in containerized
or ephemeral environments).
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* feat(ai,security,ide): add agentic MVP, security jobs, and Core IDE desktop app
Wire up AI infrastructure with unified pkg/ai package (metrics JSONL,
RAG integration), move RAG under `core ai rag`, add `core ai metrics`
command, and enrich task context with Qdrant documentation.
Add `--target` flag to all security commands for external repo scanning,
`core security jobs` for distributing findings as GitHub Issues, and
consistent error logging across scan/deps/alerts/secrets commands.
Add Core IDE Wails v3 desktop app with Angular 20 frontend, MCP bridge
(loopback-only HTTP server), WebSocket hub, and Claude Code bridge.
Production-ready with Lethean CIC branding, macOS code signing support,
and security hardening (origin validation, body size limits, URL scheme
checks, memory leak prevention, XSS mitigation).
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* fix: address PR review comments from CodeRabbit, Copilot, and Gemini
Fixes across 25 files addressing 46+ review comments:
- pkg/ai/metrics.go: handle error from Close() on writable file handle
- pkg/ansible: restore loop vars after loop, restore become settings,
fix Upload with become=true and no password (use sudo -n), honour
SSH timeout config, use E() helper for contextual errors, quote git
refs in checkout commands
- pkg/rag: validate chunk config, guard negative-to-uint64 conversion,
use E() helper for errors, add context timeout to Ollama HTTP calls
- pkg/deploy/python: fix exec.ExitError type assertion (was os.PathError),
handle os.UserHomeDir() error
- pkg/build/buildcmd: use cmd.Context() instead of context.Background()
for proper Ctrl+C cancellation
- install.bat: add curl timeouts, CRLF line endings, use --connect-timeout
for archive downloads
- install.sh: use absolute path for version check in CI mode
- tools/rag: fix broken ingest.py function def, escape HTML in query.py,
pin qdrant-client version, add markdown code block languages
- internal/cmd/rag: add chunk size validation, env override handling
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* fix(build): make release dry-run by default and remove darwin/amd64 target
Replace --dry-run (default false) with --we-are-go-for-launch (default
false) so `core build release` is safe by default. Remove darwin/amd64
from default build targets (arm64 only for macOS). Fix cmd_project.go
to use command context instead of context.Background().
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
254 lines
8.3 KiB
Python
254 lines
8.3 KiB
Python
#!/usr/bin/env python3
|
|
"""
|
|
RAG Ingestion Pipeline for Host UK Documentation
|
|
|
|
Chunks markdown files, generates embeddings via Ollama, stores in Qdrant.
|
|
|
|
Usage:
|
|
python ingest.py /path/to/docs --collection hostuk-docs
|
|
python ingest.py /path/to/flux-ui --collection flux-ui-docs
|
|
|
|
Requirements:
|
|
pip install qdrant-client ollama markdown
|
|
"""
|
|
|
|
import argparse
|
|
import hashlib
|
|
import json
|
|
import os
|
|
import re
|
|
import sys
|
|
from pathlib import Path
|
|
from typing import Generator
|
|
|
|
try:
|
|
from qdrant_client import QdrantClient
|
|
from qdrant_client.models import Distance, VectorParams, PointStruct
|
|
import ollama
|
|
except ImportError:
|
|
print("Install dependencies: pip install qdrant-client ollama")
|
|
sys.exit(1)
|
|
|
|
|
|
# Configuration
|
|
QDRANT_HOST = os.getenv("QDRANT_HOST", "localhost")
|
|
QDRANT_PORT = int(os.getenv("QDRANT_PORT", "6333"))
|
|
EMBEDDING_MODEL = os.getenv("EMBEDDING_MODEL", "nomic-embed-text")
|
|
CHUNK_SIZE = int(os.getenv("CHUNK_SIZE", "500")) # chars
|
|
CHUNK_OVERLAP = int(os.getenv("CHUNK_OVERLAP", "50")) # chars
|
|
VECTOR_DIM = 768 # nomic-embed-text dimension
|
|
|
|
|
|
def chunk_markdown(text: str, chunk_size: int = CHUNK_SIZE, overlap: int = CHUNK_OVERLAP) -> Generator[dict, None, None]:
|
|
"""
|
|
Chunk markdown by sections (## headers), then by paragraphs if too long.
|
|
Preserves context with overlap.
|
|
"""
|
|
# Split by ## headers first
|
|
sections = re.split(r'\n(?=## )', text)
|
|
|
|
for section in sections:
|
|
if not section.strip():
|
|
continue
|
|
|
|
# Extract section title
|
|
lines = section.strip().split('\n')
|
|
title = lines[0].lstrip('#').strip() if lines[0].startswith('#') else ""
|
|
|
|
# If section is small enough, yield as-is
|
|
if len(section) <= chunk_size:
|
|
yield {
|
|
"text": section.strip(),
|
|
"section": title,
|
|
}
|
|
continue
|
|
|
|
# Otherwise, chunk by paragraphs
|
|
paragraphs = re.split(r'\n\n+', section)
|
|
current_chunk = ""
|
|
|
|
for para in paragraphs:
|
|
if len(current_chunk) + len(para) <= chunk_size:
|
|
current_chunk += "\n\n" + para if current_chunk else para
|
|
else:
|
|
if current_chunk:
|
|
yield {
|
|
"text": current_chunk.strip(),
|
|
"section": title,
|
|
}
|
|
# Start new chunk with overlap from previous
|
|
if overlap and current_chunk:
|
|
overlap_text = current_chunk[-overlap:]
|
|
current_chunk = overlap_text + "\n\n" + para
|
|
else:
|
|
current_chunk = para
|
|
|
|
# Don't forget the last chunk
|
|
if current_chunk.strip():
|
|
yield {
|
|
"text": current_chunk.strip(),
|
|
"section": title,
|
|
}
|
|
|
|
|
|
def generate_embedding(text: str) -> list[float]:
|
|
"""Generate embedding using Ollama."""
|
|
response = ollama.embeddings(model=EMBEDDING_MODEL, prompt=text)
|
|
return response["embedding"]
|
|
|
|
|
|
def get_file_category(path: str) -> str:
|
|
"""Determine category from file path."""
|
|
path_lower = path.lower()
|
|
|
|
if "flux" in path_lower or "ui/component" in path_lower:
|
|
return "ui-component"
|
|
elif "brand" in path_lower or "mascot" in path_lower:
|
|
return "brand"
|
|
elif "brief" in path_lower:
|
|
return "product-brief"
|
|
elif "help" in path_lower or "draft" in path_lower:
|
|
return "help-doc"
|
|
elif "task" in path_lower or "plan" in path_lower:
|
|
return "task"
|
|
elif "architecture" in path_lower or "migration" in path_lower:
|
|
return "architecture"
|
|
else:
|
|
return "documentation"
|
|
|
|
|
|
def ingest_directory(
|
|
directory: Path,
|
|
client: QdrantClient,
|
|
collection: str,
|
|
verbose: bool = False
|
|
) -> dict:
|
|
"""Ingest all markdown files from directory into Qdrant."""
|
|
|
|
stats = {"files": 0, "chunks": 0, "errors": 0}
|
|
points = []
|
|
|
|
# Find all markdown files
|
|
md_files = list(directory.rglob("*.md"))
|
|
print(f"Found {len(md_files)} markdown files")
|
|
|
|
for file_path in md_files:
|
|
try:
|
|
rel_path = str(file_path.relative_to(directory))
|
|
|
|
with open(file_path, "r", encoding="utf-8", errors="ignore") as f:
|
|
content = f.read()
|
|
|
|
if not content.strip():
|
|
continue
|
|
|
|
# Extract metadata
|
|
category = get_file_category(rel_path)
|
|
|
|
# Chunk the content
|
|
for i, chunk in enumerate(chunk_markdown(content)):
|
|
chunk_id = hashlib.md5(
|
|
f"{rel_path}:{i}:{chunk['text'][:100]}".encode()
|
|
).hexdigest()
|
|
|
|
# Generate embedding
|
|
embedding = generate_embedding(chunk["text"])
|
|
|
|
# Create point
|
|
point = PointStruct(
|
|
id=chunk_id,
|
|
vector=embedding,
|
|
payload={
|
|
"text": chunk["text"],
|
|
"source": rel_path,
|
|
"section": chunk["section"],
|
|
"category": category,
|
|
"chunk_index": i,
|
|
}
|
|
)
|
|
points.append(point)
|
|
stats["chunks"] += 1
|
|
|
|
if verbose:
|
|
print(f" [{category}] {rel_path} chunk {i}: {len(chunk['text'])} chars")
|
|
|
|
stats["files"] += 1
|
|
if not verbose:
|
|
print(f" Processed: {rel_path} ({stats['chunks']} chunks total)")
|
|
|
|
except Exception as e:
|
|
print(f" Error processing {file_path}: {e}")
|
|
stats["errors"] += 1
|
|
|
|
# Batch upsert to Qdrant
|
|
if points:
|
|
print(f"\nUpserting {len(points)} vectors to Qdrant...")
|
|
|
|
# Upsert in batches of 100
|
|
batch_size = 100
|
|
for i in range(0, len(points), batch_size):
|
|
batch = points[i:i + batch_size]
|
|
client.upsert(collection_name=collection, points=batch)
|
|
print(f" Uploaded batch {i // batch_size + 1}/{(len(points) - 1) // batch_size + 1}")
|
|
|
|
return stats
|
|
|
|
|
|
def main():
|
|
parser = argparse.ArgumentParser(description="Ingest markdown docs into Qdrant")
|
|
parser.add_argument("directory", type=Path, help="Directory containing markdown files")
|
|
parser.add_argument("--collection", default="hostuk-docs", help="Qdrant collection name")
|
|
parser.add_argument("--recreate", action="store_true", help="Delete and recreate collection")
|
|
parser.add_argument("--verbose", "-v", action="store_true", help="Verbose output")
|
|
parser.add_argument("--qdrant-host", default=QDRANT_HOST, help="Qdrant host")
|
|
parser.add_argument("--qdrant-port", type=int, default=QDRANT_PORT, help="Qdrant port")
|
|
|
|
args = parser.parse_args()
|
|
|
|
if not args.directory.exists():
|
|
print(f"Error: Directory not found: {args.directory}")
|
|
sys.exit(1)
|
|
|
|
# Connect to Qdrant
|
|
print(f"Connecting to Qdrant at {args.qdrant_host}:{args.qdrant_port}...")
|
|
client = QdrantClient(host=args.qdrant_host, port=args.qdrant_port)
|
|
|
|
# Create or recreate collection
|
|
collections = [c.name for c in client.get_collections().collections]
|
|
|
|
if args.recreate and args.collection in collections:
|
|
print(f"Deleting existing collection: {args.collection}")
|
|
client.delete_collection(args.collection)
|
|
collections.remove(args.collection)
|
|
|
|
if args.collection not in collections:
|
|
print(f"Creating collection: {args.collection}")
|
|
client.create_collection(
|
|
collection_name=args.collection,
|
|
vectors_config=VectorParams(size=VECTOR_DIM, distance=Distance.COSINE)
|
|
)
|
|
|
|
# Verify Ollama model is available
|
|
print(f"Using embedding model: {EMBEDDING_MODEL}")
|
|
try:
|
|
ollama.embeddings(model=EMBEDDING_MODEL, prompt="test")
|
|
except Exception as e:
|
|
print(f"Error: Embedding model not available. Run: ollama pull {EMBEDDING_MODEL}")
|
|
sys.exit(1)
|
|
|
|
# Ingest files
|
|
print(f"\nIngesting from: {args.directory}")
|
|
stats = ingest_directory(args.directory, client, args.collection, args.verbose)
|
|
|
|
# Summary
|
|
print(f"\n{'=' * 50}")
|
|
print(f"Ingestion complete!")
|
|
print(f" Files processed: {stats['files']}")
|
|
print(f" Chunks created: {stats['chunks']}")
|
|
print(f" Errors: {stats['errors']}")
|
|
print(f" Collection: {args.collection}")
|
|
print(f"{'=' * 50}")
|
|
|
|
|
|
if __name__ == "__main__":
|
|
main()
|