Data Collection (collect/)
The collect package provides a multi-source data collection subsystem for gathering information from GitHub, BitcoinTalk, CoinGecko, and academic paper repositories. It supports rate limiting, incremental state tracking, event-driven progress reporting, and HTML/JSON-to-Markdown conversion.
Import path: forge.lthn.ai/core/go-scm/collect
See also: Home
Collector Interface
Every collection source implements the Collector interface:
type Collector interface {
Name() string
Collect(ctx context.Context, cfg *Config) (*Result, error)
}
The Config struct carries shared configuration:
type Config struct {
Output io.Medium // Storage medium for writing collected data
OutputDir string // Base directory for all collected data
Limiter *RateLimiter // Per-source rate limiting
State *State // Tracks progress for incremental runs
Dispatcher *Dispatcher // Event dispatch for progress reporting
Verbose bool // Enable detailed logging
DryRun bool // Simulate collection without writing
}
Create a Config with sensible defaults:
cfg := collect.NewConfig("/path/to/output")
// Or with a specific storage medium:
cfg := collect.NewConfigWithMedium(myMedium, "/path/to/output")
Result carries the outcome of each collection run:
type Result struct {
Source string // Which collector produced this
Items int // Successfully collected items
Errors int // Errors encountered
Skipped int // Items skipped (e.g. already collected)
Files []string // Paths of all files written
}
Use MergeResults() to aggregate results from multiple collectors.
Built-in Collectors
GitHubCollector
Collects issues and pull requests from GitHub repositories using the gh CLI:
collector := &collect.GitHubCollector{
Org: "host-uk",
Repo: "core", // Empty = all repos in the org
IssuesOnly: false,
PRsOnly: false,
}
result, err := collector.Collect(ctx, cfg)
Output is written as Markdown files to {OutputDir}/github/{org}/{repo}/issues/{number}.md and .../pulls/{number}.md. Each file includes title, state, author, created date, labels, URL, and body.
BitcoinTalkCollector
Scrapes forum posts from BitcoinTalk topics via HTML parsing:
collector := &collect.BitcoinTalkCollector{
TopicID: "5432871",
Pages: 5, // 0 = all pages
}
Posts are paginated (20 per page) and written to {OutputDir}/bitcointalk/{topicID}/posts/{num}.md. The collector extracts author, date, and content from the div.post DOM structure. A custom User-Agent header (CoreCollector/1.0) is sent with each request.
MarketCollector
Fetches market data from the CoinGecko API:
collector := &collect.MarketCollector{
CoinID: "lethean",
Historical: true,
FromDate: "2025-01-01", // Optional; defaults to 365 days
}
Writes two files for current data: {OutputDir}/market/{coinID}/current.json (raw API response) and summary.md (formatted Markdown with price, market cap, volume, supply). When Historical is enabled, writes historical.json containing price, market cap, and volume time series.
Uses a generic fetchJSON[T]() helper for type-safe HTTP fetching.
PapersCollector
Collects academic papers from IACR ePrint and arXiv:
collector := &collect.PapersCollector{
Source: collect.PaperSourceAll, // "iacr", "arxiv", or "all"
Query: "privacy preserving",
Category: "cs.CR", // arXiv category (optional)
}
- IACR: Scrapes search results from
eprint.iacr.org, extracting title, authors, date, abstract, and URL fromdiv.paperentryelements. - arXiv: Parses the Atom XML feed from the arXiv API (
export.arxiv.org), supporting category filtering and up to 50 results.
Papers are written to {OutputDir}/papers/{source}/{id}.md.
Excavator (Orchestrator)
The Excavator runs multiple collectors sequentially with coordinated rate limiting, state tracking for resume support, and aggregated results:
excavator := &collect.Excavator{
Collectors: []collect.Collector{
&collect.GitHubCollector{Org: "host-uk"},
&collect.MarketCollector{CoinID: "lethean"},
&collect.PapersCollector{Source: "all", Query: "vpn privacy"},
},
Resume: true, // Skip already-completed collectors
ScanOnly: false, // true = report what would run without collecting
}
result, err := excavator.Run(ctx, cfg)
When Resume is enabled, the Excavator loads saved state and skips collectors that have previously completed successfully. After all collectors run, state is saved to disk. Context cancellation is respected between collector runs.
Rate Limiter
The RateLimiter enforces per-source delays to avoid overwhelming APIs:
limiter := collect.NewRateLimiter()
Default delays per source:
| Source | Delay |
|---|---|
github |
500ms |
bitcointalk |
2s |
coingecko |
1.5s |
iacr |
1s |
arxiv |
1s |
The Wait(ctx, source) method blocks until the rate limit allows the next request, respecting context cancellation. Unknown sources default to 500ms.
CheckGitHubRateLimit() queries the GitHub API via gh api rate_limit and auto-increases the GitHub delay to 5 seconds when usage reaches 75%.
State Tracking
The State struct persists collection progress to disk for incremental runs:
state := collect.NewState(medium, "/path/to/.collect-state.json")
state.Load() // Reads from disk (no error if file missing)
entry, ok := state.Get("github:host-uk")
state.Set("github:host-uk", &collect.StateEntry{
Source: "github:host-uk",
LastRun: time.Now(),
Items: 42,
Cursor: "page=3",
})
state.Save() // Writes to disk
Each StateEntry tracks source name, last run timestamp, last processed item ID, total items collected, and an opaque pagination cursor.
Event System
The Dispatcher provides synchronous event dispatch for progress reporting:
dispatcher := collect.NewDispatcher()
dispatcher.On(collect.EventProgress, func(e collect.Event) {
fmt.Printf("[%s] %s: %s\n", e.Type, e.Source, e.Message)
})
Event types: start, progress, item, error, complete.
Convenience methods: EmitStart(), EmitProgress(), EmitItem(), EmitError(), EmitComplete(). Each Event carries Type, Source, Message, optional Data, and Time.
Processor
The Processor converts collected HTML and JSON files to clean Markdown:
processor := &collect.Processor{
Source: "bitcointalk",
Dir: "/path/to/raw/files",
}
result, err := processor.Process(ctx, cfg)
Supported conversions:
- HTML to Markdown: Recursive DOM walk handling headings (h1-h6), paragraphs, bold/italic/code, links, lists (ordered and unordered), blockquotes, pre blocks, and horizontal rules. Script, style, and head elements are stripped.
- JSON to Markdown: Recursively formats objects and arrays with sorted keys and nested bullet lists.
- Markdown passthrough:
.mdfiles are trimmed and copied as-is.
Output is written to {OutputDir}/processed/{source}/{name}.md.
Exported helpers for testing: HTMLToMarkdown(), JSONToMarkdown().
Licence
EUPL-1.2