3 Data-Collection
Virgil edited this page 2026-02-19 17:07:21 +00:00

Data Collection (collect/)

The collect package provides a multi-source data collection subsystem for gathering information from GitHub, BitcoinTalk, CoinGecko, and academic paper repositories. It supports rate limiting, incremental state tracking, event-driven progress reporting, and HTML/JSON-to-Markdown conversion.

Import path: forge.lthn.ai/core/go-scm/collect

See also: Home


Collector Interface

Every collection source implements the Collector interface:

type Collector interface {
    Name() string
    Collect(ctx context.Context, cfg *Config) (*Result, error)
}

The Config struct carries shared configuration:

type Config struct {
    Output     io.Medium       // Storage medium for writing collected data
    OutputDir  string          // Base directory for all collected data
    Limiter    *RateLimiter    // Per-source rate limiting
    State      *State          // Tracks progress for incremental runs
    Dispatcher *Dispatcher     // Event dispatch for progress reporting
    Verbose    bool            // Enable detailed logging
    DryRun     bool            // Simulate collection without writing
}

Create a Config with sensible defaults:

cfg := collect.NewConfig("/path/to/output")

// Or with a specific storage medium:
cfg := collect.NewConfigWithMedium(myMedium, "/path/to/output")

Result carries the outcome of each collection run:

type Result struct {
    Source  string   // Which collector produced this
    Items   int      // Successfully collected items
    Errors  int      // Errors encountered
    Skipped int      // Items skipped (e.g. already collected)
    Files   []string // Paths of all files written
}

Use MergeResults() to aggregate results from multiple collectors.


Built-in Collectors

GitHubCollector

Collects issues and pull requests from GitHub repositories using the gh CLI:

collector := &collect.GitHubCollector{
    Org:        "host-uk",
    Repo:       "core",       // Empty = all repos in the org
    IssuesOnly: false,
    PRsOnly:    false,
}

result, err := collector.Collect(ctx, cfg)

Output is written as Markdown files to {OutputDir}/github/{org}/{repo}/issues/{number}.md and .../pulls/{number}.md. Each file includes title, state, author, created date, labels, URL, and body.

BitcoinTalkCollector

Scrapes forum posts from BitcoinTalk topics via HTML parsing:

collector := &collect.BitcoinTalkCollector{
    TopicID: "5432871",
    Pages:   5,           // 0 = all pages
}

Posts are paginated (20 per page) and written to {OutputDir}/bitcointalk/{topicID}/posts/{num}.md. The collector extracts author, date, and content from the div.post DOM structure. A custom User-Agent header (CoreCollector/1.0) is sent with each request.

MarketCollector

Fetches market data from the CoinGecko API:

collector := &collect.MarketCollector{
    CoinID:     "lethean",
    Historical: true,
    FromDate:   "2025-01-01",  // Optional; defaults to 365 days
}

Writes two files for current data: {OutputDir}/market/{coinID}/current.json (raw API response) and summary.md (formatted Markdown with price, market cap, volume, supply). When Historical is enabled, writes historical.json containing price, market cap, and volume time series.

Uses a generic fetchJSON[T]() helper for type-safe HTTP fetching.

PapersCollector

Collects academic papers from IACR ePrint and arXiv:

collector := &collect.PapersCollector{
    Source:   collect.PaperSourceAll,  // "iacr", "arxiv", or "all"
    Query:    "privacy preserving",
    Category: "cs.CR",                // arXiv category (optional)
}
  • IACR: Scrapes search results from eprint.iacr.org, extracting title, authors, date, abstract, and URL from div.paperentry elements.
  • arXiv: Parses the Atom XML feed from the arXiv API (export.arxiv.org), supporting category filtering and up to 50 results.

Papers are written to {OutputDir}/papers/{source}/{id}.md.


Excavator (Orchestrator)

The Excavator runs multiple collectors sequentially with coordinated rate limiting, state tracking for resume support, and aggregated results:

excavator := &collect.Excavator{
    Collectors: []collect.Collector{
        &collect.GitHubCollector{Org: "host-uk"},
        &collect.MarketCollector{CoinID: "lethean"},
        &collect.PapersCollector{Source: "all", Query: "vpn privacy"},
    },
    Resume:   true,   // Skip already-completed collectors
    ScanOnly: false,  // true = report what would run without collecting
}

result, err := excavator.Run(ctx, cfg)

When Resume is enabled, the Excavator loads saved state and skips collectors that have previously completed successfully. After all collectors run, state is saved to disk. Context cancellation is respected between collector runs.


Rate Limiter

The RateLimiter enforces per-source delays to avoid overwhelming APIs:

limiter := collect.NewRateLimiter()

Default delays per source:

Source Delay
github 500ms
bitcointalk 2s
coingecko 1.5s
iacr 1s
arxiv 1s

The Wait(ctx, source) method blocks until the rate limit allows the next request, respecting context cancellation. Unknown sources default to 500ms.

CheckGitHubRateLimit() queries the GitHub API via gh api rate_limit and auto-increases the GitHub delay to 5 seconds when usage reaches 75%.


State Tracking

The State struct persists collection progress to disk for incremental runs:

state := collect.NewState(medium, "/path/to/.collect-state.json")
state.Load()   // Reads from disk (no error if file missing)

entry, ok := state.Get("github:host-uk")
state.Set("github:host-uk", &collect.StateEntry{
    Source:  "github:host-uk",
    LastRun: time.Now(),
    Items:   42,
    Cursor:  "page=3",
})

state.Save()   // Writes to disk

Each StateEntry tracks source name, last run timestamp, last processed item ID, total items collected, and an opaque pagination cursor.


Event System

The Dispatcher provides synchronous event dispatch for progress reporting:

dispatcher := collect.NewDispatcher()

dispatcher.On(collect.EventProgress, func(e collect.Event) {
    fmt.Printf("[%s] %s: %s\n", e.Type, e.Source, e.Message)
})

Event types: start, progress, item, error, complete.

Convenience methods: EmitStart(), EmitProgress(), EmitItem(), EmitError(), EmitComplete(). Each Event carries Type, Source, Message, optional Data, and Time.


Processor

The Processor converts collected HTML and JSON files to clean Markdown:

processor := &collect.Processor{
    Source: "bitcointalk",
    Dir:    "/path/to/raw/files",
}

result, err := processor.Process(ctx, cfg)

Supported conversions:

  • HTML to Markdown: Recursive DOM walk handling headings (h1-h6), paragraphs, bold/italic/code, links, lists (ordered and unordered), blockquotes, pre blocks, and horizontal rules. Script, style, and head elements are stripped.
  • JSON to Markdown: Recursively formats objects and arrays with sorted keys and nested bullet lists.
  • Markdown passthrough: .md files are trimmed and copied as-is.

Output is written to {OutputDir}/processed/{source}/{name}.md.

Exported helpers for testing: HTMLToMarkdown(), JSONToMarkdown().


Licence

EUPL-1.2