agent/claude/issues/010-collect-process.md
Snider beb24f71d2 docs: add feature request issues for core CLI migration
12 issue files documenting features needed to replace shell scripts:

Claude Code hooks:
- 001: core ai session (state management)
- 002: core ai context (fact capture)
- 003: core ai hook (command validation)
- 004: core qa debug (debug statement detection)

Data collection:
- 005: core collect github (issues/PRs archive)
- 006: core collect bitcointalk (forum threads)
- 007: core collect market (CMC/CoinGecko)
- 008: core collect papers (whitepapers)
- 009: core collect excavate (project archaeology)
- 010: core collect process (HTML→MD)
- 011: core collect dispatch (event hooks)

000: Overview tracking issue

These will be submitted to host-uk/core when rate limit resets.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-01 18:49:35 +00:00

1.5 KiB

feat(collect): Add collected data processing

Summary

Add core collect process command to convert collected HTML/JSON files into clean markdown.

Required Commands

core collect process <source> <downloads-dir>  # Process downloaded files
core collect process bitcointalk ./downloads   # BitcoinTalk HTML → MD
core collect process reddit ./downloads        # Reddit JSON → MD
core collect process wayback ./downloads       # Wayback HTML → MD
core collect process medium ./downloads        # Medium RSS → MD

Current Shell Script Being Replaced

  • claude/skills/job-collector/process.sh - 243 lines of bash + embedded Python

Supported Sources

  1. bitcointalk / btt

    • Input: HTML pages
    • Extract: posts, authors, dates
    • Output: POST-NNNN.md files
  2. reddit

    • Input: JSON from Reddit API
    • Extract: posts, comments, scores
    • Output: REDDIT-NNNN.md files
  3. wayback

    • Input: HTML from Wayback Machine
    • Extract: title, body text
    • Output: {basename}.md files
  4. medium

    • Input: RSS/XML feed
    • Extract: title, author, date, content
    • Output: MEDIUM-NNNN.md files

Output Structure

processed/
├── INDEX.md
└── posts/
    ├── POST-0001.md
    ├── POST-0002.md
    └── ...

Index Generation

Auto-generates INDEX.md with:

  • Source metadata
  • Post count
  • Links to all posts

Output Format

{
  "source": "bitcointalk",
  "input_files": 15,
  "posts_extracted": 347,
  "output": "processed/"
}