agent/claude/issues/010-collect-process.md

# feat(collect): Add collected data processing

## Summary

Add `core collect process` command to convert collected HTML/JSON files into clean markdown.

## Required Commands

```bash
core collect process <source> <downloads-dir>  # Process downloaded files
core collect process bitcointalk ./downloads   # BitcoinTalk HTML → MD
core collect process reddit ./downloads        # Reddit JSON → MD
core collect process wayback ./downloads       # Wayback HTML → MD
core collect process medium ./downloads        # Medium RSS → MD
```

## Current Shell Script Being Replaced

- `claude/skills/job-collector/process.sh` - 243 lines of bash + embedded Python

## Supported Sources

1. **bitcointalk** / **btt**
   - Input: HTML pages
   - Extract: posts, authors, dates
   - Output: POST-NNNN.md files

2. **reddit**
   - Input: JSON from Reddit API
   - Extract: posts, comments, scores
   - Output: REDDIT-NNNN.md files

3. **wayback**
   - Input: HTML from Wayback Machine
   - Extract: title, body text
   - Output: {basename}.md files

4. **medium**
   - Input: RSS/XML feed
   - Extract: title, author, date, content
   - Output: MEDIUM-NNNN.md files

## Output Structure

```
processed/
├── INDEX.md
└── posts/
    ├── POST-0001.md
    ├── POST-0002.md
    └── ...
```

## Index Generation

Auto-generates INDEX.md with:
- Source metadata
- Post count
- Links to all posts

## Output Format

```json
{
  "source": "bitcointalk",
  "input_files": 15,
  "posts_extracted": 347,
  "output": "processed/"
}
```
docs: add feature request issues for core CLI migration 12 issue files documenting features needed to replace shell scripts: Claude Code hooks: - 001: core ai session (state management) - 002: core ai context (fact capture) - 003: core ai hook (command validation) - 004: core qa debug (debug statement detection) Data collection: - 005: core collect github (issues/PRs archive) - 006: core collect bitcointalk (forum threads) - 007: core collect market (CMC/CoinGecko) - 008: core collect papers (whitepapers) - 009: core collect excavate (project archaeology) - 010: core collect process (HTML→MD) - 011: core collect dispatch (event hooks) 000: Overview tracking issue These will be submitted to host-uk/core when rate limit resets. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> 2026-02-01 18:49:35 +00:00			`# feat(collect): Add collected data processing`

			`## Summary`

			Add `core collect process` command to convert collected HTML/JSON files into clean markdown.

			`## Required Commands`

			```bash
			`core collect process <source> <downloads-dir> # Process downloaded files`
			`core collect process bitcointalk ./downloads # BitcoinTalk HTML → MD`
			`core collect process reddit ./downloads # Reddit JSON → MD`
			`core collect process wayback ./downloads # Wayback HTML → MD`
			`core collect process medium ./downloads # Medium RSS → MD`
			```

			`## Current Shell Script Being Replaced`

			- `claude/skills/job-collector/process.sh` - 243 lines of bash + embedded Python

			`## Supported Sources`

			`1. bitcointalk / btt`
			`- Input: HTML pages`
			`- Extract: posts, authors, dates`
			`- Output: POST-NNNN.md files`

			`2. reddit`
			`- Input: JSON from Reddit API`
			`- Extract: posts, comments, scores`
			`- Output: REDDIT-NNNN.md files`

			`3. wayback`
			`- Input: HTML from Wayback Machine`
			`- Extract: title, body text`
			`- Output: {basename}.md files`

			`4. medium`
			`- Input: RSS/XML feed`
			`- Extract: title, author, date, content`
			`- Output: MEDIUM-NNNN.md files`

			`## Output Structure`

			```
			`processed/`
			`├── INDEX.md`
			`└── posts/`
			`├── POST-0001.md`
			`├── POST-0002.md`
			`└── ...`
			```

			`## Index Generation`

			`Auto-generates INDEX.md with:`
			`- Source metadata`
			`- Post count`
			`- Links to all posts`

			`## Output Format`

			```json
			`{`
			`"source": "bitcointalk",`
			`"input_files": 15,`
			`"posts_extracted": 347,`
			`"output": "processed/"`
			`}`
			```