This commit introduces parallel collection capabilities to the `borg` CLI, significantly improving the performance of large-scale data collection. Key features and changes include: - **Parallel Downloads:** A `--parallel` flag has been added to the `collect github repos` and `collect website` commands, allowing users to specify the number of concurrent workers for downloading and processing. - **Rate Limiting:** A `--rate-limit` flag has been added to the `collect website` command to control the maximum number of requests per second to a single domain, preventing the crawler from overwhelming servers. - **Graceful Shutdown:** The worker pools now respect context cancellation, allowing for a graceful shutdown on interrupt (e.g., Ctrl+C). This improves the user experience for long-running collection tasks. - **Refactored Downloaders:** The `github` and `website` downloaders have been refactored to use a robust worker pool pattern, with proper synchronization primitives to ensure thread safety. Co-authored-by: Snider <631881+Snider@users.noreply.github.com> |
||
|---|---|---|
| .. | ||
| compress | ||
| console | ||
| datanode | ||
| github | ||
| logger | ||
| mocks | ||
| player | ||
| pwa | ||
| smsg | ||
| stmf | ||
| tarfs | ||
| tim | ||
| trix | ||
| ui | ||
| vcs | ||
| wasm/stmf | ||
| website | ||