Adds support for parsing and respecting robots.txt during website collection.
This change introduces the following features:
- Fetches and parses /robots.txt before crawling a website.
- Respects `Disallow` patterns to avoid crawling restricted areas.
- Honors the `Crawl-delay` directive to prevent hammering sites.
- Adds command-line flags to configure the behavior:
- `--ignore-robots`: Ignores robots.txt rules.
- `--user-agent`: Sets a custom user-agent string.
- `--min-delay`: Overrides the crawl-delay with a minimum value.
The implementation includes a new `robots` package for parsing robots.txt files and integrates it into the existing website downloader. Tests have been added to verify the new functionality.
Co-authored-by: Snider <631881+Snider@users.noreply.github.com>
Refactored the existing tests to use the `_Good`, `_Bad`, and `_Ugly`
testing convention. This provides a more structured approach to testing
and ensures that a wider range of scenarios are covered, including
valid inputs, invalid inputs, and edge cases.
In addition to refactoring the tests, this change also includes several
bug fixes that were uncovered by the new tests. These fixes improve the
robustness and reliability of the codebase.
The following packages and commands were affected:
- `pkg/datanode`
- `pkg/compress`
- `pkg/github`
- `pkg/matrix`
- `pkg/pwa`
- `pkg/vcs`
- `pkg/website`
- `cmd/all`
- `cmd/collect`
- `cmd/collect_github_repo`
- `cmd/collect_website`
- `cmd/compile`
- `cmd/root`
- `cmd/run`
- `cmd/serve`
This commit improves the user experience of the application by adding progress bars to long-running operations.
The following commands now display a progress bar:
- `collect github repo`
- `collect website`
- `collect pwa`
The underlying packages (`pkg/vcs`, `pkg/website`, and `pkg/pwa`) have been updated to support progress reporting.
This commit introduces a new `collect website` command that recursively downloads a website to a specified depth.
- A new `pkg/website` package contains the logic for the recursive download.
- A new `pkg/ui` package provides a progress bar for long-running operations, which is used by the website downloader.
- The `collect pwa` subcommand has been restored to be PWA-specific.