Adds support for parsing and respecting robots.txt during website collection.
This change introduces the following features:
- Fetches and parses /robots.txt before crawling a website.
- Respects `Disallow` patterns to avoid crawling restricted areas.
- Honors the `Crawl-delay` directive to prevent hammering sites.
- Adds command-line flags to configure the behavior:
- `--ignore-robots`: Ignores robots.txt rules.
- `--user-agent`: Sets a custom user-agent string.
- `--min-delay`: Overrides the crawl-delay with a minimum value.
The implementation includes a new `robots` package for parsing robots.txt files and integrates it into the existing website downloader. Tests have been added to verify the new functionality.
Co-authored-by: Snider <631881+Snider@users.noreply.github.com>