Borg/pkg/website at 1d8ff02f5cfeeaa81f4ab2d896ebf53db3e685c3 - Snider/Borg

History

google-labs-jules[bot] 1d8ff02f5c feat: add robots.txt support to website collector Adds support for parsing and respecting robots.txt during website collection. This change introduces the following features: - Fetches and parses /robots.txt before crawling a website. - Respects `Disallow` patterns to avoid crawling restricted areas. - Honors the `Crawl-delay` directive to prevent hammering sites. - Adds command-line flags to configure the behavior: - `--ignore-robots`: Ignores robots.txt rules. - `--user-agent`: Sets a custom user-agent string. - `--min-delay`: Overrides the crawl-delay with a minimum value. The implementation includes a new `robots` package for parsing robots.txt files and integrates it into the existing website downloader. Tests have been added to verify the new functionality. Co-authored-by: Snider <631881+Snider@users.noreply.github.com>	2026-02-02 00:42:20 +00:00
..
website.go	feat: add robots.txt support to website collector	2026-02-02 00:42:20 +00:00
website_test.go	feat: add robots.txt support to website collector	2026-02-02 00:42:20 +00:00

google-labs-jules[bot] 1d8ff02f5c feat: add robots.txt support to website collector

Adds support for parsing and respecting robots.txt during website collection.

This change introduces the following features:
- Fetches and parses /robots.txt before crawling a website.
- Respects `Disallow` patterns to avoid crawling restricted areas.
- Honors the `Crawl-delay` directive to prevent hammering sites.
- Adds command-line flags to configure the behavior:
  - `--ignore-robots`: Ignores robots.txt rules.
  - `--user-agent`: Sets a custom user-agent string.
  - `--min-delay`: Overrides the crawl-delay with a minimum value.

The implementation includes a new `robots` package for parsing robots.txt files and integrates it into the existing website downloader. Tests have been added to verify the new functionality.

Co-authored-by: Snider <631881+Snider@users.noreply.github.com>

2026-02-02 00:42:20 +00:00

website.go

feat: add robots.txt support to website collector

2026-02-02 00:42:20 +00:00

website_test.go

feat: add robots.txt support to website collector

2026-02-02 00:42:20 +00:00