Commit graph

4 commits

Author SHA1 Message Date
google-labs-jules[bot]
1d8ff02f5c feat: add robots.txt support to website collector
Adds support for parsing and respecting robots.txt during website collection.

This change introduces the following features:
- Fetches and parses /robots.txt before crawling a website.
- Respects `Disallow` patterns to avoid crawling restricted areas.
- Honors the `Crawl-delay` directive to prevent hammering sites.
- Adds command-line flags to configure the behavior:
  - `--ignore-robots`: Ignores robots.txt rules.
  - `--user-agent`: Sets a custom user-agent string.
  - `--min-delay`: Overrides the crawl-delay with a minimum value.

The implementation includes a new `robots` package for parsing robots.txt files and integrates it into the existing website downloader. Tests have been added to verify the new functionality.

Co-authored-by: Snider <631881+Snider@users.noreply.github.com>
2026-02-02 00:42:20 +00:00
google-labs-jules[bot]
8ba0deab91 feat: Add _Good, _Bad, and _Ugly tests
Refactored the existing tests to use the `_Good`, `_Bad`, and `_Ugly`
testing convention. This provides a more structured approach to testing
and ensures that a wider range of scenarios are covered, including
valid inputs, invalid inputs, and edge cases.

In addition to refactoring the tests, this change also includes several
bug fixes that were uncovered by the new tests. These fixes improve the
robustness and reliability of the codebase.

The following packages and commands were affected:
- `pkg/datanode`
- `pkg/compress`
- `pkg/github`
- `pkg/matrix`
- `pkg/pwa`
- `pkg/vcs`
- `pkg/website`
- `cmd/all`
- `cmd/collect`
- `cmd/collect_github_repo`
- `cmd/collect_website`
- `cmd/compile`
- `cmd/root`
- `cmd/run`
- `cmd/serve`
2025-11-14 10:36:35 +00:00
google-labs-jules[bot]
88502deb41 feat: Add progress bars to long-running operations
This commit improves the user experience of the application by adding progress bars to long-running operations.

The following commands now display a progress bar:
- `collect github repo`
- `collect website`
- `collect pwa`

The underlying packages (`pkg/vcs`, `pkg/website`, and `pkg/pwa`) have been updated to support progress reporting.
2025-11-02 01:26:52 +00:00
google-labs-jules[bot]
8e82bada06 feat: Add recursive website downloader and progress bar
This commit introduces a new `collect website` command that recursively downloads a website to a specified depth.

- A new `pkg/website` package contains the logic for the recursive download.
- A new `pkg/ui` package provides a progress bar for long-running operations, which is used by the website downloader.
- The `collect pwa` subcommand has been restored to be PWA-specific.
2025-10-31 21:35:53 +00:00