docs: add human-friendly documentation

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
Snider 2026-03-11 13:02:40 +00:00
parent 5626d99a17
commit 42024ef476
3 changed files with 534 additions and 0 deletions

252
docs/architecture.md Normal file
View file

@ -0,0 +1,252 @@
---
title: Architecture
description: Internal design of go-infra -- shared HTTP client, provider clients, configuration model, and CLI command structure.
---
# Architecture
go-infra is organised into four layers: a shared HTTP client, provider-specific API clients, a declarative configuration parser, and CLI commands that tie them together.
```
cmd/prod/ CLI commands (setup, status, dns, lb, ssh)
cmd/monitor/ CLI commands (security finding aggregation)
|
v
config.go YAML config parser (infra.yaml)
hetzner.go Hetzner Cloud + Robot API clients
cloudns.go CloudNS DNS API client
|
v
client.go Shared APIClient (retry, backoff, rate-limit)
|
v
net/http Go standard library
```
## Shared HTTP Client (`client.go`)
All provider-specific clients delegate HTTP requests to `APIClient`, which provides:
- **Exponential backoff with jitter** -- retries on 5xx errors and network failures
- **Rate-limit compliance** -- honours `Retry-After` headers on 429 responses
- **Configurable authentication** -- each provider injects its own auth function
- **Context-aware cancellation** -- all waits respect `context.Context` deadlines
### Key Types
```go
type APIClient struct {
client *http.Client
retry RetryConfig
authFn func(req *http.Request)
prefix string // error message prefix, e.g. "hcloud API"
mu sync.Mutex
blockedUntil time.Time // rate-limit backoff window
}
type RetryConfig struct {
MaxRetries int // 0 = no retries
InitialBackoff time.Duration // delay before first retry
MaxBackoff time.Duration // upper bound on backoff duration
}
```
### Configuration via Options
`APIClient` uses the functional options pattern:
```go
client := infra.NewAPIClient(
infra.WithHTTPClient(customHTTPClient),
infra.WithAuth(func(req *http.Request) {
req.Header.Set("Authorization", "Bearer "+token)
}),
infra.WithRetry(infra.RetryConfig{
MaxRetries: 5,
InitialBackoff: 200 * time.Millisecond,
MaxBackoff: 10 * time.Second,
}),
infra.WithPrefix("my-api"),
)
```
Default configuration (from `DefaultRetryConfig()`): 3 retries, 100ms initial backoff, 5s maximum backoff.
### Request Flow
The `Do(req, result)` and `DoRaw(req)` methods follow this flow for each attempt:
1. **Rate-limit check** -- if a previous 429 response set `blockedUntil`, wait until that time passes (or the context is cancelled).
2. **Apply authentication** -- call `authFn(req)` to inject credentials.
3. **Execute request** -- send via the underlying `http.Client`.
4. **Handle response**:
- **429 Too Many Requests** -- parse `Retry-After` header, set `blockedUntil`, and retry.
- **5xx Server Error** -- retryable; sleep with exponential backoff + jitter.
- **4xx Client Error** (except 429) -- not retried; return error immediately.
- **2xx Success** -- if `result` is non-nil, JSON-decode the body into it.
5. If all attempts are exhausted, return the last error.
The backoff calculation uses `base = initialBackoff * 2^attempt`, capped at `maxBackoff`, with jitter applied as a random factor between 50% and 100% of the calculated value.
### Do vs DoRaw
- `Do(req, result)` -- decodes the response body as JSON into `result`. Pass `nil` for fire-and-forget requests (e.g. DELETE).
- `DoRaw(req)` -- returns the raw `[]byte` response body. Used by CloudNS, whose responses need manual parsing due to inconsistent JSON shapes.
## Hetzner Clients (`hetzner.go`)
Two separate clients cover Hetzner's two distinct APIs.
### HCloudClient (Hetzner Cloud API)
Manages cloud servers, load balancers, and snapshots via `https://api.hetzner.cloud/v1`. Uses bearer token authentication.
```go
hc := infra.NewHCloudClient("your-token")
```
**Operations:**
| Method | Description |
|--------|-------------|
| `ListServers(ctx)` | List all cloud servers |
| `ListLoadBalancers(ctx)` | List all load balancers |
| `GetLoadBalancer(ctx, id)` | Get a load balancer by ID |
| `CreateLoadBalancer(ctx, req)` | Create a load balancer from a typed request struct |
| `DeleteLoadBalancer(ctx, id)` | Delete a load balancer by ID |
| `CreateSnapshot(ctx, serverID, description)` | Create a server snapshot |
**Data model hierarchy:**
```
HCloudServer
+-- HCloudPublicNet --> HCloudIPv4
+-- []HCloudPrivateNet
+-- HCloudServerType (name, cores, memory, disk)
+-- HCloudDatacenter
HCloudLoadBalancer
+-- HCloudLBPublicNet --> HCloudIPv4
+-- HCloudLBAlgorithm
+-- []HCloudLBService
| +-- HCloudLBHTTP (optional)
| +-- HCloudLBHealthCheck --> HCloudLBHCHTTP (optional)
+-- []HCloudLBTarget
+-- HCloudLBTargetIP (optional)
+-- HCloudLBTargetServer (optional)
+-- []HCloudLBHealthStatus
```
### HRobotClient (Hetzner Robot API)
Manages dedicated (bare-metal) servers via `https://robot-ws.your-server.de`. Uses HTTP Basic authentication.
```go
hr := infra.NewHRobotClient("user", "password")
```
**Operations:**
| Method | Description |
|--------|-------------|
| `ListServers(ctx)` | List all dedicated servers |
| `GetServer(ctx, ip)` | Get a server by IP address |
The Robot API wraps each server object in a `{"server": {...}}` envelope. `HRobotClient` unwraps this automatically.
## CloudNS Client (`cloudns.go`)
Manages DNS zones and records via `https://api.cloudns.net`. Uses query-parameter authentication (`auth-id` + `auth-password`).
```go
dns := infra.NewCloudNSClient("12345", "password")
```
**Operations:**
| Method | Description |
|--------|-------------|
| `ListZones(ctx)` | List all DNS zones |
| `ListRecords(ctx, domain)` | List all records in a zone (returns `map[id]CloudNSRecord`) |
| `CreateRecord(ctx, domain, host, type, value, ttl)` | Create a record; returns the new record ID |
| `UpdateRecord(ctx, domain, id, host, type, value, ttl)` | Update an existing record |
| `DeleteRecord(ctx, domain, id)` | Delete a record by ID |
| `EnsureRecord(ctx, domain, host, type, value, ttl)` | Idempotent create-or-update; returns whether a change was made |
| `SetACMEChallenge(ctx, domain, value)` | Create a `_acme-challenge` TXT record with 60s TTL |
| `ClearACMEChallenge(ctx, domain)` | Delete all `_acme-challenge` TXT records in a zone |
**CloudNS quirks handled internally:**
- Empty zone lists come back as `{}` (an object) instead of `[]` (an array). `ListZones` handles this gracefully.
- All mutations use POST with query parameters (not request bodies).
- Response status is checked via a `"status": "Success"` field in the JSON body, not HTTP status codes alone.
## Configuration Model (`config.go`)
The `Config` struct represents the full infrastructure topology, parsed from an `infra.yaml` file. It covers:
```
Config
+-- Hosts (map[string]*Host) Servers with SSH details, role, and services
+-- LoadBalancer Hetzner managed LB (name, type, backends, listeners, health)
+-- Network Private network CIDR
+-- DNS Provider config + zone records
+-- SSL Wildcard certificate settings
+-- Database Galera/MariaDB cluster nodes + backup config
+-- Cache Redis/Dragonfly cluster nodes
+-- Containers (map[string]*Container) Container deployments (image, replicas, depends_on)
+-- S3 Object storage endpoint + buckets
+-- CDN CDN provider and zones
+-- CICD CI/CD provider, runner, registry
+-- Monitoring Health endpoints and alert thresholds
+-- Backups Daily and weekly backup jobs
```
### Loading
Two functions load configuration:
- `Load(path)` -- reads and parses a specific file. Expands `~` in SSH key paths and defaults SSH port to 22.
- `Discover(startDir)` -- walks up from `startDir` looking for `infra.yaml`, then calls `Load`. Returns the config, the path found, and any error.
### Host Queries
```go
// Get all hosts with a specific role
appServers := cfg.HostsByRole("app")
// Shorthand for role="app"
appServers := cfg.AppServers()
```
## CLI Commands
### `core prod` (`cmd/prod/`)
The production command group reads `infra.yaml` (auto-discovered or specified via `--config`) and provides:
| Subcommand | Description |
|------------|-------------|
| `status` | Parallel SSH health check of all hosts. Checks Docker, Galera cluster size, Redis, Traefik, Coolify, Forgejo runner. Also queries Hetzner Cloud for load balancer health if `HCLOUD_TOKEN` is set. |
| `setup` | Runs a three-step foundation pipeline: **discover** (enumerate Hetzner Cloud + Robot servers), **lb** (create load balancer from config), **dns** (ensure DNS records via CloudNS). Supports `--dry-run` and `--step` for partial runs. |
| `dns list [zone]` | List DNS records for a zone (defaults to `host.uk.com`). |
| `dns set <host> <type> <value>` | Idempotent create-or-update of a DNS record. |
| `lb status` | Display load balancer details and per-target health status. |
| `lb create` | Create the load balancer defined in `infra.yaml`. |
| `ssh <host>` | Look up a host by name in `infra.yaml` and `exec` into an SSH session. |
The `status` command uses `go-ansible`'s `SSHClient` to connect to each host in parallel, then runs shell commands to probe service state (Docker containers, MariaDB cluster, Redis ping, etc.).
### `core monitor` (`cmd/monitor/`)
Aggregates security findings from GitHub's Security tab using the `gh` CLI:
- **Code scanning alerts** -- from Semgrep, Trivy, Gitleaks, CodeQL, etc.
- **Dependabot alerts** -- dependency vulnerability alerts.
- **Secret scanning alerts** -- exposed secrets/credentials (always classified as critical).
Findings are normalised to a common `Finding` struct, sorted by severity (critical first), and output as either a formatted table or JSON.
## Licence
EUPL-1.2

160
docs/development.md Normal file
View file

@ -0,0 +1,160 @@
---
title: Development
description: How to build, test, and contribute to go-infra.
---
# Development
## Prerequisites
- **Go 1.26+**
- **Go workspace** -- this module is part of the workspace at `~/Code/go.work`. After cloning, run `go work sync` if module resolution fails.
- **`gh` CLI** (optional) -- required only for `core monitor` commands.
## Building
The library package (`infra`) has no binary output. The CLI commands in `cmd/prod/` and `cmd/monitor/` are compiled into the `core` binary via the `forge.lthn.ai/core/cli` module -- they are not standalone binaries.
To verify the package compiles:
```bash
cd /Users/snider/Code/core/go-infra
go build ./...
```
## Running Tests
```bash
# All tests
go test ./...
# With race detector
go test -race ./...
# A specific test
go test -run TestAPIClient_Do_Good_Success
# Verbose output
go test -v ./...
```
If the `core` CLI is available:
```bash
core go test
core go test --run TestAPIClient_Do_Good_Success
```
### Test Organisation
Tests follow the `_Good`, `_Bad`, `_Ugly` suffix convention:
| Suffix | Purpose | Example |
|--------|---------|---------|
| `_Good` | Happy path -- expected successful behaviour | `TestAPIClient_Do_Good_Success` |
| `_Bad` | Expected error conditions -- invalid input, auth failures, exhausted retries | `TestAPIClient_Do_Bad_ClientError` |
| `_Ugly` | Edge cases -- context cancellation, malformed data, panics | `TestAPIClient_Do_Ugly_ContextCancelled` |
### Test Approach
All API client tests use `net/http/httptest.Server` to mock HTTP responses. No real API calls are made during tests. The test servers simulate:
- Successful JSON responses
- HTTP error codes (400, 401, 403, 404, 500, 502, 503)
- Rate limiting (429 with `Retry-After` header)
- Transient failures that succeed after retries
- Authentication verification (bearer tokens, basic auth, query parameters)
The config tests use `Discover()` to find a real `infra.yaml` in parent directories (skipped if not present) and also test error paths with nonexistent and malformed files.
### Test Coverage by File
| File | Tests | Coverage Focus |
|------|-------|----------------|
| `client_test.go` | 20 tests | Constructor defaults/options, `Do` JSON decoding, `DoRaw` raw responses, retry on 5xx, no retry on 4xx, rate-limit handling, context cancellation, `parseRetryAfter`, integration with HCloud/CloudNS clients |
| `hetzner_test.go` | 10 tests | HCloud/HRobot constructors, `ListServers`, JSON deserialisation of servers/load balancers/Robot servers, auth header verification, error responses |
| `cloudns_test.go` | 16 tests | Constructor, auth params, raw HTTP calls, zone/record JSON parsing, CRUD round-trips, ACME challenge helpers, `EnsureRecord` logic (already correct / needs update / needs create), edge cases (empty body, empty map) |
| `config_test.go` | 4 tests | `Load` with real config, missing file, invalid YAML, `expandPath` with tilde/absolute/relative paths |
## Code Style
- **UK English** in all documentation, comments, and user-facing strings (colour, organisation, centre, serialisation).
- **Strict typing** -- all function parameters and return values have explicit types.
- **Error wrapping** -- use `fmt.Errorf("context: %w", err)` to preserve error chains.
- **Formatting** -- standard `gofmt`. Run `go fmt ./...` or `core go fmt` before committing.
## Adding a New Provider Client
To add support for a new infrastructure provider:
1. Create a new file (e.g. `vultr.go`) in the package root.
2. Define a client struct that embeds or holds an `*APIClient`:
```go
type VultrClient struct {
apiKey string
baseURL string
api *APIClient
}
func NewVultrClient(apiKey string) *VultrClient {
c := &VultrClient{
apiKey: apiKey,
baseURL: "https://api.vultr.com/v2",
}
c.api = NewAPIClient(
WithAuth(func(req *http.Request) {
req.Header.Set("Authorization", "Bearer "+c.apiKey)
}),
WithPrefix("vultr API"),
)
return c
}
```
3. Add internal helper methods (`get`, `post`, `delete`) that delegate to `c.api.Do(req, result)`.
4. Write tests using `httptest.NewServer` -- never call real APIs in tests.
5. Follow the `_Good`/`_Bad`/`_Ugly` test naming convention.
## Adding CLI Commands
CLI commands live in subdirectories of `cmd/`. Each command package:
1. Calls `cli.RegisterCommands(AddXyzCommands)` in an `init()` function (see `cmd/prod/cmd_commands.go`).
2. Defines a root `*cli.Command` with subcommands.
3. Uses `loadConfig()` to auto-discover `infra.yaml` when needed.
The `core` binary picks up these commands via blank imports in its main package.
## Project Structure
```
go-infra/
client.go Shared APIClient
client_test.go APIClient tests (20 tests)
config.go YAML config types + parser
config_test.go Config tests (4 tests)
hetzner.go HCloudClient + HRobotClient
hetzner_test.go Hetzner tests (10 tests)
cloudns.go CloudNSClient
cloudns_test.go CloudNS tests (16 tests)
cmd/
prod/
cmd_commands.go Command registration
cmd_prod.go Root 'prod' command + flags
cmd_status.go Parallel host health checks
cmd_setup.go Foundation setup pipeline (discover, lb, dns)
cmd_dns.go DNS record management
cmd_lb.go Load balancer management
cmd_ssh.go SSH into production hosts
monitor/
cmd_commands.go Command registration
cmd_monitor.go Security finding aggregation
go.mod
go.sum
CLAUDE.md
```
## Licence
EUPL-1.2

122
docs/index.md Normal file
View file

@ -0,0 +1,122 @@
---
title: go-infra
description: Infrastructure provider API clients and YAML-based configuration for managing production environments.
---
# go-infra
`forge.lthn.ai/core/go-infra` provides typed Go clients for infrastructure provider APIs (Hetzner Cloud, Hetzner Robot, CloudNS) and a declarative YAML configuration layer for describing production topology. It also ships CLI commands for production management (`core prod`) and security monitoring (`core monitor`).
The library has no framework dependencies beyond the Go standard library, YAML parsing, and testify for tests. All HTTP communication goes through a shared `APIClient` that handles retries, exponential backoff, and rate-limit compliance automatically.
## Module Path
```
forge.lthn.ai/core/go-infra
```
Requires **Go 1.26+**.
## Quick Start
### Using the API Clients Directly
```go
import "forge.lthn.ai/core/go-infra"
// Hetzner Cloud -- list all servers
hc := infra.NewHCloudClient(os.Getenv("HCLOUD_TOKEN"))
servers, err := hc.ListServers(ctx)
// Hetzner Robot -- list dedicated servers
hr := infra.NewHRobotClient(user, password)
dedicated, err := hr.ListServers(ctx)
// CloudNS -- ensure a DNS record exists
dns := infra.NewCloudNSClient(authID, authPassword)
changed, err := dns.EnsureRecord(ctx, "example.com", "www", "A", "1.2.3.4", 300)
```
### Loading Infrastructure Configuration
```go
import "forge.lthn.ai/core/go-infra"
// Auto-discover infra.yaml by walking up from the current directory
cfg, path, err := infra.Discover(".")
// Or load a specific file
cfg, err := infra.Load("/path/to/infra.yaml")
// Query the configuration
appServers := cfg.AppServers()
for name, host := range appServers {
fmt.Printf("%s: %s (%s)\n", name, host.IP, host.Role)
}
```
### CLI Commands
When registered with the `core` CLI binary, go-infra provides two command groups:
```bash
# Production infrastructure management
core prod status # Health check all hosts, services, and load balancer
core prod setup # Phase 1 foundation: discover topology, create LB, configure DNS
core prod setup --dry-run # Preview what setup would do
core prod setup --step=dns # Run a single setup step
core prod dns list # List DNS records for a zone
core prod dns set www A 1.2.3.4 # Create or update a DNS record
core prod lb status # Show load balancer status and target health
core prod lb create # Create load balancer from infra.yaml
core prod ssh noc # SSH into a named host
# Security monitoring (aggregates GitHub Security findings)
core monitor # Scan current repo
core monitor --all # Scan all repos in registry
core monitor --repo core-php # Scan a specific repo
core monitor --severity high # Filter by severity
core monitor --json # JSON output
```
## Package Layout
| Path | Description |
|------|-------------|
| `client.go` | Shared HTTP API client with retry, exponential backoff, and rate-limit handling |
| `config.go` | YAML infrastructure configuration parser and typed config structs |
| `hetzner.go` | Hetzner Cloud API (servers, load balancers, snapshots) and Hetzner Robot API (dedicated servers) |
| `cloudns.go` | CloudNS DNS API (zones, records, ACME challenge helpers) |
| `cmd/prod/` | CLI commands for production infrastructure management (`core prod`) |
| `cmd/monitor/` | CLI commands for security finding aggregation (`core monitor`) |
## Dependencies
### Direct
| Module | Purpose |
|--------|---------|
| `forge.lthn.ai/core/cli` | CLI framework (cobra-based command registration) |
| `forge.lthn.ai/core/go-ansible` | SSH client used by `core prod status` for host health checks |
| `forge.lthn.ai/core/go-i18n` | Internationalisation strings for monitor command |
| `forge.lthn.ai/core/go-io` | Filesystem abstraction used by monitor's registry lookup |
| `forge.lthn.ai/core/go-log` | Structured error logging |
| `forge.lthn.ai/core/go-scm` | Repository registry for multi-repo monitoring |
| `gopkg.in/yaml.v3` | YAML parsing for `infra.yaml` |
| `github.com/stretchr/testify` | Test assertions |
The core library types (`config.go`, `client.go`, `hetzner.go`, `cloudns.go`) only depend on the standard library and `gopkg.in/yaml.v3`. The heavier dependencies (`cli`, `go-ansible`, `go-scm`, etc.) are confined to the `cmd/` packages.
## Environment Variables
| Variable | Used by | Description |
|----------|---------|-------------|
| `HCLOUD_TOKEN` | `prod setup`, `prod status`, `prod lb` | Hetzner Cloud API bearer token |
| `HETZNER_ROBOT_USER` | `prod setup` | Hetzner Robot API username |
| `HETZNER_ROBOT_PASS` | `prod setup` | Hetzner Robot API password |
| `CLOUDNS_AUTH_ID` | `prod setup`, `prod dns` | CloudNS sub-auth user ID |
| `CLOUDNS_AUTH_PASSWORD` | `prod setup`, `prod dns` | CloudNS auth password |
## Licence
EUPL-1.2