go-infra/docs/architecture.md
Snider 42024ef476 docs: add human-friendly documentation
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-11 13:02:40 +00:00

9.8 KiB

title description
Architecture Internal design of go-infra -- shared HTTP client, provider clients, configuration model, and CLI command structure.

Architecture

go-infra is organised into four layers: a shared HTTP client, provider-specific API clients, a declarative configuration parser, and CLI commands that tie them together.

cmd/prod/        CLI commands (setup, status, dns, lb, ssh)
cmd/monitor/     CLI commands (security finding aggregation)
    |
    v
config.go        YAML config parser (infra.yaml)
hetzner.go       Hetzner Cloud + Robot API clients
cloudns.go       CloudNS DNS API client
    |
    v
client.go        Shared APIClient (retry, backoff, rate-limit)
    |
    v
net/http          Go standard library

Shared HTTP Client (client.go)

All provider-specific clients delegate HTTP requests to APIClient, which provides:

  • Exponential backoff with jitter -- retries on 5xx errors and network failures
  • Rate-limit compliance -- honours Retry-After headers on 429 responses
  • Configurable authentication -- each provider injects its own auth function
  • Context-aware cancellation -- all waits respect context.Context deadlines

Key Types

type APIClient struct {
    client       *http.Client
    retry        RetryConfig
    authFn       func(req *http.Request)
    prefix       string          // error message prefix, e.g. "hcloud API"
    mu           sync.Mutex
    blockedUntil time.Time       // rate-limit backoff window
}

type RetryConfig struct {
    MaxRetries     int           // 0 = no retries
    InitialBackoff time.Duration // delay before first retry
    MaxBackoff     time.Duration // upper bound on backoff duration
}

Configuration via Options

APIClient uses the functional options pattern:

client := infra.NewAPIClient(
    infra.WithHTTPClient(customHTTPClient),
    infra.WithAuth(func(req *http.Request) {
        req.Header.Set("Authorization", "Bearer "+token)
    }),
    infra.WithRetry(infra.RetryConfig{
        MaxRetries:     5,
        InitialBackoff: 200 * time.Millisecond,
        MaxBackoff:     10 * time.Second,
    }),
    infra.WithPrefix("my-api"),
)

Default configuration (from DefaultRetryConfig()): 3 retries, 100ms initial backoff, 5s maximum backoff.

Request Flow

The Do(req, result) and DoRaw(req) methods follow this flow for each attempt:

  1. Rate-limit check -- if a previous 429 response set blockedUntil, wait until that time passes (or the context is cancelled).
  2. Apply authentication -- call authFn(req) to inject credentials.
  3. Execute request -- send via the underlying http.Client.
  4. Handle response:
    • 429 Too Many Requests -- parse Retry-After header, set blockedUntil, and retry.
    • 5xx Server Error -- retryable; sleep with exponential backoff + jitter.
    • 4xx Client Error (except 429) -- not retried; return error immediately.
    • 2xx Success -- if result is non-nil, JSON-decode the body into it.
  5. If all attempts are exhausted, return the last error.

The backoff calculation uses base = initialBackoff * 2^attempt, capped at maxBackoff, with jitter applied as a random factor between 50% and 100% of the calculated value.

Do vs DoRaw

  • Do(req, result) -- decodes the response body as JSON into result. Pass nil for fire-and-forget requests (e.g. DELETE).
  • DoRaw(req) -- returns the raw []byte response body. Used by CloudNS, whose responses need manual parsing due to inconsistent JSON shapes.

Hetzner Clients (hetzner.go)

Two separate clients cover Hetzner's two distinct APIs.

HCloudClient (Hetzner Cloud API)

Manages cloud servers, load balancers, and snapshots via https://api.hetzner.cloud/v1. Uses bearer token authentication.

hc := infra.NewHCloudClient("your-token")

Operations:

Method Description
ListServers(ctx) List all cloud servers
ListLoadBalancers(ctx) List all load balancers
GetLoadBalancer(ctx, id) Get a load balancer by ID
CreateLoadBalancer(ctx, req) Create a load balancer from a typed request struct
DeleteLoadBalancer(ctx, id) Delete a load balancer by ID
CreateSnapshot(ctx, serverID, description) Create a server snapshot

Data model hierarchy:

HCloudServer
  +-- HCloudPublicNet --> HCloudIPv4
  +-- []HCloudPrivateNet
  +-- HCloudServerType (name, cores, memory, disk)
  +-- HCloudDatacenter

HCloudLoadBalancer
  +-- HCloudLBPublicNet --> HCloudIPv4
  +-- HCloudLBAlgorithm
  +-- []HCloudLBService
  |     +-- HCloudLBHTTP (optional)
  |     +-- HCloudLBHealthCheck --> HCloudLBHCHTTP (optional)
  +-- []HCloudLBTarget
        +-- HCloudLBTargetIP (optional)
        +-- HCloudLBTargetServer (optional)
        +-- []HCloudLBHealthStatus

HRobotClient (Hetzner Robot API)

Manages dedicated (bare-metal) servers via https://robot-ws.your-server.de. Uses HTTP Basic authentication.

hr := infra.NewHRobotClient("user", "password")

Operations:

Method Description
ListServers(ctx) List all dedicated servers
GetServer(ctx, ip) Get a server by IP address

The Robot API wraps each server object in a {"server": {...}} envelope. HRobotClient unwraps this automatically.

CloudNS Client (cloudns.go)

Manages DNS zones and records via https://api.cloudns.net. Uses query-parameter authentication (auth-id + auth-password).

dns := infra.NewCloudNSClient("12345", "password")

Operations:

Method Description
ListZones(ctx) List all DNS zones
ListRecords(ctx, domain) List all records in a zone (returns map[id]CloudNSRecord)
CreateRecord(ctx, domain, host, type, value, ttl) Create a record; returns the new record ID
UpdateRecord(ctx, domain, id, host, type, value, ttl) Update an existing record
DeleteRecord(ctx, domain, id) Delete a record by ID
EnsureRecord(ctx, domain, host, type, value, ttl) Idempotent create-or-update; returns whether a change was made
SetACMEChallenge(ctx, domain, value) Create a _acme-challenge TXT record with 60s TTL
ClearACMEChallenge(ctx, domain) Delete all _acme-challenge TXT records in a zone

CloudNS quirks handled internally:

  • Empty zone lists come back as {} (an object) instead of [] (an array). ListZones handles this gracefully.
  • All mutations use POST with query parameters (not request bodies).
  • Response status is checked via a "status": "Success" field in the JSON body, not HTTP status codes alone.

Configuration Model (config.go)

The Config struct represents the full infrastructure topology, parsed from an infra.yaml file. It covers:

Config
  +-- Hosts (map[string]*Host)          Servers with SSH details, role, and services
  +-- LoadBalancer                       Hetzner managed LB (name, type, backends, listeners, health)
  +-- Network                           Private network CIDR
  +-- DNS                               Provider config + zone records
  +-- SSL                               Wildcard certificate settings
  +-- Database                          Galera/MariaDB cluster nodes + backup config
  +-- Cache                             Redis/Dragonfly cluster nodes
  +-- Containers (map[string]*Container) Container deployments (image, replicas, depends_on)
  +-- S3                                Object storage endpoint + buckets
  +-- CDN                               CDN provider and zones
  +-- CICD                              CI/CD provider, runner, registry
  +-- Monitoring                        Health endpoints and alert thresholds
  +-- Backups                           Daily and weekly backup jobs

Loading

Two functions load configuration:

  • Load(path) -- reads and parses a specific file. Expands ~ in SSH key paths and defaults SSH port to 22.
  • Discover(startDir) -- walks up from startDir looking for infra.yaml, then calls Load. Returns the config, the path found, and any error.

Host Queries

// Get all hosts with a specific role
appServers := cfg.HostsByRole("app")

// Shorthand for role="app"
appServers := cfg.AppServers()

CLI Commands

core prod (cmd/prod/)

The production command group reads infra.yaml (auto-discovered or specified via --config) and provides:

Subcommand Description
status Parallel SSH health check of all hosts. Checks Docker, Galera cluster size, Redis, Traefik, Coolify, Forgejo runner. Also queries Hetzner Cloud for load balancer health if HCLOUD_TOKEN is set.
setup Runs a three-step foundation pipeline: discover (enumerate Hetzner Cloud + Robot servers), lb (create load balancer from config), dns (ensure DNS records via CloudNS). Supports --dry-run and --step for partial runs.
dns list [zone] List DNS records for a zone (defaults to host.uk.com).
dns set <host> <type> <value> Idempotent create-or-update of a DNS record.
lb status Display load balancer details and per-target health status.
lb create Create the load balancer defined in infra.yaml.
ssh <host> Look up a host by name in infra.yaml and exec into an SSH session.

The status command uses go-ansible's SSHClient to connect to each host in parallel, then runs shell commands to probe service state (Docker containers, MariaDB cluster, Redis ping, etc.).

core monitor (cmd/monitor/)

Aggregates security findings from GitHub's Security tab using the gh CLI:

  • Code scanning alerts -- from Semgrep, Trivy, Gitleaks, CodeQL, etc.
  • Dependabot alerts -- dependency vulnerability alerts.
  • Secret scanning alerts -- exposed secrets/credentials (always classified as critical).

Findings are normalised to a common Finding struct, sorted by severity (critical first), and output as either a formatted table or JSON.

Licence

EUPL-1.2