From 630f1d5d6bc75a9a41ed8a73bb4576bc30ff2fd3 Mon Sep 17 00:00:00 2001
From: Snider <snider@host.uk.com>
Date: Wed, 25 Mar 2026 12:52:42 +0000
Subject: [PATCH] =?UTF-8?q?feat(rfc):=20Pass=20Seven=20=E2=80=94=20failure?=
 =?UTF-8?q?=20modes,=20no=20recovery=20on=20most=20paths?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

CRITICAL findings:
P7-1: New() returns half-built Core on option failure (no error return)
P7-2: ServiceStartup fails → os.Exit(1) → no cleanup → resource leak
P7-3: ACTION handlers have NO panic recovery (PerformAsync does)
P7-4: Run() has no defer — panic skips ServiceShutdown
P7-5: os.Exit(1) bypasses ALL defers — even if we add them

Additional:
P7-6: Shutdown context timeout stops remaining service cleanup
P7-7: SafeGo exists but nobody uses it — the safety primitive is unwired
P7-8: No circuit breaker — broken handlers called forever

The error path through Core is: log and crash. No rollback, no cleanup,
no recovery. Every failure mode ends in resource leaks or silent state
corruption. The fix is: defer shutdown always, wrap handlers in recover,
stop calling os.Exit from inside Core.

Seven passes, 56 findings, 2,760+ lines.

Co-Authored-By: Virgil <virgil@lethean.io>
---
 docs/RFC.md | 199 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 199 insertions(+)

diff --git a/docs/RFC.md b/docs/RFC.md
index e4bb360..f9b9262 100644
--- a/docs/RFC.md
+++ b/docs/RFC.md
@@ -2521,6 +2521,205 @@ go-process's `OnShutdown` sends SIGTERM to all running processes. But `ServiceSh
 
 ---
 
+## Pass Seven — Failure Modes and Recovery
+
+> Seventh review. What happens when things go wrong? Panics, partial failures,
+> resource leaks, unrecoverable states.
+
+### P7-1. New() Returns Half-Built Core on Option Failure
+
+```go
+func New(opts ...CoreOption) *Core {
+    c := &Core{...}
+    for _, opt := range opts {
+        if r := opt(c); !r.OK {
+            Error("core.New failed", "err", r.Value)
+            break  // stops applying options, but returns c anyway
+        }
+    }
+    c.LockApply()
+    return c  // half-built: some services registered, some not
+}
+```
+
+If `WithService` #3 fails, services #1 and #2 are registered but #3-5 are not. The caller gets a `*Core` with no indication it's incomplete. No error return. Just a log message.
+
+**Resolution:** `New()` should either:
+- Return `(*Core, error)` — breaking change but honest
+- Store the error on Core: `c.Error().HasFailed()` — queryable
+- Panic on construction failure — it's unrecoverable anyway
+
+### P7-2. ServiceStartup Fails — No Rollback, No Cleanup
+
+If service 3 of 5 fails `OnStartup`:
+- Services 1+2 are running (started successfully)
+- Service 3 failed
+- Services 4+5 never started
+- `Run()` calls `os.Exit(1)`
+- Services 1+2 never get `OnShutdown` called
+- Open files, goroutines, connections leak
+
+```go
+func (c *Core) Run() {
+    r := c.ServiceStartup(c.context, nil)
+    if !r.OK {
+        Error(err.Error())
+        os.Exit(1)  // no cleanup — started services leak
+    }
+```
+
+**Resolution:** `Run()` should call `ServiceShutdown` even on startup failure — to clean up services that DID start:
+
+```go
+r := c.ServiceStartup(c.context, nil)
+if !r.OK {
+    c.ServiceShutdown(context.Background())  // cleanup started services
+    os.Exit(1)
+}
+```
+
+### P7-3. ACTION Handlers Have No Panic Recovery
+
+```go
+func (c *Core) Action(msg Message) Result {
+    for _, h := range handlers {
+        if r := h(c, msg); !r.OK {  // if h panics → unrecovered
+            return r
+        }
+    }
+}
+```
+
+`PerformAsync` has `defer recover()`. ACTION handlers do not. A single panicking handler crashes the entire application. Given that handlers run user-supplied code (service event handlers), this is high risk.
+
+**Resolution:** Wrap each handler call in panic recovery:
+
+```go
+for _, h := range handlers {
+    func() {
+        defer func() {
+            if r := recover(); r != nil {
+                Error("ACTION handler panicked", "err", r, "msg", msg)
+            }
+        }()
+        h(c, msg)
+    }()
+}
+```
+
+This matches PerformAsync's pattern. A panicking handler is logged and skipped, not fatal.
+
+### P7-4. Run() Has No Panic Recovery — Shutdown Skipped
+
+```go
+func (c *Core) Run() {
+    // no defer recover()
+    c.ServiceStartup(...)
+    cli.Run()               // if this panics...
+    c.ServiceShutdown(...)  // ...this never runs
+}
+```
+
+If `Cli.Run()` panics (user command handler panics), `ServiceShutdown` is never called. All services leak. PID files aren't cleaned. Health endpoints keep responding.
+
+**Resolution:** `Run()` needs `defer c.ServiceShutdown(context.Background())`:
+
+```go
+func (c *Core) Run() {
+    defer c.ServiceShutdown(context.Background())
+    // ... rest of Run
+}
+```
+
+### P7-5. os.Exit(1) Called Directly — Bypasses defer
+
+`Run()` calls `os.Exit(1)` twice. `os.Exit` bypasses ALL defer statements. Even if we add `defer ServiceShutdown`, `os.Exit(1)` skips it.
+
+```go
+r := c.ServiceStartup(...)
+if !r.OK {
+    os.Exit(1)  // skips all defers
+}
+```
+
+**Resolution:** Don't call `os.Exit` inside `Run()`. Return an error and let `main()` handle the exit:
+
+```go
+func (c *Core) Run() error {
+    defer c.ServiceShutdown(context.Background())
+    // ...
+    if !r.OK { return r.Value.(error) }
+    return nil
+}
+
+// In main():
+if err := c.Run(); err != nil {
+    os.Exit(1)
+}
+```
+
+Or use `c.Error().Recover()` + panic instead of os.Exit — panics respect defer.
+
+### P7-6. ServiceShutdown Continues After First Error But May Miss Services
+
+```go
+var firstErr error
+for _, s := range stoppables {
+    r := s.OnStop()
+    if !r.OK && firstErr == nil {
+        firstErr = e  // records first error
+    }
+}
+```
+
+Good — it continues calling OnStop even after failures. But it checks `ctx.Err()` in the loop:
+
+```go
+if err := ctx.Err(); err != nil {
+    return Result{err, false}  // stops remaining services
+}
+```
+
+If the shutdown context has a timeout and it expires, remaining services never get OnStop. Combined with P7-4 (no panic recovery), a slow-to-stop service can prevent later services from stopping.
+
+**Resolution:** Each service gets its own shutdown timeout, not a shared context:
+
+```go
+for _, s := range stoppables {
+    svcCtx, cancel := context.WithTimeout(ctx, 10*time.Second)
+    s.OnStop(svcCtx)
+    cancel()
+}
+```
+
+### P7-7. ErrorPanic.SafeGo Exists but Nobody Uses It
+
+```go
+func (h *ErrorPanic) SafeGo(fn func()) {
+    go func() {
+        defer h.Recover()
+        fn()
+    }()
+}
+```
+
+A protected goroutine launcher exists. But `PerformAsync` doesn't use it — it has its own inline `defer recover()`. `Run()` doesn't use it. `ACTION` dispatch doesn't use it. The safety primitive exists but isn't wired into the system that needs it most.
+
+**Resolution:** `SafeGo` should be the standard way to spawn goroutines in Core. `PerformAsync`, `Run()`, and ACTION handlers should all use `c.Error().SafeGo()` or its equivalent. One pattern for protected goroutine spawning.
+
+### P7-8. No Circuit Breaker on IPC Handlers
+
+If an ACTION handler consistently fails (returns !OK or panics), it's still called for every message forever. There's no:
+- Tracking of handler failure rate
+- Automatic disabling of broken handlers
+- Alert that a handler is failing
+
+Combined with P6-1 (cascade), a consistently failing handler in the middle of the pipeline blocks everything downstream on every event.
+
+**Resolution:** The Action system (Section 18) should track failure rate per action. After N consecutive failures, the action is suspended with an alert. This is the same rate-limiting pattern that core/agent uses for agent dispatch (`trackFailureRate` in dispatch.go).
+
+---
+
 ## Versioning
 
 ### Release Model