diff --git a/pkg/core/REVIEW.md b/pkg/core/REVIEW.md new file mode 100644 index 0000000..b86af88 --- /dev/null +++ b/pkg/core/REVIEW.md @@ -0,0 +1,370 @@ +# Code Review: pkg/core/ — Comprehensive Analysis + +**Repository:** core/go +**Status:** Ready for fixes +**Breaking Change Risk:** 14 consumers +**Date:** 2026-03-18 + +--- + +## Executive Summary + +Comprehensive review of `pkg/core/` identified **3 critical**, **4 high**, **3 medium**, and **2 low** severity findings. The most severe issues are: + +1. CLI subsystem is completely non-functional (nil pointers) +2. Race condition in asset retrieval +3. Package-level shared mutex state violates Core isolation +4. Lock ordering hazard between service and i18n subsystems +5. Data race in default logger replacement + +All findings have concrete impact on the 14 consuming modules. Priority fixes are listed at the end. + +--- + +## Critical Findings + +### [CRITICAL] embed.go:69–79 — Concurrent Map Access in GetAsset + +**Severity:** Crash (panic) +**File:** pkg/core/embed.go +**Lines:** 69–79 +**Confidence:** 92% + +The `GetAsset` method releases the `assetGroupsMu` RLock before reading from `g.assets`, creating a race condition with concurrent `AddAsset` calls. + +```go +assetGroupsMu.RLock() +g, ok := assetGroups[group] +assetGroupsMu.RUnlock() // lock released here +// ... +data, ok := g.assets[name] // map read after lock released — RACE +``` + +**Impact:** Any concurrent call to `AddAsset` on the same group can mutate `g.assets` while `GetAsset` reads it, triggering `fatal error: concurrent map read and map write`. + +**Fix:** Hold `assetGroupsMu.RLock()` through both the group lookup and the asset read, or make a copy of the asset while locked. + +--- + +### [CRITICAL] cli.go:* — Uninitialised CLI Subsystem (Nil Pointer Dereference) + +**Severity:** Crash (panic) +**File:** pkg/core/cli.go +**Lines:** 94, 99, 105, 118, 137, 142, 157, 163, 169, 175, 181, 187, 193, 198 +**Confidence:** 95% + +The `Core` constructor at contract.go:96 initialises `cli: &Cli{opts: &CliOpts{}}` with **no `rootCommand`** and **no `bannerFunction`**. Every public `Cli` method that accesses these fields panics: + +- `AddCommand` (cli.go:94) — dereferences `c.rootCommand` +- `PrintBanner` (cli.go:99) — dereferences `c.bannerFunction` +- `Run` (cli.go:105) — dereferences `c.rootCommand` +- `PrintHelp` (cli.go:118) — dereferences `c.rootCommand` and `c.bannerFunction` +- All flag methods (`BoolFlag`, `StringFlag`, `IntFlag`) — dereference `c.rootCommand` + +**Example crash path:** +```go +c, _ := core.New() +c.Cli().AddCommand(...) // panic: nil pointer dereference (rootCommand) +``` + +**Impact:** The CLI subsystem is **completely unusable** in any Core instance created via the standard `New()` constructor. Any consumer attempting to expose CLI commands via `c.Cli()` will crash immediately. + +**Fix:** Either add a `WithCli(name, version, description)` option to initialise the CLI with a root command, or modify `New()` to create a default root command when `Cli` is instantiated. See also command.go nil check issue below. + +--- + +### [CRITICAL] lock.go:12–16 — Package-Level Shared Lock Map Violates Core Isolation + +**Severity:** Logical error (silent cross-instance interference) +**File:** pkg/core/lock.go +**Lines:** 12–16 +**Confidence:** 90% + +The `lockMap` is a package-level global shared across all `Core` instances in the process: + +```go +var ( + lockMu sync.Mutex + lockMap = make(map[string]*sync.RWMutex) // SHARED across all Core instances +) +``` + +Any two `Core` instances calling `c.Lock("srv")` will receive a pointer to the **same underlying `*sync.RWMutex`**. This violates the logical isolation of independent `Core` instances — service registration on one instance can block registration on another. + +**Impact:** In a multi-Core application (realistic for test harnesses, plugin systems, or microservices deployed in the same process), acquiring a named lock on one Core blocks all other Cores sharing that name. This is a silent logical error that could manifest as mysterious hangs or deadlocks in production. + +**Fix:** Move `lockMap` and `lockMu` into the `Core` struct (or wrap it in a per-Core map). + +--- + +## High-Priority Findings + +### [HIGH] service.go:46–66 — Lock-Order Violation (AB/BA Deadlock Risk) + +**Severity:** Deadlock risk +**File:** pkg/core/service.go +**Lines:** 46–66 +**Confidence:** 85% + +When registering a service with `c.Service(name, instance)`, the code acquires the `"srv"` write lock, then calls `c.i18n.AddLocales()` **while holding that lock** (line 66): + +```go +c.Lock("srv").Mu.Lock() // Acquire "srv" lock +defer c.Lock("srv").Mu.Unlock() +// ... +c.i18n.AddLocales(lp.Locales()) // AddLocales acquires i18n.mu +``` + +Any code path that acquires `i18n.mu` first and then tries to acquire the `"srv"` lock creates an AB/BA deadlock scenario. + +**Impact:** Lock-order inversions are latent but often invisible until they manifest under concurrent load. This is a classic correctness issue in multi-threaded systems. + +**Fix:** Move the `AddLocales` call outside the `"srv"` lock, or document and enforce a strict global lock ordering: always acquire `"srv"` before `i18n.mu` (or vice versa). + +--- + +### [HIGH] log.go:308–309 — Data Race on `defaultLog` Pointer + +**Severity:** Data race +**File:** pkg/core/log.go +**Lines:** 308–309 +**Confidence:** 88% + +`SetDefault` performs an unsynchronised write to the package-level `defaultLog` variable. If any goroutine is concurrently reading `defaultLog` (e.g., via `Debug()`, `Info()`, `SetLevel()`, etc.), this is a concurrent read+write data race on the pointer itself. + +```go +func SetDefault(l *Log) { + defaultLog = l // unsynchronised write to package-level pointer +} +``` + +**Impact:** Under high concurrency, a goroutine calling `Debug()` might read a partially-written pointer value, resulting in nil dereference, panic, or silent corruption. + +**Fix:** Use `sync/atomic.Pointer[Log]` or protect reads and writes with a package-level RWMutex. + +--- + +### [HIGH] error.go:390–391, 419–420 — File I/O Convention Violation & Directory Permissions + +**Severity:** Convention violation + permission issue +**File:** pkg/core/error.go +**Lines:** 390–391, 406–422 +**Confidence:** 88% + +The crash report file I/O uses `os.ReadFile` / `os.WriteFile` directly, bypassing the project convention that **all file I/O must go through `c.Fs()`** (sandboxed local filesystem). Additionally: + +1. `os.MkdirAll(filepath.Dir(h.filePath), 0755)` at line 419 uses world-readable permissions (0755) for a directory containing crash reports. This allows other local users to enumerate and potentially read sensitive crash metadata. +2. The file itself is written with 0600 (correct), but the directory is unprotected. + +**Impact:** Convention violation couples the error subsystem to the OS file system, preventing testing/mocking. Security issue: crash reports (containing sensitive stack traces, system info) are readable by other local users. + +**Fix:** +- Replace `os.ReadFile/WriteFile` with `c.Fs().Read()/Write()` +- Change `os.MkdirAll(..., 0755)` to `0700` or use `c.Fs().Mkdir()` +- Consider wrapping crash reporting in the abstracted I/O layer + +--- + +### [HIGH] contract.go:156–168 — Missing Nil Check in WithName + +**Severity:** Silent error (panics later) +**File:** pkg/core/contract.go +**Lines:** 156–168 +**Confidence:** 78% + +`WithName` does not validate that the `serviceInstance` returned by the factory is non-nil, unlike `WithService` (which guards this at lines 121–124). If the factory returns `(nil, nil)`, the nil is registered in the service map, and any type assertion or interface call on the retrieved service will panic. + +```go +svc, err := factory() +if err != nil { return ... } +// NO CHECK FOR nil svc +result := c.Service(name, serviceInstance) +``` + +**Fix:** Add the same nil check as `WithService`: +```go +if serviceInstance == nil { + return E("core.WithName", fmt.Sprintf("service factory returned nil instance for %q", name), nil) +} +``` + +--- + +## Medium-Priority Findings + +### [MEDIUM] config.go:101–106 — Nil ConfigOpts Dereference + +**Severity:** Panic (nil pointer dereference) +**File:** pkg/core/config.go +**Lines:** 101–106 +**Confidence:** 85% + +`Config.Enabled` and `Config.EnabledFeatures` read `e.Features` without checking if `e.ConfigOpts == nil`. While `New()` initialises `ConfigOpts`, `Config` is an exported struct and a zero-value `Config{}` would have nil `ConfigOpts`, causing a panic. + +The `Get` method at line 63 has the correct guard (`if e.ConfigOpts == nil`), but `Enabled` and `EnabledFeatures` do not. **Inconsistency increases the likelihood of misuse.** + +**Fix:** Add nil guard to `Enabled` and `EnabledFeatures`: +```go +func (e *Config) Enabled(feature string) bool { + e.mu.RLock() + defer e.mu.RUnlock() + if e.ConfigOpts == nil || e.Features == nil { + return false + } + return e.Features[feature] +} +``` + +--- + +### [MEDIUM] fs.go:85–86 — Logging Convention Violation & Log Injection Risk + +**Severity:** Convention violation + log injection risk +**File:** pkg/core/fs.go +**Lines:** 85–86 +**Confidence:** 75% + +The sandbox-escape security event at `validatePath` logs directly to `os.Stderr` using `fmt.Fprintf`, bypassing the structured `Log` system. The log line includes user-controlled input (the attempted path) **without sanitisation**, creating a log injection vector. + +```go +_, _ = fmt.Fprintf(os.Stderr, "potential sandbox escape attempt: realpath=%s input=%s user=%s\n", realNext, p, Username()) +``` + +An attacker-controlled path like `"foo\nmalicious log line\n"` would corrupt the log output. Additionally, by-passing the `Log` system violates consistency and prevents testing/log capture. + +**Fix:** +- Use `c.Log().Security(...)` instead of `fmt.Fprintf` +- Sanitise the path and username before including in the log message: + ```go + c.Log().Security("sandbox escape attempt", "path", filepath.Clean(p), "user", Username()) + ``` + +--- + +### [MEDIUM] command.go:132, 152, 176, 218 — Nil `app` Dereference + +**Severity:** Panic (nil pointer dereference) +**File:** pkg/core/command.go +**Lines:** 132, 152, 176, 218 +**Confidence:** 80% + +`Command` objects can be created with the public `NewCommand()` constructor, which leaves `c.app` nil. Later, when `run()`, `PrintHelp()`, or `isDefaultCommand()` is called, they dereference `c.app` without checking: + +```go +// Line 132: c.app.errorHandler(...) +// Line 152: c.app.defaultCommand +// Line 176: c.app.PrintBanner() +// Line 218: c.app != nil check exists but only here +``` + +**Impact:** Any consumer manually constructing a command tree using `NewCommand` and then calling methods on it will panic. The `setApp` method (which assigns `c.app`) is unexported, making it inaccessible to callers. + +**Fix:** Either make `setApp` public and require callers to use it, or add a lazy initialisation guard: +```go +func (c *Command) ensureApp() error { + if c.app == nil { + return E("Command.run", "Command not properly initialised; missing setApp call", nil) + } + return nil +} +``` + +--- + +## Low-Priority Findings + +### [LOW] log.go:388–397 — Convention Violation in LogPan.Recover + +**Severity:** Lost context +**File:** pkg/core/log.go +**Lines:** 388–397 +**Confidence:** 65% + +`LogPan.Recover` creates an error with `fmt.Errorf("%v", r)` for non-error panics (line 390), violating the project convention that errors must use `core.E()`. The resulting error lacks `Op` and structured context: + +```go +err, ok := r.(error) +if !ok { + err = fmt.Errorf("%v", r) // CONVENTION VIOLATION: should use core.E() +} +// ... +lp.log.Error("panic recovered", "op", Op(err), "stack", FormatStackTrace(err)) +``` + +Since `fmt.Errorf` returns a plain `error`, `Op(err)` and `FormatStackTrace(err)` will return empty strings. The log loses context. + +**Fix:** +```go +if !ok { + err = core.E("panic", fmt.Sprintf("%v", r), nil) +} +``` + +--- + +### [LOW] error.go:350–351 — Convention Violation in ErrPan.Recover + +**Severity:** Lost context +**File:** pkg/core/error.go +**Lines:** 350–351 +**Confidence:** 60% + +Same issue as above: `ErrPan.Recover` uses `fmt.Errorf` for non-error panics instead of `core.E()`. The crash report's `Error` field will be an unstructured string rather than a wrapped `core.Err`. + +**Fix:** +```go +if !ok { + err = core.E("panic", fmt.Sprintf("%v", r), nil) +} +``` + +--- + +## Summary Table + +| Severity | Count | Files | Action Required | +|----------|-------|-------|-----------------| +| CRITICAL | 3 | cli.go, embed.go, lock.go | **MUST FIX** — system is unusable or crashes | +| HIGH | 4 | service.go, log.go, error.go, contract.go | **MUST FIX** — logic errors, data races, deadlock risk | +| MEDIUM | 3 | config.go, fs.go, command.go | **SHOULD FIX** — crashes or convention violations | +| LOW | 2 | log.go, error.go | **CONSIDER** — lost context, non-critical | +| **TOTAL** | **12** | | | + +--- + +## Priority Implementation Order + +### Phase 1: Stability (CRITICAL + HIGH) +1. **cli.go** — Initialise `rootCommand` and `bannerFunction` in `New()` or add `WithCli` option +2. **embed.go** — Hold `assetGroupsMu` lock through both reads in `GetAsset` +3. **log.go** — Use `atomic.Pointer[Log]` for `defaultLog` or add RWMutex +4. **lock.go** — Move `lockMap` into `Core` struct (per-instance isolation) +5. **service.go** — Move `AddLocales` call outside the `"srv"` lock or document strict lock ordering + +### Phase 2: Correctness (MEDIUM) +6. **command.go** — Add nil check guard or make `setApp` public +7. **config.go** — Add nil guard to `Enabled` and `EnabledFeatures` +8. **fs.go** — Use `c.Log().Security()` and sanitise log input +9. **contract.go** — Add nil check in `WithName` factory validation + +### Phase 3: Polish (LOW) +10. **error.go** — Use `core.E()` in both `ErrPan.Recover` and `LogPan.Recover` +11. **error.go** — Use `c.Fs()` for crash report I/O; change directory mode to 0700 + +--- + +## Testing Strategy Post-Fix + +- **Unit tests** for each module, with concurrent load testing for lock contention +- **Integration tests** for multi-Core scenarios (lock.go fix) +- **Race detector** (`go test -race ./...`) must pass +- **Panic recovery tests** for nil-dereference scenarios (cli.go, command.go) +- **Functional tests** for crash reporting (error.go) + +--- + +## Files Affected (14 Consumers) + +- agent, api, cli, core, config, docs, gui, ide, lint, mcp, php, ts, LEM, eaas + +All of these should run tests and regressions after fixes are applied.