[agent/claude:haiku] Review all files in pkg/core/. Report findings with file:lin... #19

Closed
Virgil wants to merge 1 commit from agent/review-all-files-in-pkg-core---report-fi into dev

370
pkg/core/REVIEW.md Normal file
View file

@ -0,0 +1,370 @@
# Code Review: pkg/core/ — Comprehensive Analysis
**Repository:** core/go
**Status:** Ready for fixes
**Breaking Change Risk:** 14 consumers
**Date:** 2026-03-18
---
## Executive Summary
Comprehensive review of `pkg/core/` identified **3 critical**, **4 high**, **3 medium**, and **2 low** severity findings. The most severe issues are:
1. CLI subsystem is completely non-functional (nil pointers)
2. Race condition in asset retrieval
3. Package-level shared mutex state violates Core isolation
4. Lock ordering hazard between service and i18n subsystems
5. Data race in default logger replacement
All findings have concrete impact on the 14 consuming modules. Priority fixes are listed at the end.
---
## Critical Findings
### [CRITICAL] embed.go:6979 — Concurrent Map Access in GetAsset
**Severity:** Crash (panic)
**File:** pkg/core/embed.go
**Lines:** 6979
**Confidence:** 92%
The `GetAsset` method releases the `assetGroupsMu` RLock before reading from `g.assets`, creating a race condition with concurrent `AddAsset` calls.
```go
assetGroupsMu.RLock()
g, ok := assetGroups[group]
assetGroupsMu.RUnlock() // lock released here
// ...
data, ok := g.assets[name] // map read after lock released — RACE
```
**Impact:** Any concurrent call to `AddAsset` on the same group can mutate `g.assets` while `GetAsset` reads it, triggering `fatal error: concurrent map read and map write`.
**Fix:** Hold `assetGroupsMu.RLock()` through both the group lookup and the asset read, or make a copy of the asset while locked.
---
### [CRITICAL] cli.go:* — Uninitialised CLI Subsystem (Nil Pointer Dereference)
**Severity:** Crash (panic)
**File:** pkg/core/cli.go
**Lines:** 94, 99, 105, 118, 137, 142, 157, 163, 169, 175, 181, 187, 193, 198
**Confidence:** 95%
The `Core` constructor at contract.go:96 initialises `cli: &Cli{opts: &CliOpts{}}` with **no `rootCommand`** and **no `bannerFunction`**. Every public `Cli` method that accesses these fields panics:
- `AddCommand` (cli.go:94) — dereferences `c.rootCommand`
- `PrintBanner` (cli.go:99) — dereferences `c.bannerFunction`
- `Run` (cli.go:105) — dereferences `c.rootCommand`
- `PrintHelp` (cli.go:118) — dereferences `c.rootCommand` and `c.bannerFunction`
- All flag methods (`BoolFlag`, `StringFlag`, `IntFlag`) — dereference `c.rootCommand`
**Example crash path:**
```go
c, _ := core.New()
c.Cli().AddCommand(...) // panic: nil pointer dereference (rootCommand)
```
**Impact:** The CLI subsystem is **completely unusable** in any Core instance created via the standard `New()` constructor. Any consumer attempting to expose CLI commands via `c.Cli()` will crash immediately.
**Fix:** Either add a `WithCli(name, version, description)` option to initialise the CLI with a root command, or modify `New()` to create a default root command when `Cli` is instantiated. See also command.go nil check issue below.
---
### [CRITICAL] lock.go:1216 — Package-Level Shared Lock Map Violates Core Isolation
**Severity:** Logical error (silent cross-instance interference)
**File:** pkg/core/lock.go
**Lines:** 1216
**Confidence:** 90%
The `lockMap` is a package-level global shared across all `Core` instances in the process:
```go
var (
lockMu sync.Mutex
lockMap = make(map[string]*sync.RWMutex) // SHARED across all Core instances
)
```
Any two `Core` instances calling `c.Lock("srv")` will receive a pointer to the **same underlying `*sync.RWMutex`**. This violates the logical isolation of independent `Core` instances — service registration on one instance can block registration on another.
**Impact:** In a multi-Core application (realistic for test harnesses, plugin systems, or microservices deployed in the same process), acquiring a named lock on one Core blocks all other Cores sharing that name. This is a silent logical error that could manifest as mysterious hangs or deadlocks in production.
**Fix:** Move `lockMap` and `lockMu` into the `Core` struct (or wrap it in a per-Core map).
---
## High-Priority Findings
### [HIGH] service.go:4666 — Lock-Order Violation (AB/BA Deadlock Risk)
**Severity:** Deadlock risk
**File:** pkg/core/service.go
**Lines:** 4666
**Confidence:** 85%
When registering a service with `c.Service(name, instance)`, the code acquires the `"srv"` write lock, then calls `c.i18n.AddLocales()` **while holding that lock** (line 66):
```go
c.Lock("srv").Mu.Lock() // Acquire "srv" lock
defer c.Lock("srv").Mu.Unlock()
// ...
c.i18n.AddLocales(lp.Locales()) // AddLocales acquires i18n.mu
```
Any code path that acquires `i18n.mu` first and then tries to acquire the `"srv"` lock creates an AB/BA deadlock scenario.
**Impact:** Lock-order inversions are latent but often invisible until they manifest under concurrent load. This is a classic correctness issue in multi-threaded systems.
**Fix:** Move the `AddLocales` call outside the `"srv"` lock, or document and enforce a strict global lock ordering: always acquire `"srv"` before `i18n.mu` (or vice versa).
---
### [HIGH] log.go:308309 — Data Race on `defaultLog` Pointer
**Severity:** Data race
**File:** pkg/core/log.go
**Lines:** 308309
**Confidence:** 88%
`SetDefault` performs an unsynchronised write to the package-level `defaultLog` variable. If any goroutine is concurrently reading `defaultLog` (e.g., via `Debug()`, `Info()`, `SetLevel()`, etc.), this is a concurrent read+write data race on the pointer itself.
```go
func SetDefault(l *Log) {
defaultLog = l // unsynchronised write to package-level pointer
}
```
**Impact:** Under high concurrency, a goroutine calling `Debug()` might read a partially-written pointer value, resulting in nil dereference, panic, or silent corruption.
**Fix:** Use `sync/atomic.Pointer[Log]` or protect reads and writes with a package-level RWMutex.
---
### [HIGH] error.go:390391, 419420 — File I/O Convention Violation & Directory Permissions
**Severity:** Convention violation + permission issue
**File:** pkg/core/error.go
**Lines:** 390391, 406422
**Confidence:** 88%
The crash report file I/O uses `os.ReadFile` / `os.WriteFile` directly, bypassing the project convention that **all file I/O must go through `c.Fs()`** (sandboxed local filesystem). Additionally:
1. `os.MkdirAll(filepath.Dir(h.filePath), 0755)` at line 419 uses world-readable permissions (0755) for a directory containing crash reports. This allows other local users to enumerate and potentially read sensitive crash metadata.
2. The file itself is written with 0600 (correct), but the directory is unprotected.
**Impact:** Convention violation couples the error subsystem to the OS file system, preventing testing/mocking. Security issue: crash reports (containing sensitive stack traces, system info) are readable by other local users.
**Fix:**
- Replace `os.ReadFile/WriteFile` with `c.Fs().Read()/Write()`
- Change `os.MkdirAll(..., 0755)` to `0700` or use `c.Fs().Mkdir()`
- Consider wrapping crash reporting in the abstracted I/O layer
---
### [HIGH] contract.go:156168 — Missing Nil Check in WithName
**Severity:** Silent error (panics later)
**File:** pkg/core/contract.go
**Lines:** 156168
**Confidence:** 78%
`WithName` does not validate that the `serviceInstance` returned by the factory is non-nil, unlike `WithService` (which guards this at lines 121124). If the factory returns `(nil, nil)`, the nil is registered in the service map, and any type assertion or interface call on the retrieved service will panic.
```go
svc, err := factory()
if err != nil { return ... }
// NO CHECK FOR nil svc
result := c.Service(name, serviceInstance)
```
**Fix:** Add the same nil check as `WithService`:
```go
if serviceInstance == nil {
return E("core.WithName", fmt.Sprintf("service factory returned nil instance for %q", name), nil)
}
```
---
## Medium-Priority Findings
### [MEDIUM] config.go:101106 — Nil ConfigOpts Dereference
**Severity:** Panic (nil pointer dereference)
**File:** pkg/core/config.go
**Lines:** 101106
**Confidence:** 85%
`Config.Enabled` and `Config.EnabledFeatures` read `e.Features` without checking if `e.ConfigOpts == nil`. While `New()` initialises `ConfigOpts`, `Config` is an exported struct and a zero-value `Config{}` would have nil `ConfigOpts`, causing a panic.
The `Get` method at line 63 has the correct guard (`if e.ConfigOpts == nil`), but `Enabled` and `EnabledFeatures` do not. **Inconsistency increases the likelihood of misuse.**
**Fix:** Add nil guard to `Enabled` and `EnabledFeatures`:
```go
func (e *Config) Enabled(feature string) bool {
e.mu.RLock()
defer e.mu.RUnlock()
if e.ConfigOpts == nil || e.Features == nil {
return false
}
return e.Features[feature]
}
```
---
### [MEDIUM] fs.go:8586 — Logging Convention Violation & Log Injection Risk
**Severity:** Convention violation + log injection risk
**File:** pkg/core/fs.go
**Lines:** 8586
**Confidence:** 75%
The sandbox-escape security event at `validatePath` logs directly to `os.Stderr` using `fmt.Fprintf`, bypassing the structured `Log` system. The log line includes user-controlled input (the attempted path) **without sanitisation**, creating a log injection vector.
```go
_, _ = fmt.Fprintf(os.Stderr, "potential sandbox escape attempt: realpath=%s input=%s user=%s\n", realNext, p, Username())
```
An attacker-controlled path like `"foo\nmalicious log line\n"` would corrupt the log output. Additionally, by-passing the `Log` system violates consistency and prevents testing/log capture.
**Fix:**
- Use `c.Log().Security(...)` instead of `fmt.Fprintf`
- Sanitise the path and username before including in the log message:
```go
c.Log().Security("sandbox escape attempt", "path", filepath.Clean(p), "user", Username())
```
---
### [MEDIUM] command.go:132, 152, 176, 218 — Nil `app` Dereference
**Severity:** Panic (nil pointer dereference)
**File:** pkg/core/command.go
**Lines:** 132, 152, 176, 218
**Confidence:** 80%
`Command` objects can be created with the public `NewCommand()` constructor, which leaves `c.app` nil. Later, when `run()`, `PrintHelp()`, or `isDefaultCommand()` is called, they dereference `c.app` without checking:
```go
// Line 132: c.app.errorHandler(...)
// Line 152: c.app.defaultCommand
// Line 176: c.app.PrintBanner()
// Line 218: c.app != nil check exists but only here
```
**Impact:** Any consumer manually constructing a command tree using `NewCommand` and then calling methods on it will panic. The `setApp` method (which assigns `c.app`) is unexported, making it inaccessible to callers.
**Fix:** Either make `setApp` public and require callers to use it, or add a lazy initialisation guard:
```go
func (c *Command) ensureApp() error {
if c.app == nil {
return E("Command.run", "Command not properly initialised; missing setApp call", nil)
}
return nil
}
```
---
## Low-Priority Findings
### [LOW] log.go:388397 — Convention Violation in LogPan.Recover
**Severity:** Lost context
**File:** pkg/core/log.go
**Lines:** 388397
**Confidence:** 65%
`LogPan.Recover` creates an error with `fmt.Errorf("%v", r)` for non-error panics (line 390), violating the project convention that errors must use `core.E()`. The resulting error lacks `Op` and structured context:
```go
err, ok := r.(error)
if !ok {
err = fmt.Errorf("%v", r) // CONVENTION VIOLATION: should use core.E()
}
// ...
lp.log.Error("panic recovered", "op", Op(err), "stack", FormatStackTrace(err))
```
Since `fmt.Errorf` returns a plain `error`, `Op(err)` and `FormatStackTrace(err)` will return empty strings. The log loses context.
**Fix:**
```go
if !ok {
err = core.E("panic", fmt.Sprintf("%v", r), nil)
}
```
---
### [LOW] error.go:350351 — Convention Violation in ErrPan.Recover
**Severity:** Lost context
**File:** pkg/core/error.go
**Lines:** 350351
**Confidence:** 60%
Same issue as above: `ErrPan.Recover` uses `fmt.Errorf` for non-error panics instead of `core.E()`. The crash report's `Error` field will be an unstructured string rather than a wrapped `core.Err`.
**Fix:**
```go
if !ok {
err = core.E("panic", fmt.Sprintf("%v", r), nil)
}
```
---
## Summary Table
| Severity | Count | Files | Action Required |
|----------|-------|-------|-----------------|
| CRITICAL | 3 | cli.go, embed.go, lock.go | **MUST FIX** — system is unusable or crashes |
| HIGH | 4 | service.go, log.go, error.go, contract.go | **MUST FIX** — logic errors, data races, deadlock risk |
| MEDIUM | 3 | config.go, fs.go, command.go | **SHOULD FIX** — crashes or convention violations |
| LOW | 2 | log.go, error.go | **CONSIDER** — lost context, non-critical |
| **TOTAL** | **12** | | |
---
## Priority Implementation Order
### Phase 1: Stability (CRITICAL + HIGH)
1. **cli.go** — Initialise `rootCommand` and `bannerFunction` in `New()` or add `WithCli` option
2. **embed.go** — Hold `assetGroupsMu` lock through both reads in `GetAsset`
3. **log.go** — Use `atomic.Pointer[Log]` for `defaultLog` or add RWMutex
4. **lock.go** — Move `lockMap` into `Core` struct (per-instance isolation)
5. **service.go** — Move `AddLocales` call outside the `"srv"` lock or document strict lock ordering
### Phase 2: Correctness (MEDIUM)
6. **command.go** — Add nil check guard or make `setApp` public
7. **config.go** — Add nil guard to `Enabled` and `EnabledFeatures`
8. **fs.go** — Use `c.Log().Security()` and sanitise log input
9. **contract.go** — Add nil check in `WithName` factory validation
### Phase 3: Polish (LOW)
10. **error.go** — Use `core.E()` in both `ErrPan.Recover` and `LogPan.Recover`
11. **error.go** — Use `c.Fs()` for crash report I/O; change directory mode to 0700
---
## Testing Strategy Post-Fix
- **Unit tests** for each module, with concurrent load testing for lock contention
- **Integration tests** for multi-Core scenarios (lock.go fix)
- **Race detector** (`go test -race ./...`) must pass
- **Panic recovery tests** for nil-dereference scenarios (cli.go, command.go)
- **Functional tests** for crash reporting (error.go)
---
## Files Affected (14 Consumers)
- agent, api, cli, core, config, docs, gui, ide, lint, mcp, php, ts, LEM, eaas
All of these should run tests and regressions after fixes are applied.