docs: B-2 data integrity and concurrency audit

Comprehensive audit of registration token races, command queue concurrency, rapid mode risks, agent staleness, transaction safety, and deadlock risks. Key findings: - F-B2-1 HIGH: Registration flow not transactional (4 separate ops) - F-B2-8 HIGH: Same as F-B2-1 (crash leaves orphaned agent) - F-B2-2 MEDIUM: Duplicate command delivery on concurrent requests - F-B2-4 MEDIUM: No cap on concurrent rapid-mode agents - F-B2-7 MEDIUM: No staggered reconnection after server restart - F-B2-9 MEDIUM: Token renewal not transactional (self-healing) 10 findings total. See docs/B2_Data_Integrity_Audit.md for details. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 07:26:02 -04:00
parent 1f828b6f61
commit 2fd0fd27fa
1 changed files with 238 additions and 0 deletions
--- a/docs/B2_Data_Integrity_Audit.md
+++ b/docs/B2_Data_Integrity_Audit.md
@@ -0,0 +1,238 @@
+# B-2 Data Integrity & Concurrency Audit
+
+**Date:** 2026-03-29
+**Branch:** culurien
+**Scope:** Registration token races, command queue concurrency, rapid mode risks, agent staleness, transaction safety, deadlocks
+
+---
+
+## 1. REGISTRATION TOKEN LIFECYCLE
+
+### Token Lookup Flow
+
+`RegisterAgent` (agents.go ~line 80):
+1. Extract token from Authorization header or request body
+2. `ValidateRegistrationToken(token)` — SELECT with WHERE `status='active' AND expires_at > NOW() AND seats_used < max_seats`
+3. Check machine ID uniqueness
+4. `CreateAgent(agent)` — INSERT into agents
+5. `MarkTokenUsed(token, agentID)` — calls PostgreSQL function `mark_registration_token_used()`
+6. Generate JWT + refresh token
+
+### TOCTOU Race Condition — F-B2-1 HIGH
+
+**Steps 2 and 5 are NOT atomic.** Between ValidateRegistrationToken (step 2) and MarkTokenUsed (step 5), another request can:
+1. Request A validates token → seats_used=0, max_seats=1 → valid
+2. Request B validates token → seats_used=0, max_seats=1 → valid (same snapshot)
+3. Request A creates agent, marks token used → seats_used=1
+4. Request B creates agent, marks token used → **`mark_registration_token_used` function fails** because `seats_used < max_seats` is no longer true
+
+The PostgreSQL function `mark_registration_token_used` (migration 012) uses `WHERE seats_used < max_seats`, so step 4 would return `false`. The handler then tries to delete the agent (rollback). **But the agent is already created and a JWT was issued in the failed case.**
+
+Wait — re-reading the handler: MarkTokenUsed is called BEFORE JWT generation. If it fails, the handler deletes the agent (lines ~160-164) and returns an error. So the race is handled by the stored procedure's atomicity. The stored procedure does the increment+check atomically.
+
+**Actual risk:** Two agents CAN both validate the token (step 2) simultaneously, both create agents (step 4), but only ONE can successfully mark the token (step 5). The second gets an error, and the handler deletes the agent. **This is a manual rollback, NOT a transaction.** If the server crashes between CreateAgent and MarkTokenUsed, the agent exists without the token being consumed.
+
+### F-B2-1 Finding: Registration is not transactional
+
+The registration flow uses 4 separate DB operations (validate, create agent, mark token, create refresh token) without a wrapping transaction. If the server crashes mid-flow:
+- After CreateAgent but before MarkTokenUsed: orphaned agent, token still valid
+- After MarkTokenUsed but before CreateRefreshToken: agent exists but can't authenticate
+
+**Location:** agents.go RegisterAgent handler (~line 80-200)
+
+### Multi-seat Token Atomicity — ACCEPTABLE
+
+The `mark_registration_token_used` PostgreSQL function (migration 012) does the increment inside a single SQL function call. The `WHERE seats_used < max_seats` check and the `UPDATE SET seats_used = seats_used + 1` are in the same function execution. PostgreSQL function execution is atomic within a single statement. This prevents double-spend of seats.
+
+### Token Expiry — ACCEPTABLE
+
+Expired tokens are checked in the WHERE clause: `expires_at > NOW()`. Cleanup is manual via admin endpoint `POST /admin/registration-tokens/cleanup`.
+
+---
+
+## 2. COMMAND QUEUE CONCURRENCY
+
+### Command Status State Machine
+
+```
+pending → sent → completed
+                → failed
+                → timed_out
+pending → cancelled
+sent    → cancelled
+failed  → (retry) → new pending command
+timed_out → (retry) → new pending command
+cancelled → (retry) → new pending command
+```
+
+Additional: `archived_failed` (from bulk cleanup)
+
+### Duplicate Command Delivery — F-B2-2 MEDIUM
+
+`GetCommands` (agents.go ~line 428) does:
+1. `GetPendingCommands(agentID)` — SELECT with status='pending'
+2. `GetStuckCommands(agentID, 5*time.Minute)` — re-delivers stuck commands
+3. For each command: `MarkCommandSent(cmd.ID)` — UPDATE status='sent'
+
+Steps 1-3 are NOT in a transaction. If two concurrent requests from the same agent arrive:
+- Request A gets commands [C1, C2], starts marking them as sent
+- Request B gets commands [C1, C2] (still pending — A hasn't committed yet)
+- Both return C1 and C2 to the agent
+- Agent processes C1 and C2 twice
+
+**Mitigation in A-2:** The agent-side `executedIDs` dedup map prevents double execution. But the commands are still delivered twice, wasting bandwidth.
+
+**Location:** agents.go GetCommands handler (~line 428-470)
+
+### Crash Recovery — ACCEPTABLE
+
+When an agent crashes after fetching a command:
+- Command is in 'sent' status
+- `GetStuckCommands` re-delivers commands stuck in 'sent' for >5 minutes
+- TimeoutService marks commands as 'timed_out' after 2 hours
+- No maximum retry count — a command CAN loop between sent→stuck→re-sent if the agent keeps crashing. Each re-delivery goes through the dedup check, so it won't execute twice per agent lifecycle. But across restarts (dedup map lost), it could execute again.
+
+### Scheduler Queue — THREAD-SAFE
+
+The scheduler uses an in-memory `PriorityQueue` with `sync.Mutex` protection (queue.go). All methods (`Push`, `Pop`, `PopBefore`) acquire the lock. The scheduler creates commands via `commandQueries.CreateCommand` which goes through `signAndCreateCommand`. The unique index `idx_agent_pending_subsystem` prevents duplicate pending commands for the same agent+subsystem.
+
+---
+
+## 3. RAPID MODE (5-SECOND POLLING)
+
+### Trigger Mechanism
+
+**Server-side:** `SetRapidPollingMode` (agents.go ~line 1221) sets `rapid_polling_enabled` and `rapid_polling_until` in agent metadata. Max duration: 60 minutes (validated: `max=60`).
+
+**Agent-side:** `getCurrentPollingInterval` (main.go) checks `cfg.RapidPollingEnabled` and `cfg.RapidPollingUntil`. Returns 5 seconds if active, standard interval otherwise.
+
+### Server-Side Timeout — F-B2-3 LOW
+
+Rapid mode is stored in agent metadata (JSONB). There is no server-side expiry enforcement — the agent self-expires by checking `time.Now().Before(cfg.RapidPollingUntil)`. If the agent crashes and never restarts, the metadata flag persists forever in the DB. It has no operational impact (no server resources consumed), but it's stale data.
+
+### Load Under 50 Concurrent Rapid-Mode Agents — F-B2-4 MEDIUM
+
+Each rapid-mode check-in executes:
+1. `UpdateAgentLastSeen` (1 UPDATE)
+2. `GetPendingCommands` (1 SELECT)
+3. `GetStuckCommands` (1 SELECT)
+4. Various metadata UPDATEs (~2-3 UPDATEs)
+
+Per cycle: ~5 queries. At 5-second intervals with 50 agents: **50 agents x 5 queries x 12 cycles/minute = 3,000 queries/minute**. This is well within PostgreSQL capacity for simple indexed queries, but there's no server-side cap on how many agents can enter rapid mode simultaneously.
+
+Rate limiting exists at the router level (`agent_reports` rate limiter on the `SetRapidPollingMode` endpoint), but the check-in endpoint (`GetCommands`) has no rapid-mode-specific throttling. The `GetCommands` handler is not rate-limited at all (it's in the agent group without per-route limiting).
+
+### Backpressure — ABSENT (F-B2-5)
+
+No debouncing or backpressure mechanism exists for rapid-mode check-ins. The 30-second jitter at the start of each loop iteration (main.go ~line 703) provides some natural spread, but it's applied to ALL polling intervals including rapid mode. A 30-second jitter on a 5-second interval means check-ins effectively happen every 5-35 seconds, not every 5 seconds.
+
+---
+
+## 4. AGENT STATUS STALENESS
+
+### Timeout Values
+
+`timeout.go:28-29`:
+- `sentTimeout = 2 * time.Hour` — for commands in 'sent' status
+- `pendingTimeout = 30 * time.Minute` — for commands stuck in pending
+- Check frequency: every 5 minutes (`time.NewTicker(5 * time.Minute)`)
+
+`main.go ~line 429`:
+- Offline threshold: `10 * time.Minute` (hardcoded)
+- Check frequency: every 2 minutes (hardcoded)
+
+### F-B2-6 LOW: Non-configurable timeout values
+
+All timeout values are hardcoded. The `TODO: Make these timeout durations user-adjustable` at timeout.go:30 has not been implemented. The offline threshold (10 minutes) and check frequency (2 minutes) in main.go are also hardcoded.
+
+### Thundering Herd on Restart — F-B2-7 MEDIUM
+
+When the server restarts after 2 hours of downtime:
+1. Offline check runs immediately (`MarkOfflineAgents(10 * time.Minute)`)
+2. ALL agents with `last_seen < NOW() - 10 minutes` are marked offline in one UPDATE
+3. This is a single atomic UPDATE — not a thundering herd on the DB side
+4. But when agents start checking in after the server comes back, they all hit the server simultaneously
+
+**Agent-side mitigation:** The 30-second jitter (`time.Duration(rand.Intn(30)) * time.Second`) at main.go ~line 703 provides some staggering. But agents that were waiting for the server will all retry within their backoff window simultaneously.
+
+### In-Flight Commands on Offline — ACCEPTABLE
+
+When an agent goes offline, its commands remain in their current status. Pending commands are not cancelled. The timeout service eventually marks sent commands as timed_out after 2 hours. Pending commands time out after 30 minutes. No data loss occurs.
+
+---
+
+## 5. TRANSACTION SAFETY AUDIT
+
+| Operation | Transactional? | Details |
+|-----------|---------------|---------|
+| Agent registration | **NO** | 4 separate operations: validate token, create agent, mark token, create refresh token. Manual rollback (delete agent) on token failure. |
+| Command approval | **NO** | `ApproveUpdate` creates a command via `signAndCreateCommand` which is a single INSERT. But the approval status update and command creation are separate. |
+| Command retry | **NO** | `RetryCommand` builds new command and calls `signAndCreateCommand`. No transaction wrapping the original status check + new command creation. |
+| Token renewal | **NO** | `ValidateRefreshToken` (SELECT) then `UpdateExpiration` (UPDATE) then `GenerateAgentToken` (memory op). Not transactional. |
+| Agent deletion | **YES** | `DeleteAgent` uses a transaction (agents.go:211-224). Also relies on CASCADE for child records. |
+| Bulk update approval | **YES** | `BulkApproveUpdates` uses a transaction (updates.go:159-178). |
+| Update package status | **YES** | `UpdatePackageStatus` uses a transaction (updates.go:532-580). |
+
+### F-B2-8 HIGH: Agent registration not transactional
+
+The most critical non-transactional operation. A crash between any of the 4 steps leaves the system in an inconsistent state. The manual rollback (delete agent on token failure) is a best-effort mitigation, not an atomic guarantee.
+
+### F-B2-9 MEDIUM: Token renewal not transactional
+
+If the server crashes between validating the refresh token and updating its expiry, the token is consumed (validated) but the new JWT is never issued. The agent would retry and succeed (token is still valid), so this is self-healing. Low practical risk.
+
+---
+
+## 6. DEADLOCK RISK AUDIT
+
+### DB Lock Ordering — LOW RISK
+
+- No handler acquires locks on both `agents` and `agent_commands` in the same transaction
+- Agent deletion uses CASCADE (single DELETE triggers cascading deletes — PostgreSQL handles lock ordering internally)
+- The scheduler creates commands via `CreateCommand` which does a single INSERT (no multi-table locks)
+- The timeout service does single-row UPDATEs (no multi-row locking within a transaction)
+
+### Go Mutex / DB Lock Interaction — LOW RISK
+
+- `CommandHandler.executedIDs` uses `sync.Mutex` (agent-side, no DB interaction under the lock)
+- `CommandHandler.keyCache` uses `sync.RWMutex` (agent-side, may do network I/O under lock for key fetch, but no DB)
+- Scheduler `PriorityQueue` uses `sync.Mutex` (server-side, DB operations happen AFTER the lock is released)
+- No handler holds a Go mutex while also holding a DB transaction
+
+### Rapid Mode + Timeout Service — NO DEADLOCK
+
+The rapid mode handler updates agent metadata via `UpdateAgent` (single-row UPDATE). The timeout service updates command status via `UpdateCommandStatus` (different table). No overlap.
+
+---
+
+## 7. ETHOS CROSS-CHECK
+
+### ETHOS #3 — Assume Failure
+
+**Violations:**
+- Registration assumes CreateAgent succeeds before attempting MarkTokenUsed (F-B2-8)
+- Token renewal assumes ValidateRefreshToken + UpdateExpiration succeed atomically (F-B2-9)
+- GetCommands assumes MarkCommandSent succeeds (but does handle failure with logging)
+
+### ETHOS #4 — Idempotency
+
+**Command delivery:** Not idempotent (same command can be delivered twice in concurrent requests). Mitigated by agent-side dedup.
+**Registration:** Not idempotent (same token can validate twice before one is consumed). Mitigated by stored procedure atomicity on the mark-as-used step.
+**Token renewal:** Idempotent (renewing twice just extends the expiry twice — harmless).
+
+---
+
+## FINDINGS SUMMARY
+
+| ID | Severity | Finding | Location |
+|----|----------|---------|----------|
+| F-B2-1 | HIGH | Registration flow uses 4 separate DB operations without a transaction. Crash between CreateAgent and MarkTokenUsed leaves orphaned agent. | agents.go RegisterAgent (~line 80-200) |
+| F-B2-2 | MEDIUM | GetCommands + MarkCommandSent not in a transaction. Concurrent requests from same agent can receive duplicate commands. | agents.go GetCommands (~line 428-470) |
+| F-B2-3 | LOW | Rapid mode metadata persists in DB after agent crash (stale, no operational impact). | agents.go SetRapidPollingMode (~line 1221) |
+| F-B2-4 | MEDIUM | No server-side cap on concurrent rapid-mode agents. 50 agents at 5s intervals = 3,000 queries/min. | main.go, agents.go |
+| F-B2-5 | LOW | 30-second jitter on 5-second rapid interval effectively negates rapid mode (check-ins at 5-35s). | agent main.go ~line 703 |
+| F-B2-6 | LOW | All timeout values hardcoded (2h sent, 30m pending, 10m offline). TODO exists but not implemented. | timeout.go:28-30, main.go:429 |
+| F-B2-7 | MEDIUM | No staggered reconnection after server restart. All agents retry simultaneously within backoff window. | agent main.go ~line 703 |
+| F-B2-8 | HIGH | Agent registration not transactional. 4 separate DB operations with manual rollback on failure. | agents.go RegisterAgent |
+| F-B2-9 | MEDIUM | Token renewal not transactional (validate + update expiry + issue JWT). Self-healing on retry. | agents.go RenewToken (~line 995) |
+| F-B2-10 | LOW | No maximum retry count for stuck commands. A persistently failing command can loop indefinitely between sent→stuck→re-sent. | timeout.go, agents.go GetCommands |