Comprehensive audit of registration token races, command queue concurrency, rapid mode risks, agent staleness, transaction safety, and deadlock risks. Key findings: - F-B2-1 HIGH: Registration flow not transactional (4 separate ops) - F-B2-8 HIGH: Same as F-B2-1 (crash leaves orphaned agent) - F-B2-2 MEDIUM: Duplicate command delivery on concurrent requests - F-B2-4 MEDIUM: No cap on concurrent rapid-mode agents - F-B2-7 MEDIUM: No staggered reconnection after server restart - F-B2-9 MEDIUM: Token renewal not transactional (self-healing) 10 findings total. See docs/B2_Data_Integrity_Audit.md for details. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
14 KiB
B-2 Data Integrity & Concurrency Audit
Date: 2026-03-29 Branch: culurien Scope: Registration token races, command queue concurrency, rapid mode risks, agent staleness, transaction safety, deadlocks
1. REGISTRATION TOKEN LIFECYCLE
Token Lookup Flow
RegisterAgent (agents.go ~line 80):
- Extract token from Authorization header or request body
ValidateRegistrationToken(token)— SELECT with WHEREstatus='active' AND expires_at > NOW() AND seats_used < max_seats- Check machine ID uniqueness
CreateAgent(agent)— INSERT into agentsMarkTokenUsed(token, agentID)— calls PostgreSQL functionmark_registration_token_used()- Generate JWT + refresh token
TOCTOU Race Condition — F-B2-1 HIGH
Steps 2 and 5 are NOT atomic. Between ValidateRegistrationToken (step 2) and MarkTokenUsed (step 5), another request can:
- Request A validates token → seats_used=0, max_seats=1 → valid
- Request B validates token → seats_used=0, max_seats=1 → valid (same snapshot)
- Request A creates agent, marks token used → seats_used=1
- Request B creates agent, marks token used →
mark_registration_token_usedfunction fails becauseseats_used < max_seatsis no longer true
The PostgreSQL function mark_registration_token_used (migration 012) uses WHERE seats_used < max_seats, so step 4 would return false. The handler then tries to delete the agent (rollback). But the agent is already created and a JWT was issued in the failed case.
Wait — re-reading the handler: MarkTokenUsed is called BEFORE JWT generation. If it fails, the handler deletes the agent (lines ~160-164) and returns an error. So the race is handled by the stored procedure's atomicity. The stored procedure does the increment+check atomically.
Actual risk: Two agents CAN both validate the token (step 2) simultaneously, both create agents (step 4), but only ONE can successfully mark the token (step 5). The second gets an error, and the handler deletes the agent. This is a manual rollback, NOT a transaction. If the server crashes between CreateAgent and MarkTokenUsed, the agent exists without the token being consumed.
F-B2-1 Finding: Registration is not transactional
The registration flow uses 4 separate DB operations (validate, create agent, mark token, create refresh token) without a wrapping transaction. If the server crashes mid-flow:
- After CreateAgent but before MarkTokenUsed: orphaned agent, token still valid
- After MarkTokenUsed but before CreateRefreshToken: agent exists but can't authenticate
Location: agents.go RegisterAgent handler (~line 80-200)
Multi-seat Token Atomicity — ACCEPTABLE
The mark_registration_token_used PostgreSQL function (migration 012) does the increment inside a single SQL function call. The WHERE seats_used < max_seats check and the UPDATE SET seats_used = seats_used + 1 are in the same function execution. PostgreSQL function execution is atomic within a single statement. This prevents double-spend of seats.
Token Expiry — ACCEPTABLE
Expired tokens are checked in the WHERE clause: expires_at > NOW(). Cleanup is manual via admin endpoint POST /admin/registration-tokens/cleanup.
2. COMMAND QUEUE CONCURRENCY
Command Status State Machine
pending → sent → completed
→ failed
→ timed_out
pending → cancelled
sent → cancelled
failed → (retry) → new pending command
timed_out → (retry) → new pending command
cancelled → (retry) → new pending command
Additional: archived_failed (from bulk cleanup)
Duplicate Command Delivery — F-B2-2 MEDIUM
GetCommands (agents.go ~line 428) does:
GetPendingCommands(agentID)— SELECT with status='pending'GetStuckCommands(agentID, 5*time.Minute)— re-delivers stuck commands- For each command:
MarkCommandSent(cmd.ID)— UPDATE status='sent'
Steps 1-3 are NOT in a transaction. If two concurrent requests from the same agent arrive:
- Request A gets commands [C1, C2], starts marking them as sent
- Request B gets commands [C1, C2] (still pending — A hasn't committed yet)
- Both return C1 and C2 to the agent
- Agent processes C1 and C2 twice
Mitigation in A-2: The agent-side executedIDs dedup map prevents double execution. But the commands are still delivered twice, wasting bandwidth.
Location: agents.go GetCommands handler (~line 428-470)
Crash Recovery — ACCEPTABLE
When an agent crashes after fetching a command:
- Command is in 'sent' status
GetStuckCommandsre-delivers commands stuck in 'sent' for >5 minutes- TimeoutService marks commands as 'timed_out' after 2 hours
- No maximum retry count — a command CAN loop between sent→stuck→re-sent if the agent keeps crashing. Each re-delivery goes through the dedup check, so it won't execute twice per agent lifecycle. But across restarts (dedup map lost), it could execute again.
Scheduler Queue — THREAD-SAFE
The scheduler uses an in-memory PriorityQueue with sync.Mutex protection (queue.go). All methods (Push, Pop, PopBefore) acquire the lock. The scheduler creates commands via commandQueries.CreateCommand which goes through signAndCreateCommand. The unique index idx_agent_pending_subsystem prevents duplicate pending commands for the same agent+subsystem.
3. RAPID MODE (5-SECOND POLLING)
Trigger Mechanism
Server-side: SetRapidPollingMode (agents.go ~line 1221) sets rapid_polling_enabled and rapid_polling_until in agent metadata. Max duration: 60 minutes (validated: max=60).
Agent-side: getCurrentPollingInterval (main.go) checks cfg.RapidPollingEnabled and cfg.RapidPollingUntil. Returns 5 seconds if active, standard interval otherwise.
Server-Side Timeout — F-B2-3 LOW
Rapid mode is stored in agent metadata (JSONB). There is no server-side expiry enforcement — the agent self-expires by checking time.Now().Before(cfg.RapidPollingUntil). If the agent crashes and never restarts, the metadata flag persists forever in the DB. It has no operational impact (no server resources consumed), but it's stale data.
Load Under 50 Concurrent Rapid-Mode Agents — F-B2-4 MEDIUM
Each rapid-mode check-in executes:
UpdateAgentLastSeen(1 UPDATE)GetPendingCommands(1 SELECT)GetStuckCommands(1 SELECT)- Various metadata UPDATEs (~2-3 UPDATEs)
Per cycle: ~5 queries. At 5-second intervals with 50 agents: 50 agents x 5 queries x 12 cycles/minute = 3,000 queries/minute. This is well within PostgreSQL capacity for simple indexed queries, but there's no server-side cap on how many agents can enter rapid mode simultaneously.
Rate limiting exists at the router level (agent_reports rate limiter on the SetRapidPollingMode endpoint), but the check-in endpoint (GetCommands) has no rapid-mode-specific throttling. The GetCommands handler is not rate-limited at all (it's in the agent group without per-route limiting).
Backpressure — ABSENT (F-B2-5)
No debouncing or backpressure mechanism exists for rapid-mode check-ins. The 30-second jitter at the start of each loop iteration (main.go ~line 703) provides some natural spread, but it's applied to ALL polling intervals including rapid mode. A 30-second jitter on a 5-second interval means check-ins effectively happen every 5-35 seconds, not every 5 seconds.
4. AGENT STATUS STALENESS
Timeout Values
timeout.go:28-29:
sentTimeout = 2 * time.Hour— for commands in 'sent' statuspendingTimeout = 30 * time.Minute— for commands stuck in pending- Check frequency: every 5 minutes (
time.NewTicker(5 * time.Minute))
main.go ~line 429:
- Offline threshold:
10 * time.Minute(hardcoded) - Check frequency: every 2 minutes (hardcoded)
F-B2-6 LOW: Non-configurable timeout values
All timeout values are hardcoded. The TODO: Make these timeout durations user-adjustable at timeout.go:30 has not been implemented. The offline threshold (10 minutes) and check frequency (2 minutes) in main.go are also hardcoded.
Thundering Herd on Restart — F-B2-7 MEDIUM
When the server restarts after 2 hours of downtime:
- Offline check runs immediately (
MarkOfflineAgents(10 * time.Minute)) - ALL agents with
last_seen < NOW() - 10 minutesare marked offline in one UPDATE - This is a single atomic UPDATE — not a thundering herd on the DB side
- But when agents start checking in after the server comes back, they all hit the server simultaneously
Agent-side mitigation: The 30-second jitter (time.Duration(rand.Intn(30)) * time.Second) at main.go ~line 703 provides some staggering. But agents that were waiting for the server will all retry within their backoff window simultaneously.
In-Flight Commands on Offline — ACCEPTABLE
When an agent goes offline, its commands remain in their current status. Pending commands are not cancelled. The timeout service eventually marks sent commands as timed_out after 2 hours. Pending commands time out after 30 minutes. No data loss occurs.
5. TRANSACTION SAFETY AUDIT
| Operation | Transactional? | Details |
|---|---|---|
| Agent registration | NO | 4 separate operations: validate token, create agent, mark token, create refresh token. Manual rollback (delete agent) on token failure. |
| Command approval | NO | ApproveUpdate creates a command via signAndCreateCommand which is a single INSERT. But the approval status update and command creation are separate. |
| Command retry | NO | RetryCommand builds new command and calls signAndCreateCommand. No transaction wrapping the original status check + new command creation. |
| Token renewal | NO | ValidateRefreshToken (SELECT) then UpdateExpiration (UPDATE) then GenerateAgentToken (memory op). Not transactional. |
| Agent deletion | YES | DeleteAgent uses a transaction (agents.go:211-224). Also relies on CASCADE for child records. |
| Bulk update approval | YES | BulkApproveUpdates uses a transaction (updates.go:159-178). |
| Update package status | YES | UpdatePackageStatus uses a transaction (updates.go:532-580). |
F-B2-8 HIGH: Agent registration not transactional
The most critical non-transactional operation. A crash between any of the 4 steps leaves the system in an inconsistent state. The manual rollback (delete agent on token failure) is a best-effort mitigation, not an atomic guarantee.
F-B2-9 MEDIUM: Token renewal not transactional
If the server crashes between validating the refresh token and updating its expiry, the token is consumed (validated) but the new JWT is never issued. The agent would retry and succeed (token is still valid), so this is self-healing. Low practical risk.
6. DEADLOCK RISK AUDIT
DB Lock Ordering — LOW RISK
- No handler acquires locks on both
agentsandagent_commandsin the same transaction - Agent deletion uses CASCADE (single DELETE triggers cascading deletes — PostgreSQL handles lock ordering internally)
- The scheduler creates commands via
CreateCommandwhich does a single INSERT (no multi-table locks) - The timeout service does single-row UPDATEs (no multi-row locking within a transaction)
Go Mutex / DB Lock Interaction — LOW RISK
CommandHandler.executedIDsusessync.Mutex(agent-side, no DB interaction under the lock)CommandHandler.keyCacheusessync.RWMutex(agent-side, may do network I/O under lock for key fetch, but no DB)- Scheduler
PriorityQueueusessync.Mutex(server-side, DB operations happen AFTER the lock is released) - No handler holds a Go mutex while also holding a DB transaction
Rapid Mode + Timeout Service — NO DEADLOCK
The rapid mode handler updates agent metadata via UpdateAgent (single-row UPDATE). The timeout service updates command status via UpdateCommandStatus (different table). No overlap.
7. ETHOS CROSS-CHECK
ETHOS #3 — Assume Failure
Violations:
- Registration assumes CreateAgent succeeds before attempting MarkTokenUsed (F-B2-8)
- Token renewal assumes ValidateRefreshToken + UpdateExpiration succeed atomically (F-B2-9)
- GetCommands assumes MarkCommandSent succeeds (but does handle failure with logging)
ETHOS #4 — Idempotency
Command delivery: Not idempotent (same command can be delivered twice in concurrent requests). Mitigated by agent-side dedup. Registration: Not idempotent (same token can validate twice before one is consumed). Mitigated by stored procedure atomicity on the mark-as-used step. Token renewal: Idempotent (renewing twice just extends the expiry twice — harmless).
FINDINGS SUMMARY
| ID | Severity | Finding | Location |
|---|---|---|---|
| F-B2-1 | HIGH | Registration flow uses 4 separate DB operations without a transaction. Crash between CreateAgent and MarkTokenUsed leaves orphaned agent. | agents.go RegisterAgent (~line 80-200) |
| F-B2-2 | MEDIUM | GetCommands + MarkCommandSent not in a transaction. Concurrent requests from same agent can receive duplicate commands. | agents.go GetCommands (~line 428-470) |
| F-B2-3 | LOW | Rapid mode metadata persists in DB after agent crash (stale, no operational impact). | agents.go SetRapidPollingMode (~line 1221) |
| F-B2-4 | MEDIUM | No server-side cap on concurrent rapid-mode agents. 50 agents at 5s intervals = 3,000 queries/min. | main.go, agents.go |
| F-B2-5 | LOW | 30-second jitter on 5-second rapid interval effectively negates rapid mode (check-ins at 5-35s). | agent main.go ~line 703 |
| F-B2-6 | LOW | All timeout values hardcoded (2h sent, 30m pending, 10m offline). TODO exists but not implemented. | timeout.go:28-30, main.go:429 |
| F-B2-7 | MEDIUM | No staggered reconnection after server restart. All agents retry simultaneously within backoff window. | agent main.go ~line 703 |
| F-B2-8 | HIGH | Agent registration not transactional. 4 separate DB operations with manual rollback on failure. | agents.go RegisterAgent |
| F-B2-9 | MEDIUM | Token renewal not transactional (validate + update expiry + issue JWT). Self-healing on retry. | agents.go RenewToken (~line 995) |
| F-B2-10 | LOW | No maximum retry count for stuck commands. A persistently failing command can loop indefinitely between sent→stuck→re-sent. | timeout.go, agents.go GetCommands |