Files

jpetree331 2fd0fd27fa docs: B-2 data integrity and concurrency audit

Comprehensive audit of registration token races, command queue
concurrency, rapid mode risks, agent staleness, transaction
safety, and deadlock risks.

Key findings:
- F-B2-1 HIGH: Registration flow not transactional (4 separate ops)
- F-B2-8 HIGH: Same as F-B2-1 (crash leaves orphaned agent)
- F-B2-2 MEDIUM: Duplicate command delivery on concurrent requests
- F-B2-4 MEDIUM: No cap on concurrent rapid-mode agents
- F-B2-7 MEDIUM: No staggered reconnection after server restart
- F-B2-9 MEDIUM: Token renewal not transactional (self-healing)

10 findings total. See docs/B2_Data_Integrity_Audit.md for details.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-29 07:26:02 -04:00

14 KiB

Raw Blame History

B-2 Data Integrity & Concurrency Audit

Date: 2026-03-29 Branch: culurien Scope: Registration token races, command queue concurrency, rapid mode risks, agent staleness, transaction safety, deadlocks

1. REGISTRATION TOKEN LIFECYCLE

Token Lookup Flow

RegisterAgent (agents.go ~line 80):

Extract token from Authorization header or request body
ValidateRegistrationToken(token) — SELECT with WHERE status='active' AND expires_at > NOW() AND seats_used < max_seats
Check machine ID uniqueness
CreateAgent(agent) — INSERT into agents
MarkTokenUsed(token, agentID) — calls PostgreSQL function mark_registration_token_used()
Generate JWT + refresh token

TOCTOU Race Condition — F-B2-1 HIGH

Steps 2 and 5 are NOT atomic. Between ValidateRegistrationToken (step 2) and MarkTokenUsed (step 5), another request can:

Request A validates token → seats_used=0, max_seats=1 → valid
Request B validates token → seats_used=0, max_seats=1 → valid (same snapshot)
Request A creates agent, marks token used → seats_used=1
Request B creates agent, marks token used → mark_registration_token_used function fails because seats_used < max_seats is no longer true

The PostgreSQL function mark_registration_token_used (migration 012) uses WHERE seats_used < max_seats, so step 4 would return false. The handler then tries to delete the agent (rollback). But the agent is already created and a JWT was issued in the failed case.

Wait — re-reading the handler: MarkTokenUsed is called BEFORE JWT generation. If it fails, the handler deletes the agent (lines ~160-164) and returns an error. So the race is handled by the stored procedure's atomicity. The stored procedure does the increment+check atomically.

Actual risk: Two agents CAN both validate the token (step 2) simultaneously, both create agents (step 4), but only ONE can successfully mark the token (step 5). The second gets an error, and the handler deletes the agent. This is a manual rollback, NOT a transaction. If the server crashes between CreateAgent and MarkTokenUsed, the agent exists without the token being consumed.

F-B2-1 Finding: Registration is not transactional

The registration flow uses 4 separate DB operations (validate, create agent, mark token, create refresh token) without a wrapping transaction. If the server crashes mid-flow:

After CreateAgent but before MarkTokenUsed: orphaned agent, token still valid
After MarkTokenUsed but before CreateRefreshToken: agent exists but can't authenticate

Location: agents.go RegisterAgent handler (~line 80-200)

Multi-seat Token Atomicity — ACCEPTABLE

The mark_registration_token_used PostgreSQL function (migration 012) does the increment inside a single SQL function call. The WHERE seats_used < max_seats check and the UPDATE SET seats_used = seats_used + 1 are in the same function execution. PostgreSQL function execution is atomic within a single statement. This prevents double-spend of seats.

Token Expiry — ACCEPTABLE

Expired tokens are checked in the WHERE clause: expires_at > NOW(). Cleanup is manual via admin endpoint POST /admin/registration-tokens/cleanup.

2. COMMAND QUEUE CONCURRENCY

Command Status State Machine

pending → sent → completed
                → failed
                → timed_out
pending → cancelled
sent    → cancelled
failed  → (retry) → new pending command
timed_out → (retry) → new pending command
cancelled → (retry) → new pending command

Additional: archived_failed (from bulk cleanup)

Duplicate Command Delivery — F-B2-2 MEDIUM

GetCommands (agents.go ~line 428) does:

GetPendingCommands(agentID) — SELECT with status='pending'
GetStuckCommands(agentID, 5*time.Minute) — re-delivers stuck commands
For each command: MarkCommandSent(cmd.ID) — UPDATE status='sent'

Steps 1-3 are NOT in a transaction. If two concurrent requests from the same agent arrive:

Request A gets commands [C1, C2], starts marking them as sent
Request B gets commands [C1, C2] (still pending — A hasn't committed yet)
Both return C1 and C2 to the agent
Agent processes C1 and C2 twice

Mitigation in A-2: The agent-side executedIDs dedup map prevents double execution. But the commands are still delivered twice, wasting bandwidth.

Location: agents.go GetCommands handler (~line 428-470)

Crash Recovery — ACCEPTABLE

When an agent crashes after fetching a command:

Command is in 'sent' status
GetStuckCommands re-delivers commands stuck in 'sent' for >5 minutes
TimeoutService marks commands as 'timed_out' after 2 hours
No maximum retry count — a command CAN loop between sent→stuck→re-sent if the agent keeps crashing. Each re-delivery goes through the dedup check, so it won't execute twice per agent lifecycle. But across restarts (dedup map lost), it could execute again.

Scheduler Queue — THREAD-SAFE

The scheduler uses an in-memory PriorityQueue with sync.Mutex protection (queue.go). All methods (Push, Pop, PopBefore) acquire the lock. The scheduler creates commands via commandQueries.CreateCommand which goes through signAndCreateCommand. The unique index idx_agent_pending_subsystem prevents duplicate pending commands for the same agent+subsystem.

3. RAPID MODE (5-SECOND POLLING)

Trigger Mechanism

Server-side: SetRapidPollingMode (agents.go ~line 1221) sets rapid_polling_enabled and rapid_polling_until in agent metadata. Max duration: 60 minutes (validated: max=60).

Agent-side: getCurrentPollingInterval (main.go) checks cfg.RapidPollingEnabled and cfg.RapidPollingUntil. Returns 5 seconds if active, standard interval otherwise.

Server-Side Timeout — F-B2-3 LOW

Rapid mode is stored in agent metadata (JSONB). There is no server-side expiry enforcement — the agent self-expires by checking time.Now().Before(cfg.RapidPollingUntil). If the agent crashes and never restarts, the metadata flag persists forever in the DB. It has no operational impact (no server resources consumed), but it's stale data.

Load Under 50 Concurrent Rapid-Mode Agents — F-B2-4 MEDIUM

Each rapid-mode check-in executes:

UpdateAgentLastSeen (1 UPDATE)
GetPendingCommands (1 SELECT)
GetStuckCommands (1 SELECT)
Various metadata UPDATEs (~2-3 UPDATEs)

Per cycle: ~5 queries. At 5-second intervals with 50 agents: 50 agents x 5 queries x 12 cycles/minute = 3,000 queries/minute. This is well within PostgreSQL capacity for simple indexed queries, but there's no server-side cap on how many agents can enter rapid mode simultaneously.

Rate limiting exists at the router level (agent_reports rate limiter on the SetRapidPollingMode endpoint), but the check-in endpoint (GetCommands) has no rapid-mode-specific throttling. The GetCommands handler is not rate-limited at all (it's in the agent group without per-route limiting).

Backpressure — ABSENT (F-B2-5)

No debouncing or backpressure mechanism exists for rapid-mode check-ins. The 30-second jitter at the start of each loop iteration (main.go ~line 703) provides some natural spread, but it's applied to ALL polling intervals including rapid mode. A 30-second jitter on a 5-second interval means check-ins effectively happen every 5-35 seconds, not every 5 seconds.

4. AGENT STATUS STALENESS

Timeout Values

timeout.go:28-29:

sentTimeout = 2 * time.Hour — for commands in 'sent' status
pendingTimeout = 30 * time.Minute — for commands stuck in pending
Check frequency: every 5 minutes (time.NewTicker(5 * time.Minute))

main.go ~line 429:

Offline threshold: 10 * time.Minute (hardcoded)
Check frequency: every 2 minutes (hardcoded)

F-B2-6 LOW: Non-configurable timeout values

All timeout values are hardcoded. The TODO: Make these timeout durations user-adjustable at timeout.go:30 has not been implemented. The offline threshold (10 minutes) and check frequency (2 minutes) in main.go are also hardcoded.

Thundering Herd on Restart — F-B2-7 MEDIUM

When the server restarts after 2 hours of downtime:

Offline check runs immediately (MarkOfflineAgents(10 * time.Minute))
ALL agents with last_seen < NOW() - 10 minutes are marked offline in one UPDATE
This is a single atomic UPDATE — not a thundering herd on the DB side
But when agents start checking in after the server comes back, they all hit the server simultaneously

Agent-side mitigation: The 30-second jitter (time.Duration(rand.Intn(30)) * time.Second) at main.go ~line 703 provides some staggering. But agents that were waiting for the server will all retry within their backoff window simultaneously.

In-Flight Commands on Offline — ACCEPTABLE

When an agent goes offline, its commands remain in their current status. Pending commands are not cancelled. The timeout service eventually marks sent commands as timed_out after 2 hours. Pending commands time out after 30 minutes. No data loss occurs.

5. TRANSACTION SAFETY AUDIT

Operation	Transactional?	Details
Agent registration	NO	4 separate operations: validate token, create agent, mark token, create refresh token. Manual rollback (delete agent) on token failure.
Command approval	NO	`ApproveUpdate` creates a command via `signAndCreateCommand` which is a single INSERT. But the approval status update and command creation are separate.
Command retry	NO	`RetryCommand` builds new command and calls `signAndCreateCommand`. No transaction wrapping the original status check + new command creation.
Token renewal	NO	`ValidateRefreshToken` (SELECT) then `UpdateExpiration` (UPDATE) then `GenerateAgentToken` (memory op). Not transactional.
Agent deletion	YES	`DeleteAgent` uses a transaction (agents.go:211-224). Also relies on CASCADE for child records.
Bulk update approval	YES	`BulkApproveUpdates` uses a transaction (updates.go:159-178).
Update package status	YES	`UpdatePackageStatus` uses a transaction (updates.go:532-580).

F-B2-8 HIGH: Agent registration not transactional

The most critical non-transactional operation. A crash between any of the 4 steps leaves the system in an inconsistent state. The manual rollback (delete agent on token failure) is a best-effort mitigation, not an atomic guarantee.

F-B2-9 MEDIUM: Token renewal not transactional

If the server crashes between validating the refresh token and updating its expiry, the token is consumed (validated) but the new JWT is never issued. The agent would retry and succeed (token is still valid), so this is self-healing. Low practical risk.

6. DEADLOCK RISK AUDIT

DB Lock Ordering — LOW RISK

No handler acquires locks on both agents and agent_commands in the same transaction
Agent deletion uses CASCADE (single DELETE triggers cascading deletes — PostgreSQL handles lock ordering internally)
The scheduler creates commands via CreateCommand which does a single INSERT (no multi-table locks)
The timeout service does single-row UPDATEs (no multi-row locking within a transaction)

Go Mutex / DB Lock Interaction — LOW RISK

CommandHandler.executedIDs uses sync.Mutex (agent-side, no DB interaction under the lock)
CommandHandler.keyCache uses sync.RWMutex (agent-side, may do network I/O under lock for key fetch, but no DB)
Scheduler PriorityQueue uses sync.Mutex (server-side, DB operations happen AFTER the lock is released)
No handler holds a Go mutex while also holding a DB transaction

Rapid Mode + Timeout Service — NO DEADLOCK

The rapid mode handler updates agent metadata via UpdateAgent (single-row UPDATE). The timeout service updates command status via UpdateCommandStatus (different table). No overlap.

7. ETHOS CROSS-CHECK

ETHOS #3 — Assume Failure

Violations:

Registration assumes CreateAgent succeeds before attempting MarkTokenUsed (F-B2-8)
Token renewal assumes ValidateRefreshToken + UpdateExpiration succeed atomically (F-B2-9)
GetCommands assumes MarkCommandSent succeeds (but does handle failure with logging)

ETHOS #4 — Idempotency

Command delivery: Not idempotent (same command can be delivered twice in concurrent requests). Mitigated by agent-side dedup. Registration: Not idempotent (same token can validate twice before one is consumed). Mitigated by stored procedure atomicity on the mark-as-used step. Token renewal: Idempotent (renewing twice just extends the expiry twice — harmless).

FINDINGS SUMMARY

ID	Severity	Finding	Location
F-B2-1	HIGH	Registration flow uses 4 separate DB operations without a transaction. Crash between CreateAgent and MarkTokenUsed leaves orphaned agent.	agents.go RegisterAgent (~line 80-200)
F-B2-2	MEDIUM	GetCommands + MarkCommandSent not in a transaction. Concurrent requests from same agent can receive duplicate commands.	agents.go GetCommands (~line 428-470)
F-B2-3	LOW	Rapid mode metadata persists in DB after agent crash (stale, no operational impact).	agents.go SetRapidPollingMode (~line 1221)
F-B2-4	MEDIUM	No server-side cap on concurrent rapid-mode agents. 50 agents at 5s intervals = 3,000 queries/min.	main.go, agents.go
F-B2-5	LOW	30-second jitter on 5-second rapid interval effectively negates rapid mode (check-ins at 5-35s).	agent main.go ~line 703
F-B2-6	LOW	All timeout values hardcoded (2h sent, 30m pending, 10m offline). TODO exists but not implemented.	timeout.go:28-30, main.go:429
F-B2-7	MEDIUM	No staggered reconnection after server restart. All agents retry simultaneously within backoff window.	agent main.go ~line 703
F-B2-8	HIGH	Agent registration not transactional. 4 separate DB operations with manual rollback on failure.	agents.go RegisterAgent
F-B2-9	MEDIUM	Token renewal not transactional (validate + update expiry + issue JWT). Self-healing on retry.	agents.go RenewToken (~line 995)
F-B2-10	LOW	No maximum retry count for stuck commands. A persistently failing command can loop indefinitely between sent→stuck→re-sent.	timeout.go, agents.go GetCommands

14 KiB Raw Blame History