Files
Redflag/docs/B2_Fix_Implementation.md
jpetree331 3ca42d50f4 fix(concurrency): B-2 data integrity and race condition fixes
- Wrap agent registration in DB transaction (F-B2-1/F-B2-8)
  All 4 ops atomic, manual DeleteAgent rollback removed
- Use SELECT FOR UPDATE SKIP LOCKED for atomic command delivery (F-B2-2)
  Concurrent requests get different commands, no duplicates
- Wrap token renewal in DB transaction (F-B2-9)
  Validate + update expiry atomic
- Add rate limit to GET /agents/:id/commands (F-B2-4)
  agent_checkin rate limiter applied
- Add retry_count column, cap stuck command retries at 5 (F-B2-10)
  Migration 029, GetStuckCommands filters retry_count < 5
- Cap polling jitter at current interval (fixes rapid mode) (F-B2-5)
  maxJitter = min(pollingInterval/2, 30s)
- Add exponential backoff with full jitter on reconnection (F-B2-7)
  calculateBackoff: base=10s, cap=5min, reset on success

All tests pass. No regressions from A-series or B-1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 08:00:36 -04:00

2.2 KiB

B-2 Data Integrity & Concurrency Fix Implementation

Date: 2026-03-29 Branch: culurien


Files Changed

Server

File Change
handlers/agents.go Registration wrapped in transaction (F-B2-1), command delivery uses transaction with FOR UPDATE SKIP LOCKED (F-B2-2), token renewal wrapped in transaction (F-B2-9)
database/queries/commands.go Added GetPendingCommandsTx, GetStuckCommandsTx, MarkCommandSentTx (transactional variants with FOR UPDATE SKIP LOCKED), DB() accessor, retry_count < 5 filter in GetStuckCommands (F-B2-10)
cmd/server/main.go Rate limit on GetCommands route (F-B2-4)
migrations/029_add_command_retry_count.up.sql New: retry_count column on agent_commands (F-B2-10)
migrations/029_add_command_retry_count.down.sql New: rollback

Agent

File Change
cmd/agent/main.go Proportional jitter (F-B2-5), exponential backoff with calculateBackoff() (F-B2-7), consecutiveFailures counter

Transaction Strategy

Registration (F-B2-1): h.agentQueries.DB.Beginx() starts a transaction. CreateAgent, MarkTokenUsed, and CreateRefreshToken all execute on tx. JWT is generated AFTER tx.Commit(). defer tx.Rollback() ensures cleanup on any error.

Command Delivery (F-B2-2): h.commandQueries.DB().Beginx() starts a transaction. GetPendingCommandsTx and GetStuckCommandsTx use SELECT ... FOR UPDATE SKIP LOCKED. MarkCommandSentTx updates within the same transaction. Concurrent requests skip locked rows (get different commands).

Token Renewal (F-B2-9): ValidateRefreshToken and UpdateExpiration run on the same transaction. JWT generated after commit.

Retry Count (F-B2-10)

Migration 029 adds retry_count INTEGER NOT NULL DEFAULT 0. GetStuckCommands filters AND retry_count < 5. Max 5 re-deliveries per command.

Jitter Cap (F-B2-5)

maxJitter = min(pollingInterval/2, 30s). Rapid mode (5s) gets 0-2s jitter. Standard (300s) gets 0-30s.

Exponential Backoff (F-B2-7)

calculateBackoff(attempt): base=10s, cap=5min, delay=rand(base, min(cap, base*2^attempt)). Reset to 0 on success.

Final Migration Sequence

001 → ... → 028 → 029. No duplicates.