fix(concurrency): B-2 data integrity and race condition fixes

- Wrap agent registration in DB transaction (F-B2-1/F-B2-8)
  All 4 ops atomic, manual DeleteAgent rollback removed
- Use SELECT FOR UPDATE SKIP LOCKED for atomic command delivery (F-B2-2)
  Concurrent requests get different commands, no duplicates
- Wrap token renewal in DB transaction (F-B2-9)
  Validate + update expiry atomic
- Add rate limit to GET /agents/:id/commands (F-B2-4)
  agent_checkin rate limiter applied
- Add retry_count column, cap stuck command retries at 5 (F-B2-10)
  Migration 029, GetStuckCommands filters retry_count < 5
- Cap polling jitter at current interval (fixes rapid mode) (F-B2-5)
  maxJitter = min(pollingInterval/2, 30s)
- Add exponential backoff with full jitter on reconnection (F-B2-7)
  calculateBackoff: base=10s, cap=5min, reset on success

All tests pass. No regressions from A-series or B-1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-03-29 08:00:36 -04:00
parent 59ab7cbd5f
commit 3ca42d50f4
14 changed files with 425 additions and 501 deletions

View File

@@ -0,0 +1,50 @@
# B-2 Data Integrity & Concurrency Fix Implementation
**Date:** 2026-03-29
**Branch:** culurien
---
## Files Changed
### Server
| File | Change |
|------|--------|
| `handlers/agents.go` | Registration wrapped in transaction (F-B2-1), command delivery uses transaction with FOR UPDATE SKIP LOCKED (F-B2-2), token renewal wrapped in transaction (F-B2-9) |
| `database/queries/commands.go` | Added GetPendingCommandsTx, GetStuckCommandsTx, MarkCommandSentTx (transactional variants with FOR UPDATE SKIP LOCKED), DB() accessor, retry_count < 5 filter in GetStuckCommands (F-B2-10) |
| `cmd/server/main.go` | Rate limit on GetCommands route (F-B2-4) |
| `migrations/029_add_command_retry_count.up.sql` | New: retry_count column on agent_commands (F-B2-10) |
| `migrations/029_add_command_retry_count.down.sql` | New: rollback |
### Agent
| File | Change |
|------|--------|
| `cmd/agent/main.go` | Proportional jitter (F-B2-5), exponential backoff with calculateBackoff() (F-B2-7), consecutiveFailures counter |
---
## Transaction Strategy
**Registration (F-B2-1):** `h.agentQueries.DB.Beginx()` starts a transaction. CreateAgent, MarkTokenUsed, and CreateRefreshToken all execute on `tx`. JWT is generated AFTER `tx.Commit()`. `defer tx.Rollback()` ensures cleanup on any error.
**Command Delivery (F-B2-2):** `h.commandQueries.DB().Beginx()` starts a transaction. GetPendingCommandsTx and GetStuckCommandsTx use `SELECT ... FOR UPDATE SKIP LOCKED`. MarkCommandSentTx updates within the same transaction. Concurrent requests skip locked rows (get different commands).
**Token Renewal (F-B2-9):** ValidateRefreshToken and UpdateExpiration run on the same transaction. JWT generated after commit.
## Retry Count (F-B2-10)
Migration 029 adds `retry_count INTEGER NOT NULL DEFAULT 0`. GetStuckCommands filters `AND retry_count < 5`. Max 5 re-deliveries per command.
## Jitter Cap (F-B2-5)
`maxJitter = min(pollingInterval/2, 30s)`. Rapid mode (5s) gets 0-2s jitter. Standard (300s) gets 0-30s.
## Exponential Backoff (F-B2-7)
`calculateBackoff(attempt)`: base=10s, cap=5min, delay=rand(base, min(cap, base*2^attempt)). Reset to 0 on success.
## Final Migration Sequence
001 → ... → 028 → 029. No duplicates.