Files
Redflag/docs/A2_Fix_Implementation.md
jpetree331 f97d4845af feat(security): A-1 Ed25519 key rotation + A-2 replay attack fixes
Complete RedFlag codebase with two major security audit implementations.

== A-1: Ed25519 Key Rotation Support ==

Server:
- SignCommand sets SignedAt timestamp and KeyID on every signature
- signing_keys database table (migration 020) for multi-key rotation
- InitializePrimaryKey registers active key at startup
- /api/v1/public-keys endpoint for rotation-aware agents
- SigningKeyQueries for key lifecycle management

Agent:
- Key-ID-aware verification via CheckKeyRotation
- FetchAndCacheAllActiveKeys for rotation pre-caching
- Cache metadata with TTL and staleness fallback
- SecurityLogger events for key rotation and command signing

== A-2: Replay Attack Fixes (F-1 through F-7) ==

F-5 CRITICAL - RetryCommand now signs via signAndCreateCommand
F-1 HIGH     - v3 format: "{agent_id}:{cmd_id}:{type}:{hash}:{ts}"
F-7 HIGH     - Migration 026: expires_at column with partial index
F-6 HIGH     - GetPendingCommands/GetStuckCommands filter by expires_at
F-2 HIGH     - Agent-side executedIDs dedup map with cleanup
F-4 HIGH     - commandMaxAge reduced from 24h to 4h
F-3 CRITICAL - Old-format commands rejected after 48h via CreatedAt

Verification fixes: migration idempotency (ETHOS #4), log format
compliance (ETHOS #1), stale comments updated.

All 24 tests passing. Docker --no-cache build verified.
See docs/ for full audit reports and deviation log (DEV-001 to DEV-019).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 21:25:47 -04:00

171 lines
7.8 KiB
Markdown

# A2 — Replay Attack Fix Implementation Report
**Date:** 2026-03-28
**Branch:** unstabledeveloper
**Audit Reference:** docs/A2_Replay_Attack_Audit.md
---
## Summary
This document covers the implementation of fixes for 7 audit findings (F-1 through F-7) identified in the replay attack surface audit. All fixes maintain backward compatibility with pre-A1 agents and servers.
---
## Files Changed
### Server Side
| File | Change |
|------|--------|
| `aggregator-server/internal/services/signing.go` | v3 signed message format includes agent_id (F-1) |
| `aggregator-server/internal/models/command.go` | Added `ExpiresAt`, `AgentID`, `CreatedAt` to structs (F-7, F-1, F-3) |
| `aggregator-server/internal/database/queries/commands.go` | TTL filter in GetPendingCommands/GetStuckCommands, expires_at in CreateCommand (F-6, F-7) |
| `aggregator-server/internal/api/handlers/updates.go` | RetryCommand refactored to sign via signAndCreateCommand (F-5) |
| `aggregator-server/internal/api/handlers/agents.go` | GetCommands passes AgentID and CreatedAt to CommandItem (F-1, F-3) |
| `aggregator-server/internal/database/queries/docker.go` | Fix pre-existing fmt.Sprintf build error (unrelated) |
| `aggregator-server/internal/database/migrations/026_add_expires_at.up.sql` | New migration: expires_at column + index + backfill (F-7) |
| `aggregator-server/internal/database/migrations/026_add_expires_at.down.sql` | Rollback migration (F-7) |
### Agent Side
| File | Change |
|------|--------|
| `aggregator-agent/internal/crypto/verification.go` | v3 message format, field-count detection, old-format 48h expiry (F-1, F-3) |
| `aggregator-agent/internal/orchestrator/command_handler.go` | Dedup set, commandMaxAge=4h, CleanupExecutedIDs (F-2, F-4) |
| `aggregator-agent/internal/client/client.go` | Added AgentID and CreatedAt to Command struct (F-1, F-3) |
| `aggregator-agent/cmd/agent/main.go` | Wired CleanupExecutedIDs into key refresh cycle (F-2) |
### Test Files (Updated)
| File | Tests Updated |
|------|---------------|
| `aggregator-server/internal/services/signing_replay_test.go` | TestRetryCommandIsUnsigned, TestRetryCommandMustBeSigned, TestSignedCommandNotBoundToAgent, TestOldFormatCommandHasNoExpiry |
| `aggregator-server/internal/database/queries/commands_ttl_test.go` | TestGetPendingCommandsHasNoTTLFilter, TestGetPendingCommandsMustHaveTTLFilter |
| `aggregator-server/internal/api/handlers/retry_signing_test.go` | simulateRetryCommand, TestRetryCommandEndpointProducesUnsignedCommand, TestRetryCommandEndpointMustProduceSignedCommand |
| `aggregator-agent/internal/crypto/replay_test.go` | TestOldFormatReplayIsUnbounded, TestNewFormatCommandCanBeReplayedWithin24Hours, TestSameCommandCanBeVerifiedTwice, TestCrossAgentSignatureVerifies + new: TestOldFormatRecentCommandStillPasses, TestCommandBeyond4HoursIsRejected |
| `aggregator-agent/internal/crypto/verification_test.go` | All tests updated for v3 format (AgentID), signCommand helper updated, signCommandV2 added |
---
## Signed Message Format (v3)
### New Format
```
"{agent_id}:{cmd_id}:{command_type}:{sha256(params)}:{unix_timestamp}"
```
5 colon-separated fields.
### Previous Formats (backward compat)
- **v2 (4 fields):** `"{cmd_id}:{command_type}:{sha256(params)}:{unix_timestamp}"` — has signed_at, no agent_id
- **v1 (3 fields):** `"{cmd_id}:{command_type}:{sha256(params)}"` — no timestamp, no agent_id
### Backward Compatibility Detection
The agent's `VerifyCommandWithTimestamp` detects the format:
1. If `cmd.AgentID != ""` → try v3 first. If v3 fails, fall back to v2 with warning.
2. If `cmd.AgentID == ""` and `cmd.SignedAt != nil` → v2 format with warning.
3. If `cmd.SignedAt == nil` → v1 format (oldest) with warning + 48h created_at check.
Warnings are logged at the `[crypto]` level to alert operators to upgrade.
---
## Deduplication Window
- **Implementation:** In-memory `executedIDs map[string]time.Time` in `CommandHandler`
- **Window:** Entries are kept for `commandMaxAge` (4 hours)
- **Cleanup:** Runs every 6 hours when `ShouldRefreshKey()` fires
- **Restart Limitation:** The map is lost on agent restart. Commands issued within `commandMaxAge` can be replayed if the agent restarts. A TODO comment documents the future disk persistence path.
---
## Two-Phase Plan for Retiring Old-Format Commands
### Phase 1 (Implemented Now)
- Old-format commands (no `signed_at`) with `created_at > 48h` are rejected by `VerifyCommand`
- Old-format commands within 48h still pass (backward compat for recent commands)
- The `created_at` field is now included in the `CommandItem` API response
### Phase 2 (Future Work — 90 Days After Migration 025 Deployment)
- Remove the old-format fallback in `VerifyCommandWithTimestamp` entirely
- Enforce `signed_at` as required on all commands
- Remove `VerifyCommand()` from the public API
- This ensures all commands use timestamped, agent-bound signatures
---
## Docker Build + Test Output
### Server Build
```
docker-compose build server
# ... builds successfully
Service server Built
```
### Server Tests
```
=== RUN TestRetryCommandIsUnsigned
--- PASS: TestRetryCommandIsUnsigned (0.00s)
=== RUN TestRetryCommandMustBeSigned
--- PASS: TestRetryCommandMustBeSigned (0.00s)
=== RUN TestSignedCommandNotBoundToAgent
--- PASS: TestSignedCommandNotBoundToAgent (0.00s)
=== RUN TestOldFormatCommandHasNoExpiry
--- PASS: TestOldFormatCommandHasNoExpiry (0.00s)
ok github.com/Fimeg/RedFlag/aggregator-server/internal/services
=== RUN TestGetPendingCommandsHasNoTTLFilter
--- PASS: TestGetPendingCommandsHasNoTTLFilter (0.00s)
=== RUN TestGetPendingCommandsMustHaveTTLFilter
--- PASS: TestGetPendingCommandsMustHaveTTLFilter (0.00s)
=== RUN TestRetryCommandQueryDoesNotCopySignature
--- PASS: TestRetryCommandQueryDoesNotCopySignature (0.00s)
ok github.com/Fimeg/RedFlag/aggregator-server/internal/database/queries
=== RUN TestRetryCommandEndpointProducesUnsignedCommand
--- PASS: TestRetryCommandEndpointProducesUnsignedCommand (0.00s)
=== RUN TestRetryCommandEndpointMustProduceSignedCommand
--- PASS: TestRetryCommandEndpointMustProduceSignedCommand (0.00s)
=== RUN TestRetryCommandHTTPHandlerProducesUnsignedCommand_Integration
--- SKIP: TestRetryCommandHTTPHandlerProducesUnsignedCommand_Integration (0.00s)
ok github.com/Fimeg/RedFlag/aggregator-server/internal/api/handlers
```
### Agent Tests
```
=== RUN TestCacheMetadataIsExpired
--- PASS: TestCacheMetadataIsExpired (0.00s)
=== RUN TestOldFormatReplayIsUnbounded
--- PASS: TestOldFormatReplayIsUnbounded (0.00s)
=== RUN TestOldFormatRecentCommandStillPasses
--- PASS: TestOldFormatRecentCommandStillPasses (0.00s)
=== RUN TestNewFormatCommandCanBeReplayedWithin24Hours
--- PASS: TestNewFormatCommandCanBeReplayedWithin24Hours (0.00s)
=== RUN TestCommandBeyond4HoursIsRejected
--- PASS: TestCommandBeyond4HoursIsRejected (0.00s)
=== RUN TestSameCommandCanBeVerifiedTwice
--- PASS: TestSameCommandCanBeVerifiedTwice (0.00s)
=== RUN TestCrossAgentSignatureVerifies
--- PASS: TestCrossAgentSignatureVerifies (0.00s)
=== RUN TestVerifyCommandWithTimestamp_ValidRecent
--- PASS: TestVerifyCommandWithTimestamp_ValidRecent (0.00s)
=== RUN TestVerifyCommandWithTimestamp_TooOld
--- PASS: TestVerifyCommandWithTimestamp_TooOld (0.00s)
=== RUN TestVerifyCommandWithTimestamp_FutureBeyondSkew
--- PASS: TestVerifyCommandWithTimestamp_FutureBeyondSkew (0.00s)
=== RUN TestVerifyCommandWithTimestamp_FutureWithinSkew
--- PASS: TestVerifyCommandWithTimestamp_FutureWithinSkew (0.00s)
=== RUN TestVerifyCommandWithTimestamp_BackwardCompatNoTimestamp
--- PASS: TestVerifyCommandWithTimestamp_BackwardCompatNoTimestamp (0.00s)
=== RUN TestVerifyCommandWithTimestamp_WrongKey
--- PASS: TestVerifyCommandWithTimestamp_WrongKey (0.00s)
=== RUN TestVerifyCommand_BackwardCompat
--- PASS: TestVerifyCommand_BackwardCompat (0.00s)
ok github.com/Fimeg/RedFlag/aggregator-agent/internal/crypto
```
All tests pass. No regressions detected.