feat(security): A-1 Ed25519 key rotation + A-2 replay attack fixes

Complete RedFlag codebase with two major security audit implementations. == A-1: Ed25519 Key Rotation Support == Server: - SignCommand sets SignedAt timestamp and KeyID on every signature - signing_keys database table (migration 020) for multi-key rotation - InitializePrimaryKey registers active key at startup - /api/v1/public-keys endpoint for rotation-aware agents - SigningKeyQueries for key lifecycle management Agent: - Key-ID-aware verification via CheckKeyRotation - FetchAndCacheAllActiveKeys for rotation pre-caching - Cache metadata with TTL and staleness fallback - SecurityLogger events for key rotation and command signing == A-2: Replay Attack Fixes (F-1 through F-7) == F-5 CRITICAL - RetryCommand now signs via signAndCreateCommand F-1 HIGH - v3 format: "{agent_id}:{cmd_id}:{type}:{hash}:{ts}" F-7 HIGH - Migration 026: expires_at column with partial index F-6 HIGH - GetPendingCommands/GetStuckCommands filter by expires_at F-2 HIGH - Agent-side executedIDs dedup map with cleanup F-4 HIGH - commandMaxAge reduced from 24h to 4h F-3 CRITICAL - Old-format commands rejected after 48h via CreatedAt Verification fixes: migration idempotency (ETHOS #4), log format compliance (ETHOS #1), stale comments updated. All 24 tests passing. Docker --no-cache build verified. See docs/ for full audit reports and deviation log (DEV-001 to DEV-019). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 21:25:47 -04:00
commit f97d4845af
340 changed files with 75403 additions and 0 deletions
--- a/docs/A2_PreFix_Tests.md
+++ b/docs/A2_PreFix_Tests.md
@@ -0,0 +1,291 @@
+# A-2 Pre-Fix Test Suite
+**Date**: 2026-03-28
+**Branch**: unstabledeveloper
+**Purpose**: Document replay attack bugs BEFORE fixes are applied.
+
+These tests prove that the bugs exist today and will prove the fixes work
+when applied. Do NOT modify these tests before the fix is ready — they are
+the regression baseline.
+
+---
+
+## Test Files Created
+
+| File | Package | Bugs Documented |
+|------|---------|-----------------|
+| `aggregator-server/internal/services/signing_replay_test.go` | `services_test` | F-5, F-1, F-3 |
+| `aggregator-agent/internal/crypto/replay_test.go` | `crypto` | F-3, F-4, F-2, F-1 |
+| `aggregator-server/internal/database/queries/commands_ttl_test.go` | `queries_test` | F-6, F-7, F-5 |
+| `aggregator-server/internal/api/handlers/retry_signing_test.go` | `handlers_test` | F-5 |
+
+---
+
+## How to Run
+
+```bash
+# Server-side tests (all pre-fix tests)
+cd aggregator-server && go test ./internal/services/... -v -run "TestRetry|TestSigned|TestOld"
+cd aggregator-server && go test ./internal/database/queries/... -v -run TestGetPending
+cd aggregator-server && go test ./internal/api/handlers/... -v -run TestRetryCommand
+
+# Agent-side tests (all pre-fix tests)
+cd aggregator-agent && go test ./internal/crypto/... -v -run "TestOld|TestNew|TestSame|TestCross"
+
+# Run everything with verbose output
+cd aggregator-server && go test ./... -v 2>&1 | grep -E "(PASS|FAIL|BUG|---)"
+cd aggregator-agent  && go test ./... -v 2>&1 | grep -E "(PASS|FAIL|BUG|---)"
+```
+
+---
+
+## Test Inventory
+
+### Behaviour Categories
+
+**PASS-NOW / FAIL-AFTER-FIX** — Asserts the CURRENT (buggy) behaviour.
+The test passes because the bug exists. When the fix is applied, the behaviour
+changes and this test fails — signalling that the test itself needs to be
+updated to assert the new correct state.
+
+**FAIL-NOW / PASS-AFTER-FIX** — Asserts the CORRECT post-fix behaviour.
+The test fails because the bug exists. When the fix is applied, the assertion
+becomes true and the test passes — proving the fix works.
+
+---
+
+### File 1: `aggregator-server/internal/services/signing_replay_test.go`
+
+#### `TestRetryCommandIsUnsigned`
+- **Bug**: F-5 — RetryCommand creates unsigned commands
+- **What it asserts**: `retried.Signature == ""`, `retried.SignedAt == nil`, `retried.KeyID == ""`
+- **Category**: PASS-NOW / FAIL-AFTER-FIX
+- **Why it currently passes**: `queries.RetryCommand` (commands.go:189) builds a
+  new `AgentCommand` struct without calling `signAndCreateCommand`. All three
+  signature fields are zero values.
+- **What changes after fix**: `RetryCommand` will call `signAndCreateCommand`, so
+  `retried.Signature` will be non-empty — the assertions flip to failures.
+- **Operator impact**: Until fixed, every "Retry" click in the dashboard creates an
+  unsigned command. In strict enforcement mode the agent rejects it silently, logging
+  `"command verification failed: strict enforcement requires signed commands"`. The
+  server returns HTTP 200 so the operator sees no error.
+
+#### `TestRetryCommandMustBeSigned`
+- **Bug**: F-5 — RetryCommand creates unsigned commands
+- **What it asserts**: `retried.Signature != ""`, `retried.SignedAt != nil`, `retried.KeyID != ""`
+- **Category**: FAIL-NOW / PASS-AFTER-FIX
+- **Why it currently fails**: The retry command is unsigned (bug F-5 exists).
+- **What changes after fix**: All three fields will be populated; test passes.
+
+#### `TestSignedCommandNotBoundToAgent`
+- **Bug**: F-1 — `agent_id` absent from signed payload
+- **What it asserts**: `agentA.String()` is NOT in the signed message, and
+  `ed25519.Verify` returns `true` for the command regardless of which agent receives it.
+- **Category**: PASS-NOW / FAIL-AFTER-FIX
+- **Why it currently passes**: Signed message is `{id}:{type}:{sha256(params)}:{ts}`.
+  No `agent_id` component. `ed25519.Verify` ignores anything outside the signed message.
+- **What changes after fix**: When `agent_id` is added to the signed message, the message
+  reconstructed in the test (without `agent_id`) will not match the signature — `ed25519.Verify`
+  returns `false` and the test fails.
+- **Attack scenario**: An attacker with DB write access can copy a signed command from
+  agent A into agent B's `agent_commands` queue. The signature passes verification on agent B.
+
+#### `TestOldFormatCommandHasNoExpiry`
+- **Bug**: F-3 — Old-format commands (no `signed_at`) valid forever
+- **What it asserts**: `ed25519.Verify` returns `true` for an old-format signature
+  (no timestamp in the message) regardless of when verification occurs.
+- **Category**: PASS-NOW / FAIL-AFTER-FIX
+- **Why it currently passes**: `ed25519.Verify` is a pure cryptographic check — it has no
+  time component. The old format `{id}:{type}:{sha256(params)}` contains no timestamp, so
+  there is nothing to expire.
+- **What changes after fix**: Either `VerifyCommand` is updated to reject old-format
+  commands outright (requiring `signed_at`), or a `created_at` check is added — the test
+  would then need to be updated to expect rejection.
+
+---
+
+### File 2: `aggregator-agent/internal/crypto/replay_test.go`
+
+Uses helpers `generateKeyPair`, `signCommand`, `signCommandOld` from `verification_test.go`
+(same package).
+
+#### `TestOldFormatReplayIsUnbounded`
+- **Bug**: F-3 — `VerifyCommand` has no time check
+- **What it asserts**: `v.VerifyCommand(cmd, pub)` returns `nil` for a command with
+  `SignedAt == nil` (old format), regardless of age.
+- **Category**: PASS-NOW / FAIL-AFTER-FIX
+- **Why it currently passes**: `VerifyCommand` (verification.go:25) performs only an
+  Ed25519 signature check. No `created_at` or `SignedAt` field is examined.
+- **What changes after fix**: After adding an expiry check, `VerifyCommand` will return
+  an error for old-format commands beyond a defined age, and this test will fail.
+
+#### `TestNewFormatCommandCanBeReplayedWithin24Hours`
+- **Bug**: F-4 — 24-hour replay window (large but intentional)
+- **What it asserts**: `VerifyCommandWithTimestamp` returns `nil` for a command signed
+  23h59m ago (within the 24h `commandMaxAge`).
+- **Category**: PASS-NOW / WILL-REMAIN-PASSING until `commandMaxAge` is reduced
+- **Why it currently passes**: By design — the 24h window is intentional to accommodate
+  polling intervals and network delays.
+- **What changes after fix**: If `commandMaxAge` is reduced (e.g. to 4h per the A-2 audit
+  recommendation), this test will FAIL for commands older than the new limit. Update the
+  `time.Duration` in the test when `commandMaxAge` is changed.
+
+#### `TestSameCommandCanBeVerifiedTwice`
+- **Bug**: F-2 — No nonce; same command verifies any number of times
+- **What it asserts**: `VerifyCommandWithTimestamp` returns `nil` on the second and
+  third call with identical inputs.
+- **Category**: PASS-NOW / FAIL-AFTER-FIX
+- **Why it currently passes**: `VerifyCommandWithTimestamp` is a stateless pure function.
+  No nonce, no executed-command set, no single-use guarantee.
+- **What changes after fix**: After agent-side deduplication (executed-command ID set) is
+  added, the second call for a previously-seen command UUID will return an error.
+
+#### `TestCrossAgentSignatureVerifies`
+- **Bug**: F-1 — Signed message has no agent binding
+- **What it asserts**: The signed message components are `[cmd_id, cmd_type, sha256(params),
+  timestamp]` — no `agent_id`. `VerifyCommandWithTimestamp` passes for a copy of the command
+  representing delivery to a different agent.
+- **Category**: PASS-NOW / FAIL-AFTER-FIX
+- **Why it currently passes**: `client.Command` has no `agent_id` field, and
+  `reconstructMessageWithTimestamp` does not include one.
+- **What changes after fix**: After `agent_id` is added to the signed message (and
+  correspondingly to `client.Command`), the reconstructed message in the verifier will
+  include `agent_id`, and a command signed for agent A will fail verification on agent B.
+
+---
+
+### File 3: `aggregator-server/internal/database/queries/commands_ttl_test.go`
+
+These tests operate on a copied query string constant. When the fix adds a TTL clause to
+`GetPendingCommands`, update `getPendingCommandsQuery` in this file to match.
+
+#### `TestGetPendingCommandsHasNoTTLFilter`
+- **Bug**: F-6 + F-7 — `GetPendingCommands` has no TTL filter; no `expires_at` column
+- **What it asserts**: The query string does NOT contain `"INTERVAL"` or `"expires_at"`.
+- **Category**: PASS-NOW / FAIL-AFTER-FIX
+- **Why it currently passes**: The production query (commands.go:52) is:
+  ```sql
+  SELECT * FROM agent_commands
+  WHERE agent_id = $1 AND status = 'pending'
+  ORDER BY created_at ASC
+  LIMIT 100
+  ```
+  Neither `INTERVAL` nor `expires_at` appears.
+- **What changes after fix**: Update `getPendingCommandsQuery` to the new query containing
+  the TTL clause. The absence-assertions will then fail (indicator found) — update them.
+
+#### `TestGetPendingCommandsMustHaveTTLFilter`
+- **Bug**: F-6 + F-7 — same
+- **What it asserts**: The query DOES contain a TTL indicator (`"INTERVAL"` or `"expires_at"`).
+- **Category**: FAIL-NOW / PASS-AFTER-FIX
+- **Why it currently fails**: No TTL clause exists in the current query.
+- **What changes after fix**: Update `getPendingCommandsQuery`; the indicator will be found
+  and the test passes.
+
+#### `TestRetryCommandQueryDoesNotCopySignature`
+- **Bug**: F-5 (query-layer confirmation)
+- **What it asserts**: Documentary — logs that `RetryCommand` omits `signature`, `key_id`,
+  `signed_at` from the new command struct.
+- **Category**: Always passes (documentation test). Update the logged field lists when fix
+  is applied.
+
+---
+
+### File 4: `aggregator-server/internal/api/handlers/retry_signing_test.go`
+
+#### `TestRetryCommandEndpointProducesUnsignedCommand`
+- **Bug**: F-5 — Handler returns 200 but creates an unsigned command
+- **What it asserts**: `retried.Signature == ""`, `retried.SignedAt == nil`, `retried.KeyID == ""`
+  using `simulateRetryCommand` which replicates the exact struct construction in
+  `queries.RetryCommand`.
+- **Category**: PASS-NOW / FAIL-AFTER-FIX
+- **Why it currently passes**: `simulateRetryCommand` exactly mirrors the current production
+  code (commands.go:202) — no signing call.
+- **What changes after fix**: `simulateRetryCommand` must be updated to include the signing
+  call, or the test must be rewritten against the fixed implementation.
+
+#### `TestRetryCommandEndpointMustProduceSignedCommand`
+- **Bug**: F-5
+- **What it asserts**: `retried.Signature != ""`, `retried.SignedAt != nil`, `retried.KeyID != ""`
+- **Category**: FAIL-NOW / PASS-AFTER-FIX
+- **Why it currently fails**: `simulateRetryCommand` produces an unsigned command (bug exists).
+- **What changes after fix**: The production code will produce a signed command; update
+  `simulateRetryCommand` to call the signing service and the assertions will pass.
+
+#### `TestRetryCommandHTTPHandlerProducesUnsignedCommand_Integration`
+- **Bug**: F-5
+- **Status**: Skipped — requires live DB or interface extraction (see TODO in file).
+- **How to enable**: Extract `CommandQueriesInterface` from `CommandQueries` and update
+  handlers to accept the interface, then replace `simulateRetryCommand` with a real
+  handler invocation via `httptest`.
+
+---
+
+## State-Change Summary
+
+| Test | Current State | After A-2 Fix |
+|------|--------------|---------------|
+| TestRetryCommandIsUnsigned | PASS | FAIL (flip expected) |
+| TestRetryCommandMustBeSigned | **FAIL** | PASS |
+| TestSignedCommandNotBoundToAgent | PASS | FAIL (flip expected) |
+| TestOldFormatCommandHasNoExpiry | PASS | FAIL (flip expected) |
+| TestOldFormatReplayIsUnbounded | PASS | FAIL (flip expected) |
+| TestNewFormatCommandCanBeReplayedWithin24Hours | PASS | PASS (or FAIL if maxAge reduced) |
+| TestSameCommandCanBeVerifiedTwice | PASS | FAIL (flip expected) |
+| TestCrossAgentSignatureVerifies | PASS | FAIL (flip expected) |
+| TestGetPendingCommandsHasNoTTLFilter | PASS | FAIL (flip expected) |
+| TestGetPendingCommandsMustHaveTTLFilter | **FAIL** | PASS |
+| TestRetryCommandQueryDoesNotCopySignature | PASS | documentary (update manually) |
+| TestRetryCommandEndpointProducesUnsignedCommand | PASS | FAIL (flip expected) |
+| TestRetryCommandEndpointMustProduceSignedCommand | **FAIL** | PASS |
+
+Tests in **bold** currently FAIL — these are the "tests written to fail with current code"
+that satisfy the TDD requirement directly. All other tests currently PASS, documenting
+the bug-as-behavior, and will flip to FAIL when the fix changes the behavior they assert.
+
+---
+
+## Maintenance Notes
+
+1. **When applying the fix for F-5**: Update `simulateRetryCommand` in
+   `retry_signing_test.go` to reflect the new signed-command production. Update the
+   assertions in `TestRetryCommandIsUnsigned` and `TestRetryCommandEndpointProducesUnsignedCommand`
+   to assert the correct post-fix state.
+
+2. **When applying the fix for F-6/F-7**: Update `getPendingCommandsQuery` in
+   `commands_ttl_test.go` to the new query text. Invert the assertions in
+   `TestGetPendingCommandsHasNoTTLFilter` to assert presence (not absence) of TTL.
+
+3. **When applying the fix for F-3**: Update `TestOldFormatCommandHasNoExpiry` and
+   `TestOldFormatReplayIsUnbounded` to assert that old-format commands ARE rejected,
+   or that the backward-compat path has a defined expiry.
+
+4. **When applying the fix for F-1**: Update `TestSignedCommandNotBoundToAgent` and
+   `TestCrossAgentSignatureVerifies` to pass an `agent_id` into the signed message and
+   assert that a cross-agent replay fails verification.
+
+5. **When applying the fix for F-2**: Update `TestSameCommandCanBeVerifiedTwice` to
+   assert that the second call returns an error (deduplication firing).
+
+---
+
+## Post-Fix Status (2026-03-28)
+
+All fixes have been applied. Test status:
+
+| Test | Pre-Fix | Post-Fix | Status |
+|------|---------|----------|--------|
+| TestRetryCommandIsUnsigned | PASS | UPDATED — now asserts signed | VERIFIED PASSING |
+| TestRetryCommandMustBeSigned | FAIL | UPDATED — now passes | VERIFIED PASSING |
+| TestSignedCommandNotBoundToAgent | PASS | UPDATED — asserts agent_id binding | VERIFIED PASSING |
+| TestOldFormatCommandHasNoExpiry | PASS | UPDATED — documents crypto vs app-layer | VERIFIED PASSING |
+| TestOldFormatReplayIsUnbounded | PASS | UPDATED — asserts 48h rejection | VERIFIED PASSING |
+| TestOldFormatRecentCommandStillPasses | N/A | NEW — backward compat for recent old-format | VERIFIED PASSING |
+| TestNewFormatCommandCanBeReplayedWithin24Hours | PASS | UPDATED — uses 4h window (3h59m) | VERIFIED PASSING |
+| TestCommandBeyond4HoursIsRejected | N/A | NEW — asserts 4h rejection | VERIFIED PASSING |
+| TestSameCommandCanBeVerifiedTwice | PASS | UPDATED — documents verifier purity, dedup at ProcessCommand | VERIFIED PASSING |
+| TestCrossAgentSignatureVerifies | PASS | UPDATED — asserts cross-agent failure | VERIFIED PASSING |
+| TestGetPendingCommandsHasNoTTLFilter | PASS | UPDATED — asserts TTL presence | VERIFIED PASSING |
+| TestGetPendingCommandsMustHaveTTLFilter | FAIL | UPDATED — now passes | VERIFIED PASSING |
+| TestRetryCommandQueryDoesNotCopySignature | PASS | Unchanged (documentary) | VERIFIED PASSING |
+| TestRetryCommandEndpointProducesUnsignedCommand | PASS | UPDATED — asserts signed | VERIFIED PASSING |
+| TestRetryCommandEndpointMustProduceSignedCommand | FAIL | UPDATED — now passes | VERIFIED PASSING |