feat(security): A-1 Ed25519 key rotation + A-2 replay attack fixes

Complete RedFlag codebase with two major security audit implementations.

== A-1: Ed25519 Key Rotation Support ==

Server:
- SignCommand sets SignedAt timestamp and KeyID on every signature
- signing_keys database table (migration 020) for multi-key rotation
- InitializePrimaryKey registers active key at startup
- /api/v1/public-keys endpoint for rotation-aware agents
- SigningKeyQueries for key lifecycle management

Agent:
- Key-ID-aware verification via CheckKeyRotation
- FetchAndCacheAllActiveKeys for rotation pre-caching
- Cache metadata with TTL and staleness fallback
- SecurityLogger events for key rotation and command signing

== A-2: Replay Attack Fixes (F-1 through F-7) ==

F-5 CRITICAL - RetryCommand now signs via signAndCreateCommand
F-1 HIGH     - v3 format: "{agent_id}:{cmd_id}:{type}:{hash}:{ts}"
F-7 HIGH     - Migration 026: expires_at column with partial index
F-6 HIGH     - GetPendingCommands/GetStuckCommands filter by expires_at
F-2 HIGH     - Agent-side executedIDs dedup map with cleanup
F-4 HIGH     - commandMaxAge reduced from 24h to 4h
F-3 CRITICAL - Old-format commands rejected after 48h via CreatedAt

Verification fixes: migration idempotency (ETHOS #4), log format
compliance (ETHOS #1), stale comments updated.

All 24 tests passing. Docker --no-cache build verified.
See docs/ for full audit reports and deviation log (DEV-001 to DEV-019).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-03-28 21:25:47 -04:00
commit f97d4845af
340 changed files with 75403 additions and 0 deletions

291
docs/A2_PreFix_Tests.md Normal file
View File

@@ -0,0 +1,291 @@
# A-2 Pre-Fix Test Suite
**Date**: 2026-03-28
**Branch**: unstabledeveloper
**Purpose**: Document replay attack bugs BEFORE fixes are applied.
These tests prove that the bugs exist today and will prove the fixes work
when applied. Do NOT modify these tests before the fix is ready — they are
the regression baseline.
---
## Test Files Created
| File | Package | Bugs Documented |
|------|---------|-----------------|
| `aggregator-server/internal/services/signing_replay_test.go` | `services_test` | F-5, F-1, F-3 |
| `aggregator-agent/internal/crypto/replay_test.go` | `crypto` | F-3, F-4, F-2, F-1 |
| `aggregator-server/internal/database/queries/commands_ttl_test.go` | `queries_test` | F-6, F-7, F-5 |
| `aggregator-server/internal/api/handlers/retry_signing_test.go` | `handlers_test` | F-5 |
---
## How to Run
```bash
# Server-side tests (all pre-fix tests)
cd aggregator-server && go test ./internal/services/... -v -run "TestRetry|TestSigned|TestOld"
cd aggregator-server && go test ./internal/database/queries/... -v -run TestGetPending
cd aggregator-server && go test ./internal/api/handlers/... -v -run TestRetryCommand
# Agent-side tests (all pre-fix tests)
cd aggregator-agent && go test ./internal/crypto/... -v -run "TestOld|TestNew|TestSame|TestCross"
# Run everything with verbose output
cd aggregator-server && go test ./... -v 2>&1 | grep -E "(PASS|FAIL|BUG|---)"
cd aggregator-agent && go test ./... -v 2>&1 | grep -E "(PASS|FAIL|BUG|---)"
```
---
## Test Inventory
### Behaviour Categories
**PASS-NOW / FAIL-AFTER-FIX** — Asserts the CURRENT (buggy) behaviour.
The test passes because the bug exists. When the fix is applied, the behaviour
changes and this test fails — signalling that the test itself needs to be
updated to assert the new correct state.
**FAIL-NOW / PASS-AFTER-FIX** — Asserts the CORRECT post-fix behaviour.
The test fails because the bug exists. When the fix is applied, the assertion
becomes true and the test passes — proving the fix works.
---
### File 1: `aggregator-server/internal/services/signing_replay_test.go`
#### `TestRetryCommandIsUnsigned`
- **Bug**: F-5 — RetryCommand creates unsigned commands
- **What it asserts**: `retried.Signature == ""`, `retried.SignedAt == nil`, `retried.KeyID == ""`
- **Category**: PASS-NOW / FAIL-AFTER-FIX
- **Why it currently passes**: `queries.RetryCommand` (commands.go:189) builds a
new `AgentCommand` struct without calling `signAndCreateCommand`. All three
signature fields are zero values.
- **What changes after fix**: `RetryCommand` will call `signAndCreateCommand`, so
`retried.Signature` will be non-empty — the assertions flip to failures.
- **Operator impact**: Until fixed, every "Retry" click in the dashboard creates an
unsigned command. In strict enforcement mode the agent rejects it silently, logging
`"command verification failed: strict enforcement requires signed commands"`. The
server returns HTTP 200 so the operator sees no error.
#### `TestRetryCommandMustBeSigned`
- **Bug**: F-5 — RetryCommand creates unsigned commands
- **What it asserts**: `retried.Signature != ""`, `retried.SignedAt != nil`, `retried.KeyID != ""`
- **Category**: FAIL-NOW / PASS-AFTER-FIX
- **Why it currently fails**: The retry command is unsigned (bug F-5 exists).
- **What changes after fix**: All three fields will be populated; test passes.
#### `TestSignedCommandNotBoundToAgent`
- **Bug**: F-1 — `agent_id` absent from signed payload
- **What it asserts**: `agentA.String()` is NOT in the signed message, and
`ed25519.Verify` returns `true` for the command regardless of which agent receives it.
- **Category**: PASS-NOW / FAIL-AFTER-FIX
- **Why it currently passes**: Signed message is `{id}:{type}:{sha256(params)}:{ts}`.
No `agent_id` component. `ed25519.Verify` ignores anything outside the signed message.
- **What changes after fix**: When `agent_id` is added to the signed message, the message
reconstructed in the test (without `agent_id`) will not match the signature — `ed25519.Verify`
returns `false` and the test fails.
- **Attack scenario**: An attacker with DB write access can copy a signed command from
agent A into agent B's `agent_commands` queue. The signature passes verification on agent B.
#### `TestOldFormatCommandHasNoExpiry`
- **Bug**: F-3 — Old-format commands (no `signed_at`) valid forever
- **What it asserts**: `ed25519.Verify` returns `true` for an old-format signature
(no timestamp in the message) regardless of when verification occurs.
- **Category**: PASS-NOW / FAIL-AFTER-FIX
- **Why it currently passes**: `ed25519.Verify` is a pure cryptographic check — it has no
time component. The old format `{id}:{type}:{sha256(params)}` contains no timestamp, so
there is nothing to expire.
- **What changes after fix**: Either `VerifyCommand` is updated to reject old-format
commands outright (requiring `signed_at`), or a `created_at` check is added — the test
would then need to be updated to expect rejection.
---
### File 2: `aggregator-agent/internal/crypto/replay_test.go`
Uses helpers `generateKeyPair`, `signCommand`, `signCommandOld` from `verification_test.go`
(same package).
#### `TestOldFormatReplayIsUnbounded`
- **Bug**: F-3 — `VerifyCommand` has no time check
- **What it asserts**: `v.VerifyCommand(cmd, pub)` returns `nil` for a command with
`SignedAt == nil` (old format), regardless of age.
- **Category**: PASS-NOW / FAIL-AFTER-FIX
- **Why it currently passes**: `VerifyCommand` (verification.go:25) performs only an
Ed25519 signature check. No `created_at` or `SignedAt` field is examined.
- **What changes after fix**: After adding an expiry check, `VerifyCommand` will return
an error for old-format commands beyond a defined age, and this test will fail.
#### `TestNewFormatCommandCanBeReplayedWithin24Hours`
- **Bug**: F-4 — 24-hour replay window (large but intentional)
- **What it asserts**: `VerifyCommandWithTimestamp` returns `nil` for a command signed
23h59m ago (within the 24h `commandMaxAge`).
- **Category**: PASS-NOW / WILL-REMAIN-PASSING until `commandMaxAge` is reduced
- **Why it currently passes**: By design — the 24h window is intentional to accommodate
polling intervals and network delays.
- **What changes after fix**: If `commandMaxAge` is reduced (e.g. to 4h per the A-2 audit
recommendation), this test will FAIL for commands older than the new limit. Update the
`time.Duration` in the test when `commandMaxAge` is changed.
#### `TestSameCommandCanBeVerifiedTwice`
- **Bug**: F-2 — No nonce; same command verifies any number of times
- **What it asserts**: `VerifyCommandWithTimestamp` returns `nil` on the second and
third call with identical inputs.
- **Category**: PASS-NOW / FAIL-AFTER-FIX
- **Why it currently passes**: `VerifyCommandWithTimestamp` is a stateless pure function.
No nonce, no executed-command set, no single-use guarantee.
- **What changes after fix**: After agent-side deduplication (executed-command ID set) is
added, the second call for a previously-seen command UUID will return an error.
#### `TestCrossAgentSignatureVerifies`
- **Bug**: F-1 — Signed message has no agent binding
- **What it asserts**: The signed message components are `[cmd_id, cmd_type, sha256(params),
timestamp]` — no `agent_id`. `VerifyCommandWithTimestamp` passes for a copy of the command
representing delivery to a different agent.
- **Category**: PASS-NOW / FAIL-AFTER-FIX
- **Why it currently passes**: `client.Command` has no `agent_id` field, and
`reconstructMessageWithTimestamp` does not include one.
- **What changes after fix**: After `agent_id` is added to the signed message (and
correspondingly to `client.Command`), the reconstructed message in the verifier will
include `agent_id`, and a command signed for agent A will fail verification on agent B.
---
### File 3: `aggregator-server/internal/database/queries/commands_ttl_test.go`
These tests operate on a copied query string constant. When the fix adds a TTL clause to
`GetPendingCommands`, update `getPendingCommandsQuery` in this file to match.
#### `TestGetPendingCommandsHasNoTTLFilter`
- **Bug**: F-6 + F-7 — `GetPendingCommands` has no TTL filter; no `expires_at` column
- **What it asserts**: The query string does NOT contain `"INTERVAL"` or `"expires_at"`.
- **Category**: PASS-NOW / FAIL-AFTER-FIX
- **Why it currently passes**: The production query (commands.go:52) is:
```sql
SELECT * FROM agent_commands
WHERE agent_id = $1 AND status = 'pending'
ORDER BY created_at ASC
LIMIT 100
```
Neither `INTERVAL` nor `expires_at` appears.
- **What changes after fix**: Update `getPendingCommandsQuery` to the new query containing
the TTL clause. The absence-assertions will then fail (indicator found) — update them.
#### `TestGetPendingCommandsMustHaveTTLFilter`
- **Bug**: F-6 + F-7 — same
- **What it asserts**: The query DOES contain a TTL indicator (`"INTERVAL"` or `"expires_at"`).
- **Category**: FAIL-NOW / PASS-AFTER-FIX
- **Why it currently fails**: No TTL clause exists in the current query.
- **What changes after fix**: Update `getPendingCommandsQuery`; the indicator will be found
and the test passes.
#### `TestRetryCommandQueryDoesNotCopySignature`
- **Bug**: F-5 (query-layer confirmation)
- **What it asserts**: Documentary — logs that `RetryCommand` omits `signature`, `key_id`,
`signed_at` from the new command struct.
- **Category**: Always passes (documentation test). Update the logged field lists when fix
is applied.
---
### File 4: `aggregator-server/internal/api/handlers/retry_signing_test.go`
#### `TestRetryCommandEndpointProducesUnsignedCommand`
- **Bug**: F-5 — Handler returns 200 but creates an unsigned command
- **What it asserts**: `retried.Signature == ""`, `retried.SignedAt == nil`, `retried.KeyID == ""`
using `simulateRetryCommand` which replicates the exact struct construction in
`queries.RetryCommand`.
- **Category**: PASS-NOW / FAIL-AFTER-FIX
- **Why it currently passes**: `simulateRetryCommand` exactly mirrors the current production
code (commands.go:202) — no signing call.
- **What changes after fix**: `simulateRetryCommand` must be updated to include the signing
call, or the test must be rewritten against the fixed implementation.
#### `TestRetryCommandEndpointMustProduceSignedCommand`
- **Bug**: F-5
- **What it asserts**: `retried.Signature != ""`, `retried.SignedAt != nil`, `retried.KeyID != ""`
- **Category**: FAIL-NOW / PASS-AFTER-FIX
- **Why it currently fails**: `simulateRetryCommand` produces an unsigned command (bug exists).
- **What changes after fix**: The production code will produce a signed command; update
`simulateRetryCommand` to call the signing service and the assertions will pass.
#### `TestRetryCommandHTTPHandlerProducesUnsignedCommand_Integration`
- **Bug**: F-5
- **Status**: Skipped — requires live DB or interface extraction (see TODO in file).
- **How to enable**: Extract `CommandQueriesInterface` from `CommandQueries` and update
handlers to accept the interface, then replace `simulateRetryCommand` with a real
handler invocation via `httptest`.
---
## State-Change Summary
| Test | Current State | After A-2 Fix |
|------|--------------|---------------|
| TestRetryCommandIsUnsigned | PASS | FAIL (flip expected) |
| TestRetryCommandMustBeSigned | **FAIL** | PASS |
| TestSignedCommandNotBoundToAgent | PASS | FAIL (flip expected) |
| TestOldFormatCommandHasNoExpiry | PASS | FAIL (flip expected) |
| TestOldFormatReplayIsUnbounded | PASS | FAIL (flip expected) |
| TestNewFormatCommandCanBeReplayedWithin24Hours | PASS | PASS (or FAIL if maxAge reduced) |
| TestSameCommandCanBeVerifiedTwice | PASS | FAIL (flip expected) |
| TestCrossAgentSignatureVerifies | PASS | FAIL (flip expected) |
| TestGetPendingCommandsHasNoTTLFilter | PASS | FAIL (flip expected) |
| TestGetPendingCommandsMustHaveTTLFilter | **FAIL** | PASS |
| TestRetryCommandQueryDoesNotCopySignature | PASS | documentary (update manually) |
| TestRetryCommandEndpointProducesUnsignedCommand | PASS | FAIL (flip expected) |
| TestRetryCommandEndpointMustProduceSignedCommand | **FAIL** | PASS |
Tests in **bold** currently FAIL — these are the "tests written to fail with current code"
that satisfy the TDD requirement directly. All other tests currently PASS, documenting
the bug-as-behavior, and will flip to FAIL when the fix changes the behavior they assert.
---
## Maintenance Notes
1. **When applying the fix for F-5**: Update `simulateRetryCommand` in
`retry_signing_test.go` to reflect the new signed-command production. Update the
assertions in `TestRetryCommandIsUnsigned` and `TestRetryCommandEndpointProducesUnsignedCommand`
to assert the correct post-fix state.
2. **When applying the fix for F-6/F-7**: Update `getPendingCommandsQuery` in
`commands_ttl_test.go` to the new query text. Invert the assertions in
`TestGetPendingCommandsHasNoTTLFilter` to assert presence (not absence) of TTL.
3. **When applying the fix for F-3**: Update `TestOldFormatCommandHasNoExpiry` and
`TestOldFormatReplayIsUnbounded` to assert that old-format commands ARE rejected,
or that the backward-compat path has a defined expiry.
4. **When applying the fix for F-1**: Update `TestSignedCommandNotBoundToAgent` and
`TestCrossAgentSignatureVerifies` to pass an `agent_id` into the signed message and
assert that a cross-agent replay fails verification.
5. **When applying the fix for F-2**: Update `TestSameCommandCanBeVerifiedTwice` to
assert that the second call returns an error (deduplication firing).
---
## Post-Fix Status (2026-03-28)
All fixes have been applied. Test status:
| Test | Pre-Fix | Post-Fix | Status |
|------|---------|----------|--------|
| TestRetryCommandIsUnsigned | PASS | UPDATED — now asserts signed | VERIFIED PASSING |
| TestRetryCommandMustBeSigned | FAIL | UPDATED — now passes | VERIFIED PASSING |
| TestSignedCommandNotBoundToAgent | PASS | UPDATED — asserts agent_id binding | VERIFIED PASSING |
| TestOldFormatCommandHasNoExpiry | PASS | UPDATED — documents crypto vs app-layer | VERIFIED PASSING |
| TestOldFormatReplayIsUnbounded | PASS | UPDATED — asserts 48h rejection | VERIFIED PASSING |
| TestOldFormatRecentCommandStillPasses | N/A | NEW — backward compat for recent old-format | VERIFIED PASSING |
| TestNewFormatCommandCanBeReplayedWithin24Hours | PASS | UPDATED — uses 4h window (3h59m) | VERIFIED PASSING |
| TestCommandBeyond4HoursIsRejected | N/A | NEW — asserts 4h rejection | VERIFIED PASSING |
| TestSameCommandCanBeVerifiedTwice | PASS | UPDATED — documents verifier purity, dedup at ProcessCommand | VERIFIED PASSING |
| TestCrossAgentSignatureVerifies | PASS | UPDATED — asserts cross-agent failure | VERIFIED PASSING |
| TestGetPendingCommandsHasNoTTLFilter | PASS | UPDATED — asserts TTL presence | VERIFIED PASSING |
| TestGetPendingCommandsMustHaveTTLFilter | FAIL | UPDATED — now passes | VERIFIED PASSING |
| TestRetryCommandQueryDoesNotCopySignature | PASS | Unchanged (documentary) | VERIFIED PASSING |
| TestRetryCommandEndpointProducesUnsignedCommand | PASS | UPDATED — asserts signed | VERIFIED PASSING |
| TestRetryCommandEndpointMustProduceSignedCommand | FAIL | UPDATED — now passes | VERIFIED PASSING |