feat(security): A-1 Ed25519 key rotation + A-2 replay attack fixes

Complete RedFlag codebase with two major security audit implementations.

== A-1: Ed25519 Key Rotation Support ==

Server:
- SignCommand sets SignedAt timestamp and KeyID on every signature
- signing_keys database table (migration 020) for multi-key rotation
- InitializePrimaryKey registers active key at startup
- /api/v1/public-keys endpoint for rotation-aware agents
- SigningKeyQueries for key lifecycle management

Agent:
- Key-ID-aware verification via CheckKeyRotation
- FetchAndCacheAllActiveKeys for rotation pre-caching
- Cache metadata with TTL and staleness fallback
- SecurityLogger events for key rotation and command signing

== A-2: Replay Attack Fixes (F-1 through F-7) ==

F-5 CRITICAL - RetryCommand now signs via signAndCreateCommand
F-1 HIGH     - v3 format: "{agent_id}:{cmd_id}:{type}:{hash}:{ts}"
F-7 HIGH     - Migration 026: expires_at column with partial index
F-6 HIGH     - GetPendingCommands/GetStuckCommands filter by expires_at
F-2 HIGH     - Agent-side executedIDs dedup map with cleanup
F-4 HIGH     - commandMaxAge reduced from 24h to 4h
F-3 CRITICAL - Old-format commands rejected after 48h via CreatedAt

Verification fixes: migration idempotency (ETHOS #4), log format
compliance (ETHOS #1), stale comments updated.

All 24 tests passing. Docker --no-cache build verified.
See docs/ for full audit reports and deviation log (DEV-001 to DEV-019).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-03-28 21:25:47 -04:00
commit f97d4845af
340 changed files with 75403 additions and 0 deletions

200
docs/Deviations_Report.md Normal file
View File

@@ -0,0 +1,200 @@
# Deviations Report — Ed25519 Key Rotation Implementation
This document records deviations from the implementation spec.
---
## DEV-001: `LogEvent` not used in command_handler.go
**Spec says:** Use `h.securityLogger.LogEvent("key_rotation_detected", "info", "crypto", ...)` when a new key is cached.
**Actual implementation:** `LogEvent` was not found in the `SecurityLogger` interface. The available methods are `LogCommandVerificationFailure` and `LogCommandVerificationSuccess`. The key rotation detection event uses `LogCommandVerificationFailure` with a descriptive message string instead to avoid compilation failure.
**Impact:** The key rotation detection is still logged at the logger level. No security event is lost; it is logged via the `[INFO]` logger line immediately above.
---
## DEV-002: `fingerprint` field retained in `GetPublicKey` response
**Spec says:** Update `GetPublicKey()` to include `key_id` and `version`.
**Actual implementation:** The `fingerprint` field was retained for backward compatibility alongside the new `key_id` field. Since `key_id` equals `GetPublicKeyFingerprint()` (same value), they are equivalent. Removing `fingerprint` would be a breaking change for agents on older versions.
**Impact:** Older agents that read `fingerprint` continue to work. New agents can read `key_id`. No semantic difference.
---
## DEV-003: `GetPublicKey` nil check changed to `IsEnabled()`
**Spec says:** Check `h.signingService == nil`.
**Actual implementation:** Check is `h.signingService == nil || !h.signingService.IsEnabled()`. A service can be non-nil but disabled (e.g., constructed with empty private key). Returning 503 in this case is more correct.
**Impact:** More defensive; no regression for callers.
---
## DEV-004: `VerifyCommandWithTimestamp` signature changed
**Spec says:** `VerifyCommandWithTimestamp(cmd, serverPubKey, maxAge)` (3 args, old API).
**Actual implementation:** `VerifyCommandWithTimestamp(cmd, serverPubKey, maxAge, clockSkew)` (4 args). The old single-`maxAge` signature in the original file was replaced entirely because the new implementation needs a separate clock skew parameter. The old file's 3-arg signature had timestamp checking completely disabled (stub).
**Impact:** Any callers of the old 3-arg signature will fail to compile. No external callers were found; only `command_handler.go` calls this method, and it was rewritten in the same task.
---
## DEV-005: `getPublicKeyDir` is unexported in pubkey.go
**Spec says (test comment):** `origGetDir := getPublicKeyDir` used in test to override.
**Actual implementation:** `getPublicKeyDir` is a regular function, not a variable, so it cannot be overridden in tests via assignment. The test file uses `t.TempDir()` to write to a temporary path directly rather than overriding `getPublicKeyDir`. The provided test spec had placeholder code that would not compile (`import_encoding_json_inline`, `meta_placeholder()`, etc.); the final test file is a clean, working replacement using only the `CacheMetadata.IsExpired()` method which is fully testable without filesystem access.
**Impact:** Filesystem-based metadata roundtrip test is not present. The `IsExpired()` logic is fully covered. A future PR can add filesystem tests using test-local paths.
---
## DEV-006: `InsertSigningKey` uses named map instead of struct
**Spec says:** Use named parameters compatible with `NamedExecContext`.
**Actual implementation:** Uses `map[string]interface{}` as named parameter map rather than a struct. `sqlx.NamedExecContext` accepts both maps and structs. The map approach avoids needing a separate insert-specific struct.
**Impact:** None; functionally equivalent.
---
## DEV-007: `key_rotation_detected` security event not emitted
**Spec says:** `h.securityLogger.LogEvent(...)` for key rotation detection.
**Actual implementation:** Used `LogCommandVerificationFailure` with a message explaining the new key was cached. This is technically incorrect semantically (it is not a failure). However `LogEvent` does not exist in the security logger. A future task should add a `LogKeyRotation` method to `SecurityLogger`.
---
## DEV-008: `validateSigningService` fingerprint length check still expects 64 chars
**Spec says:** `GetPublicKeyFingerprint()` now returns 32 hex chars (16 bytes of SHA-256).
**Existing code in main.go says:** `if len(publicKeyHex) != 64` (checking `GetPublicKey()`, not fingerprint).
**Actual implementation:** The fingerprint length validation in `validateSigningService` checks `publicKeyHex` (the full key, 64 hex chars), not the fingerprint. The fingerprint change to 32 chars does not affect this validation. No change needed.
**Impact:** None.
---
## DEV-009: `context` import already present in main.go
**Spec says:** Add `"context"` import when adding `context.Background()` call.
**Actual implementation:** `context` was already imported at line 4 of `main.go`. No import change needed.
**Impact:** None.
---
## DEV-010: DEV-007 resolved — LogKeyRotationDetected implemented
**Previous state (DEV-007):** Key rotation detection used `LogCommandVerificationFailure` which was semantically incorrect.
**Resolution (A1 verification pass, 2026-03-28):**
- `SecurityEventTypes.KeyRotationDetected = "KEY_ROTATION_DETECTED"` added to the `SecurityEventTypes` struct in `aggregator-agent/internal/logging/security_logger.go`.
- `LogKeyRotationDetected(keyID string)` method added to `SecurityLogger` — logs at INFO level with event type `KEY_ROTATION_DETECTED`.
- `aggregator-agent/internal/orchestrator/command_handler.go` updated: the `isNew` branch now calls `h.securityLogger.LogKeyRotationDetected(keyID)`.
**Impact:** Key rotation events are now correctly classified as informational (INFO) rather than failure events. No breaking changes.
---
## DEV-011: InitializePrimaryKey version now dynamic (was hardcoded to 1)
**Previous state:** `InitializePrimaryKey` passed `version=1` hardcoded to `InsertSigningKey`.
**Resolution (A1 verification pass, 2026-03-28):**
- `GetNextVersion(ctx context.Context) (int, error)` added to `SigningKeyQueries` in `aggregator-server/internal/database/queries/signing_keys.go`. Executes `SELECT COALESCE(MAX(version), 0) + 1 FROM signing_keys`.
- `InitializePrimaryKey` in `signing.go` now calls `GetNextVersion` to determine the version before inserting. Falls back to version 1 on query error.
**Concurrency note:** The `GetNextVersion` query is not wrapped in a transaction. For a single-instance server this is safe. If concurrent restarts were possible, a TOCTOU race could assign the same version to different keys. Version numbers are informational metadata; `ON CONFLICT (key_id) DO NOTHING` prevents duplicate rows regardless.
**Impact:** New keys inserted on first startup receive version N+1 rather than always 1. Subsequent restarts with the same key are no-ops (ON CONFLICT). No breaking changes.
---
## DEV-012: F-2 deduplication at ProcessCommand layer, not VerifyCommandWithTimestamp
**Spec says:** "In ProcessCommand, BEFORE verification: Check if cmd.ID is in executedIDs."
**Actual implementation:** Deduplication is checked before verification as specified, but after successful verification the command is marked as executed. The `VerifyCommandWithTimestamp` function remains a pure function — it does not maintain state. This is intentional: dedup is a ProcessCommand-level concern, not a cryptographic verification concern.
**Impact:** The `TestSameCommandCanBeVerifiedTwice` test was updated to document that the verifier is a pure function. Dedup enforcement happens at the ProcessCommand layer. No behavioral change from spec intent.
---
## DEV-013: v3 format detection uses AgentID presence, not field-count parsing
**Spec says:** "Detection: The new format will have 5 colon-separated fields. The verifier can detect format by field count."
**Actual implementation:** Instead of counting colon-separated fields in the signature (which is opaque), the verifier checks `cmd.AgentID != ""` to determine v3 format. If AgentID is present, it tries v3 first, then falls back to v2 if v3 verification fails. This is more robust than field-count parsing because the signed message is not directly accessible — only the opaque signature is available.
**Impact:** Same behavior as spec intent. Agents with AgentID try v3 first. Agents without AgentID use v2/v1. The fallback chain is: v3 → v2 → v1 with warnings logged at each fallback.
---
## DEV-014: commandDefaultTTL set to 4h (matches commandMaxAge) instead of 24h
**Spec says (Task 2d):** "Default value: NOW() + 24 hours for new-format commands."
**Actual implementation:** `commandDefaultTTL = 4 * time.Hour` in `CreateCommand`, matching the reduced `commandMaxAge = 4 * time.Hour` from Task 6. Setting expires_at to 24h while commandMaxAge is 4h would create commands that the server considers valid but the agent rejects — a confusing inconsistency.
**Impact:** New commands expire at the same time they would be rejected by the agent's timestamp check. This is more consistent and prevents stale commands from accumulating in the pending queue for 24h after they can no longer be verified.
---
## DEV-015: docker.go pre-existing build error fixed
**Spec says:** Nothing — this was a pre-existing issue.
**Actual fix:** `aggregator-server/internal/database/queries/docker.go` had `fmt.Sprintf` calls with arguments but no format directives (lines 108, 110, 189, 191). Changed to plain string concatenation. This was blocking all test runs in the `queries` package.
**Impact:** No behavioral change. Fixes a compile error that predates the A2 work.
---
## DEV-016: Migration 026 was not idempotent (ETHOS #4 violation — fixed in verification pass)
**Spec says:** N/A (implementation detail).
**Issue found during A2 verification:** Migration 026_add_expires_at.up.sql used bare `ALTER TABLE ... ADD COLUMN` and `CREATE INDEX` without `IF NOT EXISTS`/`IF NOT EXISTS`. ETHOS #4 requires all schema changes to be idempotent.
**Fix:** Changed to `ADD COLUMN IF NOT EXISTS` and `CREATE INDEX IF NOT EXISTS`.
**Impact:** Migration is now safe to run multiple times. No behavioral change.
---
## DEV-017: Log format violations in A-2 code (ETHOS #1 — fixed in verification pass)
**Issue found during A2 verification:** Three new `fmt.Printf` calls in `verification.go` (v1/v2 fallback warnings) used the format `[crypto] WARNING: ...` instead of the ETHOS-mandated `[TAG] [system] [component]` format. Additionally, `signAndCreateCommand` in `agents.go` used `[WARNING] Command signing disabled...` without system/component tags.
**Fix:** Updated all four log lines to use `[WARNING] [agent] [crypto]` and `[WARNING] [server] [signing]` formats respectively.
**Impact:** Log output now complies with ETHOS #1. No behavioral change.
---
## DEV-018: Stale TODO comment referenced 24h maxAge (fixed in verification pass)
**Issue found during A2 verification:** The `TODO(security)` comment in `verification.go` (above `VerifyCommandWithTimestamp`) still referenced "24 hours" as the default maxAge. This was stale — commandMaxAge was reduced to 4h in Task 6 (F-4 fix).
**Fix:** Updated the comment to reference 4 hours and removed the TODO framing (it's no longer a TODO, the reduction has been implemented).
**Impact:** Documentation accuracy only. No code change.
---
## DEV-019: queries.RetryCommand is now dead code
**Issue found during A2 verification:** The `queries.RetryCommand` function (commands.go:200) still exists but is no longer called by any handler. Both `UpdateHandler.RetryCommand` and `UnifiedUpdateHandler.RetryCommand` now build the command inline and call `signAndCreateCommand`. The function is dead code.
**Action:** Not removed in this pass (verification scope is "fix what is broken", not refactor). Flagged for future cleanup.