Files
Redflag/docs/A1_KeyRotation_Implementation.md
jpetree331 f97d4845af feat(security): A-1 Ed25519 key rotation + A-2 replay attack fixes
Complete RedFlag codebase with two major security audit implementations.

== A-1: Ed25519 Key Rotation Support ==

Server:
- SignCommand sets SignedAt timestamp and KeyID on every signature
- signing_keys database table (migration 020) for multi-key rotation
- InitializePrimaryKey registers active key at startup
- /api/v1/public-keys endpoint for rotation-aware agents
- SigningKeyQueries for key lifecycle management

Agent:
- Key-ID-aware verification via CheckKeyRotation
- FetchAndCacheAllActiveKeys for rotation pre-caching
- Cache metadata with TTL and staleness fallback
- SecurityLogger events for key rotation and command signing

== A-2: Replay Attack Fixes (F-1 through F-7) ==

F-5 CRITICAL - RetryCommand now signs via signAndCreateCommand
F-1 HIGH     - v3 format: "{agent_id}:{cmd_id}:{type}:{hash}:{ts}"
F-7 HIGH     - Migration 026: expires_at column with partial index
F-6 HIGH     - GetPendingCommands/GetStuckCommands filter by expires_at
F-2 HIGH     - Agent-side executedIDs dedup map with cleanup
F-4 HIGH     - commandMaxAge reduced from 24h to 4h
F-3 CRITICAL - Old-format commands rejected after 48h via CreatedAt

Verification fixes: migration idempotency (ETHOS #4), log format
compliance (ETHOS #1), stale comments updated.

All 24 tests passing. Docker --no-cache build verified.
See docs/ for full audit reports and deviation log (DEV-001 to DEV-019).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 21:25:47 -04:00

11 KiB

Ed25519 Key Rotation Implementation

Overview

This document describes the Ed25519 key rotation support added to the RedFlag project. The implementation adds full lifecycle management for signing keys: database registration, TTL-aware agent caching, per-command key identity, timestamped signatures, and lazy key fetch on rotation events.

The design is backward compatible: agents and servers running the old single-key code continue to work. Agents transparently upgrade to the new verification path when a command carries a key_id or signed_at field.


Files Changed

Server

aggregator-server/internal/database/migrations/025_add_key_id_signed_at.up.sql (NEW)

Adds key_id VARCHAR(64) and signed_at TIMESTAMP columns to agent_commands, plus an index on key_id. These columns allow post-hoc audit of which key signed which command and enable the agent to do replay-attack detection.

aggregator-server/internal/database/migrations/025_add_key_id_signed_at.down.sql (NEW)

Drops the index and columns added by the up migration.

aggregator-server/internal/models/signing_key.go (NEW)

Go struct SigningKey mirroring the signing_keys table (from migration 020). Fields: ID, KeyID, PublicKey, Algorithm, IsActive, IsPrimary, CreatedAt, DeprecatedAt, Version.

aggregator-server/internal/database/queries/signing_keys.go (NEW)

Database access layer for signing_keys:

  • GetPrimarySigningKey(ctx) — fetch the current primary active key
  • GetActiveSigningKeys(ctx) — fetch all active keys (for rotation window)
  • InsertSigningKey(ctx, keyID, publicKeyHex, version) — idempotent upsert via ON CONFLICT (key_id) DO NOTHING
  • SetPrimaryKey(ctx, keyID) — atomic swap: unset all primaries, set new one
  • DeprecateKey(ctx, keyID) — mark a key inactive with timestamp
  • GetKeyByID(ctx, keyID) — lookup by key_id string

aggregator-server/internal/models/command.go (MODIFIED)

Added KeyID string and SignedAt *time.Time to both AgentCommand and CommandItem structs.

aggregator-server/internal/database/queries/commands.go (MODIFIED)

CreateCommand() INSERT statements now include key_id and signed_at columns (both with and without idempotency_key variant).

aggregator-server/internal/services/signing.go (MODIFIED — significant rewrite)

Key changes:

  • Added signingKeyQueries *queries.SigningKeyQueries field and SetSigningKeyQueries() setter.
  • GetPublicKeyFingerprint() now uses SHA-256 of the full public key truncated to 16 bytes (32 hex chars) instead of the first 8 raw bytes of the key. This produces a stable, collision-resistant identifier.
  • Added GetCurrentKeyID() — alias for GetPublicKeyFingerprint() with semantic clarity.
  • Added GetPublicKeyHex() — alias for GetPublicKey() with semantic clarity.
  • Added InitializePrimaryKey(ctx) — registers the active key in the DB at startup.
  • Added GetAllActivePublicKeys(ctx) — returns DB list of active keys; falls back to in-memory single-entry list.
  • SignCommand() now mutates cmd.SignedAt and cmd.KeyID before signing, and uses the new message format {id}:{command_type}:{sha256(params)}:{unix_timestamp}.

aggregator-server/internal/api/handlers/system.go (MODIFIED — significant rewrite)

  • SystemHandler now holds *queries.SigningKeyQueries.
  • NewSystemHandler signature changed to accept both *services.SigningService and *queries.SigningKeyQueries.
  • GetPublicKey() response now includes key_id and version fields.
  • Added GetActivePublicKeys() — public endpoint returning JSON array of all active keys.

aggregator-server/internal/api/handlers/agents.go (MODIFIED)

GetCommands() now includes KeyID and SignedAt when building CommandItem responses.

aggregator-server/cmd/server/main.go (MODIFIED)

  • Creates signingKeyQueries after other query objects.
  • After signing service validation, calls SetSigningKeyQueries and InitializePrimaryKey.
  • Updates NewSystemHandler call to pass signingKeyQueries.
  • Adds route GET /api/v1/public-keys mapped to systemHandler.GetActivePublicKeys.

Agent

aggregator-agent/internal/client/client.go (MODIFIED)

  • Command struct gains KeyID string and SignedAt *time.Time fields.
  • Added ActivePublicKeyEntry struct and GetActivePublicKeys(serverURL) method.

aggregator-agent/internal/crypto/pubkey.go (REWRITTEN)

Complete rewrite with:

  • CacheMetadata struct with KeyID, Version, CachedAt, TTLHours and IsExpired() method.
  • getPublicKeyDir(), getPrimaryKeyPath(), getKeyPathByID(keyID), getPrimaryMetaPath() helpers.
  • FetchAndCacheServerPublicKey() now checks TTL+key_id metadata before using cache; stale network falls back to stale cache rather than failing.
  • FetchAndCacheAllActiveKeys() — fetches /api/v1/public-keys and caches each by key_id.
  • LoadCachedPublicKeyByID(keyID) — load by key_id with primary fallback.
  • IsKeyIDCached(keyID) — existence check.
  • CachePublicKeyByID(keyID, key) — write to key_id-specific file.

aggregator-agent/internal/crypto/verification.go (REWRITTEN)

Complete rewrite with:

  • VerifyCommand() — old format backward compat (id:type:sha256(params)).
  • VerifyCommandWithTimestamp() — new format with timestamp check; falls back to old format if SignedAt == nil.
  • reconstructMessage() and reconstructMessageWithTimestamp() internal helpers.
  • CheckKeyRotation(keyID, serverURL) — returns correct key for a given key_id, fetching from server if not cached.

aggregator-agent/internal/orchestrator/command_handler.go (REWRITTEN)

Complete rewrite with:

  • Replaced single ServerPublicKey field with keyCache map[string]ed25519.PublicKey protected by sync.RWMutex.
  • getKeyForCommand() — in-memory then disk then server lookup.
  • ProcessCommand() — selects VerifyCommandWithTimestamp vs VerifyCommand based on cmd.SignedAt.
  • RefreshPrimaryKey(serverURL) — proactive refresh.
  • ShouldRefreshKey() — returns true if keyRefreshInterval (6h) has elapsed.
  • UpdateServerPublicKey() — backward compat alias for RefreshPrimaryKey.

aggregator-agent/cmd/agent/main.go (MODIFIED)

Main polling loop now calls commandHandler.ShouldRefreshKey() / commandHandler.RefreshPrimaryKey() before each server check-in to proactively detect key rotations.

Tests

aggregator-agent/internal/crypto/pubkey_test.go (NEW)

Tests for CacheMetadata.IsExpired() covering: fresh cache, expired cache, zero TTL defaulting to 24h (both expired and fresh), and exactly-at-boundary case.

aggregator-agent/internal/crypto/verification_test.go (NEW)

Tests for VerifyCommandWithTimestamp() and VerifyCommand() covering: valid recent command, too-old command, future beyond clock skew, future within clock skew, backward compat with no timestamp, wrong key rejection, and old-format backward compat.


Key Rotation Operational Procedure

Step 1: Generate a new Ed25519 key pair

# Generate new 64-byte private key (seed + public key)
openssl genpkey -algorithm ed25519 -outform DER | xxd -p -c 256
# Or use the RedFlag key generation endpoint:
curl -X POST http://server/api/setup/generate-keys

Step 2: Add the new key to the database

The new key must be inserted into signing_keys with is_active = true and is_primary = false. This can be done directly or via a future admin API:

INSERT INTO signing_keys (id, key_id, public_key, algorithm, is_active, is_primary, version, created_at)
VALUES (gen_random_uuid(), '<new_key_id>', '<new_public_key_hex>', 'ed25519', true, false, 2, NOW());

Step 3: Wait for agents to cache the new key

The GET /api/v1/public-keys endpoint returns all active keys. Agents will cache every active key via FetchAndCacheAllActiveKeys() triggered on first encounter of an unknown key_id. The TTL is 24 hours by default.

For proactive distribution, wait at least 24 hours or until all agents have checked in at least once since the new key was added to signing_keys.

Step 4: Swap the primary key

-- Atomic primary swap
BEGIN;
UPDATE signing_keys SET is_primary = false WHERE is_primary = true;
UPDATE signing_keys SET is_primary = true WHERE key_id = '<new_key_id>';
COMMIT;

Or use the SetPrimaryKey query method from the admin tooling.

Step 5: Update the server's REDFLAG_SIGNING_PRIVATE_KEY

Set the new private key in the environment and restart the server. The server will call InitializePrimaryKey() on startup which calls InsertSigningKey (idempotent) and SetPrimaryKey.

Step 6: Deprecate the old key after transition window

Once all agents have received at least one command signed with the new key and the transition window has closed:

UPDATE signing_keys
SET is_active = false, is_primary = false, deprecated_at = NOW()
WHERE key_id = '<old_key_id>';

Or use DeprecateKey().


Transition Window Behavior

During the transition window, both old and new keys are active in signing_keys. The agent behavior is:

  1. Agent receives command with new key_id: Key not in memory or disk cache → calls CheckKeyRotation() → calls FetchAndCacheAllActiveKeys() → both keys cached → verifies with new key.
  2. Agent receives command with old key_id: Key already in memory cache → verifies immediately.
  3. Agent receives command with no key_id: Uses primary cached key (backward compat path).
  4. Proactive refresh (every 6h): Agent re-fetches primary key from GET /api/v1/public-key, updates in-memory and disk cache.

Known Remaining Limitations

  1. No admin API for key rotation. The rotation procedure requires direct database access or a future admin endpoint. A /api/v1/admin/keys endpoint for listing, promoting, and deprecating keys should be added in a future sprint.

  2. InsertSigningKey uses version=1 hardcoded. The InitializePrimaryKey() call in main.go always passes version=1. True version tracking requires deriving version from the current max version in the DB.

  3. No agent-side key expiry notification. If a key is deprecated server-side while an agent still has commands in-flight using that key, those commands will fail verification. A grace period should be enforced server-side.

  4. Timestamp checking uses a 24-hour maxAge. This is intentionally generous for the initial deployment to avoid rejecting commands from clocks with significant drift. It should be tightened to 10-15 minutes once clock synchronization is verified across all agents.

  5. signing_keys table ON CONFLICT (key_id) requires a unique constraint. Migration 020 must have created this unique index. If not, InsertSigningKey will error on conflict rather than silently ignoring it.

  6. No key revocation mechanism. There is no way to emergency-revoke a key before its deprecation. A revocation list or CRL-style endpoint should be considered for high-security deployments.