Complete RedFlag codebase with two major security audit implementations.
== A-1: Ed25519 Key Rotation Support ==
Server:
- SignCommand sets SignedAt timestamp and KeyID on every signature
- signing_keys database table (migration 020) for multi-key rotation
- InitializePrimaryKey registers active key at startup
- /api/v1/public-keys endpoint for rotation-aware agents
- SigningKeyQueries for key lifecycle management
Agent:
- Key-ID-aware verification via CheckKeyRotation
- FetchAndCacheAllActiveKeys for rotation pre-caching
- Cache metadata with TTL and staleness fallback
- SecurityLogger events for key rotation and command signing
== A-2: Replay Attack Fixes (F-1 through F-7) ==
F-5 CRITICAL - RetryCommand now signs via signAndCreateCommand
F-1 HIGH - v3 format: "{agent_id}:{cmd_id}:{type}:{hash}:{ts}"
F-7 HIGH - Migration 026: expires_at column with partial index
F-6 HIGH - GetPendingCommands/GetStuckCommands filter by expires_at
F-2 HIGH - Agent-side executedIDs dedup map with cleanup
F-4 HIGH - commandMaxAge reduced from 24h to 4h
F-3 CRITICAL - Old-format commands rejected after 48h via CreatedAt
Verification fixes: migration idempotency (ETHOS #4), log format
compliance (ETHOS #1), stale comments updated.
All 24 tests passing. Docker --no-cache build verified.
See docs/ for full audit reports and deviation log (DEV-001 to DEV-019).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
189 lines
11 KiB
Markdown
189 lines
11 KiB
Markdown
# Ed25519 Key Rotation Implementation
|
|
|
|
## Overview
|
|
|
|
This document describes the Ed25519 key rotation support added to the RedFlag project. The implementation adds full lifecycle management for signing keys: database registration, TTL-aware agent caching, per-command key identity, timestamped signatures, and lazy key fetch on rotation events.
|
|
|
|
The design is backward compatible: agents and servers running the old single-key code continue to work. Agents transparently upgrade to the new verification path when a command carries a `key_id` or `signed_at` field.
|
|
|
|
---
|
|
|
|
## Files Changed
|
|
|
|
### Server
|
|
|
|
#### `aggregator-server/internal/database/migrations/025_add_key_id_signed_at.up.sql` (NEW)
|
|
Adds `key_id VARCHAR(64)` and `signed_at TIMESTAMP` columns to `agent_commands`, plus an index on `key_id`. These columns allow post-hoc audit of which key signed which command and enable the agent to do replay-attack detection.
|
|
|
|
#### `aggregator-server/internal/database/migrations/025_add_key_id_signed_at.down.sql` (NEW)
|
|
Drops the index and columns added by the up migration.
|
|
|
|
#### `aggregator-server/internal/models/signing_key.go` (NEW)
|
|
Go struct `SigningKey` mirroring the `signing_keys` table (from migration 020). Fields: `ID`, `KeyID`, `PublicKey`, `Algorithm`, `IsActive`, `IsPrimary`, `CreatedAt`, `DeprecatedAt`, `Version`.
|
|
|
|
#### `aggregator-server/internal/database/queries/signing_keys.go` (NEW)
|
|
Database access layer for `signing_keys`:
|
|
- `GetPrimarySigningKey(ctx)` — fetch the current primary active key
|
|
- `GetActiveSigningKeys(ctx)` — fetch all active keys (for rotation window)
|
|
- `InsertSigningKey(ctx, keyID, publicKeyHex, version)` — idempotent upsert via `ON CONFLICT (key_id) DO NOTHING`
|
|
- `SetPrimaryKey(ctx, keyID)` — atomic swap: unset all primaries, set new one
|
|
- `DeprecateKey(ctx, keyID)` — mark a key inactive with timestamp
|
|
- `GetKeyByID(ctx, keyID)` — lookup by key_id string
|
|
|
|
#### `aggregator-server/internal/models/command.go` (MODIFIED)
|
|
Added `KeyID string` and `SignedAt *time.Time` to both `AgentCommand` and `CommandItem` structs.
|
|
|
|
#### `aggregator-server/internal/database/queries/commands.go` (MODIFIED)
|
|
`CreateCommand()` INSERT statements now include `key_id` and `signed_at` columns (both with and without `idempotency_key` variant).
|
|
|
|
#### `aggregator-server/internal/services/signing.go` (MODIFIED — significant rewrite)
|
|
Key changes:
|
|
- Added `signingKeyQueries *queries.SigningKeyQueries` field and `SetSigningKeyQueries()` setter.
|
|
- `GetPublicKeyFingerprint()` now uses SHA-256 of the full public key truncated to 16 bytes (32 hex chars) instead of the first 8 raw bytes of the key. This produces a stable, collision-resistant identifier.
|
|
- Added `GetCurrentKeyID()` — alias for `GetPublicKeyFingerprint()` with semantic clarity.
|
|
- Added `GetPublicKeyHex()` — alias for `GetPublicKey()` with semantic clarity.
|
|
- Added `InitializePrimaryKey(ctx)` — registers the active key in the DB at startup.
|
|
- Added `GetAllActivePublicKeys(ctx)` — returns DB list of active keys; falls back to in-memory single-entry list.
|
|
- `SignCommand()` now mutates `cmd.SignedAt` and `cmd.KeyID` before signing, and uses the new message format `{id}:{command_type}:{sha256(params)}:{unix_timestamp}`.
|
|
|
|
#### `aggregator-server/internal/api/handlers/system.go` (MODIFIED — significant rewrite)
|
|
- `SystemHandler` now holds `*queries.SigningKeyQueries`.
|
|
- `NewSystemHandler` signature changed to accept both `*services.SigningService` and `*queries.SigningKeyQueries`.
|
|
- `GetPublicKey()` response now includes `key_id` and `version` fields.
|
|
- Added `GetActivePublicKeys()` — public endpoint returning JSON array of all active keys.
|
|
|
|
#### `aggregator-server/internal/api/handlers/agents.go` (MODIFIED)
|
|
`GetCommands()` now includes `KeyID` and `SignedAt` when building `CommandItem` responses.
|
|
|
|
#### `aggregator-server/cmd/server/main.go` (MODIFIED)
|
|
- Creates `signingKeyQueries` after other query objects.
|
|
- After signing service validation, calls `SetSigningKeyQueries` and `InitializePrimaryKey`.
|
|
- Updates `NewSystemHandler` call to pass `signingKeyQueries`.
|
|
- Adds route `GET /api/v1/public-keys` mapped to `systemHandler.GetActivePublicKeys`.
|
|
|
|
### Agent
|
|
|
|
#### `aggregator-agent/internal/client/client.go` (MODIFIED)
|
|
- `Command` struct gains `KeyID string` and `SignedAt *time.Time` fields.
|
|
- Added `ActivePublicKeyEntry` struct and `GetActivePublicKeys(serverURL)` method.
|
|
|
|
#### `aggregator-agent/internal/crypto/pubkey.go` (REWRITTEN)
|
|
Complete rewrite with:
|
|
- `CacheMetadata` struct with `KeyID`, `Version`, `CachedAt`, `TTLHours` and `IsExpired()` method.
|
|
- `getPublicKeyDir()`, `getPrimaryKeyPath()`, `getKeyPathByID(keyID)`, `getPrimaryMetaPath()` helpers.
|
|
- `FetchAndCacheServerPublicKey()` now checks TTL+key_id metadata before using cache; stale network falls back to stale cache rather than failing.
|
|
- `FetchAndCacheAllActiveKeys()` — fetches `/api/v1/public-keys` and caches each by key_id.
|
|
- `LoadCachedPublicKeyByID(keyID)` — load by key_id with primary fallback.
|
|
- `IsKeyIDCached(keyID)` — existence check.
|
|
- `CachePublicKeyByID(keyID, key)` — write to key_id-specific file.
|
|
|
|
#### `aggregator-agent/internal/crypto/verification.go` (REWRITTEN)
|
|
Complete rewrite with:
|
|
- `VerifyCommand()` — old format backward compat (`id:type:sha256(params)`).
|
|
- `VerifyCommandWithTimestamp()` — new format with timestamp check; falls back to old format if `SignedAt == nil`.
|
|
- `reconstructMessage()` and `reconstructMessageWithTimestamp()` internal helpers.
|
|
- `CheckKeyRotation(keyID, serverURL)` — returns correct key for a given key_id, fetching from server if not cached.
|
|
|
|
#### `aggregator-agent/internal/orchestrator/command_handler.go` (REWRITTEN)
|
|
Complete rewrite with:
|
|
- Replaced single `ServerPublicKey` field with `keyCache map[string]ed25519.PublicKey` protected by `sync.RWMutex`.
|
|
- `getKeyForCommand()` — in-memory then disk then server lookup.
|
|
- `ProcessCommand()` — selects `VerifyCommandWithTimestamp` vs `VerifyCommand` based on `cmd.SignedAt`.
|
|
- `RefreshPrimaryKey(serverURL)` — proactive refresh.
|
|
- `ShouldRefreshKey()` — returns true if `keyRefreshInterval` (6h) has elapsed.
|
|
- `UpdateServerPublicKey()` — backward compat alias for `RefreshPrimaryKey`.
|
|
|
|
#### `aggregator-agent/cmd/agent/main.go` (MODIFIED)
|
|
Main polling loop now calls `commandHandler.ShouldRefreshKey()` / `commandHandler.RefreshPrimaryKey()` before each server check-in to proactively detect key rotations.
|
|
|
|
### Tests
|
|
|
|
#### `aggregator-agent/internal/crypto/pubkey_test.go` (NEW)
|
|
Tests for `CacheMetadata.IsExpired()` covering: fresh cache, expired cache, zero TTL defaulting to 24h (both expired and fresh), and exactly-at-boundary case.
|
|
|
|
#### `aggregator-agent/internal/crypto/verification_test.go` (NEW)
|
|
Tests for `VerifyCommandWithTimestamp()` and `VerifyCommand()` covering: valid recent command, too-old command, future beyond clock skew, future within clock skew, backward compat with no timestamp, wrong key rejection, and old-format backward compat.
|
|
|
|
---
|
|
|
|
## Key Rotation Operational Procedure
|
|
|
|
### Step 1: Generate a new Ed25519 key pair
|
|
|
|
```bash
|
|
# Generate new 64-byte private key (seed + public key)
|
|
openssl genpkey -algorithm ed25519 -outform DER | xxd -p -c 256
|
|
# Or use the RedFlag key generation endpoint:
|
|
curl -X POST http://server/api/setup/generate-keys
|
|
```
|
|
|
|
### Step 2: Add the new key to the database
|
|
|
|
The new key must be inserted into `signing_keys` with `is_active = true` and `is_primary = false`. This can be done directly or via a future admin API:
|
|
|
|
```sql
|
|
INSERT INTO signing_keys (id, key_id, public_key, algorithm, is_active, is_primary, version, created_at)
|
|
VALUES (gen_random_uuid(), '<new_key_id>', '<new_public_key_hex>', 'ed25519', true, false, 2, NOW());
|
|
```
|
|
|
|
### Step 3: Wait for agents to cache the new key
|
|
|
|
The `GET /api/v1/public-keys` endpoint returns all active keys. Agents will cache every active key via `FetchAndCacheAllActiveKeys()` triggered on first encounter of an unknown `key_id`. The TTL is 24 hours by default.
|
|
|
|
For proactive distribution, wait at least 24 hours or until all agents have checked in at least once since the new key was added to `signing_keys`.
|
|
|
|
### Step 4: Swap the primary key
|
|
|
|
```sql
|
|
-- Atomic primary swap
|
|
BEGIN;
|
|
UPDATE signing_keys SET is_primary = false WHERE is_primary = true;
|
|
UPDATE signing_keys SET is_primary = true WHERE key_id = '<new_key_id>';
|
|
COMMIT;
|
|
```
|
|
|
|
Or use the `SetPrimaryKey` query method from the admin tooling.
|
|
|
|
### Step 5: Update the server's REDFLAG_SIGNING_PRIVATE_KEY
|
|
|
|
Set the new private key in the environment and restart the server. The server will call `InitializePrimaryKey()` on startup which calls `InsertSigningKey` (idempotent) and `SetPrimaryKey`.
|
|
|
|
### Step 6: Deprecate the old key after transition window
|
|
|
|
Once all agents have received at least one command signed with the new key and the transition window has closed:
|
|
|
|
```sql
|
|
UPDATE signing_keys
|
|
SET is_active = false, is_primary = false, deprecated_at = NOW()
|
|
WHERE key_id = '<old_key_id>';
|
|
```
|
|
|
|
Or use `DeprecateKey()`.
|
|
|
|
---
|
|
|
|
## Transition Window Behavior
|
|
|
|
During the transition window, both old and new keys are active in `signing_keys`. The agent behavior is:
|
|
|
|
1. **Agent receives command with new `key_id`:** Key not in memory or disk cache → calls `CheckKeyRotation()` → calls `FetchAndCacheAllActiveKeys()` → both keys cached → verifies with new key.
|
|
2. **Agent receives command with old `key_id`:** Key already in memory cache → verifies immediately.
|
|
3. **Agent receives command with no `key_id`:** Uses primary cached key (backward compat path).
|
|
4. **Proactive refresh (every 6h):** Agent re-fetches primary key from `GET /api/v1/public-key`, updates in-memory and disk cache.
|
|
|
|
---
|
|
|
|
## Known Remaining Limitations
|
|
|
|
1. **No admin API for key rotation.** The rotation procedure requires direct database access or a future admin endpoint. A `/api/v1/admin/keys` endpoint for listing, promoting, and deprecating keys should be added in a future sprint.
|
|
|
|
2. **`InsertSigningKey` uses version=1 hardcoded.** The `InitializePrimaryKey()` call in `main.go` always passes version=1. True version tracking requires deriving version from the current max version in the DB.
|
|
|
|
3. **No agent-side key expiry notification.** If a key is deprecated server-side while an agent still has commands in-flight using that key, those commands will fail verification. A grace period should be enforced server-side.
|
|
|
|
4. **Timestamp checking uses a 24-hour maxAge.** This is intentionally generous for the initial deployment to avoid rejecting commands from clocks with significant drift. It should be tightened to 10-15 minutes once clock synchronization is verified across all agents.
|
|
|
|
5. **`signing_keys` table `ON CONFLICT (key_id)` requires a unique constraint.** Migration 020 must have created this unique index. If not, `InsertSigningKey` will error on conflict rather than silently ignoring it.
|
|
|
|
6. **No key revocation mechanism.** There is no way to emergency-revoke a key before its deprecation. A revocation list or CRL-style endpoint should be considered for high-security deployments.
|