# Ed25519 Key Rotation Implementation ## Overview This document describes the Ed25519 key rotation support added to the RedFlag project. The implementation adds full lifecycle management for signing keys: database registration, TTL-aware agent caching, per-command key identity, timestamped signatures, and lazy key fetch on rotation events. The design is backward compatible: agents and servers running the old single-key code continue to work. Agents transparently upgrade to the new verification path when a command carries a `key_id` or `signed_at` field. --- ## Files Changed ### Server #### `aggregator-server/internal/database/migrations/025_add_key_id_signed_at.up.sql` (NEW) Adds `key_id VARCHAR(64)` and `signed_at TIMESTAMP` columns to `agent_commands`, plus an index on `key_id`. These columns allow post-hoc audit of which key signed which command and enable the agent to do replay-attack detection. #### `aggregator-server/internal/database/migrations/025_add_key_id_signed_at.down.sql` (NEW) Drops the index and columns added by the up migration. #### `aggregator-server/internal/models/signing_key.go` (NEW) Go struct `SigningKey` mirroring the `signing_keys` table (from migration 020). Fields: `ID`, `KeyID`, `PublicKey`, `Algorithm`, `IsActive`, `IsPrimary`, `CreatedAt`, `DeprecatedAt`, `Version`. #### `aggregator-server/internal/database/queries/signing_keys.go` (NEW) Database access layer for `signing_keys`: - `GetPrimarySigningKey(ctx)` — fetch the current primary active key - `GetActiveSigningKeys(ctx)` — fetch all active keys (for rotation window) - `InsertSigningKey(ctx, keyID, publicKeyHex, version)` — idempotent upsert via `ON CONFLICT (key_id) DO NOTHING` - `SetPrimaryKey(ctx, keyID)` — atomic swap: unset all primaries, set new one - `DeprecateKey(ctx, keyID)` — mark a key inactive with timestamp - `GetKeyByID(ctx, keyID)` — lookup by key_id string #### `aggregator-server/internal/models/command.go` (MODIFIED) Added `KeyID string` and `SignedAt *time.Time` to both `AgentCommand` and `CommandItem` structs. #### `aggregator-server/internal/database/queries/commands.go` (MODIFIED) `CreateCommand()` INSERT statements now include `key_id` and `signed_at` columns (both with and without `idempotency_key` variant). #### `aggregator-server/internal/services/signing.go` (MODIFIED — significant rewrite) Key changes: - Added `signingKeyQueries *queries.SigningKeyQueries` field and `SetSigningKeyQueries()` setter. - `GetPublicKeyFingerprint()` now uses SHA-256 of the full public key truncated to 16 bytes (32 hex chars) instead of the first 8 raw bytes of the key. This produces a stable, collision-resistant identifier. - Added `GetCurrentKeyID()` — alias for `GetPublicKeyFingerprint()` with semantic clarity. - Added `GetPublicKeyHex()` — alias for `GetPublicKey()` with semantic clarity. - Added `InitializePrimaryKey(ctx)` — registers the active key in the DB at startup. - Added `GetAllActivePublicKeys(ctx)` — returns DB list of active keys; falls back to in-memory single-entry list. - `SignCommand()` now mutates `cmd.SignedAt` and `cmd.KeyID` before signing, and uses the new message format `{id}:{command_type}:{sha256(params)}:{unix_timestamp}`. #### `aggregator-server/internal/api/handlers/system.go` (MODIFIED — significant rewrite) - `SystemHandler` now holds `*queries.SigningKeyQueries`. - `NewSystemHandler` signature changed to accept both `*services.SigningService` and `*queries.SigningKeyQueries`. - `GetPublicKey()` response now includes `key_id` and `version` fields. - Added `GetActivePublicKeys()` — public endpoint returning JSON array of all active keys. #### `aggregator-server/internal/api/handlers/agents.go` (MODIFIED) `GetCommands()` now includes `KeyID` and `SignedAt` when building `CommandItem` responses. #### `aggregator-server/cmd/server/main.go` (MODIFIED) - Creates `signingKeyQueries` after other query objects. - After signing service validation, calls `SetSigningKeyQueries` and `InitializePrimaryKey`. - Updates `NewSystemHandler` call to pass `signingKeyQueries`. - Adds route `GET /api/v1/public-keys` mapped to `systemHandler.GetActivePublicKeys`. ### Agent #### `aggregator-agent/internal/client/client.go` (MODIFIED) - `Command` struct gains `KeyID string` and `SignedAt *time.Time` fields. - Added `ActivePublicKeyEntry` struct and `GetActivePublicKeys(serverURL)` method. #### `aggregator-agent/internal/crypto/pubkey.go` (REWRITTEN) Complete rewrite with: - `CacheMetadata` struct with `KeyID`, `Version`, `CachedAt`, `TTLHours` and `IsExpired()` method. - `getPublicKeyDir()`, `getPrimaryKeyPath()`, `getKeyPathByID(keyID)`, `getPrimaryMetaPath()` helpers. - `FetchAndCacheServerPublicKey()` now checks TTL+key_id metadata before using cache; stale network falls back to stale cache rather than failing. - `FetchAndCacheAllActiveKeys()` — fetches `/api/v1/public-keys` and caches each by key_id. - `LoadCachedPublicKeyByID(keyID)` — load by key_id with primary fallback. - `IsKeyIDCached(keyID)` — existence check. - `CachePublicKeyByID(keyID, key)` — write to key_id-specific file. #### `aggregator-agent/internal/crypto/verification.go` (REWRITTEN) Complete rewrite with: - `VerifyCommand()` — old format backward compat (`id:type:sha256(params)`). - `VerifyCommandWithTimestamp()` — new format with timestamp check; falls back to old format if `SignedAt == nil`. - `reconstructMessage()` and `reconstructMessageWithTimestamp()` internal helpers. - `CheckKeyRotation(keyID, serverURL)` — returns correct key for a given key_id, fetching from server if not cached. #### `aggregator-agent/internal/orchestrator/command_handler.go` (REWRITTEN) Complete rewrite with: - Replaced single `ServerPublicKey` field with `keyCache map[string]ed25519.PublicKey` protected by `sync.RWMutex`. - `getKeyForCommand()` — in-memory then disk then server lookup. - `ProcessCommand()` — selects `VerifyCommandWithTimestamp` vs `VerifyCommand` based on `cmd.SignedAt`. - `RefreshPrimaryKey(serverURL)` — proactive refresh. - `ShouldRefreshKey()` — returns true if `keyRefreshInterval` (6h) has elapsed. - `UpdateServerPublicKey()` — backward compat alias for `RefreshPrimaryKey`. #### `aggregator-agent/cmd/agent/main.go` (MODIFIED) Main polling loop now calls `commandHandler.ShouldRefreshKey()` / `commandHandler.RefreshPrimaryKey()` before each server check-in to proactively detect key rotations. ### Tests #### `aggregator-agent/internal/crypto/pubkey_test.go` (NEW) Tests for `CacheMetadata.IsExpired()` covering: fresh cache, expired cache, zero TTL defaulting to 24h (both expired and fresh), and exactly-at-boundary case. #### `aggregator-agent/internal/crypto/verification_test.go` (NEW) Tests for `VerifyCommandWithTimestamp()` and `VerifyCommand()` covering: valid recent command, too-old command, future beyond clock skew, future within clock skew, backward compat with no timestamp, wrong key rejection, and old-format backward compat. --- ## Key Rotation Operational Procedure ### Step 1: Generate a new Ed25519 key pair ```bash # Generate new 64-byte private key (seed + public key) openssl genpkey -algorithm ed25519 -outform DER | xxd -p -c 256 # Or use the RedFlag key generation endpoint: curl -X POST http://server/api/setup/generate-keys ``` ### Step 2: Add the new key to the database The new key must be inserted into `signing_keys` with `is_active = true` and `is_primary = false`. This can be done directly or via a future admin API: ```sql INSERT INTO signing_keys (id, key_id, public_key, algorithm, is_active, is_primary, version, created_at) VALUES (gen_random_uuid(), '', '', 'ed25519', true, false, 2, NOW()); ``` ### Step 3: Wait for agents to cache the new key The `GET /api/v1/public-keys` endpoint returns all active keys. Agents will cache every active key via `FetchAndCacheAllActiveKeys()` triggered on first encounter of an unknown `key_id`. The TTL is 24 hours by default. For proactive distribution, wait at least 24 hours or until all agents have checked in at least once since the new key was added to `signing_keys`. ### Step 4: Swap the primary key ```sql -- Atomic primary swap BEGIN; UPDATE signing_keys SET is_primary = false WHERE is_primary = true; UPDATE signing_keys SET is_primary = true WHERE key_id = ''; COMMIT; ``` Or use the `SetPrimaryKey` query method from the admin tooling. ### Step 5: Update the server's REDFLAG_SIGNING_PRIVATE_KEY Set the new private key in the environment and restart the server. The server will call `InitializePrimaryKey()` on startup which calls `InsertSigningKey` (idempotent) and `SetPrimaryKey`. ### Step 6: Deprecate the old key after transition window Once all agents have received at least one command signed with the new key and the transition window has closed: ```sql UPDATE signing_keys SET is_active = false, is_primary = false, deprecated_at = NOW() WHERE key_id = ''; ``` Or use `DeprecateKey()`. --- ## Transition Window Behavior During the transition window, both old and new keys are active in `signing_keys`. The agent behavior is: 1. **Agent receives command with new `key_id`:** Key not in memory or disk cache → calls `CheckKeyRotation()` → calls `FetchAndCacheAllActiveKeys()` → both keys cached → verifies with new key. 2. **Agent receives command with old `key_id`:** Key already in memory cache → verifies immediately. 3. **Agent receives command with no `key_id`:** Uses primary cached key (backward compat path). 4. **Proactive refresh (every 6h):** Agent re-fetches primary key from `GET /api/v1/public-key`, updates in-memory and disk cache. --- ## Known Remaining Limitations 1. **No admin API for key rotation.** The rotation procedure requires direct database access or a future admin endpoint. A `/api/v1/admin/keys` endpoint for listing, promoting, and deprecating keys should be added in a future sprint. 2. **`InsertSigningKey` uses version=1 hardcoded.** The `InitializePrimaryKey()` call in `main.go` always passes version=1. True version tracking requires deriving version from the current max version in the DB. 3. **No agent-side key expiry notification.** If a key is deprecated server-side while an agent still has commands in-flight using that key, those commands will fail verification. A grace period should be enforced server-side. 4. **Timestamp checking uses a 24-hour maxAge.** This is intentionally generous for the initial deployment to avoid rejecting commands from clocks with significant drift. It should be tightened to 10-15 minutes once clock synchronization is verified across all agents. 5. **`signing_keys` table `ON CONFLICT (key_id)` requires a unique constraint.** Migration 020 must have created this unique index. If not, `InsertSigningKey` will error on conflict rather than silently ignoring it. 6. **No key revocation mechanism.** There is no way to emergency-revoke a key before its deprecation. A revocation list or CRL-style endpoint should be considered for high-security deployments.