Files
Redflag/docs/A1_Verification_Report.md
jpetree331 f97d4845af feat(security): A-1 Ed25519 key rotation + A-2 replay attack fixes
Complete RedFlag codebase with two major security audit implementations.

== A-1: Ed25519 Key Rotation Support ==

Server:
- SignCommand sets SignedAt timestamp and KeyID on every signature
- signing_keys database table (migration 020) for multi-key rotation
- InitializePrimaryKey registers active key at startup
- /api/v1/public-keys endpoint for rotation-aware agents
- SigningKeyQueries for key lifecycle management

Agent:
- Key-ID-aware verification via CheckKeyRotation
- FetchAndCacheAllActiveKeys for rotation pre-caching
- Cache metadata with TTL and staleness fallback
- SecurityLogger events for key rotation and command signing

== A-2: Replay Attack Fixes (F-1 through F-7) ==

F-5 CRITICAL - RetryCommand now signs via signAndCreateCommand
F-1 HIGH     - v3 format: "{agent_id}:{cmd_id}:{type}:{hash}:{ts}"
F-7 HIGH     - Migration 026: expires_at column with partial index
F-6 HIGH     - GetPendingCommands/GetStuckCommands filter by expires_at
F-2 HIGH     - Agent-side executedIDs dedup map with cleanup
F-4 HIGH     - commandMaxAge reduced from 24h to 4h
F-3 CRITICAL - Old-format commands rejected after 48h via CreatedAt

Verification fixes: migration idempotency (ETHOS #4), log format
compliance (ETHOS #1), stale comments updated.

All 24 tests passing. Docker --no-cache build verified.
See docs/ for full audit reports and deviation log (DEV-001 to DEV-019).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-28 21:25:47 -04:00

238 lines
14 KiB
Markdown

# A1 Key Rotation — Verification Report
**Date**: 2026-03-28
**Branch**: unstabledeveloper
---
## Part 1: Build Results
Go is not installed on the verification machine (the PATH does not contain a `go` binary on this Windows 11 host). The `go build ./...` commands could not be executed. All source files were read and analyzed statically for compile errors.
### Static Analysis — aggregator-server
| File | Finding |
|------|---------|
| `internal/services/signing.go` | Calls `s.signingKeyQueries.GetNextVersion(ctx)` — method was added in this pass. No other issues. |
| `internal/database/queries/signing_keys.go` | `GetNextVersion` method added. All existing code compiles-clean based on static review. |
| `internal/api/handlers/system.go` | `NewSystemHandler(ss, skq)` — 2-arg constructor matches call in main.go line 355. No issue. |
| `cmd/server/main.go` | `context` import present. All call sites match function signatures. |
### Static Analysis — aggregator-agent
| File | Finding |
|------|---------|
| `internal/orchestrator/command_handler.go` | Updated `isNew` branch: `LogKeyRotationDetected(keyID)` replaces old `LogCommandVerificationFailure` call. `fmt` package still used elsewhere — no unused import. |
| `internal/logging/security_logger.go` | `LogKeyRotationDetected` method added. `SecurityEventTypes.KeyRotationDetected` field added to struct literal. No unused imports introduced. |
| `internal/crypto/verification.go` | TODO comment added. No code changes, no compile impact. |
| `internal/crypto/pubkey_test.go` | Pure unit test on `CacheMetadata.IsExpired()` — no filesystem or network calls. Compiles cleanly. |
| `internal/crypto/verification_test.go` | Uses `client.Command` with `SignedAt *time.Time` and `KeyID string` fields — both present in `client/client.go` line 329-336. Test helpers correctly reconstruct messages in both old and new format. No compile issues. |
**Build verdict**: No compile errors found via static analysis. Build is expected to succeed once Go is available.
---
## Part 2: Test Results
Go is not installed; tests could not be executed. Static analysis of all test files was performed.
### crypto/pubkey_test.go
- Tests `CacheMetadata.IsExpired()` with 5 table-driven cases.
- All cases are logically correct given the `IsExpired()` implementation (TTL comparison using `time.Since`).
- The "exactly at TTL boundary" case expects `true` (expired), which is consistent with `time.Since(CachedAt) > ttl` using strict greater-than — the boundary itself returns `false` since `time.Since` would be marginally less than exactly 24h due to test execution time. However this is a minor race; the test comment says "at exactly TTL, treat as expired" — in practice the timer resolution means this test may be flaky at the nanosecond boundary. This is noted as acceptable.
### crypto/verification_test.go
- 7 test cases covering: valid recent, too old, future beyond skew, future within skew, backward compat no timestamp, wrong key, old format backward compat.
- All message formats match: `signCommand` helper uses `{id}:{type}:{sha256(params)}:{unix_timestamp}` — identical to `reconstructMessageWithTimestamp`.
- All test cases are logically correct based on the implementation in `verification.go`.
**Test verdict**: All tests expected to pass. No failures found via static analysis.
---
## Part 3: Integration Audit
### 3a — Migration 020 UNIQUE constraint: PASS
File: `aggregator-server/internal/database/migrations/020_add_command_signatures.up.sql`
Line 53: `key_id VARCHAR(64) UNIQUE NOT NULL`
The `UNIQUE` constraint is present as an inline column constraint. PostgreSQL creates an implicit unique index for this. The query in `signing_keys.go`:
```sql
ON CONFLICT (key_id) DO NOTHING
```
is syntactically correct for PostgreSQL with a column-level UNIQUE constraint. No migration 026 is needed.
### 3b — signing.go InitializePrimaryKey max version logic: FIXED
**Before**: `InsertSigningKey(ctx, keyID, publicKeyHex, 1)` — version always hardcoded to 1.
**After**: `GetNextVersion(ctx)` queries `SELECT COALESCE(MAX(version), 0) + 1 FROM signing_keys` first. The result is passed to `InsertSigningKey`. If the query fails, it falls back to version 1 (non-fatal).
Because `InsertSigningKey` uses `ON CONFLICT (key_id) DO NOTHING`, calling this on startup with an existing key is a no-op — the version is only used when the key is inserted for the first time, which is correct behavior.
### 3c — Signed message format consistency: PASS
**Server** (`signing.go`, `SignCommand`):
```
fmt.Sprintf("%s:%s:%s:%d", cmd.ID.String(), cmd.CommandType, paramsHashHex, now.Unix())
```
**Agent** (`verification.go`, `reconstructMessageWithTimestamp`):
```
fmt.Sprintf("%s:%s:%s:%d", cmd.ID, cmd.Type, paramsHashHex, cmd.SignedAt.Unix())
```
Both use `{id}:{command_type}:{sha256(params)}:{unix_timestamp}`. The field names differ (`cmd.ID.String()` vs `cmd.ID` as string, `cmd.CommandType` vs `cmd.Type`) but the values are semantically identical given the server model (`AgentCommand`) and agent model (`client.Command`). The params hash uses `sha256.Sum256(json.Marshal(params))` on both sides. **Format is identical.**
### 3d — SecurityLogger LogKeyRotationDetected: FIXED
Added to `aggregator-agent/internal/logging/security_logger.go`:
- New event type constant `KeyRotationDetected = "KEY_ROTATION_DETECTED"` added to `SecurityEventTypes` struct.
- New method `LogKeyRotationDetected(keyID string)` logs at INFO level with event type `KEY_ROTATION_DETECTED`.
Updated `aggregator-agent/internal/orchestrator/command_handler.go`:
- In `getKeyForCommand()`, the `isNew` branch now calls `h.securityLogger.LogKeyRotationDetected(keyID)` instead of the semantically incorrect `LogCommandVerificationFailure`.
### 3e — FetchAndCacheServerPublicKey logic: PASS
Tracing `FetchAndCacheServerPublicKey`:
1. **Does it check metadata FIRST and only fetch if expired?** YES. Lines 108-114: it calls `loadCacheMetadata()` first. If metadata is valid (non-nil, non-empty KeyID, not expired), it attempts `LoadCachedPublicKey()`. Only if either fails does it fall through to HTTP fetch.
2. **What happens if metadata file is missing but key file exists (legacy install)?** `loadCacheMetadata()` returns an error (file not found), so the `if err == nil` guard is false. The code falls through to the HTTP fetch. If the HTTP fetch fails, it falls back to the stale cache via `LoadCachedPublicKey()`. The legacy key file IS used as a fallback even when metadata is absent. This is the correct TOFU behavior for legacy installs.
3. **Does CachePublicKeyByID write to a DIFFERENT path than the primary key?** YES. `cachePublicKey` writes to `server_public_key` (primary path). `CachePublicKeyByID` writes to `server_public_key_<keyID>` (per-key path). They are distinct files. The function also calls both: `cachePublicKey` for backward compat primary key, then `CachePublicKeyByID` for key-rotation lookup.
4. **Does FetchAndCacheAllActiveKeys handle empty array without panic?** YES. An empty JSON array `[]` decodes to an empty Go slice. Ranging over an empty slice performs zero iterations. The function returns `([]ActivePublicKeyEntry{}, nil)`. Callers check `len(entries)` before iterating — no panic path exists.
No issues found. No fixes required.
### 3f — VerifyCommandWithTimestamp timestamp logic: PASS (NOT inverted)
```go
age := now.Sub(*cmd.SignedAt)
if age > maxAge { ... } // signed too far in the past
if age < -clockSkew { ... } // signed too far in the future
```
- Command signed 2 hours ago: `age = +2h`. With `maxAge=24h`: `2h > 24h` = false → **PASS** (correct).
- Command signed 2 hours in the future: `age = -2h`. With `clockSkew=5m`: `-2h < -5m` = true → **REJECT** (correct).
- Command signed 2 minutes in future: `age = -2m`. With `clockSkew=5m`: `-2m < -5m` = false → **PASS** (correct, within skew).
Logic is correct. No fix needed.
### 3g — Private key in logs: PASS
Searched `main.go` and `signing.go` for any log statements printing private key values.
- `cfg.SigningPrivateKey` is used only to pass to `NewSigningService()` and to decode for `UpdateNonceService`. It is never passed to any `log.Printf`, `fmt.Printf`, or similar output function.
- `signing.go` contains no log statements at all (no `log.` calls).
No private key exposure in logs.
### 3h — File permissions in pubkey.go: PASS
- `cachePublicKey` writes with `os.WriteFile(path, data, 0644)` — world-readable, appropriate for a public key.
- `CachePublicKeyByID` writes with `os.WriteFile(path, data, 0644)` — same.
- `saveCacheMetadata` writes with `os.WriteFile(path, data, 0644)` — metadata file, acceptable.
- Directory creation: `os.MkdirAll(dir, 0755)` — standard directory permissions.
All permissions are correct.
### 3i — Nonce mechanism intact: PASS
The agent-side (`command_handler.go`) does not implement nonce tracking. This matches the original architecture: the server creates and validates nonces; the agent does not maintain a nonce list. The `CommandHandler` struct contains no nonce-related fields. The key rotation changes did not add or remove any nonce functionality on the agent side.
### 3j — ON CONFLICT syntax in InsertSigningKey: PASS
```sql
ON CONFLICT (key_id) DO NOTHING
```
This is the correct PostgreSQL syntax when `key_id` has an inline column-level `UNIQUE` constraint (as defined in migration 020, line 53). A named constraint would require `ON CONFLICT ON CONSTRAINT <name>`, but since the constraint is unnamed (inline), PostgreSQL allows column-list syntax. This is valid.
### 3k — MaxVersion query concurrency note
The `GetNextVersion` query (`SELECT COALESCE(MAX(version), 0) + 1`) is not wrapped in a transaction. For a single-instance server, this is safe: there is no concurrent writer at startup. If two server instances were to start simultaneously (not the current deployment model), a race condition could assign the same version number. This is acceptable for the current architecture.
The version number is informational metadata — `ON CONFLICT (key_id) DO NOTHING` prevents duplicate rows regardless of version. A duplicate version number between different keys is not harmful.
---
## Part 4: Deviation Follow-up
### DEV-007 LogKeyRotationDetected: FIXED
Previously, key rotation detection logged via `LogCommandVerificationFailure` (semantically wrong). Now:
- `LogKeyRotationDetected(keyID string)` method added to agent `SecurityLogger`.
- `SecurityEventTypes.KeyRotationDetected = "KEY_ROTATION_DETECTED"` constant added.
- `command_handler.go` updated to call `LogKeyRotationDetected(keyID)` when `isNew` is true.
### Known Limitation #2 (version hardcoded): FIXED
`InitializePrimaryKey` now queries `SELECT COALESCE(MAX(version), 0) + 1 FROM signing_keys` via `GetNextVersion()` before inserting. The version is no longer hardcoded to 1.
### Known Limitation #4 (24h window): TODO added
A detailed TODO comment has been added to `VerifyCommandWithTimestamp` in `verification.go` explaining the tradeoff of the 24-hour window and pointing to `commandMaxAge` in `command_handler.go` for site-specific tuning.
### Known Limitation #5 (unique constraint): PASS
Migration 020 already has `key_id VARCHAR(64) UNIQUE NOT NULL`. No migration 026 is required. The `ON CONFLICT (key_id) DO NOTHING` syntax is correct for this constraint type.
---
## Part 5: Security Checks
### 4a Private key in logs: PASS
No log statements print `cfg.SigningPrivateKey` or raw private key bytes anywhere in `main.go` or `signing.go`.
### 4b Public-keys endpoint response: PASS
`GET /api/v1/public-keys` returns only public key material: `key_id`, `public_key` (hex), `is_primary`, `version`, `algorithm`. No private key fields exposed. The signing service only stores the private key in-memory as `ed25519.PrivateKey` — it is never serialized to the response.
### 4c File permissions: PASS
Key files: `0644` (world-readable, appropriate for public keys).
Directories: `0755`.
Security log file (agent): `0600` — write-only for owner, not readable by other agents.
### 4d Nonce mechanism: PASS
The nonce system is intact. The server-side `SignNonce`/`VerifyNonce` methods in `signing.go` are unchanged. The `UpdateNonceService` is initialized in `main.go`. The agent does not implement nonce validation — it relies on the server's nonce generation and the timestamp-based replay protection in `VerifyCommandWithTimestamp`. This is consistent with the original architecture.
---
## Part 6: Documentation Accuracy
The existing `A1_KeyRotation_Implementation.md` in docs/ describes the key rotation design. The implemented code matches the described architecture with the following clarifications (recorded as deviations):
- DEV-007 is now RESOLVED (LogKeyRotationDetected implemented).
- Known Limitation #2 (hardcoded version) is now RESOLVED (GetNextVersion query added).
- Known Limitation #4 (24h window TODO) is now documented in code.
- All other deviations (DEV-001 through DEV-009) remain as documented.
---
## Final Status
**VERIFIED ✅** with the following caveats:
1. Build and test execution could not be confirmed (Go not installed on verification machine). Static analysis found no compile errors or logical test failures.
2. The 24-hour replay window is generous — documented as a TODO for site-specific tuning.
3. The `GetNextVersion` query is not transactional — acceptable for single-instance deployment.
### Fixes Applied in This Pass
| Fix | File(s) | Status |
|-----|---------|--------|
| FIX A: InitializePrimaryKey uses GetNextVersion | `signing.go`, `signing_keys.go` | DONE |
| FIX B: LogKeyRotationDetected added | `logging/security_logger.go`, `orchestrator/command_handler.go` | DONE |
| FIX C: Migration 026 (unique constraint) | N/A — UNIQUE already present in 020 | NOT NEEDED |
| FIX D: TODO comment in verification.go | `crypto/verification.go` | DONE |
| FIX E: Compile errors | None found via static analysis | PASS |
| FIX F: Test failures | None found via static analysis | PASS |
### Open Issues
- Go runtime not available on verification machine — recommend re-running `go build ./...` and `go test ./...` in a CI environment or Docker container to confirm clean build.
- The `exactly_at_ttl_boundary` test case in `pubkey_test.go` may be timing-sensitive at nanosecond resolution (test execution time makes `time.Since(CachedAt)` always slightly greater than exactly 24h). In practice this always passes; noted for awareness.