Files
Redflag/docs/Deviations_Report.md
jpetree331 3e1e2a78fd refactor: A-series dead code cleanup and ETHOS compliance sweep
- Remove dead queries.RetryCommand function (DEV-019, 31 lines)
- Remove security_settings.go.broken leftover from A-3
- Remove 5 compiled test binaries from aggregator-agent/ (~61MB)
- Remove config_builder.go.restored from repo root
- Remove test_disk_detection.go and test_disk.go (throwaway test files)
- Fix 6 banned word violations (production-ready, enhanced, robust, seamlessly)
- Add .gitignore rules for compiled agent binaries
- Document machine ID duplication for D-1 fix prompt
- Document 30+ pre-existing emoji violations for D-2 pass

No behavior changes. All 41 tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 06:17:12 -04:00

255 lines
15 KiB
Markdown

# Deviations Report — Ed25519 Key Rotation Implementation
This document records deviations from the implementation spec.
---
## DEV-001: `LogEvent` not used in command_handler.go
**Spec says:** Use `h.securityLogger.LogEvent("key_rotation_detected", "info", "crypto", ...)` when a new key is cached.
**Actual implementation:** `LogEvent` was not found in the `SecurityLogger` interface. The available methods are `LogCommandVerificationFailure` and `LogCommandVerificationSuccess`. The key rotation detection event uses `LogCommandVerificationFailure` with a descriptive message string instead to avoid compilation failure.
**Impact:** The key rotation detection is still logged at the logger level. No security event is lost; it is logged via the `[INFO]` logger line immediately above.
---
## DEV-002: `fingerprint` field retained in `GetPublicKey` response
**Spec says:** Update `GetPublicKey()` to include `key_id` and `version`.
**Actual implementation:** The `fingerprint` field was retained for backward compatibility alongside the new `key_id` field. Since `key_id` equals `GetPublicKeyFingerprint()` (same value), they are equivalent. Removing `fingerprint` would be a breaking change for agents on older versions.
**Impact:** Older agents that read `fingerprint` continue to work. New agents can read `key_id`. No semantic difference.
---
## DEV-003: `GetPublicKey` nil check changed to `IsEnabled()`
**Spec says:** Check `h.signingService == nil`.
**Actual implementation:** Check is `h.signingService == nil || !h.signingService.IsEnabled()`. A service can be non-nil but disabled (e.g., constructed with empty private key). Returning 503 in this case is more correct.
**Impact:** More defensive; no regression for callers.
---
## DEV-004: `VerifyCommandWithTimestamp` signature changed
**Spec says:** `VerifyCommandWithTimestamp(cmd, serverPubKey, maxAge)` (3 args, old API).
**Actual implementation:** `VerifyCommandWithTimestamp(cmd, serverPubKey, maxAge, clockSkew)` (4 args). The old single-`maxAge` signature in the original file was replaced entirely because the new implementation needs a separate clock skew parameter. The old file's 3-arg signature had timestamp checking completely disabled (stub).
**Impact:** Any callers of the old 3-arg signature will fail to compile. No external callers were found; only `command_handler.go` calls this method, and it was rewritten in the same task.
---
## DEV-005: `getPublicKeyDir` is unexported in pubkey.go
**Spec says (test comment):** `origGetDir := getPublicKeyDir` used in test to override.
**Actual implementation:** `getPublicKeyDir` is a regular function, not a variable, so it cannot be overridden in tests via assignment. The test file uses `t.TempDir()` to write to a temporary path directly rather than overriding `getPublicKeyDir`. The provided test spec had placeholder code that would not compile (`import_encoding_json_inline`, `meta_placeholder()`, etc.); the final test file is a clean, working replacement using only the `CacheMetadata.IsExpired()` method which is fully testable without filesystem access.
**Impact:** Filesystem-based metadata roundtrip test is not present. The `IsExpired()` logic is fully covered. A future PR can add filesystem tests using test-local paths.
---
## DEV-006: `InsertSigningKey` uses named map instead of struct
**Spec says:** Use named parameters compatible with `NamedExecContext`.
**Actual implementation:** Uses `map[string]interface{}` as named parameter map rather than a struct. `sqlx.NamedExecContext` accepts both maps and structs. The map approach avoids needing a separate insert-specific struct.
**Impact:** None; functionally equivalent.
---
## DEV-007: `key_rotation_detected` security event not emitted
**Spec says:** `h.securityLogger.LogEvent(...)` for key rotation detection.
**Actual implementation:** Used `LogCommandVerificationFailure` with a message explaining the new key was cached. This is technically incorrect semantically (it is not a failure). However `LogEvent` does not exist in the security logger. A future task should add a `LogKeyRotation` method to `SecurityLogger`.
---
## DEV-008: `validateSigningService` fingerprint length check still expects 64 chars
**Spec says:** `GetPublicKeyFingerprint()` now returns 32 hex chars (16 bytes of SHA-256).
**Existing code in main.go says:** `if len(publicKeyHex) != 64` (checking `GetPublicKey()`, not fingerprint).
**Actual implementation:** The fingerprint length validation in `validateSigningService` checks `publicKeyHex` (the full key, 64 hex chars), not the fingerprint. The fingerprint change to 32 chars does not affect this validation. No change needed.
**Impact:** None.
---
## DEV-009: `context` import already present in main.go
**Spec says:** Add `"context"` import when adding `context.Background()` call.
**Actual implementation:** `context` was already imported at line 4 of `main.go`. No import change needed.
**Impact:** None.
---
## DEV-010: DEV-007 resolved — LogKeyRotationDetected implemented
**Previous state (DEV-007):** Key rotation detection used `LogCommandVerificationFailure` which was semantically incorrect.
**Resolution (A1 verification pass, 2026-03-28):**
- `SecurityEventTypes.KeyRotationDetected = "KEY_ROTATION_DETECTED"` added to the `SecurityEventTypes` struct in `aggregator-agent/internal/logging/security_logger.go`.
- `LogKeyRotationDetected(keyID string)` method added to `SecurityLogger` — logs at INFO level with event type `KEY_ROTATION_DETECTED`.
- `aggregator-agent/internal/orchestrator/command_handler.go` updated: the `isNew` branch now calls `h.securityLogger.LogKeyRotationDetected(keyID)`.
**Impact:** Key rotation events are now correctly classified as informational (INFO) rather than failure events. No breaking changes.
---
## DEV-011: InitializePrimaryKey version now dynamic (was hardcoded to 1)
**Previous state:** `InitializePrimaryKey` passed `version=1` hardcoded to `InsertSigningKey`.
**Resolution (A1 verification pass, 2026-03-28):**
- `GetNextVersion(ctx context.Context) (int, error)` added to `SigningKeyQueries` in `aggregator-server/internal/database/queries/signing_keys.go`. Executes `SELECT COALESCE(MAX(version), 0) + 1 FROM signing_keys`.
- `InitializePrimaryKey` in `signing.go` now calls `GetNextVersion` to determine the version before inserting. Falls back to version 1 on query error.
**Concurrency note:** The `GetNextVersion` query is not wrapped in a transaction. For a single-instance server this is safe. If concurrent restarts were possible, a TOCTOU race could assign the same version to different keys. Version numbers are informational metadata; `ON CONFLICT (key_id) DO NOTHING` prevents duplicate rows regardless.
**Impact:** New keys inserted on first startup receive version N+1 rather than always 1. Subsequent restarts with the same key are no-ops (ON CONFLICT). No breaking changes.
---
## DEV-012: F-2 deduplication at ProcessCommand layer, not VerifyCommandWithTimestamp
**Spec says:** "In ProcessCommand, BEFORE verification: Check if cmd.ID is in executedIDs."
**Actual implementation:** Deduplication is checked before verification as specified, but after successful verification the command is marked as executed. The `VerifyCommandWithTimestamp` function remains a pure function — it does not maintain state. This is intentional: dedup is a ProcessCommand-level concern, not a cryptographic verification concern.
**Impact:** The `TestSameCommandCanBeVerifiedTwice` test was updated to document that the verifier is a pure function. Dedup enforcement happens at the ProcessCommand layer. No behavioral change from spec intent.
---
## DEV-013: v3 format detection uses AgentID presence, not field-count parsing
**Spec says:** "Detection: The new format will have 5 colon-separated fields. The verifier can detect format by field count."
**Actual implementation:** Instead of counting colon-separated fields in the signature (which is opaque), the verifier checks `cmd.AgentID != ""` to determine v3 format. If AgentID is present, it tries v3 first, then falls back to v2 if v3 verification fails. This is more robust than field-count parsing because the signed message is not directly accessible — only the opaque signature is available.
**Impact:** Same behavior as spec intent. Agents with AgentID try v3 first. Agents without AgentID use v2/v1. The fallback chain is: v3 → v2 → v1 with warnings logged at each fallback.
---
## DEV-014: commandDefaultTTL set to 4h (matches commandMaxAge) instead of 24h
**Spec says (Task 2d):** "Default value: NOW() + 24 hours for new-format commands."
**Actual implementation:** `commandDefaultTTL = 4 * time.Hour` in `CreateCommand`, matching the reduced `commandMaxAge = 4 * time.Hour` from Task 6. Setting expires_at to 24h while commandMaxAge is 4h would create commands that the server considers valid but the agent rejects — a confusing inconsistency.
**Impact:** New commands expire at the same time they would be rejected by the agent's timestamp check. This is more consistent and prevents stale commands from accumulating in the pending queue for 24h after they can no longer be verified.
---
## DEV-015: docker.go pre-existing build error fixed
**Spec says:** Nothing — this was a pre-existing issue.
**Actual fix:** `aggregator-server/internal/database/queries/docker.go` had `fmt.Sprintf` calls with arguments but no format directives (lines 108, 110, 189, 191). Changed to plain string concatenation. This was blocking all test runs in the `queries` package.
**Impact:** No behavioral change. Fixes a compile error that predates the A2 work.
---
## DEV-016: Migration 026 was not idempotent (ETHOS #4 violation — fixed in verification pass)
**Spec says:** N/A (implementation detail).
**Issue found during A2 verification:** Migration 026_add_expires_at.up.sql used bare `ALTER TABLE ... ADD COLUMN` and `CREATE INDEX` without `IF NOT EXISTS`/`IF NOT EXISTS`. ETHOS #4 requires all schema changes to be idempotent.
**Fix:** Changed to `ADD COLUMN IF NOT EXISTS` and `CREATE INDEX IF NOT EXISTS`.
**Impact:** Migration is now safe to run multiple times. No behavioral change.
---
## DEV-017: Log format violations in A-2 code (ETHOS #1 — fixed in verification pass)
**Issue found during A2 verification:** Three new `fmt.Printf` calls in `verification.go` (v1/v2 fallback warnings) used the format `[crypto] WARNING: ...` instead of the ETHOS-mandated `[TAG] [system] [component]` format. Additionally, `signAndCreateCommand` in `agents.go` used `[WARNING] Command signing disabled...` without system/component tags.
**Fix:** Updated all four log lines to use `[WARNING] [agent] [crypto]` and `[WARNING] [server] [signing]` formats respectively.
**Impact:** Log output now complies with ETHOS #1. No behavioral change.
---
## DEV-018: Stale TODO comment referenced 24h maxAge (fixed in verification pass)
**Issue found during A2 verification:** The `TODO(security)` comment in `verification.go` (above `VerifyCommandWithTimestamp`) still referenced "24 hours" as the default maxAge. This was stale — commandMaxAge was reduced to 4h in Task 6 (F-4 fix).
**Fix:** Updated the comment to reference 4 hours and removed the TODO framing (it's no longer a TODO, the reduction has been implemented).
**Impact:** Documentation accuracy only. No code change.
---
## DEV-019: queries.RetryCommand is now dead code
**Issue found during A2 verification:** The `queries.RetryCommand` function (commands.go:200) still exists but is no longer called by any handler. Both `UpdateHandler.RetryCommand` and `UnifiedUpdateHandler.RetryCommand` now build the command inline and call `signAndCreateCommand`. The function is dead code.
**Action:** Not removed in this pass (verification scope is "fix what is broken", not refactor). Flagged for future cleanup.
---
## DEV-020: security_settings.go handler API mismatches fixed (A-3)
**Issue:** The `security_settings.go.broken` handler was written against a different version of the `SecuritySettingsService` API. Methods `GetAllSettings(userID)`, `GetSettingsByCategory(userID, category)`, `GetAuditTrail()`, `GetSecurityOverview()`, and `ApplySettingsBatch()` either had wrong signatures or didn't exist.
**Fix:** Rewrote handler methods to match the actual service API. `GetAuditTrail` and `GetSecurityOverview` return placeholder responses since those methods aren't implemented in the service yet. `ApplySettingsBatch` replaced with iterative `SetSetting` calls. `userID` string-to-UUID conversion added.
**Impact:** 7 security settings routes are now functional and protected by WebAuthMiddleware + RequireAdmin.
---
## DEV-021: Config download endpoint returns template, not secrets (A-3)
**Spec says (F-A3-7):** "Agent config files may contain server URLs, registration tokens, or other secrets."
**Actual finding:** The `HandleConfigDownload` handler returns a config **template** with placeholder credentials (zero UUID, empty tokens). No actual secrets are served. The install script fills in real credentials locally.
**Action:** Protected with WebAuthMiddleware anyway (ETHOS #2 compliance). Agents never call this endpoint (confirmed via grep). Only the admin dashboard uses it.
**Impact:** No operational change for agents. Dashboard users need to be logged in.
---
## DEV-022: Issuer grace period for backward compat (A-3)
**Issue:** Adding issuer validation to JWT middleware would immediately invalidate all existing deployed agent tokens (which have no issuer claim).
**Solution:** Tokens with empty/absent issuer are allowed through with a logged warning. Tokens with a WRONG issuer (e.g., `redflag-web` on agent middleware) are rejected immediately. This provides a migration path: new tokens have issuers, old tokens work during the grace period.
**TODO:** Remove issuer-absent grace period after 30 days from deployment. At that point, all deployed agents will have rotated their tokens (24h expiry).
**Impact:** Cross-type token confusion is blocked for new tokens. Old tokens degrade gracefully.
---
## DEV-023: Pre-existing emoji violations not fixed in refactor pass
**Issue found during A-series refactor:** 30+ emoji characters found in pre-existing log statements across `agents.go`, `machine_binding.go`, `setup.go`, `db.go`, `updates.go`, etc. These predate all A-series audit work.
**Action:** Not fixed in this pass. Fixing pre-existing emojis in established log output could break log parsing pipelines or monitoring. Flagged as future D-2 cleanup item for a dedicated ETHOS compliance pass.
**Impact:** No behavioral change. Pre-existing code remains as-is.
---
## DEV-024: Machine ID implementation divergence flagged for D-1
**Issue found during A-series refactor:** Machine ID is generated in 3 locations with 2 divergences:
1. `main.go` error fallback uses unhashed `"unknown-" + hostname` instead of SHA256
2. `example_integration.go` calls `machineid.ID()` directly instead of `GetMachineID()`
**Action:** Not fixed in this pass (requires careful analysis of downstream effects). Flagged as D-1 fix prompt input. See `docs/Refactor_A_Series.md` Task 6 for full analysis.