retry_count column and filter existed but counter was never incremented. Stuck commands always had retry_count=0 and always passed the WHERE retry_count < 5 filter, making the cap ineffective. Fix: Added RedeliverStuckCommandTx that sets retry_count = retry_count + 1 on stuck->sent re-delivery. GetCommands handler now uses MarkCommandSentTx for new commands (retry_count stays 0) and RedeliverStuckCommandTx for stuck re-delivery (retry_count increments). All 77 tests pass. DEV-029 resolved. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
18 KiB
Deviations Report — Ed25519 Key Rotation Implementation
This document records deviations from the implementation spec.
DEV-001: LogEvent not used in command_handler.go
Spec says: Use h.securityLogger.LogEvent("key_rotation_detected", "info", "crypto", ...) when a new key is cached.
Actual implementation: LogEvent was not found in the SecurityLogger interface. The available methods are LogCommandVerificationFailure and LogCommandVerificationSuccess. The key rotation detection event uses LogCommandVerificationFailure with a descriptive message string instead to avoid compilation failure.
Impact: The key rotation detection is still logged at the logger level. No security event is lost; it is logged via the [INFO] logger line immediately above.
DEV-002: fingerprint field retained in GetPublicKey response
Spec says: Update GetPublicKey() to include key_id and version.
Actual implementation: The fingerprint field was retained for backward compatibility alongside the new key_id field. Since key_id equals GetPublicKeyFingerprint() (same value), they are equivalent. Removing fingerprint would be a breaking change for agents on older versions.
Impact: Older agents that read fingerprint continue to work. New agents can read key_id. No semantic difference.
DEV-003: GetPublicKey nil check changed to IsEnabled()
Spec says: Check h.signingService == nil.
Actual implementation: Check is h.signingService == nil || !h.signingService.IsEnabled(). A service can be non-nil but disabled (e.g., constructed with empty private key). Returning 503 in this case is more correct.
Impact: More defensive; no regression for callers.
DEV-004: VerifyCommandWithTimestamp signature changed
Spec says: VerifyCommandWithTimestamp(cmd, serverPubKey, maxAge) (3 args, old API).
Actual implementation: VerifyCommandWithTimestamp(cmd, serverPubKey, maxAge, clockSkew) (4 args). The old single-maxAge signature in the original file was replaced entirely because the new implementation needs a separate clock skew parameter. The old file's 3-arg signature had timestamp checking completely disabled (stub).
Impact: Any callers of the old 3-arg signature will fail to compile. No external callers were found; only command_handler.go calls this method, and it was rewritten in the same task.
DEV-005: getPublicKeyDir is unexported in pubkey.go
Spec says (test comment): origGetDir := getPublicKeyDir used in test to override.
Actual implementation: getPublicKeyDir is a regular function, not a variable, so it cannot be overridden in tests via assignment. The test file uses t.TempDir() to write to a temporary path directly rather than overriding getPublicKeyDir. The provided test spec had placeholder code that would not compile (import_encoding_json_inline, meta_placeholder(), etc.); the final test file is a clean, working replacement using only the CacheMetadata.IsExpired() method which is fully testable without filesystem access.
Impact: Filesystem-based metadata roundtrip test is not present. The IsExpired() logic is fully covered. A future PR can add filesystem tests using test-local paths.
DEV-006: InsertSigningKey uses named map instead of struct
Spec says: Use named parameters compatible with NamedExecContext.
Actual implementation: Uses map[string]interface{} as named parameter map rather than a struct. sqlx.NamedExecContext accepts both maps and structs. The map approach avoids needing a separate insert-specific struct.
Impact: None; functionally equivalent.
DEV-007: key_rotation_detected security event not emitted
Spec says: h.securityLogger.LogEvent(...) for key rotation detection.
Actual implementation: Used LogCommandVerificationFailure with a message explaining the new key was cached. This is technically incorrect semantically (it is not a failure). However LogEvent does not exist in the security logger. A future task should add a LogKeyRotation method to SecurityLogger.
DEV-008: validateSigningService fingerprint length check still expects 64 chars
Spec says: GetPublicKeyFingerprint() now returns 32 hex chars (16 bytes of SHA-256).
Existing code in main.go says: if len(publicKeyHex) != 64 (checking GetPublicKey(), not fingerprint).
Actual implementation: The fingerprint length validation in validateSigningService checks publicKeyHex (the full key, 64 hex chars), not the fingerprint. The fingerprint change to 32 chars does not affect this validation. No change needed.
Impact: None.
DEV-009: context import already present in main.go
Spec says: Add "context" import when adding context.Background() call.
Actual implementation: context was already imported at line 4 of main.go. No import change needed.
Impact: None.
DEV-010: DEV-007 resolved — LogKeyRotationDetected implemented
Previous state (DEV-007): Key rotation detection used LogCommandVerificationFailure which was semantically incorrect.
Resolution (A1 verification pass, 2026-03-28):
SecurityEventTypes.KeyRotationDetected = "KEY_ROTATION_DETECTED"added to theSecurityEventTypesstruct inaggregator-agent/internal/logging/security_logger.go.LogKeyRotationDetected(keyID string)method added toSecurityLogger— logs at INFO level with event typeKEY_ROTATION_DETECTED.aggregator-agent/internal/orchestrator/command_handler.goupdated: theisNewbranch now callsh.securityLogger.LogKeyRotationDetected(keyID).
Impact: Key rotation events are now correctly classified as informational (INFO) rather than failure events. No breaking changes.
DEV-011: InitializePrimaryKey version now dynamic (was hardcoded to 1)
Previous state: InitializePrimaryKey passed version=1 hardcoded to InsertSigningKey.
Resolution (A1 verification pass, 2026-03-28):
GetNextVersion(ctx context.Context) (int, error)added toSigningKeyQueriesinaggregator-server/internal/database/queries/signing_keys.go. ExecutesSELECT COALESCE(MAX(version), 0) + 1 FROM signing_keys.InitializePrimaryKeyinsigning.gonow callsGetNextVersionto determine the version before inserting. Falls back to version 1 on query error.
Concurrency note: The GetNextVersion query is not wrapped in a transaction. For a single-instance server this is safe. If concurrent restarts were possible, a TOCTOU race could assign the same version to different keys. Version numbers are informational metadata; ON CONFLICT (key_id) DO NOTHING prevents duplicate rows regardless.
Impact: New keys inserted on first startup receive version N+1 rather than always 1. Subsequent restarts with the same key are no-ops (ON CONFLICT). No breaking changes.
DEV-012: F-2 deduplication at ProcessCommand layer, not VerifyCommandWithTimestamp
Spec says: "In ProcessCommand, BEFORE verification: Check if cmd.ID is in executedIDs."
Actual implementation: Deduplication is checked before verification as specified, but after successful verification the command is marked as executed. The VerifyCommandWithTimestamp function remains a pure function — it does not maintain state. This is intentional: dedup is a ProcessCommand-level concern, not a cryptographic verification concern.
Impact: The TestSameCommandCanBeVerifiedTwice test was updated to document that the verifier is a pure function. Dedup enforcement happens at the ProcessCommand layer. No behavioral change from spec intent.
DEV-013: v3 format detection uses AgentID presence, not field-count parsing
Spec says: "Detection: The new format will have 5 colon-separated fields. The verifier can detect format by field count."
Actual implementation: Instead of counting colon-separated fields in the signature (which is opaque), the verifier checks cmd.AgentID != "" to determine v3 format. If AgentID is present, it tries v3 first, then falls back to v2 if v3 verification fails. This is more robust than field-count parsing because the signed message is not directly accessible — only the opaque signature is available.
Impact: Same behavior as spec intent. Agents with AgentID try v3 first. Agents without AgentID use v2/v1. The fallback chain is: v3 → v2 → v1 with warnings logged at each fallback.
DEV-014: commandDefaultTTL set to 4h (matches commandMaxAge) instead of 24h
Spec says (Task 2d): "Default value: NOW() + 24 hours for new-format commands."
Actual implementation: commandDefaultTTL = 4 * time.Hour in CreateCommand, matching the reduced commandMaxAge = 4 * time.Hour from Task 6. Setting expires_at to 24h while commandMaxAge is 4h would create commands that the server considers valid but the agent rejects — a confusing inconsistency.
Impact: New commands expire at the same time they would be rejected by the agent's timestamp check. This is more consistent and prevents stale commands from accumulating in the pending queue for 24h after they can no longer be verified.
DEV-015: docker.go pre-existing build error fixed
Spec says: Nothing — this was a pre-existing issue.
Actual fix: aggregator-server/internal/database/queries/docker.go had fmt.Sprintf calls with arguments but no format directives (lines 108, 110, 189, 191). Changed to plain string concatenation. This was blocking all test runs in the queries package.
Impact: No behavioral change. Fixes a compile error that predates the A2 work.
DEV-016: Migration 026 was not idempotent (ETHOS #4 violation — fixed in verification pass)
Spec says: N/A (implementation detail).
Issue found during A2 verification: Migration 026_add_expires_at.up.sql used bare ALTER TABLE ... ADD COLUMN and CREATE INDEX without IF NOT EXISTS/IF NOT EXISTS. ETHOS #4 requires all schema changes to be idempotent.
Fix: Changed to ADD COLUMN IF NOT EXISTS and CREATE INDEX IF NOT EXISTS.
Impact: Migration is now safe to run multiple times. No behavioral change.
DEV-017: Log format violations in A-2 code (ETHOS #1 — fixed in verification pass)
Issue found during A2 verification: Three new fmt.Printf calls in verification.go (v1/v2 fallback warnings) used the format [crypto] WARNING: ... instead of the ETHOS-mandated [TAG] [system] [component] format. Additionally, signAndCreateCommand in agents.go used [WARNING] Command signing disabled... without system/component tags.
Fix: Updated all four log lines to use [WARNING] [agent] [crypto] and [WARNING] [server] [signing] formats respectively.
Impact: Log output now complies with ETHOS #1. No behavioral change.
DEV-018: Stale TODO comment referenced 24h maxAge (fixed in verification pass)
Issue found during A2 verification: The TODO(security) comment in verification.go (above VerifyCommandWithTimestamp) still referenced "24 hours" as the default maxAge. This was stale — commandMaxAge was reduced to 4h in Task 6 (F-4 fix).
Fix: Updated the comment to reference 4 hours and removed the TODO framing (it's no longer a TODO, the reduction has been implemented).
Impact: Documentation accuracy only. No code change.
DEV-019: queries.RetryCommand is now dead code
Issue found during A2 verification: The queries.RetryCommand function (commands.go:200) still exists but is no longer called by any handler. Both UpdateHandler.RetryCommand and UnifiedUpdateHandler.RetryCommand now build the command inline and call signAndCreateCommand. The function is dead code.
Action: Not removed in this pass (verification scope is "fix what is broken", not refactor). Flagged for future cleanup.
DEV-020: security_settings.go handler API mismatches fixed (A-3)
Issue: The security_settings.go.broken handler was written against a different version of the SecuritySettingsService API. Methods GetAllSettings(userID), GetSettingsByCategory(userID, category), GetAuditTrail(), GetSecurityOverview(), and ApplySettingsBatch() either had wrong signatures or didn't exist.
Fix: Rewrote handler methods to match the actual service API. GetAuditTrail and GetSecurityOverview return placeholder responses since those methods aren't implemented in the service yet. ApplySettingsBatch replaced with iterative SetSetting calls. userID string-to-UUID conversion added.
Impact: 7 security settings routes are now functional and protected by WebAuthMiddleware + RequireAdmin.
DEV-021: Config download endpoint returns template, not secrets (A-3)
Spec says (F-A3-7): "Agent config files may contain server URLs, registration tokens, or other secrets."
Actual finding: The HandleConfigDownload handler returns a config template with placeholder credentials (zero UUID, empty tokens). No actual secrets are served. The install script fills in real credentials locally.
Action: Protected with WebAuthMiddleware anyway (ETHOS #2 compliance). Agents never call this endpoint (confirmed via grep). Only the admin dashboard uses it.
Impact: No operational change for agents. Dashboard users need to be logged in.
DEV-022: Issuer grace period for backward compat (A-3)
Issue: Adding issuer validation to JWT middleware would immediately invalidate all existing deployed agent tokens (which have no issuer claim).
Solution: Tokens with empty/absent issuer are allowed through with a logged warning. Tokens with a WRONG issuer (e.g., redflag-web on agent middleware) are rejected immediately. This provides a migration path: new tokens have issuers, old tokens work during the grace period.
TODO: Remove issuer-absent grace period after 30 days from deployment. At that point, all deployed agents will have rotated their tokens (24h expiry).
Impact: Cross-type token confusion is blocked for new tokens. Old tokens degrade gracefully.
DEV-023: Pre-existing emoji violations not fixed in refactor pass
Issue found during A-series refactor: 30+ emoji characters found in pre-existing log statements across agents.go, machine_binding.go, setup.go, db.go, updates.go, etc. These predate all A-series audit work.
Action: Not fixed in this pass. Fixing pre-existing emojis in established log output could break log parsing pipelines or monitoring. Flagged as future D-2 cleanup item for a dedicated ETHOS compliance pass.
Impact: No behavioral change. Pre-existing code remains as-is.
DEV-024: Machine ID implementation divergence flagged for D-1
Issue found during A-series refactor: Machine ID is generated in 3 locations with 2 divergences:
main.goerror fallback uses unhashed"unknown-" + hostnameinstead of SHA256example_integration.gocallsmachineid.ID()directly instead ofGetMachineID()
Action: Not fixed in this pass (requires careful analysis of downstream effects). Flagged as D-1 fix prompt input. See docs/Refactor_A_Series.md Task 6 for full analysis.
DEV-025: Background token cleanup without context cancellation (B-1)
Issue: The background token cleanup goroutine in main.go uses a simple time.NewTicker loop without context.Context cancellation. main.go doesn't have a server-level context, so clean shutdown is handled by the deferred shutdown block and process signals.
Impact: On server shutdown, the goroutine is killed by the OS. No data loss risk — CleanupExpiredTokens is a DELETE query that's atomic. The ticker approach is consistent with the existing offline-agent-check goroutine (same pattern, same file).
DEV-026: Migration 024 fix uses Option B — existing columns (B-1)
Issue: Migration 024 referenced a deprecated column that doesn't exist. Two options: (A) add the column in a new migration, (B) use existing enabled/auto_run columns.
Decision: Option B. The migration's intent is to disable the legacy updates subsystem. SET enabled = false, auto_run = false achieves this using columns already in the schema (migration 015). Adding an unused deprecated column would be unnecessary complexity.
DEV-027: Duplicate migration numbers resolved with suffix letters (B-1)
Issue: Migrations 009 and 012 each had two files with the same numeric prefix.
Decision: Renamed second files to 009b and 012b. The runner sorts lexically (sort.Strings), so 009b_add_retry_tracking.up.sql sorts after 009_add_agent_version_tracking.up.sql correctly. This preserves the original execution order and doesn't require a full renumber.
DEV-028: Migration 018 renumbered to 027 (B-1)
Issue: 018_create_scanner_config_table.sql had wrong file suffix (.sql not .up.sql) AND shared number 018 with 018_create_metrics_and_docker_tables.up.sql.
Decision: Renumbered to 027. The scanner_config table has never been created by the runner (it was skipped due to wrong suffix), so existing databases don't have it. Number 027 is after all existing migrations, ensuring it runs last in the sequence.
DEV-029: retry_count column exists but is never incremented (B-2 verification)
Issue found during B-2 verification: Migration 029 adds retry_count to agent_commands and GetStuckCommands filters retry_count < 5. However, MarkCommandSentTx does not increment retry_count when re-delivering stuck commands. The column exists and the filter is in the query, but the counter stays at 0 forever, making the filter ineffective.
Impact: LOW — stuck commands are not capped at 5 retries as intended. They continue to be re-delivered indefinitely (pre-fix behavior). The fix is structurally correct (column, index, filter all in place) but the increment step was missed.
Resolution: Added RedeliverStuckCommandTx function that sets retry_count = retry_count + 1 on re-delivery. The GetCommands handler now uses MarkCommandSentTx for new pending commands (retry_count stays 0) and RedeliverStuckCommandTx for stuck command re-delivery (retry_count increments). The retry_count < 5 filter is now effective.