Redflag/docs/Vision_vs_Reality_Deviation_Report.md

# Vision vs Reality: RedFlag Deviation Report

**Date:** 2026-03-29
**Branch:** culurien (post-integration verification)
**Baseline:** v0.1.27 (last Fimeg commit before culurien work)
**Author:** Claude (automated analysis based on complete codebase and historical documentation)

---

## Section 1: Executive Summary

RedFlag began as "Aggregator" — Fimeg's vision for a self-hosted ConnectWise alternative that would give homelabbers centralized update management with "single pane of glass" visibility across Windows, Linux, and macOS. The Starting Prompt described an ambitious platform with AI-assisted scheduling, maintenance windows, natural language queries, and cross-platform agents covering APT, DNF, AUR, Snap, Flatpak, Winget, Windows Update, Docker, and Homebrew.

What exists today is a narrower but more deeply engineered system than the original breadth implied. The core architecture — pull-based agents, Ed25519 signed commands, machine ID binding, three-tier token authentication — is not just present but has been hardened well beyond the original specification. The culurien branch added transaction safety, replay attack protection, key rotation, configurable timeouts, path traversal defense, semver-aware version comparison, and grew the test suite from approximately 3 test files to 170 passing tests across 18 packages.

The honest gap: approximately 40% of the originally envisioned feature surface was never built. macOS support, AI features, maintenance windows, scheduled rollouts, structured JSON logging, LDAP/SSO, compliance reporting, the CLI tool — none exist. The features that do exist (package scanning, command execution, agent self-upgrade, installer infrastructure) work correctly and are production-quality for the homelab use case.

Is it production-ready for Fimeg's homelab? Yes, with caveats. The system can install agents on Linux and Windows, scan for package updates, approve and execute updates from the dashboard, and self-upgrade agents — all with cryptographic verification. The caveats are: the setup flow has known friction (P0-005), the `/api/v1/info` endpoint returns build-time values that default to "dev" without proper ldflags injection, and several P0 backlog items from the original backlog have been fixed but not all have been verified end-to-end in a live deployment.

The "scare ConnectWise" ambition requires honest framing. RedFlag has three architectural advantages ConnectWise cannot replicate: hardware-bound machine IDs, mandatory self-hosting, and code transparency. But ConnectWise has remote control, 100+ integrations, SOC2 certification, and a decade of enterprise polish. RedFlag is competitive for the self-hosted update management niche — the subset of ConnectWise functionality that homelabbers and small IT teams actually use. It is not, and should not try to become, a full RMM platform. The strategic path is: own the update management vertical completely, then expand.

---

## Section 2: What Was Built As Planned

### 2.1 Pull-Based Agent Architecture

**Planned (Starting Prompt):** Agent polls server every 5 minutes via `GET /agents/{id}/commands`. Server never initiates connections to agents.

**Built:** Exactly as planned. Agent polls via `GET /agents/:id/commands` with configurable interval (default 300 seconds). Proportional jitter added (B-2 fix: `maxJitter = pollingInterval/2`). Server has no outbound connection capability.

**Location:** `aggregator-agent/cmd/agent/main.go:960-1070` (poll loop), `aggregator-server/internal/api/handlers/agents.go` (GetCommands handler)

**Status:** Working. Enhanced beyond spec with jitter and exponential backoff on failures.

---

### 2.2 Ed25519 Command Signing

**Planned (Security.md):** All commands in DB must be Ed25519-signed before being sent to agents. `signAndCreateCommand()` implemented in handlers.

**Built:** Exactly as planned, then enhanced. v3 signed message format includes `agent_id:cmd_id:type:sha256(params):timestamp`. Key ID and SignedAt tracked per command (migration 025). 29+ call sites use `signAndCreateCommand()`.

**Location:** `aggregator-server/internal/services/signing.go`, `aggregator-server/internal/api/handlers/agents.go:49-77`

**Status:** Working. Exceeds original spec.

---

### 2.3 Machine ID Binding

**Planned (Security.md section 3.1):** `MachineBindingMiddleware` validates `X-Machine-ID` header against `agents.machine_id`. Mismatch = 403.

**Built:** Implemented. Machine ID is SHA256 hash of hardware identifiers (D-1 fix made this canonical across all platforms). Middleware validates on authenticated agent routes. Admin rebind endpoint added for recovery.

**Location:** `aggregator-server/internal/api/middleware/machine_binding.go`, `aggregator-agent/internal/system/machine_id.go`

**Status:** Working. Enhanced with canonical hash format and rebind capability.

---

### 2.4 Three-Tier Token Authentication

**Planned (Security.md section 2):** Registration tokens (one-time/multi-seat) -> JWT access tokens (24h) -> Refresh tokens (90-day sliding window, stored as SHA-256 hash).

**Built:** Exactly as planned. Registration tokens consumed at `/agents/register` with seat limits. JWT issued with `issuer=redflag-agent` (A-3 fix). Refresh tokens stored hashed, renewed via `/agents/renew`. Token renewal wrapped in database transaction (B-2 fix).

**Location:** `aggregator-server/internal/api/handlers/auth.go`, `aggregator-server/internal/api/middleware/auth.go`, `aggregator-server/internal/database/queries/refresh_tokens.go`

**Status:** Working. Transaction safety added beyond original spec.

---

### 2.5 Replay Attack Protection (Nonce System)

**Planned (Security.md section 3.3):** Unique, time-limited, Ed25519-signed nonce for every sensitive command. Agent validates signature and timestamp.

**Built:** Implemented via `UpdateNonceService` with `Generate()` and `Validate()`. Nonce format: `uuid:unix_timestamp`, signed with Ed25519. Agent validates freshness within configurable window.

**Location:** `aggregator-server/internal/services/nonce_service.go`, `aggregator-agent/cmd/agent/subsystem_handlers.go:1074` (validateNonce)

**Status:** Working. Timeout is 10 minutes (not 5 as specified in Overview.md — see VD-006).

---

### 2.6 Agent Self-Registration Flow

**Planned (Starting Prompt):** Agent POSTs to `/agents/register` with hostname, OS info, version. Server returns agent_id, token, config.

**Built:** Exactly as planned plus enhancements. Registration includes machine_id (SHA256 hash) and public_key_fingerprint. Registration wrapped in database transaction (B-2 fix). Agent stores agent_id, token, refresh_token, check_in_interval to config file.

**Location:** `aggregator-agent/cmd/agent/main.go:371-483` (registerAgent), `aggregator-server/internal/api/handlers/agents.go` (RegisterAgent)

**Status:** Working.

---

### 2.7 Package Scanner Architecture

**Planned (Starting Prompt):** APT, DNF/YUM, AUR, Winget, Windows Update, Docker, Snap, Flatpak, Homebrew.

**Built:** APT, DNF, Winget, Windows Update, Docker, Storage metrics. Each scanner has circuit breaker protection and configurable timeouts.

**Not built:** AUR, Snap, Flatpak, Homebrew. See Section 5A.

**Location:** `aggregator-agent/internal/scanner/` (apt.go, dnf.go, winget.go, docker.go), `aggregator-agent/pkg/windowsupdate/`

**Status:** Working for implemented platforms. 6 of 9 planned scanners built.

---

### 2.8 Circuit Breaker Pattern

**Planned (ETHOS.md principle 3, Overview.md):** Circuit Breaker on fragile scanners (Windows Update, DNF).

**Built:** Full implementation with Closed/Open/HalfOpen states, configurable failure threshold, failure window, open duration, and half-open attempts. Applied to all scanners via subsystem config.

**Location:** `aggregator-agent/internal/circuitbreaker/circuit_breaker.go`, config per subsystem in `config.json`

**Status:** Working.

---

### 2.9 Command Acknowledgment System

**Planned (Overview.md):** `pending_acks.json` for at-least-once delivery guarantee.

**Built:** `acknowledgment.Tracker` package with persistent pending acks, retry with `IncrementRetry()`, and state file persistence.

**Location:** `aggregator-agent/internal/acknowledgment/`

**Status:** Working.

---

### 2.10 Agent Service Management

**Planned (Overview.md):** systemd on Linux, Windows Services (SCM) on Windows.

**Built:** Both. Linux systemd unit generated inline in installer template with security hardening (ProtectSystem=strict, ProtectHome=true, PrivateTmp=true). Windows SCM registration via `InstallService()` with auto-start and recovery actions.

**Location:** `aggregator-agent/internal/service/windows.go:438-516`, installer templates in `aggregator-server/internal/services/templates/install/scripts/`

**Status:** Working.

---

### 2.11 Web Dashboard

**Planned (Starting Prompt):** React 18 + TypeScript, TailwindCSS, TanStack Query, Recharts, TanStack Table, WebSocket.

**Built:** React + TypeScript + TailwindCSS + TanStack Query. Dashboard with agent list, update management, command history, security settings. TypeScript strict compliance (0 errors after E-1b). No Recharts charts, no TanStack Table, limited WebSocket.

**Location:** `aggregator-web/src/`

**Status:** Working. Feature-complete for core use case but missing visualization features (charts, trend analysis).

---

### 2.12 PostgreSQL with Migration Runner

**Planned (Starting Prompt, ETHOS.md):** PostgreSQL database with idempotent migrations.

**Built:** PostgreSQL 16-alpine, custom migration runner in `database/db.go` with `schema_migrations` tracking table. 30 migrations (001-030) all using IF NOT EXISTS / ON CONFLICT DO NOTHING. Migration runner aborts fatally on failure (B-1 fix).

**Location:** `aggregator-server/internal/database/db.go`, `aggregator-server/internal/database/migrations/`

**Status:** Working.

---

### 2.13 Docker Compose Deployment

**Planned (Starting Prompt):** `docker-compose.yml` for quick start.

**Built:** Three-service compose: postgres (16-alpine), server (multi-stage Go build), web (nginx). Health checks, volume persistence, env_file support.

**Location:** `docker-compose.yml`, `config/.env.example`

**Status:** Working.

---

### 2.14 Installer Scripts

**Planned (various docs):** Install script served from `/install/:platform` endpoint.

**Built:** Linux bash and Windows PowerShell templates served dynamically. Linux installer creates system user, directories, sudoers (per-package-manager), systemd service with security hardening, registers agent, starts service. Windows installer creates directories, downloads binary, writes config, registers service. Both have arch auto-detection and checksum verification (Installer Fix 2).

**Location:** `aggregator-server/internal/services/templates/install/scripts/linux.sh.tmpl`, `windows.ps1.tmpl`

**Status:** Working. Idempotent.

---

### 2.15 Agent Self-Upgrade Pipeline

**Planned (Overview.md, P2-003):** 7-step pipeline: download, checksum verify, Ed25519 binary signature verify, backup, atomic install, service restart, watchdog confirmation.

**Built:** Exactly as specified. All 7 steps implemented with deferred rollback on any failure including watchdog timeout (5 minutes). Download has 5-minute timeout and 500MB size limit (U-8 fix).

**Location:** `aggregator-agent/cmd/agent/subsystem_handlers.go:575-762`

**Status:** Working. The v0.1.27 INVENTORY doc noted this was "FULLY IMPLEMENTED" despite the backlog claiming it was placeholder.

---

## Section 3: What Was Built Better Than Planned

### 3.1 Ed25519 Key Rotation with TTL

**Original:** Security.md noted key rotation was "TODO" with no implementation. SETUP-SECURITY.md described a POST `/security/keys/rotate` API with 30-day grace period.

**Built:** TTL-based key caching with automatic refresh. Server registers primary key in `signing_keys` table (migration 025). Agent caches server public key at registration (TOFU), with configurable TTL refresh. Key ID tracked per command for audit trail.

**Why better:** Automatic TTL refresh is more reliable than manual API-triggered rotation. The manual rotation API was never needed because the system handles staleness automatically.

---

### 3.2 Command Signing v3 Format

**Original:** Commands signed with `cmd_id:type:sha256(params)`.

**Built:** v3 format includes `agent_id:cmd_id:type:sha256(params):timestamp`. Agent ID binding prevents command replay to different agents. Timestamp enables time-based expiry independent of DB state.

**Why better:** Prevents a class of relay attacks where a compromised agent could forward signed commands to other agents.

---

### 3.3 Machine ID Canonical SHA256 Hash

**Original:** Machine ID was inconsistent — registration fallback used `"unknown-" + hostname` (unhashed) while runtime used SHA256.

**Built (D-1):** All paths now use `GetMachineID()` which always returns a 64-character hex SHA256 hash. Registration aborts with `log.Fatalf` if machine ID cannot be obtained — no unhashed fallback.

**Why better:** Eliminates format mismatch between registration and runtime that would cause 403 errors after restart.

---

### 3.4 Transaction Safety (B-Series)

**Original:** Not specified. Registration, command delivery, and token renewal were separate DB operations without transaction wrapping.

**Built (B-2):** Registration wrapped in `tx.Beginx()` with `defer tx.Rollback()`. Command delivery uses `SELECT FOR UPDATE SKIP LOCKED` (atomic claim). Token renewal wrapped in transaction. JWT generated after commit, not before.

**Why better:** Prevents partial registration state, command double-delivery race conditions, and orphaned tokens.

---

### 3.5 Configurable Operational Timeouts

**Original:** 6 hardcoded timeout values in main.go and timeout.go.

**Built (E-1c):** All 6 values stored in `security_settings` table under `operational` category (migration 030). Read from DB at startup with hardcoded fallback. Zero-value protection prevents zero-duration tickers.

**Why better:** Administrators can tune timeouts via API without code changes or redeployment.

---

### 3.6 Binary Path Traversal Protection

**Original:** Not specified — `c.File(pkg.BinaryPath)` served DB-sourced paths without validation.

**Built (E-1c + Integration Verification):** Both `DownloadUpdatePackage` and `DownloadAgent` resolve paths via `filepath.Abs()` and validate against `REDFLAG_BINARY_STORAGE_PATH` using prefix check. Traversal attempts logged and return 403.

**Why better:** Defense in depth against DB compromise scenarios.

---

### 3.7 TypeScript Strict Compliance

**Original:** 217 TypeScript errors in `aggregator-web/src/`.

**Built (E-1b):** All 217 errors fixed. Zero `@ts-ignore` or `as any` suppressions added. Type interfaces verified against actual server JSON responses. TanStack Query v5 `isLoading` -> `isPending` migration for mutations.

**Why better:** Catches type mismatches at compile time instead of runtime.

---

### 3.8 Semver-Aware Version Comparison

**Original:** `versions.go:72` used lexicographic comparison (`agentVersion < current.MinAgentVersion`), making `"0.1.9" > "0.1.22"`.

**Built (Upgrade Fix):** `CompareVersions()` with octet-by-octet numeric parsing. Handles `"dev"` as always-older, `"v"` prefix stripping, mismatched octet counts.

**Why better:** Version gates now work correctly for all version numbers.

---

### 3.9 Test Suite Growth

**Original (Code Review):** "Only 3 test files across the entire codebase" — `circuitbreaker_test.go`, `test_disk.go`, `test_disk_detection.go`, plus scheduler tests.

**Built:** 170 tests across 18 packages covering: Ed25519 signing and replay protection, JWT issuer validation, registration transactions, command delivery races, machine ID format, ETHOS compliance, path traversal, version comparison, checksum computation, timeout configuration, and more.

**Why better:** Regression detection for all fix series (A through Upgrade).

---

### 3.10 ETHOS Logging Compliance

**Original:** Mixed `fmt.Printf`, emoji in logs, inconsistent format.

**Built (D-2 + Integration Verification):** All production log statements use `log.Printf("[TAG] [system] [component] message key=value")`. Emoji removed from daemon log.Printf calls. `fmt.Printf` DEBUG statements removed from handlers. Terminal/CLI output emoji explicitly exempted (DEV-039).

---

### 3.11 Installer Architecture Detection

**Original:** Architecture hardcoded to `amd64` in `generateInstallScript`.

**Built (Installer Fix 2):** Runtime detection via `uname -m` (Linux) and `$env:PROCESSOR_ARCHITECTURE` (Windows). Server accepts optional `?arch=` query param. Download endpoint already supported `linux-arm64` and `windows-arm64`.

**Why better:** ARM64 homelabbers (Raspberry Pi, Apple Silicon VMs) can now install without manual binary download.

---

### 3.12 Binary Checksum Verification

**Original:** No verification of downloaded binary integrity.

**Built (Installer Fix 2):** Server computes SHA-256 and serves `X-Content-SHA256` header. Linux installer verifies with `sha256sum`. Windows installer verifies with `Get-FileHash`. Missing header = warn but continue (backward compatible).

---

### 3.13 Machine ID Rebind Endpoint

**Original:** Not specified. If machine ID changed (hardware replacement, VM migration), agent was permanently locked out.

**Built (D-1):** Admin endpoint `POST /admin/agents/:id/rebind-machine-id` allows re-binding an agent to new hardware. Requires admin authentication.

---

## Section 4: What Was Built Differently (Deviations)

### VD-001: Logging Format

**Original (P3-006):** JSON structured logs with correlation IDs via logrus or similar library. `StructuredLogger` implementation, `CorrelationIDMiddleware`, buffered async writes, P95/P99 latency tracking, `system_logs` database table.

**Actual:** ETHOS `[TAG] [system] [component]` plain text format via `log.Printf`. No correlation IDs, no structured JSON, no centralized aggregation, no log database table.

**Rationale:** ETHOS principle #5 (no marketing fluff) and principle #1 (errors are history) were prioritized over the P3-006 spec. Plain text logging with consistent tags is grep-friendly and sufficient for the homelab use case. JSON structured logging adds complexity without proportional benefit at the current scale.

**Verdict:** Acceptable for homelab. Would need to be revisited for fleet-scale deployments (100+ agents).

---

### VD-002: Authentication Architecture

**Original (P0-006 + Starting Prompt):** Multi-user system with `users` table, admin/user/readonly roles, email fields, `EnsureAdminUser()`. The Starting Prompt shows Settings page with "users" section.

**Actual:** Single-admin via `.env` credentials. The `users` table exists in migrations (for compatibility) but is not used for authentication. Web auth validates against `REDFLAG_ADMIN_USER`/`REDFLAG_ADMIN_PASSWORD` from environment.

**Rationale:** P0-006 recommended "Option 1: Complete Removal" — recognizing that multi-user scaffolding increased attack surface without benefit for a homelab tool. The current implementation follows this recommendation.

**Verdict:** Correct for homelab. Multi-user would be needed for MSP/enterprise use case.

---

### VD-003: Build Orchestrator

**Original (architecture docs):** Dynamic agent compilation per request. Build Orchestrator would cross-compile agent binaries on demand.

**Actual:** Pre-built binaries placed at container build time (via Dockerfile multi-stage build). `BuildAndSignAgent` signs existing binaries but never compiles. `AgentBuilder` generates config JSON only. `build_orchestrator.go` services layer marked `// Deprecated`.

**Rationale:** Cross-compilation on every request is impractical for a homelab server. The Dockerfile multi-stage build compiles once; the server serves pre-built binaries. This is how production package distribution works (e.g., GitHub Releases).

**Verdict:** Correct pragmatic simplification. Dynamic compilation would add complexity without benefit.

---

### VD-004: Upgrade Trigger Path

**Original:** `POST /build/upgrade/:agentID` was meant to orchestrate full upgrades.

**Actual:** The real upgrade path is `POST /agents/{id}/update` (in `agent_updates.go`), which validates the agent, generates nonces, creates signed `update_agent` commands, and tracks delivery. The `/build/upgrade` endpoint generates config JSON with manual instructions — it's an admin utility, not the upgrade orchestrator.

**Rationale:** The `/agents/{id}/update` path already existed and was more complete (nonce generation, command signing, delivery tracking). Wiring a parallel path would have created confusion.

**Verdict:** Acceptable. The working path is better designed.

---

### VD-005: Security Settings UI

**Original (SECURITY-SETTINGS.md):** Full security settings configurable from dashboard including machine binding mode, version enforcement, nonce timeout, signature algorithm, log level, alert thresholds.

**Actual:** Security settings backend works (API CRUD for `security_settings` table). Dashboard displays settings for `command_signing`, `update_signing`, `nonce_validation`, `machine_binding`, `signature_verification`. The `operational` category (E-1c timeouts) is accessible via API but not visible in the UI. No validation rules enforcement for operational settings.

**Verdict:** Partially implemented. Backend complete, frontend shows security-category settings. Operational settings need UI exposure.

---

### VD-006: Nonce Timeout

**Original (Overview.md, Security.md):** Nonce lifetime "< 5 minutes".

**Actual (SETUP-SECURITY.md, code):** `REDFLAG_SECURITY_NONCE_TIMEOUT=600` (10 minutes). The code uses a 10-minute default.

**Rationale:** The original docs contradict each other — Overview.md says "< 5 min" while SETUP-SECURITY.md says 600 seconds. The 10-minute value appears to be a deliberate choice to accommodate slow network conditions (agents polling every 5 minutes may not receive the command within a 5-minute nonce window).

**Verdict:** The 10-minute value is more practical. The 5-minute spec was likely aspirational. Document this as intentional.

---

### VD-007: Key Rotation API

**Original (SETUP-SECURITY.md):** `POST /api/v1/security/keys/rotate` with `grace_period_days` (default 30). During grace period both old and new keys valid. Keys stored in `/app/keys/` directory.

**Actual:** Key rotation is TTL-based via the signing key registry in `signing_keys` table. No explicit `/keys/rotate` API endpoint. No dual-key grace period — agents refresh their cached public key via TTL (24h default). Key stored in environment variable, not in `/app/keys/` directory.

**Rationale:** Different implementation approach that achieves the same goal (agents can handle key changes) without the complexity of dual-key acceptance windows.

**Verdict:** Functional but different. A manual key rotation API would be a nice-to-have for planned rotations.

---

### VD-008: Version Format

**Original (SECURITY-SETTINGS.md):** "Semantic version string (X.Y.Z), integers only, no v prefix."

**Actual:** Four-octet format `X.Y.Z.W` where W is the config version (e.g., `0.1.26.0`). The `v` prefix is tolerated and stripped during comparison.

**Rationale:** The fourth octet was added to embed config schema version alongside the agent version, avoiding a separate version field.

**Verdict:** Acceptable extension of spec. `CompareVersions()` handles both 3-octet and 4-octet formats.

---

### VD-009: Windows Service Key Rotation

**Original:** Not explicitly specified, but key rotation logic exists in `main.go` polling loop.

**Actual (DEV-030):** The Windows service polling loop in `windows.go` does not call `ShouldRefreshKey`. The comment at line 164-168 acknowledges this as a TODO. Agents running as Windows services rely on the 24h TTL key cache and will not proactively detect key rotation.

**Verdict:** Known gap. Low risk — the 24h TTL cache means Windows agents will naturally pick up new keys within a day.

---

### VD-010: Watchdog Version Comparison

**Original:** Not explicitly specified.

**Actual:** The upgrade watchdog in `subsystem_handlers.go:943` uses string equality (`agent.CurrentVersion == expectedVersion`) instead of `CompareVersions()`. A normalized version string mismatch (e.g., `"v0.1.4"` vs `"0.1.4"`) would trigger false rollback.

**Verdict:** Low risk — both sides use the same version string from the same source. Would need fixing if version normalization is ever introduced.

---

## Section 5: What Was Never Built

### 5A. Platform Support

| Feature | Original Priority | Effort Estimate | Notes |
|---------|------------------|-----------------|-------|
| macOS agent / launchd | Starting Prompt | 2-3 days | No launchd plist, no macOS-specific code |
| Homebrew scanner | Starting Prompt | 1-2 days | Would follow APT/DNF pattern |
| AUR scanner (Arch) | Starting Prompt | 1-2 days | Would follow APT/DNF pattern |
| Snap scanner | Starting Prompt | 1 day | Low demand |
| Flatpak scanner | Starting Prompt | 1 day | Low demand |
| aggregator-cli (Go CLI) | Starting Prompt | 3-5 days | Power-user tool, not essential for homelab |

### 5B. AI Features

| Feature | Original Priority | Effort Estimate | Notes |
|---------|------------------|-----------------|-------|
| AI Chat Sidebar (Ollama/OpenAI) | Starting Prompt | 2-3 weeks | No AI code exists anywhere |
| Natural language queries (`POST /ai/query`) | Starting Prompt | 1-2 weeks | Requires AI sidebar first |
| AI-assisted scheduling (`POST /ai/schedule`) | Starting Prompt | 1 week | Requires maintenance windows first |
| AI decision audit trail (`GET /ai/decisions`) | Starting Prompt | 3-5 days | Requires AI features first |

**Honest assessment:** AI features were aspirational in the Starting Prompt ("Future Phase"). They add significant complexity and operational overhead (Ollama requires GPU resources or external API costs). For a homelab tool, manual approval is more appropriate than AI-assisted scheduling. Recommend deferring indefinitely unless Fimeg has a specific use case.

### 5C. Scheduling & Automation

| Feature | Original Priority | Effort Estimate | Notes |
|---------|------------------|-----------------|-------|
| Maintenance Windows (RRULE recurrence) | Starting Prompt, P3 | 2-3 weeks | Full RRULE parser + calendar UI + auto-approve logic |
| Auto-approve by severity during windows | Starting Prompt | 1 week | Requires maintenance windows |
| Scheduled update execution | Starting Prompt | 1 week | Requires maintenance windows |
| Staggered rollout (5%/25%/100%) | P2-003, Strategic Roadmap | 1-2 weeks | Server-side group selection + phased command queuing |
| Auto-upgrade trigger (version-based) | Upgrade Audit | 1 week | Server detects old version on check-in, queues update_agent |

**Honest assessment:** Maintenance windows are the highest-value unbuilt feature for production use. Auto-approve by severity during defined windows would significantly reduce manual work.

### 5D. Observability

| Feature | Original Priority | Effort Estimate | Notes |
|---------|------------------|-----------------|-------|
| Structured JSON logging (P3-006) | P3 | 3-4 days | logrus + correlation IDs |
| Correlation ID propagation | P3-006 | 2-3 days | Middleware + header propagation |
| Update Metrics Dashboard (P3-003) | P3 | 2-3 days | Success/failure rates, trend charts |
| Server Health Dashboard (P3-005) | P3 | 2-3 days | CPU, memory, DB connections |
| Prometheus metrics endpoint | Strategic Roadmap | 2-3 days | /metrics endpoint with Go prometheus client |
| Real-time WebSocket updates | Starting Prompt | 1-2 weeks | Partial: security events WebSocket exists |

### 5E. Integration & Ecosystem

| Feature | Original Priority | Effort Estimate | Notes |
|---------|------------------|-----------------|-------|
| LDAP/Active Directory | Strategic Roadmap | 2-3 weeks | Auth integration |
| SAML/OIDC for SSO | Strategic Roadmap | 2-3 weeks | Requires multi-user first |
| Slack/Teams/PagerDuty webhooks | Strategic Roadmap | 1-2 weeks | Event notification hooks |
| Compliance reporting (SOX, HIPAA) | Strategic Roadmap | 4-6 weeks | Report generation framework |
| Kubernetes deployment | Strategic Roadmap | 1-2 weeks | Helm chart + StatefulSet |
| Ansible/Terraform integrations | Strategic Roadmap | 2-3 weeks | Module/provider development |

**Honest assessment:** Webhooks (Slack/Teams) are the highest-value integration for homelab use. LDAP/SSO only matters if Fimeg plans to support multi-user deployments.

### 5F. UI Features

| Feature | Original Priority | Effort Estimate | Notes |
|---------|------------------|-----------------|-------|
| Security Status Dashboard Indicators (P3-002) | P3 | 2-3 days | Color-coded security health scores |
| Token Management UI Enhancement (P3-004) | P3 | 1-2 days | Delete tokens, bulk operations |
| Server Health Dashboard (P3-005) | P3 | 2-3 days | System status monitoring |
| Operational settings in UI | E-1c carry-over | 1 day | Add 'operational' category to SecuritySettings.tsx |
| Update metrics and trend charts | P3-003 | 2-3 days | Recharts integration |
| Calendar view for maintenance windows | Starting Prompt | 1-2 weeks | Requires maintenance windows backend |

### 5G. Security Features

| Feature | Original Priority | Effort Estimate | Notes |
|---------|------------------|-----------------|-------|
| Multi-factor authentication | Strategic Roadmap | 1-2 weeks | TOTP integration |
| API key rotation via UI | SETUP-SECURITY.md | 2-3 days | Manual rotation endpoint |
| Key rotation with grace period | SETUP-SECURITY.md | 1 week | Dual-key acceptance window |
| TLS hardening (remove bypass flag) | Code Review | 1 hour | Remove `--insecure-tls` flag |
| JWT secret minimum strength | Code Review | 30 min | Validation in config loading |

---

## Section 6: Backlog Status Table

| ID | Title | Original Priority | Current Status | Where Fixed/Notes |
|----|-------|------------------|----------------|-------------------|
| P0-001 | Rate Limit First Request Bug | P0 | FIXED | v0.1.26 per v0.1.27 Inventory; rate limiter namespaced by type in A-3 fixes |
| P0-002 | Session Loop Bug | P0 | PARTIALLY FIXED | SetupCompletionChecker modified in E-1b (removed isSetupMode state); may need live verification |
| P0-003 | Agent No Retry Logic | P0 | FIXED | v0.1.27 per Inventory + culurien B-2 (exponential backoff with full jitter, proportional polling jitter) |
| P0-004 | Database Constraint Violation | P0 | FIXED | v0.1.27 per Inventory; timeout service now uses 'failed' result status (check constraint compatible) |
| P0-005 | Setup Flow Broken | P0 | NOT VERIFIED | Setup handler exists but end-to-end flow not tested in culurien branch. May still have issues |
| P0-006 | Single-Admin Architecture | P0 | ACCEPTED | Decision made: single-admin via .env. Users table exists for compatibility but not used for auth |
| P0-007 | Install Script Path Variables | P0 | FIXED | 2025-12-17 per backlog + verified in Installer Fix 1 (config path consistency) |
| P0-008 | Migration Runs on Fresh Install | P0 | FIXED | 2025-12-17 per backlog; early return in detection.go for empty agent_id |
| P0-009 | Storage Scanner Wrong Table | P0 | NOT DONE | Storage scanner still on legacy interface. Dedicated storage_metrics table exists but scanner reports to update_packages |
| P1-001 | Agent Install ID Parsing | P1 | PARTIALLY FIXED | extractOrGenerateAgentID() in install_template_service.go validates UUID format; but query param handling may still have edge cases |
| P1-002 | Agent Timeout Handling | P1 | PARTIALLY FIXED | E-1c made timeouts configurable from DB; per-scanner timeouts exist in config but generic 45s timeout may still apply in some paths |
| P2-001 | Binary URL Architecture Mismatch | P2 | FIXED | Installer Fix 2 added arch detection; templates override download URL with detected architecture |
| P2-002 | Migration Error Reporting | P2 | NOT DONE | Migration errors still only logged locally; no server-side visibility |
| P2-003 | Agent Auto-Update System | P2 | FIXED | Fully implemented (was incorrectly marked placeholder in backlog); verified in Upgrade Audit |
| P3-001 | Duplicate Command Prevention | P3 | FIXED | v0.1.27; unique index on (agent_id, command_type, status) WHERE status = 'pending' |
| P3-002 | Security Status Dashboard | P3 | PARTIALLY DONE | Security overview endpoints exist; no color-coded health scores or per-agent security badges |
| P3-003 | Update Metrics Dashboard | P3 | NOT DONE | No metrics dashboard, no trend charts |
| P3-004 | Token Management UI Enhancement | P3 | PARTIALLY DONE | Token list with copy-install-command exists; no delete, no bulk operations, no status filtering |
| P3-005 | Server Health Dashboard | P3 | NOT DONE | No health dashboard |
| P3-006 | Structured Logging System | P3 | NOT DONE (alternative) | ETHOS [TAG] format used instead of JSON structured logging. See VD-001 |
| P4-001 | Agent Retry Logic Resilience | P4 | FIXED | v0.1.27 per Inventory + culurien B-2 (exponential backoff, circuit breakers) |
| P4-002 | Scanner Timeout Optimization | P4 | PARTIALLY DONE | Configurable per-subsystem timeouts in config; E-1c made server-side timeouts configurable from DB |
| P4-003 | Agent File Management Migration | P4 | PARTIALLY DONE | MigrationExecutor exists with old-path detection; constants/paths.go standardized; validation/pathutils packages have compile errors (dead code) |
| P4-004 | Directory Path Standardization | P4 | FIXED | constants/paths.go provides canonical paths; windows.go fixed to use constants.GetAgentConfigPath() (Installer Fix 1); installer templates use standard paths |
| P4-005 | Testing Infrastructure Gaps | P4 | SIGNIFICANTLY IMPROVED | From ~3 test files to 170 tests across 18 packages; no CI/CD yet |
| P4-006 | Architecture Documentation Gaps | P4 | PARTIALLY DONE | 30+ docs in culurien docs/ folder; no formal architecture diagrams or ADRs |
| P5-001 | Security Audit Documentation | P5 | NOT DONE | No security audit checklist, IR procedures, or compliance mapping |
| P5-002 | Development Workflow Documentation | P5 | PARTIALLY DONE | .env.example created; no PR template, debugging guide, or release process |

**Summary:** 10 FIXED, 8 PARTIALLY DONE, 6 NOT DONE, 1 ACCEPTED (design decision), 1 NOT VERIFIED, 1 alternative approach.

---

## Section 7: Architecture Health Assessment

### 7A. Authentication Stack

| Component | Spec | Actual | Match |
|-----------|------|--------|-------|
| Registration tokens | One-time or multi-seat | Implemented with seat limits | YES |
| JWT 24h expiry | Short-lived JWT | Implemented with issuer-based validation (A-3) | YES |
| Refresh tokens 90-day | Sliding window, SHA-256 hash | Implemented, renewal in transaction (B-2) | YES |
| Machine ID binding | `X-Machine-ID` header, 403 on mismatch | Implemented with canonical SHA256 hash (D-1) | YES |

### 7B. Command Flow

| Component | Spec | Actual | Match |
|-----------|------|--------|-------|
| Pull-only | Agents always initiate | Confirmed — server has no outbound capability | YES |
| 5-minute check-in | Configurable interval | Default 300s, configurable via config.json | YES |
| Command types | scan_updates, collect_specs, install_updates, rollback_update, update_agent | All present in models/command.go plus enable/disable_heartbeat, reboot, dry_run_update, confirm_dependencies | YES+ |
| Acknowledgment | pending_acks.json | acknowledgment.Tracker with persistence and retry | YES |

### 7C. Security Stack

| Component | Spec | Actual | Match |
|-----------|------|--------|-------|
| Ed25519 signing | Binary + command signing | Both implemented; v3 format exceeds spec | YES+ |
| Nonce validation | < 5 min lifetime, anti-replay | 10-minute default (VD-006), otherwise matches | CLOSE |
| TOFU key caching | Fetch once at registration | Implemented with TTL refresh | YES+ |

### 7D. Agent Paths

| Platform | Spec Path | Actual | Match |
|----------|-----------|--------|-------|
| Linux config | `/etc/redflag/config.json` | `/etc/redflag/agent/config.json` (constants.GetAgentConfigPath) | CLOSE — subdir added |
| Linux state | `/var/lib/redflag/` | `/var/lib/redflag/agent/` | CLOSE — subdir added |
| Linux binary | `/usr/local/bin/redflag-agent` | `/usr/local/bin/redflag-agent` | YES |
| Windows config | `C:\ProgramData\RedFlag\config.json` | `C:\ProgramData\RedFlag\agent\config.json` (fixed in Installer Fix 1) | CLOSE — subdir added |

The `agent` subdirectory was added to support future multi-component deployments (agent + server on same machine). This is a reasonable structural enhancement.

### 7E. Migration System

| Component | Spec | Actual | Match |
|-----------|------|--------|-------|
| MigrationExecutor | Present | Implemented in `aggregator-agent/internal/migration/` | YES |
| Old path migration | `/etc/aggregator/` -> `/etc/redflag/` | Detection and backup implemented in installer templates and migration executor | YES |

---

## Section 8: The Honest Roadmap

### HIGH VALUE, LOW EFFORT (Quick Wins)

1. **JWT secret minimum strength** (30 min) — Add `len(secret) < 32` check in config loading. Addresses Code Review finding.
2. **TLS bypass flag removal** (1 hour) — Remove `--insecure-tls` flag from agent. Forces TLS in production.
3. **Operational settings in UI** (1 day) — Add `operational` category to SecuritySettings.tsx component. Makes timeout tuning accessible from dashboard.
4. **Token delete button** (1-2 days) — P3-004. DELETE endpoint + confirmation dialog. Currently requires DB manual cleanup.
5. **`/api/v1/info` ldflags injection** (1 hour) — Ensure Dockerfile passes `-ldflags` with actual version strings. Currently defaults to "dev".

### HIGH VALUE, HIGH EFFORT (Strategic Investments)

1. **Maintenance Windows** (2-3 weeks) — The single most impactful unbuilt feature. Enables scheduled patching during safe hours. Without this, every update requires manual approval at execution time.
2. **Webhook notifications** (1-2 weeks) — Slack/Teams alerts on critical update availability, failed installations, agent offline. Low integration overhead, high operational value.
3. **Staggered rollout** (1-2 weeks) — Deploy updates to 5% canary, monitor, then 25%, then 100%. Essential for fleets > 10 agents.
4. **macOS agent** (2-3 days) — launchd plist template + Homebrew scanner. Completes the "cross-platform" promise for homelabbers with Macs.

### LOW VALUE (Defer or Drop)

1. **AI features** — Drop entirely for foreseeable future. Adds operational complexity (GPU/API costs) without proportional benefit for homelab use case. Manual approval is more appropriate.
2. **aggregator-cli** — Defer. The web dashboard covers all use cases. CLI would be nice-to-have for scripting but is not essential.
3. **AUR/Snap/Flatpak scanners** — Defer. Very small user base for each. APT and DNF cover 95%+ of Linux homelabbers.
4. **LDAP/SSO** — Defer until multi-user is needed. Single-admin is correct for homelab.
5. **Compliance reporting** — Drop. SOX/HIPAA requirements don't apply to homelabs.
6. **Kubernetes deployment** — Defer. Docker Compose is the right deployment model for the target audience.

---

## Section 9: Summary Table

| Feature Area | Planned | Built | Status | Gap Rating |
|-------------|---------|-------|--------|------------|
| Core Architecture (pull model, agents, server) | Full | Full | Working | 0 (complete) |
| Ed25519 Signing (commands, binaries) | Full | Full + enhancements | Working | 0 (exceeds spec) |
| Authentication (tokens, JWT, refresh) | Full | Full + transactions | Working | 0 (exceeds spec) |
| Machine ID Binding | Full | Full + canonical hash | Working | 0 (exceeds spec) |
| Replay Protection (nonces) | Full | Full (10min vs 5min) | Working | 1 (timeout deviation) |
| Package Scanning (6 of 9 scanners) | 9 scanners | 6 scanners | Working | 3 (AUR, Snap, Flatpak, Homebrew missing) |
| Agent Self-Upgrade | Full | Full 7-step pipeline | Working | 0 (complete) |
| Installer (Linux + Windows) | Full | Full + arch + checksum | Working | 1 (macOS missing) |
| Web Dashboard | Full | Core features | Working | 3 (missing charts, health, metrics) |
| Database + Migrations | Full | Full + hardened | Working | 0 (exceeds spec) |
| Docker Deployment | Full | Full | Working | 0 (complete) |
| Testing | Minimal | 170 tests | Working | 2 (no CI/CD, no integration tests against real DB) |
| Maintenance Windows | Full | None | Not built | 10 (completely absent) |
| AI Features | Full | None | Not built | 10 (deliberately deferred) |
| Scheduling & Automation | Full | None | Not built | 8 (no maintenance windows, no staggered rollout) |
| LDAP/SSO | Planned | None | Not built | 5 (not needed for homelab) |
| Structured Logging | Planned (P3-006) | ETHOS alternative | Working differently | 3 (functional but not JSON/correlation IDs) |
| Compliance / Reporting | Planned | None | Not built | 2 (not applicable to homelab) |
| CLI Tool | Planned | None | Not built | 2 (dashboard covers use cases) |
| macOS Support | Planned | None | Not built | 4 (matters for homelabbers with Macs) |

**Overall: Core infrastructure is 9/10. Feature breadth is 5/10. Production readiness for homelab is 7/10.**

The gap is almost entirely in features that were always labeled "future" or "Phase 2" in the original docs. The core architecture — the hard engineering work — is built, tested, and hardened beyond the original specification.