# Vision vs Reality: RedFlag Deviation Report **Date:** 2026-03-29 **Branch:** culurien (post-integration verification) **Baseline:** v0.1.27 (last Fimeg commit before culurien work) **Author:** Claude (automated analysis based on complete codebase and historical documentation) --- ## Section 1: Executive Summary RedFlag began as "Aggregator" — Fimeg's vision for a self-hosted ConnectWise alternative that would give homelabbers centralized update management with "single pane of glass" visibility across Windows, Linux, and macOS. The Starting Prompt described an ambitious platform with AI-assisted scheduling, maintenance windows, natural language queries, and cross-platform agents covering APT, DNF, AUR, Snap, Flatpak, Winget, Windows Update, Docker, and Homebrew. What exists today is a narrower but more deeply engineered system than the original breadth implied. The core architecture — pull-based agents, Ed25519 signed commands, machine ID binding, three-tier token authentication — is not just present but has been hardened well beyond the original specification. The culurien branch added transaction safety, replay attack protection, key rotation, configurable timeouts, path traversal defense, semver-aware version comparison, and grew the test suite from approximately 3 test files to 170 passing tests across 18 packages. The honest gap: approximately 40% of the originally envisioned feature surface was never built. macOS support, AI features, maintenance windows, scheduled rollouts, structured JSON logging, LDAP/SSO, compliance reporting, the CLI tool — none exist. The features that do exist (package scanning, command execution, agent self-upgrade, installer infrastructure) work correctly and are production-quality for the homelab use case. Is it production-ready for Fimeg's homelab? Yes, with caveats. The system can install agents on Linux and Windows, scan for package updates, approve and execute updates from the dashboard, and self-upgrade agents — all with cryptographic verification. The caveats are: the setup flow has known friction (P0-005), the `/api/v1/info` endpoint returns build-time values that default to "dev" without proper ldflags injection, and several P0 backlog items from the original backlog have been fixed but not all have been verified end-to-end in a live deployment. The "scare ConnectWise" ambition requires honest framing. RedFlag has three architectural advantages ConnectWise cannot replicate: hardware-bound machine IDs, mandatory self-hosting, and code transparency. But ConnectWise has remote control, 100+ integrations, SOC2 certification, and a decade of enterprise polish. RedFlag is competitive for the self-hosted update management niche — the subset of ConnectWise functionality that homelabbers and small IT teams actually use. It is not, and should not try to become, a full RMM platform. The strategic path is: own the update management vertical completely, then expand. --- ## Section 2: What Was Built As Planned ### 2.1 Pull-Based Agent Architecture **Planned (Starting Prompt):** Agent polls server every 5 minutes via `GET /agents/{id}/commands`. Server never initiates connections to agents. **Built:** Exactly as planned. Agent polls via `GET /agents/:id/commands` with configurable interval (default 300 seconds). Proportional jitter added (B-2 fix: `maxJitter = pollingInterval/2`). Server has no outbound connection capability. **Location:** `aggregator-agent/cmd/agent/main.go:960-1070` (poll loop), `aggregator-server/internal/api/handlers/agents.go` (GetCommands handler) **Status:** Working. Enhanced beyond spec with jitter and exponential backoff on failures. --- ### 2.2 Ed25519 Command Signing **Planned (Security.md):** All commands in DB must be Ed25519-signed before being sent to agents. `signAndCreateCommand()` implemented in handlers. **Built:** Exactly as planned, then enhanced. v3 signed message format includes `agent_id:cmd_id:type:sha256(params):timestamp`. Key ID and SignedAt tracked per command (migration 025). 29+ call sites use `signAndCreateCommand()`. **Location:** `aggregator-server/internal/services/signing.go`, `aggregator-server/internal/api/handlers/agents.go:49-77` **Status:** Working. Exceeds original spec. --- ### 2.3 Machine ID Binding **Planned (Security.md section 3.1):** `MachineBindingMiddleware` validates `X-Machine-ID` header against `agents.machine_id`. Mismatch = 403. **Built:** Implemented. Machine ID is SHA256 hash of hardware identifiers (D-1 fix made this canonical across all platforms). Middleware validates on authenticated agent routes. Admin rebind endpoint added for recovery. **Location:** `aggregator-server/internal/api/middleware/machine_binding.go`, `aggregator-agent/internal/system/machine_id.go` **Status:** Working. Enhanced with canonical hash format and rebind capability. --- ### 2.4 Three-Tier Token Authentication **Planned (Security.md section 2):** Registration tokens (one-time/multi-seat) -> JWT access tokens (24h) -> Refresh tokens (90-day sliding window, stored as SHA-256 hash). **Built:** Exactly as planned. Registration tokens consumed at `/agents/register` with seat limits. JWT issued with `issuer=redflag-agent` (A-3 fix). Refresh tokens stored hashed, renewed via `/agents/renew`. Token renewal wrapped in database transaction (B-2 fix). **Location:** `aggregator-server/internal/api/handlers/auth.go`, `aggregator-server/internal/api/middleware/auth.go`, `aggregator-server/internal/database/queries/refresh_tokens.go` **Status:** Working. Transaction safety added beyond original spec. --- ### 2.5 Replay Attack Protection (Nonce System) **Planned (Security.md section 3.3):** Unique, time-limited, Ed25519-signed nonce for every sensitive command. Agent validates signature and timestamp. **Built:** Implemented via `UpdateNonceService` with `Generate()` and `Validate()`. Nonce format: `uuid:unix_timestamp`, signed with Ed25519. Agent validates freshness within configurable window. **Location:** `aggregator-server/internal/services/nonce_service.go`, `aggregator-agent/cmd/agent/subsystem_handlers.go:1074` (validateNonce) **Status:** Working. Timeout is 10 minutes (not 5 as specified in Overview.md — see VD-006). --- ### 2.6 Agent Self-Registration Flow **Planned (Starting Prompt):** Agent POSTs to `/agents/register` with hostname, OS info, version. Server returns agent_id, token, config. **Built:** Exactly as planned plus enhancements. Registration includes machine_id (SHA256 hash) and public_key_fingerprint. Registration wrapped in database transaction (B-2 fix). Agent stores agent_id, token, refresh_token, check_in_interval to config file. **Location:** `aggregator-agent/cmd/agent/main.go:371-483` (registerAgent), `aggregator-server/internal/api/handlers/agents.go` (RegisterAgent) **Status:** Working. --- ### 2.7 Package Scanner Architecture **Planned (Starting Prompt):** APT, DNF/YUM, AUR, Winget, Windows Update, Docker, Snap, Flatpak, Homebrew. **Built:** APT, DNF, Winget, Windows Update, Docker, Storage metrics. Each scanner has circuit breaker protection and configurable timeouts. **Not built:** AUR, Snap, Flatpak, Homebrew. See Section 5A. **Location:** `aggregator-agent/internal/scanner/` (apt.go, dnf.go, winget.go, docker.go), `aggregator-agent/pkg/windowsupdate/` **Status:** Working for implemented platforms. 6 of 9 planned scanners built. --- ### 2.8 Circuit Breaker Pattern **Planned (ETHOS.md principle 3, Overview.md):** Circuit Breaker on fragile scanners (Windows Update, DNF). **Built:** Full implementation with Closed/Open/HalfOpen states, configurable failure threshold, failure window, open duration, and half-open attempts. Applied to all scanners via subsystem config. **Location:** `aggregator-agent/internal/circuitbreaker/circuit_breaker.go`, config per subsystem in `config.json` **Status:** Working. --- ### 2.9 Command Acknowledgment System **Planned (Overview.md):** `pending_acks.json` for at-least-once delivery guarantee. **Built:** `acknowledgment.Tracker` package with persistent pending acks, retry with `IncrementRetry()`, and state file persistence. **Location:** `aggregator-agent/internal/acknowledgment/` **Status:** Working. --- ### 2.10 Agent Service Management **Planned (Overview.md):** systemd on Linux, Windows Services (SCM) on Windows. **Built:** Both. Linux systemd unit generated inline in installer template with security hardening (ProtectSystem=strict, ProtectHome=true, PrivateTmp=true). Windows SCM registration via `InstallService()` with auto-start and recovery actions. **Location:** `aggregator-agent/internal/service/windows.go:438-516`, installer templates in `aggregator-server/internal/services/templates/install/scripts/` **Status:** Working. --- ### 2.11 Web Dashboard **Planned (Starting Prompt):** React 18 + TypeScript, TailwindCSS, TanStack Query, Recharts, TanStack Table, WebSocket. **Built:** React + TypeScript + TailwindCSS + TanStack Query. Dashboard with agent list, update management, command history, security settings. TypeScript strict compliance (0 errors after E-1b). No Recharts charts, no TanStack Table, limited WebSocket. **Location:** `aggregator-web/src/` **Status:** Working. Feature-complete for core use case but missing visualization features (charts, trend analysis). --- ### 2.12 PostgreSQL with Migration Runner **Planned (Starting Prompt, ETHOS.md):** PostgreSQL database with idempotent migrations. **Built:** PostgreSQL 16-alpine, custom migration runner in `database/db.go` with `schema_migrations` tracking table. 30 migrations (001-030) all using IF NOT EXISTS / ON CONFLICT DO NOTHING. Migration runner aborts fatally on failure (B-1 fix). **Location:** `aggregator-server/internal/database/db.go`, `aggregator-server/internal/database/migrations/` **Status:** Working. --- ### 2.13 Docker Compose Deployment **Planned (Starting Prompt):** `docker-compose.yml` for quick start. **Built:** Three-service compose: postgres (16-alpine), server (multi-stage Go build), web (nginx). Health checks, volume persistence, env_file support. **Location:** `docker-compose.yml`, `config/.env.example` **Status:** Working. --- ### 2.14 Installer Scripts **Planned (various docs):** Install script served from `/install/:platform` endpoint. **Built:** Linux bash and Windows PowerShell templates served dynamically. Linux installer creates system user, directories, sudoers (per-package-manager), systemd service with security hardening, registers agent, starts service. Windows installer creates directories, downloads binary, writes config, registers service. Both have arch auto-detection and checksum verification (Installer Fix 2). **Location:** `aggregator-server/internal/services/templates/install/scripts/linux.sh.tmpl`, `windows.ps1.tmpl` **Status:** Working. Idempotent. --- ### 2.15 Agent Self-Upgrade Pipeline **Planned (Overview.md, P2-003):** 7-step pipeline: download, checksum verify, Ed25519 binary signature verify, backup, atomic install, service restart, watchdog confirmation. **Built:** Exactly as specified. All 7 steps implemented with deferred rollback on any failure including watchdog timeout (5 minutes). Download has 5-minute timeout and 500MB size limit (U-8 fix). **Location:** `aggregator-agent/cmd/agent/subsystem_handlers.go:575-762` **Status:** Working. The v0.1.27 INVENTORY doc noted this was "FULLY IMPLEMENTED" despite the backlog claiming it was placeholder. --- ## Section 3: What Was Built Better Than Planned ### 3.1 Ed25519 Key Rotation with TTL **Original:** Security.md noted key rotation was "TODO" with no implementation. SETUP-SECURITY.md described a POST `/security/keys/rotate` API with 30-day grace period. **Built:** TTL-based key caching with automatic refresh. Server registers primary key in `signing_keys` table (migration 025). Agent caches server public key at registration (TOFU), with configurable TTL refresh. Key ID tracked per command for audit trail. **Why better:** Automatic TTL refresh is more reliable than manual API-triggered rotation. The manual rotation API was never needed because the system handles staleness automatically. --- ### 3.2 Command Signing v3 Format **Original:** Commands signed with `cmd_id:type:sha256(params)`. **Built:** v3 format includes `agent_id:cmd_id:type:sha256(params):timestamp`. Agent ID binding prevents command replay to different agents. Timestamp enables time-based expiry independent of DB state. **Why better:** Prevents a class of relay attacks where a compromised agent could forward signed commands to other agents. --- ### 3.3 Machine ID Canonical SHA256 Hash **Original:** Machine ID was inconsistent — registration fallback used `"unknown-" + hostname` (unhashed) while runtime used SHA256. **Built (D-1):** All paths now use `GetMachineID()` which always returns a 64-character hex SHA256 hash. Registration aborts with `log.Fatalf` if machine ID cannot be obtained — no unhashed fallback. **Why better:** Eliminates format mismatch between registration and runtime that would cause 403 errors after restart. --- ### 3.4 Transaction Safety (B-Series) **Original:** Not specified. Registration, command delivery, and token renewal were separate DB operations without transaction wrapping. **Built (B-2):** Registration wrapped in `tx.Beginx()` with `defer tx.Rollback()`. Command delivery uses `SELECT FOR UPDATE SKIP LOCKED` (atomic claim). Token renewal wrapped in transaction. JWT generated after commit, not before. **Why better:** Prevents partial registration state, command double-delivery race conditions, and orphaned tokens. --- ### 3.5 Configurable Operational Timeouts **Original:** 6 hardcoded timeout values in main.go and timeout.go. **Built (E-1c):** All 6 values stored in `security_settings` table under `operational` category (migration 030). Read from DB at startup with hardcoded fallback. Zero-value protection prevents zero-duration tickers. **Why better:** Administrators can tune timeouts via API without code changes or redeployment. --- ### 3.6 Binary Path Traversal Protection **Original:** Not specified — `c.File(pkg.BinaryPath)` served DB-sourced paths without validation. **Built (E-1c + Integration Verification):** Both `DownloadUpdatePackage` and `DownloadAgent` resolve paths via `filepath.Abs()` and validate against `REDFLAG_BINARY_STORAGE_PATH` using prefix check. Traversal attempts logged and return 403. **Why better:** Defense in depth against DB compromise scenarios. --- ### 3.7 TypeScript Strict Compliance **Original:** 217 TypeScript errors in `aggregator-web/src/`. **Built (E-1b):** All 217 errors fixed. Zero `@ts-ignore` or `as any` suppressions added. Type interfaces verified against actual server JSON responses. TanStack Query v5 `isLoading` -> `isPending` migration for mutations. **Why better:** Catches type mismatches at compile time instead of runtime. --- ### 3.8 Semver-Aware Version Comparison **Original:** `versions.go:72` used lexicographic comparison (`agentVersion < current.MinAgentVersion`), making `"0.1.9" > "0.1.22"`. **Built (Upgrade Fix):** `CompareVersions()` with octet-by-octet numeric parsing. Handles `"dev"` as always-older, `"v"` prefix stripping, mismatched octet counts. **Why better:** Version gates now work correctly for all version numbers. --- ### 3.9 Test Suite Growth **Original (Code Review):** "Only 3 test files across the entire codebase" — `circuitbreaker_test.go`, `test_disk.go`, `test_disk_detection.go`, plus scheduler tests. **Built:** 170 tests across 18 packages covering: Ed25519 signing and replay protection, JWT issuer validation, registration transactions, command delivery races, machine ID format, ETHOS compliance, path traversal, version comparison, checksum computation, timeout configuration, and more. **Why better:** Regression detection for all fix series (A through Upgrade). --- ### 3.10 ETHOS Logging Compliance **Original:** Mixed `fmt.Printf`, emoji in logs, inconsistent format. **Built (D-2 + Integration Verification):** All production log statements use `log.Printf("[TAG] [system] [component] message key=value")`. Emoji removed from daemon log.Printf calls. `fmt.Printf` DEBUG statements removed from handlers. Terminal/CLI output emoji explicitly exempted (DEV-039). --- ### 3.11 Installer Architecture Detection **Original:** Architecture hardcoded to `amd64` in `generateInstallScript`. **Built (Installer Fix 2):** Runtime detection via `uname -m` (Linux) and `$env:PROCESSOR_ARCHITECTURE` (Windows). Server accepts optional `?arch=` query param. Download endpoint already supported `linux-arm64` and `windows-arm64`. **Why better:** ARM64 homelabbers (Raspberry Pi, Apple Silicon VMs) can now install without manual binary download. --- ### 3.12 Binary Checksum Verification **Original:** No verification of downloaded binary integrity. **Built (Installer Fix 2):** Server computes SHA-256 and serves `X-Content-SHA256` header. Linux installer verifies with `sha256sum`. Windows installer verifies with `Get-FileHash`. Missing header = warn but continue (backward compatible). --- ### 3.13 Machine ID Rebind Endpoint **Original:** Not specified. If machine ID changed (hardware replacement, VM migration), agent was permanently locked out. **Built (D-1):** Admin endpoint `POST /admin/agents/:id/rebind-machine-id` allows re-binding an agent to new hardware. Requires admin authentication. --- ## Section 4: What Was Built Differently (Deviations) ### VD-001: Logging Format **Original (P3-006):** JSON structured logs with correlation IDs via logrus or similar library. `StructuredLogger` implementation, `CorrelationIDMiddleware`, buffered async writes, P95/P99 latency tracking, `system_logs` database table. **Actual:** ETHOS `[TAG] [system] [component]` plain text format via `log.Printf`. No correlation IDs, no structured JSON, no centralized aggregation, no log database table. **Rationale:** ETHOS principle #5 (no marketing fluff) and principle #1 (errors are history) were prioritized over the P3-006 spec. Plain text logging with consistent tags is grep-friendly and sufficient for the homelab use case. JSON structured logging adds complexity without proportional benefit at the current scale. **Verdict:** Acceptable for homelab. Would need to be revisited for fleet-scale deployments (100+ agents). --- ### VD-002: Authentication Architecture **Original (P0-006 + Starting Prompt):** Multi-user system with `users` table, admin/user/readonly roles, email fields, `EnsureAdminUser()`. The Starting Prompt shows Settings page with "users" section. **Actual:** Single-admin via `.env` credentials. The `users` table exists in migrations (for compatibility) but is not used for authentication. Web auth validates against `REDFLAG_ADMIN_USER`/`REDFLAG_ADMIN_PASSWORD` from environment. **Rationale:** P0-006 recommended "Option 1: Complete Removal" — recognizing that multi-user scaffolding increased attack surface without benefit for a homelab tool. The current implementation follows this recommendation. **Verdict:** Correct for homelab. Multi-user would be needed for MSP/enterprise use case. --- ### VD-003: Build Orchestrator **Original (architecture docs):** Dynamic agent compilation per request. Build Orchestrator would cross-compile agent binaries on demand. **Actual:** Pre-built binaries placed at container build time (via Dockerfile multi-stage build). `BuildAndSignAgent` signs existing binaries but never compiles. `AgentBuilder` generates config JSON only. `build_orchestrator.go` services layer marked `// Deprecated`. **Rationale:** Cross-compilation on every request is impractical for a homelab server. The Dockerfile multi-stage build compiles once; the server serves pre-built binaries. This is how production package distribution works (e.g., GitHub Releases). **Verdict:** Correct pragmatic simplification. Dynamic compilation would add complexity without benefit. --- ### VD-004: Upgrade Trigger Path **Original:** `POST /build/upgrade/:agentID` was meant to orchestrate full upgrades. **Actual:** The real upgrade path is `POST /agents/{id}/update` (in `agent_updates.go`), which validates the agent, generates nonces, creates signed `update_agent` commands, and tracks delivery. The `/build/upgrade` endpoint generates config JSON with manual instructions — it's an admin utility, not the upgrade orchestrator. **Rationale:** The `/agents/{id}/update` path already existed and was more complete (nonce generation, command signing, delivery tracking). Wiring a parallel path would have created confusion. **Verdict:** Acceptable. The working path is better designed. --- ### VD-005: Security Settings UI **Original (SECURITY-SETTINGS.md):** Full security settings configurable from dashboard including machine binding mode, version enforcement, nonce timeout, signature algorithm, log level, alert thresholds. **Actual:** Security settings backend works (API CRUD for `security_settings` table). Dashboard displays settings for `command_signing`, `update_signing`, `nonce_validation`, `machine_binding`, `signature_verification`. The `operational` category (E-1c timeouts) is accessible via API but not visible in the UI. No validation rules enforcement for operational settings. **Verdict:** Partially implemented. Backend complete, frontend shows security-category settings. Operational settings need UI exposure. --- ### VD-006: Nonce Timeout **Original (Overview.md, Security.md):** Nonce lifetime "< 5 minutes". **Actual (SETUP-SECURITY.md, code):** `REDFLAG_SECURITY_NONCE_TIMEOUT=600` (10 minutes). The code uses a 10-minute default. **Rationale:** The original docs contradict each other — Overview.md says "< 5 min" while SETUP-SECURITY.md says 600 seconds. The 10-minute value appears to be a deliberate choice to accommodate slow network conditions (agents polling every 5 minutes may not receive the command within a 5-minute nonce window). **Verdict:** The 10-minute value is more practical. The 5-minute spec was likely aspirational. Document this as intentional. --- ### VD-007: Key Rotation API **Original (SETUP-SECURITY.md):** `POST /api/v1/security/keys/rotate` with `grace_period_days` (default 30). During grace period both old and new keys valid. Keys stored in `/app/keys/` directory. **Actual:** Key rotation is TTL-based via the signing key registry in `signing_keys` table. No explicit `/keys/rotate` API endpoint. No dual-key grace period — agents refresh their cached public key via TTL (24h default). Key stored in environment variable, not in `/app/keys/` directory. **Rationale:** Different implementation approach that achieves the same goal (agents can handle key changes) without the complexity of dual-key acceptance windows. **Verdict:** Functional but different. A manual key rotation API would be a nice-to-have for planned rotations. --- ### VD-008: Version Format **Original (SECURITY-SETTINGS.md):** "Semantic version string (X.Y.Z), integers only, no v prefix." **Actual:** Four-octet format `X.Y.Z.W` where W is the config version (e.g., `0.1.26.0`). The `v` prefix is tolerated and stripped during comparison. **Rationale:** The fourth octet was added to embed config schema version alongside the agent version, avoiding a separate version field. **Verdict:** Acceptable extension of spec. `CompareVersions()` handles both 3-octet and 4-octet formats. --- ### VD-009: Windows Service Key Rotation **Original:** Not explicitly specified, but key rotation logic exists in `main.go` polling loop. **Actual (DEV-030):** The Windows service polling loop in `windows.go` does not call `ShouldRefreshKey`. The comment at line 164-168 acknowledges this as a TODO. Agents running as Windows services rely on the 24h TTL key cache and will not proactively detect key rotation. **Verdict:** Known gap. Low risk — the 24h TTL cache means Windows agents will naturally pick up new keys within a day. --- ### VD-010: Watchdog Version Comparison **Original:** Not explicitly specified. **Actual:** The upgrade watchdog in `subsystem_handlers.go:943` uses string equality (`agent.CurrentVersion == expectedVersion`) instead of `CompareVersions()`. A normalized version string mismatch (e.g., `"v0.1.4"` vs `"0.1.4"`) would trigger false rollback. **Verdict:** Low risk — both sides use the same version string from the same source. Would need fixing if version normalization is ever introduced. --- ## Section 5: What Was Never Built ### 5A. Platform Support | Feature | Original Priority | Effort Estimate | Notes | |---------|------------------|-----------------|-------| | macOS agent / launchd | Starting Prompt | 2-3 days | No launchd plist, no macOS-specific code | | Homebrew scanner | Starting Prompt | 1-2 days | Would follow APT/DNF pattern | | AUR scanner (Arch) | Starting Prompt | 1-2 days | Would follow APT/DNF pattern | | Snap scanner | Starting Prompt | 1 day | Low demand | | Flatpak scanner | Starting Prompt | 1 day | Low demand | | aggregator-cli (Go CLI) | Starting Prompt | 3-5 days | Power-user tool, not essential for homelab | ### 5B. AI Features | Feature | Original Priority | Effort Estimate | Notes | |---------|------------------|-----------------|-------| | AI Chat Sidebar (Ollama/OpenAI) | Starting Prompt | 2-3 weeks | No AI code exists anywhere | | Natural language queries (`POST /ai/query`) | Starting Prompt | 1-2 weeks | Requires AI sidebar first | | AI-assisted scheduling (`POST /ai/schedule`) | Starting Prompt | 1 week | Requires maintenance windows first | | AI decision audit trail (`GET /ai/decisions`) | Starting Prompt | 3-5 days | Requires AI features first | **Honest assessment:** AI features were aspirational in the Starting Prompt ("Future Phase"). They add significant complexity and operational overhead (Ollama requires GPU resources or external API costs). For a homelab tool, manual approval is more appropriate than AI-assisted scheduling. Recommend deferring indefinitely unless Fimeg has a specific use case. ### 5C. Scheduling & Automation | Feature | Original Priority | Effort Estimate | Notes | |---------|------------------|-----------------|-------| | Maintenance Windows (RRULE recurrence) | Starting Prompt, P3 | 2-3 weeks | Full RRULE parser + calendar UI + auto-approve logic | | Auto-approve by severity during windows | Starting Prompt | 1 week | Requires maintenance windows | | Scheduled update execution | Starting Prompt | 1 week | Requires maintenance windows | | Staggered rollout (5%/25%/100%) | P2-003, Strategic Roadmap | 1-2 weeks | Server-side group selection + phased command queuing | | Auto-upgrade trigger (version-based) | Upgrade Audit | 1 week | Server detects old version on check-in, queues update_agent | **Honest assessment:** Maintenance windows are the highest-value unbuilt feature for production use. Auto-approve by severity during defined windows would significantly reduce manual work. ### 5D. Observability | Feature | Original Priority | Effort Estimate | Notes | |---------|------------------|-----------------|-------| | Structured JSON logging (P3-006) | P3 | 3-4 days | logrus + correlation IDs | | Correlation ID propagation | P3-006 | 2-3 days | Middleware + header propagation | | Update Metrics Dashboard (P3-003) | P3 | 2-3 days | Success/failure rates, trend charts | | Server Health Dashboard (P3-005) | P3 | 2-3 days | CPU, memory, DB connections | | Prometheus metrics endpoint | Strategic Roadmap | 2-3 days | /metrics endpoint with Go prometheus client | | Real-time WebSocket updates | Starting Prompt | 1-2 weeks | Partial: security events WebSocket exists | ### 5E. Integration & Ecosystem | Feature | Original Priority | Effort Estimate | Notes | |---------|------------------|-----------------|-------| | LDAP/Active Directory | Strategic Roadmap | 2-3 weeks | Auth integration | | SAML/OIDC for SSO | Strategic Roadmap | 2-3 weeks | Requires multi-user first | | Slack/Teams/PagerDuty webhooks | Strategic Roadmap | 1-2 weeks | Event notification hooks | | Compliance reporting (SOX, HIPAA) | Strategic Roadmap | 4-6 weeks | Report generation framework | | Kubernetes deployment | Strategic Roadmap | 1-2 weeks | Helm chart + StatefulSet | | Ansible/Terraform integrations | Strategic Roadmap | 2-3 weeks | Module/provider development | **Honest assessment:** Webhooks (Slack/Teams) are the highest-value integration for homelab use. LDAP/SSO only matters if Fimeg plans to support multi-user deployments. ### 5F. UI Features | Feature | Original Priority | Effort Estimate | Notes | |---------|------------------|-----------------|-------| | Security Status Dashboard Indicators (P3-002) | P3 | 2-3 days | Color-coded security health scores | | Token Management UI Enhancement (P3-004) | P3 | 1-2 days | Delete tokens, bulk operations | | Server Health Dashboard (P3-005) | P3 | 2-3 days | System status monitoring | | Operational settings in UI | E-1c carry-over | 1 day | Add 'operational' category to SecuritySettings.tsx | | Update metrics and trend charts | P3-003 | 2-3 days | Recharts integration | | Calendar view for maintenance windows | Starting Prompt | 1-2 weeks | Requires maintenance windows backend | ### 5G. Security Features | Feature | Original Priority | Effort Estimate | Notes | |---------|------------------|-----------------|-------| | Multi-factor authentication | Strategic Roadmap | 1-2 weeks | TOTP integration | | API key rotation via UI | SETUP-SECURITY.md | 2-3 days | Manual rotation endpoint | | Key rotation with grace period | SETUP-SECURITY.md | 1 week | Dual-key acceptance window | | TLS hardening (remove bypass flag) | Code Review | 1 hour | Remove `--insecure-tls` flag | | JWT secret minimum strength | Code Review | 30 min | Validation in config loading | --- ## Section 6: Backlog Status Table | ID | Title | Original Priority | Current Status | Where Fixed/Notes | |----|-------|------------------|----------------|-------------------| | P0-001 | Rate Limit First Request Bug | P0 | FIXED | v0.1.26 per v0.1.27 Inventory; rate limiter namespaced by type in A-3 fixes | | P0-002 | Session Loop Bug | P0 | PARTIALLY FIXED | SetupCompletionChecker modified in E-1b (removed isSetupMode state); may need live verification | | P0-003 | Agent No Retry Logic | P0 | FIXED | v0.1.27 per Inventory + culurien B-2 (exponential backoff with full jitter, proportional polling jitter) | | P0-004 | Database Constraint Violation | P0 | FIXED | v0.1.27 per Inventory; timeout service now uses 'failed' result status (check constraint compatible) | | P0-005 | Setup Flow Broken | P0 | NOT VERIFIED | Setup handler exists but end-to-end flow not tested in culurien branch. May still have issues | | P0-006 | Single-Admin Architecture | P0 | ACCEPTED | Decision made: single-admin via .env. Users table exists for compatibility but not used for auth | | P0-007 | Install Script Path Variables | P0 | FIXED | 2025-12-17 per backlog + verified in Installer Fix 1 (config path consistency) | | P0-008 | Migration Runs on Fresh Install | P0 | FIXED | 2025-12-17 per backlog; early return in detection.go for empty agent_id | | P0-009 | Storage Scanner Wrong Table | P0 | NOT DONE | Storage scanner still on legacy interface. Dedicated storage_metrics table exists but scanner reports to update_packages | | P1-001 | Agent Install ID Parsing | P1 | PARTIALLY FIXED | extractOrGenerateAgentID() in install_template_service.go validates UUID format; but query param handling may still have edge cases | | P1-002 | Agent Timeout Handling | P1 | PARTIALLY FIXED | E-1c made timeouts configurable from DB; per-scanner timeouts exist in config but generic 45s timeout may still apply in some paths | | P2-001 | Binary URL Architecture Mismatch | P2 | FIXED | Installer Fix 2 added arch detection; templates override download URL with detected architecture | | P2-002 | Migration Error Reporting | P2 | NOT DONE | Migration errors still only logged locally; no server-side visibility | | P2-003 | Agent Auto-Update System | P2 | FIXED | Fully implemented (was incorrectly marked placeholder in backlog); verified in Upgrade Audit | | P3-001 | Duplicate Command Prevention | P3 | FIXED | v0.1.27; unique index on (agent_id, command_type, status) WHERE status = 'pending' | | P3-002 | Security Status Dashboard | P3 | PARTIALLY DONE | Security overview endpoints exist; no color-coded health scores or per-agent security badges | | P3-003 | Update Metrics Dashboard | P3 | NOT DONE | No metrics dashboard, no trend charts | | P3-004 | Token Management UI Enhancement | P3 | PARTIALLY DONE | Token list with copy-install-command exists; no delete, no bulk operations, no status filtering | | P3-005 | Server Health Dashboard | P3 | NOT DONE | No health dashboard | | P3-006 | Structured Logging System | P3 | NOT DONE (alternative) | ETHOS [TAG] format used instead of JSON structured logging. See VD-001 | | P4-001 | Agent Retry Logic Resilience | P4 | FIXED | v0.1.27 per Inventory + culurien B-2 (exponential backoff, circuit breakers) | | P4-002 | Scanner Timeout Optimization | P4 | PARTIALLY DONE | Configurable per-subsystem timeouts in config; E-1c made server-side timeouts configurable from DB | | P4-003 | Agent File Management Migration | P4 | PARTIALLY DONE | MigrationExecutor exists with old-path detection; constants/paths.go standardized; validation/pathutils packages have compile errors (dead code) | | P4-004 | Directory Path Standardization | P4 | FIXED | constants/paths.go provides canonical paths; windows.go fixed to use constants.GetAgentConfigPath() (Installer Fix 1); installer templates use standard paths | | P4-005 | Testing Infrastructure Gaps | P4 | SIGNIFICANTLY IMPROVED | From ~3 test files to 170 tests across 18 packages; no CI/CD yet | | P4-006 | Architecture Documentation Gaps | P4 | PARTIALLY DONE | 30+ docs in culurien docs/ folder; no formal architecture diagrams or ADRs | | P5-001 | Security Audit Documentation | P5 | NOT DONE | No security audit checklist, IR procedures, or compliance mapping | | P5-002 | Development Workflow Documentation | P5 | PARTIALLY DONE | .env.example created; no PR template, debugging guide, or release process | **Summary:** 10 FIXED, 8 PARTIALLY DONE, 6 NOT DONE, 1 ACCEPTED (design decision), 1 NOT VERIFIED, 1 alternative approach. --- ## Section 7: Architecture Health Assessment ### 7A. Authentication Stack | Component | Spec | Actual | Match | |-----------|------|--------|-------| | Registration tokens | One-time or multi-seat | Implemented with seat limits | YES | | JWT 24h expiry | Short-lived JWT | Implemented with issuer-based validation (A-3) | YES | | Refresh tokens 90-day | Sliding window, SHA-256 hash | Implemented, renewal in transaction (B-2) | YES | | Machine ID binding | `X-Machine-ID` header, 403 on mismatch | Implemented with canonical SHA256 hash (D-1) | YES | ### 7B. Command Flow | Component | Spec | Actual | Match | |-----------|------|--------|-------| | Pull-only | Agents always initiate | Confirmed — server has no outbound capability | YES | | 5-minute check-in | Configurable interval | Default 300s, configurable via config.json | YES | | Command types | scan_updates, collect_specs, install_updates, rollback_update, update_agent | All present in models/command.go plus enable/disable_heartbeat, reboot, dry_run_update, confirm_dependencies | YES+ | | Acknowledgment | pending_acks.json | acknowledgment.Tracker with persistence and retry | YES | ### 7C. Security Stack | Component | Spec | Actual | Match | |-----------|------|--------|-------| | Ed25519 signing | Binary + command signing | Both implemented; v3 format exceeds spec | YES+ | | Nonce validation | < 5 min lifetime, anti-replay | 10-minute default (VD-006), otherwise matches | CLOSE | | TOFU key caching | Fetch once at registration | Implemented with TTL refresh | YES+ | ### 7D. Agent Paths | Platform | Spec Path | Actual | Match | |----------|-----------|--------|-------| | Linux config | `/etc/redflag/config.json` | `/etc/redflag/agent/config.json` (constants.GetAgentConfigPath) | CLOSE — subdir added | | Linux state | `/var/lib/redflag/` | `/var/lib/redflag/agent/` | CLOSE — subdir added | | Linux binary | `/usr/local/bin/redflag-agent` | `/usr/local/bin/redflag-agent` | YES | | Windows config | `C:\ProgramData\RedFlag\config.json` | `C:\ProgramData\RedFlag\agent\config.json` (fixed in Installer Fix 1) | CLOSE — subdir added | The `agent` subdirectory was added to support future multi-component deployments (agent + server on same machine). This is a reasonable structural enhancement. ### 7E. Migration System | Component | Spec | Actual | Match | |-----------|------|--------|-------| | MigrationExecutor | Present | Implemented in `aggregator-agent/internal/migration/` | YES | | Old path migration | `/etc/aggregator/` -> `/etc/redflag/` | Detection and backup implemented in installer templates and migration executor | YES | --- ## Section 8: The Honest Roadmap ### HIGH VALUE, LOW EFFORT (Quick Wins) 1. **JWT secret minimum strength** (30 min) — Add `len(secret) < 32` check in config loading. Addresses Code Review finding. 2. **TLS bypass flag removal** (1 hour) — Remove `--insecure-tls` flag from agent. Forces TLS in production. 3. **Operational settings in UI** (1 day) — Add `operational` category to SecuritySettings.tsx component. Makes timeout tuning accessible from dashboard. 4. **Token delete button** (1-2 days) — P3-004. DELETE endpoint + confirmation dialog. Currently requires DB manual cleanup. 5. **`/api/v1/info` ldflags injection** (1 hour) — Ensure Dockerfile passes `-ldflags` with actual version strings. Currently defaults to "dev". ### HIGH VALUE, HIGH EFFORT (Strategic Investments) 1. **Maintenance Windows** (2-3 weeks) — The single most impactful unbuilt feature. Enables scheduled patching during safe hours. Without this, every update requires manual approval at execution time. 2. **Webhook notifications** (1-2 weeks) — Slack/Teams alerts on critical update availability, failed installations, agent offline. Low integration overhead, high operational value. 3. **Staggered rollout** (1-2 weeks) — Deploy updates to 5% canary, monitor, then 25%, then 100%. Essential for fleets > 10 agents. 4. **macOS agent** (2-3 days) — launchd plist template + Homebrew scanner. Completes the "cross-platform" promise for homelabbers with Macs. ### LOW VALUE (Defer or Drop) 1. **AI features** — Drop entirely for foreseeable future. Adds operational complexity (GPU/API costs) without proportional benefit for homelab use case. Manual approval is more appropriate. 2. **aggregator-cli** — Defer. The web dashboard covers all use cases. CLI would be nice-to-have for scripting but is not essential. 3. **AUR/Snap/Flatpak scanners** — Defer. Very small user base for each. APT and DNF cover 95%+ of Linux homelabbers. 4. **LDAP/SSO** — Defer until multi-user is needed. Single-admin is correct for homelab. 5. **Compliance reporting** — Drop. SOX/HIPAA requirements don't apply to homelabs. 6. **Kubernetes deployment** — Defer. Docker Compose is the right deployment model for the target audience. --- ## Section 9: Summary Table | Feature Area | Planned | Built | Status | Gap Rating | |-------------|---------|-------|--------|------------| | Core Architecture (pull model, agents, server) | Full | Full | Working | 0 (complete) | | Ed25519 Signing (commands, binaries) | Full | Full + enhancements | Working | 0 (exceeds spec) | | Authentication (tokens, JWT, refresh) | Full | Full + transactions | Working | 0 (exceeds spec) | | Machine ID Binding | Full | Full + canonical hash | Working | 0 (exceeds spec) | | Replay Protection (nonces) | Full | Full (10min vs 5min) | Working | 1 (timeout deviation) | | Package Scanning (6 of 9 scanners) | 9 scanners | 6 scanners | Working | 3 (AUR, Snap, Flatpak, Homebrew missing) | | Agent Self-Upgrade | Full | Full 7-step pipeline | Working | 0 (complete) | | Installer (Linux + Windows) | Full | Full + arch + checksum | Working | 1 (macOS missing) | | Web Dashboard | Full | Core features | Working | 3 (missing charts, health, metrics) | | Database + Migrations | Full | Full + hardened | Working | 0 (exceeds spec) | | Docker Deployment | Full | Full | Working | 0 (complete) | | Testing | Minimal | 170 tests | Working | 2 (no CI/CD, no integration tests against real DB) | | Maintenance Windows | Full | None | Not built | 10 (completely absent) | | AI Features | Full | None | Not built | 10 (deliberately deferred) | | Scheduling & Automation | Full | None | Not built | 8 (no maintenance windows, no staggered rollout) | | LDAP/SSO | Planned | None | Not built | 5 (not needed for homelab) | | Structured Logging | Planned (P3-006) | ETHOS alternative | Working differently | 3 (functional but not JSON/correlation IDs) | | Compliance / Reporting | Planned | None | Not built | 2 (not applicable to homelab) | | CLI Tool | Planned | None | Not built | 2 (dashboard covers use cases) | | macOS Support | Planned | None | Not built | 4 (matters for homelabbers with Macs) | **Overall: Core infrastructure is 9/10. Feature breadth is 5/10. Production readiness for homelab is 7/10.** The gap is almost entirely in features that were always labeled "future" or "Phase 2" in the original docs. The core architecture — the hard engineering work — is built, tested, and hardened beyond the original specification.