Files
Redflag/docs/Vision_vs_Reality_Deviation_Report.md
jpetree331 3526962333 docs: Vision vs Reality comprehensive deviation report
Complete comparison of Fimeg's original vision against
current codebase state after culurien branch work.

- 9 sections covering architecture, features, backlog
- 10 deviations documented (VD-001 through VD-010)
- 27 backlog items tracked with current status
- Honest roadmap with prioritized next steps
- Executive summary for quick reference

Core: 9/10. Features: 5/10. Homelab ready: 7/10.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 19:04:49 -04:00

41 KiB

Vision vs Reality: RedFlag Deviation Report

Date: 2026-03-29 Branch: culurien (post-integration verification) Baseline: v0.1.27 (last Fimeg commit before culurien work) Author: Claude (automated analysis based on complete codebase and historical documentation)


Section 1: Executive Summary

RedFlag began as "Aggregator" — Fimeg's vision for a self-hosted ConnectWise alternative that would give homelabbers centralized update management with "single pane of glass" visibility across Windows, Linux, and macOS. The Starting Prompt described an ambitious platform with AI-assisted scheduling, maintenance windows, natural language queries, and cross-platform agents covering APT, DNF, AUR, Snap, Flatpak, Winget, Windows Update, Docker, and Homebrew.

What exists today is a narrower but more deeply engineered system than the original breadth implied. The core architecture — pull-based agents, Ed25519 signed commands, machine ID binding, three-tier token authentication — is not just present but has been hardened well beyond the original specification. The culurien branch added transaction safety, replay attack protection, key rotation, configurable timeouts, path traversal defense, semver-aware version comparison, and grew the test suite from approximately 3 test files to 170 passing tests across 18 packages.

The honest gap: approximately 40% of the originally envisioned feature surface was never built. macOS support, AI features, maintenance windows, scheduled rollouts, structured JSON logging, LDAP/SSO, compliance reporting, the CLI tool — none exist. The features that do exist (package scanning, command execution, agent self-upgrade, installer infrastructure) work correctly and are production-quality for the homelab use case.

Is it production-ready for Fimeg's homelab? Yes, with caveats. The system can install agents on Linux and Windows, scan for package updates, approve and execute updates from the dashboard, and self-upgrade agents — all with cryptographic verification. The caveats are: the setup flow has known friction (P0-005), the /api/v1/info endpoint returns build-time values that default to "dev" without proper ldflags injection, and several P0 backlog items from the original backlog have been fixed but not all have been verified end-to-end in a live deployment.

The "scare ConnectWise" ambition requires honest framing. RedFlag has three architectural advantages ConnectWise cannot replicate: hardware-bound machine IDs, mandatory self-hosting, and code transparency. But ConnectWise has remote control, 100+ integrations, SOC2 certification, and a decade of enterprise polish. RedFlag is competitive for the self-hosted update management niche — the subset of ConnectWise functionality that homelabbers and small IT teams actually use. It is not, and should not try to become, a full RMM platform. The strategic path is: own the update management vertical completely, then expand.


Section 2: What Was Built As Planned

2.1 Pull-Based Agent Architecture

Planned (Starting Prompt): Agent polls server every 5 minutes via GET /agents/{id}/commands. Server never initiates connections to agents.

Built: Exactly as planned. Agent polls via GET /agents/:id/commands with configurable interval (default 300 seconds). Proportional jitter added (B-2 fix: maxJitter = pollingInterval/2). Server has no outbound connection capability.

Location: aggregator-agent/cmd/agent/main.go:960-1070 (poll loop), aggregator-server/internal/api/handlers/agents.go (GetCommands handler)

Status: Working. Enhanced beyond spec with jitter and exponential backoff on failures.


2.2 Ed25519 Command Signing

Planned (Security.md): All commands in DB must be Ed25519-signed before being sent to agents. signAndCreateCommand() implemented in handlers.

Built: Exactly as planned, then enhanced. v3 signed message format includes agent_id:cmd_id:type:sha256(params):timestamp. Key ID and SignedAt tracked per command (migration 025). 29+ call sites use signAndCreateCommand().

Location: aggregator-server/internal/services/signing.go, aggregator-server/internal/api/handlers/agents.go:49-77

Status: Working. Exceeds original spec.


2.3 Machine ID Binding

Planned (Security.md section 3.1): MachineBindingMiddleware validates X-Machine-ID header against agents.machine_id. Mismatch = 403.

Built: Implemented. Machine ID is SHA256 hash of hardware identifiers (D-1 fix made this canonical across all platforms). Middleware validates on authenticated agent routes. Admin rebind endpoint added for recovery.

Location: aggregator-server/internal/api/middleware/machine_binding.go, aggregator-agent/internal/system/machine_id.go

Status: Working. Enhanced with canonical hash format and rebind capability.


2.4 Three-Tier Token Authentication

Planned (Security.md section 2): Registration tokens (one-time/multi-seat) -> JWT access tokens (24h) -> Refresh tokens (90-day sliding window, stored as SHA-256 hash).

Built: Exactly as planned. Registration tokens consumed at /agents/register with seat limits. JWT issued with issuer=redflag-agent (A-3 fix). Refresh tokens stored hashed, renewed via /agents/renew. Token renewal wrapped in database transaction (B-2 fix).

Location: aggregator-server/internal/api/handlers/auth.go, aggregator-server/internal/api/middleware/auth.go, aggregator-server/internal/database/queries/refresh_tokens.go

Status: Working. Transaction safety added beyond original spec.


2.5 Replay Attack Protection (Nonce System)

Planned (Security.md section 3.3): Unique, time-limited, Ed25519-signed nonce for every sensitive command. Agent validates signature and timestamp.

Built: Implemented via UpdateNonceService with Generate() and Validate(). Nonce format: uuid:unix_timestamp, signed with Ed25519. Agent validates freshness within configurable window.

Location: aggregator-server/internal/services/nonce_service.go, aggregator-agent/cmd/agent/subsystem_handlers.go:1074 (validateNonce)

Status: Working. Timeout is 10 minutes (not 5 as specified in Overview.md — see VD-006).


2.6 Agent Self-Registration Flow

Planned (Starting Prompt): Agent POSTs to /agents/register with hostname, OS info, version. Server returns agent_id, token, config.

Built: Exactly as planned plus enhancements. Registration includes machine_id (SHA256 hash) and public_key_fingerprint. Registration wrapped in database transaction (B-2 fix). Agent stores agent_id, token, refresh_token, check_in_interval to config file.

Location: aggregator-agent/cmd/agent/main.go:371-483 (registerAgent), aggregator-server/internal/api/handlers/agents.go (RegisterAgent)

Status: Working.


2.7 Package Scanner Architecture

Planned (Starting Prompt): APT, DNF/YUM, AUR, Winget, Windows Update, Docker, Snap, Flatpak, Homebrew.

Built: APT, DNF, Winget, Windows Update, Docker, Storage metrics. Each scanner has circuit breaker protection and configurable timeouts.

Not built: AUR, Snap, Flatpak, Homebrew. See Section 5A.

Location: aggregator-agent/internal/scanner/ (apt.go, dnf.go, winget.go, docker.go), aggregator-agent/pkg/windowsupdate/

Status: Working for implemented platforms. 6 of 9 planned scanners built.


2.8 Circuit Breaker Pattern

Planned (ETHOS.md principle 3, Overview.md): Circuit Breaker on fragile scanners (Windows Update, DNF).

Built: Full implementation with Closed/Open/HalfOpen states, configurable failure threshold, failure window, open duration, and half-open attempts. Applied to all scanners via subsystem config.

Location: aggregator-agent/internal/circuitbreaker/circuit_breaker.go, config per subsystem in config.json

Status: Working.


2.9 Command Acknowledgment System

Planned (Overview.md): pending_acks.json for at-least-once delivery guarantee.

Built: acknowledgment.Tracker package with persistent pending acks, retry with IncrementRetry(), and state file persistence.

Location: aggregator-agent/internal/acknowledgment/

Status: Working.


2.10 Agent Service Management

Planned (Overview.md): systemd on Linux, Windows Services (SCM) on Windows.

Built: Both. Linux systemd unit generated inline in installer template with security hardening (ProtectSystem=strict, ProtectHome=true, PrivateTmp=true). Windows SCM registration via InstallService() with auto-start and recovery actions.

Location: aggregator-agent/internal/service/windows.go:438-516, installer templates in aggregator-server/internal/services/templates/install/scripts/

Status: Working.


2.11 Web Dashboard

Planned (Starting Prompt): React 18 + TypeScript, TailwindCSS, TanStack Query, Recharts, TanStack Table, WebSocket.

Built: React + TypeScript + TailwindCSS + TanStack Query. Dashboard with agent list, update management, command history, security settings. TypeScript strict compliance (0 errors after E-1b). No Recharts charts, no TanStack Table, limited WebSocket.

Location: aggregator-web/src/

Status: Working. Feature-complete for core use case but missing visualization features (charts, trend analysis).


2.12 PostgreSQL with Migration Runner

Planned (Starting Prompt, ETHOS.md): PostgreSQL database with idempotent migrations.

Built: PostgreSQL 16-alpine, custom migration runner in database/db.go with schema_migrations tracking table. 30 migrations (001-030) all using IF NOT EXISTS / ON CONFLICT DO NOTHING. Migration runner aborts fatally on failure (B-1 fix).

Location: aggregator-server/internal/database/db.go, aggregator-server/internal/database/migrations/

Status: Working.


2.13 Docker Compose Deployment

Planned (Starting Prompt): docker-compose.yml for quick start.

Built: Three-service compose: postgres (16-alpine), server (multi-stage Go build), web (nginx). Health checks, volume persistence, env_file support.

Location: docker-compose.yml, config/.env.example

Status: Working.


2.14 Installer Scripts

Planned (various docs): Install script served from /install/:platform endpoint.

Built: Linux bash and Windows PowerShell templates served dynamically. Linux installer creates system user, directories, sudoers (per-package-manager), systemd service with security hardening, registers agent, starts service. Windows installer creates directories, downloads binary, writes config, registers service. Both have arch auto-detection and checksum verification (Installer Fix 2).

Location: aggregator-server/internal/services/templates/install/scripts/linux.sh.tmpl, windows.ps1.tmpl

Status: Working. Idempotent.


2.15 Agent Self-Upgrade Pipeline

Planned (Overview.md, P2-003): 7-step pipeline: download, checksum verify, Ed25519 binary signature verify, backup, atomic install, service restart, watchdog confirmation.

Built: Exactly as specified. All 7 steps implemented with deferred rollback on any failure including watchdog timeout (5 minutes). Download has 5-minute timeout and 500MB size limit (U-8 fix).

Location: aggregator-agent/cmd/agent/subsystem_handlers.go:575-762

Status: Working. The v0.1.27 INVENTORY doc noted this was "FULLY IMPLEMENTED" despite the backlog claiming it was placeholder.


Section 3: What Was Built Better Than Planned

3.1 Ed25519 Key Rotation with TTL

Original: Security.md noted key rotation was "TODO" with no implementation. SETUP-SECURITY.md described a POST /security/keys/rotate API with 30-day grace period.

Built: TTL-based key caching with automatic refresh. Server registers primary key in signing_keys table (migration 025). Agent caches server public key at registration (TOFU), with configurable TTL refresh. Key ID tracked per command for audit trail.

Why better: Automatic TTL refresh is more reliable than manual API-triggered rotation. The manual rotation API was never needed because the system handles staleness automatically.


3.2 Command Signing v3 Format

Original: Commands signed with cmd_id:type:sha256(params).

Built: v3 format includes agent_id:cmd_id:type:sha256(params):timestamp. Agent ID binding prevents command replay to different agents. Timestamp enables time-based expiry independent of DB state.

Why better: Prevents a class of relay attacks where a compromised agent could forward signed commands to other agents.


3.3 Machine ID Canonical SHA256 Hash

Original: Machine ID was inconsistent — registration fallback used "unknown-" + hostname (unhashed) while runtime used SHA256.

Built (D-1): All paths now use GetMachineID() which always returns a 64-character hex SHA256 hash. Registration aborts with log.Fatalf if machine ID cannot be obtained — no unhashed fallback.

Why better: Eliminates format mismatch between registration and runtime that would cause 403 errors after restart.


3.4 Transaction Safety (B-Series)

Original: Not specified. Registration, command delivery, and token renewal were separate DB operations without transaction wrapping.

Built (B-2): Registration wrapped in tx.Beginx() with defer tx.Rollback(). Command delivery uses SELECT FOR UPDATE SKIP LOCKED (atomic claim). Token renewal wrapped in transaction. JWT generated after commit, not before.

Why better: Prevents partial registration state, command double-delivery race conditions, and orphaned tokens.


3.5 Configurable Operational Timeouts

Original: 6 hardcoded timeout values in main.go and timeout.go.

Built (E-1c): All 6 values stored in security_settings table under operational category (migration 030). Read from DB at startup with hardcoded fallback. Zero-value protection prevents zero-duration tickers.

Why better: Administrators can tune timeouts via API without code changes or redeployment.


3.6 Binary Path Traversal Protection

Original: Not specified — c.File(pkg.BinaryPath) served DB-sourced paths without validation.

Built (E-1c + Integration Verification): Both DownloadUpdatePackage and DownloadAgent resolve paths via filepath.Abs() and validate against REDFLAG_BINARY_STORAGE_PATH using prefix check. Traversal attempts logged and return 403.

Why better: Defense in depth against DB compromise scenarios.


3.7 TypeScript Strict Compliance

Original: 217 TypeScript errors in aggregator-web/src/.

Built (E-1b): All 217 errors fixed. Zero @ts-ignore or as any suppressions added. Type interfaces verified against actual server JSON responses. TanStack Query v5 isLoading -> isPending migration for mutations.

Why better: Catches type mismatches at compile time instead of runtime.


3.8 Semver-Aware Version Comparison

Original: versions.go:72 used lexicographic comparison (agentVersion < current.MinAgentVersion), making "0.1.9" > "0.1.22".

Built (Upgrade Fix): CompareVersions() with octet-by-octet numeric parsing. Handles "dev" as always-older, "v" prefix stripping, mismatched octet counts.

Why better: Version gates now work correctly for all version numbers.


3.9 Test Suite Growth

Original (Code Review): "Only 3 test files across the entire codebase" — circuitbreaker_test.go, test_disk.go, test_disk_detection.go, plus scheduler tests.

Built: 170 tests across 18 packages covering: Ed25519 signing and replay protection, JWT issuer validation, registration transactions, command delivery races, machine ID format, ETHOS compliance, path traversal, version comparison, checksum computation, timeout configuration, and more.

Why better: Regression detection for all fix series (A through Upgrade).


3.10 ETHOS Logging Compliance

Original: Mixed fmt.Printf, emoji in logs, inconsistent format.

Built (D-2 + Integration Verification): All production log statements use log.Printf("[TAG] [system] [component] message key=value"). Emoji removed from daemon log.Printf calls. fmt.Printf DEBUG statements removed from handlers. Terminal/CLI output emoji explicitly exempted (DEV-039).


3.11 Installer Architecture Detection

Original: Architecture hardcoded to amd64 in generateInstallScript.

Built (Installer Fix 2): Runtime detection via uname -m (Linux) and $env:PROCESSOR_ARCHITECTURE (Windows). Server accepts optional ?arch= query param. Download endpoint already supported linux-arm64 and windows-arm64.

Why better: ARM64 homelabbers (Raspberry Pi, Apple Silicon VMs) can now install without manual binary download.


3.12 Binary Checksum Verification

Original: No verification of downloaded binary integrity.

Built (Installer Fix 2): Server computes SHA-256 and serves X-Content-SHA256 header. Linux installer verifies with sha256sum. Windows installer verifies with Get-FileHash. Missing header = warn but continue (backward compatible).


3.13 Machine ID Rebind Endpoint

Original: Not specified. If machine ID changed (hardware replacement, VM migration), agent was permanently locked out.

Built (D-1): Admin endpoint POST /admin/agents/:id/rebind-machine-id allows re-binding an agent to new hardware. Requires admin authentication.


Section 4: What Was Built Differently (Deviations)

VD-001: Logging Format

Original (P3-006): JSON structured logs with correlation IDs via logrus or similar library. StructuredLogger implementation, CorrelationIDMiddleware, buffered async writes, P95/P99 latency tracking, system_logs database table.

Actual: ETHOS [TAG] [system] [component] plain text format via log.Printf. No correlation IDs, no structured JSON, no centralized aggregation, no log database table.

Rationale: ETHOS principle #5 (no marketing fluff) and principle #1 (errors are history) were prioritized over the P3-006 spec. Plain text logging with consistent tags is grep-friendly and sufficient for the homelab use case. JSON structured logging adds complexity without proportional benefit at the current scale.

Verdict: Acceptable for homelab. Would need to be revisited for fleet-scale deployments (100+ agents).


VD-002: Authentication Architecture

Original (P0-006 + Starting Prompt): Multi-user system with users table, admin/user/readonly roles, email fields, EnsureAdminUser(). The Starting Prompt shows Settings page with "users" section.

Actual: Single-admin via .env credentials. The users table exists in migrations (for compatibility) but is not used for authentication. Web auth validates against REDFLAG_ADMIN_USER/REDFLAG_ADMIN_PASSWORD from environment.

Rationale: P0-006 recommended "Option 1: Complete Removal" — recognizing that multi-user scaffolding increased attack surface without benefit for a homelab tool. The current implementation follows this recommendation.

Verdict: Correct for homelab. Multi-user would be needed for MSP/enterprise use case.


VD-003: Build Orchestrator

Original (architecture docs): Dynamic agent compilation per request. Build Orchestrator would cross-compile agent binaries on demand.

Actual: Pre-built binaries placed at container build time (via Dockerfile multi-stage build). BuildAndSignAgent signs existing binaries but never compiles. AgentBuilder generates config JSON only. build_orchestrator.go services layer marked // Deprecated.

Rationale: Cross-compilation on every request is impractical for a homelab server. The Dockerfile multi-stage build compiles once; the server serves pre-built binaries. This is how production package distribution works (e.g., GitHub Releases).

Verdict: Correct pragmatic simplification. Dynamic compilation would add complexity without benefit.


VD-004: Upgrade Trigger Path

Original: POST /build/upgrade/:agentID was meant to orchestrate full upgrades.

Actual: The real upgrade path is POST /agents/{id}/update (in agent_updates.go), which validates the agent, generates nonces, creates signed update_agent commands, and tracks delivery. The /build/upgrade endpoint generates config JSON with manual instructions — it's an admin utility, not the upgrade orchestrator.

Rationale: The /agents/{id}/update path already existed and was more complete (nonce generation, command signing, delivery tracking). Wiring a parallel path would have created confusion.

Verdict: Acceptable. The working path is better designed.


VD-005: Security Settings UI

Original (SECURITY-SETTINGS.md): Full security settings configurable from dashboard including machine binding mode, version enforcement, nonce timeout, signature algorithm, log level, alert thresholds.

Actual: Security settings backend works (API CRUD for security_settings table). Dashboard displays settings for command_signing, update_signing, nonce_validation, machine_binding, signature_verification. The operational category (E-1c timeouts) is accessible via API but not visible in the UI. No validation rules enforcement for operational settings.

Verdict: Partially implemented. Backend complete, frontend shows security-category settings. Operational settings need UI exposure.


VD-006: Nonce Timeout

Original (Overview.md, Security.md): Nonce lifetime "< 5 minutes".

Actual (SETUP-SECURITY.md, code): REDFLAG_SECURITY_NONCE_TIMEOUT=600 (10 minutes). The code uses a 10-minute default.

Rationale: The original docs contradict each other — Overview.md says "< 5 min" while SETUP-SECURITY.md says 600 seconds. The 10-minute value appears to be a deliberate choice to accommodate slow network conditions (agents polling every 5 minutes may not receive the command within a 5-minute nonce window).

Verdict: The 10-minute value is more practical. The 5-minute spec was likely aspirational. Document this as intentional.


VD-007: Key Rotation API

Original (SETUP-SECURITY.md): POST /api/v1/security/keys/rotate with grace_period_days (default 30). During grace period both old and new keys valid. Keys stored in /app/keys/ directory.

Actual: Key rotation is TTL-based via the signing key registry in signing_keys table. No explicit /keys/rotate API endpoint. No dual-key grace period — agents refresh their cached public key via TTL (24h default). Key stored in environment variable, not in /app/keys/ directory.

Rationale: Different implementation approach that achieves the same goal (agents can handle key changes) without the complexity of dual-key acceptance windows.

Verdict: Functional but different. A manual key rotation API would be a nice-to-have for planned rotations.


VD-008: Version Format

Original (SECURITY-SETTINGS.md): "Semantic version string (X.Y.Z), integers only, no v prefix."

Actual: Four-octet format X.Y.Z.W where W is the config version (e.g., 0.1.26.0). The v prefix is tolerated and stripped during comparison.

Rationale: The fourth octet was added to embed config schema version alongside the agent version, avoiding a separate version field.

Verdict: Acceptable extension of spec. CompareVersions() handles both 3-octet and 4-octet formats.


VD-009: Windows Service Key Rotation

Original: Not explicitly specified, but key rotation logic exists in main.go polling loop.

Actual (DEV-030): The Windows service polling loop in windows.go does not call ShouldRefreshKey. The comment at line 164-168 acknowledges this as a TODO. Agents running as Windows services rely on the 24h TTL key cache and will not proactively detect key rotation.

Verdict: Known gap. Low risk — the 24h TTL cache means Windows agents will naturally pick up new keys within a day.


VD-010: Watchdog Version Comparison

Original: Not explicitly specified.

Actual: The upgrade watchdog in subsystem_handlers.go:943 uses string equality (agent.CurrentVersion == expectedVersion) instead of CompareVersions(). A normalized version string mismatch (e.g., "v0.1.4" vs "0.1.4") would trigger false rollback.

Verdict: Low risk — both sides use the same version string from the same source. Would need fixing if version normalization is ever introduced.


Section 5: What Was Never Built

5A. Platform Support

Feature Original Priority Effort Estimate Notes
macOS agent / launchd Starting Prompt 2-3 days No launchd plist, no macOS-specific code
Homebrew scanner Starting Prompt 1-2 days Would follow APT/DNF pattern
AUR scanner (Arch) Starting Prompt 1-2 days Would follow APT/DNF pattern
Snap scanner Starting Prompt 1 day Low demand
Flatpak scanner Starting Prompt 1 day Low demand
aggregator-cli (Go CLI) Starting Prompt 3-5 days Power-user tool, not essential for homelab

5B. AI Features

Feature Original Priority Effort Estimate Notes
AI Chat Sidebar (Ollama/OpenAI) Starting Prompt 2-3 weeks No AI code exists anywhere
Natural language queries (POST /ai/query) Starting Prompt 1-2 weeks Requires AI sidebar first
AI-assisted scheduling (POST /ai/schedule) Starting Prompt 1 week Requires maintenance windows first
AI decision audit trail (GET /ai/decisions) Starting Prompt 3-5 days Requires AI features first

Honest assessment: AI features were aspirational in the Starting Prompt ("Future Phase"). They add significant complexity and operational overhead (Ollama requires GPU resources or external API costs). For a homelab tool, manual approval is more appropriate than AI-assisted scheduling. Recommend deferring indefinitely unless Fimeg has a specific use case.

5C. Scheduling & Automation

Feature Original Priority Effort Estimate Notes
Maintenance Windows (RRULE recurrence) Starting Prompt, P3 2-3 weeks Full RRULE parser + calendar UI + auto-approve logic
Auto-approve by severity during windows Starting Prompt 1 week Requires maintenance windows
Scheduled update execution Starting Prompt 1 week Requires maintenance windows
Staggered rollout (5%/25%/100%) P2-003, Strategic Roadmap 1-2 weeks Server-side group selection + phased command queuing
Auto-upgrade trigger (version-based) Upgrade Audit 1 week Server detects old version on check-in, queues update_agent

Honest assessment: Maintenance windows are the highest-value unbuilt feature for production use. Auto-approve by severity during defined windows would significantly reduce manual work.

5D. Observability

Feature Original Priority Effort Estimate Notes
Structured JSON logging (P3-006) P3 3-4 days logrus + correlation IDs
Correlation ID propagation P3-006 2-3 days Middleware + header propagation
Update Metrics Dashboard (P3-003) P3 2-3 days Success/failure rates, trend charts
Server Health Dashboard (P3-005) P3 2-3 days CPU, memory, DB connections
Prometheus metrics endpoint Strategic Roadmap 2-3 days /metrics endpoint with Go prometheus client
Real-time WebSocket updates Starting Prompt 1-2 weeks Partial: security events WebSocket exists

5E. Integration & Ecosystem

Feature Original Priority Effort Estimate Notes
LDAP/Active Directory Strategic Roadmap 2-3 weeks Auth integration
SAML/OIDC for SSO Strategic Roadmap 2-3 weeks Requires multi-user first
Slack/Teams/PagerDuty webhooks Strategic Roadmap 1-2 weeks Event notification hooks
Compliance reporting (SOX, HIPAA) Strategic Roadmap 4-6 weeks Report generation framework
Kubernetes deployment Strategic Roadmap 1-2 weeks Helm chart + StatefulSet
Ansible/Terraform integrations Strategic Roadmap 2-3 weeks Module/provider development

Honest assessment: Webhooks (Slack/Teams) are the highest-value integration for homelab use. LDAP/SSO only matters if Fimeg plans to support multi-user deployments.

5F. UI Features

Feature Original Priority Effort Estimate Notes
Security Status Dashboard Indicators (P3-002) P3 2-3 days Color-coded security health scores
Token Management UI Enhancement (P3-004) P3 1-2 days Delete tokens, bulk operations
Server Health Dashboard (P3-005) P3 2-3 days System status monitoring
Operational settings in UI E-1c carry-over 1 day Add 'operational' category to SecuritySettings.tsx
Update metrics and trend charts P3-003 2-3 days Recharts integration
Calendar view for maintenance windows Starting Prompt 1-2 weeks Requires maintenance windows backend

5G. Security Features

Feature Original Priority Effort Estimate Notes
Multi-factor authentication Strategic Roadmap 1-2 weeks TOTP integration
API key rotation via UI SETUP-SECURITY.md 2-3 days Manual rotation endpoint
Key rotation with grace period SETUP-SECURITY.md 1 week Dual-key acceptance window
TLS hardening (remove bypass flag) Code Review 1 hour Remove --insecure-tls flag
JWT secret minimum strength Code Review 30 min Validation in config loading

Section 6: Backlog Status Table

ID Title Original Priority Current Status Where Fixed/Notes
P0-001 Rate Limit First Request Bug P0 FIXED v0.1.26 per v0.1.27 Inventory; rate limiter namespaced by type in A-3 fixes
P0-002 Session Loop Bug P0 PARTIALLY FIXED SetupCompletionChecker modified in E-1b (removed isSetupMode state); may need live verification
P0-003 Agent No Retry Logic P0 FIXED v0.1.27 per Inventory + culurien B-2 (exponential backoff with full jitter, proportional polling jitter)
P0-004 Database Constraint Violation P0 FIXED v0.1.27 per Inventory; timeout service now uses 'failed' result status (check constraint compatible)
P0-005 Setup Flow Broken P0 NOT VERIFIED Setup handler exists but end-to-end flow not tested in culurien branch. May still have issues
P0-006 Single-Admin Architecture P0 ACCEPTED Decision made: single-admin via .env. Users table exists for compatibility but not used for auth
P0-007 Install Script Path Variables P0 FIXED 2025-12-17 per backlog + verified in Installer Fix 1 (config path consistency)
P0-008 Migration Runs on Fresh Install P0 FIXED 2025-12-17 per backlog; early return in detection.go for empty agent_id
P0-009 Storage Scanner Wrong Table P0 NOT DONE Storage scanner still on legacy interface. Dedicated storage_metrics table exists but scanner reports to update_packages
P1-001 Agent Install ID Parsing P1 PARTIALLY FIXED extractOrGenerateAgentID() in install_template_service.go validates UUID format; but query param handling may still have edge cases
P1-002 Agent Timeout Handling P1 PARTIALLY FIXED E-1c made timeouts configurable from DB; per-scanner timeouts exist in config but generic 45s timeout may still apply in some paths
P2-001 Binary URL Architecture Mismatch P2 FIXED Installer Fix 2 added arch detection; templates override download URL with detected architecture
P2-002 Migration Error Reporting P2 NOT DONE Migration errors still only logged locally; no server-side visibility
P2-003 Agent Auto-Update System P2 FIXED Fully implemented (was incorrectly marked placeholder in backlog); verified in Upgrade Audit
P3-001 Duplicate Command Prevention P3 FIXED v0.1.27; unique index on (agent_id, command_type, status) WHERE status = 'pending'
P3-002 Security Status Dashboard P3 PARTIALLY DONE Security overview endpoints exist; no color-coded health scores or per-agent security badges
P3-003 Update Metrics Dashboard P3 NOT DONE No metrics dashboard, no trend charts
P3-004 Token Management UI Enhancement P3 PARTIALLY DONE Token list with copy-install-command exists; no delete, no bulk operations, no status filtering
P3-005 Server Health Dashboard P3 NOT DONE No health dashboard
P3-006 Structured Logging System P3 NOT DONE (alternative) ETHOS [TAG] format used instead of JSON structured logging. See VD-001
P4-001 Agent Retry Logic Resilience P4 FIXED v0.1.27 per Inventory + culurien B-2 (exponential backoff, circuit breakers)
P4-002 Scanner Timeout Optimization P4 PARTIALLY DONE Configurable per-subsystem timeouts in config; E-1c made server-side timeouts configurable from DB
P4-003 Agent File Management Migration P4 PARTIALLY DONE MigrationExecutor exists with old-path detection; constants/paths.go standardized; validation/pathutils packages have compile errors (dead code)
P4-004 Directory Path Standardization P4 FIXED constants/paths.go provides canonical paths; windows.go fixed to use constants.GetAgentConfigPath() (Installer Fix 1); installer templates use standard paths
P4-005 Testing Infrastructure Gaps P4 SIGNIFICANTLY IMPROVED From ~3 test files to 170 tests across 18 packages; no CI/CD yet
P4-006 Architecture Documentation Gaps P4 PARTIALLY DONE 30+ docs in culurien docs/ folder; no formal architecture diagrams or ADRs
P5-001 Security Audit Documentation P5 NOT DONE No security audit checklist, IR procedures, or compliance mapping
P5-002 Development Workflow Documentation P5 PARTIALLY DONE .env.example created; no PR template, debugging guide, or release process

Summary: 10 FIXED, 8 PARTIALLY DONE, 6 NOT DONE, 1 ACCEPTED (design decision), 1 NOT VERIFIED, 1 alternative approach.


Section 7: Architecture Health Assessment

7A. Authentication Stack

Component Spec Actual Match
Registration tokens One-time or multi-seat Implemented with seat limits YES
JWT 24h expiry Short-lived JWT Implemented with issuer-based validation (A-3) YES
Refresh tokens 90-day Sliding window, SHA-256 hash Implemented, renewal in transaction (B-2) YES
Machine ID binding X-Machine-ID header, 403 on mismatch Implemented with canonical SHA256 hash (D-1) YES

7B. Command Flow

Component Spec Actual Match
Pull-only Agents always initiate Confirmed — server has no outbound capability YES
5-minute check-in Configurable interval Default 300s, configurable via config.json YES
Command types scan_updates, collect_specs, install_updates, rollback_update, update_agent All present in models/command.go plus enable/disable_heartbeat, reboot, dry_run_update, confirm_dependencies YES+
Acknowledgment pending_acks.json acknowledgment.Tracker with persistence and retry YES

7C. Security Stack

Component Spec Actual Match
Ed25519 signing Binary + command signing Both implemented; v3 format exceeds spec YES+
Nonce validation < 5 min lifetime, anti-replay 10-minute default (VD-006), otherwise matches CLOSE
TOFU key caching Fetch once at registration Implemented with TTL refresh YES+

7D. Agent Paths

Platform Spec Path Actual Match
Linux config /etc/redflag/config.json /etc/redflag/agent/config.json (constants.GetAgentConfigPath) CLOSE — subdir added
Linux state /var/lib/redflag/ /var/lib/redflag/agent/ CLOSE — subdir added
Linux binary /usr/local/bin/redflag-agent /usr/local/bin/redflag-agent YES
Windows config C:\ProgramData\RedFlag\config.json C:\ProgramData\RedFlag\agent\config.json (fixed in Installer Fix 1) CLOSE — subdir added

The agent subdirectory was added to support future multi-component deployments (agent + server on same machine). This is a reasonable structural enhancement.

7E. Migration System

Component Spec Actual Match
MigrationExecutor Present Implemented in aggregator-agent/internal/migration/ YES
Old path migration /etc/aggregator/ -> /etc/redflag/ Detection and backup implemented in installer templates and migration executor YES

Section 8: The Honest Roadmap

HIGH VALUE, LOW EFFORT (Quick Wins)

  1. JWT secret minimum strength (30 min) — Add len(secret) < 32 check in config loading. Addresses Code Review finding.
  2. TLS bypass flag removal (1 hour) — Remove --insecure-tls flag from agent. Forces TLS in production.
  3. Operational settings in UI (1 day) — Add operational category to SecuritySettings.tsx component. Makes timeout tuning accessible from dashboard.
  4. Token delete button (1-2 days) — P3-004. DELETE endpoint + confirmation dialog. Currently requires DB manual cleanup.
  5. /api/v1/info ldflags injection (1 hour) — Ensure Dockerfile passes -ldflags with actual version strings. Currently defaults to "dev".

HIGH VALUE, HIGH EFFORT (Strategic Investments)

  1. Maintenance Windows (2-3 weeks) — The single most impactful unbuilt feature. Enables scheduled patching during safe hours. Without this, every update requires manual approval at execution time.
  2. Webhook notifications (1-2 weeks) — Slack/Teams alerts on critical update availability, failed installations, agent offline. Low integration overhead, high operational value.
  3. Staggered rollout (1-2 weeks) — Deploy updates to 5% canary, monitor, then 25%, then 100%. Essential for fleets > 10 agents.
  4. macOS agent (2-3 days) — launchd plist template + Homebrew scanner. Completes the "cross-platform" promise for homelabbers with Macs.

LOW VALUE (Defer or Drop)

  1. AI features — Drop entirely for foreseeable future. Adds operational complexity (GPU/API costs) without proportional benefit for homelab use case. Manual approval is more appropriate.
  2. aggregator-cli — Defer. The web dashboard covers all use cases. CLI would be nice-to-have for scripting but is not essential.
  3. AUR/Snap/Flatpak scanners — Defer. Very small user base for each. APT and DNF cover 95%+ of Linux homelabbers.
  4. LDAP/SSO — Defer until multi-user is needed. Single-admin is correct for homelab.
  5. Compliance reporting — Drop. SOX/HIPAA requirements don't apply to homelabs.
  6. Kubernetes deployment — Defer. Docker Compose is the right deployment model for the target audience.

Section 9: Summary Table

Feature Area Planned Built Status Gap Rating
Core Architecture (pull model, agents, server) Full Full Working 0 (complete)
Ed25519 Signing (commands, binaries) Full Full + enhancements Working 0 (exceeds spec)
Authentication (tokens, JWT, refresh) Full Full + transactions Working 0 (exceeds spec)
Machine ID Binding Full Full + canonical hash Working 0 (exceeds spec)
Replay Protection (nonces) Full Full (10min vs 5min) Working 1 (timeout deviation)
Package Scanning (6 of 9 scanners) 9 scanners 6 scanners Working 3 (AUR, Snap, Flatpak, Homebrew missing)
Agent Self-Upgrade Full Full 7-step pipeline Working 0 (complete)
Installer (Linux + Windows) Full Full + arch + checksum Working 1 (macOS missing)
Web Dashboard Full Core features Working 3 (missing charts, health, metrics)
Database + Migrations Full Full + hardened Working 0 (exceeds spec)
Docker Deployment Full Full Working 0 (complete)
Testing Minimal 170 tests Working 2 (no CI/CD, no integration tests against real DB)
Maintenance Windows Full None Not built 10 (completely absent)
AI Features Full None Not built 10 (deliberately deferred)
Scheduling & Automation Full None Not built 8 (no maintenance windows, no staggered rollout)
LDAP/SSO Planned None Not built 5 (not needed for homelab)
Structured Logging Planned (P3-006) ETHOS alternative Working differently 3 (functional but not JSON/correlation IDs)
Compliance / Reporting Planned None Not built 2 (not applicable to homelab)
CLI Tool Planned None Not built 2 (dashboard covers use cases)
macOS Support Planned None Not built 4 (matters for homelabbers with Macs)

Overall: Core infrastructure is 9/10. Feature breadth is 5/10. Production readiness for homelab is 7/10.

The gap is almost entirely in features that were always labeled "future" or "Phase 2" in the original docs. The core architecture — the hard engineering work — is built, tested, and hardened beyond the original specification.