Files

jpetree331 3526962333 docs: Vision vs Reality comprehensive deviation report

Complete comparison of Fimeg's original vision against
current codebase state after culurien branch work.

- 9 sections covering architecture, features, backlog
- 10 deviations documented (VD-001 through VD-010)
- 27 backlog items tracked with current status
- Honest roadmap with prioritized next steps
- Executive summary for quick reference

Core: 9/10. Features: 5/10. Homelab ready: 7/10.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-29 19:04:49 -04:00

41 KiB

Raw Blame History

Vision vs Reality: RedFlag Deviation Report

Date: 2026-03-29 Branch: culurien (post-integration verification) Baseline: v0.1.27 (last Fimeg commit before culurien work) Author: Claude (automated analysis based on complete codebase and historical documentation)

Section 1: Executive Summary

RedFlag began as "Aggregator" — Fimeg's vision for a self-hosted ConnectWise alternative that would give homelabbers centralized update management with "single pane of glass" visibility across Windows, Linux, and macOS. The Starting Prompt described an ambitious platform with AI-assisted scheduling, maintenance windows, natural language queries, and cross-platform agents covering APT, DNF, AUR, Snap, Flatpak, Winget, Windows Update, Docker, and Homebrew.

What exists today is a narrower but more deeply engineered system than the original breadth implied. The core architecture — pull-based agents, Ed25519 signed commands, machine ID binding, three-tier token authentication — is not just present but has been hardened well beyond the original specification. The culurien branch added transaction safety, replay attack protection, key rotation, configurable timeouts, path traversal defense, semver-aware version comparison, and grew the test suite from approximately 3 test files to 170 passing tests across 18 packages.

The honest gap: approximately 40% of the originally envisioned feature surface was never built. macOS support, AI features, maintenance windows, scheduled rollouts, structured JSON logging, LDAP/SSO, compliance reporting, the CLI tool — none exist. The features that do exist (package scanning, command execution, agent self-upgrade, installer infrastructure) work correctly and are production-quality for the homelab use case.

Is it production-ready for Fimeg's homelab? Yes, with caveats. The system can install agents on Linux and Windows, scan for package updates, approve and execute updates from the dashboard, and self-upgrade agents — all with cryptographic verification. The caveats are: the setup flow has known friction (P0-005), the /api/v1/info endpoint returns build-time values that default to "dev" without proper ldflags injection, and several P0 backlog items from the original backlog have been fixed but not all have been verified end-to-end in a live deployment.

The "scare ConnectWise" ambition requires honest framing. RedFlag has three architectural advantages ConnectWise cannot replicate: hardware-bound machine IDs, mandatory self-hosting, and code transparency. But ConnectWise has remote control, 100+ integrations, SOC2 certification, and a decade of enterprise polish. RedFlag is competitive for the self-hosted update management niche — the subset of ConnectWise functionality that homelabbers and small IT teams actually use. It is not, and should not try to become, a full RMM platform. The strategic path is: own the update management vertical completely, then expand.

Section 2: What Was Built As Planned

2.1 Pull-Based Agent Architecture

Planned (Starting Prompt): Agent polls server every 5 minutes via GET /agents/{id}/commands. Server never initiates connections to agents.

Built: Exactly as planned. Agent polls via GET /agents/:id/commands with configurable interval (default 300 seconds). Proportional jitter added (B-2 fix: maxJitter = pollingInterval/2). Server has no outbound connection capability.

Location: aggregator-agent/cmd/agent/main.go:960-1070 (poll loop), aggregator-server/internal/api/handlers/agents.go (GetCommands handler)

Status: Working. Enhanced beyond spec with jitter and exponential backoff on failures.

2.2 Ed25519 Command Signing

Planned (Security.md): All commands in DB must be Ed25519-signed before being sent to agents. signAndCreateCommand() implemented in handlers.

Built: Exactly as planned, then enhanced. v3 signed message format includes agent_id:cmd_id:type:sha256(params):timestamp. Key ID and SignedAt tracked per command (migration 025). 29+ call sites use signAndCreateCommand().

Location: aggregator-server/internal/services/signing.go, aggregator-server/internal/api/handlers/agents.go:49-77

Status: Working. Exceeds original spec.

2.3 Machine ID Binding

Planned (Security.md section 3.1): MachineBindingMiddleware validates X-Machine-ID header against agents.machine_id. Mismatch = 403.

Built: Implemented. Machine ID is SHA256 hash of hardware identifiers (D-1 fix made this canonical across all platforms). Middleware validates on authenticated agent routes. Admin rebind endpoint added for recovery.

Location: aggregator-server/internal/api/middleware/machine_binding.go, aggregator-agent/internal/system/machine_id.go

Status: Working. Enhanced with canonical hash format and rebind capability.

2.4 Three-Tier Token Authentication

Planned (Security.md section 2): Registration tokens (one-time/multi-seat) -> JWT access tokens (24h) -> Refresh tokens (90-day sliding window, stored as SHA-256 hash).

Built: Exactly as planned. Registration tokens consumed at /agents/register with seat limits. JWT issued with issuer=redflag-agent (A-3 fix). Refresh tokens stored hashed, renewed via /agents/renew. Token renewal wrapped in database transaction (B-2 fix).

Location: aggregator-server/internal/api/handlers/auth.go, aggregator-server/internal/api/middleware/auth.go, aggregator-server/internal/database/queries/refresh_tokens.go

Status: Working. Transaction safety added beyond original spec.

2.5 Replay Attack Protection (Nonce System)

Planned (Security.md section 3.3): Unique, time-limited, Ed25519-signed nonce for every sensitive command. Agent validates signature and timestamp.

Built: Implemented via UpdateNonceService with Generate() and Validate(). Nonce format: uuid:unix_timestamp, signed with Ed25519. Agent validates freshness within configurable window.

Location: aggregator-server/internal/services/nonce_service.go, aggregator-agent/cmd/agent/subsystem_handlers.go:1074 (validateNonce)

Status: Working. Timeout is 10 minutes (not 5 as specified in Overview.md — see VD-006).

2.6 Agent Self-Registration Flow

Planned (Starting Prompt): Agent POSTs to /agents/register with hostname, OS info, version. Server returns agent_id, token, config.

Built: Exactly as planned plus enhancements. Registration includes machine_id (SHA256 hash) and public_key_fingerprint. Registration wrapped in database transaction (B-2 fix). Agent stores agent_id, token, refresh_token, check_in_interval to config file.

Location: aggregator-agent/cmd/agent/main.go:371-483 (registerAgent), aggregator-server/internal/api/handlers/agents.go (RegisterAgent)

Status: Working.

2.7 Package Scanner Architecture

Planned (Starting Prompt): APT, DNF/YUM, AUR, Winget, Windows Update, Docker, Snap, Flatpak, Homebrew.

Built: APT, DNF, Winget, Windows Update, Docker, Storage metrics. Each scanner has circuit breaker protection and configurable timeouts.

Not built: AUR, Snap, Flatpak, Homebrew. See Section 5A.

Location: aggregator-agent/internal/scanner/ (apt.go, dnf.go, winget.go, docker.go), aggregator-agent/pkg/windowsupdate/

Status: Working for implemented platforms. 6 of 9 planned scanners built.

2.8 Circuit Breaker Pattern

Planned (ETHOS.md principle 3, Overview.md): Circuit Breaker on fragile scanners (Windows Update, DNF).

Built: Full implementation with Closed/Open/HalfOpen states, configurable failure threshold, failure window, open duration, and half-open attempts. Applied to all scanners via subsystem config.

Location: aggregator-agent/internal/circuitbreaker/circuit_breaker.go, config per subsystem in config.json

Status: Working.

2.9 Command Acknowledgment System

Planned (Overview.md): pending_acks.json for at-least-once delivery guarantee.

Built: acknowledgment.Tracker package with persistent pending acks, retry with IncrementRetry(), and state file persistence.

Location: aggregator-agent/internal/acknowledgment/

Status: Working.

2.10 Agent Service Management

Planned (Overview.md): systemd on Linux, Windows Services (SCM) on Windows.

Built: Both. Linux systemd unit generated inline in installer template with security hardening (ProtectSystem=strict, ProtectHome=true, PrivateTmp=true). Windows SCM registration via InstallService() with auto-start and recovery actions.

Location: aggregator-agent/internal/service/windows.go:438-516, installer templates in aggregator-server/internal/services/templates/install/scripts/

Status: Working.

2.11 Web Dashboard

Planned (Starting Prompt): React 18 + TypeScript, TailwindCSS, TanStack Query, Recharts, TanStack Table, WebSocket.

Built: React + TypeScript + TailwindCSS + TanStack Query. Dashboard with agent list, update management, command history, security settings. TypeScript strict compliance (0 errors after E-1b). No Recharts charts, no TanStack Table, limited WebSocket.

Location: aggregator-web/src/

Status: Working. Feature-complete for core use case but missing visualization features (charts, trend analysis).

2.12 PostgreSQL with Migration Runner

Planned (Starting Prompt, ETHOS.md): PostgreSQL database with idempotent migrations.

Built: PostgreSQL 16-alpine, custom migration runner in database/db.go with schema_migrations tracking table. 30 migrations (001-030) all using IF NOT EXISTS / ON CONFLICT DO NOTHING. Migration runner aborts fatally on failure (B-1 fix).

Location: aggregator-server/internal/database/db.go, aggregator-server/internal/database/migrations/

Status: Working.

2.13 Docker Compose Deployment

Planned (Starting Prompt): docker-compose.yml for quick start.

Built: Three-service compose: postgres (16-alpine), server (multi-stage Go build), web (nginx). Health checks, volume persistence, env_file support.

Location: docker-compose.yml, config/.env.example

Status: Working.

2.14 Installer Scripts

Planned (various docs): Install script served from /install/:platform endpoint.

Built: Linux bash and Windows PowerShell templates served dynamically. Linux installer creates system user, directories, sudoers (per-package-manager), systemd service with security hardening, registers agent, starts service. Windows installer creates directories, downloads binary, writes config, registers service. Both have arch auto-detection and checksum verification (Installer Fix 2).

Location: aggregator-server/internal/services/templates/install/scripts/linux.sh.tmpl, windows.ps1.tmpl

Status: Working. Idempotent.

2.15 Agent Self-Upgrade Pipeline

Planned (Overview.md, P2-003): 7-step pipeline: download, checksum verify, Ed25519 binary signature verify, backup, atomic install, service restart, watchdog confirmation.

Built: Exactly as specified. All 7 steps implemented with deferred rollback on any failure including watchdog timeout (5 minutes). Download has 5-minute timeout and 500MB size limit (U-8 fix).

Location: aggregator-agent/cmd/agent/subsystem_handlers.go:575-762

Status: Working. The v0.1.27 INVENTORY doc noted this was "FULLY IMPLEMENTED" despite the backlog claiming it was placeholder.

Section 3: What Was Built Better Than Planned

3.1 Ed25519 Key Rotation with TTL

Original: Security.md noted key rotation was "TODO" with no implementation. SETUP-SECURITY.md described a POST /security/keys/rotate API with 30-day grace period.

Built: TTL-based key caching with automatic refresh. Server registers primary key in signing_keys table (migration 025). Agent caches server public key at registration (TOFU), with configurable TTL refresh. Key ID tracked per command for audit trail.

Why better: Automatic TTL refresh is more reliable than manual API-triggered rotation. The manual rotation API was never needed because the system handles staleness automatically.

3.2 Command Signing v3 Format

Original: Commands signed with cmd_id:type:sha256(params).

Built: v3 format includes agent_id:cmd_id:type:sha256(params):timestamp. Agent ID binding prevents command replay to different agents. Timestamp enables time-based expiry independent of DB state.

Why better: Prevents a class of relay attacks where a compromised agent could forward signed commands to other agents.

3.3 Machine ID Canonical SHA256 Hash

Original: Machine ID was inconsistent — registration fallback used "unknown-" + hostname (unhashed) while runtime used SHA256.

Built (D-1): All paths now use GetMachineID() which always returns a 64-character hex SHA256 hash. Registration aborts with log.Fatalf if machine ID cannot be obtained — no unhashed fallback.

Why better: Eliminates format mismatch between registration and runtime that would cause 403 errors after restart.

3.4 Transaction Safety (B-Series)

Original: Not specified. Registration, command delivery, and token renewal were separate DB operations without transaction wrapping.

Built (B-2): Registration wrapped in tx.Beginx() with defer tx.Rollback(). Command delivery uses SELECT FOR UPDATE SKIP LOCKED (atomic claim). Token renewal wrapped in transaction. JWT generated after commit, not before.

Why better: Prevents partial registration state, command double-delivery race conditions, and orphaned tokens.

3.5 Configurable Operational Timeouts

Original: 6 hardcoded timeout values in main.go and timeout.go.

Built (E-1c): All 6 values stored in security_settings table under operational category (migration 030). Read from DB at startup with hardcoded fallback. Zero-value protection prevents zero-duration tickers.

Why better: Administrators can tune timeouts via API without code changes or redeployment.

3.6 Binary Path Traversal Protection

Original: Not specified — c.File(pkg.BinaryPath) served DB-sourced paths without validation.

Built (E-1c + Integration Verification): Both DownloadUpdatePackage and DownloadAgent resolve paths via filepath.Abs() and validate against REDFLAG_BINARY_STORAGE_PATH using prefix check. Traversal attempts logged and return 403.

Why better: Defense in depth against DB compromise scenarios.

3.7 TypeScript Strict Compliance

Original: 217 TypeScript errors in aggregator-web/src/.

Built (E-1b): All 217 errors fixed. Zero @ts-ignore or as any suppressions added. Type interfaces verified against actual server JSON responses. TanStack Query v5 isLoading -> isPending migration for mutations.

Why better: Catches type mismatches at compile time instead of runtime.

3.8 Semver-Aware Version Comparison

Original: versions.go:72 used lexicographic comparison (agentVersion < current.MinAgentVersion), making "0.1.9" > "0.1.22".

Built (Upgrade Fix): CompareVersions() with octet-by-octet numeric parsing. Handles "dev" as always-older, "v" prefix stripping, mismatched octet counts.

Why better: Version gates now work correctly for all version numbers.

3.9 Test Suite Growth

Original (Code Review): "Only 3 test files across the entire codebase" — circuitbreaker_test.go, test_disk.go, test_disk_detection.go, plus scheduler tests.

Built: 170 tests across 18 packages covering: Ed25519 signing and replay protection, JWT issuer validation, registration transactions, command delivery races, machine ID format, ETHOS compliance, path traversal, version comparison, checksum computation, timeout configuration, and more.

Why better: Regression detection for all fix series (A through Upgrade).

3.10 ETHOS Logging Compliance

Original: Mixed fmt.Printf, emoji in logs, inconsistent format.

Built (D-2 + Integration Verification): All production log statements use log.Printf("[TAG] [system] [component] message key=value"). Emoji removed from daemon log.Printf calls. fmt.Printf DEBUG statements removed from handlers. Terminal/CLI output emoji explicitly exempted (DEV-039).

3.11 Installer Architecture Detection

Original: Architecture hardcoded to amd64 in generateInstallScript.

Built (Installer Fix 2): Runtime detection via uname -m (Linux) and $env:PROCESSOR_ARCHITECTURE (Windows). Server accepts optional ?arch= query param. Download endpoint already supported linux-arm64 and windows-arm64.

Why better: ARM64 homelabbers (Raspberry Pi, Apple Silicon VMs) can now install without manual binary download.

3.12 Binary Checksum Verification

Original: No verification of downloaded binary integrity.

Built (Installer Fix 2): Server computes SHA-256 and serves X-Content-SHA256 header. Linux installer verifies with sha256sum. Windows installer verifies with Get-FileHash. Missing header = warn but continue (backward compatible).

3.13 Machine ID Rebind Endpoint

Original: Not specified. If machine ID changed (hardware replacement, VM migration), agent was permanently locked out.

Built (D-1): Admin endpoint POST /admin/agents/:id/rebind-machine-id allows re-binding an agent to new hardware. Requires admin authentication.

Section 4: What Was Built Differently (Deviations)

VD-001: Logging Format

Original (P3-006): JSON structured logs with correlation IDs via logrus or similar library. StructuredLogger implementation, CorrelationIDMiddleware, buffered async writes, P95/P99 latency tracking, system_logs database table.

Actual: ETHOS [TAG] [system] [component] plain text format via log.Printf. No correlation IDs, no structured JSON, no centralized aggregation, no log database table.

Rationale: ETHOS principle #5 (no marketing fluff) and principle #1 (errors are history) were prioritized over the P3-006 spec. Plain text logging with consistent tags is grep-friendly and sufficient for the homelab use case. JSON structured logging adds complexity without proportional benefit at the current scale.

Verdict: Acceptable for homelab. Would need to be revisited for fleet-scale deployments (100+ agents).

VD-002: Authentication Architecture

Original (P0-006 + Starting Prompt): Multi-user system with users table, admin/user/readonly roles, email fields, EnsureAdminUser(). The Starting Prompt shows Settings page with "users" section.

Actual: Single-admin via .env credentials. The users table exists in migrations (for compatibility) but is not used for authentication. Web auth validates against REDFLAG_ADMIN_USER/REDFLAG_ADMIN_PASSWORD from environment.

Rationale: P0-006 recommended "Option 1: Complete Removal" — recognizing that multi-user scaffolding increased attack surface without benefit for a homelab tool. The current implementation follows this recommendation.

Verdict: Correct for homelab. Multi-user would be needed for MSP/enterprise use case.

VD-003: Build Orchestrator

Original (architecture docs): Dynamic agent compilation per request. Build Orchestrator would cross-compile agent binaries on demand.

Actual: Pre-built binaries placed at container build time (via Dockerfile multi-stage build). BuildAndSignAgent signs existing binaries but never compiles. AgentBuilder generates config JSON only. build_orchestrator.go services layer marked // Deprecated.

Rationale: Cross-compilation on every request is impractical for a homelab server. The Dockerfile multi-stage build compiles once; the server serves pre-built binaries. This is how production package distribution works (e.g., GitHub Releases).

Verdict: Correct pragmatic simplification. Dynamic compilation would add complexity without benefit.

VD-004: Upgrade Trigger Path

Original: POST /build/upgrade/:agentID was meant to orchestrate full upgrades.

Actual: The real upgrade path is POST /agents/{id}/update (in agent_updates.go), which validates the agent, generates nonces, creates signed update_agent commands, and tracks delivery. The /build/upgrade endpoint generates config JSON with manual instructions — it's an admin utility, not the upgrade orchestrator.

Rationale: The /agents/{id}/update path already existed and was more complete (nonce generation, command signing, delivery tracking). Wiring a parallel path would have created confusion.

Verdict: Acceptable. The working path is better designed.

VD-005: Security Settings UI

Original (SECURITY-SETTINGS.md): Full security settings configurable from dashboard including machine binding mode, version enforcement, nonce timeout, signature algorithm, log level, alert thresholds.

Actual: Security settings backend works (API CRUD for security_settings table). Dashboard displays settings for command_signing, update_signing, nonce_validation, machine_binding, signature_verification. The operational category (E-1c timeouts) is accessible via API but not visible in the UI. No validation rules enforcement for operational settings.

Verdict: Partially implemented. Backend complete, frontend shows security-category settings. Operational settings need UI exposure.

VD-006: Nonce Timeout

Original (Overview.md, Security.md): Nonce lifetime "< 5 minutes".

Actual (SETUP-SECURITY.md, code): REDFLAG_SECURITY_NONCE_TIMEOUT=600 (10 minutes). The code uses a 10-minute default.

Rationale: The original docs contradict each other — Overview.md says "< 5 min" while SETUP-SECURITY.md says 600 seconds. The 10-minute value appears to be a deliberate choice to accommodate slow network conditions (agents polling every 5 minutes may not receive the command within a 5-minute nonce window).

Verdict: The 10-minute value is more practical. The 5-minute spec was likely aspirational. Document this as intentional.

VD-007: Key Rotation API

Original (SETUP-SECURITY.md): POST /api/v1/security/keys/rotate with grace_period_days (default 30). During grace period both old and new keys valid. Keys stored in /app/keys/ directory.

Actual: Key rotation is TTL-based via the signing key registry in signing_keys table. No explicit /keys/rotate API endpoint. No dual-key grace period — agents refresh their cached public key via TTL (24h default). Key stored in environment variable, not in /app/keys/ directory.

Rationale: Different implementation approach that achieves the same goal (agents can handle key changes) without the complexity of dual-key acceptance windows.

Verdict: Functional but different. A manual key rotation API would be a nice-to-have for planned rotations.

VD-008: Version Format

Original (SECURITY-SETTINGS.md): "Semantic version string (X.Y.Z), integers only, no v prefix."

Actual: Four-octet format X.Y.Z.W where W is the config version (e.g., 0.1.26.0). The v prefix is tolerated and stripped during comparison.

Rationale: The fourth octet was added to embed config schema version alongside the agent version, avoiding a separate version field.

Verdict: Acceptable extension of spec. CompareVersions() handles both 3-octet and 4-octet formats.

VD-009: Windows Service Key Rotation

Original: Not explicitly specified, but key rotation logic exists in main.go polling loop.

Actual (DEV-030): The Windows service polling loop in windows.go does not call ShouldRefreshKey. The comment at line 164-168 acknowledges this as a TODO. Agents running as Windows services rely on the 24h TTL key cache and will not proactively detect key rotation.

Verdict: Known gap. Low risk — the 24h TTL cache means Windows agents will naturally pick up new keys within a day.

VD-010: Watchdog Version Comparison

Original: Not explicitly specified.

Actual: The upgrade watchdog in subsystem_handlers.go:943 uses string equality (agent.CurrentVersion == expectedVersion) instead of CompareVersions(). A normalized version string mismatch (e.g., "v0.1.4" vs "0.1.4") would trigger false rollback.

Verdict: Low risk — both sides use the same version string from the same source. Would need fixing if version normalization is ever introduced.

Section 5: What Was Never Built

5A. Platform Support

Feature	Original Priority	Effort Estimate	Notes
macOS agent / launchd	Starting Prompt	2-3 days	No launchd plist, no macOS-specific code
Homebrew scanner	Starting Prompt	1-2 days	Would follow APT/DNF pattern
AUR scanner (Arch)	Starting Prompt	1-2 days	Would follow APT/DNF pattern
Snap scanner	Starting Prompt	1 day	Low demand
Flatpak scanner	Starting Prompt	1 day	Low demand
aggregator-cli (Go CLI)	Starting Prompt	3-5 days	Power-user tool, not essential for homelab

5B. AI Features

Feature	Original Priority	Effort Estimate	Notes
AI Chat Sidebar (Ollama/OpenAI)	Starting Prompt	2-3 weeks	No AI code exists anywhere
Natural language queries (`POST /ai/query`)	Starting Prompt	1-2 weeks	Requires AI sidebar first
AI-assisted scheduling (`POST /ai/schedule`)	Starting Prompt	1 week	Requires maintenance windows first
AI decision audit trail (`GET /ai/decisions`)	Starting Prompt	3-5 days	Requires AI features first

Honest assessment: AI features were aspirational in the Starting Prompt ("Future Phase"). They add significant complexity and operational overhead (Ollama requires GPU resources or external API costs). For a homelab tool, manual approval is more appropriate than AI-assisted scheduling. Recommend deferring indefinitely unless Fimeg has a specific use case.

5C. Scheduling & Automation

Feature	Original Priority	Effort Estimate	Notes
Maintenance Windows (RRULE recurrence)	Starting Prompt, P3	2-3 weeks	Full RRULE parser + calendar UI + auto-approve logic
Auto-approve by severity during windows	Starting Prompt	1 week	Requires maintenance windows
Scheduled update execution	Starting Prompt	1 week	Requires maintenance windows
Staggered rollout (5%/25%/100%)	P2-003, Strategic Roadmap	1-2 weeks	Server-side group selection + phased command queuing
Auto-upgrade trigger (version-based)	Upgrade Audit	1 week	Server detects old version on check-in, queues update_agent

Honest assessment: Maintenance windows are the highest-value unbuilt feature for production use. Auto-approve by severity during defined windows would significantly reduce manual work.

5D. Observability

Feature	Original Priority	Effort Estimate	Notes
Structured JSON logging (P3-006)	P3	3-4 days	logrus + correlation IDs
Correlation ID propagation	P3-006	2-3 days	Middleware + header propagation
Update Metrics Dashboard (P3-003)	P3	2-3 days	Success/failure rates, trend charts
Server Health Dashboard (P3-005)	P3	2-3 days	CPU, memory, DB connections
Prometheus metrics endpoint	Strategic Roadmap	2-3 days	/metrics endpoint with Go prometheus client
Real-time WebSocket updates	Starting Prompt	1-2 weeks	Partial: security events WebSocket exists

5E. Integration & Ecosystem

Feature	Original Priority	Effort Estimate	Notes
LDAP/Active Directory	Strategic Roadmap	2-3 weeks	Auth integration
SAML/OIDC for SSO	Strategic Roadmap	2-3 weeks	Requires multi-user first
Slack/Teams/PagerDuty webhooks	Strategic Roadmap	1-2 weeks	Event notification hooks
Compliance reporting (SOX, HIPAA)	Strategic Roadmap	4-6 weeks	Report generation framework
Kubernetes deployment	Strategic Roadmap	1-2 weeks	Helm chart + StatefulSet
Ansible/Terraform integrations	Strategic Roadmap	2-3 weeks	Module/provider development

Honest assessment: Webhooks (Slack/Teams) are the highest-value integration for homelab use. LDAP/SSO only matters if Fimeg plans to support multi-user deployments.

5F. UI Features

Feature	Original Priority	Effort Estimate	Notes
Security Status Dashboard Indicators (P3-002)	P3	2-3 days	Color-coded security health scores
Token Management UI Enhancement (P3-004)	P3	1-2 days	Delete tokens, bulk operations
Server Health Dashboard (P3-005)	P3	2-3 days	System status monitoring
Operational settings in UI	E-1c carry-over	1 day	Add 'operational' category to SecuritySettings.tsx
Update metrics and trend charts	P3-003	2-3 days	Recharts integration
Calendar view for maintenance windows	Starting Prompt	1-2 weeks	Requires maintenance windows backend

5G. Security Features

Feature	Original Priority	Effort Estimate	Notes
Multi-factor authentication	Strategic Roadmap	1-2 weeks	TOTP integration
API key rotation via UI	SETUP-SECURITY.md	2-3 days	Manual rotation endpoint
Key rotation with grace period	SETUP-SECURITY.md	1 week	Dual-key acceptance window
TLS hardening (remove bypass flag)	Code Review	1 hour	Remove `--insecure-tls` flag
JWT secret minimum strength	Code Review	30 min	Validation in config loading

Section 6: Backlog Status Table

ID	Title	Original Priority	Current Status	Where Fixed/Notes
P0-001	Rate Limit First Request Bug	P0	FIXED	v0.1.26 per v0.1.27 Inventory; rate limiter namespaced by type in A-3 fixes
P0-002	Session Loop Bug	P0	PARTIALLY FIXED	SetupCompletionChecker modified in E-1b (removed isSetupMode state); may need live verification
P0-003	Agent No Retry Logic	P0	FIXED	v0.1.27 per Inventory + culurien B-2 (exponential backoff with full jitter, proportional polling jitter)
P0-004	Database Constraint Violation	P0	FIXED	v0.1.27 per Inventory; timeout service now uses 'failed' result status (check constraint compatible)
P0-005	Setup Flow Broken	P0	NOT VERIFIED	Setup handler exists but end-to-end flow not tested in culurien branch. May still have issues
P0-006	Single-Admin Architecture	P0	ACCEPTED	Decision made: single-admin via .env. Users table exists for compatibility but not used for auth
P0-007	Install Script Path Variables	P0	FIXED	2025-12-17 per backlog + verified in Installer Fix 1 (config path consistency)
P0-008	Migration Runs on Fresh Install	P0	FIXED	2025-12-17 per backlog; early return in detection.go for empty agent_id
P0-009	Storage Scanner Wrong Table	P0	NOT DONE	Storage scanner still on legacy interface. Dedicated storage_metrics table exists but scanner reports to update_packages
P1-001	Agent Install ID Parsing	P1	PARTIALLY FIXED	extractOrGenerateAgentID() in install_template_service.go validates UUID format; but query param handling may still have edge cases
P1-002	Agent Timeout Handling	P1	PARTIALLY FIXED	E-1c made timeouts configurable from DB; per-scanner timeouts exist in config but generic 45s timeout may still apply in some paths
P2-001	Binary URL Architecture Mismatch	P2	FIXED	Installer Fix 2 added arch detection; templates override download URL with detected architecture
P2-002	Migration Error Reporting	P2	NOT DONE	Migration errors still only logged locally; no server-side visibility
P2-003	Agent Auto-Update System	P2	FIXED	Fully implemented (was incorrectly marked placeholder in backlog); verified in Upgrade Audit
P3-001	Duplicate Command Prevention	P3	FIXED	v0.1.27; unique index on (agent_id, command_type, status) WHERE status = 'pending'
P3-002	Security Status Dashboard	P3	PARTIALLY DONE	Security overview endpoints exist; no color-coded health scores or per-agent security badges
P3-003	Update Metrics Dashboard	P3	NOT DONE	No metrics dashboard, no trend charts
P3-004	Token Management UI Enhancement	P3	PARTIALLY DONE	Token list with copy-install-command exists; no delete, no bulk operations, no status filtering
P3-005	Server Health Dashboard	P3	NOT DONE	No health dashboard
P3-006	Structured Logging System	P3	NOT DONE (alternative)	ETHOS [TAG] format used instead of JSON structured logging. See VD-001
P4-001	Agent Retry Logic Resilience	P4	FIXED	v0.1.27 per Inventory + culurien B-2 (exponential backoff, circuit breakers)
P4-002	Scanner Timeout Optimization	P4	PARTIALLY DONE	Configurable per-subsystem timeouts in config; E-1c made server-side timeouts configurable from DB
P4-003	Agent File Management Migration	P4	PARTIALLY DONE	MigrationExecutor exists with old-path detection; constants/paths.go standardized; validation/pathutils packages have compile errors (dead code)
P4-004	Directory Path Standardization	P4	FIXED	constants/paths.go provides canonical paths; windows.go fixed to use constants.GetAgentConfigPath() (Installer Fix 1); installer templates use standard paths
P4-005	Testing Infrastructure Gaps	P4	SIGNIFICANTLY IMPROVED	From ~3 test files to 170 tests across 18 packages; no CI/CD yet
P4-006	Architecture Documentation Gaps	P4	PARTIALLY DONE	30+ docs in culurien docs/ folder; no formal architecture diagrams or ADRs
P5-001	Security Audit Documentation	P5	NOT DONE	No security audit checklist, IR procedures, or compliance mapping
P5-002	Development Workflow Documentation	P5	PARTIALLY DONE	.env.example created; no PR template, debugging guide, or release process

Summary: 10 FIXED, 8 PARTIALLY DONE, 6 NOT DONE, 1 ACCEPTED (design decision), 1 NOT VERIFIED, 1 alternative approach.

Section 7: Architecture Health Assessment

7A. Authentication Stack

Component	Spec	Actual	Match
Registration tokens	One-time or multi-seat	Implemented with seat limits	YES
JWT 24h expiry	Short-lived JWT	Implemented with issuer-based validation (A-3)	YES
Refresh tokens 90-day	Sliding window, SHA-256 hash	Implemented, renewal in transaction (B-2)	YES
Machine ID binding	`X-Machine-ID` header, 403 on mismatch	Implemented with canonical SHA256 hash (D-1)	YES

7B. Command Flow

Component	Spec	Actual	Match
Pull-only	Agents always initiate	Confirmed — server has no outbound capability	YES
5-minute check-in	Configurable interval	Default 300s, configurable via config.json	YES
Command types	scan_updates, collect_specs, install_updates, rollback_update, update_agent	All present in models/command.go plus enable/disable_heartbeat, reboot, dry_run_update, confirm_dependencies	YES+
Acknowledgment	pending_acks.json	acknowledgment.Tracker with persistence and retry	YES

7C. Security Stack

Component	Spec	Actual	Match
Ed25519 signing	Binary + command signing	Both implemented; v3 format exceeds spec	YES+
Nonce validation	< 5 min lifetime, anti-replay	10-minute default (VD-006), otherwise matches	CLOSE
TOFU key caching	Fetch once at registration	Implemented with TTL refresh	YES+

7D. Agent Paths

Platform	Spec Path	Actual	Match
Linux config	`/etc/redflag/config.json`	`/etc/redflag/agent/config.json` (constants.GetAgentConfigPath)	CLOSE — subdir added
Linux state	`/var/lib/redflag/`	`/var/lib/redflag/agent/`	CLOSE — subdir added
Linux binary	`/usr/local/bin/redflag-agent`	`/usr/local/bin/redflag-agent`	YES
Windows config	`C:\ProgramData\RedFlag\config.json`	`C:\ProgramData\RedFlag\agent\config.json` (fixed in Installer Fix 1)	CLOSE — subdir added

The agent subdirectory was added to support future multi-component deployments (agent + server on same machine). This is a reasonable structural enhancement.

7E. Migration System

Component	Spec	Actual	Match
MigrationExecutor	Present	Implemented in `aggregator-agent/internal/migration/`	YES
Old path migration	`/etc/aggregator/` -> `/etc/redflag/`	Detection and backup implemented in installer templates and migration executor	YES

Section 8: The Honest Roadmap

HIGH VALUE, LOW EFFORT (Quick Wins)

JWT secret minimum strength (30 min) — Add len(secret) < 32 check in config loading. Addresses Code Review finding.
TLS bypass flag removal (1 hour) — Remove --insecure-tls flag from agent. Forces TLS in production.
Operational settings in UI (1 day) — Add operational category to SecuritySettings.tsx component. Makes timeout tuning accessible from dashboard.
Token delete button (1-2 days) — P3-004. DELETE endpoint + confirmation dialog. Currently requires DB manual cleanup.
/api/v1/info ldflags injection (1 hour) — Ensure Dockerfile passes -ldflags with actual version strings. Currently defaults to "dev".

HIGH VALUE, HIGH EFFORT (Strategic Investments)

Maintenance Windows (2-3 weeks) — The single most impactful unbuilt feature. Enables scheduled patching during safe hours. Without this, every update requires manual approval at execution time.
Webhook notifications (1-2 weeks) — Slack/Teams alerts on critical update availability, failed installations, agent offline. Low integration overhead, high operational value.
Staggered rollout (1-2 weeks) — Deploy updates to 5% canary, monitor, then 25%, then 100%. Essential for fleets > 10 agents.
macOS agent (2-3 days) — launchd plist template + Homebrew scanner. Completes the "cross-platform" promise for homelabbers with Macs.

LOW VALUE (Defer or Drop)

AI features — Drop entirely for foreseeable future. Adds operational complexity (GPU/API costs) without proportional benefit for homelab use case. Manual approval is more appropriate.
aggregator-cli — Defer. The web dashboard covers all use cases. CLI would be nice-to-have for scripting but is not essential.
AUR/Snap/Flatpak scanners — Defer. Very small user base for each. APT and DNF cover 95%+ of Linux homelabbers.
LDAP/SSO — Defer until multi-user is needed. Single-admin is correct for homelab.
Compliance reporting — Drop. SOX/HIPAA requirements don't apply to homelabs.
Kubernetes deployment — Defer. Docker Compose is the right deployment model for the target audience.

Section 9: Summary Table

Feature Area	Planned	Built	Status	Gap Rating
Core Architecture (pull model, agents, server)	Full	Full	Working	0 (complete)
Ed25519 Signing (commands, binaries)	Full	Full + enhancements	Working	0 (exceeds spec)
Authentication (tokens, JWT, refresh)	Full	Full + transactions	Working	0 (exceeds spec)
Machine ID Binding	Full	Full + canonical hash	Working	0 (exceeds spec)
Replay Protection (nonces)	Full	Full (10min vs 5min)	Working	1 (timeout deviation)
Package Scanning (6 of 9 scanners)	9 scanners	6 scanners	Working	3 (AUR, Snap, Flatpak, Homebrew missing)
Agent Self-Upgrade	Full	Full 7-step pipeline	Working	0 (complete)
Installer (Linux + Windows)	Full	Full + arch + checksum	Working	1 (macOS missing)
Web Dashboard	Full	Core features	Working	3 (missing charts, health, metrics)
Database + Migrations	Full	Full + hardened	Working	0 (exceeds spec)
Docker Deployment	Full	Full	Working	0 (complete)
Testing	Minimal	170 tests	Working	2 (no CI/CD, no integration tests against real DB)
Maintenance Windows	Full	None	Not built	10 (completely absent)
AI Features	Full	None	Not built	10 (deliberately deferred)
Scheduling & Automation	Full	None	Not built	8 (no maintenance windows, no staggered rollout)
LDAP/SSO	Planned	None	Not built	5 (not needed for homelab)
Structured Logging	Planned (P3-006)	ETHOS alternative	Working differently	3 (functional but not JSON/correlation IDs)
Compliance / Reporting	Planned	None	Not built	2 (not applicable to homelab)
CLI Tool	Planned	None	Not built	2 (dashboard covers use cases)
macOS Support	Planned	None	Not built	4 (matters for homelabbers with Macs)

Overall: Core infrastructure is 9/10. Feature breadth is 5/10. Production readiness for homelab is 7/10.

The gap is almost entirely in features that were always labeled "future" or "Phase 2" in the original docs. The core architecture — the hard engineering work — is built, tested, and hardened beyond the original specification.

41 KiB Raw Blame History

Vision vs Reality: RedFlag Deviation Report

Section 1: Executive Summary

Section 2: What Was Built As Planned

2.1 Pull-Based Agent Architecture

2.2 Ed25519 Command Signing

2.3 Machine ID Binding

2.4 Three-Tier Token Authentication

2.5 Replay Attack Protection (Nonce System)

2.6 Agent Self-Registration Flow

2.7 Package Scanner Architecture

2.8 Circuit Breaker Pattern

2.9 Command Acknowledgment System

2.10 Agent Service Management

2.11 Web Dashboard

2.12 PostgreSQL with Migration Runner

2.13 Docker Compose Deployment

2.14 Installer Scripts

2.15 Agent Self-Upgrade Pipeline

Section 3: What Was Built Better Than Planned

3.1 Ed25519 Key Rotation with TTL

3.2 Command Signing v3 Format

3.3 Machine ID Canonical SHA256 Hash

3.4 Transaction Safety (B-Series)

3.5 Configurable Operational Timeouts

3.6 Binary Path Traversal Protection

3.7 TypeScript Strict Compliance

3.8 Semver-Aware Version Comparison

3.9 Test Suite Growth

3.10 ETHOS Logging Compliance

3.11 Installer Architecture Detection

3.12 Binary Checksum Verification

3.13 Machine ID Rebind Endpoint

Section 4: What Was Built Differently (Deviations)

VD-001: Logging Format

VD-002: Authentication Architecture

VD-003: Build Orchestrator

VD-004: Upgrade Trigger Path

VD-005: Security Settings UI

VD-006: Nonce Timeout

VD-007: Key Rotation API

VD-008: Version Format

VD-009: Windows Service Key Rotation

VD-010: Watchdog Version Comparison

Section 5: What Was Never Built

5A. Platform Support

5B. AI Features

5C. Scheduling & Automation

5D. Observability

5E. Integration & Ecosystem

5F. UI Features

5G. Security Features

Section 6: Backlog Status Table

Section 7: Architecture Health Assessment

7A. Authentication Stack

7B. Command Flow

7C. Security Stack

7D. Agent Paths

7E. Migration System

Section 8: The Honest Roadmap

HIGH VALUE, LOW EFFORT (Quick Wins)

HIGH VALUE, HIGH EFFORT (Strategic Investments)

LOW VALUE (Defer or Drop)

Section 9: Summary Table

41 KiB

Raw Blame History