Complete comparison of Fimeg's original vision against current codebase state after culurien branch work. - 9 sections covering architecture, features, backlog - 10 deviations documented (VD-001 through VD-010) - 27 backlog items tracked with current status - Honest roadmap with prioritized next steps - Executive summary for quick reference Core: 9/10. Features: 5/10. Homelab ready: 7/10. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
41 KiB
Vision vs Reality: RedFlag Deviation Report
Date: 2026-03-29 Branch: culurien (post-integration verification) Baseline: v0.1.27 (last Fimeg commit before culurien work) Author: Claude (automated analysis based on complete codebase and historical documentation)
Section 1: Executive Summary
RedFlag began as "Aggregator" — Fimeg's vision for a self-hosted ConnectWise alternative that would give homelabbers centralized update management with "single pane of glass" visibility across Windows, Linux, and macOS. The Starting Prompt described an ambitious platform with AI-assisted scheduling, maintenance windows, natural language queries, and cross-platform agents covering APT, DNF, AUR, Snap, Flatpak, Winget, Windows Update, Docker, and Homebrew.
What exists today is a narrower but more deeply engineered system than the original breadth implied. The core architecture — pull-based agents, Ed25519 signed commands, machine ID binding, three-tier token authentication — is not just present but has been hardened well beyond the original specification. The culurien branch added transaction safety, replay attack protection, key rotation, configurable timeouts, path traversal defense, semver-aware version comparison, and grew the test suite from approximately 3 test files to 170 passing tests across 18 packages.
The honest gap: approximately 40% of the originally envisioned feature surface was never built. macOS support, AI features, maintenance windows, scheduled rollouts, structured JSON logging, LDAP/SSO, compliance reporting, the CLI tool — none exist. The features that do exist (package scanning, command execution, agent self-upgrade, installer infrastructure) work correctly and are production-quality for the homelab use case.
Is it production-ready for Fimeg's homelab? Yes, with caveats. The system can install agents on Linux and Windows, scan for package updates, approve and execute updates from the dashboard, and self-upgrade agents — all with cryptographic verification. The caveats are: the setup flow has known friction (P0-005), the /api/v1/info endpoint returns build-time values that default to "dev" without proper ldflags injection, and several P0 backlog items from the original backlog have been fixed but not all have been verified end-to-end in a live deployment.
The "scare ConnectWise" ambition requires honest framing. RedFlag has three architectural advantages ConnectWise cannot replicate: hardware-bound machine IDs, mandatory self-hosting, and code transparency. But ConnectWise has remote control, 100+ integrations, SOC2 certification, and a decade of enterprise polish. RedFlag is competitive for the self-hosted update management niche — the subset of ConnectWise functionality that homelabbers and small IT teams actually use. It is not, and should not try to become, a full RMM platform. The strategic path is: own the update management vertical completely, then expand.
Section 2: What Was Built As Planned
2.1 Pull-Based Agent Architecture
Planned (Starting Prompt): Agent polls server every 5 minutes via GET /agents/{id}/commands. Server never initiates connections to agents.
Built: Exactly as planned. Agent polls via GET /agents/:id/commands with configurable interval (default 300 seconds). Proportional jitter added (B-2 fix: maxJitter = pollingInterval/2). Server has no outbound connection capability.
Location: aggregator-agent/cmd/agent/main.go:960-1070 (poll loop), aggregator-server/internal/api/handlers/agents.go (GetCommands handler)
Status: Working. Enhanced beyond spec with jitter and exponential backoff on failures.
2.2 Ed25519 Command Signing
Planned (Security.md): All commands in DB must be Ed25519-signed before being sent to agents. signAndCreateCommand() implemented in handlers.
Built: Exactly as planned, then enhanced. v3 signed message format includes agent_id:cmd_id:type:sha256(params):timestamp. Key ID and SignedAt tracked per command (migration 025). 29+ call sites use signAndCreateCommand().
Location: aggregator-server/internal/services/signing.go, aggregator-server/internal/api/handlers/agents.go:49-77
Status: Working. Exceeds original spec.
2.3 Machine ID Binding
Planned (Security.md section 3.1): MachineBindingMiddleware validates X-Machine-ID header against agents.machine_id. Mismatch = 403.
Built: Implemented. Machine ID is SHA256 hash of hardware identifiers (D-1 fix made this canonical across all platforms). Middleware validates on authenticated agent routes. Admin rebind endpoint added for recovery.
Location: aggregator-server/internal/api/middleware/machine_binding.go, aggregator-agent/internal/system/machine_id.go
Status: Working. Enhanced with canonical hash format and rebind capability.
2.4 Three-Tier Token Authentication
Planned (Security.md section 2): Registration tokens (one-time/multi-seat) -> JWT access tokens (24h) -> Refresh tokens (90-day sliding window, stored as SHA-256 hash).
Built: Exactly as planned. Registration tokens consumed at /agents/register with seat limits. JWT issued with issuer=redflag-agent (A-3 fix). Refresh tokens stored hashed, renewed via /agents/renew. Token renewal wrapped in database transaction (B-2 fix).
Location: aggregator-server/internal/api/handlers/auth.go, aggregator-server/internal/api/middleware/auth.go, aggregator-server/internal/database/queries/refresh_tokens.go
Status: Working. Transaction safety added beyond original spec.
2.5 Replay Attack Protection (Nonce System)
Planned (Security.md section 3.3): Unique, time-limited, Ed25519-signed nonce for every sensitive command. Agent validates signature and timestamp.
Built: Implemented via UpdateNonceService with Generate() and Validate(). Nonce format: uuid:unix_timestamp, signed with Ed25519. Agent validates freshness within configurable window.
Location: aggregator-server/internal/services/nonce_service.go, aggregator-agent/cmd/agent/subsystem_handlers.go:1074 (validateNonce)
Status: Working. Timeout is 10 minutes (not 5 as specified in Overview.md — see VD-006).
2.6 Agent Self-Registration Flow
Planned (Starting Prompt): Agent POSTs to /agents/register with hostname, OS info, version. Server returns agent_id, token, config.
Built: Exactly as planned plus enhancements. Registration includes machine_id (SHA256 hash) and public_key_fingerprint. Registration wrapped in database transaction (B-2 fix). Agent stores agent_id, token, refresh_token, check_in_interval to config file.
Location: aggregator-agent/cmd/agent/main.go:371-483 (registerAgent), aggregator-server/internal/api/handlers/agents.go (RegisterAgent)
Status: Working.
2.7 Package Scanner Architecture
Planned (Starting Prompt): APT, DNF/YUM, AUR, Winget, Windows Update, Docker, Snap, Flatpak, Homebrew.
Built: APT, DNF, Winget, Windows Update, Docker, Storage metrics. Each scanner has circuit breaker protection and configurable timeouts.
Not built: AUR, Snap, Flatpak, Homebrew. See Section 5A.
Location: aggregator-agent/internal/scanner/ (apt.go, dnf.go, winget.go, docker.go), aggregator-agent/pkg/windowsupdate/
Status: Working for implemented platforms. 6 of 9 planned scanners built.
2.8 Circuit Breaker Pattern
Planned (ETHOS.md principle 3, Overview.md): Circuit Breaker on fragile scanners (Windows Update, DNF).
Built: Full implementation with Closed/Open/HalfOpen states, configurable failure threshold, failure window, open duration, and half-open attempts. Applied to all scanners via subsystem config.
Location: aggregator-agent/internal/circuitbreaker/circuit_breaker.go, config per subsystem in config.json
Status: Working.
2.9 Command Acknowledgment System
Planned (Overview.md): pending_acks.json for at-least-once delivery guarantee.
Built: acknowledgment.Tracker package with persistent pending acks, retry with IncrementRetry(), and state file persistence.
Location: aggregator-agent/internal/acknowledgment/
Status: Working.
2.10 Agent Service Management
Planned (Overview.md): systemd on Linux, Windows Services (SCM) on Windows.
Built: Both. Linux systemd unit generated inline in installer template with security hardening (ProtectSystem=strict, ProtectHome=true, PrivateTmp=true). Windows SCM registration via InstallService() with auto-start and recovery actions.
Location: aggregator-agent/internal/service/windows.go:438-516, installer templates in aggregator-server/internal/services/templates/install/scripts/
Status: Working.
2.11 Web Dashboard
Planned (Starting Prompt): React 18 + TypeScript, TailwindCSS, TanStack Query, Recharts, TanStack Table, WebSocket.
Built: React + TypeScript + TailwindCSS + TanStack Query. Dashboard with agent list, update management, command history, security settings. TypeScript strict compliance (0 errors after E-1b). No Recharts charts, no TanStack Table, limited WebSocket.
Location: aggregator-web/src/
Status: Working. Feature-complete for core use case but missing visualization features (charts, trend analysis).
2.12 PostgreSQL with Migration Runner
Planned (Starting Prompt, ETHOS.md): PostgreSQL database with idempotent migrations.
Built: PostgreSQL 16-alpine, custom migration runner in database/db.go with schema_migrations tracking table. 30 migrations (001-030) all using IF NOT EXISTS / ON CONFLICT DO NOTHING. Migration runner aborts fatally on failure (B-1 fix).
Location: aggregator-server/internal/database/db.go, aggregator-server/internal/database/migrations/
Status: Working.
2.13 Docker Compose Deployment
Planned (Starting Prompt): docker-compose.yml for quick start.
Built: Three-service compose: postgres (16-alpine), server (multi-stage Go build), web (nginx). Health checks, volume persistence, env_file support.
Location: docker-compose.yml, config/.env.example
Status: Working.
2.14 Installer Scripts
Planned (various docs): Install script served from /install/:platform endpoint.
Built: Linux bash and Windows PowerShell templates served dynamically. Linux installer creates system user, directories, sudoers (per-package-manager), systemd service with security hardening, registers agent, starts service. Windows installer creates directories, downloads binary, writes config, registers service. Both have arch auto-detection and checksum verification (Installer Fix 2).
Location: aggregator-server/internal/services/templates/install/scripts/linux.sh.tmpl, windows.ps1.tmpl
Status: Working. Idempotent.
2.15 Agent Self-Upgrade Pipeline
Planned (Overview.md, P2-003): 7-step pipeline: download, checksum verify, Ed25519 binary signature verify, backup, atomic install, service restart, watchdog confirmation.
Built: Exactly as specified. All 7 steps implemented with deferred rollback on any failure including watchdog timeout (5 minutes). Download has 5-minute timeout and 500MB size limit (U-8 fix).
Location: aggregator-agent/cmd/agent/subsystem_handlers.go:575-762
Status: Working. The v0.1.27 INVENTORY doc noted this was "FULLY IMPLEMENTED" despite the backlog claiming it was placeholder.
Section 3: What Was Built Better Than Planned
3.1 Ed25519 Key Rotation with TTL
Original: Security.md noted key rotation was "TODO" with no implementation. SETUP-SECURITY.md described a POST /security/keys/rotate API with 30-day grace period.
Built: TTL-based key caching with automatic refresh. Server registers primary key in signing_keys table (migration 025). Agent caches server public key at registration (TOFU), with configurable TTL refresh. Key ID tracked per command for audit trail.
Why better: Automatic TTL refresh is more reliable than manual API-triggered rotation. The manual rotation API was never needed because the system handles staleness automatically.
3.2 Command Signing v3 Format
Original: Commands signed with cmd_id:type:sha256(params).
Built: v3 format includes agent_id:cmd_id:type:sha256(params):timestamp. Agent ID binding prevents command replay to different agents. Timestamp enables time-based expiry independent of DB state.
Why better: Prevents a class of relay attacks where a compromised agent could forward signed commands to other agents.
3.3 Machine ID Canonical SHA256 Hash
Original: Machine ID was inconsistent — registration fallback used "unknown-" + hostname (unhashed) while runtime used SHA256.
Built (D-1): All paths now use GetMachineID() which always returns a 64-character hex SHA256 hash. Registration aborts with log.Fatalf if machine ID cannot be obtained — no unhashed fallback.
Why better: Eliminates format mismatch between registration and runtime that would cause 403 errors after restart.
3.4 Transaction Safety (B-Series)
Original: Not specified. Registration, command delivery, and token renewal were separate DB operations without transaction wrapping.
Built (B-2): Registration wrapped in tx.Beginx() with defer tx.Rollback(). Command delivery uses SELECT FOR UPDATE SKIP LOCKED (atomic claim). Token renewal wrapped in transaction. JWT generated after commit, not before.
Why better: Prevents partial registration state, command double-delivery race conditions, and orphaned tokens.
3.5 Configurable Operational Timeouts
Original: 6 hardcoded timeout values in main.go and timeout.go.
Built (E-1c): All 6 values stored in security_settings table under operational category (migration 030). Read from DB at startup with hardcoded fallback. Zero-value protection prevents zero-duration tickers.
Why better: Administrators can tune timeouts via API without code changes or redeployment.
3.6 Binary Path Traversal Protection
Original: Not specified — c.File(pkg.BinaryPath) served DB-sourced paths without validation.
Built (E-1c + Integration Verification): Both DownloadUpdatePackage and DownloadAgent resolve paths via filepath.Abs() and validate against REDFLAG_BINARY_STORAGE_PATH using prefix check. Traversal attempts logged and return 403.
Why better: Defense in depth against DB compromise scenarios.
3.7 TypeScript Strict Compliance
Original: 217 TypeScript errors in aggregator-web/src/.
Built (E-1b): All 217 errors fixed. Zero @ts-ignore or as any suppressions added. Type interfaces verified against actual server JSON responses. TanStack Query v5 isLoading -> isPending migration for mutations.
Why better: Catches type mismatches at compile time instead of runtime.
3.8 Semver-Aware Version Comparison
Original: versions.go:72 used lexicographic comparison (agentVersion < current.MinAgentVersion), making "0.1.9" > "0.1.22".
Built (Upgrade Fix): CompareVersions() with octet-by-octet numeric parsing. Handles "dev" as always-older, "v" prefix stripping, mismatched octet counts.
Why better: Version gates now work correctly for all version numbers.
3.9 Test Suite Growth
Original (Code Review): "Only 3 test files across the entire codebase" — circuitbreaker_test.go, test_disk.go, test_disk_detection.go, plus scheduler tests.
Built: 170 tests across 18 packages covering: Ed25519 signing and replay protection, JWT issuer validation, registration transactions, command delivery races, machine ID format, ETHOS compliance, path traversal, version comparison, checksum computation, timeout configuration, and more.
Why better: Regression detection for all fix series (A through Upgrade).
3.10 ETHOS Logging Compliance
Original: Mixed fmt.Printf, emoji in logs, inconsistent format.
Built (D-2 + Integration Verification): All production log statements use log.Printf("[TAG] [system] [component] message key=value"). Emoji removed from daemon log.Printf calls. fmt.Printf DEBUG statements removed from handlers. Terminal/CLI output emoji explicitly exempted (DEV-039).
3.11 Installer Architecture Detection
Original: Architecture hardcoded to amd64 in generateInstallScript.
Built (Installer Fix 2): Runtime detection via uname -m (Linux) and $env:PROCESSOR_ARCHITECTURE (Windows). Server accepts optional ?arch= query param. Download endpoint already supported linux-arm64 and windows-arm64.
Why better: ARM64 homelabbers (Raspberry Pi, Apple Silicon VMs) can now install without manual binary download.
3.12 Binary Checksum Verification
Original: No verification of downloaded binary integrity.
Built (Installer Fix 2): Server computes SHA-256 and serves X-Content-SHA256 header. Linux installer verifies with sha256sum. Windows installer verifies with Get-FileHash. Missing header = warn but continue (backward compatible).
3.13 Machine ID Rebind Endpoint
Original: Not specified. If machine ID changed (hardware replacement, VM migration), agent was permanently locked out.
Built (D-1): Admin endpoint POST /admin/agents/:id/rebind-machine-id allows re-binding an agent to new hardware. Requires admin authentication.
Section 4: What Was Built Differently (Deviations)
VD-001: Logging Format
Original (P3-006): JSON structured logs with correlation IDs via logrus or similar library. StructuredLogger implementation, CorrelationIDMiddleware, buffered async writes, P95/P99 latency tracking, system_logs database table.
Actual: ETHOS [TAG] [system] [component] plain text format via log.Printf. No correlation IDs, no structured JSON, no centralized aggregation, no log database table.
Rationale: ETHOS principle #5 (no marketing fluff) and principle #1 (errors are history) were prioritized over the P3-006 spec. Plain text logging with consistent tags is grep-friendly and sufficient for the homelab use case. JSON structured logging adds complexity without proportional benefit at the current scale.
Verdict: Acceptable for homelab. Would need to be revisited for fleet-scale deployments (100+ agents).
VD-002: Authentication Architecture
Original (P0-006 + Starting Prompt): Multi-user system with users table, admin/user/readonly roles, email fields, EnsureAdminUser(). The Starting Prompt shows Settings page with "users" section.
Actual: Single-admin via .env credentials. The users table exists in migrations (for compatibility) but is not used for authentication. Web auth validates against REDFLAG_ADMIN_USER/REDFLAG_ADMIN_PASSWORD from environment.
Rationale: P0-006 recommended "Option 1: Complete Removal" — recognizing that multi-user scaffolding increased attack surface without benefit for a homelab tool. The current implementation follows this recommendation.
Verdict: Correct for homelab. Multi-user would be needed for MSP/enterprise use case.
VD-003: Build Orchestrator
Original (architecture docs): Dynamic agent compilation per request. Build Orchestrator would cross-compile agent binaries on demand.
Actual: Pre-built binaries placed at container build time (via Dockerfile multi-stage build). BuildAndSignAgent signs existing binaries but never compiles. AgentBuilder generates config JSON only. build_orchestrator.go services layer marked // Deprecated.
Rationale: Cross-compilation on every request is impractical for a homelab server. The Dockerfile multi-stage build compiles once; the server serves pre-built binaries. This is how production package distribution works (e.g., GitHub Releases).
Verdict: Correct pragmatic simplification. Dynamic compilation would add complexity without benefit.
VD-004: Upgrade Trigger Path
Original: POST /build/upgrade/:agentID was meant to orchestrate full upgrades.
Actual: The real upgrade path is POST /agents/{id}/update (in agent_updates.go), which validates the agent, generates nonces, creates signed update_agent commands, and tracks delivery. The /build/upgrade endpoint generates config JSON with manual instructions — it's an admin utility, not the upgrade orchestrator.
Rationale: The /agents/{id}/update path already existed and was more complete (nonce generation, command signing, delivery tracking). Wiring a parallel path would have created confusion.
Verdict: Acceptable. The working path is better designed.
VD-005: Security Settings UI
Original (SECURITY-SETTINGS.md): Full security settings configurable from dashboard including machine binding mode, version enforcement, nonce timeout, signature algorithm, log level, alert thresholds.
Actual: Security settings backend works (API CRUD for security_settings table). Dashboard displays settings for command_signing, update_signing, nonce_validation, machine_binding, signature_verification. The operational category (E-1c timeouts) is accessible via API but not visible in the UI. No validation rules enforcement for operational settings.
Verdict: Partially implemented. Backend complete, frontend shows security-category settings. Operational settings need UI exposure.
VD-006: Nonce Timeout
Original (Overview.md, Security.md): Nonce lifetime "< 5 minutes".
Actual (SETUP-SECURITY.md, code): REDFLAG_SECURITY_NONCE_TIMEOUT=600 (10 minutes). The code uses a 10-minute default.
Rationale: The original docs contradict each other — Overview.md says "< 5 min" while SETUP-SECURITY.md says 600 seconds. The 10-minute value appears to be a deliberate choice to accommodate slow network conditions (agents polling every 5 minutes may not receive the command within a 5-minute nonce window).
Verdict: The 10-minute value is more practical. The 5-minute spec was likely aspirational. Document this as intentional.
VD-007: Key Rotation API
Original (SETUP-SECURITY.md): POST /api/v1/security/keys/rotate with grace_period_days (default 30). During grace period both old and new keys valid. Keys stored in /app/keys/ directory.
Actual: Key rotation is TTL-based via the signing key registry in signing_keys table. No explicit /keys/rotate API endpoint. No dual-key grace period — agents refresh their cached public key via TTL (24h default). Key stored in environment variable, not in /app/keys/ directory.
Rationale: Different implementation approach that achieves the same goal (agents can handle key changes) without the complexity of dual-key acceptance windows.
Verdict: Functional but different. A manual key rotation API would be a nice-to-have for planned rotations.
VD-008: Version Format
Original (SECURITY-SETTINGS.md): "Semantic version string (X.Y.Z), integers only, no v prefix."
Actual: Four-octet format X.Y.Z.W where W is the config version (e.g., 0.1.26.0). The v prefix is tolerated and stripped during comparison.
Rationale: The fourth octet was added to embed config schema version alongside the agent version, avoiding a separate version field.
Verdict: Acceptable extension of spec. CompareVersions() handles both 3-octet and 4-octet formats.
VD-009: Windows Service Key Rotation
Original: Not explicitly specified, but key rotation logic exists in main.go polling loop.
Actual (DEV-030): The Windows service polling loop in windows.go does not call ShouldRefreshKey. The comment at line 164-168 acknowledges this as a TODO. Agents running as Windows services rely on the 24h TTL key cache and will not proactively detect key rotation.
Verdict: Known gap. Low risk — the 24h TTL cache means Windows agents will naturally pick up new keys within a day.
VD-010: Watchdog Version Comparison
Original: Not explicitly specified.
Actual: The upgrade watchdog in subsystem_handlers.go:943 uses string equality (agent.CurrentVersion == expectedVersion) instead of CompareVersions(). A normalized version string mismatch (e.g., "v0.1.4" vs "0.1.4") would trigger false rollback.
Verdict: Low risk — both sides use the same version string from the same source. Would need fixing if version normalization is ever introduced.
Section 5: What Was Never Built
5A. Platform Support
| Feature | Original Priority | Effort Estimate | Notes |
|---|---|---|---|
| macOS agent / launchd | Starting Prompt | 2-3 days | No launchd plist, no macOS-specific code |
| Homebrew scanner | Starting Prompt | 1-2 days | Would follow APT/DNF pattern |
| AUR scanner (Arch) | Starting Prompt | 1-2 days | Would follow APT/DNF pattern |
| Snap scanner | Starting Prompt | 1 day | Low demand |
| Flatpak scanner | Starting Prompt | 1 day | Low demand |
| aggregator-cli (Go CLI) | Starting Prompt | 3-5 days | Power-user tool, not essential for homelab |
5B. AI Features
| Feature | Original Priority | Effort Estimate | Notes |
|---|---|---|---|
| AI Chat Sidebar (Ollama/OpenAI) | Starting Prompt | 2-3 weeks | No AI code exists anywhere |
Natural language queries (POST /ai/query) |
Starting Prompt | 1-2 weeks | Requires AI sidebar first |
AI-assisted scheduling (POST /ai/schedule) |
Starting Prompt | 1 week | Requires maintenance windows first |
AI decision audit trail (GET /ai/decisions) |
Starting Prompt | 3-5 days | Requires AI features first |
Honest assessment: AI features were aspirational in the Starting Prompt ("Future Phase"). They add significant complexity and operational overhead (Ollama requires GPU resources or external API costs). For a homelab tool, manual approval is more appropriate than AI-assisted scheduling. Recommend deferring indefinitely unless Fimeg has a specific use case.
5C. Scheduling & Automation
| Feature | Original Priority | Effort Estimate | Notes |
|---|---|---|---|
| Maintenance Windows (RRULE recurrence) | Starting Prompt, P3 | 2-3 weeks | Full RRULE parser + calendar UI + auto-approve logic |
| Auto-approve by severity during windows | Starting Prompt | 1 week | Requires maintenance windows |
| Scheduled update execution | Starting Prompt | 1 week | Requires maintenance windows |
| Staggered rollout (5%/25%/100%) | P2-003, Strategic Roadmap | 1-2 weeks | Server-side group selection + phased command queuing |
| Auto-upgrade trigger (version-based) | Upgrade Audit | 1 week | Server detects old version on check-in, queues update_agent |
Honest assessment: Maintenance windows are the highest-value unbuilt feature for production use. Auto-approve by severity during defined windows would significantly reduce manual work.
5D. Observability
| Feature | Original Priority | Effort Estimate | Notes |
|---|---|---|---|
| Structured JSON logging (P3-006) | P3 | 3-4 days | logrus + correlation IDs |
| Correlation ID propagation | P3-006 | 2-3 days | Middleware + header propagation |
| Update Metrics Dashboard (P3-003) | P3 | 2-3 days | Success/failure rates, trend charts |
| Server Health Dashboard (P3-005) | P3 | 2-3 days | CPU, memory, DB connections |
| Prometheus metrics endpoint | Strategic Roadmap | 2-3 days | /metrics endpoint with Go prometheus client |
| Real-time WebSocket updates | Starting Prompt | 1-2 weeks | Partial: security events WebSocket exists |
5E. Integration & Ecosystem
| Feature | Original Priority | Effort Estimate | Notes |
|---|---|---|---|
| LDAP/Active Directory | Strategic Roadmap | 2-3 weeks | Auth integration |
| SAML/OIDC for SSO | Strategic Roadmap | 2-3 weeks | Requires multi-user first |
| Slack/Teams/PagerDuty webhooks | Strategic Roadmap | 1-2 weeks | Event notification hooks |
| Compliance reporting (SOX, HIPAA) | Strategic Roadmap | 4-6 weeks | Report generation framework |
| Kubernetes deployment | Strategic Roadmap | 1-2 weeks | Helm chart + StatefulSet |
| Ansible/Terraform integrations | Strategic Roadmap | 2-3 weeks | Module/provider development |
Honest assessment: Webhooks (Slack/Teams) are the highest-value integration for homelab use. LDAP/SSO only matters if Fimeg plans to support multi-user deployments.
5F. UI Features
| Feature | Original Priority | Effort Estimate | Notes |
|---|---|---|---|
| Security Status Dashboard Indicators (P3-002) | P3 | 2-3 days | Color-coded security health scores |
| Token Management UI Enhancement (P3-004) | P3 | 1-2 days | Delete tokens, bulk operations |
| Server Health Dashboard (P3-005) | P3 | 2-3 days | System status monitoring |
| Operational settings in UI | E-1c carry-over | 1 day | Add 'operational' category to SecuritySettings.tsx |
| Update metrics and trend charts | P3-003 | 2-3 days | Recharts integration |
| Calendar view for maintenance windows | Starting Prompt | 1-2 weeks | Requires maintenance windows backend |
5G. Security Features
| Feature | Original Priority | Effort Estimate | Notes |
|---|---|---|---|
| Multi-factor authentication | Strategic Roadmap | 1-2 weeks | TOTP integration |
| API key rotation via UI | SETUP-SECURITY.md | 2-3 days | Manual rotation endpoint |
| Key rotation with grace period | SETUP-SECURITY.md | 1 week | Dual-key acceptance window |
| TLS hardening (remove bypass flag) | Code Review | 1 hour | Remove --insecure-tls flag |
| JWT secret minimum strength | Code Review | 30 min | Validation in config loading |
Section 6: Backlog Status Table
| ID | Title | Original Priority | Current Status | Where Fixed/Notes |
|---|---|---|---|---|
| P0-001 | Rate Limit First Request Bug | P0 | FIXED | v0.1.26 per v0.1.27 Inventory; rate limiter namespaced by type in A-3 fixes |
| P0-002 | Session Loop Bug | P0 | PARTIALLY FIXED | SetupCompletionChecker modified in E-1b (removed isSetupMode state); may need live verification |
| P0-003 | Agent No Retry Logic | P0 | FIXED | v0.1.27 per Inventory + culurien B-2 (exponential backoff with full jitter, proportional polling jitter) |
| P0-004 | Database Constraint Violation | P0 | FIXED | v0.1.27 per Inventory; timeout service now uses 'failed' result status (check constraint compatible) |
| P0-005 | Setup Flow Broken | P0 | NOT VERIFIED | Setup handler exists but end-to-end flow not tested in culurien branch. May still have issues |
| P0-006 | Single-Admin Architecture | P0 | ACCEPTED | Decision made: single-admin via .env. Users table exists for compatibility but not used for auth |
| P0-007 | Install Script Path Variables | P0 | FIXED | 2025-12-17 per backlog + verified in Installer Fix 1 (config path consistency) |
| P0-008 | Migration Runs on Fresh Install | P0 | FIXED | 2025-12-17 per backlog; early return in detection.go for empty agent_id |
| P0-009 | Storage Scanner Wrong Table | P0 | NOT DONE | Storage scanner still on legacy interface. Dedicated storage_metrics table exists but scanner reports to update_packages |
| P1-001 | Agent Install ID Parsing | P1 | PARTIALLY FIXED | extractOrGenerateAgentID() in install_template_service.go validates UUID format; but query param handling may still have edge cases |
| P1-002 | Agent Timeout Handling | P1 | PARTIALLY FIXED | E-1c made timeouts configurable from DB; per-scanner timeouts exist in config but generic 45s timeout may still apply in some paths |
| P2-001 | Binary URL Architecture Mismatch | P2 | FIXED | Installer Fix 2 added arch detection; templates override download URL with detected architecture |
| P2-002 | Migration Error Reporting | P2 | NOT DONE | Migration errors still only logged locally; no server-side visibility |
| P2-003 | Agent Auto-Update System | P2 | FIXED | Fully implemented (was incorrectly marked placeholder in backlog); verified in Upgrade Audit |
| P3-001 | Duplicate Command Prevention | P3 | FIXED | v0.1.27; unique index on (agent_id, command_type, status) WHERE status = 'pending' |
| P3-002 | Security Status Dashboard | P3 | PARTIALLY DONE | Security overview endpoints exist; no color-coded health scores or per-agent security badges |
| P3-003 | Update Metrics Dashboard | P3 | NOT DONE | No metrics dashboard, no trend charts |
| P3-004 | Token Management UI Enhancement | P3 | PARTIALLY DONE | Token list with copy-install-command exists; no delete, no bulk operations, no status filtering |
| P3-005 | Server Health Dashboard | P3 | NOT DONE | No health dashboard |
| P3-006 | Structured Logging System | P3 | NOT DONE (alternative) | ETHOS [TAG] format used instead of JSON structured logging. See VD-001 |
| P4-001 | Agent Retry Logic Resilience | P4 | FIXED | v0.1.27 per Inventory + culurien B-2 (exponential backoff, circuit breakers) |
| P4-002 | Scanner Timeout Optimization | P4 | PARTIALLY DONE | Configurable per-subsystem timeouts in config; E-1c made server-side timeouts configurable from DB |
| P4-003 | Agent File Management Migration | P4 | PARTIALLY DONE | MigrationExecutor exists with old-path detection; constants/paths.go standardized; validation/pathutils packages have compile errors (dead code) |
| P4-004 | Directory Path Standardization | P4 | FIXED | constants/paths.go provides canonical paths; windows.go fixed to use constants.GetAgentConfigPath() (Installer Fix 1); installer templates use standard paths |
| P4-005 | Testing Infrastructure Gaps | P4 | SIGNIFICANTLY IMPROVED | From ~3 test files to 170 tests across 18 packages; no CI/CD yet |
| P4-006 | Architecture Documentation Gaps | P4 | PARTIALLY DONE | 30+ docs in culurien docs/ folder; no formal architecture diagrams or ADRs |
| P5-001 | Security Audit Documentation | P5 | NOT DONE | No security audit checklist, IR procedures, or compliance mapping |
| P5-002 | Development Workflow Documentation | P5 | PARTIALLY DONE | .env.example created; no PR template, debugging guide, or release process |
Summary: 10 FIXED, 8 PARTIALLY DONE, 6 NOT DONE, 1 ACCEPTED (design decision), 1 NOT VERIFIED, 1 alternative approach.
Section 7: Architecture Health Assessment
7A. Authentication Stack
| Component | Spec | Actual | Match |
|---|---|---|---|
| Registration tokens | One-time or multi-seat | Implemented with seat limits | YES |
| JWT 24h expiry | Short-lived JWT | Implemented with issuer-based validation (A-3) | YES |
| Refresh tokens 90-day | Sliding window, SHA-256 hash | Implemented, renewal in transaction (B-2) | YES |
| Machine ID binding | X-Machine-ID header, 403 on mismatch |
Implemented with canonical SHA256 hash (D-1) | YES |
7B. Command Flow
| Component | Spec | Actual | Match |
|---|---|---|---|
| Pull-only | Agents always initiate | Confirmed — server has no outbound capability | YES |
| 5-minute check-in | Configurable interval | Default 300s, configurable via config.json | YES |
| Command types | scan_updates, collect_specs, install_updates, rollback_update, update_agent | All present in models/command.go plus enable/disable_heartbeat, reboot, dry_run_update, confirm_dependencies | YES+ |
| Acknowledgment | pending_acks.json | acknowledgment.Tracker with persistence and retry | YES |
7C. Security Stack
| Component | Spec | Actual | Match |
|---|---|---|---|
| Ed25519 signing | Binary + command signing | Both implemented; v3 format exceeds spec | YES+ |
| Nonce validation | < 5 min lifetime, anti-replay | 10-minute default (VD-006), otherwise matches | CLOSE |
| TOFU key caching | Fetch once at registration | Implemented with TTL refresh | YES+ |
7D. Agent Paths
| Platform | Spec Path | Actual | Match |
|---|---|---|---|
| Linux config | /etc/redflag/config.json |
/etc/redflag/agent/config.json (constants.GetAgentConfigPath) |
CLOSE — subdir added |
| Linux state | /var/lib/redflag/ |
/var/lib/redflag/agent/ |
CLOSE — subdir added |
| Linux binary | /usr/local/bin/redflag-agent |
/usr/local/bin/redflag-agent |
YES |
| Windows config | C:\ProgramData\RedFlag\config.json |
C:\ProgramData\RedFlag\agent\config.json (fixed in Installer Fix 1) |
CLOSE — subdir added |
The agent subdirectory was added to support future multi-component deployments (agent + server on same machine). This is a reasonable structural enhancement.
7E. Migration System
| Component | Spec | Actual | Match |
|---|---|---|---|
| MigrationExecutor | Present | Implemented in aggregator-agent/internal/migration/ |
YES |
| Old path migration | /etc/aggregator/ -> /etc/redflag/ |
Detection and backup implemented in installer templates and migration executor | YES |
Section 8: The Honest Roadmap
HIGH VALUE, LOW EFFORT (Quick Wins)
- JWT secret minimum strength (30 min) — Add
len(secret) < 32check in config loading. Addresses Code Review finding. - TLS bypass flag removal (1 hour) — Remove
--insecure-tlsflag from agent. Forces TLS in production. - Operational settings in UI (1 day) — Add
operationalcategory to SecuritySettings.tsx component. Makes timeout tuning accessible from dashboard. - Token delete button (1-2 days) — P3-004. DELETE endpoint + confirmation dialog. Currently requires DB manual cleanup.
/api/v1/infoldflags injection (1 hour) — Ensure Dockerfile passes-ldflagswith actual version strings. Currently defaults to "dev".
HIGH VALUE, HIGH EFFORT (Strategic Investments)
- Maintenance Windows (2-3 weeks) — The single most impactful unbuilt feature. Enables scheduled patching during safe hours. Without this, every update requires manual approval at execution time.
- Webhook notifications (1-2 weeks) — Slack/Teams alerts on critical update availability, failed installations, agent offline. Low integration overhead, high operational value.
- Staggered rollout (1-2 weeks) — Deploy updates to 5% canary, monitor, then 25%, then 100%. Essential for fleets > 10 agents.
- macOS agent (2-3 days) — launchd plist template + Homebrew scanner. Completes the "cross-platform" promise for homelabbers with Macs.
LOW VALUE (Defer or Drop)
- AI features — Drop entirely for foreseeable future. Adds operational complexity (GPU/API costs) without proportional benefit for homelab use case. Manual approval is more appropriate.
- aggregator-cli — Defer. The web dashboard covers all use cases. CLI would be nice-to-have for scripting but is not essential.
- AUR/Snap/Flatpak scanners — Defer. Very small user base for each. APT and DNF cover 95%+ of Linux homelabbers.
- LDAP/SSO — Defer until multi-user is needed. Single-admin is correct for homelab.
- Compliance reporting — Drop. SOX/HIPAA requirements don't apply to homelabs.
- Kubernetes deployment — Defer. Docker Compose is the right deployment model for the target audience.
Section 9: Summary Table
| Feature Area | Planned | Built | Status | Gap Rating |
|---|---|---|---|---|
| Core Architecture (pull model, agents, server) | Full | Full | Working | 0 (complete) |
| Ed25519 Signing (commands, binaries) | Full | Full + enhancements | Working | 0 (exceeds spec) |
| Authentication (tokens, JWT, refresh) | Full | Full + transactions | Working | 0 (exceeds spec) |
| Machine ID Binding | Full | Full + canonical hash | Working | 0 (exceeds spec) |
| Replay Protection (nonces) | Full | Full (10min vs 5min) | Working | 1 (timeout deviation) |
| Package Scanning (6 of 9 scanners) | 9 scanners | 6 scanners | Working | 3 (AUR, Snap, Flatpak, Homebrew missing) |
| Agent Self-Upgrade | Full | Full 7-step pipeline | Working | 0 (complete) |
| Installer (Linux + Windows) | Full | Full + arch + checksum | Working | 1 (macOS missing) |
| Web Dashboard | Full | Core features | Working | 3 (missing charts, health, metrics) |
| Database + Migrations | Full | Full + hardened | Working | 0 (exceeds spec) |
| Docker Deployment | Full | Full | Working | 0 (complete) |
| Testing | Minimal | 170 tests | Working | 2 (no CI/CD, no integration tests against real DB) |
| Maintenance Windows | Full | None | Not built | 10 (completely absent) |
| AI Features | Full | None | Not built | 10 (deliberately deferred) |
| Scheduling & Automation | Full | None | Not built | 8 (no maintenance windows, no staggered rollout) |
| LDAP/SSO | Planned | None | Not built | 5 (not needed for homelab) |
| Structured Logging | Planned (P3-006) | ETHOS alternative | Working differently | 3 (functional but not JSON/correlation IDs) |
| Compliance / Reporting | Planned | None | Not built | 2 (not applicable to homelab) |
| CLI Tool | Planned | None | Not built | 2 (dashboard covers use cases) |
| macOS Support | Planned | None | Not built | 4 (matters for homelabbers with Macs) |
Overall: Core infrastructure is 9/10. Feature breadth is 5/10. Production readiness for homelab is 7/10.
The gap is almost entirely in features that were always labeled "future" or "Phase 2" in the original docs. The core architecture — the hard engineering work — is built, tested, and hardened beyond the original specification.