Complete comparison of Fimeg's original vision against current codebase state after culurien branch work. - 9 sections covering architecture, features, backlog - 10 deviations documented (VD-001 through VD-010) - 27 backlog items tracked with current status - Honest roadmap with prioritized next steps - Executive summary for quick reference Core: 9/10. Features: 5/10. Homelab ready: 7/10. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
674 lines
41 KiB
Markdown
674 lines
41 KiB
Markdown
# Vision vs Reality: RedFlag Deviation Report
|
|
|
|
**Date:** 2026-03-29
|
|
**Branch:** culurien (post-integration verification)
|
|
**Baseline:** v0.1.27 (last Fimeg commit before culurien work)
|
|
**Author:** Claude (automated analysis based on complete codebase and historical documentation)
|
|
|
|
---
|
|
|
|
## Section 1: Executive Summary
|
|
|
|
RedFlag began as "Aggregator" — Fimeg's vision for a self-hosted ConnectWise alternative that would give homelabbers centralized update management with "single pane of glass" visibility across Windows, Linux, and macOS. The Starting Prompt described an ambitious platform with AI-assisted scheduling, maintenance windows, natural language queries, and cross-platform agents covering APT, DNF, AUR, Snap, Flatpak, Winget, Windows Update, Docker, and Homebrew.
|
|
|
|
What exists today is a narrower but more deeply engineered system than the original breadth implied. The core architecture — pull-based agents, Ed25519 signed commands, machine ID binding, three-tier token authentication — is not just present but has been hardened well beyond the original specification. The culurien branch added transaction safety, replay attack protection, key rotation, configurable timeouts, path traversal defense, semver-aware version comparison, and grew the test suite from approximately 3 test files to 170 passing tests across 18 packages.
|
|
|
|
The honest gap: approximately 40% of the originally envisioned feature surface was never built. macOS support, AI features, maintenance windows, scheduled rollouts, structured JSON logging, LDAP/SSO, compliance reporting, the CLI tool — none exist. The features that do exist (package scanning, command execution, agent self-upgrade, installer infrastructure) work correctly and are production-quality for the homelab use case.
|
|
|
|
Is it production-ready for Fimeg's homelab? Yes, with caveats. The system can install agents on Linux and Windows, scan for package updates, approve and execute updates from the dashboard, and self-upgrade agents — all with cryptographic verification. The caveats are: the setup flow has known friction (P0-005), the `/api/v1/info` endpoint returns build-time values that default to "dev" without proper ldflags injection, and several P0 backlog items from the original backlog have been fixed but not all have been verified end-to-end in a live deployment.
|
|
|
|
The "scare ConnectWise" ambition requires honest framing. RedFlag has three architectural advantages ConnectWise cannot replicate: hardware-bound machine IDs, mandatory self-hosting, and code transparency. But ConnectWise has remote control, 100+ integrations, SOC2 certification, and a decade of enterprise polish. RedFlag is competitive for the self-hosted update management niche — the subset of ConnectWise functionality that homelabbers and small IT teams actually use. It is not, and should not try to become, a full RMM platform. The strategic path is: own the update management vertical completely, then expand.
|
|
|
|
---
|
|
|
|
## Section 2: What Was Built As Planned
|
|
|
|
### 2.1 Pull-Based Agent Architecture
|
|
|
|
**Planned (Starting Prompt):** Agent polls server every 5 minutes via `GET /agents/{id}/commands`. Server never initiates connections to agents.
|
|
|
|
**Built:** Exactly as planned. Agent polls via `GET /agents/:id/commands` with configurable interval (default 300 seconds). Proportional jitter added (B-2 fix: `maxJitter = pollingInterval/2`). Server has no outbound connection capability.
|
|
|
|
**Location:** `aggregator-agent/cmd/agent/main.go:960-1070` (poll loop), `aggregator-server/internal/api/handlers/agents.go` (GetCommands handler)
|
|
|
|
**Status:** Working. Enhanced beyond spec with jitter and exponential backoff on failures.
|
|
|
|
---
|
|
|
|
### 2.2 Ed25519 Command Signing
|
|
|
|
**Planned (Security.md):** All commands in DB must be Ed25519-signed before being sent to agents. `signAndCreateCommand()` implemented in handlers.
|
|
|
|
**Built:** Exactly as planned, then enhanced. v3 signed message format includes `agent_id:cmd_id:type:sha256(params):timestamp`. Key ID and SignedAt tracked per command (migration 025). 29+ call sites use `signAndCreateCommand()`.
|
|
|
|
**Location:** `aggregator-server/internal/services/signing.go`, `aggregator-server/internal/api/handlers/agents.go:49-77`
|
|
|
|
**Status:** Working. Exceeds original spec.
|
|
|
|
---
|
|
|
|
### 2.3 Machine ID Binding
|
|
|
|
**Planned (Security.md section 3.1):** `MachineBindingMiddleware` validates `X-Machine-ID` header against `agents.machine_id`. Mismatch = 403.
|
|
|
|
**Built:** Implemented. Machine ID is SHA256 hash of hardware identifiers (D-1 fix made this canonical across all platforms). Middleware validates on authenticated agent routes. Admin rebind endpoint added for recovery.
|
|
|
|
**Location:** `aggregator-server/internal/api/middleware/machine_binding.go`, `aggregator-agent/internal/system/machine_id.go`
|
|
|
|
**Status:** Working. Enhanced with canonical hash format and rebind capability.
|
|
|
|
---
|
|
|
|
### 2.4 Three-Tier Token Authentication
|
|
|
|
**Planned (Security.md section 2):** Registration tokens (one-time/multi-seat) -> JWT access tokens (24h) -> Refresh tokens (90-day sliding window, stored as SHA-256 hash).
|
|
|
|
**Built:** Exactly as planned. Registration tokens consumed at `/agents/register` with seat limits. JWT issued with `issuer=redflag-agent` (A-3 fix). Refresh tokens stored hashed, renewed via `/agents/renew`. Token renewal wrapped in database transaction (B-2 fix).
|
|
|
|
**Location:** `aggregator-server/internal/api/handlers/auth.go`, `aggregator-server/internal/api/middleware/auth.go`, `aggregator-server/internal/database/queries/refresh_tokens.go`
|
|
|
|
**Status:** Working. Transaction safety added beyond original spec.
|
|
|
|
---
|
|
|
|
### 2.5 Replay Attack Protection (Nonce System)
|
|
|
|
**Planned (Security.md section 3.3):** Unique, time-limited, Ed25519-signed nonce for every sensitive command. Agent validates signature and timestamp.
|
|
|
|
**Built:** Implemented via `UpdateNonceService` with `Generate()` and `Validate()`. Nonce format: `uuid:unix_timestamp`, signed with Ed25519. Agent validates freshness within configurable window.
|
|
|
|
**Location:** `aggregator-server/internal/services/nonce_service.go`, `aggregator-agent/cmd/agent/subsystem_handlers.go:1074` (validateNonce)
|
|
|
|
**Status:** Working. Timeout is 10 minutes (not 5 as specified in Overview.md — see VD-006).
|
|
|
|
---
|
|
|
|
### 2.6 Agent Self-Registration Flow
|
|
|
|
**Planned (Starting Prompt):** Agent POSTs to `/agents/register` with hostname, OS info, version. Server returns agent_id, token, config.
|
|
|
|
**Built:** Exactly as planned plus enhancements. Registration includes machine_id (SHA256 hash) and public_key_fingerprint. Registration wrapped in database transaction (B-2 fix). Agent stores agent_id, token, refresh_token, check_in_interval to config file.
|
|
|
|
**Location:** `aggregator-agent/cmd/agent/main.go:371-483` (registerAgent), `aggregator-server/internal/api/handlers/agents.go` (RegisterAgent)
|
|
|
|
**Status:** Working.
|
|
|
|
---
|
|
|
|
### 2.7 Package Scanner Architecture
|
|
|
|
**Planned (Starting Prompt):** APT, DNF/YUM, AUR, Winget, Windows Update, Docker, Snap, Flatpak, Homebrew.
|
|
|
|
**Built:** APT, DNF, Winget, Windows Update, Docker, Storage metrics. Each scanner has circuit breaker protection and configurable timeouts.
|
|
|
|
**Not built:** AUR, Snap, Flatpak, Homebrew. See Section 5A.
|
|
|
|
**Location:** `aggregator-agent/internal/scanner/` (apt.go, dnf.go, winget.go, docker.go), `aggregator-agent/pkg/windowsupdate/`
|
|
|
|
**Status:** Working for implemented platforms. 6 of 9 planned scanners built.
|
|
|
|
---
|
|
|
|
### 2.8 Circuit Breaker Pattern
|
|
|
|
**Planned (ETHOS.md principle 3, Overview.md):** Circuit Breaker on fragile scanners (Windows Update, DNF).
|
|
|
|
**Built:** Full implementation with Closed/Open/HalfOpen states, configurable failure threshold, failure window, open duration, and half-open attempts. Applied to all scanners via subsystem config.
|
|
|
|
**Location:** `aggregator-agent/internal/circuitbreaker/circuit_breaker.go`, config per subsystem in `config.json`
|
|
|
|
**Status:** Working.
|
|
|
|
---
|
|
|
|
### 2.9 Command Acknowledgment System
|
|
|
|
**Planned (Overview.md):** `pending_acks.json` for at-least-once delivery guarantee.
|
|
|
|
**Built:** `acknowledgment.Tracker` package with persistent pending acks, retry with `IncrementRetry()`, and state file persistence.
|
|
|
|
**Location:** `aggregator-agent/internal/acknowledgment/`
|
|
|
|
**Status:** Working.
|
|
|
|
---
|
|
|
|
### 2.10 Agent Service Management
|
|
|
|
**Planned (Overview.md):** systemd on Linux, Windows Services (SCM) on Windows.
|
|
|
|
**Built:** Both. Linux systemd unit generated inline in installer template with security hardening (ProtectSystem=strict, ProtectHome=true, PrivateTmp=true). Windows SCM registration via `InstallService()` with auto-start and recovery actions.
|
|
|
|
**Location:** `aggregator-agent/internal/service/windows.go:438-516`, installer templates in `aggregator-server/internal/services/templates/install/scripts/`
|
|
|
|
**Status:** Working.
|
|
|
|
---
|
|
|
|
### 2.11 Web Dashboard
|
|
|
|
**Planned (Starting Prompt):** React 18 + TypeScript, TailwindCSS, TanStack Query, Recharts, TanStack Table, WebSocket.
|
|
|
|
**Built:** React + TypeScript + TailwindCSS + TanStack Query. Dashboard with agent list, update management, command history, security settings. TypeScript strict compliance (0 errors after E-1b). No Recharts charts, no TanStack Table, limited WebSocket.
|
|
|
|
**Location:** `aggregator-web/src/`
|
|
|
|
**Status:** Working. Feature-complete for core use case but missing visualization features (charts, trend analysis).
|
|
|
|
---
|
|
|
|
### 2.12 PostgreSQL with Migration Runner
|
|
|
|
**Planned (Starting Prompt, ETHOS.md):** PostgreSQL database with idempotent migrations.
|
|
|
|
**Built:** PostgreSQL 16-alpine, custom migration runner in `database/db.go` with `schema_migrations` tracking table. 30 migrations (001-030) all using IF NOT EXISTS / ON CONFLICT DO NOTHING. Migration runner aborts fatally on failure (B-1 fix).
|
|
|
|
**Location:** `aggregator-server/internal/database/db.go`, `aggregator-server/internal/database/migrations/`
|
|
|
|
**Status:** Working.
|
|
|
|
---
|
|
|
|
### 2.13 Docker Compose Deployment
|
|
|
|
**Planned (Starting Prompt):** `docker-compose.yml` for quick start.
|
|
|
|
**Built:** Three-service compose: postgres (16-alpine), server (multi-stage Go build), web (nginx). Health checks, volume persistence, env_file support.
|
|
|
|
**Location:** `docker-compose.yml`, `config/.env.example`
|
|
|
|
**Status:** Working.
|
|
|
|
---
|
|
|
|
### 2.14 Installer Scripts
|
|
|
|
**Planned (various docs):** Install script served from `/install/:platform` endpoint.
|
|
|
|
**Built:** Linux bash and Windows PowerShell templates served dynamically. Linux installer creates system user, directories, sudoers (per-package-manager), systemd service with security hardening, registers agent, starts service. Windows installer creates directories, downloads binary, writes config, registers service. Both have arch auto-detection and checksum verification (Installer Fix 2).
|
|
|
|
**Location:** `aggregator-server/internal/services/templates/install/scripts/linux.sh.tmpl`, `windows.ps1.tmpl`
|
|
|
|
**Status:** Working. Idempotent.
|
|
|
|
---
|
|
|
|
### 2.15 Agent Self-Upgrade Pipeline
|
|
|
|
**Planned (Overview.md, P2-003):** 7-step pipeline: download, checksum verify, Ed25519 binary signature verify, backup, atomic install, service restart, watchdog confirmation.
|
|
|
|
**Built:** Exactly as specified. All 7 steps implemented with deferred rollback on any failure including watchdog timeout (5 minutes). Download has 5-minute timeout and 500MB size limit (U-8 fix).
|
|
|
|
**Location:** `aggregator-agent/cmd/agent/subsystem_handlers.go:575-762`
|
|
|
|
**Status:** Working. The v0.1.27 INVENTORY doc noted this was "FULLY IMPLEMENTED" despite the backlog claiming it was placeholder.
|
|
|
|
---
|
|
|
|
## Section 3: What Was Built Better Than Planned
|
|
|
|
### 3.1 Ed25519 Key Rotation with TTL
|
|
|
|
**Original:** Security.md noted key rotation was "TODO" with no implementation. SETUP-SECURITY.md described a POST `/security/keys/rotate` API with 30-day grace period.
|
|
|
|
**Built:** TTL-based key caching with automatic refresh. Server registers primary key in `signing_keys` table (migration 025). Agent caches server public key at registration (TOFU), with configurable TTL refresh. Key ID tracked per command for audit trail.
|
|
|
|
**Why better:** Automatic TTL refresh is more reliable than manual API-triggered rotation. The manual rotation API was never needed because the system handles staleness automatically.
|
|
|
|
---
|
|
|
|
### 3.2 Command Signing v3 Format
|
|
|
|
**Original:** Commands signed with `cmd_id:type:sha256(params)`.
|
|
|
|
**Built:** v3 format includes `agent_id:cmd_id:type:sha256(params):timestamp`. Agent ID binding prevents command replay to different agents. Timestamp enables time-based expiry independent of DB state.
|
|
|
|
**Why better:** Prevents a class of relay attacks where a compromised agent could forward signed commands to other agents.
|
|
|
|
---
|
|
|
|
### 3.3 Machine ID Canonical SHA256 Hash
|
|
|
|
**Original:** Machine ID was inconsistent — registration fallback used `"unknown-" + hostname` (unhashed) while runtime used SHA256.
|
|
|
|
**Built (D-1):** All paths now use `GetMachineID()` which always returns a 64-character hex SHA256 hash. Registration aborts with `log.Fatalf` if machine ID cannot be obtained — no unhashed fallback.
|
|
|
|
**Why better:** Eliminates format mismatch between registration and runtime that would cause 403 errors after restart.
|
|
|
|
---
|
|
|
|
### 3.4 Transaction Safety (B-Series)
|
|
|
|
**Original:** Not specified. Registration, command delivery, and token renewal were separate DB operations without transaction wrapping.
|
|
|
|
**Built (B-2):** Registration wrapped in `tx.Beginx()` with `defer tx.Rollback()`. Command delivery uses `SELECT FOR UPDATE SKIP LOCKED` (atomic claim). Token renewal wrapped in transaction. JWT generated after commit, not before.
|
|
|
|
**Why better:** Prevents partial registration state, command double-delivery race conditions, and orphaned tokens.
|
|
|
|
---
|
|
|
|
### 3.5 Configurable Operational Timeouts
|
|
|
|
**Original:** 6 hardcoded timeout values in main.go and timeout.go.
|
|
|
|
**Built (E-1c):** All 6 values stored in `security_settings` table under `operational` category (migration 030). Read from DB at startup with hardcoded fallback. Zero-value protection prevents zero-duration tickers.
|
|
|
|
**Why better:** Administrators can tune timeouts via API without code changes or redeployment.
|
|
|
|
---
|
|
|
|
### 3.6 Binary Path Traversal Protection
|
|
|
|
**Original:** Not specified — `c.File(pkg.BinaryPath)` served DB-sourced paths without validation.
|
|
|
|
**Built (E-1c + Integration Verification):** Both `DownloadUpdatePackage` and `DownloadAgent` resolve paths via `filepath.Abs()` and validate against `REDFLAG_BINARY_STORAGE_PATH` using prefix check. Traversal attempts logged and return 403.
|
|
|
|
**Why better:** Defense in depth against DB compromise scenarios.
|
|
|
|
---
|
|
|
|
### 3.7 TypeScript Strict Compliance
|
|
|
|
**Original:** 217 TypeScript errors in `aggregator-web/src/`.
|
|
|
|
**Built (E-1b):** All 217 errors fixed. Zero `@ts-ignore` or `as any` suppressions added. Type interfaces verified against actual server JSON responses. TanStack Query v5 `isLoading` -> `isPending` migration for mutations.
|
|
|
|
**Why better:** Catches type mismatches at compile time instead of runtime.
|
|
|
|
---
|
|
|
|
### 3.8 Semver-Aware Version Comparison
|
|
|
|
**Original:** `versions.go:72` used lexicographic comparison (`agentVersion < current.MinAgentVersion`), making `"0.1.9" > "0.1.22"`.
|
|
|
|
**Built (Upgrade Fix):** `CompareVersions()` with octet-by-octet numeric parsing. Handles `"dev"` as always-older, `"v"` prefix stripping, mismatched octet counts.
|
|
|
|
**Why better:** Version gates now work correctly for all version numbers.
|
|
|
|
---
|
|
|
|
### 3.9 Test Suite Growth
|
|
|
|
**Original (Code Review):** "Only 3 test files across the entire codebase" — `circuitbreaker_test.go`, `test_disk.go`, `test_disk_detection.go`, plus scheduler tests.
|
|
|
|
**Built:** 170 tests across 18 packages covering: Ed25519 signing and replay protection, JWT issuer validation, registration transactions, command delivery races, machine ID format, ETHOS compliance, path traversal, version comparison, checksum computation, timeout configuration, and more.
|
|
|
|
**Why better:** Regression detection for all fix series (A through Upgrade).
|
|
|
|
---
|
|
|
|
### 3.10 ETHOS Logging Compliance
|
|
|
|
**Original:** Mixed `fmt.Printf`, emoji in logs, inconsistent format.
|
|
|
|
**Built (D-2 + Integration Verification):** All production log statements use `log.Printf("[TAG] [system] [component] message key=value")`. Emoji removed from daemon log.Printf calls. `fmt.Printf` DEBUG statements removed from handlers. Terminal/CLI output emoji explicitly exempted (DEV-039).
|
|
|
|
---
|
|
|
|
### 3.11 Installer Architecture Detection
|
|
|
|
**Original:** Architecture hardcoded to `amd64` in `generateInstallScript`.
|
|
|
|
**Built (Installer Fix 2):** Runtime detection via `uname -m` (Linux) and `$env:PROCESSOR_ARCHITECTURE` (Windows). Server accepts optional `?arch=` query param. Download endpoint already supported `linux-arm64` and `windows-arm64`.
|
|
|
|
**Why better:** ARM64 homelabbers (Raspberry Pi, Apple Silicon VMs) can now install without manual binary download.
|
|
|
|
---
|
|
|
|
### 3.12 Binary Checksum Verification
|
|
|
|
**Original:** No verification of downloaded binary integrity.
|
|
|
|
**Built (Installer Fix 2):** Server computes SHA-256 and serves `X-Content-SHA256` header. Linux installer verifies with `sha256sum`. Windows installer verifies with `Get-FileHash`. Missing header = warn but continue (backward compatible).
|
|
|
|
---
|
|
|
|
### 3.13 Machine ID Rebind Endpoint
|
|
|
|
**Original:** Not specified. If machine ID changed (hardware replacement, VM migration), agent was permanently locked out.
|
|
|
|
**Built (D-1):** Admin endpoint `POST /admin/agents/:id/rebind-machine-id` allows re-binding an agent to new hardware. Requires admin authentication.
|
|
|
|
---
|
|
|
|
## Section 4: What Was Built Differently (Deviations)
|
|
|
|
### VD-001: Logging Format
|
|
|
|
**Original (P3-006):** JSON structured logs with correlation IDs via logrus or similar library. `StructuredLogger` implementation, `CorrelationIDMiddleware`, buffered async writes, P95/P99 latency tracking, `system_logs` database table.
|
|
|
|
**Actual:** ETHOS `[TAG] [system] [component]` plain text format via `log.Printf`. No correlation IDs, no structured JSON, no centralized aggregation, no log database table.
|
|
|
|
**Rationale:** ETHOS principle #5 (no marketing fluff) and principle #1 (errors are history) were prioritized over the P3-006 spec. Plain text logging with consistent tags is grep-friendly and sufficient for the homelab use case. JSON structured logging adds complexity without proportional benefit at the current scale.
|
|
|
|
**Verdict:** Acceptable for homelab. Would need to be revisited for fleet-scale deployments (100+ agents).
|
|
|
|
---
|
|
|
|
### VD-002: Authentication Architecture
|
|
|
|
**Original (P0-006 + Starting Prompt):** Multi-user system with `users` table, admin/user/readonly roles, email fields, `EnsureAdminUser()`. The Starting Prompt shows Settings page with "users" section.
|
|
|
|
**Actual:** Single-admin via `.env` credentials. The `users` table exists in migrations (for compatibility) but is not used for authentication. Web auth validates against `REDFLAG_ADMIN_USER`/`REDFLAG_ADMIN_PASSWORD` from environment.
|
|
|
|
**Rationale:** P0-006 recommended "Option 1: Complete Removal" — recognizing that multi-user scaffolding increased attack surface without benefit for a homelab tool. The current implementation follows this recommendation.
|
|
|
|
**Verdict:** Correct for homelab. Multi-user would be needed for MSP/enterprise use case.
|
|
|
|
---
|
|
|
|
### VD-003: Build Orchestrator
|
|
|
|
**Original (architecture docs):** Dynamic agent compilation per request. Build Orchestrator would cross-compile agent binaries on demand.
|
|
|
|
**Actual:** Pre-built binaries placed at container build time (via Dockerfile multi-stage build). `BuildAndSignAgent` signs existing binaries but never compiles. `AgentBuilder` generates config JSON only. `build_orchestrator.go` services layer marked `// Deprecated`.
|
|
|
|
**Rationale:** Cross-compilation on every request is impractical for a homelab server. The Dockerfile multi-stage build compiles once; the server serves pre-built binaries. This is how production package distribution works (e.g., GitHub Releases).
|
|
|
|
**Verdict:** Correct pragmatic simplification. Dynamic compilation would add complexity without benefit.
|
|
|
|
---
|
|
|
|
### VD-004: Upgrade Trigger Path
|
|
|
|
**Original:** `POST /build/upgrade/:agentID` was meant to orchestrate full upgrades.
|
|
|
|
**Actual:** The real upgrade path is `POST /agents/{id}/update` (in `agent_updates.go`), which validates the agent, generates nonces, creates signed `update_agent` commands, and tracks delivery. The `/build/upgrade` endpoint generates config JSON with manual instructions — it's an admin utility, not the upgrade orchestrator.
|
|
|
|
**Rationale:** The `/agents/{id}/update` path already existed and was more complete (nonce generation, command signing, delivery tracking). Wiring a parallel path would have created confusion.
|
|
|
|
**Verdict:** Acceptable. The working path is better designed.
|
|
|
|
---
|
|
|
|
### VD-005: Security Settings UI
|
|
|
|
**Original (SECURITY-SETTINGS.md):** Full security settings configurable from dashboard including machine binding mode, version enforcement, nonce timeout, signature algorithm, log level, alert thresholds.
|
|
|
|
**Actual:** Security settings backend works (API CRUD for `security_settings` table). Dashboard displays settings for `command_signing`, `update_signing`, `nonce_validation`, `machine_binding`, `signature_verification`. The `operational` category (E-1c timeouts) is accessible via API but not visible in the UI. No validation rules enforcement for operational settings.
|
|
|
|
**Verdict:** Partially implemented. Backend complete, frontend shows security-category settings. Operational settings need UI exposure.
|
|
|
|
---
|
|
|
|
### VD-006: Nonce Timeout
|
|
|
|
**Original (Overview.md, Security.md):** Nonce lifetime "< 5 minutes".
|
|
|
|
**Actual (SETUP-SECURITY.md, code):** `REDFLAG_SECURITY_NONCE_TIMEOUT=600` (10 minutes). The code uses a 10-minute default.
|
|
|
|
**Rationale:** The original docs contradict each other — Overview.md says "< 5 min" while SETUP-SECURITY.md says 600 seconds. The 10-minute value appears to be a deliberate choice to accommodate slow network conditions (agents polling every 5 minutes may not receive the command within a 5-minute nonce window).
|
|
|
|
**Verdict:** The 10-minute value is more practical. The 5-minute spec was likely aspirational. Document this as intentional.
|
|
|
|
---
|
|
|
|
### VD-007: Key Rotation API
|
|
|
|
**Original (SETUP-SECURITY.md):** `POST /api/v1/security/keys/rotate` with `grace_period_days` (default 30). During grace period both old and new keys valid. Keys stored in `/app/keys/` directory.
|
|
|
|
**Actual:** Key rotation is TTL-based via the signing key registry in `signing_keys` table. No explicit `/keys/rotate` API endpoint. No dual-key grace period — agents refresh their cached public key via TTL (24h default). Key stored in environment variable, not in `/app/keys/` directory.
|
|
|
|
**Rationale:** Different implementation approach that achieves the same goal (agents can handle key changes) without the complexity of dual-key acceptance windows.
|
|
|
|
**Verdict:** Functional but different. A manual key rotation API would be a nice-to-have for planned rotations.
|
|
|
|
---
|
|
|
|
### VD-008: Version Format
|
|
|
|
**Original (SECURITY-SETTINGS.md):** "Semantic version string (X.Y.Z), integers only, no v prefix."
|
|
|
|
**Actual:** Four-octet format `X.Y.Z.W` where W is the config version (e.g., `0.1.26.0`). The `v` prefix is tolerated and stripped during comparison.
|
|
|
|
**Rationale:** The fourth octet was added to embed config schema version alongside the agent version, avoiding a separate version field.
|
|
|
|
**Verdict:** Acceptable extension of spec. `CompareVersions()` handles both 3-octet and 4-octet formats.
|
|
|
|
---
|
|
|
|
### VD-009: Windows Service Key Rotation
|
|
|
|
**Original:** Not explicitly specified, but key rotation logic exists in `main.go` polling loop.
|
|
|
|
**Actual (DEV-030):** The Windows service polling loop in `windows.go` does not call `ShouldRefreshKey`. The comment at line 164-168 acknowledges this as a TODO. Agents running as Windows services rely on the 24h TTL key cache and will not proactively detect key rotation.
|
|
|
|
**Verdict:** Known gap. Low risk — the 24h TTL cache means Windows agents will naturally pick up new keys within a day.
|
|
|
|
---
|
|
|
|
### VD-010: Watchdog Version Comparison
|
|
|
|
**Original:** Not explicitly specified.
|
|
|
|
**Actual:** The upgrade watchdog in `subsystem_handlers.go:943` uses string equality (`agent.CurrentVersion == expectedVersion`) instead of `CompareVersions()`. A normalized version string mismatch (e.g., `"v0.1.4"` vs `"0.1.4"`) would trigger false rollback.
|
|
|
|
**Verdict:** Low risk — both sides use the same version string from the same source. Would need fixing if version normalization is ever introduced.
|
|
|
|
---
|
|
|
|
## Section 5: What Was Never Built
|
|
|
|
### 5A. Platform Support
|
|
|
|
| Feature | Original Priority | Effort Estimate | Notes |
|
|
|---------|------------------|-----------------|-------|
|
|
| macOS agent / launchd | Starting Prompt | 2-3 days | No launchd plist, no macOS-specific code |
|
|
| Homebrew scanner | Starting Prompt | 1-2 days | Would follow APT/DNF pattern |
|
|
| AUR scanner (Arch) | Starting Prompt | 1-2 days | Would follow APT/DNF pattern |
|
|
| Snap scanner | Starting Prompt | 1 day | Low demand |
|
|
| Flatpak scanner | Starting Prompt | 1 day | Low demand |
|
|
| aggregator-cli (Go CLI) | Starting Prompt | 3-5 days | Power-user tool, not essential for homelab |
|
|
|
|
### 5B. AI Features
|
|
|
|
| Feature | Original Priority | Effort Estimate | Notes |
|
|
|---------|------------------|-----------------|-------|
|
|
| AI Chat Sidebar (Ollama/OpenAI) | Starting Prompt | 2-3 weeks | No AI code exists anywhere |
|
|
| Natural language queries (`POST /ai/query`) | Starting Prompt | 1-2 weeks | Requires AI sidebar first |
|
|
| AI-assisted scheduling (`POST /ai/schedule`) | Starting Prompt | 1 week | Requires maintenance windows first |
|
|
| AI decision audit trail (`GET /ai/decisions`) | Starting Prompt | 3-5 days | Requires AI features first |
|
|
|
|
**Honest assessment:** AI features were aspirational in the Starting Prompt ("Future Phase"). They add significant complexity and operational overhead (Ollama requires GPU resources or external API costs). For a homelab tool, manual approval is more appropriate than AI-assisted scheduling. Recommend deferring indefinitely unless Fimeg has a specific use case.
|
|
|
|
### 5C. Scheduling & Automation
|
|
|
|
| Feature | Original Priority | Effort Estimate | Notes |
|
|
|---------|------------------|-----------------|-------|
|
|
| Maintenance Windows (RRULE recurrence) | Starting Prompt, P3 | 2-3 weeks | Full RRULE parser + calendar UI + auto-approve logic |
|
|
| Auto-approve by severity during windows | Starting Prompt | 1 week | Requires maintenance windows |
|
|
| Scheduled update execution | Starting Prompt | 1 week | Requires maintenance windows |
|
|
| Staggered rollout (5%/25%/100%) | P2-003, Strategic Roadmap | 1-2 weeks | Server-side group selection + phased command queuing |
|
|
| Auto-upgrade trigger (version-based) | Upgrade Audit | 1 week | Server detects old version on check-in, queues update_agent |
|
|
|
|
**Honest assessment:** Maintenance windows are the highest-value unbuilt feature for production use. Auto-approve by severity during defined windows would significantly reduce manual work.
|
|
|
|
### 5D. Observability
|
|
|
|
| Feature | Original Priority | Effort Estimate | Notes |
|
|
|---------|------------------|-----------------|-------|
|
|
| Structured JSON logging (P3-006) | P3 | 3-4 days | logrus + correlation IDs |
|
|
| Correlation ID propagation | P3-006 | 2-3 days | Middleware + header propagation |
|
|
| Update Metrics Dashboard (P3-003) | P3 | 2-3 days | Success/failure rates, trend charts |
|
|
| Server Health Dashboard (P3-005) | P3 | 2-3 days | CPU, memory, DB connections |
|
|
| Prometheus metrics endpoint | Strategic Roadmap | 2-3 days | /metrics endpoint with Go prometheus client |
|
|
| Real-time WebSocket updates | Starting Prompt | 1-2 weeks | Partial: security events WebSocket exists |
|
|
|
|
### 5E. Integration & Ecosystem
|
|
|
|
| Feature | Original Priority | Effort Estimate | Notes |
|
|
|---------|------------------|-----------------|-------|
|
|
| LDAP/Active Directory | Strategic Roadmap | 2-3 weeks | Auth integration |
|
|
| SAML/OIDC for SSO | Strategic Roadmap | 2-3 weeks | Requires multi-user first |
|
|
| Slack/Teams/PagerDuty webhooks | Strategic Roadmap | 1-2 weeks | Event notification hooks |
|
|
| Compliance reporting (SOX, HIPAA) | Strategic Roadmap | 4-6 weeks | Report generation framework |
|
|
| Kubernetes deployment | Strategic Roadmap | 1-2 weeks | Helm chart + StatefulSet |
|
|
| Ansible/Terraform integrations | Strategic Roadmap | 2-3 weeks | Module/provider development |
|
|
|
|
**Honest assessment:** Webhooks (Slack/Teams) are the highest-value integration for homelab use. LDAP/SSO only matters if Fimeg plans to support multi-user deployments.
|
|
|
|
### 5F. UI Features
|
|
|
|
| Feature | Original Priority | Effort Estimate | Notes |
|
|
|---------|------------------|-----------------|-------|
|
|
| Security Status Dashboard Indicators (P3-002) | P3 | 2-3 days | Color-coded security health scores |
|
|
| Token Management UI Enhancement (P3-004) | P3 | 1-2 days | Delete tokens, bulk operations |
|
|
| Server Health Dashboard (P3-005) | P3 | 2-3 days | System status monitoring |
|
|
| Operational settings in UI | E-1c carry-over | 1 day | Add 'operational' category to SecuritySettings.tsx |
|
|
| Update metrics and trend charts | P3-003 | 2-3 days | Recharts integration |
|
|
| Calendar view for maintenance windows | Starting Prompt | 1-2 weeks | Requires maintenance windows backend |
|
|
|
|
### 5G. Security Features
|
|
|
|
| Feature | Original Priority | Effort Estimate | Notes |
|
|
|---------|------------------|-----------------|-------|
|
|
| Multi-factor authentication | Strategic Roadmap | 1-2 weeks | TOTP integration |
|
|
| API key rotation via UI | SETUP-SECURITY.md | 2-3 days | Manual rotation endpoint |
|
|
| Key rotation with grace period | SETUP-SECURITY.md | 1 week | Dual-key acceptance window |
|
|
| TLS hardening (remove bypass flag) | Code Review | 1 hour | Remove `--insecure-tls` flag |
|
|
| JWT secret minimum strength | Code Review | 30 min | Validation in config loading |
|
|
|
|
---
|
|
|
|
## Section 6: Backlog Status Table
|
|
|
|
| ID | Title | Original Priority | Current Status | Where Fixed/Notes |
|
|
|----|-------|------------------|----------------|-------------------|
|
|
| P0-001 | Rate Limit First Request Bug | P0 | FIXED | v0.1.26 per v0.1.27 Inventory; rate limiter namespaced by type in A-3 fixes |
|
|
| P0-002 | Session Loop Bug | P0 | PARTIALLY FIXED | SetupCompletionChecker modified in E-1b (removed isSetupMode state); may need live verification |
|
|
| P0-003 | Agent No Retry Logic | P0 | FIXED | v0.1.27 per Inventory + culurien B-2 (exponential backoff with full jitter, proportional polling jitter) |
|
|
| P0-004 | Database Constraint Violation | P0 | FIXED | v0.1.27 per Inventory; timeout service now uses 'failed' result status (check constraint compatible) |
|
|
| P0-005 | Setup Flow Broken | P0 | NOT VERIFIED | Setup handler exists but end-to-end flow not tested in culurien branch. May still have issues |
|
|
| P0-006 | Single-Admin Architecture | P0 | ACCEPTED | Decision made: single-admin via .env. Users table exists for compatibility but not used for auth |
|
|
| P0-007 | Install Script Path Variables | P0 | FIXED | 2025-12-17 per backlog + verified in Installer Fix 1 (config path consistency) |
|
|
| P0-008 | Migration Runs on Fresh Install | P0 | FIXED | 2025-12-17 per backlog; early return in detection.go for empty agent_id |
|
|
| P0-009 | Storage Scanner Wrong Table | P0 | NOT DONE | Storage scanner still on legacy interface. Dedicated storage_metrics table exists but scanner reports to update_packages |
|
|
| P1-001 | Agent Install ID Parsing | P1 | PARTIALLY FIXED | extractOrGenerateAgentID() in install_template_service.go validates UUID format; but query param handling may still have edge cases |
|
|
| P1-002 | Agent Timeout Handling | P1 | PARTIALLY FIXED | E-1c made timeouts configurable from DB; per-scanner timeouts exist in config but generic 45s timeout may still apply in some paths |
|
|
| P2-001 | Binary URL Architecture Mismatch | P2 | FIXED | Installer Fix 2 added arch detection; templates override download URL with detected architecture |
|
|
| P2-002 | Migration Error Reporting | P2 | NOT DONE | Migration errors still only logged locally; no server-side visibility |
|
|
| P2-003 | Agent Auto-Update System | P2 | FIXED | Fully implemented (was incorrectly marked placeholder in backlog); verified in Upgrade Audit |
|
|
| P3-001 | Duplicate Command Prevention | P3 | FIXED | v0.1.27; unique index on (agent_id, command_type, status) WHERE status = 'pending' |
|
|
| P3-002 | Security Status Dashboard | P3 | PARTIALLY DONE | Security overview endpoints exist; no color-coded health scores or per-agent security badges |
|
|
| P3-003 | Update Metrics Dashboard | P3 | NOT DONE | No metrics dashboard, no trend charts |
|
|
| P3-004 | Token Management UI Enhancement | P3 | PARTIALLY DONE | Token list with copy-install-command exists; no delete, no bulk operations, no status filtering |
|
|
| P3-005 | Server Health Dashboard | P3 | NOT DONE | No health dashboard |
|
|
| P3-006 | Structured Logging System | P3 | NOT DONE (alternative) | ETHOS [TAG] format used instead of JSON structured logging. See VD-001 |
|
|
| P4-001 | Agent Retry Logic Resilience | P4 | FIXED | v0.1.27 per Inventory + culurien B-2 (exponential backoff, circuit breakers) |
|
|
| P4-002 | Scanner Timeout Optimization | P4 | PARTIALLY DONE | Configurable per-subsystem timeouts in config; E-1c made server-side timeouts configurable from DB |
|
|
| P4-003 | Agent File Management Migration | P4 | PARTIALLY DONE | MigrationExecutor exists with old-path detection; constants/paths.go standardized; validation/pathutils packages have compile errors (dead code) |
|
|
| P4-004 | Directory Path Standardization | P4 | FIXED | constants/paths.go provides canonical paths; windows.go fixed to use constants.GetAgentConfigPath() (Installer Fix 1); installer templates use standard paths |
|
|
| P4-005 | Testing Infrastructure Gaps | P4 | SIGNIFICANTLY IMPROVED | From ~3 test files to 170 tests across 18 packages; no CI/CD yet |
|
|
| P4-006 | Architecture Documentation Gaps | P4 | PARTIALLY DONE | 30+ docs in culurien docs/ folder; no formal architecture diagrams or ADRs |
|
|
| P5-001 | Security Audit Documentation | P5 | NOT DONE | No security audit checklist, IR procedures, or compliance mapping |
|
|
| P5-002 | Development Workflow Documentation | P5 | PARTIALLY DONE | .env.example created; no PR template, debugging guide, or release process |
|
|
|
|
**Summary:** 10 FIXED, 8 PARTIALLY DONE, 6 NOT DONE, 1 ACCEPTED (design decision), 1 NOT VERIFIED, 1 alternative approach.
|
|
|
|
---
|
|
|
|
## Section 7: Architecture Health Assessment
|
|
|
|
### 7A. Authentication Stack
|
|
|
|
| Component | Spec | Actual | Match |
|
|
|-----------|------|--------|-------|
|
|
| Registration tokens | One-time or multi-seat | Implemented with seat limits | YES |
|
|
| JWT 24h expiry | Short-lived JWT | Implemented with issuer-based validation (A-3) | YES |
|
|
| Refresh tokens 90-day | Sliding window, SHA-256 hash | Implemented, renewal in transaction (B-2) | YES |
|
|
| Machine ID binding | `X-Machine-ID` header, 403 on mismatch | Implemented with canonical SHA256 hash (D-1) | YES |
|
|
|
|
### 7B. Command Flow
|
|
|
|
| Component | Spec | Actual | Match |
|
|
|-----------|------|--------|-------|
|
|
| Pull-only | Agents always initiate | Confirmed — server has no outbound capability | YES |
|
|
| 5-minute check-in | Configurable interval | Default 300s, configurable via config.json | YES |
|
|
| Command types | scan_updates, collect_specs, install_updates, rollback_update, update_agent | All present in models/command.go plus enable/disable_heartbeat, reboot, dry_run_update, confirm_dependencies | YES+ |
|
|
| Acknowledgment | pending_acks.json | acknowledgment.Tracker with persistence and retry | YES |
|
|
|
|
### 7C. Security Stack
|
|
|
|
| Component | Spec | Actual | Match |
|
|
|-----------|------|--------|-------|
|
|
| Ed25519 signing | Binary + command signing | Both implemented; v3 format exceeds spec | YES+ |
|
|
| Nonce validation | < 5 min lifetime, anti-replay | 10-minute default (VD-006), otherwise matches | CLOSE |
|
|
| TOFU key caching | Fetch once at registration | Implemented with TTL refresh | YES+ |
|
|
|
|
### 7D. Agent Paths
|
|
|
|
| Platform | Spec Path | Actual | Match |
|
|
|----------|-----------|--------|-------|
|
|
| Linux config | `/etc/redflag/config.json` | `/etc/redflag/agent/config.json` (constants.GetAgentConfigPath) | CLOSE — subdir added |
|
|
| Linux state | `/var/lib/redflag/` | `/var/lib/redflag/agent/` | CLOSE — subdir added |
|
|
| Linux binary | `/usr/local/bin/redflag-agent` | `/usr/local/bin/redflag-agent` | YES |
|
|
| Windows config | `C:\ProgramData\RedFlag\config.json` | `C:\ProgramData\RedFlag\agent\config.json` (fixed in Installer Fix 1) | CLOSE — subdir added |
|
|
|
|
The `agent` subdirectory was added to support future multi-component deployments (agent + server on same machine). This is a reasonable structural enhancement.
|
|
|
|
### 7E. Migration System
|
|
|
|
| Component | Spec | Actual | Match |
|
|
|-----------|------|--------|-------|
|
|
| MigrationExecutor | Present | Implemented in `aggregator-agent/internal/migration/` | YES |
|
|
| Old path migration | `/etc/aggregator/` -> `/etc/redflag/` | Detection and backup implemented in installer templates and migration executor | YES |
|
|
|
|
---
|
|
|
|
## Section 8: The Honest Roadmap
|
|
|
|
### HIGH VALUE, LOW EFFORT (Quick Wins)
|
|
|
|
1. **JWT secret minimum strength** (30 min) — Add `len(secret) < 32` check in config loading. Addresses Code Review finding.
|
|
2. **TLS bypass flag removal** (1 hour) — Remove `--insecure-tls` flag from agent. Forces TLS in production.
|
|
3. **Operational settings in UI** (1 day) — Add `operational` category to SecuritySettings.tsx component. Makes timeout tuning accessible from dashboard.
|
|
4. **Token delete button** (1-2 days) — P3-004. DELETE endpoint + confirmation dialog. Currently requires DB manual cleanup.
|
|
5. **`/api/v1/info` ldflags injection** (1 hour) — Ensure Dockerfile passes `-ldflags` with actual version strings. Currently defaults to "dev".
|
|
|
|
### HIGH VALUE, HIGH EFFORT (Strategic Investments)
|
|
|
|
1. **Maintenance Windows** (2-3 weeks) — The single most impactful unbuilt feature. Enables scheduled patching during safe hours. Without this, every update requires manual approval at execution time.
|
|
2. **Webhook notifications** (1-2 weeks) — Slack/Teams alerts on critical update availability, failed installations, agent offline. Low integration overhead, high operational value.
|
|
3. **Staggered rollout** (1-2 weeks) — Deploy updates to 5% canary, monitor, then 25%, then 100%. Essential for fleets > 10 agents.
|
|
4. **macOS agent** (2-3 days) — launchd plist template + Homebrew scanner. Completes the "cross-platform" promise for homelabbers with Macs.
|
|
|
|
### LOW VALUE (Defer or Drop)
|
|
|
|
1. **AI features** — Drop entirely for foreseeable future. Adds operational complexity (GPU/API costs) without proportional benefit for homelab use case. Manual approval is more appropriate.
|
|
2. **aggregator-cli** — Defer. The web dashboard covers all use cases. CLI would be nice-to-have for scripting but is not essential.
|
|
3. **AUR/Snap/Flatpak scanners** — Defer. Very small user base for each. APT and DNF cover 95%+ of Linux homelabbers.
|
|
4. **LDAP/SSO** — Defer until multi-user is needed. Single-admin is correct for homelab.
|
|
5. **Compliance reporting** — Drop. SOX/HIPAA requirements don't apply to homelabs.
|
|
6. **Kubernetes deployment** — Defer. Docker Compose is the right deployment model for the target audience.
|
|
|
|
---
|
|
|
|
## Section 9: Summary Table
|
|
|
|
| Feature Area | Planned | Built | Status | Gap Rating |
|
|
|-------------|---------|-------|--------|------------|
|
|
| Core Architecture (pull model, agents, server) | Full | Full | Working | 0 (complete) |
|
|
| Ed25519 Signing (commands, binaries) | Full | Full + enhancements | Working | 0 (exceeds spec) |
|
|
| Authentication (tokens, JWT, refresh) | Full | Full + transactions | Working | 0 (exceeds spec) |
|
|
| Machine ID Binding | Full | Full + canonical hash | Working | 0 (exceeds spec) |
|
|
| Replay Protection (nonces) | Full | Full (10min vs 5min) | Working | 1 (timeout deviation) |
|
|
| Package Scanning (6 of 9 scanners) | 9 scanners | 6 scanners | Working | 3 (AUR, Snap, Flatpak, Homebrew missing) |
|
|
| Agent Self-Upgrade | Full | Full 7-step pipeline | Working | 0 (complete) |
|
|
| Installer (Linux + Windows) | Full | Full + arch + checksum | Working | 1 (macOS missing) |
|
|
| Web Dashboard | Full | Core features | Working | 3 (missing charts, health, metrics) |
|
|
| Database + Migrations | Full | Full + hardened | Working | 0 (exceeds spec) |
|
|
| Docker Deployment | Full | Full | Working | 0 (complete) |
|
|
| Testing | Minimal | 170 tests | Working | 2 (no CI/CD, no integration tests against real DB) |
|
|
| Maintenance Windows | Full | None | Not built | 10 (completely absent) |
|
|
| AI Features | Full | None | Not built | 10 (deliberately deferred) |
|
|
| Scheduling & Automation | Full | None | Not built | 8 (no maintenance windows, no staggered rollout) |
|
|
| LDAP/SSO | Planned | None | Not built | 5 (not needed for homelab) |
|
|
| Structured Logging | Planned (P3-006) | ETHOS alternative | Working differently | 3 (functional but not JSON/correlation IDs) |
|
|
| Compliance / Reporting | Planned | None | Not built | 2 (not applicable to homelab) |
|
|
| CLI Tool | Planned | None | Not built | 2 (dashboard covers use cases) |
|
|
| macOS Support | Planned | None | Not built | 4 (matters for homelabbers with Macs) |
|
|
|
|
**Overall: Core infrastructure is 9/10. Feature breadth is 5/10. Production readiness for homelab is 7/10.**
|
|
|
|
The gap is almost entirely in features that were always labeled "future" or "Phase 2" in the original docs. The core architecture — the hard engineering work — is built, tested, and hardened beyond the original specification.
|