# Agent Upgrade System Audit **Date:** 2026-03-29 **Branch:** culurien **Status:** Audit only — no changes --- ## 1. WHAT ALREADY EXISTS ### 1a. POST /build/upgrade/:agentID Handler **Route:** `cmd/server/main.go:422` **Handler:** `handlers/build_orchestrator.go:95-191` **Status: Partially functional — config generator, not an upgrade orchestrator.** The handler generates a fresh config JSON and returns a download URL for a pre-built binary. It does NOT: - Verify the agent exists in the DB - Create any DB record for the upgrade event - Queue a `CommandTypeUpdateAgent` command - Push or deliver anything to the agent - Implement `PreserveExisting` (lines 142-146 are a TODO stub) The response contains manual `next_steps` instructions telling a human to stop the service, download, and restart. ### 1b. services/build_orchestrator.go — BuildAndSignAgent **File:** `services/build_orchestrator.go:32-96` `BuildAndSignAgent(version, platform, architecture)`: 1. Locates pre-built binary at `{agentDir}/binaries/{platform}/redflag-agent[.exe]` 2. Signs with Ed25519 via `signingService.SignFile()` 3. Stores in DB via `packageQueries.StoreSignedPackage()` 4. Returns `AgentUpdatePackage` **Critical disconnect:** This service is NOT called by the HTTP upgrade handler. The handler uses `AgentBuilder.BuildAgentWithConfig` (config-only). `BuildAndSignAgent` is orphaned from the HTTP flow. ### 1c. agent_update_packages Table (Migration 016) **File:** `migrations/016_agent_update_packages.up.sql` | Column | Type | Notes | |--------|------|-------| | `id` | UUID PK | `gen_random_uuid()` | | `version` | VARCHAR(50) | NOT NULL | | `platform` | VARCHAR(50) | e.g. `linux-amd64` | | `architecture` | VARCHAR(20) | NOT NULL | | `binary_path` | VARCHAR(500) | NOT NULL | | `signature` | VARCHAR(128) | Ed25519 hex | | `checksum` | VARCHAR(64) | SHA-256 | | `file_size` | BIGINT | NOT NULL | | `created_at` | TIMESTAMP | default now | | `created_by` | VARCHAR(100) | default `'system'` | | `is_active` | BOOLEAN | default `true` | Migration 016 also adds to `agents` table: - `is_updating BOOLEAN DEFAULT false` - `updating_to_version VARCHAR(50)` - `update_initiated_at TIMESTAMP` ### 1d. NewAgentBuild vs UpgradeAgentBuild | Aspect | NewAgentBuild | UpgradeAgentBuild | |--------|--------------|-------------------| | Registration token | Required | Not needed | | consumes_seat | true | false | | Agent ID source | Generated or from request | From URL param | | PreserveExisting | N/A | TODO stub | | DB interaction | None | None | | Command queued | No | No | Both are config generators that return download URLs. Neither triggers actual delivery. ### 1e. Agent-Side Upgrade Code **A full self-update pipeline EXISTS in the agent.** **Handler:** `cmd/agent/subsystem_handlers.go:575-762` (`handleUpdateAgent`) **7-step pipeline:** | Step | Line | What | |------|------|------| | 1 | 661 | `downloadUpdatePackage()` — HTTP GET to temp file | | 2 | 669 | SHA-256 checksum verification against `params["checksum"]` | | 3 | 681 | Ed25519 binary signature verification via cached server public key | | 4 | 687 | Backup current binary to `.bak` | | 5 | 719 | Atomic install: write `.new`, chmod, `os.Rename` | | 6 | 724 | `restartAgentService()` — `systemctl restart` (Linux) or `sc stop/start` (Windows) | | 7 | 731 | Watchdog: polls `GetAgent()` every 15s for 5 min, checks version | **Rollback:** Deferred block (lines 700-715) restores from `.bak` if `updateSuccess == false`. ### 1f. Command Type for Self-Upgrade **YES — `CommandTypeUpdateAgent = "update_agent"` exists.** Defined in `models/command.go:103`. Dispatched in `cmd/agent/main.go:1064`: ```go case "update_agent": handleUpdateAgent(apiClient, cmd, cfg) ``` Full command type list: - `collect_specs`, `install_updates`, `dry_run_update`, `confirm_dependencies` - `rollback_update`, `update_agent`, `enable_heartbeat`, `disable_heartbeat`, `reboot` --- ## 2. AGENT SELF-REPLACEMENT MECHANISM ### 2a. Existing Binary Replacement Code — EXISTS All steps exist in `subsystem_handlers.go`: - Download to temp: `downloadUpdatePackage()` (line 661/774) - Ed25519 verification: `verifyBinarySignature()` (line 681) - Checksum verification: SHA-256 (line 669) - Atomic replace: write `.new` + `os.Rename` (line 878) - Service restart: `restartAgentService()` (line 724/888) ### 2b. Linux Restart — EXISTS `restartAgentService()` at line 888: 1. Try `systemctl restart redflag-agent` (line 892) 2. Fallback: `service redflag-agent restart` (line 898) The agent knows its service name as hardcoded `"redflag-agent"`. ### 2c. Windows Restart — EXISTS (with gap) Lines 901-903: `sc stop RedFlagAgent` then `sc start RedFlagAgent` as separate commands. **Gap:** No error check on `sc stop` — result is discarded. The running `.exe` is replaced via `os.Rename` which works on Windows if the service has stopped. ### 2d. Acknowledgment — EXISTS `acknowledgment.Tracker` package is used: - `reportLogWithAck(commandID)` called at upgrade start (line 651) and completion (line 751) - The tracker persists pending acks and retries with `IncrementRetry()` --- ## 3. SERVER-SIDE UPGRADE ORCHESTRATION ### 3a. Command Types — EXISTS Full list in `models/command.go:97-107`. Includes `"update_agent"`. ### 3b. update_agent Command Params The agent handler at `subsystem_handlers.go:575` expects these params: - `download_url` — URL to download the new binary - `checksum` — SHA-256 hex string - `signature` — Ed25519 hex signature of the binary - `version` — Expected version string after upgrade - `nonce` — Replay protection nonce (uuid:timestamp format) ### 3c. Agent Command Handling — EXISTS Dispatched in `main.go:1064` to `handleUpdateAgent()`. Full pipeline as described in section 1e. ### 3d. Agent Version Tracking — EXISTS - `agents` table has `current_version` column - Agent reports version on every check-in via `AgentVersion: version.Version` in the heartbeat/check-in payload - `is_updating`, `updating_to_version`, `update_initiated_at` columns exist for tracking in-progress upgrades ### 3e. Expected Agent Version — PARTIAL - `config.LatestAgentVersion` field exists in Config struct - `version.MinAgentVersion` is build-time injected - **BUT:** The `/api/v1/info` endpoint returns hardcoded `"v0.1.21"` instead of using `version.GetCurrentVersions()` — agents and the dashboard cannot reliably detect the current expected version. - `version.ValidateAgentVersion()` uses lexicographic string comparison (bug: `"0.1.9" > "0.1.22"` is true in lex order). --- ## 4. VERSION COMPARISON ### 4a. Agent Reports Version — YES Via `version.Version` (build-time injected, default `"dev"`). Sent on: - Registration (line 384/443) - Token renewal (line 506) - System info collection (line 373) ### 4b. Version String Format Production: `0.1.26.0` (four-octet semver-like). The 4th octet = config version. Dev: `"dev"`. ### 4c. Server Expected Version — PARTIAL `config.LatestAgentVersion` and `version.MinAgentVersion` exist but are not reliably surfaced: - `/api/v1/info` hardcodes `"v0.1.21"` - No endpoint returns `latest_agent_version` dynamically ### 4d. /api/v1/info Response — BROKEN `system.go:111-124` — Returns hardcoded JSON: ```json { "version": "v0.1.21", "name": "RedFlag Aggregator", "features": [...] } ``` Does NOT use `version.GetCurrentVersions()`. Does NOT include `latest_agent_version` or `min_agent_version`. --- ## 5. ROLLBACK MECHANISM ### 5a. Rollback — EXISTS Deferred rollback in `subsystem_handlers.go:700-715`: - Before install: backup to `.bak` - On any failure (including watchdog timeout): `restoreFromBackup()` restores the `.bak` file - On success: `.bak` file is removed ### 5b. Backup Logic — EXISTS `createBackup()` copies current binary to `.bak` before replacement. ### 5c. Health Check — EXISTS Watchdog (line 919-940) polls `GetAgent()` every 15s for 5 min. Success = `agent.CurrentVersion == expectedVersion`. Failure = timeout → rollback. --- ## 6. DASHBOARD UPGRADE UI ### 6a. Upgrade Button — EXISTS Multiple entry points in `Agents.tsx`: - Version column "Update" badge (line 1281-1294) when `agent.update_available === true` - Per-row action button (line 1338-1348) - Bulk action bar for selected agents (line 1112-1131) These open `AgentUpdatesModal.tsx` which: - Fetches available upgrade packages - Single agent: generates nonce → calls `POST /agents/{id}/update` - Multiple agents: calls `POST /agents/bulk-update` ### 6b. Target Version UI — PARTIAL `AgentUpdatesModal.tsx` shows a package selection grid with version/platform filters. No global "set target version" control. ### 6c. Bulk Upgrade — EXISTS (with bugs) Two bulk paths: 1. `AgentUpdatesModal` bulk path — no nonces generated (security gap) 2. `BulkAgentUpdate` in `RelayList.tsx` — **platform hardcoded to `linux-amd64`** for all agents (line 91). Mixed-OS fleets get wrong binaries. --- ## 7. COMPLETENESS MATRIX | Component | Status | Notes | |-----------|--------|-------| | `update_agent` command type | EXISTS | `models/command.go:103` | | Agent handles upgrade command | EXISTS | `subsystem_handlers.go:575-762`, full 7-step pipeline | | Safe binary replacement (Linux) | EXISTS | Atomic rename + systemctl restart | | Safe binary replacement (Windows) | EXISTS | Atomic rename + sc stop/start (no error check on stop) | | Ed25519 signature verification | EXISTS | `verifyBinarySignature()` against cached server key | | Checksum verification | EXISTS | SHA-256 in agent handler; server serves `X-Content-SHA256` header | | Rollback on failure | EXISTS | Deferred `.bak` restore on any failure including watchdog timeout | | Server triggers upgrade command | PARTIAL | `POST /agents/{id}/update` endpoint exists (called by UI), but the `/build/upgrade` endpoint is disconnected | | Server tracks expected version | PARTIAL | DB columns exist; `/api/v1/info` version is hardcoded to `v0.1.21` | | Dashboard upgrade UI | EXISTS | Single + bulk upgrade via `AgentUpdatesModal` | | Bulk upgrade UI | EXISTS (buggy) | Platform hardcoded to `linux-amd64`; no nonces in modal bulk path | | Acknowledgment/delivery tracking | EXISTS | `acknowledgment.Tracker` with retry | | Version comparison | PARTIAL | Lexicographic comparison is buggy for multi-digit versions | --- ## 8. EFFORT ESTIMATE ### 8a. Exists and Just Needs Wiring 1. **`/api/v1/info` version fix** — Replace hardcoded `"v0.1.21"` with `version.GetCurrentVersions()`. Add `latest_agent_version` and `min_agent_version` to the response. (~10 lines) 2. **`BuildAndSignAgent` connection** — The signing/packaging service exists but isn't called by the upgrade HTTP handler. Wire it to create a signed package when an admin triggers an upgrade. (~20 lines) 3. **Bulk upgrade platform detection** — `RelayList.tsx` line 91 hardcodes `linux-amd64`. Fix to use each agent's actual `os_type + os_architecture`. (~5 lines) 4. **Bulk nonce generation** — `AgentUpdatesModal` bulk path skips nonces. Align with single-agent path. (~15 lines) ### 8b. Needs Building from Scratch 1. **Semver-aware version comparison** — Replace lexicographic comparison in `version.ValidateAgentVersion()` with proper semver parsing. (~30 lines) 2. **Auto-upgrade trigger** — Server-side logic: when agent checks in with version < `LatestAgentVersion`, automatically queue an `update_agent` command. Requires policy controls (opt-in/opt-out per agent, maintenance windows). (~100-200 lines) 3. **Staged rollout** — Upgrade N% of agents first, monitor for failures, then proceed. (~200-300 lines) ### 8c. Minimum Viable Upgrade System (already working) The MVP already works end-to-end: 1. Admin clicks "Update" in dashboard → `POST /agents/{id}/update` 2. Server creates `update_agent` command with download URL, checksum, signature 3. Agent polls, receives command, verifies signature+checksum 4. Agent downloads new binary, backs up old, atomic replace, restarts 5. Watchdog confirms new version running, rollback if not **The critical gap is `/api/v1/info` returning stale version.** Everything else functions. ### 8d. Full Production Upgrade System Would Add 1. Auto-upgrade policy engine (version-based triggers) 2. Staged rollout with configurable percentages 3. Maintenance window scheduling 4. Cross-platform bulk upgrade fix (the `linux-amd64` hardcode) 5. Upgrade history dashboard (who upgraded when, rollbacks) 6. Semver comparison throughout 7. Download progress reporting (large binaries on slow links) --- ## FINDINGS TABLE | ID | Platform | Severity | Finding | Location | |----|----------|----------|---------|----------| | U-1 | All | HIGH | `/api/v1/info` returns hardcoded `"v0.1.21"` — agents/dashboard cannot detect current expected version | `system.go:111-124` | | U-2 | All | HIGH | `ValidateAgentVersion` uses lexicographic comparison — `"0.1.9" > "0.1.22"` incorrectly | `version/versions.go:72` | | U-3 | Windows | MEDIUM | Bulk upgrade platform hardcoded to `linux-amd64` — Windows agents get wrong binary | `RelayList.tsx:91` | | U-4 | All | MEDIUM | Bulk upgrade in `AgentUpdatesModal` skips nonce generation — weaker replay protection | `AgentUpdatesModal.tsx:93-99` | | U-5 | All | MEDIUM | `BuildAndSignAgent` service is disconnected from HTTP upgrade handler | `build_orchestrator.go` | | U-6 | All | MEDIUM | `POST /build/upgrade/:agentID` is a config generator, not an upgrade orchestrator | `handlers/build_orchestrator.go:95-191` | | U-7 | Windows | LOW | `sc stop` result not checked in `restartAgentService()` | `subsystem_handlers.go:901` | | U-8 | All | LOW | `downloadUpdatePackage` uses plain `http.Get` — no timeout, no size limit | `subsystem_handlers.go:774` | | U-9 | All | LOW | `PreserveExisting` is a TODO stub in upgrade handler | `handlers/build_orchestrator.go:142-146` | | U-10 | All | INFO | `ExtractConfigVersionFromAgent` is fragile — last-char extraction breaks at version x.y.z10+ | `version/versions.go:59-62` | | U-11 | All | INFO | `AgentUpdate.tsx` component exists but is not imported by any page | `AgentUpdate.tsx` | | U-12 | All | INFO | `build_orchestrator.go` services layer marked `// Deprecated` | `services/build_orchestrator.go` | --- ## RECOMMENDED BUILD ORDER 1. **Fix `/api/v1/info`** (U-1) — immediate, ~10 lines, unblocks version detection 2. **Fix bulk platform hardcode** (U-3) — immediate, ~5 lines, prevents wrong-platform delivery 3. **Fix semver comparison** (U-2) — immediate, ~30 lines, prevents version logic bugs 4. **Fix bulk nonce generation** (U-4) — quick, ~15 lines, security consistency 5. **Wire `BuildAndSignAgent` to upgrade flow** (U-5) — medium, connects existing code 6. **Auto-upgrade trigger** — larger feature, requires policy design 7. **Staged rollout** — future enhancement