Files
Redflag/docs/Upgrade_Audit.md
jpetree331 949aca0342 feat(upgrade): agent upgrade system fixes
- Fix /api/v1/info returning hardcoded v0.1.21 (U-1)
- Fix semver comparison (lexicographic -> octet-based) (U-2)
- Fix bulk upgrade platform hardcoded to linux-amd64 (U-3)
- Fix bulk upgrade missing nonce generation (U-4)
- Add error check for sc stop in Windows restart (U-7)
- Add timeout + size limit to binary download (U-8)
- Fix ExtractConfigVersionFromAgent last-char bug (U-10)

End-to-end upgrade pipeline now fully wired.
170 tests pass (110 server + 60 agent). No regressions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 18:27:21 -04:00

15 KiB

Agent Upgrade System Audit

Date: 2026-03-29 Branch: culurien Status: Audit only — no changes


1. WHAT ALREADY EXISTS

1a. POST /build/upgrade/:agentID Handler

Route: cmd/server/main.go:422 Handler: handlers/build_orchestrator.go:95-191

Status: Partially functional — config generator, not an upgrade orchestrator.

The handler generates a fresh config JSON and returns a download URL for a pre-built binary. It does NOT:

  • Verify the agent exists in the DB
  • Create any DB record for the upgrade event
  • Queue a CommandTypeUpdateAgent command
  • Push or deliver anything to the agent
  • Implement PreserveExisting (lines 142-146 are a TODO stub)

The response contains manual next_steps instructions telling a human to stop the service, download, and restart.

1b. services/build_orchestrator.go — BuildAndSignAgent

File: services/build_orchestrator.go:32-96

BuildAndSignAgent(version, platform, architecture):

  1. Locates pre-built binary at {agentDir}/binaries/{platform}/redflag-agent[.exe]
  2. Signs with Ed25519 via signingService.SignFile()
  3. Stores in DB via packageQueries.StoreSignedPackage()
  4. Returns AgentUpdatePackage

Critical disconnect: This service is NOT called by the HTTP upgrade handler. The handler uses AgentBuilder.BuildAgentWithConfig (config-only). BuildAndSignAgent is orphaned from the HTTP flow.

1c. agent_update_packages Table (Migration 016)

File: migrations/016_agent_update_packages.up.sql

Column Type Notes
id UUID PK gen_random_uuid()
version VARCHAR(50) NOT NULL
platform VARCHAR(50) e.g. linux-amd64
architecture VARCHAR(20) NOT NULL
binary_path VARCHAR(500) NOT NULL
signature VARCHAR(128) Ed25519 hex
checksum VARCHAR(64) SHA-256
file_size BIGINT NOT NULL
created_at TIMESTAMP default now
created_by VARCHAR(100) default 'system'
is_active BOOLEAN default true

Migration 016 also adds to agents table:

  • is_updating BOOLEAN DEFAULT false
  • updating_to_version VARCHAR(50)
  • update_initiated_at TIMESTAMP

1d. NewAgentBuild vs UpgradeAgentBuild

Aspect NewAgentBuild UpgradeAgentBuild
Registration token Required Not needed
consumes_seat true false
Agent ID source Generated or from request From URL param
PreserveExisting N/A TODO stub
DB interaction None None
Command queued No No

Both are config generators that return download URLs. Neither triggers actual delivery.

1e. Agent-Side Upgrade Code

A full self-update pipeline EXISTS in the agent.

Handler: cmd/agent/subsystem_handlers.go:575-762 (handleUpdateAgent)

7-step pipeline:

Step Line What
1 661 downloadUpdatePackage() — HTTP GET to temp file
2 669 SHA-256 checksum verification against params["checksum"]
3 681 Ed25519 binary signature verification via cached server public key
4 687 Backup current binary to <binary>.bak
5 719 Atomic install: write .new, chmod, os.Rename
6 724 restartAgentService()systemctl restart (Linux) or sc stop/start (Windows)
7 731 Watchdog: polls GetAgent() every 15s for 5 min, checks version

Rollback: Deferred block (lines 700-715) restores from .bak if updateSuccess == false.

1f. Command Type for Self-Upgrade

YES — CommandTypeUpdateAgent = "update_agent" exists.

Defined in models/command.go:103. Dispatched in cmd/agent/main.go:1064:

case "update_agent":
    handleUpdateAgent(apiClient, cmd, cfg)

Full command type list:

  • collect_specs, install_updates, dry_run_update, confirm_dependencies
  • rollback_update, update_agent, enable_heartbeat, disable_heartbeat, reboot

2. AGENT SELF-REPLACEMENT MECHANISM

2a. Existing Binary Replacement Code — EXISTS

All steps exist in subsystem_handlers.go:

  • Download to temp: downloadUpdatePackage() (line 661/774)
  • Ed25519 verification: verifyBinarySignature() (line 681)
  • Checksum verification: SHA-256 (line 669)
  • Atomic replace: write .new + os.Rename (line 878)
  • Service restart: restartAgentService() (line 724/888)

2b. Linux Restart — EXISTS

restartAgentService() at line 888:

  1. Try systemctl restart redflag-agent (line 892)
  2. Fallback: service redflag-agent restart (line 898)

The agent knows its service name as hardcoded "redflag-agent".

2c. Windows Restart — EXISTS (with gap)

Lines 901-903: sc stop RedFlagAgent then sc start RedFlagAgent as separate commands. Gap: No error check on sc stop — result is discarded. The running .exe is replaced via os.Rename which works on Windows if the service has stopped.

2d. Acknowledgment — EXISTS

acknowledgment.Tracker package is used:

  • reportLogWithAck(commandID) called at upgrade start (line 651) and completion (line 751)
  • The tracker persists pending acks and retries with IncrementRetry()

3. SERVER-SIDE UPGRADE ORCHESTRATION

3a. Command Types — EXISTS

Full list in models/command.go:97-107. Includes "update_agent".

3b. update_agent Command Params

The agent handler at subsystem_handlers.go:575 expects these params:

  • download_url — URL to download the new binary
  • checksum — SHA-256 hex string
  • signature — Ed25519 hex signature of the binary
  • version — Expected version string after upgrade
  • nonce — Replay protection nonce (uuid:timestamp format)

3c. Agent Command Handling — EXISTS

Dispatched in main.go:1064 to handleUpdateAgent(). Full pipeline as described in section 1e.

3d. Agent Version Tracking — EXISTS

  • agents table has current_version column
  • Agent reports version on every check-in via AgentVersion: version.Version in the heartbeat/check-in payload
  • is_updating, updating_to_version, update_initiated_at columns exist for tracking in-progress upgrades

3e. Expected Agent Version — PARTIAL

  • config.LatestAgentVersion field exists in Config struct
  • version.MinAgentVersion is build-time injected
  • BUT: The /api/v1/info endpoint returns hardcoded "v0.1.21" instead of using version.GetCurrentVersions() — agents and the dashboard cannot reliably detect the current expected version.
  • version.ValidateAgentVersion() uses lexicographic string comparison (bug: "0.1.9" > "0.1.22" is true in lex order).

4. VERSION COMPARISON

4a. Agent Reports Version — YES

Via version.Version (build-time injected, default "dev"). Sent on:

  • Registration (line 384/443)
  • Token renewal (line 506)
  • System info collection (line 373)

4b. Version String Format

Production: 0.1.26.0 (four-octet semver-like). The 4th octet = config version. Dev: "dev".

4c. Server Expected Version — PARTIAL

config.LatestAgentVersion and version.MinAgentVersion exist but are not reliably surfaced:

  • /api/v1/info hardcodes "v0.1.21"
  • No endpoint returns latest_agent_version dynamically

4d. /api/v1/info Response — BROKEN

system.go:111-124 — Returns hardcoded JSON:

{
  "version": "v0.1.21",
  "name": "RedFlag Aggregator",
  "features": [...]
}

Does NOT use version.GetCurrentVersions(). Does NOT include latest_agent_version or min_agent_version.


5. ROLLBACK MECHANISM

5a. Rollback — EXISTS

Deferred rollback in subsystem_handlers.go:700-715:

  • Before install: backup to <binary>.bak
  • On any failure (including watchdog timeout): restoreFromBackup() restores the .bak file
  • On success: .bak file is removed

5b. Backup Logic — EXISTS

createBackup() copies current binary to <path>.bak before replacement.

5c. Health Check — EXISTS

Watchdog (line 919-940) polls GetAgent() every 15s for 5 min. Success = agent.CurrentVersion == expectedVersion. Failure = timeout → rollback.


6. DASHBOARD UPGRADE UI

6a. Upgrade Button — EXISTS

Multiple entry points in Agents.tsx:

  • Version column "Update" badge (line 1281-1294) when agent.update_available === true
  • Per-row action button (line 1338-1348)
  • Bulk action bar for selected agents (line 1112-1131)

These open AgentUpdatesModal.tsx which:

  • Fetches available upgrade packages
  • Single agent: generates nonce → calls POST /agents/{id}/update
  • Multiple agents: calls POST /agents/bulk-update

6b. Target Version UI — PARTIAL

AgentUpdatesModal.tsx shows a package selection grid with version/platform filters. No global "set target version" control.

6c. Bulk Upgrade — EXISTS (with bugs)

Two bulk paths:

  1. AgentUpdatesModal bulk path — no nonces generated (security gap)
  2. BulkAgentUpdate in RelayList.tsxplatform hardcoded to linux-amd64 for all agents (line 91). Mixed-OS fleets get wrong binaries.

7. COMPLETENESS MATRIX

Component Status Notes
update_agent command type EXISTS models/command.go:103
Agent handles upgrade command EXISTS subsystem_handlers.go:575-762, full 7-step pipeline
Safe binary replacement (Linux) EXISTS Atomic rename + systemctl restart
Safe binary replacement (Windows) EXISTS Atomic rename + sc stop/start (no error check on stop)
Ed25519 signature verification EXISTS verifyBinarySignature() against cached server key
Checksum verification EXISTS SHA-256 in agent handler; server serves X-Content-SHA256 header
Rollback on failure EXISTS Deferred .bak restore on any failure including watchdog timeout
Server triggers upgrade command PARTIAL POST /agents/{id}/update endpoint exists (called by UI), but the /build/upgrade endpoint is disconnected
Server tracks expected version PARTIAL DB columns exist; /api/v1/info version is hardcoded to v0.1.21
Dashboard upgrade UI EXISTS Single + bulk upgrade via AgentUpdatesModal
Bulk upgrade UI EXISTS (buggy) Platform hardcoded to linux-amd64; no nonces in modal bulk path
Acknowledgment/delivery tracking EXISTS acknowledgment.Tracker with retry
Version comparison PARTIAL Lexicographic comparison is buggy for multi-digit versions

8. EFFORT ESTIMATE

8a. Exists and Just Needs Wiring

  1. /api/v1/info version fix — Replace hardcoded "v0.1.21" with version.GetCurrentVersions(). Add latest_agent_version and min_agent_version to the response. (~10 lines)

  2. BuildAndSignAgent connection — The signing/packaging service exists but isn't called by the upgrade HTTP handler. Wire it to create a signed package when an admin triggers an upgrade. (~20 lines)

  3. Bulk upgrade platform detectionRelayList.tsx line 91 hardcodes linux-amd64. Fix to use each agent's actual os_type + os_architecture. (~5 lines)

  4. Bulk nonce generationAgentUpdatesModal bulk path skips nonces. Align with single-agent path. (~15 lines)

8b. Needs Building from Scratch

  1. Semver-aware version comparison — Replace lexicographic comparison in version.ValidateAgentVersion() with proper semver parsing. (~30 lines)

  2. Auto-upgrade trigger — Server-side logic: when agent checks in with version < LatestAgentVersion, automatically queue an update_agent command. Requires policy controls (opt-in/opt-out per agent, maintenance windows). (~100-200 lines)

  3. Staged rollout — Upgrade N% of agents first, monitor for failures, then proceed. (~200-300 lines)

8c. Minimum Viable Upgrade System (already working)

The MVP already works end-to-end:

  1. Admin clicks "Update" in dashboard → POST /agents/{id}/update
  2. Server creates update_agent command with download URL, checksum, signature
  3. Agent polls, receives command, verifies signature+checksum
  4. Agent downloads new binary, backs up old, atomic replace, restarts
  5. Watchdog confirms new version running, rollback if not

The critical gap is /api/v1/info returning stale version. Everything else functions.

8d. Full Production Upgrade System Would Add

  1. Auto-upgrade policy engine (version-based triggers)
  2. Staged rollout with configurable percentages
  3. Maintenance window scheduling
  4. Cross-platform bulk upgrade fix (the linux-amd64 hardcode)
  5. Upgrade history dashboard (who upgraded when, rollbacks)
  6. Semver comparison throughout
  7. Download progress reporting (large binaries on slow links)

FINDINGS TABLE

ID Platform Severity Finding Location
U-1 All HIGH /api/v1/info returns hardcoded "v0.1.21" — agents/dashboard cannot detect current expected version system.go:111-124
U-2 All HIGH ValidateAgentVersion uses lexicographic comparison — "0.1.9" > "0.1.22" incorrectly version/versions.go:72
U-3 Windows MEDIUM Bulk upgrade platform hardcoded to linux-amd64 — Windows agents get wrong binary RelayList.tsx:91
U-4 All MEDIUM Bulk upgrade in AgentUpdatesModal skips nonce generation — weaker replay protection AgentUpdatesModal.tsx:93-99
U-5 All MEDIUM BuildAndSignAgent service is disconnected from HTTP upgrade handler build_orchestrator.go
U-6 All MEDIUM POST /build/upgrade/:agentID is a config generator, not an upgrade orchestrator handlers/build_orchestrator.go:95-191
U-7 Windows LOW sc stop result not checked in restartAgentService() subsystem_handlers.go:901
U-8 All LOW downloadUpdatePackage uses plain http.Get — no timeout, no size limit subsystem_handlers.go:774
U-9 All LOW PreserveExisting is a TODO stub in upgrade handler handlers/build_orchestrator.go:142-146
U-10 All INFO ExtractConfigVersionFromAgent is fragile — last-char extraction breaks at version x.y.z10+ version/versions.go:59-62
U-11 All INFO AgentUpdate.tsx component exists but is not imported by any page AgentUpdate.tsx
U-12 All INFO build_orchestrator.go services layer marked // Deprecated services/build_orchestrator.go

  1. Fix /api/v1/info (U-1) — immediate, ~10 lines, unblocks version detection
  2. Fix bulk platform hardcode (U-3) — immediate, ~5 lines, prevents wrong-platform delivery
  3. Fix semver comparison (U-2) — immediate, ~30 lines, prevents version logic bugs
  4. Fix bulk nonce generation (U-4) — quick, ~15 lines, security consistency
  5. Wire BuildAndSignAgent to upgrade flow (U-5) — medium, connects existing code
  6. Auto-upgrade trigger — larger feature, requires policy design
  7. Staged rollout — future enhancement