Files
Redflag/docs/Upgrade_Audit.md
jpetree331 949aca0342 feat(upgrade): agent upgrade system fixes
- Fix /api/v1/info returning hardcoded v0.1.21 (U-1)
- Fix semver comparison (lexicographic -> octet-based) (U-2)
- Fix bulk upgrade platform hardcoded to linux-amd64 (U-3)
- Fix bulk upgrade missing nonce generation (U-4)
- Add error check for sc stop in Windows restart (U-7)
- Add timeout + size limit to binary download (U-8)
- Fix ExtractConfigVersionFromAgent last-char bug (U-10)

End-to-end upgrade pipeline now fully wired.
170 tests pass (110 server + 60 agent). No regressions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 18:27:21 -04:00

347 lines
15 KiB
Markdown

# Agent Upgrade System Audit
**Date:** 2026-03-29
**Branch:** culurien
**Status:** Audit only — no changes
---
## 1. WHAT ALREADY EXISTS
### 1a. POST /build/upgrade/:agentID Handler
**Route:** `cmd/server/main.go:422`
**Handler:** `handlers/build_orchestrator.go:95-191`
**Status: Partially functional — config generator, not an upgrade orchestrator.**
The handler generates a fresh config JSON and returns a download URL for a pre-built binary. It does NOT:
- Verify the agent exists in the DB
- Create any DB record for the upgrade event
- Queue a `CommandTypeUpdateAgent` command
- Push or deliver anything to the agent
- Implement `PreserveExisting` (lines 142-146 are a TODO stub)
The response contains manual `next_steps` instructions telling a human to stop the service, download, and restart.
### 1b. services/build_orchestrator.go — BuildAndSignAgent
**File:** `services/build_orchestrator.go:32-96`
`BuildAndSignAgent(version, platform, architecture)`:
1. Locates pre-built binary at `{agentDir}/binaries/{platform}/redflag-agent[.exe]`
2. Signs with Ed25519 via `signingService.SignFile()`
3. Stores in DB via `packageQueries.StoreSignedPackage()`
4. Returns `AgentUpdatePackage`
**Critical disconnect:** This service is NOT called by the HTTP upgrade handler. The handler uses `AgentBuilder.BuildAgentWithConfig` (config-only). `BuildAndSignAgent` is orphaned from the HTTP flow.
### 1c. agent_update_packages Table (Migration 016)
**File:** `migrations/016_agent_update_packages.up.sql`
| Column | Type | Notes |
|--------|------|-------|
| `id` | UUID PK | `gen_random_uuid()` |
| `version` | VARCHAR(50) | NOT NULL |
| `platform` | VARCHAR(50) | e.g. `linux-amd64` |
| `architecture` | VARCHAR(20) | NOT NULL |
| `binary_path` | VARCHAR(500) | NOT NULL |
| `signature` | VARCHAR(128) | Ed25519 hex |
| `checksum` | VARCHAR(64) | SHA-256 |
| `file_size` | BIGINT | NOT NULL |
| `created_at` | TIMESTAMP | default now |
| `created_by` | VARCHAR(100) | default `'system'` |
| `is_active` | BOOLEAN | default `true` |
Migration 016 also adds to `agents` table:
- `is_updating BOOLEAN DEFAULT false`
- `updating_to_version VARCHAR(50)`
- `update_initiated_at TIMESTAMP`
### 1d. NewAgentBuild vs UpgradeAgentBuild
| Aspect | NewAgentBuild | UpgradeAgentBuild |
|--------|--------------|-------------------|
| Registration token | Required | Not needed |
| consumes_seat | true | false |
| Agent ID source | Generated or from request | From URL param |
| PreserveExisting | N/A | TODO stub |
| DB interaction | None | None |
| Command queued | No | No |
Both are config generators that return download URLs. Neither triggers actual delivery.
### 1e. Agent-Side Upgrade Code
**A full self-update pipeline EXISTS in the agent.**
**Handler:** `cmd/agent/subsystem_handlers.go:575-762` (`handleUpdateAgent`)
**7-step pipeline:**
| Step | Line | What |
|------|------|------|
| 1 | 661 | `downloadUpdatePackage()` — HTTP GET to temp file |
| 2 | 669 | SHA-256 checksum verification against `params["checksum"]` |
| 3 | 681 | Ed25519 binary signature verification via cached server public key |
| 4 | 687 | Backup current binary to `<binary>.bak` |
| 5 | 719 | Atomic install: write `.new`, chmod, `os.Rename` |
| 6 | 724 | `restartAgentService()``systemctl restart` (Linux) or `sc stop/start` (Windows) |
| 7 | 731 | Watchdog: polls `GetAgent()` every 15s for 5 min, checks version |
**Rollback:** Deferred block (lines 700-715) restores from `.bak` if `updateSuccess == false`.
### 1f. Command Type for Self-Upgrade
**YES — `CommandTypeUpdateAgent = "update_agent"` exists.**
Defined in `models/command.go:103`. Dispatched in `cmd/agent/main.go:1064`:
```go
case "update_agent":
handleUpdateAgent(apiClient, cmd, cfg)
```
Full command type list:
- `collect_specs`, `install_updates`, `dry_run_update`, `confirm_dependencies`
- `rollback_update`, `update_agent`, `enable_heartbeat`, `disable_heartbeat`, `reboot`
---
## 2. AGENT SELF-REPLACEMENT MECHANISM
### 2a. Existing Binary Replacement Code — EXISTS
All steps exist in `subsystem_handlers.go`:
- Download to temp: `downloadUpdatePackage()` (line 661/774)
- Ed25519 verification: `verifyBinarySignature()` (line 681)
- Checksum verification: SHA-256 (line 669)
- Atomic replace: write `.new` + `os.Rename` (line 878)
- Service restart: `restartAgentService()` (line 724/888)
### 2b. Linux Restart — EXISTS
`restartAgentService()` at line 888:
1. Try `systemctl restart redflag-agent` (line 892)
2. Fallback: `service redflag-agent restart` (line 898)
The agent knows its service name as hardcoded `"redflag-agent"`.
### 2c. Windows Restart — EXISTS (with gap)
Lines 901-903: `sc stop RedFlagAgent` then `sc start RedFlagAgent` as separate commands.
**Gap:** No error check on `sc stop` — result is discarded. The running `.exe` is replaced via `os.Rename` which works on Windows if the service has stopped.
### 2d. Acknowledgment — EXISTS
`acknowledgment.Tracker` package is used:
- `reportLogWithAck(commandID)` called at upgrade start (line 651) and completion (line 751)
- The tracker persists pending acks and retries with `IncrementRetry()`
---
## 3. SERVER-SIDE UPGRADE ORCHESTRATION
### 3a. Command Types — EXISTS
Full list in `models/command.go:97-107`. Includes `"update_agent"`.
### 3b. update_agent Command Params
The agent handler at `subsystem_handlers.go:575` expects these params:
- `download_url` — URL to download the new binary
- `checksum` — SHA-256 hex string
- `signature` — Ed25519 hex signature of the binary
- `version` — Expected version string after upgrade
- `nonce` — Replay protection nonce (uuid:timestamp format)
### 3c. Agent Command Handling — EXISTS
Dispatched in `main.go:1064` to `handleUpdateAgent()`. Full pipeline as described in section 1e.
### 3d. Agent Version Tracking — EXISTS
- `agents` table has `current_version` column
- Agent reports version on every check-in via `AgentVersion: version.Version` in the heartbeat/check-in payload
- `is_updating`, `updating_to_version`, `update_initiated_at` columns exist for tracking in-progress upgrades
### 3e. Expected Agent Version — PARTIAL
- `config.LatestAgentVersion` field exists in Config struct
- `version.MinAgentVersion` is build-time injected
- **BUT:** The `/api/v1/info` endpoint returns hardcoded `"v0.1.21"` instead of using `version.GetCurrentVersions()` — agents and the dashboard cannot reliably detect the current expected version.
- `version.ValidateAgentVersion()` uses lexicographic string comparison (bug: `"0.1.9" > "0.1.22"` is true in lex order).
---
## 4. VERSION COMPARISON
### 4a. Agent Reports Version — YES
Via `version.Version` (build-time injected, default `"dev"`). Sent on:
- Registration (line 384/443)
- Token renewal (line 506)
- System info collection (line 373)
### 4b. Version String Format
Production: `0.1.26.0` (four-octet semver-like). The 4th octet = config version.
Dev: `"dev"`.
### 4c. Server Expected Version — PARTIAL
`config.LatestAgentVersion` and `version.MinAgentVersion` exist but are not reliably surfaced:
- `/api/v1/info` hardcodes `"v0.1.21"`
- No endpoint returns `latest_agent_version` dynamically
### 4d. /api/v1/info Response — BROKEN
`system.go:111-124` — Returns hardcoded JSON:
```json
{
"version": "v0.1.21",
"name": "RedFlag Aggregator",
"features": [...]
}
```
Does NOT use `version.GetCurrentVersions()`. Does NOT include `latest_agent_version` or `min_agent_version`.
---
## 5. ROLLBACK MECHANISM
### 5a. Rollback — EXISTS
Deferred rollback in `subsystem_handlers.go:700-715`:
- Before install: backup to `<binary>.bak`
- On any failure (including watchdog timeout): `restoreFromBackup()` restores the `.bak` file
- On success: `.bak` file is removed
### 5b. Backup Logic — EXISTS
`createBackup()` copies current binary to `<path>.bak` before replacement.
### 5c. Health Check — EXISTS
Watchdog (line 919-940) polls `GetAgent()` every 15s for 5 min. Success = `agent.CurrentVersion == expectedVersion`. Failure = timeout → rollback.
---
## 6. DASHBOARD UPGRADE UI
### 6a. Upgrade Button — EXISTS
Multiple entry points in `Agents.tsx`:
- Version column "Update" badge (line 1281-1294) when `agent.update_available === true`
- Per-row action button (line 1338-1348)
- Bulk action bar for selected agents (line 1112-1131)
These open `AgentUpdatesModal.tsx` which:
- Fetches available upgrade packages
- Single agent: generates nonce → calls `POST /agents/{id}/update`
- Multiple agents: calls `POST /agents/bulk-update`
### 6b. Target Version UI — PARTIAL
`AgentUpdatesModal.tsx` shows a package selection grid with version/platform filters. No global "set target version" control.
### 6c. Bulk Upgrade — EXISTS (with bugs)
Two bulk paths:
1. `AgentUpdatesModal` bulk path — no nonces generated (security gap)
2. `BulkAgentUpdate` in `RelayList.tsx`**platform hardcoded to `linux-amd64`** for all agents (line 91). Mixed-OS fleets get wrong binaries.
---
## 7. COMPLETENESS MATRIX
| Component | Status | Notes |
|-----------|--------|-------|
| `update_agent` command type | EXISTS | `models/command.go:103` |
| Agent handles upgrade command | EXISTS | `subsystem_handlers.go:575-762`, full 7-step pipeline |
| Safe binary replacement (Linux) | EXISTS | Atomic rename + systemctl restart |
| Safe binary replacement (Windows) | EXISTS | Atomic rename + sc stop/start (no error check on stop) |
| Ed25519 signature verification | EXISTS | `verifyBinarySignature()` against cached server key |
| Checksum verification | EXISTS | SHA-256 in agent handler; server serves `X-Content-SHA256` header |
| Rollback on failure | EXISTS | Deferred `.bak` restore on any failure including watchdog timeout |
| Server triggers upgrade command | PARTIAL | `POST /agents/{id}/update` endpoint exists (called by UI), but the `/build/upgrade` endpoint is disconnected |
| Server tracks expected version | PARTIAL | DB columns exist; `/api/v1/info` version is hardcoded to `v0.1.21` |
| Dashboard upgrade UI | EXISTS | Single + bulk upgrade via `AgentUpdatesModal` |
| Bulk upgrade UI | EXISTS (buggy) | Platform hardcoded to `linux-amd64`; no nonces in modal bulk path |
| Acknowledgment/delivery tracking | EXISTS | `acknowledgment.Tracker` with retry |
| Version comparison | PARTIAL | Lexicographic comparison is buggy for multi-digit versions |
---
## 8. EFFORT ESTIMATE
### 8a. Exists and Just Needs Wiring
1. **`/api/v1/info` version fix** — Replace hardcoded `"v0.1.21"` with `version.GetCurrentVersions()`. Add `latest_agent_version` and `min_agent_version` to the response. (~10 lines)
2. **`BuildAndSignAgent` connection** — The signing/packaging service exists but isn't called by the upgrade HTTP handler. Wire it to create a signed package when an admin triggers an upgrade. (~20 lines)
3. **Bulk upgrade platform detection**`RelayList.tsx` line 91 hardcodes `linux-amd64`. Fix to use each agent's actual `os_type + os_architecture`. (~5 lines)
4. **Bulk nonce generation**`AgentUpdatesModal` bulk path skips nonces. Align with single-agent path. (~15 lines)
### 8b. Needs Building from Scratch
1. **Semver-aware version comparison** — Replace lexicographic comparison in `version.ValidateAgentVersion()` with proper semver parsing. (~30 lines)
2. **Auto-upgrade trigger** — Server-side logic: when agent checks in with version < `LatestAgentVersion`, automatically queue an `update_agent` command. Requires policy controls (opt-in/opt-out per agent, maintenance windows). (~100-200 lines)
3. **Staged rollout** — Upgrade N% of agents first, monitor for failures, then proceed. (~200-300 lines)
### 8c. Minimum Viable Upgrade System (already working)
The MVP already works end-to-end:
1. Admin clicks "Update" in dashboard → `POST /agents/{id}/update`
2. Server creates `update_agent` command with download URL, checksum, signature
3. Agent polls, receives command, verifies signature+checksum
4. Agent downloads new binary, backs up old, atomic replace, restarts
5. Watchdog confirms new version running, rollback if not
**The critical gap is `/api/v1/info` returning stale version.** Everything else functions.
### 8d. Full Production Upgrade System Would Add
1. Auto-upgrade policy engine (version-based triggers)
2. Staged rollout with configurable percentages
3. Maintenance window scheduling
4. Cross-platform bulk upgrade fix (the `linux-amd64` hardcode)
5. Upgrade history dashboard (who upgraded when, rollbacks)
6. Semver comparison throughout
7. Download progress reporting (large binaries on slow links)
---
## FINDINGS TABLE
| ID | Platform | Severity | Finding | Location |
|----|----------|----------|---------|----------|
| U-1 | All | HIGH | `/api/v1/info` returns hardcoded `"v0.1.21"` — agents/dashboard cannot detect current expected version | `system.go:111-124` |
| U-2 | All | HIGH | `ValidateAgentVersion` uses lexicographic comparison — `"0.1.9" > "0.1.22"` incorrectly | `version/versions.go:72` |
| U-3 | Windows | MEDIUM | Bulk upgrade platform hardcoded to `linux-amd64` — Windows agents get wrong binary | `RelayList.tsx:91` |
| U-4 | All | MEDIUM | Bulk upgrade in `AgentUpdatesModal` skips nonce generation — weaker replay protection | `AgentUpdatesModal.tsx:93-99` |
| U-5 | All | MEDIUM | `BuildAndSignAgent` service is disconnected from HTTP upgrade handler | `build_orchestrator.go` |
| U-6 | All | MEDIUM | `POST /build/upgrade/:agentID` is a config generator, not an upgrade orchestrator | `handlers/build_orchestrator.go:95-191` |
| U-7 | Windows | LOW | `sc stop` result not checked in `restartAgentService()` | `subsystem_handlers.go:901` |
| U-8 | All | LOW | `downloadUpdatePackage` uses plain `http.Get` — no timeout, no size limit | `subsystem_handlers.go:774` |
| U-9 | All | LOW | `PreserveExisting` is a TODO stub in upgrade handler | `handlers/build_orchestrator.go:142-146` |
| U-10 | All | INFO | `ExtractConfigVersionFromAgent` is fragile — last-char extraction breaks at version x.y.z10+ | `version/versions.go:59-62` |
| U-11 | All | INFO | `AgentUpdate.tsx` component exists but is not imported by any page | `AgentUpdate.tsx` |
| U-12 | All | INFO | `build_orchestrator.go` services layer marked `// Deprecated` | `services/build_orchestrator.go` |
---
## RECOMMENDED BUILD ORDER
1. **Fix `/api/v1/info`** (U-1) — immediate, ~10 lines, unblocks version detection
2. **Fix bulk platform hardcode** (U-3) — immediate, ~5 lines, prevents wrong-platform delivery
3. **Fix semver comparison** (U-2) — immediate, ~30 lines, prevents version logic bugs
4. **Fix bulk nonce generation** (U-4) — quick, ~15 lines, security consistency
5. **Wire `BuildAndSignAgent` to upgrade flow** (U-5) — medium, connects existing code
6. **Auto-upgrade trigger** — larger feature, requires policy design
7. **Staged rollout** — future enhancement