- Fix /api/v1/info returning hardcoded v0.1.21 (U-1) - Fix semver comparison (lexicographic -> octet-based) (U-2) - Fix bulk upgrade platform hardcoded to linux-amd64 (U-3) - Fix bulk upgrade missing nonce generation (U-4) - Add error check for sc stop in Windows restart (U-7) - Add timeout + size limit to binary download (U-8) - Fix ExtractConfigVersionFromAgent last-char bug (U-10) End-to-end upgrade pipeline now fully wired. 170 tests pass (110 server + 60 agent). No regressions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
347 lines
15 KiB
Markdown
347 lines
15 KiB
Markdown
# Agent Upgrade System Audit
|
|
|
|
**Date:** 2026-03-29
|
|
**Branch:** culurien
|
|
**Status:** Audit only — no changes
|
|
|
|
---
|
|
|
|
## 1. WHAT ALREADY EXISTS
|
|
|
|
### 1a. POST /build/upgrade/:agentID Handler
|
|
|
|
**Route:** `cmd/server/main.go:422`
|
|
**Handler:** `handlers/build_orchestrator.go:95-191`
|
|
|
|
**Status: Partially functional — config generator, not an upgrade orchestrator.**
|
|
|
|
The handler generates a fresh config JSON and returns a download URL for a pre-built binary. It does NOT:
|
|
- Verify the agent exists in the DB
|
|
- Create any DB record for the upgrade event
|
|
- Queue a `CommandTypeUpdateAgent` command
|
|
- Push or deliver anything to the agent
|
|
- Implement `PreserveExisting` (lines 142-146 are a TODO stub)
|
|
|
|
The response contains manual `next_steps` instructions telling a human to stop the service, download, and restart.
|
|
|
|
### 1b. services/build_orchestrator.go — BuildAndSignAgent
|
|
|
|
**File:** `services/build_orchestrator.go:32-96`
|
|
|
|
`BuildAndSignAgent(version, platform, architecture)`:
|
|
1. Locates pre-built binary at `{agentDir}/binaries/{platform}/redflag-agent[.exe]`
|
|
2. Signs with Ed25519 via `signingService.SignFile()`
|
|
3. Stores in DB via `packageQueries.StoreSignedPackage()`
|
|
4. Returns `AgentUpdatePackage`
|
|
|
|
**Critical disconnect:** This service is NOT called by the HTTP upgrade handler. The handler uses `AgentBuilder.BuildAgentWithConfig` (config-only). `BuildAndSignAgent` is orphaned from the HTTP flow.
|
|
|
|
### 1c. agent_update_packages Table (Migration 016)
|
|
|
|
**File:** `migrations/016_agent_update_packages.up.sql`
|
|
|
|
| Column | Type | Notes |
|
|
|--------|------|-------|
|
|
| `id` | UUID PK | `gen_random_uuid()` |
|
|
| `version` | VARCHAR(50) | NOT NULL |
|
|
| `platform` | VARCHAR(50) | e.g. `linux-amd64` |
|
|
| `architecture` | VARCHAR(20) | NOT NULL |
|
|
| `binary_path` | VARCHAR(500) | NOT NULL |
|
|
| `signature` | VARCHAR(128) | Ed25519 hex |
|
|
| `checksum` | VARCHAR(64) | SHA-256 |
|
|
| `file_size` | BIGINT | NOT NULL |
|
|
| `created_at` | TIMESTAMP | default now |
|
|
| `created_by` | VARCHAR(100) | default `'system'` |
|
|
| `is_active` | BOOLEAN | default `true` |
|
|
|
|
Migration 016 also adds to `agents` table:
|
|
- `is_updating BOOLEAN DEFAULT false`
|
|
- `updating_to_version VARCHAR(50)`
|
|
- `update_initiated_at TIMESTAMP`
|
|
|
|
### 1d. NewAgentBuild vs UpgradeAgentBuild
|
|
|
|
| Aspect | NewAgentBuild | UpgradeAgentBuild |
|
|
|--------|--------------|-------------------|
|
|
| Registration token | Required | Not needed |
|
|
| consumes_seat | true | false |
|
|
| Agent ID source | Generated or from request | From URL param |
|
|
| PreserveExisting | N/A | TODO stub |
|
|
| DB interaction | None | None |
|
|
| Command queued | No | No |
|
|
|
|
Both are config generators that return download URLs. Neither triggers actual delivery.
|
|
|
|
### 1e. Agent-Side Upgrade Code
|
|
|
|
**A full self-update pipeline EXISTS in the agent.**
|
|
|
|
**Handler:** `cmd/agent/subsystem_handlers.go:575-762` (`handleUpdateAgent`)
|
|
|
|
**7-step pipeline:**
|
|
|
|
| Step | Line | What |
|
|
|------|------|------|
|
|
| 1 | 661 | `downloadUpdatePackage()` — HTTP GET to temp file |
|
|
| 2 | 669 | SHA-256 checksum verification against `params["checksum"]` |
|
|
| 3 | 681 | Ed25519 binary signature verification via cached server public key |
|
|
| 4 | 687 | Backup current binary to `<binary>.bak` |
|
|
| 5 | 719 | Atomic install: write `.new`, chmod, `os.Rename` |
|
|
| 6 | 724 | `restartAgentService()` — `systemctl restart` (Linux) or `sc stop/start` (Windows) |
|
|
| 7 | 731 | Watchdog: polls `GetAgent()` every 15s for 5 min, checks version |
|
|
|
|
**Rollback:** Deferred block (lines 700-715) restores from `.bak` if `updateSuccess == false`.
|
|
|
|
### 1f. Command Type for Self-Upgrade
|
|
|
|
**YES — `CommandTypeUpdateAgent = "update_agent"` exists.**
|
|
|
|
Defined in `models/command.go:103`. Dispatched in `cmd/agent/main.go:1064`:
|
|
```go
|
|
case "update_agent":
|
|
handleUpdateAgent(apiClient, cmd, cfg)
|
|
```
|
|
|
|
Full command type list:
|
|
- `collect_specs`, `install_updates`, `dry_run_update`, `confirm_dependencies`
|
|
- `rollback_update`, `update_agent`, `enable_heartbeat`, `disable_heartbeat`, `reboot`
|
|
|
|
---
|
|
|
|
## 2. AGENT SELF-REPLACEMENT MECHANISM
|
|
|
|
### 2a. Existing Binary Replacement Code — EXISTS
|
|
|
|
All steps exist in `subsystem_handlers.go`:
|
|
- Download to temp: `downloadUpdatePackage()` (line 661/774)
|
|
- Ed25519 verification: `verifyBinarySignature()` (line 681)
|
|
- Checksum verification: SHA-256 (line 669)
|
|
- Atomic replace: write `.new` + `os.Rename` (line 878)
|
|
- Service restart: `restartAgentService()` (line 724/888)
|
|
|
|
### 2b. Linux Restart — EXISTS
|
|
|
|
`restartAgentService()` at line 888:
|
|
1. Try `systemctl restart redflag-agent` (line 892)
|
|
2. Fallback: `service redflag-agent restart` (line 898)
|
|
|
|
The agent knows its service name as hardcoded `"redflag-agent"`.
|
|
|
|
### 2c. Windows Restart — EXISTS (with gap)
|
|
|
|
Lines 901-903: `sc stop RedFlagAgent` then `sc start RedFlagAgent` as separate commands.
|
|
**Gap:** No error check on `sc stop` — result is discarded. The running `.exe` is replaced via `os.Rename` which works on Windows if the service has stopped.
|
|
|
|
### 2d. Acknowledgment — EXISTS
|
|
|
|
`acknowledgment.Tracker` package is used:
|
|
- `reportLogWithAck(commandID)` called at upgrade start (line 651) and completion (line 751)
|
|
- The tracker persists pending acks and retries with `IncrementRetry()`
|
|
|
|
---
|
|
|
|
## 3. SERVER-SIDE UPGRADE ORCHESTRATION
|
|
|
|
### 3a. Command Types — EXISTS
|
|
|
|
Full list in `models/command.go:97-107`. Includes `"update_agent"`.
|
|
|
|
### 3b. update_agent Command Params
|
|
|
|
The agent handler at `subsystem_handlers.go:575` expects these params:
|
|
- `download_url` — URL to download the new binary
|
|
- `checksum` — SHA-256 hex string
|
|
- `signature` — Ed25519 hex signature of the binary
|
|
- `version` — Expected version string after upgrade
|
|
- `nonce` — Replay protection nonce (uuid:timestamp format)
|
|
|
|
### 3c. Agent Command Handling — EXISTS
|
|
|
|
Dispatched in `main.go:1064` to `handleUpdateAgent()`. Full pipeline as described in section 1e.
|
|
|
|
### 3d. Agent Version Tracking — EXISTS
|
|
|
|
- `agents` table has `current_version` column
|
|
- Agent reports version on every check-in via `AgentVersion: version.Version` in the heartbeat/check-in payload
|
|
- `is_updating`, `updating_to_version`, `update_initiated_at` columns exist for tracking in-progress upgrades
|
|
|
|
### 3e. Expected Agent Version — PARTIAL
|
|
|
|
- `config.LatestAgentVersion` field exists in Config struct
|
|
- `version.MinAgentVersion` is build-time injected
|
|
- **BUT:** The `/api/v1/info` endpoint returns hardcoded `"v0.1.21"` instead of using `version.GetCurrentVersions()` — agents and the dashboard cannot reliably detect the current expected version.
|
|
- `version.ValidateAgentVersion()` uses lexicographic string comparison (bug: `"0.1.9" > "0.1.22"` is true in lex order).
|
|
|
|
---
|
|
|
|
## 4. VERSION COMPARISON
|
|
|
|
### 4a. Agent Reports Version — YES
|
|
|
|
Via `version.Version` (build-time injected, default `"dev"`). Sent on:
|
|
- Registration (line 384/443)
|
|
- Token renewal (line 506)
|
|
- System info collection (line 373)
|
|
|
|
### 4b. Version String Format
|
|
|
|
Production: `0.1.26.0` (four-octet semver-like). The 4th octet = config version.
|
|
Dev: `"dev"`.
|
|
|
|
### 4c. Server Expected Version — PARTIAL
|
|
|
|
`config.LatestAgentVersion` and `version.MinAgentVersion` exist but are not reliably surfaced:
|
|
- `/api/v1/info` hardcodes `"v0.1.21"`
|
|
- No endpoint returns `latest_agent_version` dynamically
|
|
|
|
### 4d. /api/v1/info Response — BROKEN
|
|
|
|
`system.go:111-124` — Returns hardcoded JSON:
|
|
```json
|
|
{
|
|
"version": "v0.1.21",
|
|
"name": "RedFlag Aggregator",
|
|
"features": [...]
|
|
}
|
|
```
|
|
Does NOT use `version.GetCurrentVersions()`. Does NOT include `latest_agent_version` or `min_agent_version`.
|
|
|
|
---
|
|
|
|
## 5. ROLLBACK MECHANISM
|
|
|
|
### 5a. Rollback — EXISTS
|
|
|
|
Deferred rollback in `subsystem_handlers.go:700-715`:
|
|
- Before install: backup to `<binary>.bak`
|
|
- On any failure (including watchdog timeout): `restoreFromBackup()` restores the `.bak` file
|
|
- On success: `.bak` file is removed
|
|
|
|
### 5b. Backup Logic — EXISTS
|
|
|
|
`createBackup()` copies current binary to `<path>.bak` before replacement.
|
|
|
|
### 5c. Health Check — EXISTS
|
|
|
|
Watchdog (line 919-940) polls `GetAgent()` every 15s for 5 min. Success = `agent.CurrentVersion == expectedVersion`. Failure = timeout → rollback.
|
|
|
|
---
|
|
|
|
## 6. DASHBOARD UPGRADE UI
|
|
|
|
### 6a. Upgrade Button — EXISTS
|
|
|
|
Multiple entry points in `Agents.tsx`:
|
|
- Version column "Update" badge (line 1281-1294) when `agent.update_available === true`
|
|
- Per-row action button (line 1338-1348)
|
|
- Bulk action bar for selected agents (line 1112-1131)
|
|
|
|
These open `AgentUpdatesModal.tsx` which:
|
|
- Fetches available upgrade packages
|
|
- Single agent: generates nonce → calls `POST /agents/{id}/update`
|
|
- Multiple agents: calls `POST /agents/bulk-update`
|
|
|
|
### 6b. Target Version UI — PARTIAL
|
|
|
|
`AgentUpdatesModal.tsx` shows a package selection grid with version/platform filters. No global "set target version" control.
|
|
|
|
### 6c. Bulk Upgrade — EXISTS (with bugs)
|
|
|
|
Two bulk paths:
|
|
1. `AgentUpdatesModal` bulk path — no nonces generated (security gap)
|
|
2. `BulkAgentUpdate` in `RelayList.tsx` — **platform hardcoded to `linux-amd64`** for all agents (line 91). Mixed-OS fleets get wrong binaries.
|
|
|
|
---
|
|
|
|
## 7. COMPLETENESS MATRIX
|
|
|
|
| Component | Status | Notes |
|
|
|-----------|--------|-------|
|
|
| `update_agent` command type | EXISTS | `models/command.go:103` |
|
|
| Agent handles upgrade command | EXISTS | `subsystem_handlers.go:575-762`, full 7-step pipeline |
|
|
| Safe binary replacement (Linux) | EXISTS | Atomic rename + systemctl restart |
|
|
| Safe binary replacement (Windows) | EXISTS | Atomic rename + sc stop/start (no error check on stop) |
|
|
| Ed25519 signature verification | EXISTS | `verifyBinarySignature()` against cached server key |
|
|
| Checksum verification | EXISTS | SHA-256 in agent handler; server serves `X-Content-SHA256` header |
|
|
| Rollback on failure | EXISTS | Deferred `.bak` restore on any failure including watchdog timeout |
|
|
| Server triggers upgrade command | PARTIAL | `POST /agents/{id}/update` endpoint exists (called by UI), but the `/build/upgrade` endpoint is disconnected |
|
|
| Server tracks expected version | PARTIAL | DB columns exist; `/api/v1/info` version is hardcoded to `v0.1.21` |
|
|
| Dashboard upgrade UI | EXISTS | Single + bulk upgrade via `AgentUpdatesModal` |
|
|
| Bulk upgrade UI | EXISTS (buggy) | Platform hardcoded to `linux-amd64`; no nonces in modal bulk path |
|
|
| Acknowledgment/delivery tracking | EXISTS | `acknowledgment.Tracker` with retry |
|
|
| Version comparison | PARTIAL | Lexicographic comparison is buggy for multi-digit versions |
|
|
|
|
---
|
|
|
|
## 8. EFFORT ESTIMATE
|
|
|
|
### 8a. Exists and Just Needs Wiring
|
|
|
|
1. **`/api/v1/info` version fix** — Replace hardcoded `"v0.1.21"` with `version.GetCurrentVersions()`. Add `latest_agent_version` and `min_agent_version` to the response. (~10 lines)
|
|
|
|
2. **`BuildAndSignAgent` connection** — The signing/packaging service exists but isn't called by the upgrade HTTP handler. Wire it to create a signed package when an admin triggers an upgrade. (~20 lines)
|
|
|
|
3. **Bulk upgrade platform detection** — `RelayList.tsx` line 91 hardcodes `linux-amd64`. Fix to use each agent's actual `os_type + os_architecture`. (~5 lines)
|
|
|
|
4. **Bulk nonce generation** — `AgentUpdatesModal` bulk path skips nonces. Align with single-agent path. (~15 lines)
|
|
|
|
### 8b. Needs Building from Scratch
|
|
|
|
1. **Semver-aware version comparison** — Replace lexicographic comparison in `version.ValidateAgentVersion()` with proper semver parsing. (~30 lines)
|
|
|
|
2. **Auto-upgrade trigger** — Server-side logic: when agent checks in with version < `LatestAgentVersion`, automatically queue an `update_agent` command. Requires policy controls (opt-in/opt-out per agent, maintenance windows). (~100-200 lines)
|
|
|
|
3. **Staged rollout** — Upgrade N% of agents first, monitor for failures, then proceed. (~200-300 lines)
|
|
|
|
### 8c. Minimum Viable Upgrade System (already working)
|
|
|
|
The MVP already works end-to-end:
|
|
1. Admin clicks "Update" in dashboard → `POST /agents/{id}/update`
|
|
2. Server creates `update_agent` command with download URL, checksum, signature
|
|
3. Agent polls, receives command, verifies signature+checksum
|
|
4. Agent downloads new binary, backs up old, atomic replace, restarts
|
|
5. Watchdog confirms new version running, rollback if not
|
|
|
|
**The critical gap is `/api/v1/info` returning stale version.** Everything else functions.
|
|
|
|
### 8d. Full Production Upgrade System Would Add
|
|
|
|
1. Auto-upgrade policy engine (version-based triggers)
|
|
2. Staged rollout with configurable percentages
|
|
3. Maintenance window scheduling
|
|
4. Cross-platform bulk upgrade fix (the `linux-amd64` hardcode)
|
|
5. Upgrade history dashboard (who upgraded when, rollbacks)
|
|
6. Semver comparison throughout
|
|
7. Download progress reporting (large binaries on slow links)
|
|
|
|
---
|
|
|
|
## FINDINGS TABLE
|
|
|
|
| ID | Platform | Severity | Finding | Location |
|
|
|----|----------|----------|---------|----------|
|
|
| U-1 | All | HIGH | `/api/v1/info` returns hardcoded `"v0.1.21"` — agents/dashboard cannot detect current expected version | `system.go:111-124` |
|
|
| U-2 | All | HIGH | `ValidateAgentVersion` uses lexicographic comparison — `"0.1.9" > "0.1.22"` incorrectly | `version/versions.go:72` |
|
|
| U-3 | Windows | MEDIUM | Bulk upgrade platform hardcoded to `linux-amd64` — Windows agents get wrong binary | `RelayList.tsx:91` |
|
|
| U-4 | All | MEDIUM | Bulk upgrade in `AgentUpdatesModal` skips nonce generation — weaker replay protection | `AgentUpdatesModal.tsx:93-99` |
|
|
| U-5 | All | MEDIUM | `BuildAndSignAgent` service is disconnected from HTTP upgrade handler | `build_orchestrator.go` |
|
|
| U-6 | All | MEDIUM | `POST /build/upgrade/:agentID` is a config generator, not an upgrade orchestrator | `handlers/build_orchestrator.go:95-191` |
|
|
| U-7 | Windows | LOW | `sc stop` result not checked in `restartAgentService()` | `subsystem_handlers.go:901` |
|
|
| U-8 | All | LOW | `downloadUpdatePackage` uses plain `http.Get` — no timeout, no size limit | `subsystem_handlers.go:774` |
|
|
| U-9 | All | LOW | `PreserveExisting` is a TODO stub in upgrade handler | `handlers/build_orchestrator.go:142-146` |
|
|
| U-10 | All | INFO | `ExtractConfigVersionFromAgent` is fragile — last-char extraction breaks at version x.y.z10+ | `version/versions.go:59-62` |
|
|
| U-11 | All | INFO | `AgentUpdate.tsx` component exists but is not imported by any page | `AgentUpdate.tsx` |
|
|
| U-12 | All | INFO | `build_orchestrator.go` services layer marked `// Deprecated` | `services/build_orchestrator.go` |
|
|
|
|
---
|
|
|
|
## RECOMMENDED BUILD ORDER
|
|
|
|
1. **Fix `/api/v1/info`** (U-1) — immediate, ~10 lines, unblocks version detection
|
|
2. **Fix bulk platform hardcode** (U-3) — immediate, ~5 lines, prevents wrong-platform delivery
|
|
3. **Fix semver comparison** (U-2) — immediate, ~30 lines, prevents version logic bugs
|
|
4. **Fix bulk nonce generation** (U-4) — quick, ~15 lines, security consistency
|
|
5. **Wire `BuildAndSignAgent` to upgrade flow** (U-5) — medium, connects existing code
|
|
6. **Auto-upgrade trigger** — larger feature, requires policy design
|
|
7. **Staged rollout** — future enhancement
|