# Agent Auto-Update System **Priority**: P2 (New Feature) **Source Reference**: From needs.md line 121 **Status**: Designed, Ready for Implementation ## Problem Statement Currently, agent updates require manual intervention (re-running installation scripts). There is no automated mechanism for agents to self-update when new versions are available, creating operational overhead for managing large fleets of agents. ## Feature Description Implement an automated agent update system that allows agents to detect available updates, download new binaries, verify signatures, and perform self-updates with proper rollback capabilities and staggered rollout support. ## Acceptance Criteria 1. Agents can detect when new versions are available via server API 2. Agents can download signed binaries and verify cryptographic signatures 3. Self-update process handles service restarts gracefully 4. Rollback capability if health checks fail after update 5. Staggered rollout support (canary → wave → full deployment) 6. Version pinning to prevent unauthorized downgrades 7. Update progress and status visible in web interface 8. Update failures are properly logged and reported ## Technical Approach ### 1. Agent-Side Self-Update Handler **New Command Handler** (`aggregator-agent/internal/commands/`): ```go func (h *CommandHandler) handleSelfUpdate(cmd Command) error { // 1. Check current version vs target version // 2. Download new binary to temporary location // 3. Verify cryptographic signature // 4. Stop current service gracefully // 5. Replace binary // 6. Start updated service // 7. Perform health checks // 8. Rollback if health checks fail } ``` **Update Stages**: - `update_download` - Download new binary - `update_verify` - Verify signature and integrity - `update_install` - Install and restart - `update_healthcheck` - Verify functionality - `update_rollback` - Revert if needed ### 2. Server-Side Update Management **Binary Signing** (`aggregator-server/internal/services/`): - Implement SHA-256 hashing for all binary builds - Optional GPG signature generation - Signature storage and serving infrastructure **Update Orchestration**: - `GET /api/v1/agents/:id/updates/available` - Check for updates - `POST /api/v1/agents/:id/update` - Trigger update command - Update queue management with priority handling - Staggered rollout configuration **Rollout Strategy**: - Phase 1: 5% canary deployment - Phase 2: 25% wave 2 (if canary successful) - Phase 3: 100% full deployment ### 3. Update Verification System **Signature Verification**: ```go func verifyBinarySignature(binaryPath string, signaturePath string, publicKey string) error { // Verify SHA-256 hash matches expected // Verify GPG signature if available // Check binary integrity and authenticity } ``` **Health Check Integration**: - Post-update health verification - Service functionality testing - Communication verification with server - Automatic rollback threshold detection ### 4. Frontend Update Management **Batch Update UI** (`aggregator-web/src/pages/`): - Select multiple agents for updates - Configure rollout strategy (immediate, staggered, manual approval) - Monitor update progress in real-time - View update history and success/failure rates - Rollback capability for failed deployments ## Definition of Done - ✅ `self_update` command handler implemented in agent - ✅ Binary signature verification working - ✅ Automated service restart and health checking - ✅ Rollback mechanism functional - ✅ Staggered rollout system operational - ✅ Web UI for batch update management - ✅ Update progress monitoring and reporting - ✅ Comprehensive testing of failure scenarios ## Test Plan 1. **Unit Tests** - Binary download and signature verification - Service lifecycle management during updates - Health check validation - Rollback trigger conditions 2. **Integration Tests** - End-to-end update flow from detection to completion - Staggered rollout simulation - Failed update rollback scenarios - Version pinning and downgrade prevention 3. **Security Tests** - Signature verification with invalid signatures - Tampered binary rejection - Unauthorized update attempts 4. **Manual Tests** - Test update from v0.2.0 to v0.2.1 on real agents - Test rollback scenarios - Test batch update operations - Test staggered rollout phases ## Files to Modify - `aggregator-agent/internal/commands/update.go` - Add self_update handler - `aggregator-agent/internal/security/` - Signature verification logic - `aggregator-agent/cmd/agent/main.go` - Update command registration - `aggregator-server/internal/services/binary_signing.go` - New service - `aggregator-server/internal/api/handlers/updates.go` - Update management API - `aggregator-server/internal/services/update_orchestrator.go` - New service - `aggregator-web/src/pages/AgentManagement.tsx` - Batch update UI - `aggregator-web/src/components/UpdateProgress.tsx` - Progress monitoring ## Update Flow 1. **Detection**: Agent polls for updates via existing heartbeat mechanism 2. **Queuing**: Server creates update command with priority and rollout phase 3. **Download**: Agent downloads binary to temporary location 4. **Verification**: Cryptographic signature and integrity verification 5. **Installation**: Service stop, binary replacement, service start 6. **Validation**: Health checks and functionality verification 7. **Reporting**: Status update to server (success/failure/rollback) 8. **Monitoring**: Continuous health monitoring post-update ## Security Considerations - Binary signature verification mandatory for all updates - Version pinning prevents unauthorized downgrades - Update authorization tied to agent registration tokens - Audit trail for all update operations - Isolated temporary directories for downloads ## Estimated Effort - **Development**: 24-32 hours - **Testing**: 16-20 hours - **Review**: 8-12 hours - **Security Review**: 4-6 hours ## Dependencies - Existing command queue system - Agent service management infrastructure - Binary distribution system - Agent registration and authentication ## Risk Assessment **Medium Risk** - Core system modification with significant complexity. Requires extensive testing and security review. Rollback mechanisms are critical for safety. Staged rollout approach mitigates risk. ## Rollback Strategy 1. **Automatic Rollback**: Triggered by health check failures 2. **Manual Rollback**: Admin-initiated via web interface 3. **Binary Backup**: Keep previous version for rollback 4. **Configuration Backup**: Preserve agent configuration during updates