6.5 KiB
Agent Auto-Update System
Priority: P2 (New Feature) Source Reference: From needs.md line 121 Status: Designed, Ready for Implementation
Problem Statement
Currently, agent updates require manual intervention (re-running installation scripts). There is no automated mechanism for agents to self-update when new versions are available, creating operational overhead for managing large fleets of agents.
Feature Description
Implement an automated agent update system that allows agents to detect available updates, download new binaries, verify signatures, and perform self-updates with proper rollback capabilities and staggered rollout support.
Acceptance Criteria
- Agents can detect when new versions are available via server API
- Agents can download signed binaries and verify cryptographic signatures
- Self-update process handles service restarts gracefully
- Rollback capability if health checks fail after update
- Staggered rollout support (canary → wave → full deployment)
- Version pinning to prevent unauthorized downgrades
- Update progress and status visible in web interface
- Update failures are properly logged and reported
Technical Approach
1. Agent-Side Self-Update Handler
New Command Handler (aggregator-agent/internal/commands/):
func (h *CommandHandler) handleSelfUpdate(cmd Command) error {
// 1. Check current version vs target version
// 2. Download new binary to temporary location
// 3. Verify cryptographic signature
// 4. Stop current service gracefully
// 5. Replace binary
// 6. Start updated service
// 7. Perform health checks
// 8. Rollback if health checks fail
}
Update Stages:
update_download- Download new binaryupdate_verify- Verify signature and integrityupdate_install- Install and restartupdate_healthcheck- Verify functionalityupdate_rollback- Revert if needed
2. Server-Side Update Management
Binary Signing (aggregator-server/internal/services/):
- Implement SHA-256 hashing for all binary builds
- Optional GPG signature generation
- Signature storage and serving infrastructure
Update Orchestration:
GET /api/v1/agents/:id/updates/available- Check for updatesPOST /api/v1/agents/:id/update- Trigger update command- Update queue management with priority handling
- Staggered rollout configuration
Rollout Strategy:
- Phase 1: 5% canary deployment
- Phase 2: 25% wave 2 (if canary successful)
- Phase 3: 100% full deployment
3. Update Verification System
Signature Verification:
func verifyBinarySignature(binaryPath string, signaturePath string, publicKey string) error {
// Verify SHA-256 hash matches expected
// Verify GPG signature if available
// Check binary integrity and authenticity
}
Health Check Integration:
- Post-update health verification
- Service functionality testing
- Communication verification with server
- Automatic rollback threshold detection
4. Frontend Update Management
Batch Update UI (aggregator-web/src/pages/):
- Select multiple agents for updates
- Configure rollout strategy (immediate, staggered, manual approval)
- Monitor update progress in real-time
- View update history and success/failure rates
- Rollback capability for failed deployments
Definition of Done
- ✅
self_updatecommand handler implemented in agent - ✅ Binary signature verification working
- ✅ Automated service restart and health checking
- ✅ Rollback mechanism functional
- ✅ Staggered rollout system operational
- ✅ Web UI for batch update management
- ✅ Update progress monitoring and reporting
- ✅ Comprehensive testing of failure scenarios
Test Plan
-
Unit Tests
- Binary download and signature verification
- Service lifecycle management during updates
- Health check validation
- Rollback trigger conditions
-
Integration Tests
- End-to-end update flow from detection to completion
- Staggered rollout simulation
- Failed update rollback scenarios
- Version pinning and downgrade prevention
-
Security Tests
- Signature verification with invalid signatures
- Tampered binary rejection
- Unauthorized update attempts
-
Manual Tests
- Test update from v0.2.0 to v0.2.1 on real agents
- Test rollback scenarios
- Test batch update operations
- Test staggered rollout phases
Files to Modify
aggregator-agent/internal/commands/update.go- Add self_update handleraggregator-agent/internal/security/- Signature verification logicaggregator-agent/cmd/agent/main.go- Update command registrationaggregator-server/internal/services/binary_signing.go- New serviceaggregator-server/internal/api/handlers/updates.go- Update management APIaggregator-server/internal/services/update_orchestrator.go- New serviceaggregator-web/src/pages/AgentManagement.tsx- Batch update UIaggregator-web/src/components/UpdateProgress.tsx- Progress monitoring
Update Flow
- Detection: Agent polls for updates via existing heartbeat mechanism
- Queuing: Server creates update command with priority and rollout phase
- Download: Agent downloads binary to temporary location
- Verification: Cryptographic signature and integrity verification
- Installation: Service stop, binary replacement, service start
- Validation: Health checks and functionality verification
- Reporting: Status update to server (success/failure/rollback)
- Monitoring: Continuous health monitoring post-update
Security Considerations
- Binary signature verification mandatory for all updates
- Version pinning prevents unauthorized downgrades
- Update authorization tied to agent registration tokens
- Audit trail for all update operations
- Isolated temporary directories for downloads
Estimated Effort
- Development: 24-32 hours
- Testing: 16-20 hours
- Review: 8-12 hours
- Security Review: 4-6 hours
Dependencies
- Existing command queue system
- Agent service management infrastructure
- Binary distribution system
- Agent registration and authentication
Risk Assessment
Medium Risk - Core system modification with significant complexity. Requires extensive testing and security review. Rollback mechanisms are critical for safety. Staged rollout approach mitigates risk.
Rollback Strategy
- Automatic Rollback: Triggered by health check failures
- Manual Rollback: Admin-initiated via web interface
- Binary Backup: Keep previous version for rollback
- Configuration Backup: Preserve agent configuration during updates