184 lines
6.5 KiB
Markdown
184 lines
6.5 KiB
Markdown
# Agent Auto-Update System
|
|
|
|
**Priority**: P2 (New Feature)
|
|
**Source Reference**: From needs.md line 121
|
|
**Status**: Designed, Ready for Implementation
|
|
|
|
## Problem Statement
|
|
|
|
Currently, agent updates require manual intervention (re-running installation scripts). There is no automated mechanism for agents to self-update when new versions are available, creating operational overhead for managing large fleets of agents.
|
|
|
|
## Feature Description
|
|
|
|
Implement an automated agent update system that allows agents to detect available updates, download new binaries, verify signatures, and perform self-updates with proper rollback capabilities and staggered rollout support.
|
|
|
|
## Acceptance Criteria
|
|
|
|
1. Agents can detect when new versions are available via server API
|
|
2. Agents can download signed binaries and verify cryptographic signatures
|
|
3. Self-update process handles service restarts gracefully
|
|
4. Rollback capability if health checks fail after update
|
|
5. Staggered rollout support (canary → wave → full deployment)
|
|
6. Version pinning to prevent unauthorized downgrades
|
|
7. Update progress and status visible in web interface
|
|
8. Update failures are properly logged and reported
|
|
|
|
## Technical Approach
|
|
|
|
### 1. Agent-Side Self-Update Handler
|
|
|
|
**New Command Handler** (`aggregator-agent/internal/commands/`):
|
|
```go
|
|
func (h *CommandHandler) handleSelfUpdate(cmd Command) error {
|
|
// 1. Check current version vs target version
|
|
// 2. Download new binary to temporary location
|
|
// 3. Verify cryptographic signature
|
|
// 4. Stop current service gracefully
|
|
// 5. Replace binary
|
|
// 6. Start updated service
|
|
// 7. Perform health checks
|
|
// 8. Rollback if health checks fail
|
|
}
|
|
```
|
|
|
|
**Update Stages**:
|
|
- `update_download` - Download new binary
|
|
- `update_verify` - Verify signature and integrity
|
|
- `update_install` - Install and restart
|
|
- `update_healthcheck` - Verify functionality
|
|
- `update_rollback` - Revert if needed
|
|
|
|
### 2. Server-Side Update Management
|
|
|
|
**Binary Signing** (`aggregator-server/internal/services/`):
|
|
- Implement SHA-256 hashing for all binary builds
|
|
- Optional GPG signature generation
|
|
- Signature storage and serving infrastructure
|
|
|
|
**Update Orchestration**:
|
|
- `GET /api/v1/agents/:id/updates/available` - Check for updates
|
|
- `POST /api/v1/agents/:id/update` - Trigger update command
|
|
- Update queue management with priority handling
|
|
- Staggered rollout configuration
|
|
|
|
**Rollout Strategy**:
|
|
- Phase 1: 5% canary deployment
|
|
- Phase 2: 25% wave 2 (if canary successful)
|
|
- Phase 3: 100% full deployment
|
|
|
|
### 3. Update Verification System
|
|
|
|
**Signature Verification**:
|
|
```go
|
|
func verifyBinarySignature(binaryPath string, signaturePath string, publicKey string) error {
|
|
// Verify SHA-256 hash matches expected
|
|
// Verify GPG signature if available
|
|
// Check binary integrity and authenticity
|
|
}
|
|
```
|
|
|
|
**Health Check Integration**:
|
|
- Post-update health verification
|
|
- Service functionality testing
|
|
- Communication verification with server
|
|
- Automatic rollback threshold detection
|
|
|
|
### 4. Frontend Update Management
|
|
|
|
**Batch Update UI** (`aggregator-web/src/pages/`):
|
|
- Select multiple agents for updates
|
|
- Configure rollout strategy (immediate, staggered, manual approval)
|
|
- Monitor update progress in real-time
|
|
- View update history and success/failure rates
|
|
- Rollback capability for failed deployments
|
|
|
|
## Definition of Done
|
|
|
|
- ✅ `self_update` command handler implemented in agent
|
|
- ✅ Binary signature verification working
|
|
- ✅ Automated service restart and health checking
|
|
- ✅ Rollback mechanism functional
|
|
- ✅ Staggered rollout system operational
|
|
- ✅ Web UI for batch update management
|
|
- ✅ Update progress monitoring and reporting
|
|
- ✅ Comprehensive testing of failure scenarios
|
|
|
|
## Test Plan
|
|
|
|
1. **Unit Tests**
|
|
- Binary download and signature verification
|
|
- Service lifecycle management during updates
|
|
- Health check validation
|
|
- Rollback trigger conditions
|
|
|
|
2. **Integration Tests**
|
|
- End-to-end update flow from detection to completion
|
|
- Staggered rollout simulation
|
|
- Failed update rollback scenarios
|
|
- Version pinning and downgrade prevention
|
|
|
|
3. **Security Tests**
|
|
- Signature verification with invalid signatures
|
|
- Tampered binary rejection
|
|
- Unauthorized update attempts
|
|
|
|
4. **Manual Tests**
|
|
- Test update from v0.2.0 to v0.2.1 on real agents
|
|
- Test rollback scenarios
|
|
- Test batch update operations
|
|
- Test staggered rollout phases
|
|
|
|
## Files to Modify
|
|
|
|
- `aggregator-agent/internal/commands/update.go` - Add self_update handler
|
|
- `aggregator-agent/internal/security/` - Signature verification logic
|
|
- `aggregator-agent/cmd/agent/main.go` - Update command registration
|
|
- `aggregator-server/internal/services/binary_signing.go` - New service
|
|
- `aggregator-server/internal/api/handlers/updates.go` - Update management API
|
|
- `aggregator-server/internal/services/update_orchestrator.go` - New service
|
|
- `aggregator-web/src/pages/AgentManagement.tsx` - Batch update UI
|
|
- `aggregator-web/src/components/UpdateProgress.tsx` - Progress monitoring
|
|
|
|
## Update Flow
|
|
|
|
1. **Detection**: Agent polls for updates via existing heartbeat mechanism
|
|
2. **Queuing**: Server creates update command with priority and rollout phase
|
|
3. **Download**: Agent downloads binary to temporary location
|
|
4. **Verification**: Cryptographic signature and integrity verification
|
|
5. **Installation**: Service stop, binary replacement, service start
|
|
6. **Validation**: Health checks and functionality verification
|
|
7. **Reporting**: Status update to server (success/failure/rollback)
|
|
8. **Monitoring**: Continuous health monitoring post-update
|
|
|
|
## Security Considerations
|
|
|
|
- Binary signature verification mandatory for all updates
|
|
- Version pinning prevents unauthorized downgrades
|
|
- Update authorization tied to agent registration tokens
|
|
- Audit trail for all update operations
|
|
- Isolated temporary directories for downloads
|
|
|
|
## Estimated Effort
|
|
|
|
- **Development**: 24-32 hours
|
|
- **Testing**: 16-20 hours
|
|
- **Review**: 8-12 hours
|
|
- **Security Review**: 4-6 hours
|
|
|
|
## Dependencies
|
|
|
|
- Existing command queue system
|
|
- Agent service management infrastructure
|
|
- Binary distribution system
|
|
- Agent registration and authentication
|
|
|
|
## Risk Assessment
|
|
|
|
**Medium Risk** - Core system modification with significant complexity. Requires extensive testing and security review. Rollback mechanisms are critical for safety. Staged rollout approach mitigates risk.
|
|
|
|
## Rollback Strategy
|
|
|
|
1. **Automatic Rollback**: Triggered by health check failures
|
|
2. **Manual Rollback**: Admin-initiated via web interface
|
|
3. **Binary Backup**: Keep previous version for rollback
|
|
4. **Configuration Backup**: Preserve agent configuration during updates |