Add docs and project files - force for Culurien
This commit is contained in:
184
docs/3_BACKLOG/P2-003_Agent-Auto-Update-System.md
Normal file
184
docs/3_BACKLOG/P2-003_Agent-Auto-Update-System.md
Normal file
@@ -0,0 +1,184 @@
|
||||
# Agent Auto-Update System
|
||||
|
||||
**Priority**: P2 (New Feature)
|
||||
**Source Reference**: From needs.md line 121
|
||||
**Status**: Designed, Ready for Implementation
|
||||
|
||||
## Problem Statement
|
||||
|
||||
Currently, agent updates require manual intervention (re-running installation scripts). There is no automated mechanism for agents to self-update when new versions are available, creating operational overhead for managing large fleets of agents.
|
||||
|
||||
## Feature Description
|
||||
|
||||
Implement an automated agent update system that allows agents to detect available updates, download new binaries, verify signatures, and perform self-updates with proper rollback capabilities and staggered rollout support.
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
1. Agents can detect when new versions are available via server API
|
||||
2. Agents can download signed binaries and verify cryptographic signatures
|
||||
3. Self-update process handles service restarts gracefully
|
||||
4. Rollback capability if health checks fail after update
|
||||
5. Staggered rollout support (canary → wave → full deployment)
|
||||
6. Version pinning to prevent unauthorized downgrades
|
||||
7. Update progress and status visible in web interface
|
||||
8. Update failures are properly logged and reported
|
||||
|
||||
## Technical Approach
|
||||
|
||||
### 1. Agent-Side Self-Update Handler
|
||||
|
||||
**New Command Handler** (`aggregator-agent/internal/commands/`):
|
||||
```go
|
||||
func (h *CommandHandler) handleSelfUpdate(cmd Command) error {
|
||||
// 1. Check current version vs target version
|
||||
// 2. Download new binary to temporary location
|
||||
// 3. Verify cryptographic signature
|
||||
// 4. Stop current service gracefully
|
||||
// 5. Replace binary
|
||||
// 6. Start updated service
|
||||
// 7. Perform health checks
|
||||
// 8. Rollback if health checks fail
|
||||
}
|
||||
```
|
||||
|
||||
**Update Stages**:
|
||||
- `update_download` - Download new binary
|
||||
- `update_verify` - Verify signature and integrity
|
||||
- `update_install` - Install and restart
|
||||
- `update_healthcheck` - Verify functionality
|
||||
- `update_rollback` - Revert if needed
|
||||
|
||||
### 2. Server-Side Update Management
|
||||
|
||||
**Binary Signing** (`aggregator-server/internal/services/`):
|
||||
- Implement SHA-256 hashing for all binary builds
|
||||
- Optional GPG signature generation
|
||||
- Signature storage and serving infrastructure
|
||||
|
||||
**Update Orchestration**:
|
||||
- `GET /api/v1/agents/:id/updates/available` - Check for updates
|
||||
- `POST /api/v1/agents/:id/update` - Trigger update command
|
||||
- Update queue management with priority handling
|
||||
- Staggered rollout configuration
|
||||
|
||||
**Rollout Strategy**:
|
||||
- Phase 1: 5% canary deployment
|
||||
- Phase 2: 25% wave 2 (if canary successful)
|
||||
- Phase 3: 100% full deployment
|
||||
|
||||
### 3. Update Verification System
|
||||
|
||||
**Signature Verification**:
|
||||
```go
|
||||
func verifyBinarySignature(binaryPath string, signaturePath string, publicKey string) error {
|
||||
// Verify SHA-256 hash matches expected
|
||||
// Verify GPG signature if available
|
||||
// Check binary integrity and authenticity
|
||||
}
|
||||
```
|
||||
|
||||
**Health Check Integration**:
|
||||
- Post-update health verification
|
||||
- Service functionality testing
|
||||
- Communication verification with server
|
||||
- Automatic rollback threshold detection
|
||||
|
||||
### 4. Frontend Update Management
|
||||
|
||||
**Batch Update UI** (`aggregator-web/src/pages/`):
|
||||
- Select multiple agents for updates
|
||||
- Configure rollout strategy (immediate, staggered, manual approval)
|
||||
- Monitor update progress in real-time
|
||||
- View update history and success/failure rates
|
||||
- Rollback capability for failed deployments
|
||||
|
||||
## Definition of Done
|
||||
|
||||
- ✅ `self_update` command handler implemented in agent
|
||||
- ✅ Binary signature verification working
|
||||
- ✅ Automated service restart and health checking
|
||||
- ✅ Rollback mechanism functional
|
||||
- ✅ Staggered rollout system operational
|
||||
- ✅ Web UI for batch update management
|
||||
- ✅ Update progress monitoring and reporting
|
||||
- ✅ Comprehensive testing of failure scenarios
|
||||
|
||||
## Test Plan
|
||||
|
||||
1. **Unit Tests**
|
||||
- Binary download and signature verification
|
||||
- Service lifecycle management during updates
|
||||
- Health check validation
|
||||
- Rollback trigger conditions
|
||||
|
||||
2. **Integration Tests**
|
||||
- End-to-end update flow from detection to completion
|
||||
- Staggered rollout simulation
|
||||
- Failed update rollback scenarios
|
||||
- Version pinning and downgrade prevention
|
||||
|
||||
3. **Security Tests**
|
||||
- Signature verification with invalid signatures
|
||||
- Tampered binary rejection
|
||||
- Unauthorized update attempts
|
||||
|
||||
4. **Manual Tests**
|
||||
- Test update from v0.2.0 to v0.2.1 on real agents
|
||||
- Test rollback scenarios
|
||||
- Test batch update operations
|
||||
- Test staggered rollout phases
|
||||
|
||||
## Files to Modify
|
||||
|
||||
- `aggregator-agent/internal/commands/update.go` - Add self_update handler
|
||||
- `aggregator-agent/internal/security/` - Signature verification logic
|
||||
- `aggregator-agent/cmd/agent/main.go` - Update command registration
|
||||
- `aggregator-server/internal/services/binary_signing.go` - New service
|
||||
- `aggregator-server/internal/api/handlers/updates.go` - Update management API
|
||||
- `aggregator-server/internal/services/update_orchestrator.go` - New service
|
||||
- `aggregator-web/src/pages/AgentManagement.tsx` - Batch update UI
|
||||
- `aggregator-web/src/components/UpdateProgress.tsx` - Progress monitoring
|
||||
|
||||
## Update Flow
|
||||
|
||||
1. **Detection**: Agent polls for updates via existing heartbeat mechanism
|
||||
2. **Queuing**: Server creates update command with priority and rollout phase
|
||||
3. **Download**: Agent downloads binary to temporary location
|
||||
4. **Verification**: Cryptographic signature and integrity verification
|
||||
5. **Installation**: Service stop, binary replacement, service start
|
||||
6. **Validation**: Health checks and functionality verification
|
||||
7. **Reporting**: Status update to server (success/failure/rollback)
|
||||
8. **Monitoring**: Continuous health monitoring post-update
|
||||
|
||||
## Security Considerations
|
||||
|
||||
- Binary signature verification mandatory for all updates
|
||||
- Version pinning prevents unauthorized downgrades
|
||||
- Update authorization tied to agent registration tokens
|
||||
- Audit trail for all update operations
|
||||
- Isolated temporary directories for downloads
|
||||
|
||||
## Estimated Effort
|
||||
|
||||
- **Development**: 24-32 hours
|
||||
- **Testing**: 16-20 hours
|
||||
- **Review**: 8-12 hours
|
||||
- **Security Review**: 4-6 hours
|
||||
|
||||
## Dependencies
|
||||
|
||||
- Existing command queue system
|
||||
- Agent service management infrastructure
|
||||
- Binary distribution system
|
||||
- Agent registration and authentication
|
||||
|
||||
## Risk Assessment
|
||||
|
||||
**Medium Risk** - Core system modification with significant complexity. Requires extensive testing and security review. Rollback mechanisms are critical for safety. Staged rollout approach mitigates risk.
|
||||
|
||||
## Rollback Strategy
|
||||
|
||||
1. **Automatic Rollback**: Triggered by health check failures
|
||||
2. **Manual Rollback**: Admin-initiated via web interface
|
||||
3. **Binary Backup**: Keep previous version for rollback
|
||||
4. **Configuration Backup**: Preserve agent configuration during updates
|
||||
Reference in New Issue
Block a user