Add docs and project files - force for Culurien

This commit is contained in:
Fimeg
2026-03-28 20:46:24 -04:00
parent dc61797423
commit 484a7f77ce
343 changed files with 119530 additions and 0 deletions

View File

@@ -0,0 +1,184 @@
# Agent Auto-Update System
**Priority**: P2 (New Feature)
**Source Reference**: From needs.md line 121
**Status**: Designed, Ready for Implementation
## Problem Statement
Currently, agent updates require manual intervention (re-running installation scripts). There is no automated mechanism for agents to self-update when new versions are available, creating operational overhead for managing large fleets of agents.
## Feature Description
Implement an automated agent update system that allows agents to detect available updates, download new binaries, verify signatures, and perform self-updates with proper rollback capabilities and staggered rollout support.
## Acceptance Criteria
1. Agents can detect when new versions are available via server API
2. Agents can download signed binaries and verify cryptographic signatures
3. Self-update process handles service restarts gracefully
4. Rollback capability if health checks fail after update
5. Staggered rollout support (canary → wave → full deployment)
6. Version pinning to prevent unauthorized downgrades
7. Update progress and status visible in web interface
8. Update failures are properly logged and reported
## Technical Approach
### 1. Agent-Side Self-Update Handler
**New Command Handler** (`aggregator-agent/internal/commands/`):
```go
func (h *CommandHandler) handleSelfUpdate(cmd Command) error {
// 1. Check current version vs target version
// 2. Download new binary to temporary location
// 3. Verify cryptographic signature
// 4. Stop current service gracefully
// 5. Replace binary
// 6. Start updated service
// 7. Perform health checks
// 8. Rollback if health checks fail
}
```
**Update Stages**:
- `update_download` - Download new binary
- `update_verify` - Verify signature and integrity
- `update_install` - Install and restart
- `update_healthcheck` - Verify functionality
- `update_rollback` - Revert if needed
### 2. Server-Side Update Management
**Binary Signing** (`aggregator-server/internal/services/`):
- Implement SHA-256 hashing for all binary builds
- Optional GPG signature generation
- Signature storage and serving infrastructure
**Update Orchestration**:
- `GET /api/v1/agents/:id/updates/available` - Check for updates
- `POST /api/v1/agents/:id/update` - Trigger update command
- Update queue management with priority handling
- Staggered rollout configuration
**Rollout Strategy**:
- Phase 1: 5% canary deployment
- Phase 2: 25% wave 2 (if canary successful)
- Phase 3: 100% full deployment
### 3. Update Verification System
**Signature Verification**:
```go
func verifyBinarySignature(binaryPath string, signaturePath string, publicKey string) error {
// Verify SHA-256 hash matches expected
// Verify GPG signature if available
// Check binary integrity and authenticity
}
```
**Health Check Integration**:
- Post-update health verification
- Service functionality testing
- Communication verification with server
- Automatic rollback threshold detection
### 4. Frontend Update Management
**Batch Update UI** (`aggregator-web/src/pages/`):
- Select multiple agents for updates
- Configure rollout strategy (immediate, staggered, manual approval)
- Monitor update progress in real-time
- View update history and success/failure rates
- Rollback capability for failed deployments
## Definition of Done
-`self_update` command handler implemented in agent
- ✅ Binary signature verification working
- ✅ Automated service restart and health checking
- ✅ Rollback mechanism functional
- ✅ Staggered rollout system operational
- ✅ Web UI for batch update management
- ✅ Update progress monitoring and reporting
- ✅ Comprehensive testing of failure scenarios
## Test Plan
1. **Unit Tests**
- Binary download and signature verification
- Service lifecycle management during updates
- Health check validation
- Rollback trigger conditions
2. **Integration Tests**
- End-to-end update flow from detection to completion
- Staggered rollout simulation
- Failed update rollback scenarios
- Version pinning and downgrade prevention
3. **Security Tests**
- Signature verification with invalid signatures
- Tampered binary rejection
- Unauthorized update attempts
4. **Manual Tests**
- Test update from v0.2.0 to v0.2.1 on real agents
- Test rollback scenarios
- Test batch update operations
- Test staggered rollout phases
## Files to Modify
- `aggregator-agent/internal/commands/update.go` - Add self_update handler
- `aggregator-agent/internal/security/` - Signature verification logic
- `aggregator-agent/cmd/agent/main.go` - Update command registration
- `aggregator-server/internal/services/binary_signing.go` - New service
- `aggregator-server/internal/api/handlers/updates.go` - Update management API
- `aggregator-server/internal/services/update_orchestrator.go` - New service
- `aggregator-web/src/pages/AgentManagement.tsx` - Batch update UI
- `aggregator-web/src/components/UpdateProgress.tsx` - Progress monitoring
## Update Flow
1. **Detection**: Agent polls for updates via existing heartbeat mechanism
2. **Queuing**: Server creates update command with priority and rollout phase
3. **Download**: Agent downloads binary to temporary location
4. **Verification**: Cryptographic signature and integrity verification
5. **Installation**: Service stop, binary replacement, service start
6. **Validation**: Health checks and functionality verification
7. **Reporting**: Status update to server (success/failure/rollback)
8. **Monitoring**: Continuous health monitoring post-update
## Security Considerations
- Binary signature verification mandatory for all updates
- Version pinning prevents unauthorized downgrades
- Update authorization tied to agent registration tokens
- Audit trail for all update operations
- Isolated temporary directories for downloads
## Estimated Effort
- **Development**: 24-32 hours
- **Testing**: 16-20 hours
- **Review**: 8-12 hours
- **Security Review**: 4-6 hours
## Dependencies
- Existing command queue system
- Agent service management infrastructure
- Binary distribution system
- Agent registration and authentication
## Risk Assessment
**Medium Risk** - Core system modification with significant complexity. Requires extensive testing and security review. Rollback mechanisms are critical for safety. Staged rollout approach mitigates risk.
## Rollback Strategy
1. **Automatic Rollback**: Triggered by health check failures
2. **Manual Rollback**: Admin-initiated via web interface
3. **Binary Backup**: Keep previous version for rollback
4. **Configuration Backup**: Preserve agent configuration during updates