30 KiB
RedFlag Security System Implementation Plan v0.2.0
Date: 2025-11-10 Version: 0.1.23.4 → 0.2.0 Status: Implementation Ready Owner: Claude 4.5 (@Fimeg)
Executive Summary
Goal: Implement the RedFlag security architecture as designed - TOFU-first, Ed25519-signed binaries, machine ID binding, and command acknowledgment system.
Critical Discovery: Build orchestrator generates Docker configs while install script expects signed native binaries. This is blocking the entire update workflow (404 errors on agent updates).
Decision: Implement Option 2 (Per-Version/Platform Signing) from Decision.md - sign generic binaries once per version/platform, serve to all agents, keep config.json separate.
Scope: This plan covers the complete implementation from current state (v0.1.23.4) to production-ready v0.2.0.
Phase 1: Build Orchestrator Alignment (CRITICAL - Week 1)
Priority: 🔴 CRITICAL - Blocking all agent updates Estimated Time: 3-4 days Testing Required: Integration test with live agent upgrade
1.1 Replace Docker Config Generation with Signed Binary Flow
Current State:
// agent_builder.go:171-245
generateDockerCompose() → docker-compose.yml
generateDockerfile() → Dockerfile
generateEmbeddedConfig() → Go source with config
Result: Docker deployment configs (WRONG)
Target State:
// New flow:
1. Take generic binary from /app/binaries/{platform}/
2. Sign binary with Ed25519 private key
3. Store package metadata in agent_update_packages table
4. Generate config.json separately
5. Return download URLs for both
Implementation Steps:
a) Modify agent_builder.go
// REMOVE these functions:
- generateDockerCompose() → Delete
- generateDockerfile() → Delete
- BuildAgentWithConfig() → Rewrite completely
// NEW signature:
func (ab *AgentBuilder) BuildAgentArtifacts(config *AgentConfiguration) (*BuildArtifacts, error) {
// Step 1: Generate agent-specific config.json (not embedded in binary)
configJSON, err := generateConfigJSON(config)
if err != nil {
return nil, fmt.Errorf("config generation failed: %w", err)
}
// Step 2: Copy generic binary to temp location (don't modify binary)
// Generic binary built by Docker multi-stage build in /app/binaries/
genericBinaryPath := fmt.Sprintf("/app/binaries/%s/redflag-agent", config.Platform)
tempBinaryPath := fmt.Sprintf("/tmp/agent-build-%s/redflag-agent", config.AgentID)
if err := copyFile(genericBinaryPath, tempBinaryPath); err != nil {
return nil, fmt.Errorf("binary copy failed: %w", err)
}
// Step 3: Sign the binary (do not embed config)
signingService := services.NewSigningService()
signature, err := signingService.SignFile(tempBinaryPath)
if err != nil {
return nil, fmt.Errorf("signing failed: %w", err)
}
// Step 4: Return artifact locations (don't return Docker configs)
return &BuildArtifacts{
AgentID: config.AgentID,
ConfigJSON: configJSON,
BinaryPath: tempBinaryPath,
Signature: signature,
Platform: config.Platform,
Version: config.Version,
}, nil
}
Obvious Things That Might Be Missed:
- ☐ The
BuildAgentWithConfigfunction is called from multiple places - update all callers - ☐ The
BuildResultstruct has fields for Docker artifacts - remove them or they'll be dead code - ☐ Check
build_orchestrator.gohandlers - they construct responses expecting Docker files - ☐ Update API documentation if it references Docker build process
- ☐ The install script expects
docker-compose.ymlinstructions - it will break if we just remove without updating
b) Update build_orchestrator.go handlers
// In NewAgentBuild and UpgradeAgentBuild handlers:
// REMOVE: Response fields for Docker
"compose_file": buildResult.ComposeFile,
"dockerfile": buildResult.Dockerfile,
"next_steps": []string{
"1. docker build -t " + buildResult.ImageTag + " .",
"2. docker compose up -d",
}
// ADD: Response fields for native binary
"config_url": "/api/v1/config/" + config.AgentID,
"binary_url": "/api/v1/downloads/" + config.Platform,
"signature": signature,
"next_steps": []string{
"1. Download binary from: " + binaryURL,
"2. Download config from: " + configURL,
"3. Place redflag-agent in /usr/local/bin/",
"4. Place config.json in /etc/redflag/",
"5. Run: systemctl enable --now redflag-agent",
}
Obvious Things That Might Be Missed:
- ☐ The
AgentSetupRequeststruct might have Docker-specific fields - clean those up - ☐ The installer script (
downloads.go:537-831) parses this response - keep it compatible - ☐ Web UI shows build instructions - update to show systemctl commands not Docker
- ☐ API response structure changes break backward compatibility with older installers
c) Update database schema for signed packages
-- In migration file (create new migration 019):
-- agent_update_packages table exists but may need tweaks
ALTER TABLE agent_update_packages
ADD COLUMN IF NOT EXISTS config_path VARCHAR(255),
ADD COLUMN IF NOT EXISTS platform VARCHAR(20) NOT NULL,
ADD COLUMN IF NOT EXISTS version VARCHAR(20) NOT NULL;
-- Add indexes for performance
CREATE INDEX IF NOT EXISTS idx_agent_updates_version_platform
ON agent_update_packages(version, platform)
WHERE agent_id IS NULL;
-- For per-version packages (not per-agent)
CREATE INDEX IF NOT EXISTS idx_agent_updates_generic
ON agent_update_packages(version, platform, platform)
WHERE agent_id IS NULL;
Obvious Things That Might Be Missed:
- ☐ Migration needs to handle existing empty table gracefully
- ☐ Need both per-agent and generic package indexes
- ☐ Consider adding expires_at for automatic cleanup
1.2 Integrate Signing Service
Current State: Signing service exists but isn't called from build pipeline
Implementation:
// In BuildAgentArtifacts (from 1.1a):
signingService := services.NewSigningService()
signature, err := signingService.SignFile(tempBinaryPath)
if err != nil {
return nil, fmt.Errorf("signing failed: %w", err)
}
// Store in database
packageID := uuid.New()
query := `
INSERT INTO agent_update_packages (id, version, platform, binary_path, signature, created_at)
VALUES ($1, $2, $3, $4, $5, NOW())
`
_, err = db.Exec(query, packageID, config.Version, config.Platform, tempBinaryPath, signature)
Obvious Things That Might Be Missed:
- ☐ Signing service needs Ed25519 private key from env/config
- ☐ Missing key should fail fast with clear error message
- ☐ Signature format must match what agent expects (base64? hex?)
- ☐ Need to store signing key fingerprint for verification
1.3 Update Download Handler
Current State: Serves generic unsigned binaries from /app/binaries/
Target State: Serve signed versions first, fallback to unsigned
Implementation:
// In downloads.go:175,244 - modify downloadHandler
func (h *DownloadHandler) DownloadAgent(c *gin.Context) {
platform := c.Param("platform")
version := c.Query("version") // Optional version parameter
// Try to get signed package first
if version != "" {
signedPackage, err := h.packageQueries.GetSignedPackage(version, platform)
if err == nil && fileExists(signedPackage.BinaryPath) {
// Serve signed version
log.Printf("Serving signed binary v%s for platform %s", version, platform)
c.File(signedPackage.BinaryPath)
return
}
}
// Fallback to unsigned generic binary
genericPath := fmt.Sprintf("/app/binaries/%s/redflag-agent", platform)
if fileExists(genericPath) {
log.Printf("Serving unsigned generic binary for platform %s", platform)
c.File(genericPath)
return
}
c.JSON(http.StatusNotFound, gin.H{
"error": "no binary available",
"platform": platform,
"version": version,
})
}
Obvious Things That Might Be Missed:
- ☐ Add
versionquery parameter support to API route - ☐ Update install script to request specific version
- ☐ Handle 404 gracefully in installer (current installer assumes binary exists)
- ☐ Add signature verification step in install script
- ☐ Need fileExists() helper or use os.Stat() with error handling
1.4 Verify Subsystem Registration in Install Flow
Critical: Build orchestrator must ensure subsystems are properly configured
Current Issue: Installer may create agent without subsystems enabled
Implementation:
// After agent registration, ensure subsystems are created:
func ensureDefaultSubsystems(agentID uuid.UUID) error {
defaultSubsystems := []string{"updates", "storage", "system", "docker"}
for _, subsystem := range defaultSubsystems {
// Check if already exists
exists, err := subsystemQueries.Exists(agentID, subsystem)
if err != nil {
return err
}
if !exists {
// Create with defaults
err := subsystemQueries.Create(agentID, subsystem, models.SubsystemConfig{
Enabled: true,
AutoRun: true,
Interval: getDefaultInterval(subsystem),
CircuitBreaker: models.DefaultCircuitBreaker(),
})
if err != nil {
return err
}
}
}
return nil
}
Obvious Things That Might Be Missed:
- ☐ Subsystem creation must happen BEFORE scheduler loads jobs
- ☐ Need to prevent duplicate subsystem entries
- ☐ Update agent builder to call this function
- ☐ Migration: Existing agents without subsystems need backfill
1.5 Testing & Verification
Integration Test Steps:
# Setup:
1. Start fresh server with empty agent_update_packages table
2. Create registration token (1 seat)
3. Install agent on test machine
# Test 1: First agent update
4. Admin clicks "Update Agent" in UI
5. Verify: POST /api/v1/build/upgrade returns 200
6. Verify: agent_update_packages has 1 row (version, platform, signature)
7. Verify: Binary file exists at /app/binaries/{platform}/
8. Agent checks for update → receives signed binary
9. Verify: Agent download succeeds
10. Verify: Agent verifies signature → installs update
Expected: Agent running new version, config preserved
# Test 2: Second agent (same version)
11. Register second agent with same token
12. Click "Update Agent"
13. Verify: No new binary built (reuses existing signed package)
14. Verify: Second agent downloads same binary successfully
# Test 3: Version upgrade
15. Server upgraded to v0.1.24
16. First agent checks in → 426 Upgrade Required
17. Admin triggers update → new signed binary for v0.1.24
18. Agent downloads v0.1.24 binary
19. Verify: Agent version updated in database
20. Verify: Agent continues checking in normally
Verification Checklist:
- ☐ API returns proper download URLs (not Docker commands)
- ☐ Binary signature verifies on agent side
- ☐ Config.json preserved across updates (not overwritten)
- ☐ Agent restarts successfully after update
- ☐ Subsystems continue working after update
- ☐ Machine ID binding remains enforced
- ☐ Token refresh continues working
Phase 2: Middleware Version Upgrade Fix (HIGH - Week 2)
Priority: 🟠 HIGH - Prevents agents from getting bricked Estimated Time: 2-3 days Testing Required: Version upgrade scenario test
2.1 Middleware Enhancement
Problem: Machine binding middleware blocks version upgrades (returns 426), creating catch-22 where agent can't upgrade database version.
Solution: Make middleware "update-aware" - detect upgrading agents and allow them through with nonce validation.
Implementation:
a) Add update fields to agents table
-- Migration 020:
ALTER TABLE agents
ADD COLUMN IF NOT EXISTS is_updating BOOLEAN DEFAULT FALSE,
ADD COLUMN IF NOT EXISTS update_nonce VARCHAR(64),
ADD COLUMN IF NOT EXISTS update_nonce_expires_at TIMESTAMPTZ;
Obvious Things That Might Be Missed:
- ☐ Set default FALSE (not null)
- ☐ Add index on is_updating for query performance
- ☐ Consider cleanup job for stale update_nounces
b) Enhance machine_binding.go middleware
// In MachineBindingMiddleware, add update detection logic:
func MachineBindingMiddleware() gin.HandlerFunc {
return func(c *gin.Context) {
// ... existing machine ID validation ...
// Check if agent is reporting upgrade completion
reportedVersion := c.GetHeader("X-Agent-Version")
updateNonce := c.GetHeader("X-Update-Nonce")
if agent.IsUpdating != nil && *agent.IsUpdating {
// Validate upgrade (not downgrade)
if !utils.IsNewerOrEqualVersion(reportedVersion, agent.CurrentVersion) {
log.Printf("DOWNGRADE ATTEMPT: Agent %s reporting version %s < current %s",
agentID, reportedVersion, agent.CurrentVersion)
c.JSON(http.StatusForbidden, gin.H{"error": "downgrade not allowed"})
c.Abort()
return
}
// Validate nonce (proves server authorized update)
if err := validateUpdateNonce(updateNonce, agent.ServerPublicKey); err != nil {
log.Printf("Invalid update nonce for agent %s: %v", agentID, err)
c.JSON(http.StatusForbidden, gin.H{"error": "invalid update nonce"})
c.Abort()
return
}
// Valid upgrade - complete it
log.Printf("Completing update for agent %s: %s → %s",
agentID, agent.CurrentVersion, reportedVersion)
go agentQueries.CompleteAgentUpdate(agentID, reportedVersion)
// Allow request through
c.Next()
return
}
// Normal version check (not in update)
// ... existing code ...
}
}
func validateUpdateNonce(nonce string, serverPublicKey string) error {
// Parse nonce: format is "uuid:timestamp:signature"
parts := strings.Split(nonce, ":")
if len(parts) != 3 {
return fmt.Errorf("invalid nonce format")
}
// Verify Ed25519 signature
message := parts[0] + ":" + parts[1] // uuid:timestamp
signature := parts[2]
if !ed25519.Verify(serverPublicKey, []byte(message), []byte(signature)) {
return fmt.Errorf("invalid nonce signature")
}
// Verify timestamp (< 5 minutes old)
timestamp, err := strconv.ParseInt(parts[1], 10, 64)
if err != nil {
return fmt.Errorf("invalid timestamp")
}
if time.Now().Unix()-timestamp > 300 { // 5 minutes
return fmt.Errorf("nonce expired")
}
return nil
}
Obvious Things That Might Be Missed:
- ☐ Agent must send X-Agent-Version header during check-in when updating
- ☐ Agent must send X-Update-Nonce header with server's signed nonce
- ☐ Server must have agent.ServerPublicKey available (from TOFU)
- ☐ Update nonce must be generated by server when update triggered
- ☐ Nonce must be stored temporarily (redis or database with TTL)
- ☐ Clean up expired nonces (cron job or TTL index)
c) Update agent to send headers
// In aggregator-agent/cmd/agent/main.go:checkInAgent()
func checkInAgent(cfg *config.Config) error {
req, err := http.NewRequest("POST", cfg.ServerURL+"/api/v1/agents/metrics", body)
// Always send machine ID
machineID, _ := system.GetMachineID()
req.Header.Set("X-Machine-ID", machineID)
// Send agent version
req.Header.Set("X-Agent-Version", cfg.AgentVersion)
// If updating, include nonce
if cfg.UpdateInProgress {
req.Header.Set("X-Update-Nonce", cfg.UpdateNonce)
}
// ... rest of request ...
}
Obvious Things That Might Be Missed:
- ☐ Agent must persist update_nonce across restarts (STATE_DIR file)
- ☐ Agent must clear update flag after successful update
- ☐ Need to handle case where agent crashes mid-update
2.2 Testing Version Upgrade Scenario
Test Steps:
# Setup:
1. Have agent v0.1.17 in database
2. Have agent binary v0.1.23 on machine
3. Agent checks in → expecting 426 currently
# After fix:
4. Admin triggers update for agent
5. Server sets is_updating=true, generates nonce, stores nonce with agent_id
6. Agent checks in with X-Update-Nonce header
7. Middleware validates nonce, allows through
8. Agent reports v0.1.23
9. Server updates agent.current_version → completes update
10. Server clears is_updating flag
# Verify:
11. Agent no longer gets 426
12. Agent shows v0.1.23 in dashboard
13. Agent receives commands normally
Verification Checklist:
- ☐ Agent can upgrade from v0.1.17 → v0.1.23
- ☐ No manual intervention required (agent auto-updates)
- ☐ Middleware allows upgrade with valid nonce
- ☐ Middleware rejects downgrade attempts
- ☐ Invalid nonce causes rejection (security test)
- ☐ Expired nonce causes rejection (security test)
- ☐ Machine ID binding remains enforced during update
Phase 3: Security Hardening (MEDIUM - Week 3)
Priority: 🟡 MEDIUM - Improvements, not blockers Estimated Time: 2-3 days Testing Required: Security test scenarios
3.1 Remove JWT Secret Logging
Current Issue: JWT secret logged in debug when generated
Implementation:
// aggregator-server/cmd/server/main.go:105-108
if cfg.Admin.JWTSecret == "" {
cfg.Admin.JWTSecret = GenerateSecureToken()
// REMOVE: log.Printf("Generated JWT secret: %s", cfg.Admin.JWTSecret)
// Instead: log.Printf("JWT secret generated (not logged for security)")
}
Obvious Things That Might Be Missed:
- ☐ Add environment variable for debug mode:
REDFLAG_DEBUG=true - ☐ Only log secret when debug is explicitly enabled
- ☐ Warn in production if JWT_SECRET not set (don't generate randomly)
3.2 Implement Per-Server JWT Secrets
Current Issue: All servers share same JWT secret if not explicitly set
Implementation:
// In GenerateAgentToken():
// Generate unique secret per server on first boot, store in database
func ensureServerJWTSecret(db *sqlx.DB) (string, error) {
// Check if secret exists in settings
var secret string
query := `SELECT value FROM settings WHERE key = 'jwt_secret'`
err := db.Get(&secret, query)
if err == sql.ErrNoRows {
// Generate new secret
secret = GenerateSecureToken()
// Store in database
insert := `INSERT INTO settings (key, value) VALUES ('jwt_secret', $1)`
_, err = db.Exec(insert, secret)
if err != nil {
return "", err
}
log.Printf("Generated new JWT secret for this server")
}
return secret, nil
}
Migration:
-- Create settings table if doesn't exist
CREATE TABLE IF NOT EXISTS settings (
key VARCHAR(100) PRIMARY KEY,
value TEXT NOT NULL,
created_at TIMESTAMPTZ DEFAULT NOW(),
updated_at TIMESTAMPTZ DEFAULT NOW()
);
-- For existing installations, migrate current secret
INSERT INTO settings (key, value)
SELECT 'jwt_secret', COALESCE(current_setting('REDFLAG_JWT_SECRET', true), '')
WHERE NOT EXISTS (SELECT 1 FROM settings WHERE key = 'jwt_secret');
Obvious Things That Might Be Missed:
- ☐ Existing agents with old tokens will be invalidated (migration window needed)
- ☐ Need to support multiple valid secrets during rotation period
- ☐ Document secret rotation procedure for admins
3.3 Clean Dead Code
Files to clean:
a) Remove TLSConfig struct (not used)
// aggregator-agent/internal/config/config.go:23-29
// REMOVE:
type TLSConfig struct {
InsecureSkipVerify bool `json:"insecure_skip_verify"`
CertFile string `json:"cert_file,omitempty"`
KeyFile string `json:"key_file,omitempty"`
CAFile string `json:"ca_file,omitempty"`
}
// From Config struct, remove:
TLS TLSConfig `json:"tls,omitempty"`
// REMOVE from Load() function any TLS config loading
Rationale: Client certificate authentication was planned but rejected in favor of machine ID binding. This is dead code that will confuse future developers.
b) Remove Docker-specific build artifacts from structs
// aggregator-server/internal/services/agent_builder.go:53-60
// REMOVE from BuildResult:
ComposeFile string `json:"compose_file"`
Dockerfile string `json:"dockerfile"`
// Update all references to BuildResult throughout codebase
c) Update go.mod to remove unused dependencies
# After removing Docker code:
go mod tidy
# Verify no Docker-related imports remain
Obvious Things That Might Be Missed:
- ☐ Check test files for references to removed fields
- ☐ Check Web UI for references to Docker fields in API responses
- ☐ Update API documentation to remove Docker endpoints
- ☐ Search for TODO comments referencing Docker implementation
- ☐ Check for mocked Docker functions in test files
Phase 4: Documentation & Handover (Week 4)
4.1 Update Decision.md
Add findings from implementation:
- Final architecture decisions
- Performance metrics observed
- Security boundaries validated
- Any deviations from original plan
4.2 Create CHANGELOG.md Entry
## v0.2.0 - Security Hardening & Build Orchestrator
### Breaking Changes
- Build orchestrator now generates native binaries instead of Docker configs
- API response format changed for /api/v1/build/* endpoints
- Agent update flow now requires version parameter
### New Features
- Ed25519 signed agent binaries with automatic verification
- Machine ID binding enforced on all endpoints
- Command acknowledgment system (at-least-once delivery)
- Version upgrade middleware (fixes catch-22)
### Security Improvements
- Per-server JWT secrets (not shared)
- Token refresh with nonce validation
- Config protected by file permissions (0600)
### Removed
- Docker container deployment option (native binaries only)
- TLSConfig from agent config (dead code)
4.3 Create Migration Guide
For existing installations (v0.1.17-v0.1.23.4):
# 1. Backup database
pg_dump redflag > backup-pre-v020.sql
# 2. Apply migrations 018-020
migrate -path migrations -database postgres://... up
# 3. Set JWT secret if not already set
export REDFLAG_JWT_SECRET=$(cat /dev/urandom | tr -dc 'a-zA-Z0-9' | head -c64)
# 4. Restart server (generates signing key if missing)
systemctl restart redflag-server
# 5. Verify signing key configured
curl http://localhost:8080/api/v1/security/signing
# 6. Trigger agent updates (all agents will update to signed binaries)
# Admin UI → Agents → Select All → Update Agent
4.4 Update README.md
Key sections to update:
- Architecture diagram (remove Docker, add signing flow)
- Security model (document machine ID binding)
- Installation instructions (systemctl, not Docker)
- Configuration reference (remove TLSConfig)
- API documentation (update build endpoints)
Testing & Quality Assurance
Unit Tests
// Test packages needed:
1. Test signing service (SignFile, VerifyFile)
2. Test build orchestrator (BuildAgentArtifacts)
3. Test machine binding middleware (with update scenario)
4. Test token renewal with nonce validation
5. Test download handler (signed vs unsigned fallback)
// Run tests with coverage:
go test ./... -cover -v
// Target: >80% coverage on security-critical packages
Integration Tests
#!/bin/bash
# integration_test.sh
echo "Starting RedFlag v0.2.0 integration test..."
# Setup test environment
docker-compose up -d postgres echo "Waiting for database..." sleep 5
# Start server
cd aggregator-server
go run cmd/server/main.go & SERVER_PID=$!
echo "Waiting for server..." sleep 10
# Run test scenarios:
./tests/test_registration.sh
./tests/test_machine_binding.sh
./tests/test_build_orchestrator.sh
./tests/test_signed_updates.sh
./tests/test_token_renewal.sh
./tests/test_command_acknowledgment.sh
# Cleanup
kill $SERVER_PID
docker-compose down
echo "All tests passed!"
Test Scenarios:
- Registration: New agent registers, gets tokens, machine ID stored
- Machine Binding: Attempt from different machine → 403 Forbidden
- Build Orchestrator: Build signed binary → verify signature → download
- Signed Updates: Agent updates → signature verification → successful install
- Token Renewal: With nonce → successful renewal → version updated
- Command Acknowledgment: Agent sends ack → server processes → queue cleared
Security Testing
# Penetration test checklist:
□ Attempt registration with stolen token (should fail if seats full)
□ Copy config.json to different machine (should fail machine binding)
□ Modify binary and attempt update (signature verification should fail)
□ Replay old nonce (timestamp check should fail)
□ Use expired JWT (should be rejected)
□ Attempt downgrade attack (middleware should reject)
□ Try to access agent data from wrong agent ID (auth should block)
□ Test token renewal without nonce (should fail)
Performance Benchmarks
Target Metrics:
| Operation | Target Time | Notes |
|---|---|---|
| Sign binary (per version) | < 50ms | Ed25519 is fast |
| Build artifacts generation | < 500ms | Mostly file I/O |
| Token renewal with nonce | < 100ms | Includes DB write |
| Machine ID validation | < 10ms | Database lookup |
| Download signed binary | < 5s | Depends on network |
| Agent update process | < 30s | Including verification & restart |
Scalability Targets:
- 1,000 agents: Update all in < 5 minutes (CDN caching)
- 10,000 agents: Update all in < 1 hour (CDN caching)
- Token renewal: 1,000 req/sec (stateless JWT validation)
- Database: < 10% CPU at 1k agents
Deployment Checklist
Pre-Deployment
- All unit tests passing (coverage >80%)
- All integration tests passing
- Security tests passing
- Performance benchmarks met
- Documentation updated (Decision.md, CHANGELOG.md, README.md)
- Migration scripts tested
- Backup procedure documented
- Rollback plan documented
Deployment Steps
- Announce maintenance window (4 hours recommended)
- Create database backup
- Stop agent schedulers (prevent command generation)
- Stop server
- Apply migrations 018-020
- Set required environment variables:
REDFLAG_JWT_SECRET(min 32 chars)REDFLAG_SIGNING_PRIVATE_KEY(if not using keygen)
- Start server
- Verify server starts without errors
- Verify health endpoint: GET /api/v1/health
- Verify signing endpoint: GET /api/v1/security/signing
- Start agent schedulers
- Trigger agent updates (optional, can be gradual)
- Monitor logs for errors
- Verify agent connectivity
Post-Deployment
- Monitor error rates for 24 hours
- Verify agent update success rate >95%
- Check database for anomalies (duplicate subsystems, etc.)
- Review logs for security violations (machine ID mismatches)
- Performance metrics within targets
- Update documentation with any deviations
Rollback Plan
If critical issues found:
- Stop server
- Restore database backup
- Revert to v0.1.23.4 code
- Restart server
- Notify users of rollback
- Document issue for v0.2.1 fix
Known Risks & Mitigations
| Risk | Probability | Impact | Mitigation |
|---|---|---|---|
| Build orchestrator produces invalid signature | Medium | High | Unit tests + manual verification |
| Token renewal fails during update | Low | High | Graceful fallback to re-registration |
| Machine ID collision (rare) | Low | Medium | Hardware fingerprint + agent_id composite |
| JWT secret exposed in logs | Medium | Medium | Remove logging + use per-server secrets |
| Subsystems not attached after update | Low | Medium | EnsureDefaultSubsystems() called |
| Dead code causes confusion | High | Low | Clean TLSConfig, BuildResult fields |
| CDN caches unsigned binary | Low | High | Use version-specific URLs |
Success Criteria
Functional:
- [ ] Agent can successfully update from v0.1.17 → v0.1.23 → v0.2.0
- [ ] Signed binary verification passes
- [ ] Machine ID binding prevents cross-machine impersonation
- [ ] Token renewal with nonce validation works
- [ ] Command acknowledgment system operational
- [ ] Subsystems properly attached after update
Performance:
- [ ] Update completes in < 30 seconds
- [ ] Server handles 1000 concurrent agents
- [ ] Token renewal < 100ms
- [ ] No database deadlocks under load
Security:
- [ ] Ed25519 signatures verified on agent
- [ ] JWT secret not logged in production
- [ ] Per-server JWT secrets implemented
- [ ] Machine ID mismatch logs security alert
- [ ] Token theft from decommissioned agent mitigated by machine binding
Handover Notes for Claude 4.5
@Fimeg, this plan is your implementation guide. Key points:
- Focus on Phase 1 first - Build orchestrator alignment is critical and blocks everything else
- Test as you go - Don't wait until end, integration testing is crucial
- Clean up dead code - TLSConfig, Docker fields in structs have been removed
- Verify subsystems - Make sure they're attached during agent registration/update
- Machine binding is THE security boundary - Token rotation is less important
- Ask questions - If anything is unclear, we have logs of all discussions
Time budget: Expect 3-4 weeks for full implementation. Phase 1 is most complex. Phases 2-4 are straightforward.
Resources:
- Decision.md - Architecture decisions
- Status.md - Current state
- todayupdate.md - Historical context
- answer.md - Token system analysis
- SECURITY_AUDIT.md - Security boundaries
When stuck: Review the "Obvious Things That Might Be Missed" sections - they're based on actual issues we identified.
Good luck! 🚀
Document Version: 1.0 Created: 2025-11-10 Last Updated: 2025-11-10 Prepared by: @Kimi + @Fimeg + @Grok Ready for Implementation: ✅ Yes