Files
Redflag/docs/4_LOG/_originals_archive.backup/plan.md

30 KiB

RedFlag Security System Implementation Plan v0.2.0

Date: 2025-11-10 Version: 0.1.23.4 → 0.2.0 Status: Implementation Ready Owner: Claude 4.5 (@Fimeg)


Executive Summary

Goal: Implement the RedFlag security architecture as designed - TOFU-first, Ed25519-signed binaries, machine ID binding, and command acknowledgment system.

Critical Discovery: Build orchestrator generates Docker configs while install script expects signed native binaries. This is blocking the entire update workflow (404 errors on agent updates).

Decision: Implement Option 2 (Per-Version/Platform Signing) from Decision.md - sign generic binaries once per version/platform, serve to all agents, keep config.json separate.

Scope: This plan covers the complete implementation from current state (v0.1.23.4) to production-ready v0.2.0.


Phase 1: Build Orchestrator Alignment (CRITICAL - Week 1)

Priority: 🔴 CRITICAL - Blocking all agent updates Estimated Time: 3-4 days Testing Required: Integration test with live agent upgrade

1.1 Replace Docker Config Generation with Signed Binary Flow

Current State:

// agent_builder.go:171-245
generateDockerCompose()  docker-compose.yml
generateDockerfile()  Dockerfile
generateEmbeddedConfig()  Go source with config
Result: Docker deployment configs (WRONG)

Target State:

// New flow:
1. Take generic binary from /app/binaries/{platform}/
2. Sign binary with Ed25519 private key
3. Store package metadata in agent_update_packages table
4. Generate config.json separately
5. Return download URLs for both

Implementation Steps:

a) Modify agent_builder.go

// REMOVE these functions:
- generateDockerCompose()  Delete
- generateDockerfile()  Delete
- BuildAgentWithConfig()  Rewrite completely

// NEW signature:
func (ab *AgentBuilder) BuildAgentArtifacts(config *AgentConfiguration) (*BuildArtifacts, error) {
    // Step 1: Generate agent-specific config.json (not embedded in binary)
    configJSON, err := generateConfigJSON(config)
    if err != nil {
        return nil, fmt.Errorf("config generation failed: %w", err)
    }

    // Step 2: Copy generic binary to temp location (don't modify binary)
    // Generic binary built by Docker multi-stage build in /app/binaries/
    genericBinaryPath := fmt.Sprintf("/app/binaries/%s/redflag-agent", config.Platform)
    tempBinaryPath := fmt.Sprintf("/tmp/agent-build-%s/redflag-agent", config.AgentID)

    if err := copyFile(genericBinaryPath, tempBinaryPath); err != nil {
        return nil, fmt.Errorf("binary copy failed: %w", err)
    }

    // Step 3: Sign the binary (do not embed config)
    signingService := services.NewSigningService()
    signature, err := signingService.SignFile(tempBinaryPath)
    if err != nil {
        return nil, fmt.Errorf("signing failed: %w", err)
    }

    // Step 4: Return artifact locations (don't return Docker configs)
    return &BuildArtifacts{
        AgentID:          config.AgentID,
        ConfigJSON:       configJSON,
        BinaryPath:       tempBinaryPath,
        Signature:        signature,
        Platform:         config.Platform,
        Version:          config.Version,
    }, nil
}

Obvious Things That Might Be Missed:

  • ☐ The BuildAgentWithConfig function is called from multiple places - update all callers
  • ☐ The BuildResult struct has fields for Docker artifacts - remove them or they'll be dead code
  • ☐ Check build_orchestrator.go handlers - they construct responses expecting Docker files
  • ☐ Update API documentation if it references Docker build process
  • ☐ The install script expects docker-compose.yml instructions - it will break if we just remove without updating

b) Update build_orchestrator.go handlers

// In NewAgentBuild and UpgradeAgentBuild handlers:

// REMOVE: Response fields for Docker
"compose_file": buildResult.ComposeFile,
"dockerfile":   buildResult.Dockerfile,
"next_steps": []string{
    "1. docker build -t " + buildResult.ImageTag + " .",
    "2. docker compose up -d",
}

// ADD: Response fields for native binary
"config_url": "/api/v1/config/" + config.AgentID,
"binary_url": "/api/v1/downloads/" + config.Platform,
"signature":  signature,
"next_steps": []string{
    "1. Download binary from: " + binaryURL,
    "2. Download config from: " + configURL,
    "3. Place redflag-agent in /usr/local/bin/",
    "4. Place config.json in /etc/redflag/",
    "5. Run: systemctl enable --now redflag-agent",
}

Obvious Things That Might Be Missed:

  • ☐ The AgentSetupRequest struct might have Docker-specific fields - clean those up
  • ☐ The installer script (downloads.go:537-831) parses this response - keep it compatible
  • ☐ Web UI shows build instructions - update to show systemctl commands not Docker
  • ☐ API response structure changes break backward compatibility with older installers

c) Update database schema for signed packages

-- In migration file (create new migration 019):

-- agent_update_packages table exists but may need tweaks
ALTER TABLE agent_update_packages
ADD COLUMN IF NOT EXISTS config_path VARCHAR(255),
ADD COLUMN IF NOT EXISTS platform VARCHAR(20) NOT NULL,
ADD COLUMN IF NOT EXISTS version VARCHAR(20) NOT NULL;

-- Add indexes for performance
CREATE INDEX IF NOT EXISTS idx_agent_updates_version_platform
ON agent_update_packages(version, platform)
WHERE agent_id IS NULL;

-- For per-version packages (not per-agent)
CREATE INDEX IF NOT EXISTS idx_agent_updates_generic
ON agent_update_packages(version, platform, platform)
WHERE agent_id IS NULL;

Obvious Things That Might Be Missed:

  • ☐ Migration needs to handle existing empty table gracefully
  • ☐ Need both per-agent and generic package indexes
  • ☐ Consider adding expires_at for automatic cleanup

1.2 Integrate Signing Service

Current State: Signing service exists but isn't called from build pipeline

Implementation:

// In BuildAgentArtifacts (from 1.1a):
signingService := services.NewSigningService()
signature, err := signingService.SignFile(tempBinaryPath)
if err != nil {
    return nil, fmt.Errorf("signing failed: %w", err)
}

// Store in database
packageID := uuid.New()
query := `
    INSERT INTO agent_update_packages (id, version, platform, binary_path, signature, created_at)
    VALUES ($1, $2, $3, $4, $5, NOW())
`
_, err = db.Exec(query, packageID, config.Version, config.Platform, tempBinaryPath, signature)

Obvious Things That Might Be Missed:

  • ☐ Signing service needs Ed25519 private key from env/config
  • ☐ Missing key should fail fast with clear error message
  • ☐ Signature format must match what agent expects (base64? hex?)
  • ☐ Need to store signing key fingerprint for verification

1.3 Update Download Handler

Current State: Serves generic unsigned binaries from /app/binaries/

Target State: Serve signed versions first, fallback to unsigned

Implementation:

// In downloads.go:175,244 - modify downloadHandler

func (h *DownloadHandler) DownloadAgent(c *gin.Context) {
    platform := c.Param("platform")
    version := c.Query("version") // Optional version parameter

    // Try to get signed package first
    if version != "" {
        signedPackage, err := h.packageQueries.GetSignedPackage(version, platform)
        if err == nil && fileExists(signedPackage.BinaryPath) {
            // Serve signed version
            log.Printf("Serving signed binary v%s for platform %s", version, platform)
            c.File(signedPackage.BinaryPath)
            return
        }
    }

    // Fallback to unsigned generic binary
    genericPath := fmt.Sprintf("/app/binaries/%s/redflag-agent", platform)
    if fileExists(genericPath) {
        log.Printf("Serving unsigned generic binary for platform %s", platform)
        c.File(genericPath)
        return
    }

    c.JSON(http.StatusNotFound, gin.H{
        "error": "no binary available",
        "platform": platform,
        "version": version,
    })
}

Obvious Things That Might Be Missed:

  • ☐ Add version query parameter support to API route
  • ☐ Update install script to request specific version
  • ☐ Handle 404 gracefully in installer (current installer assumes binary exists)
  • ☐ Add signature verification step in install script
  • ☐ Need fileExists() helper or use os.Stat() with error handling

1.4 Verify Subsystem Registration in Install Flow

Critical: Build orchestrator must ensure subsystems are properly configured

Current Issue: Installer may create agent without subsystems enabled

Implementation:

// After agent registration, ensure subsystems are created:
func ensureDefaultSubsystems(agentID uuid.UUID) error {
    defaultSubsystems := []string{"updates", "storage", "system", "docker"}

    for _, subsystem := range defaultSubsystems {
        // Check if already exists
        exists, err := subsystemQueries.Exists(agentID, subsystem)
        if err != nil {
            return err
        }

        if !exists {
            // Create with defaults
            err := subsystemQueries.Create(agentID, subsystem, models.SubsystemConfig{
                Enabled:     true,
                AutoRun:     true,
                Interval:    getDefaultInterval(subsystem),
                CircuitBreaker: models.DefaultCircuitBreaker(),
            })
            if err != nil {
                return err
            }
        }
    }
    return nil
}

Obvious Things That Might Be Missed:

  • ☐ Subsystem creation must happen BEFORE scheduler loads jobs
  • ☐ Need to prevent duplicate subsystem entries
  • ☐ Update agent builder to call this function
  • ☐ Migration: Existing agents without subsystems need backfill

1.5 Testing & Verification

Integration Test Steps:

# Setup:
1. Start fresh server with empty agent_update_packages table
2. Create registration token (1 seat)
3. Install agent on test machine

# Test 1: First agent update
4. Admin clicks "Update Agent" in UI
5. Verify: POST /api/v1/build/upgrade returns 200
6. Verify: agent_update_packages has 1 row (version, platform, signature)
7. Verify: Binary file exists at /app/binaries/{platform}/
8. Agent checks for update → receives signed binary
9. Verify: Agent download succeeds
10. Verify: Agent verifies signature → installs update
Expected: Agent running new version, config preserved

# Test 2: Second agent (same version)
11. Register second agent with same token
12. Click "Update Agent"
13. Verify: No new binary built (reuses existing signed package)
14. Verify: Second agent downloads same binary successfully

# Test 3: Version upgrade
15. Server upgraded to v0.1.24
16. First agent checks in → 426 Upgrade Required
17. Admin triggers update → new signed binary for v0.1.24
18. Agent downloads v0.1.24 binary
19. Verify: Agent version updated in database
20. Verify: Agent continues checking in normally

Verification Checklist:

  • ☐ API returns proper download URLs (not Docker commands)
  • ☐ Binary signature verifies on agent side
  • ☐ Config.json preserved across updates (not overwritten)
  • ☐ Agent restarts successfully after update
  • ☐ Subsystems continue working after update
  • ☐ Machine ID binding remains enforced
  • ☐ Token refresh continues working

Phase 2: Middleware Version Upgrade Fix (HIGH - Week 2)

Priority: 🟠 HIGH - Prevents agents from getting bricked Estimated Time: 2-3 days Testing Required: Version upgrade scenario test

2.1 Middleware Enhancement

Problem: Machine binding middleware blocks version upgrades (returns 426), creating catch-22 where agent can't upgrade database version.

Solution: Make middleware "update-aware" - detect upgrading agents and allow them through with nonce validation.

Implementation:

a) Add update fields to agents table

-- Migration 020:
ALTER TABLE agents
ADD COLUMN IF NOT EXISTS is_updating BOOLEAN DEFAULT FALSE,
ADD COLUMN IF NOT EXISTS update_nonce VARCHAR(64),
ADD COLUMN IF NOT EXISTS update_nonce_expires_at TIMESTAMPTZ;

Obvious Things That Might Be Missed:

  • ☐ Set default FALSE (not null)
  • ☐ Add index on is_updating for query performance
  • ☐ Consider cleanup job for stale update_nounces

b) Enhance machine_binding.go middleware

// In MachineBindingMiddleware, add update detection logic:

func MachineBindingMiddleware() gin.HandlerFunc {
    return func(c *gin.Context) {
        // ... existing machine ID validation ...

        // Check if agent is reporting upgrade completion
        reportedVersion := c.GetHeader("X-Agent-Version")
        updateNonce := c.GetHeader("X-Update-Nonce")

        if agent.IsUpdating != nil && *agent.IsUpdating {
            // Validate upgrade (not downgrade)
            if !utils.IsNewerOrEqualVersion(reportedVersion, agent.CurrentVersion) {
                log.Printf("DOWNGRADE ATTEMPT: Agent %s reporting version %s < current %s",
                    agentID, reportedVersion, agent.CurrentVersion)
                c.JSON(http.StatusForbidden, gin.H{"error": "downgrade not allowed"})
                c.Abort()
                return
            }

            // Validate nonce (proves server authorized update)
            if err := validateUpdateNonce(updateNonce, agent.ServerPublicKey); err != nil {
                log.Printf("Invalid update nonce for agent %s: %v", agentID, err)
                c.JSON(http.StatusForbidden, gin.H{"error": "invalid update nonce"})
                c.Abort()
                return
            }

            // Valid upgrade - complete it
            log.Printf("Completing update for agent %s: %s → %s",
                agentID, agent.CurrentVersion, reportedVersion)

            go agentQueries.CompleteAgentUpdate(agentID, reportedVersion)

            // Allow request through
            c.Next()
            return
        }

        // Normal version check (not in update)
        // ... existing code ...
    }
}

func validateUpdateNonce(nonce string, serverPublicKey string) error {
    // Parse nonce: format is "uuid:timestamp:signature"
    parts := strings.Split(nonce, ":")
    if len(parts) != 3 {
        return fmt.Errorf("invalid nonce format")
    }

    // Verify Ed25519 signature
    message := parts[0] + ":" + parts[1] // uuid:timestamp
    signature := parts[2]

    if !ed25519.Verify(serverPublicKey, []byte(message), []byte(signature)) {
        return fmt.Errorf("invalid nonce signature")
    }

    // Verify timestamp (< 5 minutes old)
    timestamp, err := strconv.ParseInt(parts[1], 10, 64)
    if err != nil {
        return fmt.Errorf("invalid timestamp")
    }

    if time.Now().Unix()-timestamp > 300 { // 5 minutes
        return fmt.Errorf("nonce expired")
    }

    return nil
}

Obvious Things That Might Be Missed:

  • ☐ Agent must send X-Agent-Version header during check-in when updating
  • ☐ Agent must send X-Update-Nonce header with server's signed nonce
  • ☐ Server must have agent.ServerPublicKey available (from TOFU)
  • ☐ Update nonce must be generated by server when update triggered
  • ☐ Nonce must be stored temporarily (redis or database with TTL)
  • ☐ Clean up expired nonces (cron job or TTL index)

c) Update agent to send headers

// In aggregator-agent/cmd/agent/main.go:checkInAgent()

func checkInAgent(cfg *config.Config) error {
    req, err := http.NewRequest("POST", cfg.ServerURL+"/api/v1/agents/metrics", body)

    // Always send machine ID
    machineID, _ := system.GetMachineID()
    req.Header.Set("X-Machine-ID", machineID)

    // Send agent version
    req.Header.Set("X-Agent-Version", cfg.AgentVersion)

    // If updating, include nonce
    if cfg.UpdateInProgress {
        req.Header.Set("X-Update-Nonce", cfg.UpdateNonce)
    }

    // ... rest of request ...
}

Obvious Things That Might Be Missed:

  • ☐ Agent must persist update_nonce across restarts (STATE_DIR file)
  • ☐ Agent must clear update flag after successful update
  • ☐ Need to handle case where agent crashes mid-update

2.2 Testing Version Upgrade Scenario

Test Steps:

# Setup:
1. Have agent v0.1.17 in database
2. Have agent binary v0.1.23 on machine
3. Agent checks in → expecting 426 currently

# After fix:
4. Admin triggers update for agent
5. Server sets is_updating=true, generates nonce, stores nonce with agent_id
6. Agent checks in with X-Update-Nonce header
7. Middleware validates nonce, allows through
8. Agent reports v0.1.23
9. Server updates agent.current_version → completes update
10. Server clears is_updating flag

# Verify:
11. Agent no longer gets 426
12. Agent shows v0.1.23 in dashboard
13. Agent receives commands normally

Verification Checklist:

  • ☐ Agent can upgrade from v0.1.17 → v0.1.23
  • ☐ No manual intervention required (agent auto-updates)
  • ☐ Middleware allows upgrade with valid nonce
  • ☐ Middleware rejects downgrade attempts
  • ☐ Invalid nonce causes rejection (security test)
  • ☐ Expired nonce causes rejection (security test)
  • ☐ Machine ID binding remains enforced during update

Phase 3: Security Hardening (MEDIUM - Week 3)

Priority: 🟡 MEDIUM - Improvements, not blockers Estimated Time: 2-3 days Testing Required: Security test scenarios

3.1 Remove JWT Secret Logging

Current Issue: JWT secret logged in debug when generated

Implementation:

// aggregator-server/cmd/server/main.go:105-108

if cfg.Admin.JWTSecret == "" {
    cfg.Admin.JWTSecret = GenerateSecureToken()
    // REMOVE: log.Printf("Generated JWT secret: %s", cfg.Admin.JWTSecret)
    // Instead: log.Printf("JWT secret generated (not logged for security)")
}

Obvious Things That Might Be Missed:

  • ☐ Add environment variable for debug mode: REDFLAG_DEBUG=true
  • ☐ Only log secret when debug is explicitly enabled
  • ☐ Warn in production if JWT_SECRET not set (don't generate randomly)

3.2 Implement Per-Server JWT Secrets

Current Issue: All servers share same JWT secret if not explicitly set

Implementation:

// In GenerateAgentToken():
// Generate unique secret per server on first boot, store in database

func ensureServerJWTSecret(db *sqlx.DB) (string, error) {
    // Check if secret exists in settings
    var secret string
    query := `SELECT value FROM settings WHERE key = 'jwt_secret'`
    err := db.Get(&secret, query)

    if err == sql.ErrNoRows {
        // Generate new secret
        secret = GenerateSecureToken()

        // Store in database
        insert := `INSERT INTO settings (key, value) VALUES ('jwt_secret', $1)`
        _, err = db.Exec(insert, secret)
        if err != nil {
            return "", err
        }

        log.Printf("Generated new JWT secret for this server")
    }

    return secret, nil
}

Migration:

-- Create settings table if doesn't exist
CREATE TABLE IF NOT EXISTS settings (
    key VARCHAR(100) PRIMARY KEY,
    value TEXT NOT NULL,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW()
);

-- For existing installations, migrate current secret
INSERT INTO settings (key, value)
SELECT 'jwt_secret', COALESCE(current_setting('REDFLAG_JWT_SECRET', true), '')
WHERE NOT EXISTS (SELECT 1 FROM settings WHERE key = 'jwt_secret');

Obvious Things That Might Be Missed:

  • ☐ Existing agents with old tokens will be invalidated (migration window needed)
  • ☐ Need to support multiple valid secrets during rotation period
  • ☐ Document secret rotation procedure for admins

3.3 Clean Dead Code

Files to clean:

a) Remove TLSConfig struct (not used)

// aggregator-agent/internal/config/config.go:23-29

// REMOVE:
type TLSConfig struct {
    InsecureSkipVerify bool   `json:"insecure_skip_verify"`
    CertFile           string `json:"cert_file,omitempty"`
    KeyFile            string `json:"key_file,omitempty"`
    CAFile             string `json:"ca_file,omitempty"`
}

// From Config struct, remove:
TLS TLSConfig `json:"tls,omitempty"`

// REMOVE from Load() function any TLS config loading

Rationale: Client certificate authentication was planned but rejected in favor of machine ID binding. This is dead code that will confuse future developers.

b) Remove Docker-specific build artifacts from structs

// aggregator-server/internal/services/agent_builder.go:53-60

// REMOVE from BuildResult:
ComposeFile string    `json:"compose_file"`
Dockerfile  string    `json:"dockerfile"`

// Update all references to BuildResult throughout codebase

c) Update go.mod to remove unused dependencies

# After removing Docker code:
go mod tidy
# Verify no Docker-related imports remain

Obvious Things That Might Be Missed:

  • ☐ Check test files for references to removed fields
  • ☐ Check Web UI for references to Docker fields in API responses
  • ☐ Update API documentation to remove Docker endpoints
  • ☐ Search for TODO comments referencing Docker implementation
  • ☐ Check for mocked Docker functions in test files

Phase 4: Documentation & Handover (Week 4)

4.1 Update Decision.md

Add findings from implementation:

  • Final architecture decisions
  • Performance metrics observed
  • Security boundaries validated
  • Any deviations from original plan

4.2 Create CHANGELOG.md Entry

## v0.2.0 - Security Hardening & Build Orchestrator

### Breaking Changes
- Build orchestrator now generates native binaries instead of Docker configs
- API response format changed for /api/v1/build/* endpoints
- Agent update flow now requires version parameter

### New Features
- Ed25519 signed agent binaries with automatic verification
- Machine ID binding enforced on all endpoints
- Command acknowledgment system (at-least-once delivery)
- Version upgrade middleware (fixes catch-22)

### Security Improvements
- Per-server JWT secrets (not shared)
- Token refresh with nonce validation
- Config protected by file permissions (0600)

### Removed
- Docker container deployment option (native binaries only)
- TLSConfig from agent config (dead code)

4.3 Create Migration Guide

For existing installations (v0.1.17-v0.1.23.4):

# 1. Backup database
pg_dump redflag > backup-pre-v020.sql

# 2. Apply migrations 018-020
migrate -path migrations -database postgres://... up

# 3. Set JWT secret if not already set
export REDFLAG_JWT_SECRET=$(cat /dev/urandom | tr -dc 'a-zA-Z0-9' | head -c64)

# 4. Restart server (generates signing key if missing)
systemctl restart redflag-server

# 5. Verify signing key configured
curl http://localhost:8080/api/v1/security/signing

# 6. Trigger agent updates (all agents will update to signed binaries)
# Admin UI → Agents → Select All → Update Agent

4.4 Update README.md

Key sections to update:

  1. Architecture diagram (remove Docker, add signing flow)
  2. Security model (document machine ID binding)
  3. Installation instructions (systemctl, not Docker)
  4. Configuration reference (remove TLSConfig)
  5. API documentation (update build endpoints)

Testing & Quality Assurance

Unit Tests

// Test packages needed:
1. Test signing service (SignFile, VerifyFile)
2. Test build orchestrator (BuildAgentArtifacts)
3. Test machine binding middleware (with update scenario)
4. Test token renewal with nonce validation
5. Test download handler (signed vs unsigned fallback)

// Run tests with coverage:
go test ./... -cover -v
// Target: >80% coverage on security-critical packages

Integration Tests

#!/bin/bash
# integration_test.sh

echo "Starting RedFlag v0.2.0 integration test..."

# Setup test environment
docker-compose up -d postgres echo "Waiting for database..." sleep 5

# Start server
cd aggregator-server
go run cmd/server/main.go & SERVER_PID=$!
echo "Waiting for server..." sleep 10

# Run test scenarios:
./tests/test_registration.sh
./tests/test_machine_binding.sh
./tests/test_build_orchestrator.sh
./tests/test_signed_updates.sh
./tests/test_token_renewal.sh
./tests/test_command_acknowledgment.sh

# Cleanup
kill $SERVER_PID
docker-compose down

echo "All tests passed!"

Test Scenarios:

  1. Registration: New agent registers, gets tokens, machine ID stored
  2. Machine Binding: Attempt from different machine → 403 Forbidden
  3. Build Orchestrator: Build signed binary → verify signature → download
  4. Signed Updates: Agent updates → signature verification → successful install
  5. Token Renewal: With nonce → successful renewal → version updated
  6. Command Acknowledgment: Agent sends ack → server processes → queue cleared

Security Testing

# Penetration test checklist:
□ Attempt registration with stolen token (should fail if seats full)
□ Copy config.json to different machine (should fail machine binding)
□ Modify binary and attempt update (signature verification should fail)
□ Replay old nonce (timestamp check should fail)
□ Use expired JWT (should be rejected)
□ Attempt downgrade attack (middleware should reject)
□ Try to access agent data from wrong agent ID (auth should block)
□ Test token renewal without nonce (should fail)

Performance Benchmarks

Target Metrics:

Operation Target Time Notes
Sign binary (per version) < 50ms Ed25519 is fast
Build artifacts generation < 500ms Mostly file I/O
Token renewal with nonce < 100ms Includes DB write
Machine ID validation < 10ms Database lookup
Download signed binary < 5s Depends on network
Agent update process < 30s Including verification & restart

Scalability Targets:

  • 1,000 agents: Update all in < 5 minutes (CDN caching)
  • 10,000 agents: Update all in < 1 hour (CDN caching)
  • Token renewal: 1,000 req/sec (stateless JWT validation)
  • Database: < 10% CPU at 1k agents


Deployment Checklist

Pre-Deployment

  • All unit tests passing (coverage >80%)
  • All integration tests passing
  • Security tests passing
  • Performance benchmarks met
  • Documentation updated (Decision.md, CHANGELOG.md, README.md)
  • Migration scripts tested
  • Backup procedure documented
  • Rollback plan documented

Deployment Steps

  1. Announce maintenance window (4 hours recommended)
  2. Create database backup
  3. Stop agent schedulers (prevent command generation)
  4. Stop server
  5. Apply migrations 018-020
  6. Set required environment variables:
    • REDFLAG_JWT_SECRET (min 32 chars)
    • REDFLAG_SIGNING_PRIVATE_KEY (if not using keygen)
  7. Start server
  8. Verify server starts without errors
  9. Verify health endpoint: GET /api/v1/health
  10. Verify signing endpoint: GET /api/v1/security/signing
  11. Start agent schedulers
  12. Trigger agent updates (optional, can be gradual)
  13. Monitor logs for errors
  14. Verify agent connectivity

Post-Deployment

  • Monitor error rates for 24 hours
  • Verify agent update success rate >95%
  • Check database for anomalies (duplicate subsystems, etc.)
  • Review logs for security violations (machine ID mismatches)
  • Performance metrics within targets
  • Update documentation with any deviations

Rollback Plan

If critical issues found:

  1. Stop server
  2. Restore database backup
  3. Revert to v0.1.23.4 code
  4. Restart server
  5. Notify users of rollback
  6. Document issue for v0.2.1 fix

Known Risks & Mitigations

Risk Probability Impact Mitigation
Build orchestrator produces invalid signature Medium High Unit tests + manual verification
Token renewal fails during update Low High Graceful fallback to re-registration
Machine ID collision (rare) Low Medium Hardware fingerprint + agent_id composite
JWT secret exposed in logs Medium Medium Remove logging + use per-server secrets
Subsystems not attached after update Low Medium EnsureDefaultSubsystems() called
Dead code causes confusion High Low Clean TLSConfig, BuildResult fields
CDN caches unsigned binary Low High Use version-specific URLs

Success Criteria

Functional:
- [ ] Agent can successfully update from v0.1.17 → v0.1.23 → v0.2.0
- [ ] Signed binary verification passes
- [ ] Machine ID binding prevents cross-machine impersonation
- [ ] Token renewal with nonce validation works
- [ ] Command acknowledgment system operational
- [ ] Subsystems properly attached after update

Performance:
- [ ] Update completes in < 30 seconds
- [ ] Server handles 1000 concurrent agents
- [ ] Token renewal < 100ms
- [ ] No database deadlocks under load

Security:
- [ ] Ed25519 signatures verified on agent
- [ ] JWT secret not logged in production
- [ ] Per-server JWT secrets implemented
- [ ] Machine ID mismatch logs security alert
- [ ] Token theft from decommissioned agent mitigated by machine binding


Handover Notes for Claude 4.5

@Fimeg, this plan is your implementation guide. Key points:

  1. Focus on Phase 1 first - Build orchestrator alignment is critical and blocks everything else
  2. Test as you go - Don't wait until end, integration testing is crucial
  3. Clean up dead code - TLSConfig, Docker fields in structs have been removed
  4. Verify subsystems - Make sure they're attached during agent registration/update
  5. Machine binding is THE security boundary - Token rotation is less important
  6. Ask questions - If anything is unclear, we have logs of all discussions

Time budget: Expect 3-4 weeks for full implementation. Phase 1 is most complex. Phases 2-4 are straightforward.

Resources:

  • Decision.md - Architecture decisions
  • Status.md - Current state
  • todayupdate.md - Historical context
  • answer.md - Token system analysis
  • SECURITY_AUDIT.md - Security boundaries

When stuck: Review the "Obvious Things That Might Be Missed" sections - they're based on actual issues we identified.

Good luck! 🚀


Document Version: 1.0 Created: 2025-11-10 Last Updated: 2025-11-10 Prepared by: @Kimi + @Fimeg + @Grok Ready for Implementation: Yes