Fimeg/Redflag

Fork 0

Files

Fimeg 484a7f77ce Add docs and project files - force for Culurien

2026-03-28 20:46:24 -04:00

30 KiB

Raw Permalink Blame History

RedFlag Security System Implementation Plan v0.2.0

Date: 2025-11-10 Version: 0.1.23.4 → 0.2.0 Status: Implementation Ready Owner: Claude 4.5 (@Fimeg)

Executive Summary

Goal: Implement the RedFlag security architecture as designed - TOFU-first, Ed25519-signed binaries, machine ID binding, and command acknowledgment system.

Critical Discovery: Build orchestrator generates Docker configs while install script expects signed native binaries. This is blocking the entire update workflow (404 errors on agent updates).

Decision: Implement Option 2 (Per-Version/Platform Signing) from Decision.md - sign generic binaries once per version/platform, serve to all agents, keep config.json separate.

Scope: This plan covers the complete implementation from current state (v0.1.23.4) to production-ready v0.2.0.

Phase 1: Build Orchestrator Alignment (CRITICAL - Week 1)

Priority: 🔴 CRITICAL - Blocking all agent updates Estimated Time: 3-4 days Testing Required: Integration test with live agent upgrade

1.1 Replace Docker Config Generation with Signed Binary Flow

Current State:

// agent_builder.go:171-245
generateDockerCompose() → docker-compose.yml
generateDockerfile() → Dockerfile
generateEmbeddedConfig() → Go source with config
Result: Docker deployment configs (WRONG)

Target State:

// New flow:
1. Take generic binary from /app/binaries/{platform}/
2. Sign binary with Ed25519 private key
3. Store package metadata in agent_update_packages table
4. Generate config.json separately
5. Return download URLs for both

Implementation Steps:

a) Modify `agent_builder.go`

// REMOVE these functions:
- generateDockerCompose() → Delete
- generateDockerfile() → Delete
- BuildAgentWithConfig() → Rewrite completely

// NEW signature:
func (ab *AgentBuilder) BuildAgentArtifacts(config *AgentConfiguration) (*BuildArtifacts, error) {
    // Step 1: Generate agent-specific config.json (not embedded in binary)
    configJSON, err := generateConfigJSON(config)
    if err != nil {
        return nil, fmt.Errorf("config generation failed: %w", err)
    }

    // Step 2: Copy generic binary to temp location (don't modify binary)
    // Generic binary built by Docker multi-stage build in /app/binaries/
    genericBinaryPath := fmt.Sprintf("/app/binaries/%s/redflag-agent", config.Platform)
    tempBinaryPath := fmt.Sprintf("/tmp/agent-build-%s/redflag-agent", config.AgentID)

    if err := copyFile(genericBinaryPath, tempBinaryPath); err != nil {
        return nil, fmt.Errorf("binary copy failed: %w", err)
    }

    // Step 3: Sign the binary (do not embed config)
    signingService := services.NewSigningService()
    signature, err := signingService.SignFile(tempBinaryPath)
    if err != nil {
        return nil, fmt.Errorf("signing failed: %w", err)
    }

    // Step 4: Return artifact locations (don't return Docker configs)
    return &BuildArtifacts{
        AgentID:          config.AgentID,
        ConfigJSON:       configJSON,
        BinaryPath:       tempBinaryPath,
        Signature:        signature,
        Platform:         config.Platform,
        Version:          config.Version,
    }, nil
}

Obvious Things That Might Be Missed:

☐ The BuildAgentWithConfig function is called from multiple places - update all callers
☐ The BuildResult struct has fields for Docker artifacts - remove them or they'll be dead code
☐ Check build_orchestrator.go handlers - they construct responses expecting Docker files
☐ Update API documentation if it references Docker build process
☐ The install script expects docker-compose.yml instructions - it will break if we just remove without updating

b) Update `build_orchestrator.go` handlers

// In NewAgentBuild and UpgradeAgentBuild handlers:

// REMOVE: Response fields for Docker
"compose_file": buildResult.ComposeFile,
"dockerfile":   buildResult.Dockerfile,
"next_steps": []string{
    "1. docker build -t " + buildResult.ImageTag + " .",
    "2. docker compose up -d",
}

// ADD: Response fields for native binary
"config_url": "/api/v1/config/" + config.AgentID,
"binary_url": "/api/v1/downloads/" + config.Platform,
"signature":  signature,
"next_steps": []string{
    "1. Download binary from: " + binaryURL,
    "2. Download config from: " + configURL,
    "3. Place redflag-agent in /usr/local/bin/",
    "4. Place config.json in /etc/redflag/",
    "5. Run: systemctl enable --now redflag-agent",
}

Obvious Things That Might Be Missed:

☐ The AgentSetupRequest struct might have Docker-specific fields - clean those up
☐ The installer script (downloads.go:537-831) parses this response - keep it compatible
☐ Web UI shows build instructions - update to show systemctl commands not Docker
☐ API response structure changes break backward compatibility with older installers

c) Update database schema for signed packages

-- In migration file (create new migration 019):

-- agent_update_packages table exists but may need tweaks
ALTER TABLE agent_update_packages
ADD COLUMN IF NOT EXISTS config_path VARCHAR(255),
ADD COLUMN IF NOT EXISTS platform VARCHAR(20) NOT NULL,
ADD COLUMN IF NOT EXISTS version VARCHAR(20) NOT NULL;

-- Add indexes for performance
CREATE INDEX IF NOT EXISTS idx_agent_updates_version_platform
ON agent_update_packages(version, platform)
WHERE agent_id IS NULL;

-- For per-version packages (not per-agent)
CREATE INDEX IF NOT EXISTS idx_agent_updates_generic
ON agent_update_packages(version, platform, platform)
WHERE agent_id IS NULL;

Obvious Things That Might Be Missed:

☐ Migration needs to handle existing empty table gracefully
☐ Need both per-agent and generic package indexes
☐ Consider adding expires_at for automatic cleanup

1.2 Integrate Signing Service

Current State: Signing service exists but isn't called from build pipeline

Implementation:

// In BuildAgentArtifacts (from 1.1a):
signingService := services.NewSigningService()
signature, err := signingService.SignFile(tempBinaryPath)
if err != nil {
    return nil, fmt.Errorf("signing failed: %w", err)
}

// Store in database
packageID := uuid.New()
query := `
    INSERT INTO agent_update_packages (id, version, platform, binary_path, signature, created_at)
    VALUES ($1, $2, $3, $4, $5, NOW())
`
_, err = db.Exec(query, packageID, config.Version, config.Platform, tempBinaryPath, signature)

Obvious Things That Might Be Missed:

☐ Signing service needs Ed25519 private key from env/config
☐ Missing key should fail fast with clear error message
☐ Signature format must match what agent expects (base64? hex?)
☐ Need to store signing key fingerprint for verification

1.3 Update Download Handler

Current State: Serves generic unsigned binaries from /app/binaries/

Target State: Serve signed versions first, fallback to unsigned

Implementation:

// In downloads.go:175,244 - modify downloadHandler

func (h *DownloadHandler) DownloadAgent(c *gin.Context) {
    platform := c.Param("platform")
    version := c.Query("version") // Optional version parameter

    // Try to get signed package first
    if version != "" {
        signedPackage, err := h.packageQueries.GetSignedPackage(version, platform)
        if err == nil && fileExists(signedPackage.BinaryPath) {
            // Serve signed version
            log.Printf("Serving signed binary v%s for platform %s", version, platform)
            c.File(signedPackage.BinaryPath)
            return
        }
    }

    // Fallback to unsigned generic binary
    genericPath := fmt.Sprintf("/app/binaries/%s/redflag-agent", platform)
    if fileExists(genericPath) {
        log.Printf("Serving unsigned generic binary for platform %s", platform)
        c.File(genericPath)
        return
    }

    c.JSON(http.StatusNotFound, gin.H{
        "error": "no binary available",
        "platform": platform,
        "version": version,
    })
}

Obvious Things That Might Be Missed:

☐ Add version query parameter support to API route
☐ Update install script to request specific version
☐ Handle 404 gracefully in installer (current installer assumes binary exists)
☐ Add signature verification step in install script
☐ Need fileExists() helper or use os.Stat() with error handling

1.4 Verify Subsystem Registration in Install Flow

Critical: Build orchestrator must ensure subsystems are properly configured

Current Issue: Installer may create agent without subsystems enabled

Implementation:

// After agent registration, ensure subsystems are created:
func ensureDefaultSubsystems(agentID uuid.UUID) error {
    defaultSubsystems := []string{"updates", "storage", "system", "docker"}

    for _, subsystem := range defaultSubsystems {
        // Check if already exists
        exists, err := subsystemQueries.Exists(agentID, subsystem)
        if err != nil {
            return err
        }

        if !exists {
            // Create with defaults
            err := subsystemQueries.Create(agentID, subsystem, models.SubsystemConfig{
                Enabled:     true,
                AutoRun:     true,
                Interval:    getDefaultInterval(subsystem),
                CircuitBreaker: models.DefaultCircuitBreaker(),
            })
            if err != nil {
                return err
            }
        }
    }
    return nil
}

Obvious Things That Might Be Missed:

☐ Subsystem creation must happen BEFORE scheduler loads jobs
☐ Need to prevent duplicate subsystem entries
☐ Update agent builder to call this function
☐ Migration: Existing agents without subsystems need backfill

1.5 Testing & Verification

Integration Test Steps:

# Setup:
1. Start fresh server with empty agent_update_packages table
2. Create registration token (1 seat)
3. Install agent on test machine

# Test 1: First agent update
4. Admin clicks "Update Agent" in UI
5. Verify: POST /api/v1/build/upgrade returns 200
6. Verify: agent_update_packages has 1 row (version, platform, signature)
7. Verify: Binary file exists at /app/binaries/{platform}/
8. Agent checks for update → receives signed binary
9. Verify: Agent download succeeds
10. Verify: Agent verifies signature → installs update
Expected: Agent running new version, config preserved

# Test 2: Second agent (same version)
11. Register second agent with same token
12. Click "Update Agent"
13. Verify: No new binary built (reuses existing signed package)
14. Verify: Second agent downloads same binary successfully

# Test 3: Version upgrade
15. Server upgraded to v0.1.24
16. First agent checks in → 426 Upgrade Required
17. Admin triggers update → new signed binary for v0.1.24
18. Agent downloads v0.1.24 binary
19. Verify: Agent version updated in database
20. Verify: Agent continues checking in normally

Verification Checklist:

☐ API returns proper download URLs (not Docker commands)
☐ Binary signature verifies on agent side
☐ Config.json preserved across updates (not overwritten)
☐ Agent restarts successfully after update
☐ Subsystems continue working after update
☐ Machine ID binding remains enforced
☐ Token refresh continues working

Phase 2: Middleware Version Upgrade Fix (HIGH - Week 2)

Priority: 🟠 HIGH - Prevents agents from getting bricked Estimated Time: 2-3 days Testing Required: Version upgrade scenario test

2.1 Middleware Enhancement

Problem: Machine binding middleware blocks version upgrades (returns 426), creating catch-22 where agent can't upgrade database version.

Solution: Make middleware "update-aware" - detect upgrading agents and allow them through with nonce validation.

Implementation:

a) Add update fields to agents table

-- Migration 020:
ALTER TABLE agents
ADD COLUMN IF NOT EXISTS is_updating BOOLEAN DEFAULT FALSE,
ADD COLUMN IF NOT EXISTS update_nonce VARCHAR(64),
ADD COLUMN IF NOT EXISTS update_nonce_expires_at TIMESTAMPTZ;

Obvious Things That Might Be Missed:

☐ Set default FALSE (not null)
☐ Add index on is_updating for query performance
☐ Consider cleanup job for stale update_nounces

b) Enhance machine_binding.go middleware

// In MachineBindingMiddleware, add update detection logic:

func MachineBindingMiddleware() gin.HandlerFunc {
    return func(c *gin.Context) {
        // ... existing machine ID validation ...

        // Check if agent is reporting upgrade completion
        reportedVersion := c.GetHeader("X-Agent-Version")
        updateNonce := c.GetHeader("X-Update-Nonce")

        if agent.IsUpdating != nil && *agent.IsUpdating {
            // Validate upgrade (not downgrade)
            if !utils.IsNewerOrEqualVersion(reportedVersion, agent.CurrentVersion) {
                log.Printf("DOWNGRADE ATTEMPT: Agent %s reporting version %s < current %s",
                    agentID, reportedVersion, agent.CurrentVersion)
                c.JSON(http.StatusForbidden, gin.H{"error": "downgrade not allowed"})
                c.Abort()
                return
            }

            // Validate nonce (proves server authorized update)
            if err := validateUpdateNonce(updateNonce, agent.ServerPublicKey); err != nil {
                log.Printf("Invalid update nonce for agent %s: %v", agentID, err)
                c.JSON(http.StatusForbidden, gin.H{"error": "invalid update nonce"})
                c.Abort()
                return
            }

            // Valid upgrade - complete it
            log.Printf("Completing update for agent %s: %s → %s",
                agentID, agent.CurrentVersion, reportedVersion)

            go agentQueries.CompleteAgentUpdate(agentID, reportedVersion)

            // Allow request through
            c.Next()
            return
        }

        // Normal version check (not in update)
        // ... existing code ...
    }
}

func validateUpdateNonce(nonce string, serverPublicKey string) error {
    // Parse nonce: format is "uuid:timestamp:signature"
    parts := strings.Split(nonce, ":")
    if len(parts) != 3 {
        return fmt.Errorf("invalid nonce format")
    }

    // Verify Ed25519 signature
    message := parts[0] + ":" + parts[1] // uuid:timestamp
    signature := parts[2]

    if !ed25519.Verify(serverPublicKey, []byte(message), []byte(signature)) {
        return fmt.Errorf("invalid nonce signature")
    }

    // Verify timestamp (< 5 minutes old)
    timestamp, err := strconv.ParseInt(parts[1], 10, 64)
    if err != nil {
        return fmt.Errorf("invalid timestamp")
    }

    if time.Now().Unix()-timestamp > 300 { // 5 minutes
        return fmt.Errorf("nonce expired")
    }

    return nil
}

Obvious Things That Might Be Missed:

☐ Agent must send X-Agent-Version header during check-in when updating
☐ Agent must send X-Update-Nonce header with server's signed nonce
☐ Server must have agent.ServerPublicKey available (from TOFU)
☐ Update nonce must be generated by server when update triggered
☐ Nonce must be stored temporarily (redis or database with TTL)
☐ Clean up expired nonces (cron job or TTL index)

c) Update agent to send headers

// In aggregator-agent/cmd/agent/main.go:checkInAgent()

func checkInAgent(cfg *config.Config) error {
    req, err := http.NewRequest("POST", cfg.ServerURL+"/api/v1/agents/metrics", body)

    // Always send machine ID
    machineID, _ := system.GetMachineID()
    req.Header.Set("X-Machine-ID", machineID)

    // Send agent version
    req.Header.Set("X-Agent-Version", cfg.AgentVersion)

    // If updating, include nonce
    if cfg.UpdateInProgress {
        req.Header.Set("X-Update-Nonce", cfg.UpdateNonce)
    }

    // ... rest of request ...
}

Obvious Things That Might Be Missed:

☐ Agent must persist update_nonce across restarts (STATE_DIR file)
☐ Agent must clear update flag after successful update
☐ Need to handle case where agent crashes mid-update

2.2 Testing Version Upgrade Scenario

Test Steps:

# Setup:
1. Have agent v0.1.17 in database
2. Have agent binary v0.1.23 on machine
3. Agent checks in → expecting 426 currently

# After fix:
4. Admin triggers update for agent
5. Server sets is_updating=true, generates nonce, stores nonce with agent_id
6. Agent checks in with X-Update-Nonce header
7. Middleware validates nonce, allows through
8. Agent reports v0.1.23
9. Server updates agent.current_version → completes update
10. Server clears is_updating flag

# Verify:
11. Agent no longer gets 426
12. Agent shows v0.1.23 in dashboard
13. Agent receives commands normally

Verification Checklist:

☐ Agent can upgrade from v0.1.17 → v0.1.23
☐ No manual intervention required (agent auto-updates)
☐ Middleware allows upgrade with valid nonce
☐ Middleware rejects downgrade attempts
☐ Invalid nonce causes rejection (security test)
☐ Expired nonce causes rejection (security test)
☐ Machine ID binding remains enforced during update

Phase 3: Security Hardening (MEDIUM - Week 3)

Priority: 🟡 MEDIUM - Improvements, not blockers Estimated Time: 2-3 days Testing Required: Security test scenarios

3.1 Remove JWT Secret Logging

Current Issue: JWT secret logged in debug when generated

Implementation:

// aggregator-server/cmd/server/main.go:105-108

if cfg.Admin.JWTSecret == "" {
    cfg.Admin.JWTSecret = GenerateSecureToken()
    // REMOVE: log.Printf("Generated JWT secret: %s", cfg.Admin.JWTSecret)
    // Instead: log.Printf("JWT secret generated (not logged for security)")
}

Obvious Things That Might Be Missed:

☐ Add environment variable for debug mode: REDFLAG_DEBUG=true
☐ Only log secret when debug is explicitly enabled
☐ Warn in production if JWT_SECRET not set (don't generate randomly)

3.2 Implement Per-Server JWT Secrets

Current Issue: All servers share same JWT secret if not explicitly set

Implementation:

// In GenerateAgentToken():
// Generate unique secret per server on first boot, store in database

func ensureServerJWTSecret(db *sqlx.DB) (string, error) {
    // Check if secret exists in settings
    var secret string
    query := `SELECT value FROM settings WHERE key = 'jwt_secret'`
    err := db.Get(&secret, query)

    if err == sql.ErrNoRows {
        // Generate new secret
        secret = GenerateSecureToken()

        // Store in database
        insert := `INSERT INTO settings (key, value) VALUES ('jwt_secret', $1)`
        _, err = db.Exec(insert, secret)
        if err != nil {
            return "", err
        }

        log.Printf("Generated new JWT secret for this server")
    }

    return secret, nil
}

Migration:

-- Create settings table if doesn't exist
CREATE TABLE IF NOT EXISTS settings (
    key VARCHAR(100) PRIMARY KEY,
    value TEXT NOT NULL,
    created_at TIMESTAMPTZ DEFAULT NOW(),
    updated_at TIMESTAMPTZ DEFAULT NOW()
);

-- For existing installations, migrate current secret
INSERT INTO settings (key, value)
SELECT 'jwt_secret', COALESCE(current_setting('REDFLAG_JWT_SECRET', true), '')
WHERE NOT EXISTS (SELECT 1 FROM settings WHERE key = 'jwt_secret');

Obvious Things That Might Be Missed:

☐ Existing agents with old tokens will be invalidated (migration window needed)
☐ Need to support multiple valid secrets during rotation period
☐ Document secret rotation procedure for admins

3.3 Clean Dead Code

Files to clean:

a) Remove TLSConfig struct (not used)

// aggregator-agent/internal/config/config.go:23-29

// REMOVE:
type TLSConfig struct {
    InsecureSkipVerify bool   `json:"insecure_skip_verify"`
    CertFile           string `json:"cert_file,omitempty"`
    KeyFile            string `json:"key_file,omitempty"`
    CAFile             string `json:"ca_file,omitempty"`
}

// From Config struct, remove:
TLS TLSConfig `json:"tls,omitempty"`

// REMOVE from Load() function any TLS config loading

Rationale: Client certificate authentication was planned but rejected in favor of machine ID binding. This is dead code that will confuse future developers.

b) Remove Docker-specific build artifacts from structs

// aggregator-server/internal/services/agent_builder.go:53-60

// REMOVE from BuildResult:
ComposeFile string    `json:"compose_file"`
Dockerfile  string    `json:"dockerfile"`

// Update all references to BuildResult throughout codebase

c) Update go.mod to remove unused dependencies

# After removing Docker code:
go mod tidy
# Verify no Docker-related imports remain

Obvious Things That Might Be Missed:

☐ Check test files for references to removed fields
☐ Check Web UI for references to Docker fields in API responses
☐ Update API documentation to remove Docker endpoints
☐ Search for TODO comments referencing Docker implementation
☐ Check for mocked Docker functions in test files

Phase 4: Documentation & Handover (Week 4)

4.1 Update Decision.md

Add findings from implementation:

Final architecture decisions
Performance metrics observed
Security boundaries validated
Any deviations from original plan

4.2 Create CHANGELOG.md Entry

## v0.2.0 - Security Hardening & Build Orchestrator

### Breaking Changes
- Build orchestrator now generates native binaries instead of Docker configs
- API response format changed for /api/v1/build/* endpoints
- Agent update flow now requires version parameter

### New Features
- Ed25519 signed agent binaries with automatic verification
- Machine ID binding enforced on all endpoints
- Command acknowledgment system (at-least-once delivery)
- Version upgrade middleware (fixes catch-22)

### Security Improvements
- Per-server JWT secrets (not shared)
- Token refresh with nonce validation
- Config protected by file permissions (0600)

### Removed
- Docker container deployment option (native binaries only)
- TLSConfig from agent config (dead code)

4.3 Create Migration Guide

For existing installations (v0.1.17-v0.1.23.4):

# 1. Backup database
pg_dump redflag > backup-pre-v020.sql

# 2. Apply migrations 018-020
migrate -path migrations -database postgres://... up

# 3. Set JWT secret if not already set
export REDFLAG_JWT_SECRET=$(cat /dev/urandom | tr -dc 'a-zA-Z0-9' | head -c64)

# 4. Restart server (generates signing key if missing)
systemctl restart redflag-server

# 5. Verify signing key configured
curl http://localhost:8080/api/v1/security/signing

# 6. Trigger agent updates (all agents will update to signed binaries)
# Admin UI → Agents → Select All → Update Agent

4.4 Update README.md

Key sections to update:

Architecture diagram (remove Docker, add signing flow)
Security model (document machine ID binding)
Installation instructions (systemctl, not Docker)
Configuration reference (remove TLSConfig)
API documentation (update build endpoints)

Testing & Quality Assurance

Unit Tests

// Test packages needed:
1. Test signing service (SignFile, VerifyFile)
2. Test build orchestrator (BuildAgentArtifacts)
3. Test machine binding middleware (with update scenario)
4. Test token renewal with nonce validation
5. Test download handler (signed vs unsigned fallback)

// Run tests with coverage:
go test ./... -cover -v
// Target: >80% coverage on security-critical packages

Integration Tests

#!/bin/bash
# integration_test.sh

echo "Starting RedFlag v0.2.0 integration test..."

# Setup test environment
docker-compose up -d postgres echo "Waiting for database..." sleep 5

# Start server
cd aggregator-server
go run cmd/server/main.go & SERVER_PID=$!
echo "Waiting for server..." sleep 10

# Run test scenarios:
./tests/test_registration.sh
./tests/test_machine_binding.sh
./tests/test_build_orchestrator.sh
./tests/test_signed_updates.sh
./tests/test_token_renewal.sh
./tests/test_command_acknowledgment.sh

# Cleanup
kill $SERVER_PID
docker-compose down

echo "All tests passed!"

Test Scenarios:

Registration: New agent registers, gets tokens, machine ID stored
Machine Binding: Attempt from different machine → 403 Forbidden
Build Orchestrator: Build signed binary → verify signature → download
Signed Updates: Agent updates → signature verification → successful install
Token Renewal: With nonce → successful renewal → version updated
Command Acknowledgment: Agent sends ack → server processes → queue cleared

Security Testing

# Penetration test checklist:
□ Attempt registration with stolen token (should fail if seats full)
□ Copy config.json to different machine (should fail machine binding)
□ Modify binary and attempt update (signature verification should fail)
□ Replay old nonce (timestamp check should fail)
□ Use expired JWT (should be rejected)
□ Attempt downgrade attack (middleware should reject)
□ Try to access agent data from wrong agent ID (auth should block)
□ Test token renewal without nonce (should fail)

Performance Benchmarks

Target Metrics:

Operation	Target Time	Notes
Sign binary (per version)	< 50ms	Ed25519 is fast
Build artifacts generation	< 500ms	Mostly file I/O
Token renewal with nonce	< 100ms	Includes DB write
Machine ID validation	< 10ms	Database lookup
Download signed binary	< 5s	Depends on network
Agent update process	< 30s	Including verification & restart

Scalability Targets:

1,000 agents: Update all in < 5 minutes (CDN caching)
10,000 agents: Update all in < 1 hour (CDN caching)
Token renewal: 1,000 req/sec (stateless JWT validation)
Database: < 10% CPU at 1k agents

Deployment Checklist

Pre-Deployment

All unit tests passing (coverage >80%)
All integration tests passing
Security tests passing
Performance benchmarks met
Documentation updated (Decision.md, CHANGELOG.md, README.md)
Migration scripts tested
Backup procedure documented
Rollback plan documented

Deployment Steps

Announce maintenance window (4 hours recommended)
Create database backup
Stop agent schedulers (prevent command generation)
Stop server
Apply migrations 018-020
Set required environment variables:
- REDFLAG_JWT_SECRET (min 32 chars)
- REDFLAG_SIGNING_PRIVATE_KEY (if not using keygen)
Start server
Verify server starts without errors
Verify health endpoint: GET /api/v1/health
Verify signing endpoint: GET /api/v1/security/signing
Start agent schedulers
Trigger agent updates (optional, can be gradual)
Monitor logs for errors
Verify agent connectivity

Post-Deployment

Monitor error rates for 24 hours
Verify agent update success rate >95%
Check database for anomalies (duplicate subsystems, etc.)
Review logs for security violations (machine ID mismatches)
Performance metrics within targets
Update documentation with any deviations

Rollback Plan

If critical issues found:

Stop server
Restore database backup
Revert to v0.1.23.4 code
Restart server
Notify users of rollback
Document issue for v0.2.1 fix

Known Risks & Mitigations

Risk	Probability	Impact	Mitigation
Build orchestrator produces invalid signature	Medium	High	Unit tests + manual verification
Token renewal fails during update	Low	High	Graceful fallback to re-registration
Machine ID collision (rare)	Low	Medium	Hardware fingerprint + agent_id composite
JWT secret exposed in logs	Medium	Medium	Remove logging + use per-server secrets
Subsystems not attached after update	Low	Medium	EnsureDefaultSubsystems() called
Dead code causes confusion	High	Low	Clean TLSConfig, BuildResult fields
CDN caches unsigned binary	Low	High	Use version-specific URLs

Success Criteria

Functional:
- [ ] Agent can successfully update from v0.1.17 → v0.1.23 → v0.2.0
- [ ] Signed binary verification passes
- [ ] Machine ID binding prevents cross-machine impersonation
- [ ] Token renewal with nonce validation works
- [ ] Command acknowledgment system operational
- [ ] Subsystems properly attached after update

Performance:
- [ ] Update completes in < 30 seconds
- [ ] Server handles 1000 concurrent agents
- [ ] Token renewal < 100ms
- [ ] No database deadlocks under load

Security:
- [ ] Ed25519 signatures verified on agent
- [ ] JWT secret not logged in production
- [ ] Per-server JWT secrets implemented
- [ ] Machine ID mismatch logs security alert
- [ ] Token theft from decommissioned agent mitigated by machine binding

Handover Notes for Claude 4.5

@Fimeg, this plan is your implementation guide. Key points:

Focus on Phase 1 first - Build orchestrator alignment is critical and blocks everything else
Test as you go - Don't wait until end, integration testing is crucial
Clean up dead code - TLSConfig, Docker fields in structs have been removed
Verify subsystems - Make sure they're attached during agent registration/update
Machine binding is THE security boundary - Token rotation is less important
Ask questions - If anything is unclear, we have logs of all discussions

Time budget: Expect 3-4 weeks for full implementation. Phase 1 is most complex. Phases 2-4 are straightforward.

Resources:

Decision.md - Architecture decisions
Status.md - Current state
todayupdate.md - Historical context
answer.md - Token system analysis
SECURITY_AUDIT.md - Security boundaries

When stuck: Review the "Obvious Things That Might Be Missed" sections - they're based on actual issues we identified.

Good luck! 🚀

Document Version: 1.0 Created: 2025-11-10 Last Updated: 2025-11-10 Prepared by: @Kimi + @Fimeg + @Grok Ready for Implementation: ✅ Yes

30 KiB Raw Permalink Blame History

RedFlag Security System Implementation Plan v0.2.0

Executive Summary

Phase 1: Build Orchestrator Alignment (CRITICAL - Week 1)

1.1 Replace Docker Config Generation with Signed Binary Flow

a) Modify agent_builder.go

b) Update build_orchestrator.go handlers

c) Update database schema for signed packages

1.2 Integrate Signing Service

1.3 Update Download Handler

1.4 Verify Subsystem Registration in Install Flow

1.5 Testing & Verification

Phase 2: Middleware Version Upgrade Fix (HIGH - Week 2)

2.1 Middleware Enhancement

a) Add update fields to agents table

b) Enhance machine_binding.go middleware

c) Update agent to send headers

2.2 Testing Version Upgrade Scenario

Phase 3: Security Hardening (MEDIUM - Week 3)

3.1 Remove JWT Secret Logging

3.2 Implement Per-Server JWT Secrets

3.3 Clean Dead Code

a) Remove TLSConfig struct (not used)

b) Remove Docker-specific build artifacts from structs

c) Update go.mod to remove unused dependencies

Phase 4: Documentation & Handover (Week 4)

4.1 Update Decision.md

4.2 Create CHANGELOG.md Entry

4.3 Create Migration Guide

4.4 Update README.md

Testing & Quality Assurance

Unit Tests

Integration Tests

Security Testing

Performance Benchmarks

Deployment Checklist

Pre-Deployment

Deployment Steps

Post-Deployment

Rollback PlanIf critical issues found:

Known Risks & Mitigations

Success Criteria

Handover Notes for Claude 4.5

30 KiB

Raw Permalink Blame History

a) Modify `agent_builder.go`

b) Update `build_orchestrator.go` handlers

Rollback Plan

If critical issues found: