401 lines
16 KiB
Markdown
401 lines
16 KiB
Markdown
# RedFlag Security Architecture Session
|
|
**Date:** 2025-01-07 (Security Audit) | 2025-11-10 (Build Orchestrator Analysis)
|
|
**Version:** 0.1.23
|
|
**Focus:** Security audit and build orchestrator alignment
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
|
|
Initial assessment: RedFlag claims comprehensive security (Ed25519 signatures, nonce protection, machine ID binding, TOFU). Deep dive revealed **critical gaps** in implementation.
|
|
|
|
## Key Findings
|
|
|
|
### 1. Security Claims vs Reality
|
|
|
|
**Claimed Security:**
|
|
- ✅ Ed25519 digital signatures for agent updates
|
|
- ✅ Nonce-based replay protection (5-minute window)
|
|
- ✅ Machine ID binding (anti-impersonation)
|
|
- ✅ Trust-On-First-Use (TOFU) public key distribution
|
|
- ✅ Command acknowledgment system
|
|
|
|
**Actual State:**
|
|
- ✅ All security primitives correctly implemented in code
|
|
- ❌ **Agent update signing workflow connected to wrong build system**
|
|
- ❌ Build orchestrator generates Docker configs, not signed native binaries
|
|
- ❌ Zero signed packages in database
|
|
- ❌ Updates fail with 404 (no packages to download)
|
|
- ❌ Hardcoded signing key reused across test servers
|
|
|
|
### 2. The Update Flow Problem
|
|
|
|
**What Should Happen:**
|
|
```
|
|
Admin clicks "Update Agent" → Server finds signed package → Agent downloads → Verifies signature → Updates
|
|
```
|
|
|
|
**What Actually Happens:**
|
|
```
|
|
Admin clicks "Update Agent" → Server looks for signed package → Database is empty → 404 error → Update fails
|
|
```
|
|
|
|
**Evidence:**
|
|
```sql
|
|
redflag=# SELECT COUNT(*) FROM agent_update_packages;
|
|
count
|
|
-------
|
|
0
|
|
```
|
|
|
|
### 3. Build Orchestrator Misalignment
|
|
|
|
**Discovery Date:** 2025-11-10
|
|
|
|
**Expected Goal:** Server generates signed native binaries with embedded configuration
|
|
|
|
**What Build Orchestrator Actually Does:**
|
|
- Generates `docker-compose.yml` (Docker container deployment) ❌
|
|
- Generates `Dockerfile` (multi-stage builds) ❌
|
|
- Generates Go source with embedded JSON config ❌
|
|
- **Does NOT produce signed native binaries for download** ❌
|
|
|
|
**Root Cause:** Build orchestrator designed for Docker-first deployment, but actual production uses native binaries with systemd/Windows services.
|
|
|
|
**Discovery Location:**
|
|
- `aggregator-server/internal/services/agent_builder.go:171-320` (docker-compose.yml generation)
|
|
- `aggregator-server/internal/api/handlers/build_orchestrator.go:77-84` (instructions for Docker build)
|
|
- `aggregator-server/Dockerfile:11-28` (generic binary build - CORRECT)
|
|
- `aggregator-server/cmd/main.go:175,244` (downloadHandler serves native binaries from `/app/binaries/`)
|
|
|
|
**The Core Flow:**
|
|
```
|
|
Docker Build (during compose up) → Generic Binaries in /app/binaries/ →
|
|
downloadHandler serves them → Install Script downloads and deploys natively
|
|
```
|
|
|
|
**What's Missing in the Middle:**
|
|
```
|
|
Generic Binary → Copy → Embed Config → Sign → Store → Serve via Downloads Endpoint
|
|
↑ ↑ ↑ ↑ ↑ ↑
|
|
/app/binaries agent_id server_url token Ed25519 agent_update_packages table
|
|
```
|
|
|
|
**Install Script Paradox:**
|
|
- ✅ Install script correctly downloads native binaries from `/api/v1/downloads/{platform}`
|
|
- ✅ Install script correctly deploys via systemd/Windows services
|
|
- ❌ But it downloads **generic unsigned binaries** instead of **signed custom binaries**
|
|
- ❌ Build orchestrator gives Docker instructions, not signed binary paths
|
|
|
|
### 4. Hardcoded Signing Key Issue
|
|
|
|
**Location:** `config/.env:24`
|
|
```
|
|
REDFLAG_SIGNING_PRIVATE_KEY=1104a7fd7fb1a12b99e31d043fc7f4ef00bee6df19daff11ae4244606dac5bf9792d68d1c31f6c6a7820033720fb80d54bf22a8aab0382efd5deacc5122a5947
|
|
```
|
|
|
|
**Public Key Fingerprint:** `792d68d1c31f6c6a`
|
|
|
|
**Problem:** Same fingerprint appearing across multiple test servers indicates key reuse.
|
|
|
|
### 5. Version Check Bug Discovered
|
|
|
|
**Real-world scenario on test bench two:**
|
|
- Agent binary: `0.1.23` ✅
|
|
- Database record: `0.1.17` ❌
|
|
- Machine binding middleware rejects agent: `426 Upgrade Required`
|
|
- Agent cannot check in to update its database version
|
|
- **Catch-22: Agent stuck because middleware blocks version updates**
|
|
|
|
**Log evidence:**
|
|
```
|
|
Checking in with server... (Agent v0.1.23)
|
|
Check-in unsuccessful: failed to get commands: 426 Upgrade Required - {"current_version":"0.1.17"}
|
|
```
|
|
|
|
## Components Analysis
|
|
|
|
### ✅ What Works (Fully Operational)
|
|
|
|
1. **Machine ID Binding** (`machine_binding.go`)
|
|
- Validates X-Machine-ID header against database
|
|
- Returns HTTP 403 on mismatch
|
|
- Enforces minimum version 0.1.22+
|
|
|
|
2. **Nonce Replay Protection** (`agent_updates.go:92`, `subsystem_handlers.go:397`)
|
|
- Generates UUID + timestamp + signature
|
|
- Validates < 5 minute window
|
|
- Prevents command replay attacks
|
|
|
|
3. **Command Acknowledgment System**
|
|
- At-least-once delivery guarantee
|
|
- Automatic retry with persistence
|
|
- Cleanup after success/expiration
|
|
|
|
4. **Ed25519 Infrastructure** (code level)
|
|
- `SignFile()` implementation correct
|
|
- `verifyBinarySignature()` implementation correct
|
|
- Nonce validation implemented correctly
|
|
|
|
### ❌ What's Broken
|
|
|
|
1. **Build Orchestrator Paradigm Mismatch** (NEW - Critical Discovery)
|
|
- Generic binary build pipeline **WORKS** ✅ (Dockerfile:11-28)
|
|
- Native binary download endpoints **WORK** ✅ (main.go:244)
|
|
- Install script deployment **WORKS** ✅ (downloads.go:537-544)
|
|
- Build orchestrator generates **wrong artifacts** ❌ (Docker configs, not signed binaries)
|
|
- Missing: Signing service integration with build pipeline ❌
|
|
- Missing: Custom config injection into binaries ❌
|
|
|
|
2. **Update Signing Workflow**
|
|
- Binaries built during `docker compose build` ✅
|
|
- Binaries never signed ❌
|
|
- No signed packages in database ❌
|
|
- No UI for signing ❌
|
|
- No automation ❌
|
|
|
|
3. **Public Key TOFU** (partial failure)
|
|
- Fetch on registration ✅
|
|
- **Non-blocking failure** ❌ (agent registers even if key fetch fails)
|
|
- **No fingerprint logging** ❌ (admin can't verify correct server)
|
|
- **No key rotation support** ❌
|
|
|
|
4. **Version Update Flow**
|
|
- Middleware blocks old versions ✅
|
|
- **No path for version upgrades** ❌ (catch-22 scenario)
|
|
- **Database can become stale** ❌
|
|
|
|
## Implementation Work Done
|
|
|
|
### 1. Security Audit Documentation
|
|
|
|
Created `SECURITY_AUDIT.md` with comprehensive analysis:
|
|
- Detailed component-by-component review
|
|
- Specific code locations and line numbers
|
|
- Risk assessment matrix
|
|
- Implementation gaps identification
|
|
- Recommended remediation steps
|
|
|
|
### 2. Version Upgrade Solution Design
|
|
|
|
**Problem Identified:** Machine binding middleware treats version enforcement as hard security boundary, preventing legitimate agent updates.
|
|
|
|
**Solution Designed:** Middleware becomes "update-aware" with:
|
|
- Detects agents in update process (`is_updating` flag)
|
|
- Validates upgrade authorization via nonce
|
|
- Prevents downgrade attacks
|
|
- Maintains audit trail
|
|
|
|
**Implementation Plan:**
|
|
1. **Middleware updates** - Allow version upgrades with nonce validation
|
|
2. **Agent updates** - Send version and nonce headers during check-in
|
|
3. **Database helpers** - Complete agent update process
|
|
4. **Storage mechanisms** - Persist update nonce across restarts
|
|
|
|
### 3. Started Implementation
|
|
|
|
**Current Status:**
|
|
- ✅ Security audit complete
|
|
- ✅ Solution architecture designed
|
|
- 🔄 Middleware implementation in progress
|
|
- ⏳ Remaining: nonce validation, agent headers, database helpers
|
|
|
|
## Critical Issues Summary
|
|
|
|
| Issue | Severity | Status | Impact |
|
|
|-------|----------|--------|---------|
|
|
| Update signing workflow non-functional | Critical | Identified | Agent updates completely broken |
|
|
| Hardcoded signing key reuse | High | Identified | Cross-contamination risk |
|
|
| Version update catch-22 | High | In Progress | Agents can get stuck |
|
|
| Public key fetch non-blocking | Medium | Identified | Updates fail silently |
|
|
| No fingerprint verification | Medium | Identified | MITV risk in TOFU |
|
|
|
|
## Next Steps
|
|
|
|
### Immediate (In Progress)
|
|
1. Complete middleware implementation for version upgrade handling
|
|
2. Add nonce validation for update authorization
|
|
3. Update agent to send version/nonce headers
|
|
|
|
### Short Term (Next Session)
|
|
1. Add database helpers for update completion
|
|
2. Implement agent-side nonce storage
|
|
3. Test version upgrade flow end-to-end
|
|
|
|
### Medium Term
|
|
1. Complete update signing workflow implementation
|
|
2. Add UI for package management
|
|
3. Add integration tests for security features
|
|
|
|
## Technical Details Added
|
|
|
|
### Machine Binding Middleware Enhancement
|
|
```go
|
|
// Check if agent is in update process and reporting completion
|
|
if agent.IsUpdating != nil && *agent.IsUpdating {
|
|
reportedVersion := c.GetHeader("X-Agent-Version")
|
|
updateNonce := c.GetHeader("X-Update-Nonce")
|
|
|
|
// Validate upgrade (not downgrade)
|
|
if !utils.IsNewerOrEqualVersion(reportedVersion, agent.CurrentVersion) {
|
|
// Security log and reject
|
|
}
|
|
|
|
// Validate nonce (proves server authorization)
|
|
if err := validateUpdateNonce(updateNonce); err != nil {
|
|
// Security log and reject
|
|
}
|
|
|
|
// Complete update and allow through
|
|
go agentQueries.CompleteAgentUpdate(agentID, reportedVersion)
|
|
c.Next()
|
|
return
|
|
}
|
|
```
|
|
|
|
### Security Model
|
|
- **No downgrade attacks** - middleware rejects version < current
|
|
- **Nonce proves server authorization** - agent can't fake updates
|
|
- **Target version validation** - must match server's expectation
|
|
- **Machine binding remains enforced** - prevents impersonation
|
|
|
|
## Root Cause Analysis
|
|
|
|
The security system was designed with correct cryptographic primitives but:
|
|
1. **Workflow incomplete** - signing never connected to update delivery
|
|
2. **Edge cases unhandled** - version updates can get stuck
|
|
3. **Operational gaps** - no UI/automation for critical functions
|
|
|
|
This is a classic "secure design, insecure implementation" scenario.
|
|
|
|
## Lessons Learned
|
|
|
|
1. **Security is not just about algorithms** - the workflow matters
|
|
2. **Edge cases kill security** - version update catch-22
|
|
3. **Automation is required** - manual steps don't happen
|
|
4. **Visibility is critical** - need logs, alerts, UI feedback
|
|
5. **Testing must include failure modes** - what happens when things go wrong
|
|
|
|
## Files Modified/Created
|
|
|
|
- `SECURITY_AUDIT.md` - Comprehensive security analysis
|
|
- `today.md` - This session log
|
|
- `aggregator-server/internal/api/middleware/machine_binding.go` - Enhancement in progress
|
|
|
|
## Session Conclusion
|
|
|
|
RedFlag has excellent security architecture but critical implementation gaps prevent it from being production-ready. The version upgrade bug is the most immediate user-facing issue, while the missing update signing workflow is the biggest architectural gap.
|
|
|
|
The solution approach focuses on making existing security components work together seamlessly while maintaining strong security guarantees.
|
|
|
|
---
|
|
|
|
**Status:** Session paused mid-implementation, ready to continue with middleware enhancement.
|
|
|
|
---
|
|
|
|
## Build Orchestrator Analysis (2025-11-10)
|
|
|
|
### Discovery Summary
|
|
|
|
**Problem:** Build orchestrator and install script were speaking different languages
|
|
|
|
**What Was Happening:**
|
|
- Build orchestrator → Generated Docker configs (docker-compose.yml, Dockerfile)
|
|
- Install script → Expected native binary + config.json (no Docker)
|
|
- Result: Install script ignored build orchestrator, downloaded generic unsigned binaries
|
|
|
|
**Why This Happened:**
|
|
During development, both approaches were explored:
|
|
1. Docker container agents (early prototype)
|
|
2. Native binary agents (production decision)
|
|
|
|
Build orchestrator was implemented for approach #1 while install script was built for approach #2. Neither was updated when the architectural decision was made to go native-only.
|
|
|
|
### Architecture Validation
|
|
|
|
**What Actually Works PERFECTLY:**
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Dockerfile Multi-Stage Build (aggregator-server/Dockerfile)│
|
|
│ Stage 1: Build generic agent binaries for all platforms │
|
|
│ Stage 2: Copy to /app/binaries/ in final server image │
|
|
└────────────────────────┬────────────────────────────────────┘
|
|
│
|
|
│ Server runs...
|
|
▼
|
|
┌──────────────────────────────────────────┐
|
|
│ downloadHandler serves from /app/ │
|
|
│ Endpoint: /api/v1/downloads/{platform} │
|
|
└────────────┬─────────────────────────────┘
|
|
│
|
|
│ Install script downloads with curl...
|
|
▼
|
|
┌──────────────────────────────────────────┐
|
|
│ Install Script (downloads.go:537-830) │
|
|
│ - Deploys via systemd (Linux) │
|
|
│ - Deploys via Windows services │
|
|
│ - No Docker involved │
|
|
└──────────────────────────────────────────┘
|
|
```
|
|
|
|
**What's Missing (The Gap):**
|
|
```
|
|
When admin clicks "Update Agent" in UI:
|
|
1. Take generic binary from /app/binaries/{platform}/redflag-agent
|
|
2. Embed: agent_id, server_url, registration_token into config
|
|
3. Sign with Ed25519 (using signingService.SignFile())
|
|
4. Store in agent_update_packages table
|
|
5. Serve signed version via downloads endpoint
|
|
```
|
|
|
|
### Corrected Architecture
|
|
|
|
**Goal:** Make build orchestrator generate **signed native binaries** not Docker configs
|
|
|
|
**New Build Orchestrator Flow:**
|
|
```go
|
|
// 1. Receive build request via POST /api/v1/build/new or /api/v1/build/upgrade
|
|
// 2. Load generic binary from /app/binaries/{platform}/
|
|
// 3. Generate agent-specific config.json (not docker-compose.yml)
|
|
// 4. Sign binary with Ed25519 key (using existing signingService)
|
|
// 5. Store signature in agent_update_packages table
|
|
// 6. Return download URL for signed binary
|
|
```
|
|
|
|
**Install Script Stays EXACTLY THE SAME**
|
|
- Continues to download from `/api/v1/downloads/{platform}`
|
|
- Continues systemd/Windows service deployment
|
|
- Just gets **signed binaries** instead of generic ones
|
|
|
|
### Implementation Roadmap (Updated)
|
|
|
|
### Immediate (Build Orchestrator Fix)
|
|
1. Replace docker-compose.yml generation with config.json generation
|
|
2. Add Ed25519 signing step using signingService.SignFile()
|
|
3. Store signed binary info in agent_update_packages table
|
|
4. Update downloadHandler to serve signed versions when available
|
|
|
|
### Short Term (Agent Updates)
|
|
1. Complete middleware implementation for version upgrade handling
|
|
2. Add nonce validation for update authorization
|
|
3. Update agent to send version/nonce headers
|
|
4. Test end-to-end agent update flow
|
|
|
|
### Medium Term (Security Polish)
|
|
1. Add UI for package management and signing status
|
|
2. Add fingerprint logging for TOFU verification
|
|
3. Implement key rotation support
|
|
4. Add integration tests for signing workflow
|
|
|
|
### Corrected Understanding
|
|
|
|
**Original Misconception:** Build orchestrator was "wrong" or "broken"
|
|
|
|
**Actual Reality:** Build orchestrator was generating artifacts for a Docker-based deployment architecture that was **explored but not chosen**. The native binary architecture is **already correct and working** - we just need to connect the signing workflow to it.
|
|
|
|
**The Fix:** Don't throw out the build orchestrator - **redirect it** to generate the right artifacts for the native binary architecture.
|
|
|
|
---
|
|
|
|
**Final Status:** Architecture validated, root cause identified, path forward clear. Ready to implement signed binary generation. |