Files
Redflag/docs/4_LOG/November_2025/today.md

401 lines
16 KiB
Markdown

# RedFlag Security Architecture Session
**Date:** 2025-01-07 (Security Audit) | 2025-11-10 (Build Orchestrator Analysis)
**Version:** 0.1.23
**Focus:** Security audit and build orchestrator alignment
---
## Executive Summary
Initial assessment: RedFlag claims comprehensive security (Ed25519 signatures, nonce protection, machine ID binding, TOFU). Deep dive revealed **critical gaps** in implementation.
## Key Findings
### 1. Security Claims vs Reality
**Claimed Security:**
- ✅ Ed25519 digital signatures for agent updates
- ✅ Nonce-based replay protection (5-minute window)
- ✅ Machine ID binding (anti-impersonation)
- ✅ Trust-On-First-Use (TOFU) public key distribution
- ✅ Command acknowledgment system
**Actual State:**
- ✅ All security primitives correctly implemented in code
-**Agent update signing workflow connected to wrong build system**
- ❌ Build orchestrator generates Docker configs, not signed native binaries
- ❌ Zero signed packages in database
- ❌ Updates fail with 404 (no packages to download)
- ❌ Hardcoded signing key reused across test servers
### 2. The Update Flow Problem
**What Should Happen:**
```
Admin clicks "Update Agent" → Server finds signed package → Agent downloads → Verifies signature → Updates
```
**What Actually Happens:**
```
Admin clicks "Update Agent" → Server looks for signed package → Database is empty → 404 error → Update fails
```
**Evidence:**
```sql
redflag=# SELECT COUNT(*) FROM agent_update_packages;
count
-------
0
```
### 3. Build Orchestrator Misalignment
**Discovery Date:** 2025-11-10
**Expected Goal:** Server generates signed native binaries with embedded configuration
**What Build Orchestrator Actually Does:**
- Generates `docker-compose.yml` (Docker container deployment) ❌
- Generates `Dockerfile` (multi-stage builds) ❌
- Generates Go source with embedded JSON config ❌
- **Does NOT produce signed native binaries for download** ❌
**Root Cause:** Build orchestrator designed for Docker-first deployment, but actual production uses native binaries with systemd/Windows services.
**Discovery Location:**
- `aggregator-server/internal/services/agent_builder.go:171-320` (docker-compose.yml generation)
- `aggregator-server/internal/api/handlers/build_orchestrator.go:77-84` (instructions for Docker build)
- `aggregator-server/Dockerfile:11-28` (generic binary build - CORRECT)
- `aggregator-server/cmd/main.go:175,244` (downloadHandler serves native binaries from `/app/binaries/`)
**The Core Flow:**
```
Docker Build (during compose up) → Generic Binaries in /app/binaries/ →
downloadHandler serves them → Install Script downloads and deploys natively
```
**What's Missing in the Middle:**
```
Generic Binary → Copy → Embed Config → Sign → Store → Serve via Downloads Endpoint
↑ ↑ ↑ ↑ ↑ ↑
/app/binaries agent_id server_url token Ed25519 agent_update_packages table
```
**Install Script Paradox:**
- ✅ Install script correctly downloads native binaries from `/api/v1/downloads/{platform}`
- ✅ Install script correctly deploys via systemd/Windows services
- ❌ But it downloads **generic unsigned binaries** instead of **signed custom binaries**
- ❌ Build orchestrator gives Docker instructions, not signed binary paths
### 4. Hardcoded Signing Key Issue
**Location:** `config/.env:24`
```
REDFLAG_SIGNING_PRIVATE_KEY=1104a7fd7fb1a12b99e31d043fc7f4ef00bee6df19daff11ae4244606dac5bf9792d68d1c31f6c6a7820033720fb80d54bf22a8aab0382efd5deacc5122a5947
```
**Public Key Fingerprint:** `792d68d1c31f6c6a`
**Problem:** Same fingerprint appearing across multiple test servers indicates key reuse.
### 5. Version Check Bug Discovered
**Real-world scenario on test bench two:**
- Agent binary: `0.1.23`
- Database record: `0.1.17`
- Machine binding middleware rejects agent: `426 Upgrade Required`
- Agent cannot check in to update its database version
- **Catch-22: Agent stuck because middleware blocks version updates**
**Log evidence:**
```
Checking in with server... (Agent v0.1.23)
Check-in unsuccessful: failed to get commands: 426 Upgrade Required - {"current_version":"0.1.17"}
```
## Components Analysis
### ✅ What Works (Fully Operational)
1. **Machine ID Binding** (`machine_binding.go`)
- Validates X-Machine-ID header against database
- Returns HTTP 403 on mismatch
- Enforces minimum version 0.1.22+
2. **Nonce Replay Protection** (`agent_updates.go:92`, `subsystem_handlers.go:397`)
- Generates UUID + timestamp + signature
- Validates < 5 minute window
- Prevents command replay attacks
3. **Command Acknowledgment System**
- At-least-once delivery guarantee
- Automatic retry with persistence
- Cleanup after success/expiration
4. **Ed25519 Infrastructure** (code level)
- `SignFile()` implementation correct
- `verifyBinarySignature()` implementation correct
- Nonce validation implemented correctly
### ❌ What's Broken
1. **Build Orchestrator Paradigm Mismatch** (NEW - Critical Discovery)
- Generic binary build pipeline **WORKS** ✅ (Dockerfile:11-28)
- Native binary download endpoints **WORK** ✅ (main.go:244)
- Install script deployment **WORKS** ✅ (downloads.go:537-544)
- Build orchestrator generates **wrong artifacts** ❌ (Docker configs, not signed binaries)
- Missing: Signing service integration with build pipeline ❌
- Missing: Custom config injection into binaries ❌
2. **Update Signing Workflow**
- Binaries built during `docker compose build`
- Binaries never signed ❌
- No signed packages in database ❌
- No UI for signing ❌
- No automation ❌
3. **Public Key TOFU** (partial failure)
- Fetch on registration ✅
- **Non-blocking failure** ❌ (agent registers even if key fetch fails)
- **No fingerprint logging** ❌ (admin can't verify correct server)
- **No key rotation support** ❌
4. **Version Update Flow**
- Middleware blocks old versions ✅
- **No path for version upgrades** ❌ (catch-22 scenario)
- **Database can become stale** ❌
## Implementation Work Done
### 1. Security Audit Documentation
Created `SECURITY_AUDIT.md` with comprehensive analysis:
- Detailed component-by-component review
- Specific code locations and line numbers
- Risk assessment matrix
- Implementation gaps identification
- Recommended remediation steps
### 2. Version Upgrade Solution Design
**Problem Identified:** Machine binding middleware treats version enforcement as hard security boundary, preventing legitimate agent updates.
**Solution Designed:** Middleware becomes "update-aware" with:
- Detects agents in update process (`is_updating` flag)
- Validates upgrade authorization via nonce
- Prevents downgrade attacks
- Maintains audit trail
**Implementation Plan:**
1. **Middleware updates** - Allow version upgrades with nonce validation
2. **Agent updates** - Send version and nonce headers during check-in
3. **Database helpers** - Complete agent update process
4. **Storage mechanisms** - Persist update nonce across restarts
### 3. Started Implementation
**Current Status:**
- ✅ Security audit complete
- ✅ Solution architecture designed
- 🔄 Middleware implementation in progress
- ⏳ Remaining: nonce validation, agent headers, database helpers
## Critical Issues Summary
| Issue | Severity | Status | Impact |
|-------|----------|--------|---------|
| Update signing workflow non-functional | Critical | Identified | Agent updates completely broken |
| Hardcoded signing key reuse | High | Identified | Cross-contamination risk |
| Version update catch-22 | High | In Progress | Agents can get stuck |
| Public key fetch non-blocking | Medium | Identified | Updates fail silently |
| No fingerprint verification | Medium | Identified | MITV risk in TOFU |
## Next Steps
### Immediate (In Progress)
1. Complete middleware implementation for version upgrade handling
2. Add nonce validation for update authorization
3. Update agent to send version/nonce headers
### Short Term (Next Session)
1. Add database helpers for update completion
2. Implement agent-side nonce storage
3. Test version upgrade flow end-to-end
### Medium Term
1. Complete update signing workflow implementation
2. Add UI for package management
3. Add integration tests for security features
## Technical Details Added
### Machine Binding Middleware Enhancement
```go
// Check if agent is in update process and reporting completion
if agent.IsUpdating != nil && *agent.IsUpdating {
reportedVersion := c.GetHeader("X-Agent-Version")
updateNonce := c.GetHeader("X-Update-Nonce")
// Validate upgrade (not downgrade)
if !utils.IsNewerOrEqualVersion(reportedVersion, agent.CurrentVersion) {
// Security log and reject
}
// Validate nonce (proves server authorization)
if err := validateUpdateNonce(updateNonce); err != nil {
// Security log and reject
}
// Complete update and allow through
go agentQueries.CompleteAgentUpdate(agentID, reportedVersion)
c.Next()
return
}
```
### Security Model
- **No downgrade attacks** - middleware rejects version < current
- **Nonce proves server authorization** - agent can't fake updates
- **Target version validation** - must match server's expectation
- **Machine binding remains enforced** - prevents impersonation
## Root Cause Analysis
The security system was designed with correct cryptographic primitives but:
1. **Workflow incomplete** - signing never connected to update delivery
2. **Edge cases unhandled** - version updates can get stuck
3. **Operational gaps** - no UI/automation for critical functions
This is a classic "secure design, insecure implementation" scenario.
## Lessons Learned
1. **Security is not just about algorithms** - the workflow matters
2. **Edge cases kill security** - version update catch-22
3. **Automation is required** - manual steps don't happen
4. **Visibility is critical** - need logs, alerts, UI feedback
5. **Testing must include failure modes** - what happens when things go wrong
## Files Modified/Created
- `SECURITY_AUDIT.md` - Comprehensive security analysis
- `today.md` - This session log
- `aggregator-server/internal/api/middleware/machine_binding.go` - Enhancement in progress
## Session Conclusion
RedFlag has excellent security architecture but critical implementation gaps prevent it from being production-ready. The version upgrade bug is the most immediate user-facing issue, while the missing update signing workflow is the biggest architectural gap.
The solution approach focuses on making existing security components work together seamlessly while maintaining strong security guarantees.
---
**Status:** Session paused mid-implementation, ready to continue with middleware enhancement.
---
## Build Orchestrator Analysis (2025-11-10)
### Discovery Summary
**Problem:** Build orchestrator and install script were speaking different languages
**What Was Happening:**
- Build orchestrator → Generated Docker configs (docker-compose.yml, Dockerfile)
- Install script → Expected native binary + config.json (no Docker)
- Result: Install script ignored build orchestrator, downloaded generic unsigned binaries
**Why This Happened:**
During development, both approaches were explored:
1. Docker container agents (early prototype)
2. Native binary agents (production decision)
Build orchestrator was implemented for approach #1 while install script was built for approach #2. Neither was updated when the architectural decision was made to go native-only.
### Architecture Validation
**What Actually Works PERFECTLY:**
```
┌─────────────────────────────────────────────────────────────┐
│ Dockerfile Multi-Stage Build (aggregator-server/Dockerfile)│
│ Stage 1: Build generic agent binaries for all platforms │
│ Stage 2: Copy to /app/binaries/ in final server image │
└────────────────────────┬────────────────────────────────────┘
│ Server runs...
┌──────────────────────────────────────────┐
│ downloadHandler serves from /app/ │
│ Endpoint: /api/v1/downloads/{platform} │
└────────────┬─────────────────────────────┘
│ Install script downloads with curl...
┌──────────────────────────────────────────┐
│ Install Script (downloads.go:537-830) │
│ - Deploys via systemd (Linux) │
│ - Deploys via Windows services │
│ - No Docker involved │
└──────────────────────────────────────────┘
```
**What's Missing (The Gap):**
```
When admin clicks "Update Agent" in UI:
1. Take generic binary from /app/binaries/{platform}/redflag-agent
2. Embed: agent_id, server_url, registration_token into config
3. Sign with Ed25519 (using signingService.SignFile())
4. Store in agent_update_packages table
5. Serve signed version via downloads endpoint
```
### Corrected Architecture
**Goal:** Make build orchestrator generate **signed native binaries** not Docker configs
**New Build Orchestrator Flow:**
```go
// 1. Receive build request via POST /api/v1/build/new or /api/v1/build/upgrade
// 2. Load generic binary from /app/binaries/{platform}/
// 3. Generate agent-specific config.json (not docker-compose.yml)
// 4. Sign binary with Ed25519 key (using existing signingService)
// 5. Store signature in agent_update_packages table
// 6. Return download URL for signed binary
```
**Install Script Stays EXACTLY THE SAME**
- Continues to download from `/api/v1/downloads/{platform}`
- Continues systemd/Windows service deployment
- Just gets **signed binaries** instead of generic ones
### Implementation Roadmap (Updated)
### Immediate (Build Orchestrator Fix)
1. Replace docker-compose.yml generation with config.json generation
2. Add Ed25519 signing step using signingService.SignFile()
3. Store signed binary info in agent_update_packages table
4. Update downloadHandler to serve signed versions when available
### Short Term (Agent Updates)
1. Complete middleware implementation for version upgrade handling
2. Add nonce validation for update authorization
3. Update agent to send version/nonce headers
4. Test end-to-end agent update flow
### Medium Term (Security Polish)
1. Add UI for package management and signing status
2. Add fingerprint logging for TOFU verification
3. Implement key rotation support
4. Add integration tests for signing workflow
### Corrected Understanding
**Original Misconception:** Build orchestrator was "wrong" or "broken"
**Actual Reality:** Build orchestrator was generating artifacts for a Docker-based deployment architecture that was **explored but not chosen**. The native binary architecture is **already correct and working** - we just need to connect the signing workflow to it.
**The Fix:** Don't throw out the build orchestrator - **redirect it** to generate the right artifacts for the native binary architecture.
---
**Final Status:** Architecture validated, root cause identified, path forward clear. Ready to implement signed binary generation.