Files

jpetree331 f97d4845af feat(security): A-1 Ed25519 key rotation + A-2 replay attack fixes

Complete RedFlag codebase with two major security audit implementations.

== A-1: Ed25519 Key Rotation Support ==

Server:
- SignCommand sets SignedAt timestamp and KeyID on every signature
- signing_keys database table (migration 020) for multi-key rotation
- InitializePrimaryKey registers active key at startup
- /api/v1/public-keys endpoint for rotation-aware agents
- SigningKeyQueries for key lifecycle management

Agent:
- Key-ID-aware verification via CheckKeyRotation
- FetchAndCacheAllActiveKeys for rotation pre-caching
- Cache metadata with TTL and staleness fallback
- SecurityLogger events for key rotation and command signing

== A-2: Replay Attack Fixes (F-1 through F-7) ==

F-5 CRITICAL - RetryCommand now signs via signAndCreateCommand
F-1 HIGH     - v3 format: "{agent_id}:{cmd_id}:{type}:{hash}:{ts}"
F-7 HIGH     - Migration 026: expires_at column with partial index
F-6 HIGH     - GetPendingCommands/GetStuckCommands filter by expires_at
F-2 HIGH     - Agent-side executedIDs dedup map with cleanup
F-4 HIGH     - commandMaxAge reduced from 24h to 4h
F-3 CRITICAL - Old-format commands rejected after 48h via CreatedAt

Verification fixes: migration idempotency (ETHOS #4), log format
compliance (ETHOS #1), stale comments updated.

All 24 tests passing. Docker --no-cache build verified.
See docs/ for full audit reports and deviation log (DEV-001 to DEV-019).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-03-28 21:25:47 -04:00

8.0 KiB

Raw Blame History

RedFlag Development Ethos

Philosophy: We are building honest, autonomous software for a community that values digital sovereignty. This isn't enterprise-fluff; it's a "less is more" set of non-negotiable principles forged from experience. We ship bugs, but we are honest about them, and we log the failures.

The Core Ethos (Non-Negotiable Principles)

These are the rules we've learned not to compromise on. They are the foundation of our development contract.

1. Errors are History, Not /dev/null

Principle: NEVER silence errors.

Rationale: A "laid back" admin is one who can sleep at night, knowing any failure will be in the logs. We don't use 2>/dev/null. We fix the root cause, not the symptom.

Implementation Contract:

All errors, from a script exit 1 to an API 500, MUST be captured and logged with context (what failed, why, what was attempted)
All logs MUST follow the [TAG] [system] [component] format (e.g., [ERROR] [agent] [installer] Download failed...)
The final destination for all auditable events (errors and state changes) is the history table

2. Security is Non-Negotiable

Principle: NEVER add unauthenticated endpoints.

Rationale: "Temporary" is permanent. Every single route MUST be protected by the established, multi-subsystem security architecture.

Security Stack:

User Auth (WebUI): All admin dashboard routes MUST be protected by WebAuthMiddleware()
Agent Registration: Agents can only be created using valid registration_token via /api/v1/agents/register
Agent Check-in: All agent-to-server communication MUST be protected by AuthMiddleware() validating JWT access tokens
Agent Token Renewal: Agents MUST only renew tokens using their long-lived refresh_token via /api/v1/agents/renew
Hardware Verification: All authenticated agent routes MUST be protected by MachineBindingMiddleware to validate X-Machine-ID header
Update Security: Sensitive commands MUST be protected by signed Ed25519 Nonce to prevent replay attacks
Binary Security: Agents MUST verify Ed25519 signatures of downloaded binaries against cached server public key (TOFU model)

3. Assume Failure; Build for Resilience

Principle: NEVER assume an operation will succeed.

Rationale: Networks fail. Servers restart. Agents crash. The system must recover without manual intervention.

Resilience Contract:

Agent Network: Agent check-ins MUST use retry logic with exponential backoff to survive server 502s and transient failures
Scanner Reliability: Long-running or fragile scanners (Windows Update, DNF) MUST be wrapped in Circuit Breaker to prevent subsystem blocking
Data Delivery: Command results MUST use Command Acknowledgment System (pending_acks.json) for at-least-once delivery guarantees

4. Idempotency is a Requirement

Principle: NEVER forget idempotency.

Rationale: We (and our agents) will inevitably run the same command twice. The system must not break or create duplicate state.

Idempotency Contract:

Install Scripts: Must be idempotent, checking if agent/service is already installed before attempting installation
Command Design: All commands should be designed for idempotency to prevent duplicate state issues
Database Migrations: All schema changes MUST be idempotent (CREATE TABLE IF NOT EXISTS, ADD COLUMN IF NOT EXISTS, etc.)

5. No Marketing Fluff (The "No BS" Rule)

Principle: NEVER use banned words or emojis in logs or code.

Rationale: We are building an "honest" tool for technical users, not pitching a product. Fluff hides meaning and creates enterprise BS.

Clarity Contract:

Banned Words: enhanced, enterprise-ready, seamless, robust, production-ready, revolutionary, etc.
Banned Emojis: Emojis like ⚠️, ✅, ❌ are for UI/communications, not for logs
Logging Format: All logs MUST use the [TAG] [system] [component] format for clarity and consistency

Critical Build Practices (Non-Negotiable)

Docker Cache Invalidation During Testing

Principle: ALWAYS use --no-cache when testing fixes.

Rationale: Docker layer caching will use the broken state unless explicitly invalidated. A fix that appears to fail may simply be using cached layers.

Build Contract:

Testing Fixes: docker-compose build --no-cache or docker build --no-cache
Never Assume: Cache will not pick up source code changes automatically
Verification: If a fix doesn't work, rebuild without cache before debugging further

Development Workflow Principles

Session-Based Development

Development sessions follow a structured pattern to maintain quality and documentation:

Before Starting:

Review current project status and priorities
Read previous session documentation for context
Set clear, specific goals for the session
Create todo list to track progress

During Development:

Implement code following established patterns
Document progress as you work (don't wait until end)
Update todo list continuously
Test functionality as you build

After Session Completion:

Create session documentation with complete technical details
Update status files with new capabilities and technical debt
Clean up todo list and plan next session priorities
Verify all quality checkpoints are met

Quality Standards

Code Quality:

Follow language best practices (Go, TypeScript, React)
Include proper error handling for all failure scenarios
Add meaningful comments for complex logic
Maintain consistent formatting and style

Documentation Quality:

Be accurate and specific with technical details
Include file paths, line numbers, and code snippets
Document the "why" behind technical decisions
Focus on outcomes and user impact

Testing Quality:

Test core functionality and error scenarios
Verify integration points work correctly
Validate user workflows end-to-end
Document test results and known issues

The Pre-Integration Checklist

Do not merge or consider work complete until you can check these boxes:

All errors are logged (not silenced with /dev/null)
No new unauthenticated endpoints exist (all use proper middleware)
Backup/restore/fallback paths exist for critical operations
Idempotency verified (can run 3x safely)
History table logging added for all state changes
Security review completed (respects the established stack)
Testing includes error scenarios (not just happy path)
Documentation is updated with current implementation details
Technical debt is identified and tracked

Sustainable Development Practices

Technical Debt Management

Every session must identify and document:

New Technical Debt: What shortcuts were taken and why
Deferred Features: What was postponed and the justification
Known Issues: Problems discovered but not fixed
Architecture Decisions: Technical choices needing future review

Self-Enforcement Mechanisms

Pattern Discipline:

Use TodoWrite tool for session progress tracking
Create session documentation for ALL development work
Update status files to reflect current reality
Maintain context across development sessions

Anti-Patterns to Avoid: ❌ "I'll document it later" - Details will be lost ❌ "This session was too small to document" - All sessions matter ❌ "The technical debt isn't important enough to track" - It will become critical ❌ "I'll remember this decision" - You won't, document it

Positive Patterns to Follow: ✅ Document as you go - Take notes during implementation ✅ End each session with documentation - Make it part of completion criteria ✅ Track all decisions - Even small choices have future impact ✅ Maintain technical debt visibility - Hidden debt becomes project risk

This ethos ensures consistent, high-quality development while building a maintainable system that serves both current users and future development needs. The principles only work when consistently followed.

8.0 KiB Raw Blame History