Files
Redflag/docs/4_LOG/October_2025/Development-Documentation/FutureEnhancements.md

10 KiB

Future Enhancements & Considerations

Critical Testing Issues

Windows Agent Update Persistence Bug

Status: Needs Investigation

Problem: Microsoft Security Defender updates reappearing after installation

  • Updates marked as installed but show back up in scan results
  • Possible Windows Update state caching issue
  • May be related to Windows Update Agent refresh timing

Investigation Needed:

  • Verify update installation actually completes on Windows side
  • Check Windows Update API state after installation
  • Compare package state in database vs Windows registry
  • Test with different update types (Defender vs other updates)
  • May need to force WUA refresh after installation

Priority: High - affects Windows agent reliability


Immediate Priority - Real-Time Operations

Intelligent Heartbeat System Enhancement

Current State:

  • Manual heartbeat toggle (pink icon when active)
  • User-initiated only
  • Fixed duration options

Proposed Enhancement:

  • Auto-trigger heartbeat on operations: Any command sent to agent triggers heartbeat automatically
  • Color coding:
    • Blue: System-initiated heartbeat (scan, install, etc)
    • Pink: User-initiated manual heartbeat
  • Lifecycle management: Heartbeat auto-ends when operation completes
  • Smart detection: Don't spam heartbeat commands if already active

Implementation Strategy: Phase 1: Scan operations auto-trigger heartbeat Phase 2: Install/approve operations auto-trigger heartbeat Phase 3: Any agent command auto-triggers appropriate heartbeat duration Phase 4: Heartbeat duration scales with operation type (30s scan vs 10m install)

User Experience:

  • User clicks "Scan Now" → blue heartbeat activates → scan completes → heartbeat stops
  • User clicks "Install" → blue heartbeat activates → install completes → heartbeat stops
  • User manually triggers heartbeat → pink icon → user controls duration

Priority: High - improves responsiveness without manual intervention

Dashboard Visualization Enhancement:

  • Live Commands Dashboard Widget: Aggregate view of all active operations
  • Color coding extends to commands:
    • Pink badges: User-initiated commands (manual scan, manual install, etc)
    • Blue badges: System-orchestrated commands (auto-scan, auto-heartbeat, approved workflows)
  • Fleet monitoring at a glance:
    • Visual breakdown: "X agents with blue (system) operations | Y agents with pink (manual) operations"
    • Quick filtering: "Show only system-orchestrated operations" vs "Show only user-initiated"
    • Live count: "Active system operations triggering heartbeats: 3"
  • Agent list integration:
    • Small blue/pink indicator dots next to agent names
    • Sort/filter by active heartbeat status and source
    • Dashboard stats showing heartbeat distribution across fleet

Use Case: MSP/homelab fleet monitoring - differentiate between automated orchestration (blue) and manual intervention (pink) at a glance. Helps identify which systems need attention vs which are running autonomously.

Note: Backend tracking complete (source field in commands, metadata storage). Frontend visualization deferred for post-V1.0.


Strategic Architecture Decisions

Update Management Philosophy - Pre-V1.0 Discussion Needed

Core Questions:

  1. Are we a mirror? Do we cache/store update packages locally?
  2. Are we a gatekeeper? Do we proxy updates through our server?
  3. Are we an orchestrator? Do we just coordinate direct agent→repo downloads?

Current Implementation: Orchestrator model

  • Agents download directly from upstream repos
  • Server coordinates approval/installation
  • No package caching or storage

Alternative Models to Consider:

Model A: Package Proxy/Cache

  • Server downloads and caches approved updates
  • Agents pull from local server instead of internet
  • Pros: Bandwidth savings, offline capability, version pinning
  • Cons: Storage requirements, security responsibility, repo sync complexity

Model B: Approval Database

  • Server stores approval decisions without packages
  • Agents check "is package X approved?" before installing from upstream
  • Pros: Lightweight, flexible, audit trail
  • Cons: No offline capability, no bandwidth savings

Model C: Hybrid Approach

  • Critical updates: Cache locally (security patches)
  • Regular updates: Direct from upstream
  • User-configurable per update category

Windows Enforcement Challenge:

  • Linux: Can control APT/DNF sources easily
  • Windows: Windows Update has limited local control
  • Winget: Can control sources
  • Need unified approach that works cross-platform

Questions for V1.0:

  • Do users want local update caching?
  • Is bandwidth savings worth storage complexity?
  • Should "disapprove" mean "block installation" or just "don't auto-install"?
  • How do we handle Windows Update's limited control surface?

Decision Timeline: Before V1.0 - this affects database schema, agent architecture, storage requirements


High Priority - Security & Authentication

Cryptographically Signed Agent Binaries

Problem: Currently agents can be copied between servers, duplicated, or spoofed. Rate limiting is IP-based which doesn't prevent abuse at the agent level.

Proposed Solution:

  • Server generates unique cryptographic signature when building/distributing agent binaries
  • Each agent binary is bound to the specific server instance via:
    • SSH keys or x.509 certificates
    • Server's public/private key pair
    • Unique server identifier embedded in binary at build time
  • Agent presents cryptographic proof of authenticity during registration and check-ins
  • Server validates signature before accepting any agent communication

Benefits:

  1. Better Rate Limiting: Track and limit per-agent-binary instead of per-IP

    • Prevents multiple agents from same host sharing rate limit bucket
    • Each unique agent has its own quota
    • Detect and block duplicated/copied agents
  2. Prevents Cross-Server Agent Migration:

    • Agent built for Server A cannot register with Server B
    • Stops unauthorized agent redistribution
    • Ensures agents only communicate with their originating server
  3. Audit Trail:

    • Track which specific binary version is running where
    • Identify compromised or rogue agent binaries
    • Revoke specific agent signatures if needed

Implementation Considerations:

  • Use Ed25519 or RSA for signing (fast, secure)
  • Embed server public key in agent binary at build time
  • Store server private key securely (not in env file)
  • Agent includes signature in Authorization header alongside token
  • Server validates: signature + token + agent_id combo
  • Migration path for existing unsigned agents

Timeline: Sooner than initially thought - foundational security improvement


Medium Priority - UI/UX Improvements

Rate Limit Settings UI

Current State: API endpoints exist, UI skeleton present but non-functional

Needed:

  • Display current rate limit values for all endpoint types
  • Live editing of limits with validation
  • Show current usage/remaining per limit type
  • Reset to defaults button
  • Preview impact before applying changes
  • Warning when setting limits too low

Location: Settings page → Rate Limits section

Server Status/Splash During Operations

Current State: Dashboard shows "Failed to load" during server restarts/maintenance

Needed:

  • Detect when server is unreachable vs actual error
  • Show friendly "Server restarting..." splash instead of error
  • Maybe animated spinner or progress indicator
  • Different states:
    • Server starting up
    • Server restarting (config change)
    • Server maintenance
    • Actual error (needs user action)

Possible Implementation:

  • SetupCompletionChecker could handle this (already polling /health)
  • Add status overlay component
  • Detect specific error types (network vs 500 vs 401)

Dashboard Statistics Loading State

Current: Hard error when stats unavailable

Better:

  • Skeleton loaders for stat cards
  • Graceful degradation if some stats fail
  • Retry button for failed stat fetches
  • Cache last-known-good values briefly

Lower Priority - Feature Enhancements

Agent Auto-Update System

Currently agents must be manually updated. Need:

  • Server-initiated agent updates
  • Rollback capability
  • Staged rollouts (canary deployments)
  • Version compatibility checks

Proxmox Integration

Planned feature for managing VMs/containers:

  • Detect Proxmox hosts
  • List VMs and containers
  • Trigger updates at VM/container level
  • Separate update categories for host vs guests

Mobile-Responsive Dashboard

Works but not optimized:

  • Better mobile nav (hamburger menu)
  • Touch-friendly buttons
  • Responsive tables (card view on mobile)
  • PWA support for installing as app

Notification System

  • Email alerts for failed updates
  • Webhook integration (Discord, Slack, etc)
  • Configurable notification rules
  • Quiet hours / alert throttling

Scheduled Update Windows

  • Define maintenance windows per agent
  • Auto-approve updates during windows
  • Block updates outside windows
  • Timezone-aware scheduling

Technical Debt

Configuration Management

Current: Settings scattered between database, .env file, and hardcoded defaults

Better:

  • Unified settings table in database
  • Web UI for all configuration
  • Import/export settings
  • Settings version history

Testing Coverage

  • Add integration tests for rate limiter
  • Test agent registration flow end-to-end
  • UI component tests for critical paths
  • Load testing for concurrent agents

Documentation

  • API reference needs expansion
  • Agent installation guide for edge cases
  • Troubleshooting guide
  • Architecture diagrams

Code Organization

  • Rate limiter settings should be database-backed (currently in-memory only)
  • Agent timeout values hardcoded (need to be configurable)
  • Shutdown delay hardcoded at 1 minute (user-adjustable needed)

Notes & Philosophy

  • Less is more: No enterprise BS, keep it simple
  • FOSS mentality: All software has bugs, best effort approach
  • Homelab-first: Build for real use cases, not investor pitches
  • Honest about limitations: Document what doesn't work
  • Community-driven: Users know their needs best

Implementation Priority Order

  1. Cryptographic agent signing - Security foundation, enables better rate limiting
  2. Rate limit UI completion - Already have API, just need frontend
  3. Server status splash - UX improvement, quick win
  4. Settings management refactor - Enables other features
  5. Auto-update system - Major feature, needs careful design
  6. Everything else - As time permits

Last updated: 2025-10-31