Files

8.5 KiB

RedFlag Heartbeat System Documentation

Version: v0.1.14 (Architecture Separation) COMPLETED Status: Fully functional with automatic UI updates Last Updated: 2025-10-28

Overview

The RedFlag Heartbeat System enables agents to switch from normal polling (5-minute intervals) to rapid polling (10-second intervals) for real-time monitoring and operations. This system is essential for live operations, updates, and time-sensitive tasks where immediate agent responsiveness is required.

The heartbeat system is a temporary, on-demand rapid polling mechanism that allows agents to check in every 10 seconds instead of the normal 5-minute intervals during active operations. This provides near real-time feedback for commands and operations.

Architecture (v0.1.14+)

Separation of Concerns

Core Design Principle: Heartbeat is fast-changing data, general agent metadata is slow-changing. They should be treated separately with appropriate caching strategies.

Data Flow

User clicks heartbeat button
    ↓
Heartbeat command created in database
    ↓
Agent processes command
    ↓
Agent sends immediate check-in with heartbeat metadata
    ↓
Server processes heartbeat metadata → Updates database
    ↓
UI gets heartbeat data via dedicated endpoint (5s cache)
    ↓
Buttons update automatically

New Architecture Components

1. Server-side Endpoints

GET /api/v1/agents/{id}/heartbeat (NEW - v0.1.14)

{
  "enabled": boolean,        // Heartbeat enabled by user
  "until": "timestamp",      // When heartbeat expires
  "active": boolean,        // Currently active (not expired)
  "duration_minutes": number // Configured duration
}

POST /api/v1/agents/{id}/heartbeat (Existing)

{
  "enabled": true,
  "duration_minutes": 10
}

2. Client-side Architecture

useHeartbeatStatus(agentId) Hook (NEW - v0.1.14)

  • Smart Polling: Only polls when heartbeat is active
  • 5-second cache: Appropriate for real-time data
  • Auto-stops: Stops polling when heartbeat expires
  • No rate limiting: Minimal server impact

Data Sources:

  • Heartbeat UI: Uses dedicated endpoint (/agents/{id}/heartbeat)
  • General Agent UI: Uses existing endpoint (/agents/{id})
  • System Information: Uses existing endpoint with 2-5 minute cache
  • History: Uses existing endpoint with 5-minute cache

Smart Polling Logic

refetchInterval: (query) => {
  const data = query.state.data as HeartbeatStatus;

  // Only poll when heartbeat is enabled and still active
  if (data?.enabled && data?.active) {
    return 5000; // 5 seconds
  }

  return false; // No polling when inactive
}

Legacy Systems Removed (v0.1.14)

Removed Components

  1. Circular Sync Logic (agent/main.go lines 353-365)

    • Problem: Config ↔ Client bidirectional sync causing inconsistent state
    • Removed in v0.1.13
  2. Startup Config→Client Sync (agent/main.go lines 289-291)

    • Problem: Unnecessary sync that could override heartbeat state
    • Removed in v0.1.13
  3. Server-driven Heartbeat (EnableRapidPollingMode())

    • Problem: Bypassed command system, created inconsistency
    • Replaced with command-based approach in v0.1.13
  4. Mixed Data Sources (v0.1.14)

    • Problem: Heartbeat state mixed with general agent metadata
    • Separated into dedicated endpoint in v0.1.14

Retained Components

  1. Command-based Architecture (v0.1.12+)

    • Heartbeat commands go through same system as other commands
    • Full audit trail in history
    • Proper error handling and retry logic
  2. Config Persistence (v0.1.13+)

    • cfg.Save() calls ensure heartbeat settings survive restarts
    • Agent remembers heartbeat state across reboots
  3. Stale Heartbeat Detection (v0.1.13+)

    • Server detects when agent restarts without heartbeat
    • Creates audit command: "Heartbeat cleared - agent restarted without active heartbeat mode"

Cache Strategy

Data Type Endpoint Cache Time Polling Interval Rationale
Heartbeat Status /agents/{id}/heartbeat 5 seconds 5 seconds (when active) Real-time feedback needed
Agent Status /agents/{id} 2-5 minutes None Slow-changing data
System Information /agents/{id} 2-5 minutes None Static most of time
History Data /agents/{id}/commands 5 minutes None Historical data
Active Commands /commands/active 0 5 seconds Command tracking

Usage Patterns

1. Manual Heartbeat Activation

User clicks "Enable Heartbeat" → 10-minute default → Agent polls every 5 seconds → Auto-disable after 10 minutes

2. Duration Selection

Quick Actions dropdown: 10min, 30min, 1hr, Permanent → Configured duration applies → Auto-disable when expires

3. Command-triggered Heartbeat

Update/Install commands → Heartbeat enabled automatically (10min) → Command completes → Auto-disable after 10min

4. Stale State Detection

Agent restarts with heartbeat active → Server detects mismatch → Creates audit command → Clears stale state

Performance Impact

Minimal Server Load

  • Smart Polling: Only polls when heartbeat is active
  • Dedicated Endpoint: Small JSON response (heartbeat data only)
  • 5-second Cache: Prevents excessive API calls
  • Auto-stop: Polling stops when heartbeat expires

Network Efficiency

  • Separate Caches: Fast data updates without affecting slow data
  • No Global Refresh: Only heartbeat components update frequently
  • Conditional Polling: No polling when heartbeat is inactive

Debugging and Monitoring

Server Logs

[Heartbeat] Agent <id> heartbeat status: enabled=<bool>, until=<timestamp>, active=<bool>
[Heartbeat] Stale heartbeat detected for agent <id> - server expected active until <timestamp>, but agent not reporting heartbeat (likely restarted)
[Heartbeat] Cleared stale heartbeat state for agent <id>
[Heartbeat] Created audit trail for stale heartbeat cleanup (agent <id>)

Client Console Logs

[Heartbeat UI] Tracking command <command-id> for completion
[Heartbeat UI] Command <command-id> completed with status: <status>
[Heartbeat UI] Monitoring for completion of command <command-id>

Common Issues

  1. Buttons Not Updating: Check if using dedicated useHeartbeatStatus() hook
  2. Constant Polling: Verify active property in heartbeat response
  3. Stale State: Look for "stale heartbeat detected" logs
  4. Missing Data: Ensure /agents/{id}/heartbeat endpoint is registered

Migration Notes

From v0.1.13 to v0.1.14

  • No Breaking Changes: Existing endpoints preserved
  • Improved UX: Real-time heartbeat button updates
  • Better Performance: Smart polling reduces server load
  • Clean Architecture: Separated fast/slow data concerns

Data Compatibility

  • Existing agent metadata format preserved
  • New heartbeat endpoint extracts from existing metadata
  • Backward compatibility maintained for legacy clients

Future Enhancements

Potential Improvements

  1. WebSocket Support: Push updates instead of polling (v0.1.15+)
  2. Batch Heartbeat: Multiple agents in single operation
  3. Global Heartbeat: Enable/disable for all agents
  4. Scheduled Heartbeat: Time-based activation
  5. Performance Metrics: Track heartbeat efficiency

Deprecation Timeline

  • v0.1.13: Command-based heartbeat (current)
  • v0.1.14: Architecture separation (current)
  • v0.1.15: WebSocket consideration
  • v0.1.16: Legacy metadata deprecation consideration

Testing

Functional Tests

  1. Manual Activation: Click enable/disable buttons
  2. Duration Selection: Test 10min/30min/1hr/permanent
  3. Auto-expiration: Verify heartbeat stops when time expires
  4. Command Integration: Confirm heartbeat auto-enables before updates
  5. Stale Detection: Test agent restart scenarios

Performance Tests

  1. Polling Behavior: Verify smart polling (only when active)
  2. Cache Efficiency: Confirm 5-second cache prevents excessive calls
  3. Multiple Agents: Test concurrent heartbeat sessions
  4. Server Load: Monitor during heavy heartbeat usage

Related Files:

  • aggregator-server/internal/api/handlers/agents.go: New GetHeartbeatStatus() function
  • aggregator-web/src/hooks/useHeartbeat.ts: Smart polling hook
  • aggregator-web/src/pages/Agents.tsx: Updated UI components
  • aggregator-web/src/lib/api.ts: New getHeartbeatStatus() function