Add docs and project files - force for Culurien
This commit is contained in:
233
docs/4_LOG/2025-10/Status-Updates/heartbeat.md
Normal file
233
docs/4_LOG/2025-10/Status-Updates/heartbeat.md
Normal file
@@ -0,0 +1,233 @@
|
||||
# RedFlag Heartbeat System Documentation
|
||||
|
||||
**Version**: v0.1.14 (Architecture Separation) ✅ **COMPLETED**
|
||||
**Status**: Fully functional with automatic UI updates
|
||||
**Last Updated**: 2025-10-28
|
||||
|
||||
## Overview
|
||||
|
||||
The RedFlag Heartbeat System enables agents to switch from normal polling (5-minute intervals) to rapid polling (10-second intervals) for real-time monitoring and operations. This system is essential for live operations, updates, and time-sensitive tasks where immediate agent responsiveness is required.
|
||||
|
||||
The heartbeat system is a **temporary, on-demand rapid polling mechanism** that allows agents to check in every 10 seconds instead of the normal 5-minute intervals during active operations. This provides near real-time feedback for commands and operations.
|
||||
|
||||
## Architecture (v0.1.14+)
|
||||
|
||||
### Separation of Concerns
|
||||
|
||||
**Core Design Principle**: Heartbeat is fast-changing data, general agent metadata is slow-changing. They should be treated separately with appropriate caching strategies.
|
||||
|
||||
### Data Flow
|
||||
|
||||
```
|
||||
User clicks heartbeat button
|
||||
↓
|
||||
Heartbeat command created in database
|
||||
↓
|
||||
Agent processes command
|
||||
↓
|
||||
Agent sends immediate check-in with heartbeat metadata
|
||||
↓
|
||||
Server processes heartbeat metadata → Updates database
|
||||
↓
|
||||
UI gets heartbeat data via dedicated endpoint (5s cache)
|
||||
↓
|
||||
Buttons update automatically
|
||||
```
|
||||
|
||||
### New Architecture Components
|
||||
|
||||
#### 1. Server-side Endpoints
|
||||
|
||||
**GET `/api/v1/agents/{id}/heartbeat`** (NEW - v0.1.14)
|
||||
```json
|
||||
{
|
||||
"enabled": boolean, // Heartbeat enabled by user
|
||||
"until": "timestamp", // When heartbeat expires
|
||||
"active": boolean, // Currently active (not expired)
|
||||
"duration_minutes": number // Configured duration
|
||||
}
|
||||
```
|
||||
|
||||
**POST `/api/v1/agents/{id}/heartbeat`** (Existing)
|
||||
```json
|
||||
{
|
||||
"enabled": true,
|
||||
"duration_minutes": 10
|
||||
}
|
||||
```
|
||||
|
||||
#### 2. Client-side Architecture
|
||||
|
||||
**`useHeartbeatStatus(agentId)` Hook (NEW - v0.1.14)**
|
||||
- **Smart Polling**: Only polls when heartbeat is active
|
||||
- **5-second cache**: Appropriate for real-time data
|
||||
- **Auto-stops**: Stops polling when heartbeat expires
|
||||
- **No rate limiting**: Minimal server impact
|
||||
|
||||
**Data Sources**:
|
||||
- **Heartbeat UI**: Uses dedicated endpoint (`/agents/{id}/heartbeat`)
|
||||
- **General Agent UI**: Uses existing endpoint (`/agents/{id}`)
|
||||
- **System Information**: Uses existing endpoint with 2-5 minute cache
|
||||
- **History**: Uses existing endpoint with 5-minute cache
|
||||
|
||||
### Smart Polling Logic
|
||||
|
||||
```typescript
|
||||
refetchInterval: (query) => {
|
||||
const data = query.state.data as HeartbeatStatus;
|
||||
|
||||
// Only poll when heartbeat is enabled and still active
|
||||
if (data?.enabled && data?.active) {
|
||||
return 5000; // 5 seconds
|
||||
}
|
||||
|
||||
return false; // No polling when inactive
|
||||
}
|
||||
```
|
||||
|
||||
## Legacy Systems Removed (v0.1.14)
|
||||
|
||||
### ❌ Removed Components
|
||||
|
||||
1. **Circular Sync Logic** (agent/main.go lines 353-365)
|
||||
- Problem: Config ↔ Client bidirectional sync causing inconsistent state
|
||||
- Removed in v0.1.13
|
||||
|
||||
2. **Startup Config→Client Sync** (agent/main.go lines 289-291)
|
||||
- Problem: Unnecessary sync that could override heartbeat state
|
||||
- Removed in v0.1.13
|
||||
|
||||
3. **Server-driven Heartbeat** (`EnableRapidPollingMode()`)
|
||||
- Problem: Bypassed command system, created inconsistency
|
||||
- Replaced with command-based approach in v0.1.13
|
||||
|
||||
4. **Mixed Data Sources** (v0.1.14)
|
||||
- Problem: Heartbeat state mixed with general agent metadata
|
||||
- Separated into dedicated endpoint in v0.1.14
|
||||
|
||||
### ✅ Retained Components
|
||||
|
||||
1. **Command-based Architecture** (v0.1.12+)
|
||||
- Heartbeat commands go through same system as other commands
|
||||
- Full audit trail in history
|
||||
- Proper error handling and retry logic
|
||||
|
||||
2. **Config Persistence** (v0.1.13+)
|
||||
- `cfg.Save()` calls ensure heartbeat settings survive restarts
|
||||
- Agent remembers heartbeat state across reboots
|
||||
|
||||
3. **Stale Heartbeat Detection** (v0.1.13+)
|
||||
- Server detects when agent restarts without heartbeat
|
||||
- Creates audit command: "Heartbeat cleared - agent restarted without active heartbeat mode"
|
||||
|
||||
## Cache Strategy
|
||||
|
||||
| Data Type | Endpoint | Cache Time | Polling Interval | Rationale |
|
||||
|------------|----------|------------|------------------|-----------|
|
||||
| **Heartbeat Status** | `/agents/{id}/heartbeat` | 5 seconds | 5 seconds (when active) | Real-time feedback needed |
|
||||
| **Agent Status** | `/agents/{id}` | 2-5 minutes | None | Slow-changing data |
|
||||
| **System Information** | `/agents/{id}` | 2-5 minutes | None | Static most of time |
|
||||
| **History Data** | `/agents/{id}/commands` | 5 minutes | None | Historical data |
|
||||
| **Active Commands** | `/commands/active` | 0 | 5 seconds | Command tracking |
|
||||
|
||||
## Usage Patterns
|
||||
|
||||
### 1. Manual Heartbeat Activation
|
||||
User clicks "Enable Heartbeat" → 10-minute default → Agent polls every 5 seconds → Auto-disable after 10 minutes
|
||||
|
||||
### 2. Duration Selection
|
||||
Quick Actions dropdown: 10min, 30min, 1hr, Permanent → Configured duration applies → Auto-disable when expires
|
||||
|
||||
### 3. Command-triggered Heartbeat
|
||||
Update/Install commands → Heartbeat enabled automatically (10min) → Command completes → Auto-disable after 10min
|
||||
|
||||
### 4. Stale State Detection
|
||||
Agent restarts with heartbeat active → Server detects mismatch → Creates audit command → Clears stale state
|
||||
|
||||
## Performance Impact
|
||||
|
||||
### Minimal Server Load
|
||||
- **Smart Polling**: Only polls when heartbeat is active
|
||||
- **Dedicated Endpoint**: Small JSON response (heartbeat data only)
|
||||
- **5-second Cache**: Prevents excessive API calls
|
||||
- **Auto-stop**: Polling stops when heartbeat expires
|
||||
|
||||
### Network Efficiency
|
||||
- **Separate Caches**: Fast data updates without affecting slow data
|
||||
- **No Global Refresh**: Only heartbeat components update frequently
|
||||
- **Conditional Polling**: No polling when heartbeat is inactive
|
||||
|
||||
## Debugging and Monitoring
|
||||
|
||||
### Server Logs
|
||||
```bash
|
||||
[Heartbeat] Agent <id> heartbeat status: enabled=<bool>, until=<timestamp>, active=<bool>
|
||||
[Heartbeat] Stale heartbeat detected for agent <id> - server expected active until <timestamp>, but agent not reporting heartbeat (likely restarted)
|
||||
[Heartbeat] Cleared stale heartbeat state for agent <id>
|
||||
[Heartbeat] Created audit trail for stale heartbeat cleanup (agent <id>)
|
||||
```
|
||||
|
||||
### Client Console Logs
|
||||
```bash
|
||||
[Heartbeat UI] Tracking command <command-id> for completion
|
||||
[Heartbeat UI] Command <command-id> completed with status: <status>
|
||||
[Heartbeat UI] Monitoring for completion of command <command-id>
|
||||
```
|
||||
|
||||
### Common Issues
|
||||
|
||||
1. **Buttons Not Updating**: Check if using dedicated `useHeartbeatStatus()` hook
|
||||
2. **Constant Polling**: Verify `active` property in heartbeat response
|
||||
3. **Stale State**: Look for "stale heartbeat detected" logs
|
||||
4. **Missing Data**: Ensure `/agents/{id}/heartbeat` endpoint is registered
|
||||
|
||||
## Migration Notes
|
||||
|
||||
### From v0.1.13 to v0.1.14
|
||||
- ✅ **No Breaking Changes**: Existing endpoints preserved
|
||||
- ✅ **Improved UX**: Real-time heartbeat button updates
|
||||
- ✅ **Better Performance**: Smart polling reduces server load
|
||||
- ✅ **Clean Architecture**: Separated fast/slow data concerns
|
||||
|
||||
### Data Compatibility
|
||||
- Existing agent metadata format preserved
|
||||
- New heartbeat endpoint extracts from existing metadata
|
||||
- Backward compatibility maintained for legacy clients
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
### Potential Improvements
|
||||
1. **WebSocket Support**: Push updates instead of polling (v0.1.15+)
|
||||
2. **Batch Heartbeat**: Multiple agents in single operation
|
||||
3. **Global Heartbeat**: Enable/disable for all agents
|
||||
4. **Scheduled Heartbeat**: Time-based activation
|
||||
5. **Performance Metrics**: Track heartbeat efficiency
|
||||
|
||||
### Deprecation Timeline
|
||||
- **v0.1.13**: Command-based heartbeat (current)
|
||||
- **v0.1.14**: Architecture separation (current)
|
||||
- **v0.1.15**: WebSocket consideration
|
||||
- **v0.1.16**: Legacy metadata deprecation consideration
|
||||
|
||||
## Testing
|
||||
|
||||
### Functional Tests
|
||||
1. **Manual Activation**: Click enable/disable buttons
|
||||
2. **Duration Selection**: Test 10min/30min/1hr/permanent
|
||||
3. **Auto-expiration**: Verify heartbeat stops when time expires
|
||||
4. **Command Integration**: Confirm heartbeat auto-enables before updates
|
||||
5. **Stale Detection**: Test agent restart scenarios
|
||||
|
||||
### Performance Tests
|
||||
1. **Polling Behavior**: Verify smart polling (only when active)
|
||||
2. **Cache Efficiency**: Confirm 5-second cache prevents excessive calls
|
||||
3. **Multiple Agents**: Test concurrent heartbeat sessions
|
||||
4. **Server Load**: Monitor during heavy heartbeat usage
|
||||
|
||||
---
|
||||
|
||||
**Related Files**:
|
||||
- `aggregator-server/internal/api/handlers/agents.go`: New `GetHeartbeatStatus()` function
|
||||
- `aggregator-web/src/hooks/useHeartbeat.ts`: Smart polling hook
|
||||
- `aggregator-web/src/pages/Agents.tsx`: Updated UI components
|
||||
- `aggregator-web/src/lib/api.ts`: New `getHeartbeatStatus()` function
|
||||
Reference in New Issue
Block a user