16 KiB
RedFlag v0.1.23.5 → v0.2.0 Implementation Plan
Priority: CRITICAL P0 Error Logging MUST be completed before v0.2.0 release
Architecture: PULL ONLY (No WebSockets/Push mechanisms)
Timeline: 2-3 development sessions (15-17 hours)
Current State Assessment (v0.1.23.5)
✅ What's Working
- Agent v0.1.23.5 running and checking in successfully
- Server config sync working (all subsystems configured with auto_run=true)
- Migration detection working properly (install.log shows proper behavior)
- Token preservation working (agent's built-in migration system)
- Install script idempotency implemented
- HistoryLog build failure fixed (system_events table created)
- Registration token expiration fixed (UI now shows correct status)
- Heartbeat implementation verified correct (with minor bug fixed)
❌ Critical Gaps (P0 - Must Fix Before v0.2.0)
- Agent startup failures invisible to server (log.Fatal before server communication)
- Registration failures not logged (invalid tokens, machine ID conflicts)
- Token renewal failures cause silent agent death
- Migration failures only visible in local logs
- Subsystem scanner failures invisible (circuit breakers, timeouts)
- No event buffering for offline agents
Implementation Strategy: Phase-Based Approach
Phase 1: Foundation & Verification (2-3 hours)
Goal: Ensure infrastructure is ready before adding error logging
1.1 Verify System Events Table
- Run migration:
cd aggregator-server && go run cmd/server/main.go migrate - Verify
system_eventstable created in database - Test
CreateSystemEvent()query method - Confirm indexes are working properly
1.2 Verify Subsystem Configuration
- Check
agent_subsystemstable has data for existing agents - Verify
GetSubsystems()query returns correct data - Confirm heartbeat metadata storage working (
rapid_polling_enabled,rapid_polling_until)
1.3 Update Documentation
- Add
ERROR_FLOW_AUDIT_CRITICAL_P0_PLAN.mdto project roadmap - Update
DEVELOPMENT_ETHOS.mdwith event logging requirements - Create
EVENT_CLASSIFICATIONS.mdfor reference
Deliverable: Verified infrastructure ready for P0 error logging
Phase 2: Agent-Side Event Buffering (3-4 hours)
Goal: Create local event buffering system for offline resilience
2.1 Create Event Buffer Package
File: aggregator-agent/internal/event/buffer.go (NEW)
Implementation:
// Event buffer with configurable path
type EventBuffer struct {
filePath string
maxSize int
mu sync.Mutex
}
// Initialize with config-driven path
func NewEventBuffer(configPath string) *EventBuffer {
return &EventBuffer{
filePath: configPath,
maxSize: 1000,
}
}
// BufferEvent saves event to local file
func (b *EventBuffer) BufferEvent(event *models.SystemEvent) error
// GetBufferedEvents retrieves and clears buffer
func (b *EventBuffer) GetBufferedEvents() ([]*models.SystemEvent, error)
Key Features:
- ✅ Configurable buffer path (not hardcoded)
- ✅ Thread-safe (sync.Mutex)
- ✅ Circular buffer (max 1000 events)
- ✅ JSON serialization
- ✅ Automatic directory creation
2.2 Integrate Buffer into Agent Config
File: aggregator-agent/internal/config/config.go
Add to Config struct:
type Config struct {
// ... existing fields ...
EventBufferPath string `json:"event_buffer_path"`
}
Default value: /var/lib/redflag/events_buffer.json
2.3 Test Buffering System
- Unit tests for buffer operations
- Test concurrent writes
- Test buffer overflow (circular behavior)
- Test file permissions and directory creation
Deliverable: Working event buffer that survives agent restarts
Phase 3: Critical Error Logging Integration (6-7 hours)
Goal: Add P0 error logging to all critical failure points
3.1 Agent Startup Failures (1 hour)
File: aggregator-agent/cmd/agent/main.go
Locations:
- Line 259-262: Config load failure
- Line 305-307: Registration failure
- Line 360-362: Runtime failure
Implementation pattern:
cfg, err := config.Load(configPath, cliFlags)
if err != nil {
event := &models.SystemEvent{
EventType: "agent_startup",
EventSubtype: "failed",
Severity: "critical",
Component: "agent",
Message: fmt.Sprintf("Configuration load failed: %v", err),
Metadata: map[string]interface{}{
"error_type": "config_load_failed",
"error_details": err.Error(),
"config_path": configPath,
},
}
eventBuffer.BufferEvent(event) // Buffer before fatal exit
log.Fatal("Failed to load configuration:", err)
}
3.2 Registration & Token Failures (1.5 hours)
File: aggregator-agent/internal/client/client.go
Locations:
- Line 121-125: Registration API failure
- Line 172-175: Token renewal failure
- Line 263-266: Command fetch failure
Implementation pattern:
if resp.StatusCode != http.StatusOK {
bodyBytes, _ := io.ReadAll(resp.Body)
event := &models.SystemEvent{
EventType: "agent_registration",
EventSubtype: "failed",
Severity: "error",
Component: "agent",
Message: fmt.Sprintf("Registration failed: %s", resp.Status),
Metadata: map[string]interface{}{
"error_type": "registration_failed",
"http_status": resp.StatusCode,
"response_body": string(bodyBytes),
},
}
eventBuffer.BufferEvent(event)
return nil, fmt.Errorf("registration failed: %s - %s", resp.Status, string(bodyBytes))
}
3.3 Migration Failures (1 hour)
File: aggregator-agent/internal/migration/executor.go
Locations:
- Line 60-62: Backup creation failure
- Line 67-69: Directory migration failure
- Line 75-77: Configuration migration failure
- Line 96-98: Validation failure
Implementation pattern:
if err := e.createBackups(); err != nil {
event := &models.SystemEvent{
EventType: "agent_migration",
EventSubtype: "failed",
Severity: "error",
Component: "migration",
Message: fmt.Sprintf("Backup creation failed: %v", err),
Metadata: map[string]interface{}{
"error_type": "backup_creation_failed",
"migration_from": e.fromVersion,
"migration_to": e.toVersion,
},
}
eventBuffer.BufferEvent(event)
return e.completeMigration(false, fmt.Errorf("backup creation failed: %w", err))
}
3.4 Subsystem Scanner Failures (2 hours)
Files: aggregator-agent/internal/orchestrator/*.go
Circuit breaker activations:
// When circuit breaker opens
event := &models.SystemEvent{
EventType: "agent_scan",
EventSubtype: "failed",
Severity: "warning",
Component: "scanner",
Message: fmt.Sprintf("Circuit breaker opened for %s scanner", scannerType),
Metadata: map[string]interface{}{
"scanner_type": scannerType,
"error_type": "circuit_breaker_activated",
"failures": failureCount,
},
}
eventBuffer.BufferEvent(event)
Scanner timeouts:
// When scanner times out
event := &models.SystemEvent{
EventType: "agent_scan",
EventSubtype: "failed",
Severity: "error",
Component: "scanner",
Message: fmt.Sprintf("%s scanner timed out after %v", scannerType, timeout),
Metadata: map[string]interface{}{
"scanner_type": scannerType,
"error_type": "timeout",
"duration_ms": duration.Milliseconds(),
},
}
eventBuffer.BufferEvent(event)
3.5 Server-Side Auth Failures (0.5 hours)
File: aggregator-server/internal/api/handlers/agents.go
Locations:
- Line 64-67: Missing registration token
- Line 72-74: Invalid/expired token
- Line 81-84: Machine ID conflict
Implementation pattern:
if registrationToken == "" {
event := &models.SystemEvent{
EventType: "server_auth",
EventSubtype: "failed",
Severity: "warning",
Component: "security",
Message: "Registration attempt without token",
Metadata: map[string]interface{}{
"error_type": "missing_token",
"client_ip": c.ClientIP(),
},
}
h.agentQueries.CreateSystemEvent(event) // Don't fail on error
c.JSON(http.StatusUnauthorized, gin.H{"error": "registration token required"})
return
}
Deliverable: All P0 errors now buffered and will be reported during next check-in
Phase 4: Event Reporting Integration (2-3 hours)
Goal: Report buffered events during agent check-in
4.1 Modify Agent Check-In
File: aggregator-agent/internal/client/client.go
In CheckIn() method:
func (c *Client) CheckIn(agentID string, metrics map[string]interface{}) (*CheckInResponse, error) {
// ... existing code ...
// Add buffered events to request
bufferedEvents, err := eventBuffer.GetBufferedEvents()
if err != nil {
log.Printf("Warning: Failed to get buffered events: %v", err)
}
if len(bufferedEvents) > 0 {
metrics["buffered_events"] = bufferedEvents
log.Printf("Reporting %d buffered events to server", len(bufferedEvents))
}
// ... rest of check-in code ...
}
4.2 Modify Server GetCommands
File: aggregator-server/internal/api/handlers/agents.go
In GetCommands() method:
// Process buffered events from agent
if bufferedEvents, exists := metrics.Metadata["buffered_events"]; exists {
if events, ok := bufferedEvents.([]interface{}); ok && len(events) > 0 {
stored := 0
for _, e := range events {
if eventMap, ok := e.(map[string]interface{}); ok {
event := &models.SystemEvent{
AgentID: &agentID,
EventType: getString(eventMap, "event_type"),
EventSubtype: getString(eventMap, "event_subtype"),
Severity: getString(eventMap, "severity"),
Component: getString(eventMap, "component"),
Message: getString(eventMap, "message"),
Metadata: getJSONB(eventMap, "metadata"),
CreatedAt: getTime(eventMap, "created_at"),
}
if event.EventType != "" && event.EventSubtype != "" && event.Severity != "" {
if err := h.agentQueries.CreateSystemEvent(event); err != nil {
log.Printf("Warning: Failed to store buffered event: %v", err)
} else {
stored++
}
}
}
}
if stored > 0 {
log.Printf("Stored %d buffered events from agent %s", stored, agentID)
}
}
}
4.3 Test End-to-End Flow
- Simulate agent startup failure → Verify event buffered
- Start agent → Verify event reported in next check-in
- Check server database → Verify event stored in system_events table
- Test offline scenario → Verify events survive agent restart
- Test multiple failures → Verify all events reported
Deliverable: Complete PULL ONLY event reporting pipeline
Phase 5: UI Integration (2-3 hours) - Optional for v0.2.0
Goal: Display critical errors in UI (can be v0.2.1)
5.1 Create Event History API Endpoint
File: aggregator-server/internal/api/handlers/events.go
// GetAgentEvents handles GET /api/v1/agents/:id/events
func (h *EventHandler) GetAgentEvents(c *gin.Context) {
agentID := c.Param("id")
// Query parameters
limit := 50
if l := c.Query("limit"); l != "" {
if parsed, err := strconv.Atoi(l); err == nil && parsed > 0 && parsed <= 1000 {
limit = parsed
}
}
severity := c.Query("severity") // "error,critical" filter
events, err := h.agentQueries.GetSystemEvents(agentID, severity, limit)
if err != nil {
c.JSON(http.StatusInternalServerError, gin.H{"error": "failed to retrieve events"})
return
}
c.JSON(http.StatusOK, gin.H{
"events": events,
"total": len(events),
})
}
5.2 Add UI Polling
File: aggregator-web/src/hooks/useAgentEvents.ts
// Poll for agent events every 30 seconds
export const useAgentEvents = (agentId: string) => {
const [events, setEvents] = useState<SystemEvent[]>([]);
useEffect(() => {
const fetchEvents = async () => {
const data = await api.get(`/agents/${agentId}/events?severity=error,critical`);
setEvents(data.events);
};
// Initial fetch
fetchEvents();
// Poll every 30 seconds
const interval = setInterval(fetchEvents, 30000);
return () => clearInterval(interval);
}, [agentId]);
return events;
};
Testing Checklist
Unit Tests
- Event buffer concurrent writes
- Buffer overflow behavior (circular)
- Event serialization/deserialization
- GetBufferedEvents clears buffer
Integration Tests
- Startup failure → event buffered → event reported → event stored
- Registration failure → event appears in UI within 60 seconds
- Token renewal failure → event logged → admin notified
- Offline scenario → events survive restart → all reported when online
- Multiple subsystem failures → all events captured with correct context
Manual Tests
- Kill agent process mid-scan → verify event appears in UI
- Use expired registration token → verify security event logged
- Disconnect network during token renewal → verify event buffered
- Trigger migration failure → verify event reported
Success Criteria
Must Have for v0.2.0
- All 4 P0 error types logged (startup, registration, token, migration)
- Events survive agent restart (buffered to disk)
- Events reported within 1-2 check-in cycles (30-60 seconds)
- PULL ONLY architecture (no WebSockets)
- Server-side auth failures logged
- Subsystem context captured in event metadata
Should Have for v0.2.0
- Subsystem scanner failures logged
- Basic UI displays critical errors
- Event buffer path configurable (not hardcoded)
Can Wait for v0.2.1
- Full event history UI with filtering
- Success events logged
- Event analytics and metrics
Risk Mitigation
| Risk | Mitigation |
|---|---|
| Agent can't write buffer file | Fail silently, log to stdout, don't block startup |
| Buffer file grows too large | Circular buffer (max 1000 events), old events dropped |
| Server overwhelmed with events | Rate limiting in event ingestion, backpressure handling |
| Sensitive data in metadata | Sanitize before buffering, exclude secrets/tokens |
| Events lost during crash | Write buffer before fatal exit, fsync if possible |
Timeline Estimate
Total: 15-17 hours over 2-3 sessions
Session 1 (5-6 hours):
- Phase 1: Foundation verification (2 hours)
- Phase 2: Event buffering system (3-4 hours)
Session 2 (6-7 hours):
- Phase 3: Critical error integration (6-7 hours)
Session 3 (4-5 hours):
- Phase 4: Event reporting integration (2-3 hours)
- Phase 5: Testing and polish (2 hours)
Next Steps
- Verify current state (Run migration, check subsystem table)
- Implement event buffering (Create buffer.go package)
- Add error logging to critical failure points
- Test end-to-end flow
- Document and ship v0.2.0
Decision Point: Do we want to include subsystem scanner failures in v0.2.0 P0 scope, or push to v0.2.1? (Adds ~3 hours)