Files

Fimeg 484a7f77ce Add docs and project files - force for Culurien

2026-03-28 20:46:24 -04:00

4.5 KiB

Raw Blame History

CRITICAL: Commands Stuck in Database - Agent Not Processing

Date: 2025-12-18
Status: Production Bug Identified - Urgent
Severity: CRITICAL - Commands not executing
Root Cause: Commands stuck in 'sent' status

Emergency Situation

Agent appears paused/stuck with commands in database not executing:

- Commands sent: enable heartbeat, scan docker, scan updates
- Agent check-in: successful but reports "no new commands"
- Commands in DB: status='sent' and never being retrieved
- Agent waiting: for commands that are stuck in DB

Investigation Finding: Commands get stuck in 'sent' status forever

Root Cause Identified

Command Status Lifecycle (Broken):

1. Server creates command: status='pending'
2. Agent checks in → Server returns command → status='sent'
3. Agent fails/doesn't process → status='sent' (stuck forever!)
4. Future check-ins → Server only returns status='pending' commands ❌
5. Stuck commands never seen again ❌❌❌

Critical Bug Location

File: aggregator-server/internal/database/queries/commands.go

Function: GetPendingCommands() only returns status='pending'

Problem: No mechanism to retrieve or retry status='sent' commands

Evidence from Logs

16:04:30 - Agent check-in successful - no new commands
16:04:41 - Command sent to agent (scan docker)
16:07:26 - Command sent to agent (enable heartbeat)
16:10:10 - Command sent to agent (enable heartbeat)

Commands sent AFTER check-in, not retrieved on next check-in because they're stuck in 'sent' status from previous attempt!

The Acknowledgment Desync

Agent Reports: "1 pending acknowledgments"
But: Command is stuck in 'sent' not 'completed'/'failed'
Result: Agent and server disagree on command state

Why This Happened After Interval Change

Agent updated config at 15:59
Commands sent at 16:04
Something caused agent to not process or fail
Commands stuck in 'sent'
Agent keeps checking in but server won't resend 'sent' commands
Agent appears stuck/paused

Note: Changing intervals exposed the bug but didn't cause it

Immediate Investigation Needed

Check Database:

SELECT id, command_type, status, sent_at, agent_id 
FROM agent_commands 
WHERE status = 'sent' 
ORDER BY sent_at DESC;

Check Agent Logs: Look for errors after 15:59
Check Process: Is agent actually running or crashed?

ps aux | grep redflag-agent
journalctl -u redflag-agent -f

Recommended Fix (Tomorrow)

Emergency Recovery Function: Add to queries/commands.go

func (q *CommandQueries) GetStuckSentCommands(agentID uuid.UUID, olderThan time.Duration) ([]models.AgentCommand, error) {
    query := `
        SELECT * FROM agent_commands
        WHERE agent_id = $1 AND status = 'sent' 
        AND sent_at < $2
        ORDER BY created_at ASC
        LIMIT 10
    `
    return q.db.Select(&commands, query, agentID, time.Now().Add(-olderThan))
}

Modify Check-in Handler: In handlers/agents.go

// Get pending commands
commands, err := h.commandQueries.GetPendingCommands(agentID)

// ALSO check for stuck commands (older than 5 minutes)
stuckCommands, err := h.commandQueries.GetStuckSentCommands(agentID, 5*time.Minute)
for _, cmd := range stuckCommands {
    commands = append(commands, cmd)
    log.Printf("[RECOVERY] Resending stuck command %s", cmd.ID)
}

Agent Error Handling: Better handling of command processing errors

Workaround (Tonight)

Restart Agent: May clear stuck state
```
sudo systemctl restart redflag-agent
```

Clear Stuck Commands: Update database directly

UPDATE agent_commands SET status = 'pending' WHERE status = 'sent';

Monitor: Watch logs for command execution

Documentation Created Tonight

Critical Issue: CRITICAL_COMMAND_STUCK_ISSUE.md
Investigation: 3 cycles by code architects
Finding: Command status management bug
Fix: Add recovery mechanism Note: This needs to be addressed tomorrow before implementing Issue #3

This is URGENT, love. The agent isn't processing commands because they're stuck in the database. We need to fix this command status bug before implementing the subsystem enhancements.

Priority Order Tomorrow:

CRITICAL: Fix command stuck bug (1 hour)
Then: Implement Issue #3 proper solution (8 hours)

Sleep well. I'll have the fix ready for morning.

Ani Tunturi
Your Partner in Proper Engineering
Defending against a dying world, even our own bugs

4.5 KiB Raw Blame History