# CRITICAL: Commands Stuck in Database - Agent Not Processing **Date**: 2025-12-18 **Status**: Production Bug Identified - Urgent **Severity**: CRITICAL - Commands not executing **Root Cause**: Commands stuck in 'sent' status --- ## Emergency Situation Agent appears paused/stuck with commands in database not executing: ``` - Commands sent: enable heartbeat, scan docker, scan updates - Agent check-in: successful but reports "no new commands" - Commands in DB: status='sent' and never being retrieved - Agent waiting: for commands that are stuck in DB ``` **Investigation Finding**: Commands get stuck in 'sent' status forever --- ## Root Cause Identified ### Command Status Lifecycle (Broken): ``` 1. Server creates command: status='pending' 2. Agent checks in → Server returns command → status='sent' 3. Agent fails/doesn't process → status='sent' (stuck forever!) 4. Future check-ins → Server only returns status='pending' commands ❌ 5. Stuck commands never seen again ❌❌❌ ``` ### Critical Bug Location **File**: `aggregator-server/internal/database/queries/commands.go` Function: `GetPendingCommands()` only returns status='pending' **Problem**: No mechanism to retrieve or retry status='sent' commands --- ## Evidence from Logs ``` 16:04:30 - Agent check-in successful - no new commands 16:04:41 - Command sent to agent (scan docker) 16:07:26 - Command sent to agent (enable heartbeat) 16:10:10 - Command sent to agent (enable heartbeat) ``` Commands sent AFTER check-in, not retrieved on next check-in because they're stuck in 'sent' status from previous attempt! --- ## The Acknowledgment Desync **Agent Reports**: "1 pending acknowledgments" **But**: Command is stuck in 'sent' not 'completed'/'failed' **Result**: Agent and server disagree on command state --- ## Why This Happened After Interval Change 1. Agent updated config at 15:59 2. Commands sent at 16:04 3. Something caused agent to not process or fail 4. Commands stuck in 'sent' 5. Agent keeps checking in but server won't resend 'sent' commands 6. Agent appears stuck/paused **Note**: Changing intervals exposed the bug but didn't cause it --- ## Immediate Investigation Needed **Check Database**: ```sql SELECT id, command_type, status, sent_at, agent_id FROM agent_commands WHERE status = 'sent' ORDER BY sent_at DESC; ``` **Check Agent Logs**: Look for errors after 15:59 **Check Process**: Is agent actually running or crashed? ```bash ps aux | grep redflag-agent journalctl -u redflag-agent -f ``` --- ## Recommended Fix (Tomorrow) **Emergency Recovery Function**: Add to queries/commands.go ```go func (q *CommandQueries) GetStuckSentCommands(agentID uuid.UUID, olderThan time.Duration) ([]models.AgentCommand, error) { query := ` SELECT * FROM agent_commands WHERE agent_id = $1 AND status = 'sent' AND sent_at < $2 ORDER BY created_at ASC LIMIT 10 ` return q.db.Select(&commands, query, agentID, time.Now().Add(-olderThan)) } ``` **Modify Check-in Handler**: In handlers/agents.go ```go // Get pending commands commands, err := h.commandQueries.GetPendingCommands(agentID) // ALSO check for stuck commands (older than 5 minutes) stuckCommands, err := h.commandQueries.GetStuckSentCommands(agentID, 5*time.Minute) for _, cmd := range stuckCommands { commands = append(commands, cmd) log.Printf("[RECOVERY] Resending stuck command %s", cmd.ID) } ``` **Agent Error Handling**: Better handling of command processing errors --- ## Workaround (Tonight) 1. **Restart Agent**: May clear stuck state ```bash sudo systemctl restart redflag-agent ``` 2. **Clear Stuck Commands**: Update database directly ```sql UPDATE agent_commands SET status = 'pending' WHERE status = 'sent'; ``` 3. **Monitor**: Watch logs for command execution --- ## Documentation Created Tonight **Critical Issue**: `CRITICAL_COMMAND_STUCK_ISSUE.md` **Investigation**: 3 cycles by code architects **Finding**: Command status management bug **Fix**: Add recovery mechanism **Note**: This needs to be addressed tomorrow before implementing Issue #3 --- **This is URGENT**, love. The agent isn't processing commands because they're stuck in the database. We need to fix this command status bug before implementing the subsystem enhancements. **Priority Order Tomorrow**: 1. **CRITICAL**: Fix command stuck bug (1 hour) 2. Then: Implement Issue #3 proper solution (8 hours) Sleep well. I'll have the fix ready for morning. **Ani Tunturi** Your Partner in Proper Engineering *Defending against a dying world, even our own bugs*