4.5 KiB
CRITICAL: Commands Stuck in Database - Agent Not Processing
Date: 2025-12-18
Status: Production Bug Identified - Urgent
Severity: CRITICAL - Commands not executing
Root Cause: Commands stuck in 'sent' status
Emergency Situation
Agent appears paused/stuck with commands in database not executing:
- Commands sent: enable heartbeat, scan docker, scan updates
- Agent check-in: successful but reports "no new commands"
- Commands in DB: status='sent' and never being retrieved
- Agent waiting: for commands that are stuck in DB
Investigation Finding: Commands get stuck in 'sent' status forever
Root Cause Identified
Command Status Lifecycle (Broken):
1. Server creates command: status='pending'
2. Agent checks in → Server returns command → status='sent'
3. Agent fails/doesn't process → status='sent' (stuck forever!)
4. Future check-ins → Server only returns status='pending' commands ❌
5. Stuck commands never seen again ❌❌❌
Critical Bug Location
File: aggregator-server/internal/database/queries/commands.go
Function: GetPendingCommands() only returns status='pending'
Problem: No mechanism to retrieve or retry status='sent' commands
Evidence from Logs
16:04:30 - Agent check-in successful - no new commands
16:04:41 - Command sent to agent (scan docker)
16:07:26 - Command sent to agent (enable heartbeat)
16:10:10 - Command sent to agent (enable heartbeat)
Commands sent AFTER check-in, not retrieved on next check-in because they're stuck in 'sent' status from previous attempt!
The Acknowledgment Desync
Agent Reports: "1 pending acknowledgments"
But: Command is stuck in 'sent' not 'completed'/'failed'
Result: Agent and server disagree on command state
Why This Happened After Interval Change
- Agent updated config at 15:59
- Commands sent at 16:04
- Something caused agent to not process or fail
- Commands stuck in 'sent'
- Agent keeps checking in but server won't resend 'sent' commands
- Agent appears stuck/paused
Note: Changing intervals exposed the bug but didn't cause it
Immediate Investigation Needed
Check Database:
SELECT id, command_type, status, sent_at, agent_id
FROM agent_commands
WHERE status = 'sent'
ORDER BY sent_at DESC;
Check Agent Logs: Look for errors after 15:59
Check Process: Is agent actually running or crashed?
ps aux | grep redflag-agent
journalctl -u redflag-agent -f
Recommended Fix (Tomorrow)
Emergency Recovery Function: Add to queries/commands.go
func (q *CommandQueries) GetStuckSentCommands(agentID uuid.UUID, olderThan time.Duration) ([]models.AgentCommand, error) {
query := `
SELECT * FROM agent_commands
WHERE agent_id = $1 AND status = 'sent'
AND sent_at < $2
ORDER BY created_at ASC
LIMIT 10
`
return q.db.Select(&commands, query, agentID, time.Now().Add(-olderThan))
}
Modify Check-in Handler: In handlers/agents.go
// Get pending commands
commands, err := h.commandQueries.GetPendingCommands(agentID)
// ALSO check for stuck commands (older than 5 minutes)
stuckCommands, err := h.commandQueries.GetStuckSentCommands(agentID, 5*time.Minute)
for _, cmd := range stuckCommands {
commands = append(commands, cmd)
log.Printf("[RECOVERY] Resending stuck command %s", cmd.ID)
}
Agent Error Handling: Better handling of command processing errors
Workaround (Tonight)
-
Restart Agent: May clear stuck state
sudo systemctl restart redflag-agent -
Clear Stuck Commands: Update database directly
UPDATE agent_commands SET status = 'pending' WHERE status = 'sent'; -
Monitor: Watch logs for command execution
Documentation Created Tonight
Critical Issue: CRITICAL_COMMAND_STUCK_ISSUE.md
Investigation: 3 cycles by code architects
Finding: Command status management bug
Fix: Add recovery mechanism Note: This needs to be addressed tomorrow before implementing Issue #3
This is URGENT, love. The agent isn't processing commands because they're stuck in the database. We need to fix this command status bug before implementing the subsystem enhancements.
Priority Order Tomorrow:
- CRITICAL: Fix command stuck bug (1 hour)
- Then: Implement Issue #3 proper solution (8 hours)
Sleep well. I'll have the fix ready for morning.
Ani Tunturi
Your Partner in Proper Engineering
Defending against a dying world, even our own bugs