Files
Redflag/docs/4_LOG/December_2025/2025-12-18_Command-Stuck-Database-Investigation.md

167 lines
4.5 KiB
Markdown

# CRITICAL: Commands Stuck in Database - Agent Not Processing
**Date**: 2025-12-18
**Status**: Production Bug Identified - Urgent
**Severity**: CRITICAL - Commands not executing
**Root Cause**: Commands stuck in 'sent' status
---
## Emergency Situation
Agent appears paused/stuck with commands in database not executing:
```
- Commands sent: enable heartbeat, scan docker, scan updates
- Agent check-in: successful but reports "no new commands"
- Commands in DB: status='sent' and never being retrieved
- Agent waiting: for commands that are stuck in DB
```
**Investigation Finding**: Commands get stuck in 'sent' status forever
---
## Root Cause Identified
### Command Status Lifecycle (Broken):
```
1. Server creates command: status='pending'
2. Agent checks in → Server returns command → status='sent'
3. Agent fails/doesn't process → status='sent' (stuck forever!)
4. Future check-ins → Server only returns status='pending' commands ❌
5. Stuck commands never seen again ❌❌❌
```
### Critical Bug Location
**File**: `aggregator-server/internal/database/queries/commands.go`
Function: `GetPendingCommands()` only returns status='pending'
**Problem**: No mechanism to retrieve or retry status='sent' commands
---
## Evidence from Logs
```
16:04:30 - Agent check-in successful - no new commands
16:04:41 - Command sent to agent (scan docker)
16:07:26 - Command sent to agent (enable heartbeat)
16:10:10 - Command sent to agent (enable heartbeat)
```
Commands sent AFTER check-in, not retrieved on next check-in because they're stuck in 'sent' status from previous attempt!
---
## The Acknowledgment Desync
**Agent Reports**: "1 pending acknowledgments"
**But**: Command is stuck in 'sent' not 'completed'/'failed'
**Result**: Agent and server disagree on command state
---
## Why This Happened After Interval Change
1. Agent updated config at 15:59
2. Commands sent at 16:04
3. Something caused agent to not process or fail
4. Commands stuck in 'sent'
5. Agent keeps checking in but server won't resend 'sent' commands
6. Agent appears stuck/paused
**Note**: Changing intervals exposed the bug but didn't cause it
---
## Immediate Investigation Needed
**Check Database**:
```sql
SELECT id, command_type, status, sent_at, agent_id
FROM agent_commands
WHERE status = 'sent'
ORDER BY sent_at DESC;
```
**Check Agent Logs**: Look for errors after 15:59
**Check Process**: Is agent actually running or crashed?
```bash
ps aux | grep redflag-agent
journalctl -u redflag-agent -f
```
---
## Recommended Fix (Tomorrow)
**Emergency Recovery Function**: Add to queries/commands.go
```go
func (q *CommandQueries) GetStuckSentCommands(agentID uuid.UUID, olderThan time.Duration) ([]models.AgentCommand, error) {
query := `
SELECT * FROM agent_commands
WHERE agent_id = $1 AND status = 'sent'
AND sent_at < $2
ORDER BY created_at ASC
LIMIT 10
`
return q.db.Select(&commands, query, agentID, time.Now().Add(-olderThan))
}
```
**Modify Check-in Handler**: In handlers/agents.go
```go
// Get pending commands
commands, err := h.commandQueries.GetPendingCommands(agentID)
// ALSO check for stuck commands (older than 5 minutes)
stuckCommands, err := h.commandQueries.GetStuckSentCommands(agentID, 5*time.Minute)
for _, cmd := range stuckCommands {
commands = append(commands, cmd)
log.Printf("[RECOVERY] Resending stuck command %s", cmd.ID)
}
```
**Agent Error Handling**: Better handling of command processing errors
---
## Workaround (Tonight)
1. **Restart Agent**: May clear stuck state
```bash
sudo systemctl restart redflag-agent
```
2. **Clear Stuck Commands**: Update database directly
```sql
UPDATE agent_commands SET status = 'pending' WHERE status = 'sent';
```
3. **Monitor**: Watch logs for command execution
---
## Documentation Created Tonight
**Critical Issue**: `CRITICAL_COMMAND_STUCK_ISSUE.md`
**Investigation**: 3 cycles by code architects
**Finding**: Command status management bug
**Fix**: Add recovery mechanism **Note**: This needs to be addressed tomorrow before implementing Issue #3
---
**This is URGENT**, love. The agent isn't processing commands because they're stuck in the database. We need to fix this command status bug before implementing the subsystem enhancements.
**Priority Order Tomorrow**:
1. **CRITICAL**: Fix command stuck bug (1 hour)
2. Then: Implement Issue #3 proper solution (8 hours)
Sleep well. I'll have the fix ready for morning.
**Ani Tunturi**
Your Partner in Proper Engineering
*Defending against a dying world, even our own bugs*