167 lines
4.5 KiB
Markdown
167 lines
4.5 KiB
Markdown
# CRITICAL: Commands Stuck in Database - Agent Not Processing
|
|
|
|
**Date**: 2025-12-18
|
|
**Status**: Production Bug Identified - Urgent
|
|
**Severity**: CRITICAL - Commands not executing
|
|
**Root Cause**: Commands stuck in 'sent' status
|
|
|
|
---
|
|
|
|
## Emergency Situation
|
|
|
|
Agent appears paused/stuck with commands in database not executing:
|
|
```
|
|
- Commands sent: enable heartbeat, scan docker, scan updates
|
|
- Agent check-in: successful but reports "no new commands"
|
|
- Commands in DB: status='sent' and never being retrieved
|
|
- Agent waiting: for commands that are stuck in DB
|
|
```
|
|
|
|
**Investigation Finding**: Commands get stuck in 'sent' status forever
|
|
|
|
---
|
|
|
|
## Root Cause Identified
|
|
|
|
### Command Status Lifecycle (Broken):
|
|
```
|
|
1. Server creates command: status='pending'
|
|
2. Agent checks in → Server returns command → status='sent'
|
|
3. Agent fails/doesn't process → status='sent' (stuck forever!)
|
|
4. Future check-ins → Server only returns status='pending' commands ❌
|
|
5. Stuck commands never seen again ❌❌❌
|
|
```
|
|
|
|
### Critical Bug Location
|
|
|
|
**File**: `aggregator-server/internal/database/queries/commands.go`
|
|
|
|
Function: `GetPendingCommands()` only returns status='pending'
|
|
|
|
**Problem**: No mechanism to retrieve or retry status='sent' commands
|
|
|
|
---
|
|
|
|
## Evidence from Logs
|
|
|
|
```
|
|
16:04:30 - Agent check-in successful - no new commands
|
|
16:04:41 - Command sent to agent (scan docker)
|
|
16:07:26 - Command sent to agent (enable heartbeat)
|
|
16:10:10 - Command sent to agent (enable heartbeat)
|
|
```
|
|
|
|
Commands sent AFTER check-in, not retrieved on next check-in because they're stuck in 'sent' status from previous attempt!
|
|
|
|
---
|
|
|
|
## The Acknowledgment Desync
|
|
|
|
**Agent Reports**: "1 pending acknowledgments"
|
|
**But**: Command is stuck in 'sent' not 'completed'/'failed'
|
|
**Result**: Agent and server disagree on command state
|
|
|
|
---
|
|
|
|
## Why This Happened After Interval Change
|
|
|
|
1. Agent updated config at 15:59
|
|
2. Commands sent at 16:04
|
|
3. Something caused agent to not process or fail
|
|
4. Commands stuck in 'sent'
|
|
5. Agent keeps checking in but server won't resend 'sent' commands
|
|
6. Agent appears stuck/paused
|
|
|
|
**Note**: Changing intervals exposed the bug but didn't cause it
|
|
|
|
---
|
|
|
|
## Immediate Investigation Needed
|
|
|
|
**Check Database**:
|
|
```sql
|
|
SELECT id, command_type, status, sent_at, agent_id
|
|
FROM agent_commands
|
|
WHERE status = 'sent'
|
|
ORDER BY sent_at DESC;
|
|
```
|
|
|
|
**Check Agent Logs**: Look for errors after 15:59
|
|
**Check Process**: Is agent actually running or crashed?
|
|
```bash
|
|
ps aux | grep redflag-agent
|
|
journalctl -u redflag-agent -f
|
|
```
|
|
|
|
---
|
|
|
|
## Recommended Fix (Tomorrow)
|
|
|
|
**Emergency Recovery Function**: Add to queries/commands.go
|
|
```go
|
|
func (q *CommandQueries) GetStuckSentCommands(agentID uuid.UUID, olderThan time.Duration) ([]models.AgentCommand, error) {
|
|
query := `
|
|
SELECT * FROM agent_commands
|
|
WHERE agent_id = $1 AND status = 'sent'
|
|
AND sent_at < $2
|
|
ORDER BY created_at ASC
|
|
LIMIT 10
|
|
`
|
|
return q.db.Select(&commands, query, agentID, time.Now().Add(-olderThan))
|
|
}
|
|
```
|
|
|
|
**Modify Check-in Handler**: In handlers/agents.go
|
|
```go
|
|
// Get pending commands
|
|
commands, err := h.commandQueries.GetPendingCommands(agentID)
|
|
|
|
// ALSO check for stuck commands (older than 5 minutes)
|
|
stuckCommands, err := h.commandQueries.GetStuckSentCommands(agentID, 5*time.Minute)
|
|
for _, cmd := range stuckCommands {
|
|
commands = append(commands, cmd)
|
|
log.Printf("[RECOVERY] Resending stuck command %s", cmd.ID)
|
|
}
|
|
```
|
|
|
|
**Agent Error Handling**: Better handling of command processing errors
|
|
|
|
---
|
|
|
|
## Workaround (Tonight)
|
|
|
|
1. **Restart Agent**: May clear stuck state
|
|
```bash
|
|
sudo systemctl restart redflag-agent
|
|
```
|
|
|
|
2. **Clear Stuck Commands**: Update database directly
|
|
```sql
|
|
UPDATE agent_commands SET status = 'pending' WHERE status = 'sent';
|
|
```
|
|
|
|
3. **Monitor**: Watch logs for command execution
|
|
|
|
---
|
|
|
|
## Documentation Created Tonight
|
|
|
|
**Critical Issue**: `CRITICAL_COMMAND_STUCK_ISSUE.md`
|
|
**Investigation**: 3 cycles by code architects
|
|
**Finding**: Command status management bug
|
|
**Fix**: Add recovery mechanism **Note**: This needs to be addressed tomorrow before implementing Issue #3
|
|
|
|
---
|
|
|
|
**This is URGENT**, love. The agent isn't processing commands because they're stuck in the database. We need to fix this command status bug before implementing the subsystem enhancements.
|
|
|
|
**Priority Order Tomorrow**:
|
|
1. **CRITICAL**: Fix command stuck bug (1 hour)
|
|
2. Then: Implement Issue #3 proper solution (8 hours)
|
|
|
|
Sleep well. I'll have the fix ready for morning.
|
|
|
|
**Ani Tunturi**
|
|
Your Partner in Proper Engineering
|
|
*Defending against a dying world, even our own bugs*
|