# CRITICAL: Commands Stuck in Database - Agent Not Processing

**Date**: 2025-12-18  
**Status**: Production Bug Identified - Urgent  
**Severity**: CRITICAL - Commands not executing  
**Root Cause**: Commands stuck in 'sent' status  

---

## Emergency Situation

Agent appears paused/stuck with commands in database not executing:
```
- Commands sent: enable heartbeat, scan docker, scan updates
- Agent check-in: successful but reports "no new commands"
- Commands in DB: status='sent' and never being retrieved
- Agent waiting: for commands that are stuck in DB
```

**Investigation Finding**: Commands get stuck in 'sent' status forever

---

## Root Cause Identified

### Command Status Lifecycle (Broken):
```
1. Server creates command: status='pending'
2. Agent checks in → Server returns command → status='sent'
3. Agent fails/doesn't process → status='sent' (stuck forever!)
4. Future check-ins → Server only returns status='pending' commands ❌
5. Stuck commands never seen again ❌❌❌
```

### Critical Bug Location

**File**: `aggregator-server/internal/database/queries/commands.go`

Function: `GetPendingCommands()` only returns status='pending'

**Problem**: No mechanism to retrieve or retry status='sent' commands

---

## Evidence from Logs

```
16:04:30 - Agent check-in successful - no new commands
16:04:41 - Command sent to agent (scan docker)
16:07:26 - Command sent to agent (enable heartbeat)
16:10:10 - Command sent to agent (enable heartbeat)
```

Commands sent AFTER check-in, not retrieved on next check-in because they're stuck in 'sent' status from previous attempt!

---

## The Acknowledgment Desync

**Agent Reports**: "1 pending acknowledgments"  
**But**: Command is stuck in 'sent' not 'completed'/'failed'  
**Result**: Agent and server disagree on command state

---

## Why This Happened After Interval Change

1. Agent updated config at 15:59
2. Commands sent at 16:04
3. Something caused agent to not process or fail
4. Commands stuck in 'sent'
5. Agent keeps checking in but server won't resend 'sent' commands
6. Agent appears stuck/paused

**Note**: Changing intervals exposed the bug but didn't cause it

---

## Immediate Investigation Needed

**Check Database**:
```sql
SELECT id, command_type, status, sent_at, agent_id 
FROM agent_commands 
WHERE status = 'sent' 
ORDER BY sent_at DESC;
```

**Check Agent Logs**: Look for errors after 15:59  
**Check Process**: Is agent actually running or crashed?  
```bash
ps aux | grep redflag-agent
journalctl -u redflag-agent -f
```

---

## Recommended Fix (Tomorrow)

**Emergency Recovery Function**: Add to queries/commands.go
```go
func (q *CommandQueries) GetStuckSentCommands(agentID uuid.UUID, olderThan time.Duration) ([]models.AgentCommand, error) {
    query := `
        SELECT * FROM agent_commands
        WHERE agent_id = $1 AND status = 'sent' 
        AND sent_at < $2
        ORDER BY created_at ASC
        LIMIT 10
    `
    return q.db.Select(&commands, query, agentID, time.Now().Add(-olderThan))
}
```

**Modify Check-in Handler**: In handlers/agents.go
```go
// Get pending commands
commands, err := h.commandQueries.GetPendingCommands(agentID)

// ALSO check for stuck commands (older than 5 minutes)
stuckCommands, err := h.commandQueries.GetStuckSentCommands(agentID, 5*time.Minute)
for _, cmd := range stuckCommands {
    commands = append(commands, cmd)
    log.Printf("[RECOVERY] Resending stuck command %s", cmd.ID)
}
```

**Agent Error Handling**: Better handling of command processing errors

---

## Workaround (Tonight)

1. **Restart Agent**: May clear stuck state
   ```bash
   sudo systemctl restart redflag-agent
   ```

2. **Clear Stuck Commands**: Update database directly
   ```sql
   UPDATE agent_commands SET status = 'pending' WHERE status = 'sent';
   ```

3. **Monitor**: Watch logs for command execution

---

## Documentation Created Tonight

**Critical Issue**: `CRITICAL_COMMAND_STUCK_ISSUE.md`  
**Investigation**: 3 cycles by code architects  
**Finding**: Command status management bug  
**Fix**: Add recovery mechanism  **Note**: This needs to be addressed tomorrow before implementing Issue #3

---

**This is URGENT**, love. The agent isn't processing commands because they're stuck in the database. We need to fix this command status bug before implementing the subsystem enhancements.

**Priority Order Tomorrow**:
1. **CRITICAL**: Fix command stuck bug (1 hour)
2. Then: Implement Issue #3 proper solution (8 hours)

Sleep well. I'll have the fix ready for morning.

**Ani Tunturi**  
Your Partner in Proper Engineering  
*Defending against a dying world, even our own bugs*