Files
Redflag/docs/3_BACKLOG/P0-004_Database-Constraint-Violation.md

238 lines
8.3 KiB
Markdown

# P0-004: Database Constraint Violation in Timeout Log Creation
**Priority:** P0 (Critical)
**Source Reference:** From needsfixingbeforepush.md line 313
**Date Identified:** 2025-11-12
## Problem Description
Timeout service successfully marks commands as timed_out but fails to create audit log entries in the `update_logs` table due to a database constraint violation. The error "pq: new row for relation "update_logs" violates check constraint "update_logs_result_check"" prevents proper audit trail creation for timeout events.
## Current Behavior
- Timeout service runs every 5 minutes correctly
- Successfully identifies timed out commands (both pending >30min and sent >2h)
- Successfully updates command status to 'timed_out' in `agent_commands` table
- **FAILS** to create audit log entry in `update_logs` table
- Constraint violation suggests 'timed_out' is not a valid value for the `result` field
### Error Message
```
Warning: failed to create timeout log entry: pq: new row for relation "update_logs" violates check constraint "update_logs_result_check"
```
## Root Cause Analysis
The `update_logs` table has a CHECK constraint on the `result` field that doesn't include 'timed_out' as a valid value. The timeout service is trying to insert 'timed_out' as the result, but the database schema only accepts other values like 'success', 'failed', 'error', etc.
### Likely Database Schema Issue
```sql
-- Current constraint (hypothetical)
ALTER TABLE update_logs ADD CONSTRAINT update_logs_result_check
CHECK (result IN ('success', 'failed', 'error', 'pending'));
-- Missing: 'timed_out' in the allowed values list
```
## Proposed Solution
### Option 1: Add 'timed_out' to Database Constraint (Recommended)
```sql
-- Update the check constraint to include 'timed_out'
ALTER TABLE update_logs DROP CONSTRAINT update_logs_result_check;
ALTER TABLE update_logs ADD CONSTRAINT update_logs_result_check
CHECK (result IN ('success', 'failed', 'error', 'pending', 'timed_out'));
```
### Option 2: Use 'failed' with Timeout Metadata
```go
// In timeout service, use 'failed' instead of 'timed_out'
logEntry := &UpdateLog{
CommandID: command.ID,
AgentID: command.AgentID,
Result: "failed", // Instead of "timed_out"
Message: "Command timed out after 2 hours",
Metadata: map[string]interface{}{
"timeout_duration": "2h",
"timeout_reason": "no_response",
"sent_at": command.SentAt,
},
}
```
### Option 3: Separate Timeout Status Field
```sql
-- Add dedicated timeout tracking
ALTER TABLE update_logs ADD COLUMN is_timed_out BOOLEAN DEFAULT FALSE;
ALTER TABLE update_logs ADD COLUMN timeout_duration INTERVAL;
-- Keep result as 'failed' but mark as timeout
UPDATE update_logs SET
result = 'failed',
is_timed_out = TRUE,
timeout_duration = '2 hours'
WHERE command_id = '...';
```
## Definition of Done
- [ ] Timeout service can create audit log entries without constraint violations
- [ ] Audit trail properly records timeout events with timestamps and details
- [ ] Timeout events are visible in command history and audit reports
- [ ] Database constraint allows all valid command result states
- [ ] Error logs no longer show constraint violation warnings
- [ ] Compliance requirements for audit trail are met
## Test Plan
### 1. Manual Timeout Creation Test
```bash
# Create a command and mark it as sent
docker exec -it redflag-postgres psql -U aggregator -d aggregator -c "
INSERT INTO agent_commands (id, agent_id, command_type, status, created_at, sent_at)
VALUES ('test-timeout-123', 'agent-uuid', 'scan_updates', 'sent', NOW(), NOW() - INTERVAL '3 hours');
"
# Run timeout service manually or wait for next run (5 minutes)
# Check that no constraint violation occurs
docker logs redflag-server | grep -i "constraint\|timeout"
# Verify audit log was created
docker exec -it redflag-postgres psql -U aggregator -d aggregator -c "
SELECT * FROM update_logs WHERE command_id = 'test-timeout-123';
"
```
### 2. Database Constraint Test
```bash
# Test all valid result values
docker exec -it redflag-postgres psql -U aggregator -d aggregator -c "
INSERT INTO update_logs (command_id, agent_id, result, message)
VALUES
('test-success', 'agent-uuid', 'success', 'Test success'),
('test-failed', 'agent-uuid', 'failed', 'Test failed'),
('test-error', 'agent-uuid', 'error', 'Test error'),
('test-pending', 'agent-uuid', 'pending', 'Test pending'),
('test-timeout', 'agent-uuid', 'timed_out', 'Test timeout');
"
# All should succeed without constraint violations
```
### 3. Full Timeout Service Test
```bash
# Set up old commands that should timeout
docker exec -it redflag-postgres psql -U aggregator -d aggregator -c "
UPDATE agent_commands
SET status = 'sent', sent_at = NOW() - INTERVAL '3 hours'
WHERE created_at < NOW() - INTERVAL '1 hour';
"
# Trigger timeout service
curl -X POST http://localhost:8080/api/v1/admin/timeout-service/run \
-H "Authorization: Bearer $ADMIN_TOKEN"
# Verify no constraint violations in logs
# Verify audit logs are created for timed out commands
```
### 4. Audit Trail Verification
```bash
# Check that timeout events appear in command history
curl -H "Authorization: Bearer $TOKEN" \
"http://localhost:8080/api/v1/commands/history?include_timeout=true"
# Should show timeout events with proper metadata
```
## Files to Modify
- **Database Migration:** `aggregator-server/internal/database/migrations/XXX_add_timed_out_constraint.up.sql`
- **Timeout Service:** `aggregator-server/internal/services/timeout.go`
- **Database Schema:** Update `update_logs` table constraints
- **API Handlers:** Ensure timeout events are returned in history queries
## Database Migration Example
```sql
-- File: 020_add_timed_out_to_result_constraint.up.sql
-- Add 'timed_out' as valid result value for update_logs
-- First, drop existing constraint
ALTER TABLE update_logs DROP CONSTRAINT IF EXISTS update_logs_result_check;
-- Add updated constraint with 'timed_out' included
ALTER TABLE update_logs ADD CONSTRAINT update_logs_result_check
CHECK (result IN ('success', 'failed', 'error', 'pending', 'timed_out'));
-- Add comment explaining the change
COMMENT ON CONSTRAINT update_logs_result_check ON update_logs IS
'Valid result values for command execution, including timeout status';
```
## Impact
- **Audit Compliance:** Enables complete audit trail for timeout events
- **Troubleshooting:** Timeout events visible in command history and logs
- **Compliance:** Meets regulatory requirements for complete audit trail
- **Debugging:** Clear visibility into timeout patterns and system health
- **Monitoring:** Enables metrics on timeout rates and patterns
## Security and Compliance Considerations
### Audit Trail Requirements
- **Complete Records:** All command state changes must be logged
- **Immutable History:** Timeout events should not be deletable
- **Timestamp Accuracy:** Precise timing of timeout detection
- **User Attribution:** Which system/service detected the timeout
### Data Privacy
- **Command Details:** What command timed out (but not sensitive data)
- **Agent Information:** Which agent had the timeout
- **Timing Data:** How long the command was stuck
- **System Metadata:** Service version, detection method
## Monitoring and Alerting
### Metrics to Track
- Timeout rate by command type
- Average timeout duration
- Timeout service execution success rate
- Audit log creation success rate
- Database constraint violations (should be 0)
### Alert Examples
```bash
# Alert if timeout service fails
if timeout_service_failures > 3 in 5m:
alert("Timeout service experiencing failures")
# Alert if constraint violations occur
if database_constraint_violations > 0:
critical("Database constraint violation detected!")
```
## Verification Commands
After fix implementation:
```bash
# Test timeout service execution
curl -X POST http://localhost:8080/api/v1/admin/timeout-service/run \
-H "Authorization: Bearer $ADMIN_TOKEN"
# Check for constraint violations
docker logs redflag-server | grep -i "constraint" # Should be empty
# Verify audit log creation
docker exec -it redflag-postgres psql -U aggregator -d aggregator -c "
SELECT COUNT(*) FROM update_logs
WHERE result = 'timed_out'
AND created_at > NOW() - INTERVAL '1 hour';
"
# Should be >0 after timeout service runs
# Verify no constraint errors
docker logs redflag-server 2>&1 | grep -c "violates check constraint"
# Should return 0
```