Files
Redflag/docs/3_BACKLOG/P0-004_Database-Constraint-Violation.md

8.3 KiB

P0-004: Database Constraint Violation in Timeout Log Creation

Priority: P0 (Critical) Source Reference: From needsfixingbeforepush.md line 313 Date Identified: 2025-11-12

Problem Description

Timeout service successfully marks commands as timed_out but fails to create audit log entries in the update_logs table due to a database constraint violation. The error "pq: new row for relation "update_logs" violates check constraint "update_logs_result_check"" prevents proper audit trail creation for timeout events.

Current Behavior

  • Timeout service runs every 5 minutes correctly
  • Successfully identifies timed out commands (both pending >30min and sent >2h)
  • Successfully updates command status to 'timed_out' in agent_commands table
  • FAILS to create audit log entry in update_logs table
  • Constraint violation suggests 'timed_out' is not a valid value for the result field

Error Message

Warning: failed to create timeout log entry: pq: new row for relation "update_logs" violates check constraint "update_logs_result_check"

Root Cause Analysis

The update_logs table has a CHECK constraint on the result field that doesn't include 'timed_out' as a valid value. The timeout service is trying to insert 'timed_out' as the result, but the database schema only accepts other values like 'success', 'failed', 'error', etc.

Likely Database Schema Issue

-- Current constraint (hypothetical)
ALTER TABLE update_logs ADD CONSTRAINT update_logs_result_check
CHECK (result IN ('success', 'failed', 'error', 'pending'));

-- Missing: 'timed_out' in the allowed values list

Proposed Solution

-- Update the check constraint to include 'timed_out'
ALTER TABLE update_logs DROP CONSTRAINT update_logs_result_check;
ALTER TABLE update_logs ADD CONSTRAINT update_logs_result_check
CHECK (result IN ('success', 'failed', 'error', 'pending', 'timed_out'));

Option 2: Use 'failed' with Timeout Metadata

// In timeout service, use 'failed' instead of 'timed_out'
logEntry := &UpdateLog{
    CommandID: command.ID,
    AgentID:   command.AgentID,
    Result:    "failed",  // Instead of "timed_out"
    Message:   "Command timed out after 2 hours",
    Metadata: map[string]interface{}{
        "timeout_duration": "2h",
        "timeout_reason":   "no_response",
        "sent_at":         command.SentAt,
    },
}

Option 3: Separate Timeout Status Field

-- Add dedicated timeout tracking
ALTER TABLE update_logs ADD COLUMN is_timed_out BOOLEAN DEFAULT FALSE;
ALTER TABLE update_logs ADD COLUMN timeout_duration INTERVAL;

-- Keep result as 'failed' but mark as timeout
UPDATE update_logs SET
    result = 'failed',
    is_timed_out = TRUE,
    timeout_duration = '2 hours'
WHERE command_id = '...';

Definition of Done

  • Timeout service can create audit log entries without constraint violations
  • Audit trail properly records timeout events with timestamps and details
  • Timeout events are visible in command history and audit reports
  • Database constraint allows all valid command result states
  • Error logs no longer show constraint violation warnings
  • Compliance requirements for audit trail are met

Test Plan

1. Manual Timeout Creation Test

# Create a command and mark it as sent
docker exec -it redflag-postgres psql -U aggregator -d aggregator -c "
INSERT INTO agent_commands (id, agent_id, command_type, status, created_at, sent_at)
VALUES ('test-timeout-123', 'agent-uuid', 'scan_updates', 'sent', NOW(), NOW() - INTERVAL '3 hours');
"

# Run timeout service manually or wait for next run (5 minutes)
# Check that no constraint violation occurs
docker logs redflag-server | grep -i "constraint\|timeout"

# Verify audit log was created
docker exec -it redflag-postgres psql -U aggregator -d aggregator -c "
SELECT * FROM update_logs WHERE command_id = 'test-timeout-123';
"

2. Database Constraint Test

# Test all valid result values
docker exec -it redflag-postgres psql -U aggregator -d aggregator -c "
INSERT INTO update_logs (command_id, agent_id, result, message)
VALUES
  ('test-success', 'agent-uuid', 'success', 'Test success'),
  ('test-failed', 'agent-uuid', 'failed', 'Test failed'),
  ('test-error', 'agent-uuid', 'error', 'Test error'),
  ('test-pending', 'agent-uuid', 'pending', 'Test pending'),
  ('test-timeout', 'agent-uuid', 'timed_out', 'Test timeout');
"

# All should succeed without constraint violations

3. Full Timeout Service Test

# Set up old commands that should timeout
docker exec -it redflag-postgres psql -U aggregator -d aggregator -c "
UPDATE agent_commands
SET status = 'sent', sent_at = NOW() - INTERVAL '3 hours'
WHERE created_at < NOW() - INTERVAL '1 hour';
"

# Trigger timeout service
curl -X POST http://localhost:8080/api/v1/admin/timeout-service/run \
  -H "Authorization: Bearer $ADMIN_TOKEN"

# Verify no constraint violations in logs
# Verify audit logs are created for timed out commands

4. Audit Trail Verification

# Check that timeout events appear in command history
curl -H "Authorization: Bearer $TOKEN" \
  "http://localhost:8080/api/v1/commands/history?include_timeout=true"

# Should show timeout events with proper metadata

Files to Modify

  • Database Migration: aggregator-server/internal/database/migrations/XXX_add_timed_out_constraint.up.sql
  • Timeout Service: aggregator-server/internal/services/timeout.go
  • Database Schema: Update update_logs table constraints
  • API Handlers: Ensure timeout events are returned in history queries

Database Migration Example

-- File: 020_add_timed_out_to_result_constraint.up.sql
-- Add 'timed_out' as valid result value for update_logs

-- First, drop existing constraint
ALTER TABLE update_logs DROP CONSTRAINT IF EXISTS update_logs_result_check;

-- Add updated constraint with 'timed_out' included
ALTER TABLE update_logs ADD CONSTRAINT update_logs_result_check
CHECK (result IN ('success', 'failed', 'error', 'pending', 'timed_out'));

-- Add comment explaining the change
COMMENT ON CONSTRAINT update_logs_result_check ON update_logs IS
  'Valid result values for command execution, including timeout status';

Impact

  • Audit Compliance: Enables complete audit trail for timeout events
  • Troubleshooting: Timeout events visible in command history and logs
  • Compliance: Meets regulatory requirements for complete audit trail
  • Debugging: Clear visibility into timeout patterns and system health
  • Monitoring: Enables metrics on timeout rates and patterns

Security and Compliance Considerations

Audit Trail Requirements

  • Complete Records: All command state changes must be logged
  • Immutable History: Timeout events should not be deletable
  • Timestamp Accuracy: Precise timing of timeout detection
  • User Attribution: Which system/service detected the timeout

Data Privacy

  • Command Details: What command timed out (but not sensitive data)
  • Agent Information: Which agent had the timeout
  • Timing Data: How long the command was stuck
  • System Metadata: Service version, detection method

Monitoring and Alerting

Metrics to Track

  • Timeout rate by command type
  • Average timeout duration
  • Timeout service execution success rate
  • Audit log creation success rate
  • Database constraint violations (should be 0)

Alert Examples

# Alert if timeout service fails
if timeout_service_failures > 3 in 5m:
    alert("Timeout service experiencing failures")

# Alert if constraint violations occur
if database_constraint_violations > 0:
    critical("Database constraint violation detected!")

Verification Commands

After fix implementation:

# Test timeout service execution
curl -X POST http://localhost:8080/api/v1/admin/timeout-service/run \
  -H "Authorization: Bearer $ADMIN_TOKEN"

# Check for constraint violations
docker logs redflag-server | grep -i "constraint"  # Should be empty

# Verify audit log creation
docker exec -it redflag-postgres psql -U aggregator -d aggregator -c "
SELECT COUNT(*) FROM update_logs
WHERE result = 'timed_out'
AND created_at > NOW() - INTERVAL '1 hour';
"
# Should be >0 after timeout service runs

# Verify no constraint errors
docker logs redflag-server 2>&1 | grep -c "violates check constraint"
# Should return 0