238 lines
8.3 KiB
Markdown
238 lines
8.3 KiB
Markdown
# P0-004: Database Constraint Violation in Timeout Log Creation
|
|
|
|
**Priority:** P0 (Critical)
|
|
**Source Reference:** From needsfixingbeforepush.md line 313
|
|
**Date Identified:** 2025-11-12
|
|
|
|
## Problem Description
|
|
|
|
Timeout service successfully marks commands as timed_out but fails to create audit log entries in the `update_logs` table due to a database constraint violation. The error "pq: new row for relation "update_logs" violates check constraint "update_logs_result_check"" prevents proper audit trail creation for timeout events.
|
|
|
|
## Current Behavior
|
|
|
|
- Timeout service runs every 5 minutes correctly
|
|
- Successfully identifies timed out commands (both pending >30min and sent >2h)
|
|
- Successfully updates command status to 'timed_out' in `agent_commands` table
|
|
- **FAILS** to create audit log entry in `update_logs` table
|
|
- Constraint violation suggests 'timed_out' is not a valid value for the `result` field
|
|
|
|
### Error Message
|
|
```
|
|
Warning: failed to create timeout log entry: pq: new row for relation "update_logs" violates check constraint "update_logs_result_check"
|
|
```
|
|
|
|
## Root Cause Analysis
|
|
|
|
The `update_logs` table has a CHECK constraint on the `result` field that doesn't include 'timed_out' as a valid value. The timeout service is trying to insert 'timed_out' as the result, but the database schema only accepts other values like 'success', 'failed', 'error', etc.
|
|
|
|
### Likely Database Schema Issue
|
|
```sql
|
|
-- Current constraint (hypothetical)
|
|
ALTER TABLE update_logs ADD CONSTRAINT update_logs_result_check
|
|
CHECK (result IN ('success', 'failed', 'error', 'pending'));
|
|
|
|
-- Missing: 'timed_out' in the allowed values list
|
|
```
|
|
|
|
## Proposed Solution
|
|
|
|
### Option 1: Add 'timed_out' to Database Constraint (Recommended)
|
|
```sql
|
|
-- Update the check constraint to include 'timed_out'
|
|
ALTER TABLE update_logs DROP CONSTRAINT update_logs_result_check;
|
|
ALTER TABLE update_logs ADD CONSTRAINT update_logs_result_check
|
|
CHECK (result IN ('success', 'failed', 'error', 'pending', 'timed_out'));
|
|
```
|
|
|
|
### Option 2: Use 'failed' with Timeout Metadata
|
|
```go
|
|
// In timeout service, use 'failed' instead of 'timed_out'
|
|
logEntry := &UpdateLog{
|
|
CommandID: command.ID,
|
|
AgentID: command.AgentID,
|
|
Result: "failed", // Instead of "timed_out"
|
|
Message: "Command timed out after 2 hours",
|
|
Metadata: map[string]interface{}{
|
|
"timeout_duration": "2h",
|
|
"timeout_reason": "no_response",
|
|
"sent_at": command.SentAt,
|
|
},
|
|
}
|
|
```
|
|
|
|
### Option 3: Separate Timeout Status Field
|
|
```sql
|
|
-- Add dedicated timeout tracking
|
|
ALTER TABLE update_logs ADD COLUMN is_timed_out BOOLEAN DEFAULT FALSE;
|
|
ALTER TABLE update_logs ADD COLUMN timeout_duration INTERVAL;
|
|
|
|
-- Keep result as 'failed' but mark as timeout
|
|
UPDATE update_logs SET
|
|
result = 'failed',
|
|
is_timed_out = TRUE,
|
|
timeout_duration = '2 hours'
|
|
WHERE command_id = '...';
|
|
```
|
|
|
|
## Definition of Done
|
|
|
|
- [ ] Timeout service can create audit log entries without constraint violations
|
|
- [ ] Audit trail properly records timeout events with timestamps and details
|
|
- [ ] Timeout events are visible in command history and audit reports
|
|
- [ ] Database constraint allows all valid command result states
|
|
- [ ] Error logs no longer show constraint violation warnings
|
|
- [ ] Compliance requirements for audit trail are met
|
|
|
|
## Test Plan
|
|
|
|
### 1. Manual Timeout Creation Test
|
|
```bash
|
|
# Create a command and mark it as sent
|
|
docker exec -it redflag-postgres psql -U aggregator -d aggregator -c "
|
|
INSERT INTO agent_commands (id, agent_id, command_type, status, created_at, sent_at)
|
|
VALUES ('test-timeout-123', 'agent-uuid', 'scan_updates', 'sent', NOW(), NOW() - INTERVAL '3 hours');
|
|
"
|
|
|
|
# Run timeout service manually or wait for next run (5 minutes)
|
|
# Check that no constraint violation occurs
|
|
docker logs redflag-server | grep -i "constraint\|timeout"
|
|
|
|
# Verify audit log was created
|
|
docker exec -it redflag-postgres psql -U aggregator -d aggregator -c "
|
|
SELECT * FROM update_logs WHERE command_id = 'test-timeout-123';
|
|
"
|
|
```
|
|
|
|
### 2. Database Constraint Test
|
|
```bash
|
|
# Test all valid result values
|
|
docker exec -it redflag-postgres psql -U aggregator -d aggregator -c "
|
|
INSERT INTO update_logs (command_id, agent_id, result, message)
|
|
VALUES
|
|
('test-success', 'agent-uuid', 'success', 'Test success'),
|
|
('test-failed', 'agent-uuid', 'failed', 'Test failed'),
|
|
('test-error', 'agent-uuid', 'error', 'Test error'),
|
|
('test-pending', 'agent-uuid', 'pending', 'Test pending'),
|
|
('test-timeout', 'agent-uuid', 'timed_out', 'Test timeout');
|
|
"
|
|
|
|
# All should succeed without constraint violations
|
|
```
|
|
|
|
### 3. Full Timeout Service Test
|
|
```bash
|
|
# Set up old commands that should timeout
|
|
docker exec -it redflag-postgres psql -U aggregator -d aggregator -c "
|
|
UPDATE agent_commands
|
|
SET status = 'sent', sent_at = NOW() - INTERVAL '3 hours'
|
|
WHERE created_at < NOW() - INTERVAL '1 hour';
|
|
"
|
|
|
|
# Trigger timeout service
|
|
curl -X POST http://localhost:8080/api/v1/admin/timeout-service/run \
|
|
-H "Authorization: Bearer $ADMIN_TOKEN"
|
|
|
|
# Verify no constraint violations in logs
|
|
# Verify audit logs are created for timed out commands
|
|
```
|
|
|
|
### 4. Audit Trail Verification
|
|
```bash
|
|
# Check that timeout events appear in command history
|
|
curl -H "Authorization: Bearer $TOKEN" \
|
|
"http://localhost:8080/api/v1/commands/history?include_timeout=true"
|
|
|
|
# Should show timeout events with proper metadata
|
|
```
|
|
|
|
## Files to Modify
|
|
|
|
- **Database Migration:** `aggregator-server/internal/database/migrations/XXX_add_timed_out_constraint.up.sql`
|
|
- **Timeout Service:** `aggregator-server/internal/services/timeout.go`
|
|
- **Database Schema:** Update `update_logs` table constraints
|
|
- **API Handlers:** Ensure timeout events are returned in history queries
|
|
|
|
## Database Migration Example
|
|
|
|
```sql
|
|
-- File: 020_add_timed_out_to_result_constraint.up.sql
|
|
-- Add 'timed_out' as valid result value for update_logs
|
|
|
|
-- First, drop existing constraint
|
|
ALTER TABLE update_logs DROP CONSTRAINT IF EXISTS update_logs_result_check;
|
|
|
|
-- Add updated constraint with 'timed_out' included
|
|
ALTER TABLE update_logs ADD CONSTRAINT update_logs_result_check
|
|
CHECK (result IN ('success', 'failed', 'error', 'pending', 'timed_out'));
|
|
|
|
-- Add comment explaining the change
|
|
COMMENT ON CONSTRAINT update_logs_result_check ON update_logs IS
|
|
'Valid result values for command execution, including timeout status';
|
|
```
|
|
|
|
## Impact
|
|
|
|
- **Audit Compliance:** Enables complete audit trail for timeout events
|
|
- **Troubleshooting:** Timeout events visible in command history and logs
|
|
- **Compliance:** Meets regulatory requirements for complete audit trail
|
|
- **Debugging:** Clear visibility into timeout patterns and system health
|
|
- **Monitoring:** Enables metrics on timeout rates and patterns
|
|
|
|
## Security and Compliance Considerations
|
|
|
|
### Audit Trail Requirements
|
|
- **Complete Records:** All command state changes must be logged
|
|
- **Immutable History:** Timeout events should not be deletable
|
|
- **Timestamp Accuracy:** Precise timing of timeout detection
|
|
- **User Attribution:** Which system/service detected the timeout
|
|
|
|
### Data Privacy
|
|
- **Command Details:** What command timed out (but not sensitive data)
|
|
- **Agent Information:** Which agent had the timeout
|
|
- **Timing Data:** How long the command was stuck
|
|
- **System Metadata:** Service version, detection method
|
|
|
|
## Monitoring and Alerting
|
|
|
|
### Metrics to Track
|
|
- Timeout rate by command type
|
|
- Average timeout duration
|
|
- Timeout service execution success rate
|
|
- Audit log creation success rate
|
|
- Database constraint violations (should be 0)
|
|
|
|
### Alert Examples
|
|
```bash
|
|
# Alert if timeout service fails
|
|
if timeout_service_failures > 3 in 5m:
|
|
alert("Timeout service experiencing failures")
|
|
|
|
# Alert if constraint violations occur
|
|
if database_constraint_violations > 0:
|
|
critical("Database constraint violation detected!")
|
|
```
|
|
|
|
## Verification Commands
|
|
|
|
After fix implementation:
|
|
|
|
```bash
|
|
# Test timeout service execution
|
|
curl -X POST http://localhost:8080/api/v1/admin/timeout-service/run \
|
|
-H "Authorization: Bearer $ADMIN_TOKEN"
|
|
|
|
# Check for constraint violations
|
|
docker logs redflag-server | grep -i "constraint" # Should be empty
|
|
|
|
# Verify audit log creation
|
|
docker exec -it redflag-postgres psql -U aggregator -d aggregator -c "
|
|
SELECT COUNT(*) FROM update_logs
|
|
WHERE result = 'timed_out'
|
|
AND created_at > NOW() - INTERVAL '1 hour';
|
|
"
|
|
# Should be >0 after timeout service runs
|
|
|
|
# Verify no constraint errors
|
|
docker logs redflag-server 2>&1 | grep -c "violates check constraint"
|
|
# Should return 0
|
|
``` |