Add docs and project files - force for Culurien
This commit is contained in:
238
docs/3_BACKLOG/P0-004_Database-Constraint-Violation.md
Normal file
238
docs/3_BACKLOG/P0-004_Database-Constraint-Violation.md
Normal file
@@ -0,0 +1,238 @@
|
||||
# P0-004: Database Constraint Violation in Timeout Log Creation
|
||||
|
||||
**Priority:** P0 (Critical)
|
||||
**Source Reference:** From needsfixingbeforepush.md line 313
|
||||
**Date Identified:** 2025-11-12
|
||||
|
||||
## Problem Description
|
||||
|
||||
Timeout service successfully marks commands as timed_out but fails to create audit log entries in the `update_logs` table due to a database constraint violation. The error "pq: new row for relation "update_logs" violates check constraint "update_logs_result_check"" prevents proper audit trail creation for timeout events.
|
||||
|
||||
## Current Behavior
|
||||
|
||||
- Timeout service runs every 5 minutes correctly
|
||||
- Successfully identifies timed out commands (both pending >30min and sent >2h)
|
||||
- Successfully updates command status to 'timed_out' in `agent_commands` table
|
||||
- **FAILS** to create audit log entry in `update_logs` table
|
||||
- Constraint violation suggests 'timed_out' is not a valid value for the `result` field
|
||||
|
||||
### Error Message
|
||||
```
|
||||
Warning: failed to create timeout log entry: pq: new row for relation "update_logs" violates check constraint "update_logs_result_check"
|
||||
```
|
||||
|
||||
## Root Cause Analysis
|
||||
|
||||
The `update_logs` table has a CHECK constraint on the `result` field that doesn't include 'timed_out' as a valid value. The timeout service is trying to insert 'timed_out' as the result, but the database schema only accepts other values like 'success', 'failed', 'error', etc.
|
||||
|
||||
### Likely Database Schema Issue
|
||||
```sql
|
||||
-- Current constraint (hypothetical)
|
||||
ALTER TABLE update_logs ADD CONSTRAINT update_logs_result_check
|
||||
CHECK (result IN ('success', 'failed', 'error', 'pending'));
|
||||
|
||||
-- Missing: 'timed_out' in the allowed values list
|
||||
```
|
||||
|
||||
## Proposed Solution
|
||||
|
||||
### Option 1: Add 'timed_out' to Database Constraint (Recommended)
|
||||
```sql
|
||||
-- Update the check constraint to include 'timed_out'
|
||||
ALTER TABLE update_logs DROP CONSTRAINT update_logs_result_check;
|
||||
ALTER TABLE update_logs ADD CONSTRAINT update_logs_result_check
|
||||
CHECK (result IN ('success', 'failed', 'error', 'pending', 'timed_out'));
|
||||
```
|
||||
|
||||
### Option 2: Use 'failed' with Timeout Metadata
|
||||
```go
|
||||
// In timeout service, use 'failed' instead of 'timed_out'
|
||||
logEntry := &UpdateLog{
|
||||
CommandID: command.ID,
|
||||
AgentID: command.AgentID,
|
||||
Result: "failed", // Instead of "timed_out"
|
||||
Message: "Command timed out after 2 hours",
|
||||
Metadata: map[string]interface{}{
|
||||
"timeout_duration": "2h",
|
||||
"timeout_reason": "no_response",
|
||||
"sent_at": command.SentAt,
|
||||
},
|
||||
}
|
||||
```
|
||||
|
||||
### Option 3: Separate Timeout Status Field
|
||||
```sql
|
||||
-- Add dedicated timeout tracking
|
||||
ALTER TABLE update_logs ADD COLUMN is_timed_out BOOLEAN DEFAULT FALSE;
|
||||
ALTER TABLE update_logs ADD COLUMN timeout_duration INTERVAL;
|
||||
|
||||
-- Keep result as 'failed' but mark as timeout
|
||||
UPDATE update_logs SET
|
||||
result = 'failed',
|
||||
is_timed_out = TRUE,
|
||||
timeout_duration = '2 hours'
|
||||
WHERE command_id = '...';
|
||||
```
|
||||
|
||||
## Definition of Done
|
||||
|
||||
- [ ] Timeout service can create audit log entries without constraint violations
|
||||
- [ ] Audit trail properly records timeout events with timestamps and details
|
||||
- [ ] Timeout events are visible in command history and audit reports
|
||||
- [ ] Database constraint allows all valid command result states
|
||||
- [ ] Error logs no longer show constraint violation warnings
|
||||
- [ ] Compliance requirements for audit trail are met
|
||||
|
||||
## Test Plan
|
||||
|
||||
### 1. Manual Timeout Creation Test
|
||||
```bash
|
||||
# Create a command and mark it as sent
|
||||
docker exec -it redflag-postgres psql -U aggregator -d aggregator -c "
|
||||
INSERT INTO agent_commands (id, agent_id, command_type, status, created_at, sent_at)
|
||||
VALUES ('test-timeout-123', 'agent-uuid', 'scan_updates', 'sent', NOW(), NOW() - INTERVAL '3 hours');
|
||||
"
|
||||
|
||||
# Run timeout service manually or wait for next run (5 minutes)
|
||||
# Check that no constraint violation occurs
|
||||
docker logs redflag-server | grep -i "constraint\|timeout"
|
||||
|
||||
# Verify audit log was created
|
||||
docker exec -it redflag-postgres psql -U aggregator -d aggregator -c "
|
||||
SELECT * FROM update_logs WHERE command_id = 'test-timeout-123';
|
||||
"
|
||||
```
|
||||
|
||||
### 2. Database Constraint Test
|
||||
```bash
|
||||
# Test all valid result values
|
||||
docker exec -it redflag-postgres psql -U aggregator -d aggregator -c "
|
||||
INSERT INTO update_logs (command_id, agent_id, result, message)
|
||||
VALUES
|
||||
('test-success', 'agent-uuid', 'success', 'Test success'),
|
||||
('test-failed', 'agent-uuid', 'failed', 'Test failed'),
|
||||
('test-error', 'agent-uuid', 'error', 'Test error'),
|
||||
('test-pending', 'agent-uuid', 'pending', 'Test pending'),
|
||||
('test-timeout', 'agent-uuid', 'timed_out', 'Test timeout');
|
||||
"
|
||||
|
||||
# All should succeed without constraint violations
|
||||
```
|
||||
|
||||
### 3. Full Timeout Service Test
|
||||
```bash
|
||||
# Set up old commands that should timeout
|
||||
docker exec -it redflag-postgres psql -U aggregator -d aggregator -c "
|
||||
UPDATE agent_commands
|
||||
SET status = 'sent', sent_at = NOW() - INTERVAL '3 hours'
|
||||
WHERE created_at < NOW() - INTERVAL '1 hour';
|
||||
"
|
||||
|
||||
# Trigger timeout service
|
||||
curl -X POST http://localhost:8080/api/v1/admin/timeout-service/run \
|
||||
-H "Authorization: Bearer $ADMIN_TOKEN"
|
||||
|
||||
# Verify no constraint violations in logs
|
||||
# Verify audit logs are created for timed out commands
|
||||
```
|
||||
|
||||
### 4. Audit Trail Verification
|
||||
```bash
|
||||
# Check that timeout events appear in command history
|
||||
curl -H "Authorization: Bearer $TOKEN" \
|
||||
"http://localhost:8080/api/v1/commands/history?include_timeout=true"
|
||||
|
||||
# Should show timeout events with proper metadata
|
||||
```
|
||||
|
||||
## Files to Modify
|
||||
|
||||
- **Database Migration:** `aggregator-server/internal/database/migrations/XXX_add_timed_out_constraint.up.sql`
|
||||
- **Timeout Service:** `aggregator-server/internal/services/timeout.go`
|
||||
- **Database Schema:** Update `update_logs` table constraints
|
||||
- **API Handlers:** Ensure timeout events are returned in history queries
|
||||
|
||||
## Database Migration Example
|
||||
|
||||
```sql
|
||||
-- File: 020_add_timed_out_to_result_constraint.up.sql
|
||||
-- Add 'timed_out' as valid result value for update_logs
|
||||
|
||||
-- First, drop existing constraint
|
||||
ALTER TABLE update_logs DROP CONSTRAINT IF EXISTS update_logs_result_check;
|
||||
|
||||
-- Add updated constraint with 'timed_out' included
|
||||
ALTER TABLE update_logs ADD CONSTRAINT update_logs_result_check
|
||||
CHECK (result IN ('success', 'failed', 'error', 'pending', 'timed_out'));
|
||||
|
||||
-- Add comment explaining the change
|
||||
COMMENT ON CONSTRAINT update_logs_result_check ON update_logs IS
|
||||
'Valid result values for command execution, including timeout status';
|
||||
```
|
||||
|
||||
## Impact
|
||||
|
||||
- **Audit Compliance:** Enables complete audit trail for timeout events
|
||||
- **Troubleshooting:** Timeout events visible in command history and logs
|
||||
- **Compliance:** Meets regulatory requirements for complete audit trail
|
||||
- **Debugging:** Clear visibility into timeout patterns and system health
|
||||
- **Monitoring:** Enables metrics on timeout rates and patterns
|
||||
|
||||
## Security and Compliance Considerations
|
||||
|
||||
### Audit Trail Requirements
|
||||
- **Complete Records:** All command state changes must be logged
|
||||
- **Immutable History:** Timeout events should not be deletable
|
||||
- **Timestamp Accuracy:** Precise timing of timeout detection
|
||||
- **User Attribution:** Which system/service detected the timeout
|
||||
|
||||
### Data Privacy
|
||||
- **Command Details:** What command timed out (but not sensitive data)
|
||||
- **Agent Information:** Which agent had the timeout
|
||||
- **Timing Data:** How long the command was stuck
|
||||
- **System Metadata:** Service version, detection method
|
||||
|
||||
## Monitoring and Alerting
|
||||
|
||||
### Metrics to Track
|
||||
- Timeout rate by command type
|
||||
- Average timeout duration
|
||||
- Timeout service execution success rate
|
||||
- Audit log creation success rate
|
||||
- Database constraint violations (should be 0)
|
||||
|
||||
### Alert Examples
|
||||
```bash
|
||||
# Alert if timeout service fails
|
||||
if timeout_service_failures > 3 in 5m:
|
||||
alert("Timeout service experiencing failures")
|
||||
|
||||
# Alert if constraint violations occur
|
||||
if database_constraint_violations > 0:
|
||||
critical("Database constraint violation detected!")
|
||||
```
|
||||
|
||||
## Verification Commands
|
||||
|
||||
After fix implementation:
|
||||
|
||||
```bash
|
||||
# Test timeout service execution
|
||||
curl -X POST http://localhost:8080/api/v1/admin/timeout-service/run \
|
||||
-H "Authorization: Bearer $ADMIN_TOKEN"
|
||||
|
||||
# Check for constraint violations
|
||||
docker logs redflag-server | grep -i "constraint" # Should be empty
|
||||
|
||||
# Verify audit log creation
|
||||
docker exec -it redflag-postgres psql -U aggregator -d aggregator -c "
|
||||
SELECT COUNT(*) FROM update_logs
|
||||
WHERE result = 'timed_out'
|
||||
AND created_at > NOW() - INTERVAL '1 hour';
|
||||
"
|
||||
# Should be >0 after timeout service runs
|
||||
|
||||
# Verify no constraint errors
|
||||
docker logs redflag-server 2>&1 | grep -c "violates check constraint"
|
||||
# Should return 0
|
||||
```
|
||||
Reference in New Issue
Block a user