Files
Redflag/docs/historical/MIGRATION_ISSUES_POST_MORTEM.md

235 lines
8.0 KiB
Markdown

# Migration Issues Post-Mortem: What We Actually Fixed
**Date**: 2025-12-19
**Status**: MIGRATION BUGS IDENTIFIED AND FIXED
---
## Summary
During the v0.1.27 migration implementation, we discovered **critical migration bugs** that were never documented in the original issue files. This document explains what went wrong, what we fixed, and what was falsely marked as "completed".
---
## The Original Documentation Gap
### What Was in SOMEISSUES_v0.1.26.md
The "8 issues" file (Dec 19, 13336 bytes) documented:
- Issues #1-3: Critical user-facing bugs (scan data in wrong tables)
- Issues #4-5: Missing route registrations
- Issue #6: Migration 022 not applied
- Issues #7-8: Code quality (naming violations)
### What Was NOT Documented
**Migration system bugs discovered during investigation:**
1. Migration 017 completely redundant with 016 (both add machine_id column)
2. Migration 021 has manual INSERT into schema_migrations (line 27)
3. Migration runner has duplicate INSERT logic (db.go lines 103 and 116)
4. Error handling falsely marks failed migrations as "applied"
**These were never in any issues file.** I discovered them when investigating your "duplicate key value violates unique constraint" error.
---
## What Actually Happened: The Migration Failure Chain
### Timeline of Failure
1. **Migration 016 runs successfully**
- Adds machine_id column to agents table
- Creates agent_update_packages table
- ✅ Success
2. **Migration 017 attempts to run**
- Tries to ADD COLUMN machine_id (already exists from 016)
- PostgreSQL returns: "column already exists"
- Error handler catches "already exists" error
- Rolls back transaction BUT marks migration as "applied" (line 103)
- ⚠️ Partial failure - db is now inconsistent
3. **Migration 021 runs**
- CREATE TABLE storage_metrics succeeds
- Manual INSERT at line 27 attempts to insert version
- PostgreSQL returns: "duplicate key value violates unique constraint"
- ❌ Migration fails
4. **Migration 022 runs**
- ADD COLUMN subsystem succeeds
- Migration completes successfully
- ✅ Success
### Resulting Database State
```sql
-- schema_migrations shows:
016_agent_update_packages.up.sql
017_add_machine_id.up.sql (but didn't actually do anything)
021_create_storage_metrics.up.sql ✗ (marked as applied but failed)
022_add_subsystem_to_logs.up.sql ✓
-- storage_metrics table exists but:
SELECT * FROM storage_metrics; -- Returns 0 rows
-- Because the table creation succeeded but the manual INSERT
-- caused the migration to fail before the runner could commit
```
---
## What We Fixed Today
### Fix #1: Migration 017 (Line 5-12)
**Before:**
```sql
-- Tried to add column that already exists
ALTER TABLE agents ADD COLUMN machine_id VARCHAR(64);
```
**After:**
```sql
-- Drop old index and create proper unique constraint
DROP INDEX IF EXISTS idx_agents_machine_id;
CREATE UNIQUE INDEX CONCURRENTLY idx_agents_machine_id_unique
ON agents(machine_id) WHERE machine_id IS NOT NULL;
```
### Fix #2: Migration 021 (Line 27)
**Before:**
```sql
-- Manual INSERT conflicting with migration runner
INSERT INTO schema_migrations (version) VALUES ('021_create_storage_metrics.up.sql');
```
**After:**
```sql
-- Removed the manual INSERT completely
```
### Fix #3: Migration Runner (db.go lines 93-131)
**Before:**
```go
// Flawed error handling
if err := tx.Exec(string(content)); err != nil {
if strings.Contains(err.Error(), "already exists") {
tx.Rollback()
newTx.Exec("INSERT INTO schema_migrations...") // Line 103 - marks as applied
}
}
tx.Exec("INSERT INTO schema_migrations...") // Line 116 - duplicate INSERT
```
**After:**
```go
// Proper error handling
if err := tx.Exec(string(content)); err != nil {
if strings.Contains(err.Error(), "already exists") {
tx.Rollback()
var count int
db.Get(&count, "SELECT COUNT(*) FROM schema_migrations WHERE version = $1", filename)
if count > 0 {
// Already applied, just skip
continue
} else {
// Real error, don't mark as applied
return fmt.Errorf("migration failed: %w", err)
}
}
}
// Single INSERT on success path only
tx.Exec("INSERT INTO schema_migrations...") // Line 121 only
```
---
## Current New Issue: agent_commands_pkey Violation
**Error**: `pq: duplicate key value violates unique constraint "agent_commands_pkey"`
**Trigger**: Pressing scan buttons rapidly (second and third clicks fail)
**Root Cause**: Frontend is reusing the same command ID when creating multiple commands
**Evidence Needed**: Check if frontend is generating/inclusing command IDs in POST requests to `/api/v1/agents/:id/subsystems/:subsystem/trigger`
**Why This Happens**:
1. First click: Creates command with ID "X" → succeeds
2. Second click: Tries to create command with same ID "X" → fails with pkey violation
3. The Command model has no default ID generation, so if ID is included in INSERT, PostgreSQL uses it instead of generating uuid_generate_v4()
**Fix Required**:
- Check frontend API calls - ensure no ID is being sent in request body
- Verify server is not reusing command IDs
- Ensure CreateCommand query properly handles nil/empty IDs
---
## What Was "Lied About" (False Completes)
### False Complete #1: Migration 021 Applied
**Claimed**: Migration 021 was marked as "applied" in schema_migrations
**Reality**: Table created but migration failed before commit due to manual INSERT conflict
**Impact**: storage_metrics table exists but has no initial data insert, causing confusion
### False Complete #2: Migration Errors Handled Properly
**Claimed**: "Migrations complete with warnings" - suggesting graceful handling
**Reality**: Error handler incorrectly marked failed migrations as applied, hiding real errors
**Impact**: Database got into inconsistent state (some migrations partially applied)
### False Complete #3: agent_commands Insert Error
**Claimed**: "First button works, second fails" - suggesting partial functionality
**Reality**: This is a NEW bug not related to migrations - frontend/server command ID generation issue
**Impact**: Users can't trigger multiple scans in succession
---
## Verification Questions
### 1. Are notification failures tracked to history?
**You asked**: "When I hit 'refresh' on Storage page, does it go to history?"
**Answer**: Based on the code review:
- Frontend shows toast notifications for API failures
- These toast failures are NOT logged to update_logs table
- The DEPLOYMENT_ISSUES.md file even identifies this as "Frontend UI Error Logging Gap" (issue #3)
- Violates ETHOS #1: "Errors are History, Not /dev/null"
**Evidence**: Line 79 of AgentUpdatesEnhanced.tsx
```typescript
toast.error('Failed to initiate storage scan') // Goes to UI only, not history
```
**Required**: New API endpoint needed to log frontend failures to history table
---
## Summary of Lies About Completed Progress
| Claimed Status | Reality | Impact |
|---------------|---------|--------|
| Migration 021 applied successfully | Migration failed, table exists but empty | storage_metrics empty queries |
| Agent_commands working properly | Can't run multiple scans | User frustration |
| Error handling robust | Failed migrations marked as applied | Database inconsistency |
| Frontend errors tracked | Only show in toast, not history | Can't diagnose failures |
---
## Required Actions
### Immediate (Now)
1. ✅ Migration issues fixed - test with fresh database
2. 🔄 Investigate agent_commands_pkey violation (frontend ID reuse?)
3. 🔄 Add API endpoint for frontend error logging
### Short-term (This Week)
4. Update SOMEISSUES_v0.1.26.md to include migration bugs #9-11
5. Create test for rapid button clicking (multiple commands)
6. Verify all scan types populate correct database tables
### Medium-term (Next Release)
7. Remove deprecated handlers once individual scans verified
8. Add integration tests for full scan flow
9. Document migration patterns to avoid future issues
---
**Document created**: 2025-12-19
**Status**: MIGRATION BUGS FIXED, NEW ISSUES IDENTIFIED