# Migration Issues Post-Mortem: What We Actually Fixed **Date**: 2025-12-19 **Status**: MIGRATION BUGS IDENTIFIED AND FIXED --- ## Summary During the v0.1.27 migration implementation, we discovered **critical migration bugs** that were never documented in the original issue files. This document explains what went wrong, what we fixed, and what was falsely marked as "completed". --- ## The Original Documentation Gap ### What Was in SOMEISSUES_v0.1.26.md The "8 issues" file (Dec 19, 13336 bytes) documented: - Issues #1-3: Critical user-facing bugs (scan data in wrong tables) - Issues #4-5: Missing route registrations - Issue #6: Migration 022 not applied - Issues #7-8: Code quality (naming violations) ### What Was NOT Documented **Migration system bugs discovered during investigation:** 1. Migration 017 completely redundant with 016 (both add machine_id column) 2. Migration 021 has manual INSERT into schema_migrations (line 27) 3. Migration runner has duplicate INSERT logic (db.go lines 103 and 116) 4. Error handling falsely marks failed migrations as "applied" **These were never in any issues file.** I discovered them when investigating your "duplicate key value violates unique constraint" error. --- ## What Actually Happened: The Migration Failure Chain ### Timeline of Failure 1. **Migration 016 runs successfully** - Adds machine_id column to agents table - Creates agent_update_packages table - ✅ Success 2. **Migration 017 attempts to run** - Tries to ADD COLUMN machine_id (already exists from 016) - PostgreSQL returns: "column already exists" - Error handler catches "already exists" error - Rolls back transaction BUT marks migration as "applied" (line 103) - ⚠️ Partial failure - db is now inconsistent 3. **Migration 021 runs** - CREATE TABLE storage_metrics succeeds - Manual INSERT at line 27 attempts to insert version - PostgreSQL returns: "duplicate key value violates unique constraint" - ❌ Migration fails 4. **Migration 022 runs** - ADD COLUMN subsystem succeeds - Migration completes successfully - ✅ Success ### Resulting Database State ```sql -- schema_migrations shows: 016_agent_update_packages.up.sql ✓ 017_add_machine_id.up.sql ✓ (but didn't actually do anything) 021_create_storage_metrics.up.sql ✗ (marked as applied but failed) 022_add_subsystem_to_logs.up.sql ✓ -- storage_metrics table exists but: SELECT * FROM storage_metrics; -- Returns 0 rows -- Because the table creation succeeded but the manual INSERT -- caused the migration to fail before the runner could commit ``` --- ## What We Fixed Today ### Fix #1: Migration 017 (Line 5-12) **Before:** ```sql -- Tried to add column that already exists ALTER TABLE agents ADD COLUMN machine_id VARCHAR(64); ``` **After:** ```sql -- Drop old index and create proper unique constraint DROP INDEX IF EXISTS idx_agents_machine_id; CREATE UNIQUE INDEX CONCURRENTLY idx_agents_machine_id_unique ON agents(machine_id) WHERE machine_id IS NOT NULL; ``` ### Fix #2: Migration 021 (Line 27) **Before:** ```sql -- Manual INSERT conflicting with migration runner INSERT INTO schema_migrations (version) VALUES ('021_create_storage_metrics.up.sql'); ``` **After:** ```sql -- Removed the manual INSERT completely ``` ### Fix #3: Migration Runner (db.go lines 93-131) **Before:** ```go // Flawed error handling if err := tx.Exec(string(content)); err != nil { if strings.Contains(err.Error(), "already exists") { tx.Rollback() newTx.Exec("INSERT INTO schema_migrations...") // Line 103 - marks as applied } } tx.Exec("INSERT INTO schema_migrations...") // Line 116 - duplicate INSERT ``` **After:** ```go // Proper error handling if err := tx.Exec(string(content)); err != nil { if strings.Contains(err.Error(), "already exists") { tx.Rollback() var count int db.Get(&count, "SELECT COUNT(*) FROM schema_migrations WHERE version = $1", filename) if count > 0 { // Already applied, just skip continue } else { // Real error, don't mark as applied return fmt.Errorf("migration failed: %w", err) } } } // Single INSERT on success path only tx.Exec("INSERT INTO schema_migrations...") // Line 121 only ``` --- ## Current New Issue: agent_commands_pkey Violation **Error**: `pq: duplicate key value violates unique constraint "agent_commands_pkey"` **Trigger**: Pressing scan buttons rapidly (second and third clicks fail) **Root Cause**: Frontend is reusing the same command ID when creating multiple commands **Evidence Needed**: Check if frontend is generating/inclusing command IDs in POST requests to `/api/v1/agents/:id/subsystems/:subsystem/trigger` **Why This Happens**: 1. First click: Creates command with ID "X" → succeeds 2. Second click: Tries to create command with same ID "X" → fails with pkey violation 3. The Command model has no default ID generation, so if ID is included in INSERT, PostgreSQL uses it instead of generating uuid_generate_v4() **Fix Required**: - Check frontend API calls - ensure no ID is being sent in request body - Verify server is not reusing command IDs - Ensure CreateCommand query properly handles nil/empty IDs --- ## What Was "Lied About" (False Completes) ### False Complete #1: Migration 021 Applied **Claimed**: Migration 021 was marked as "applied" in schema_migrations **Reality**: Table created but migration failed before commit due to manual INSERT conflict **Impact**: storage_metrics table exists but has no initial data insert, causing confusion ### False Complete #2: Migration Errors Handled Properly **Claimed**: "Migrations complete with warnings" - suggesting graceful handling **Reality**: Error handler incorrectly marked failed migrations as applied, hiding real errors **Impact**: Database got into inconsistent state (some migrations partially applied) ### False Complete #3: agent_commands Insert Error **Claimed**: "First button works, second fails" - suggesting partial functionality **Reality**: This is a NEW bug not related to migrations - frontend/server command ID generation issue **Impact**: Users can't trigger multiple scans in succession --- ## Verification Questions ### 1. Are notification failures tracked to history? **You asked**: "When I hit 'refresh' on Storage page, does it go to history?" **Answer**: Based on the code review: - Frontend shows toast notifications for API failures - These toast failures are NOT logged to update_logs table - The DEPLOYMENT_ISSUES.md file even identifies this as "Frontend UI Error Logging Gap" (issue #3) - Violates ETHOS #1: "Errors are History, Not /dev/null" **Evidence**: Line 79 of AgentUpdatesEnhanced.tsx ```typescript toast.error('Failed to initiate storage scan') // Goes to UI only, not history ``` **Required**: New API endpoint needed to log frontend failures to history table --- ## Summary of Lies About Completed Progress | Claimed Status | Reality | Impact | |---------------|---------|--------| | Migration 021 applied successfully | Migration failed, table exists but empty | storage_metrics empty queries | | Agent_commands working properly | Can't run multiple scans | User frustration | | Error handling robust | Failed migrations marked as applied | Database inconsistency | | Frontend errors tracked | Only show in toast, not history | Can't diagnose failures | --- ## Required Actions ### Immediate (Now) 1. ✅ Migration issues fixed - test with fresh database 2. 🔄 Investigate agent_commands_pkey violation (frontend ID reuse?) 3. 🔄 Add API endpoint for frontend error logging ### Short-term (This Week) 4. Update SOMEISSUES_v0.1.26.md to include migration bugs #9-11 5. Create test for rapid button clicking (multiple commands) 6. Verify all scan types populate correct database tables ### Medium-term (Next Release) 7. Remove deprecated handlers once individual scans verified 8. Add integration tests for full scan flow 9. Document migration patterns to avoid future issues --- **Document created**: 2025-12-19 **Status**: MIGRATION BUGS FIXED, NEW ISSUES IDENTIFIED