8.0 KiB
Migration Issues Post-Mortem: What We Actually Fixed
Date: 2025-12-19 Status: MIGRATION BUGS IDENTIFIED AND FIXED
Summary
During the v0.1.27 migration implementation, we discovered critical migration bugs that were never documented in the original issue files. This document explains what went wrong, what we fixed, and what was falsely marked as "completed".
The Original Documentation Gap
What Was in SOMEISSUES_v0.1.26.md
The "8 issues" file (Dec 19, 13336 bytes) documented:
- Issues #1-3: Critical user-facing bugs (scan data in wrong tables)
- Issues #4-5: Missing route registrations
- Issue #6: Migration 022 not applied
- Issues #7-8: Code quality (naming violations)
What Was NOT Documented
Migration system bugs discovered during investigation:
- Migration 017 completely redundant with 016 (both add machine_id column)
- Migration 021 has manual INSERT into schema_migrations (line 27)
- Migration runner has duplicate INSERT logic (db.go lines 103 and 116)
- Error handling falsely marks failed migrations as "applied"
These were never in any issues file. I discovered them when investigating your "duplicate key value violates unique constraint" error.
What Actually Happened: The Migration Failure Chain
Timeline of Failure
-
Migration 016 runs successfully
- Adds machine_id column to agents table
- Creates agent_update_packages table
- ✅ Success
-
Migration 017 attempts to run
- Tries to ADD COLUMN machine_id (already exists from 016)
- PostgreSQL returns: "column already exists"
- Error handler catches "already exists" error
- Rolls back transaction BUT marks migration as "applied" (line 103)
- ⚠️ Partial failure - db is now inconsistent
-
Migration 021 runs
- CREATE TABLE storage_metrics succeeds
- Manual INSERT at line 27 attempts to insert version
- PostgreSQL returns: "duplicate key value violates unique constraint"
- ❌ Migration fails
-
Migration 022 runs
- ADD COLUMN subsystem succeeds
- Migration completes successfully
- ✅ Success
Resulting Database State
-- schema_migrations shows:
016_agent_update_packages.up.sql ✓
017_add_machine_id.up.sql ✓ (but didn't actually do anything)
021_create_storage_metrics.up.sql ✗ (marked as applied but failed)
022_add_subsystem_to_logs.up.sql ✓
-- storage_metrics table exists but:
SELECT * FROM storage_metrics; -- Returns 0 rows
-- Because the table creation succeeded but the manual INSERT
-- caused the migration to fail before the runner could commit
What We Fixed Today
Fix #1: Migration 017 (Line 5-12)
Before:
-- Tried to add column that already exists
ALTER TABLE agents ADD COLUMN machine_id VARCHAR(64);
After:
-- Drop old index and create proper unique constraint
DROP INDEX IF EXISTS idx_agents_machine_id;
CREATE UNIQUE INDEX CONCURRENTLY idx_agents_machine_id_unique
ON agents(machine_id) WHERE machine_id IS NOT NULL;
Fix #2: Migration 021 (Line 27)
Before:
-- Manual INSERT conflicting with migration runner
INSERT INTO schema_migrations (version) VALUES ('021_create_storage_metrics.up.sql');
After:
-- Removed the manual INSERT completely
Fix #3: Migration Runner (db.go lines 93-131)
Before:
// Flawed error handling
if err := tx.Exec(string(content)); err != nil {
if strings.Contains(err.Error(), "already exists") {
tx.Rollback()
newTx.Exec("INSERT INTO schema_migrations...") // Line 103 - marks as applied
}
}
tx.Exec("INSERT INTO schema_migrations...") // Line 116 - duplicate INSERT
After:
// Proper error handling
if err := tx.Exec(string(content)); err != nil {
if strings.Contains(err.Error(), "already exists") {
tx.Rollback()
var count int
db.Get(&count, "SELECT COUNT(*) FROM schema_migrations WHERE version = $1", filename)
if count > 0 {
// Already applied, just skip
continue
} else {
// Real error, don't mark as applied
return fmt.Errorf("migration failed: %w", err)
}
}
}
// Single INSERT on success path only
tx.Exec("INSERT INTO schema_migrations...") // Line 121 only
Current New Issue: agent_commands_pkey Violation
Error: pq: duplicate key value violates unique constraint "agent_commands_pkey"
Trigger: Pressing scan buttons rapidly (second and third clicks fail)
Root Cause: Frontend is reusing the same command ID when creating multiple commands
Evidence Needed: Check if frontend is generating/inclusing command IDs in POST requests to /api/v1/agents/:id/subsystems/:subsystem/trigger
Why This Happens:
- First click: Creates command with ID "X" → succeeds
- Second click: Tries to create command with same ID "X" → fails with pkey violation
- The Command model has no default ID generation, so if ID is included in INSERT, PostgreSQL uses it instead of generating uuid_generate_v4()
Fix Required:
- Check frontend API calls - ensure no ID is being sent in request body
- Verify server is not reusing command IDs
- Ensure CreateCommand query properly handles nil/empty IDs
What Was "Lied About" (False Completes)
False Complete #1: Migration 021 Applied
Claimed: Migration 021 was marked as "applied" in schema_migrations Reality: Table created but migration failed before commit due to manual INSERT conflict Impact: storage_metrics table exists but has no initial data insert, causing confusion
False Complete #2: Migration Errors Handled Properly
Claimed: "Migrations complete with warnings" - suggesting graceful handling Reality: Error handler incorrectly marked failed migrations as applied, hiding real errors Impact: Database got into inconsistent state (some migrations partially applied)
False Complete #3: agent_commands Insert Error
Claimed: "First button works, second fails" - suggesting partial functionality Reality: This is a NEW bug not related to migrations - frontend/server command ID generation issue Impact: Users can't trigger multiple scans in succession
Verification Questions
1. Are notification failures tracked to history?
You asked: "When I hit 'refresh' on Storage page, does it go to history?"
Answer: Based on the code review:
- Frontend shows toast notifications for API failures
- These toast failures are NOT logged to update_logs table
- The DEPLOYMENT_ISSUES.md file even identifies this as "Frontend UI Error Logging Gap" (issue #3)
- Violates ETHOS #1: "Errors are History, Not /dev/null"
Evidence: Line 79 of AgentUpdatesEnhanced.tsx
toast.error('Failed to initiate storage scan') // Goes to UI only, not history
Required: New API endpoint needed to log frontend failures to history table
Summary of Lies About Completed Progress
| Claimed Status | Reality | Impact |
|---|---|---|
| Migration 021 applied successfully | Migration failed, table exists but empty | storage_metrics empty queries |
| Agent_commands working properly | Can't run multiple scans | User frustration |
| Error handling robust | Failed migrations marked as applied | Database inconsistency |
| Frontend errors tracked | Only show in toast, not history | Can't diagnose failures |
Required Actions
Immediate (Now)
- ✅ Migration issues fixed - test with fresh database
- 🔄 Investigate agent_commands_pkey violation (frontend ID reuse?)
- 🔄 Add API endpoint for frontend error logging
Short-term (This Week)
- Update SOMEISSUES_v0.1.26.md to include migration bugs #9-11
- Create test for rapid button clicking (multiple commands)
- Verify all scan types populate correct database tables
Medium-term (Next Release)
- Remove deprecated handlers once individual scans verified
- Add integration tests for full scan flow
- Document migration patterns to avoid future issues
Document created: 2025-12-19 Status: MIGRATION BUGS FIXED, NEW ISSUES IDENTIFIED