Files
Redflag/docs/historical/MIGRATION_ISSUES_POST_MORTEM.md

8.0 KiB

Migration Issues Post-Mortem: What We Actually Fixed

Date: 2025-12-19 Status: MIGRATION BUGS IDENTIFIED AND FIXED


Summary

During the v0.1.27 migration implementation, we discovered critical migration bugs that were never documented in the original issue files. This document explains what went wrong, what we fixed, and what was falsely marked as "completed".


The Original Documentation Gap

What Was in SOMEISSUES_v0.1.26.md

The "8 issues" file (Dec 19, 13336 bytes) documented:

  • Issues #1-3: Critical user-facing bugs (scan data in wrong tables)
  • Issues #4-5: Missing route registrations
  • Issue #6: Migration 022 not applied
  • Issues #7-8: Code quality (naming violations)

What Was NOT Documented

Migration system bugs discovered during investigation:

  1. Migration 017 completely redundant with 016 (both add machine_id column)
  2. Migration 021 has manual INSERT into schema_migrations (line 27)
  3. Migration runner has duplicate INSERT logic (db.go lines 103 and 116)
  4. Error handling falsely marks failed migrations as "applied"

These were never in any issues file. I discovered them when investigating your "duplicate key value violates unique constraint" error.


What Actually Happened: The Migration Failure Chain

Timeline of Failure

  1. Migration 016 runs successfully

    • Adds machine_id column to agents table
    • Creates agent_update_packages table
    • Success
  2. Migration 017 attempts to run

    • Tries to ADD COLUMN machine_id (already exists from 016)
    • PostgreSQL returns: "column already exists"
    • Error handler catches "already exists" error
    • Rolls back transaction BUT marks migration as "applied" (line 103)
    • ⚠️ Partial failure - db is now inconsistent
  3. Migration 021 runs

    • CREATE TABLE storage_metrics succeeds
    • Manual INSERT at line 27 attempts to insert version
    • PostgreSQL returns: "duplicate key value violates unique constraint"
    • Migration fails
  4. Migration 022 runs

    • ADD COLUMN subsystem succeeds
    • Migration completes successfully
    • Success

Resulting Database State

-- schema_migrations shows:
016_agent_update_packages.up.sql 
017_add_machine_id.up.sql  (but didn't actually do anything)
021_create_storage_metrics.up.sql ✗ (marked as applied but failed)
022_add_subsystem_to_logs.up.sql ✓

-- storage_metrics table exists but:
SELECT * FROM storage_metrics; -- Returns 0 rows
-- Because the table creation succeeded but the manual INSERT
-- caused the migration to fail before the runner could commit

What We Fixed Today

Fix #1: Migration 017 (Line 5-12)

Before:

-- Tried to add column that already exists
ALTER TABLE agents ADD COLUMN machine_id VARCHAR(64);

After:

-- Drop old index and create proper unique constraint
DROP INDEX IF EXISTS idx_agents_machine_id;
CREATE UNIQUE INDEX CONCURRENTLY idx_agents_machine_id_unique 
ON agents(machine_id) WHERE machine_id IS NOT NULL;

Fix #2: Migration 021 (Line 27)

Before:

-- Manual INSERT conflicting with migration runner
INSERT INTO schema_migrations (version) VALUES ('021_create_storage_metrics.up.sql');

After:

-- Removed the manual INSERT completely

Fix #3: Migration Runner (db.go lines 93-131)

Before:

// Flawed error handling
if err := tx.Exec(string(content)); err != nil {
    if strings.Contains(err.Error(), "already exists") {
        tx.Rollback()
        newTx.Exec("INSERT INTO schema_migrations...") // Line 103 - marks as applied
    }
}
tx.Exec("INSERT INTO schema_migrations...") // Line 116 - duplicate INSERT

After:

// Proper error handling
if err := tx.Exec(string(content)); err != nil {
    if strings.Contains(err.Error(), "already exists") {
        tx.Rollback()
        var count int
        db.Get(&count, "SELECT COUNT(*) FROM schema_migrations WHERE version = $1", filename)
        if count > 0 {
            // Already applied, just skip
            continue
        } else {
            // Real error, don't mark as applied
            return fmt.Errorf("migration failed: %w", err)
        }
    }
}
// Single INSERT on success path only
tx.Exec("INSERT INTO schema_migrations...") // Line 121 only

Current New Issue: agent_commands_pkey Violation

Error: pq: duplicate key value violates unique constraint "agent_commands_pkey"

Trigger: Pressing scan buttons rapidly (second and third clicks fail)

Root Cause: Frontend is reusing the same command ID when creating multiple commands

Evidence Needed: Check if frontend is generating/inclusing command IDs in POST requests to /api/v1/agents/:id/subsystems/:subsystem/trigger

Why This Happens:

  1. First click: Creates command with ID "X" → succeeds
  2. Second click: Tries to create command with same ID "X" → fails with pkey violation
  3. The Command model has no default ID generation, so if ID is included in INSERT, PostgreSQL uses it instead of generating uuid_generate_v4()

Fix Required:

  • Check frontend API calls - ensure no ID is being sent in request body
  • Verify server is not reusing command IDs
  • Ensure CreateCommand query properly handles nil/empty IDs

What Was "Lied About" (False Completes)

False Complete #1: Migration 021 Applied

Claimed: Migration 021 was marked as "applied" in schema_migrations Reality: Table created but migration failed before commit due to manual INSERT conflict Impact: storage_metrics table exists but has no initial data insert, causing confusion

False Complete #2: Migration Errors Handled Properly

Claimed: "Migrations complete with warnings" - suggesting graceful handling Reality: Error handler incorrectly marked failed migrations as applied, hiding real errors Impact: Database got into inconsistent state (some migrations partially applied)

False Complete #3: agent_commands Insert Error

Claimed: "First button works, second fails" - suggesting partial functionality Reality: This is a NEW bug not related to migrations - frontend/server command ID generation issue Impact: Users can't trigger multiple scans in succession


Verification Questions

1. Are notification failures tracked to history?

You asked: "When I hit 'refresh' on Storage page, does it go to history?"

Answer: Based on the code review:

  • Frontend shows toast notifications for API failures
  • These toast failures are NOT logged to update_logs table
  • The DEPLOYMENT_ISSUES.md file even identifies this as "Frontend UI Error Logging Gap" (issue #3)
  • Violates ETHOS #1: "Errors are History, Not /dev/null"

Evidence: Line 79 of AgentUpdatesEnhanced.tsx

toast.error('Failed to initiate storage scan') // Goes to UI only, not history

Required: New API endpoint needed to log frontend failures to history table


Summary of Lies About Completed Progress

Claimed Status Reality Impact
Migration 021 applied successfully Migration failed, table exists but empty storage_metrics empty queries
Agent_commands working properly Can't run multiple scans User frustration
Error handling robust Failed migrations marked as applied Database inconsistency
Frontend errors tracked Only show in toast, not history Can't diagnose failures

Required Actions

Immediate (Now)

  1. Migration issues fixed - test with fresh database
  2. 🔄 Investigate agent_commands_pkey violation (frontend ID reuse?)
  3. 🔄 Add API endpoint for frontend error logging

Short-term (This Week)

  1. Update SOMEISSUES_v0.1.26.md to include migration bugs #9-11
  2. Create test for rapid button clicking (multiple commands)
  3. Verify all scan types populate correct database tables

Medium-term (Next Release)

  1. Remove deprecated handlers once individual scans verified
  2. Add integration tests for full scan flow
  3. Document migration patterns to avoid future issues

Document created: 2025-12-19 Status: MIGRATION BUGS FIXED, NEW ISSUES IDENTIFIED