235 lines
8.0 KiB
Markdown
235 lines
8.0 KiB
Markdown
# Migration Issues Post-Mortem: What We Actually Fixed
|
|
|
|
**Date**: 2025-12-19
|
|
**Status**: MIGRATION BUGS IDENTIFIED AND FIXED
|
|
|
|
---
|
|
|
|
## Summary
|
|
|
|
During the v0.1.27 migration implementation, we discovered **critical migration bugs** that were never documented in the original issue files. This document explains what went wrong, what we fixed, and what was falsely marked as "completed".
|
|
|
|
---
|
|
|
|
## The Original Documentation Gap
|
|
|
|
### What Was in SOMEISSUES_v0.1.26.md
|
|
The "8 issues" file (Dec 19, 13336 bytes) documented:
|
|
- Issues #1-3: Critical user-facing bugs (scan data in wrong tables)
|
|
- Issues #4-5: Missing route registrations
|
|
- Issue #6: Migration 022 not applied
|
|
- Issues #7-8: Code quality (naming violations)
|
|
|
|
### What Was NOT Documented
|
|
**Migration system bugs discovered during investigation:**
|
|
1. Migration 017 completely redundant with 016 (both add machine_id column)
|
|
2. Migration 021 has manual INSERT into schema_migrations (line 27)
|
|
3. Migration runner has duplicate INSERT logic (db.go lines 103 and 116)
|
|
4. Error handling falsely marks failed migrations as "applied"
|
|
|
|
**These were never in any issues file.** I discovered them when investigating your "duplicate key value violates unique constraint" error.
|
|
|
|
---
|
|
|
|
## What Actually Happened: The Migration Failure Chain
|
|
|
|
### Timeline of Failure
|
|
|
|
1. **Migration 016 runs successfully**
|
|
- Adds machine_id column to agents table
|
|
- Creates agent_update_packages table
|
|
- ✅ Success
|
|
|
|
2. **Migration 017 attempts to run**
|
|
- Tries to ADD COLUMN machine_id (already exists from 016)
|
|
- PostgreSQL returns: "column already exists"
|
|
- Error handler catches "already exists" error
|
|
- Rolls back transaction BUT marks migration as "applied" (line 103)
|
|
- ⚠️ Partial failure - db is now inconsistent
|
|
|
|
3. **Migration 021 runs**
|
|
- CREATE TABLE storage_metrics succeeds
|
|
- Manual INSERT at line 27 attempts to insert version
|
|
- PostgreSQL returns: "duplicate key value violates unique constraint"
|
|
- ❌ Migration fails
|
|
|
|
4. **Migration 022 runs**
|
|
- ADD COLUMN subsystem succeeds
|
|
- Migration completes successfully
|
|
- ✅ Success
|
|
|
|
### Resulting Database State
|
|
```sql
|
|
-- schema_migrations shows:
|
|
016_agent_update_packages.up.sql ✓
|
|
017_add_machine_id.up.sql ✓ (but didn't actually do anything)
|
|
021_create_storage_metrics.up.sql ✗ (marked as applied but failed)
|
|
022_add_subsystem_to_logs.up.sql ✓
|
|
|
|
-- storage_metrics table exists but:
|
|
SELECT * FROM storage_metrics; -- Returns 0 rows
|
|
-- Because the table creation succeeded but the manual INSERT
|
|
-- caused the migration to fail before the runner could commit
|
|
```
|
|
|
|
---
|
|
|
|
## What We Fixed Today
|
|
|
|
### Fix #1: Migration 017 (Line 5-12)
|
|
**Before:**
|
|
```sql
|
|
-- Tried to add column that already exists
|
|
ALTER TABLE agents ADD COLUMN machine_id VARCHAR(64);
|
|
```
|
|
|
|
**After:**
|
|
```sql
|
|
-- Drop old index and create proper unique constraint
|
|
DROP INDEX IF EXISTS idx_agents_machine_id;
|
|
CREATE UNIQUE INDEX CONCURRENTLY idx_agents_machine_id_unique
|
|
ON agents(machine_id) WHERE machine_id IS NOT NULL;
|
|
```
|
|
|
|
### Fix #2: Migration 021 (Line 27)
|
|
**Before:**
|
|
```sql
|
|
-- Manual INSERT conflicting with migration runner
|
|
INSERT INTO schema_migrations (version) VALUES ('021_create_storage_metrics.up.sql');
|
|
```
|
|
|
|
**After:**
|
|
```sql
|
|
-- Removed the manual INSERT completely
|
|
```
|
|
|
|
### Fix #3: Migration Runner (db.go lines 93-131)
|
|
**Before:**
|
|
```go
|
|
// Flawed error handling
|
|
if err := tx.Exec(string(content)); err != nil {
|
|
if strings.Contains(err.Error(), "already exists") {
|
|
tx.Rollback()
|
|
newTx.Exec("INSERT INTO schema_migrations...") // Line 103 - marks as applied
|
|
}
|
|
}
|
|
tx.Exec("INSERT INTO schema_migrations...") // Line 116 - duplicate INSERT
|
|
```
|
|
|
|
**After:**
|
|
```go
|
|
// Proper error handling
|
|
if err := tx.Exec(string(content)); err != nil {
|
|
if strings.Contains(err.Error(), "already exists") {
|
|
tx.Rollback()
|
|
var count int
|
|
db.Get(&count, "SELECT COUNT(*) FROM schema_migrations WHERE version = $1", filename)
|
|
if count > 0 {
|
|
// Already applied, just skip
|
|
continue
|
|
} else {
|
|
// Real error, don't mark as applied
|
|
return fmt.Errorf("migration failed: %w", err)
|
|
}
|
|
}
|
|
}
|
|
// Single INSERT on success path only
|
|
tx.Exec("INSERT INTO schema_migrations...") // Line 121 only
|
|
```
|
|
|
|
---
|
|
|
|
## Current New Issue: agent_commands_pkey Violation
|
|
|
|
**Error**: `pq: duplicate key value violates unique constraint "agent_commands_pkey"`
|
|
|
|
**Trigger**: Pressing scan buttons rapidly (second and third clicks fail)
|
|
|
|
**Root Cause**: Frontend is reusing the same command ID when creating multiple commands
|
|
|
|
**Evidence Needed**: Check if frontend is generating/inclusing command IDs in POST requests to `/api/v1/agents/:id/subsystems/:subsystem/trigger`
|
|
|
|
**Why This Happens**:
|
|
1. First click: Creates command with ID "X" → succeeds
|
|
2. Second click: Tries to create command with same ID "X" → fails with pkey violation
|
|
3. The Command model has no default ID generation, so if ID is included in INSERT, PostgreSQL uses it instead of generating uuid_generate_v4()
|
|
|
|
**Fix Required**:
|
|
- Check frontend API calls - ensure no ID is being sent in request body
|
|
- Verify server is not reusing command IDs
|
|
- Ensure CreateCommand query properly handles nil/empty IDs
|
|
|
|
---
|
|
|
|
## What Was "Lied About" (False Completes)
|
|
|
|
### False Complete #1: Migration 021 Applied
|
|
**Claimed**: Migration 021 was marked as "applied" in schema_migrations
|
|
**Reality**: Table created but migration failed before commit due to manual INSERT conflict
|
|
**Impact**: storage_metrics table exists but has no initial data insert, causing confusion
|
|
|
|
### False Complete #2: Migration Errors Handled Properly
|
|
**Claimed**: "Migrations complete with warnings" - suggesting graceful handling
|
|
**Reality**: Error handler incorrectly marked failed migrations as applied, hiding real errors
|
|
**Impact**: Database got into inconsistent state (some migrations partially applied)
|
|
|
|
### False Complete #3: agent_commands Insert Error
|
|
**Claimed**: "First button works, second fails" - suggesting partial functionality
|
|
**Reality**: This is a NEW bug not related to migrations - frontend/server command ID generation issue
|
|
**Impact**: Users can't trigger multiple scans in succession
|
|
|
|
---
|
|
|
|
## Verification Questions
|
|
|
|
### 1. Are notification failures tracked to history?
|
|
**You asked**: "When I hit 'refresh' on Storage page, does it go to history?"
|
|
|
|
**Answer**: Based on the code review:
|
|
- Frontend shows toast notifications for API failures
|
|
- These toast failures are NOT logged to update_logs table
|
|
- The DEPLOYMENT_ISSUES.md file even identifies this as "Frontend UI Error Logging Gap" (issue #3)
|
|
- Violates ETHOS #1: "Errors are History, Not /dev/null"
|
|
|
|
**Evidence**: Line 79 of AgentUpdatesEnhanced.tsx
|
|
```typescript
|
|
toast.error('Failed to initiate storage scan') // Goes to UI only, not history
|
|
```
|
|
|
|
**Required**: New API endpoint needed to log frontend failures to history table
|
|
|
|
---
|
|
|
|
## Summary of Lies About Completed Progress
|
|
|
|
| Claimed Status | Reality | Impact |
|
|
|---------------|---------|--------|
|
|
| Migration 021 applied successfully | Migration failed, table exists but empty | storage_metrics empty queries |
|
|
| Agent_commands working properly | Can't run multiple scans | User frustration |
|
|
| Error handling robust | Failed migrations marked as applied | Database inconsistency |
|
|
| Frontend errors tracked | Only show in toast, not history | Can't diagnose failures |
|
|
|
|
---
|
|
|
|
## Required Actions
|
|
|
|
### Immediate (Now)
|
|
1. ✅ Migration issues fixed - test with fresh database
|
|
2. 🔄 Investigate agent_commands_pkey violation (frontend ID reuse?)
|
|
3. 🔄 Add API endpoint for frontend error logging
|
|
|
|
### Short-term (This Week)
|
|
4. Update SOMEISSUES_v0.1.26.md to include migration bugs #9-11
|
|
5. Create test for rapid button clicking (multiple commands)
|
|
6. Verify all scan types populate correct database tables
|
|
|
|
### Medium-term (Next Release)
|
|
7. Remove deprecated handlers once individual scans verified
|
|
8. Add integration tests for full scan flow
|
|
9. Document migration patterns to avoid future issues
|
|
|
|
---
|
|
|
|
**Document created**: 2025-12-19
|
|
**Status**: MIGRATION BUGS FIXED, NEW ISSUES IDENTIFIED
|