# B-1 Database Migration & Schema Integrity Audit **Date:** 2026-03-29 **Branch:** culurien **Scope:** Migration runner, schema integrity, query patterns, foreign keys --- ## 1. MIGRATION RUNNER ANALYSIS **File:** `aggregator-server/internal/database/db.go` **Approach:** Custom runner (not golang-migrate, not goose) **Tracking:** `schema_migrations` table with `version VARCHAR(255) PRIMARY KEY` ### Transaction Logic (lines 86-131) The runner wraps each migration SQL in a transaction (`db.Beginx()` at line 87). The INSERT into schema_migrations happens INSIDE the same transaction (line 121). On successful commit (line 127), both the SQL changes and the tracking record are committed atomically. **"Already exists" handling (lines 93-113):** When migration SQL fails with "already exists" or "duplicate key": 1. Transaction is rolled back (line 100) 2. Runner checks if this migration was already recorded in schema_migrations (line 104) 3. If recorded: skip silently (line 107) 4. If NOT recorded: **return fatal error** (line 110) This logic is correct for the case where a prior run successfully applied the migration. However, it creates a problem when combined with main.go's error swallowing. ### Server Startup Behavior (main.go:191-195) ```go if err := db.Migrate(migrationsPath); err != nil { fmt.Printf("Warning: Migration failed (tables may already exist): %v\n", err) } fmt.Println("[OK] Database migrations completed") ``` **The server logs a warning and continues.** The `[OK]` message prints even when migrations failed. This is the P0 blocker. --- ## 2. MIGRATION FILE AUDIT ### Numbering Conflicts | Issue | Files | |-------|-------| | Duplicate 009 | `009_add_agent_version_tracking.up.sql`, `009_add_retry_tracking.up.sql` | | Duplicate 012 | `012_add_token_seats.up.sql`, `012_create_admin_user.up.sql` | | Duplicate 018 | `018_create_metrics_and_docker_tables.up.sql`, `018_create_scanner_config_table.sql` | ### Missing DOWN Migrations (16 of 28) | Migration | Has DOWN? | |-----------|-----------| | 001 | YES | | 003 | NO | | 004 | NO | | 005 | NO | | 006 | NO | | 007 | NO | | 008 | NO | | 009 (both) | NO | | 010 | NO | | 011 | NO | | 012 (both) | NO | | 013 | NO | | 014 | NO | | 015 | YES | | 016 | YES | | 017 | YES | | 018 (metrics) | YES | | 019 | NO | | 020 | YES | | 021 | NO | | 022 | YES | | 023 | YES | | 023a | YES | | 024 | YES | | 025 | YES | | 026 | YES | ### Idempotency Issues (ETHOS #4 Violations) | Migration | Issue | |-----------|-------| | 001 | No IF NOT EXISTS on CREATE TABLE/INDEX | | 009 (both) | ALTER TABLE ADD COLUMN without IF NOT EXISTS | | 013 | ALTER TABLE ADD COLUMN without IF NOT EXISTS | | 014 | ALTER TABLE ADD COLUMN without IF NOT EXISTS — also ADD CONSTRAINT without IF NOT EXISTS | | 016 | ALTER TABLE ADD COLUMN without IF NOT EXISTS | | 019 | CREATE INDEX without IF NOT EXISTS | | 021 | CREATE TABLE/INDEX without IF NOT EXISTS | ### Critical Migration Bugs **F-B1-1 CRITICAL: Migration 024 self-inserts into schema_migrations** `024_disable_updates_subsystem.up.sql:18-19`: ```sql INSERT INTO schema_migrations (version) VALUES ('024_disable_updates_subsystem.up.sql'); ``` This INSERT runs inside the migration transaction. When the runner also inserts at db.go:121, a duplicate key violation occurs. The transaction rolls back, undoing the migration SQL. The server then catches the error at main.go:194 and continues with a warning. Migration 024 is never applied. **F-B1-2 CRITICAL: Migration 024 references non-existent column** `024_disable_updates_subsystem.up.sql:10`: ```sql SET ... deprecated = true ... ``` The `deprecated` column does not exist on `agent_subsystems`. Migration 015 creates the table without it, and no subsequent migration adds it. This migration will always fail with "column deprecated does not exist." **F-B1-3 HIGH: Migration 018 scanner_config never runs** `018_create_scanner_config_table.sql` has no `.up.sql` suffix. The migration runner only processes files ending in `.up.sql` (db.go:59). The scanner_config table is only created if this file happens to match the filter, which it doesn't. Any handler referencing scanner_config will get "relation does not exist" errors at runtime. **F-B1-4 HIGH: GRANT to non-existent role** `018_create_scanner_config_table.sql:34`: ```sql GRANT SELECT, INSERT, UPDATE, DELETE ON scanner_config TO redflag_user; ``` The default DB user is `redflag`, not `redflag_user`. This would fail even if the file were executed. --- ## 3. MISSING INDEXES AUDIT | Table | Query Pattern | Index Exists? | Finding | |-------|---------------|---------------|---------| | agents | WHERE machine_id = $1 | YES (unique, migration 017) | OK | | agents | WHERE status = $1 | YES (migration 001) | OK | | agents | WHERE last_seen < $1 | YES (migration 001) | OK | | agent_commands | WHERE agent_id = $1 AND status = 'pending' | YES (composite, migration 001) | OK | | agent_commands | WHERE status IN ('pending','sent') AND sent_at < $1 | PARTIAL — composite on (agent_id, status) exists but not on (status, sent_at) | F-B1-5 | | agent_commands | WHERE expires_at > NOW() | YES (partial, migration 026) | OK | | agent_commands | WHERE key_id = $1 | YES (migration 025) | OK | | refresh_tokens | WHERE token_hash = $1 | YES (unique + partial, migration 008) | OK | | registration_tokens | WHERE token = $1 | YES (unique, migration 011) | OK | | signing_keys | WHERE key_id = $1 | YES (unique, migration 020) | OK | | signing_keys | WHERE is_active AND is_primary | YES (composite, migration 020) | OK | **F-B1-5 MEDIUM: Missing composite index on agent_commands(status, sent_at)** `GetStuckCommands` queries `WHERE status IN ('pending', 'sent') AND (sent_at < $2 OR created_at < $2)`. The existing index is `(agent_id, status)` which helps when filtering by agent but not for the timeout service that scans across all agents. --- ## 4. N+1 QUERY AUDIT **F-B1-6 HIGH: GetDashboardStats N+1 loop** `aggregator-server/internal/api/handlers/stats.go:55-77`: ```go agents, err := h.agentQueries.ListAgents("", "") for _, agent := range agents { agentStats, err := h.updateQueries.GetUpdateStatsFromState(agent.ID) // aggregate... } ``` One query per agent on every dashboard load. With 100 agents = 101 queries. **Fix:** Replace with single `LEFT JOIN current_package_state ON agent_id` with aggregate functions. **F-B1-7 MEDIUM: BulkApproveUpdates loop UPDATE** `aggregator-server/internal/database/queries/updates.go:153-179`: ```go for _, id := range updateIDs { _, err := tx.Exec(query, id) } ``` One UPDATE per item. Should use `WHERE id = ANY($1)`. **F-B1-8 LOW: Correlated subqueries in ListAgentsWithLastScan** `aggregator-server/internal/database/queries/agents.go`: Uses `(SELECT MAX(created_at) FROM update_events WHERE agent_id = a.id)` as scalar subquery. Could be a LEFT JOIN. --- ## 5. FOREIGN KEY & CASCADE AUDIT | Parent | Child | ON DELETE | Risk | |--------|-------|-----------|------| | agents | agent_commands | CASCADE | Agent delete removes all commands | | agents | agent_specs | CASCADE | Agent delete removes specs | | agents | update_packages | CASCADE | Agent delete removes packages | | agents | update_logs | CASCADE | Agent delete removes logs | | agents | refresh_tokens | CASCADE | Agent delete revokes all tokens | | agents | agent_subsystems | CASCADE | Agent delete removes subsystems | | agents | storage_metrics | CASCADE | Agent delete removes metrics | | agents | metrics | CASCADE | Agent delete removes metrics | | agents | docker_images | CASCADE | Agent delete removes images | | agents | system_events | CASCADE | Agent delete removes events | | agents | registration_tokens | SET NULL (used_by_agent_id) | Safe | | agents | client_errors | SET NULL | Safe | | agent_commands | agent_commands (retried_from_id) | SET NULL | Safe — retry chain preserved | | users | security_settings | **No explicit policy** | F-B1-9 | | users | security_settings_audit | **No explicit policy** | F-B1-9 | | users | security_incidents | **No explicit policy** | F-B1-9 | | signing_keys | (none) | N/A | Deprecated keys remain in DB | **F-B1-9 LOW: No ON DELETE policy for user-referenced tables** `security_settings.updated_by`, `security_settings_audit.changed_by`, and `security_incidents.resolved_by` reference `users(id)` without an explicit ON DELETE policy. PostgreSQL defaults to RESTRICT, meaning a user cannot be deleted if they have associated security records. This is probably correct behavior (don't delete admins with audit trails) but should be explicitly documented. **Orphaned records:** Under normal operations, orphans are unlikely because all agent-related FKs use CASCADE. The only risk is if a transaction partially commits (which PostgreSQL prevents). --- ## 6. REFRESH TOKEN SLIDING WINDOW **Schema:** Migration 008 — `refresh_tokens` table with `expires_at`, `last_used_at`, `revoked` **Sliding window:** YES — `agents.go:1031-1036`: ```go newExpiry := time.Now().Add(90 * 24 * time.Hour) h.refreshTokenQueries.UpdateExpiration(refreshToken.ID, newExpiry) ``` On every renewal, expiry is reset to NOW+90 days. **Background cleanup:** NO background job. `CleanupExpiredTokens()` exists but is only callable via admin endpoint `POST /admin/registration-tokens/cleanup`. Expired refresh tokens accumulate until manually cleaned. **F-B1-10 MEDIUM: No automatic refresh token cleanup** The refresh_tokens table grows unbounded. Each agent renewal creates/extends tokens but old revoked/expired tokens are never automatically removed. **Maximum token age:** 90 days from last use (sliding window). **Indefinite renewal:** YES — an active agent can renew indefinitely. **Race condition on simultaneous renewal:** Two concurrent renewal requests for the same token could both succeed. `ValidateRefreshToken` is not inside a transaction with `UpdateExpiration`. Both requests would read the same token as valid, both would generate new JWTs, and both would update the expiration. The second update overwrites the first. No data corruption, but two valid JWTs are issued. This is acceptable for the use case. --- ## 7. SCHEMA MIGRATION STATE CORRUPTION (P0) ### Q1: Is the INSERT inside the transaction? **YES.** `tx.Exec("INSERT INTO schema_migrations...")` at db.go:121 is called on `tx` (the transaction begun at line 87). Both the migration SQL and the tracking INSERT are inside the same transaction. This is correct. ### Q2: What happens on migration SQL failure? 1. `tx.Exec(string(content))` at line 93 fails 2. If "already exists" error: `tx.Rollback()` at line 100, then check schema_migrations outside the tx (line 104) 3. If already recorded: skip (correct) 4. If NOT recorded: return fatal error (correct) 5. For any other error: `tx.Rollback()` at line 116, return fatal error The migration is **NOT** marked as applied when it fails. The transaction ensures atomicity. ### Q3: Is there a duplicate INSERT? **NO.** The prior analysis identified a potential duplicate INSERT, but the current code (after review) has only ONE INSERT into schema_migrations — at line 121, inside the transaction. Migration 024 has its own INSERT (line 18-19 of the SQL file), but that's the migration's bug, not the runner's. ### Q4: Server startup behavior on migration failure? **main.go:191-195:** ```go if err := db.Migrate(migrationsPath); err != nil { fmt.Printf("Warning: Migration failed (tables may already exist): %v\n", err) } fmt.Println("[OK] Database migrations completed") ``` **(b) Log a warning and continue.** The error is caught, printed as a warning, and the server starts anyway. The `[OK]` message prints regardless. This is the P0 — the server runs with an incomplete schema. ### Q5: What happens on next restart if migration was not applied? The runner checks `schema_migrations` — the migration is NOT recorded — it tries to run again. If the migration SQL creates objects that DO exist (from a partial prior run), it may hit "already exists" errors and enter the error handling at lines 94-113. If the objects don't exist in schema_migrations, the runner returns a fatal error (line 110), which main.go swallows. **F-B1-11 P0/CRITICAL: Server starts with incomplete schema** The combination of: 1. Migration failures being swallowed by main.go 2. `[OK]` message printing after failure 3. No retry or abort mechanism ...means the server can run indefinitely with missing tables, columns, or indexes, causing runtime errors (500s, nil pointer dereferences) that are difficult to diagnose. --- ## 8. A-SERIES MIGRATION REVIEW ### Migration 025 (A-1: key_id + signed_at) - **Idempotent:** YES — uses `ADD COLUMN IF NOT EXISTS` and `CREATE INDEX IF NOT EXISTS` - **DOWN migration:** YES — drops columns and index - **NULL handling:** Both columns are nullable, no constraint violations for existing data ### Migration 026 (A-2: expires_at) - **Idempotent:** YES — uses `ADD COLUMN IF NOT EXISTS` and `CREATE INDEX IF NOT EXISTS` - **DOWN migration:** YES — drops column and index - **Backfill:** `UPDATE ... SET expires_at = created_at + INTERVAL '24 hours' WHERE expires_at IS NULL AND status = 'pending'` — correct, only touches pending rows, uses 24h for conservative backfill - **NULL handling:** Column is nullable, IS NULL guard in queries handles un-backfilled rows Both A-series migrations are well-constructed. --- ## 9. ETHOS COMPLIANCE | Principle | Violation | Location | |-----------|-----------|----------| | ETHOS #1 | Migration failure logged with emoji, no [TAG] format | main.go:194, db.go:107,131 | | ETHOS #1 | `[OK]` printed after migration failure | main.go:196 | | ETHOS #3 | Server assumes migrations succeed and continues | main.go:191-195 | | ETHOS #4 | 7+ migrations lack IF NOT EXISTS on schema changes | See idempotency table above | | ETHOS #4 | Migration 024 self-inserts into schema_migrations | 024:18-19 | --- ## FINDINGS SUMMARY | ID | Severity | Finding | Location | |----|----------|---------|----------| | F-B1-1 | CRITICAL | Migration 024 self-inserts into schema_migrations, causing duplicate key on runner INSERT | 024_disable_updates_subsystem.up.sql:18-19 | | F-B1-2 | CRITICAL | Migration 024 references non-existent `deprecated` column on agent_subsystems | 024_disable_updates_subsystem.up.sql:10 | | F-B1-3 | HIGH | Migration 018 scanner_config file has no .up.sql suffix — never executed | 018_create_scanner_config_table.sql | | F-B1-4 | HIGH | GRANT to non-existent role `redflag_user` in scanner_config migration | 018_create_scanner_config_table.sql:34 | | F-B1-5 | MEDIUM | Missing composite index on agent_commands(status, sent_at) for timeout service | GetStuckCommands query pattern | | F-B1-6 | HIGH | N+1 in GetDashboardStats: one query per agent on every dashboard load | stats.go:55-77 | | F-B1-7 | MEDIUM | BulkApproveUpdates: loop UPDATE instead of batch WHERE id = ANY($1) | updates.go:153-179 | | F-B1-8 | LOW | Correlated subqueries in ListAgentsWithLastScan | agents.go ListAgentsWithLastScan | | F-B1-9 | LOW | No ON DELETE policy for user-referenced security tables (defaults to RESTRICT) | 020_add_command_signatures.up.sql | | F-B1-10 | MEDIUM | No automatic refresh token cleanup — table grows unbounded | refresh_tokens.go, main.go (no background job) | | F-B1-11 | P0/CRITICAL | Server starts with incomplete schema after migration failure, prints `[OK]` | main.go:191-196 | | F-B1-12 | MEDIUM | 16 migrations have no DOWN/rollback file | See table above | | F-B1-13 | MEDIUM | Duplicate migration numbers (009, 012) — fragile ordering | Migration filenames | | F-B1-14 | LOW | Inconsistent UUID generation (uuid_generate_v4 vs gen_random_uuid) | Various migrations | | F-B1-15 | LOW | 7+ migrations not idempotent (missing IF NOT EXISTS) | See idempotency table |