Files
Redflag/docs/B1_Database_Audit.md
jpetree331 3de7577802 docs: B-1 database migration and schema integrity audit
Comprehensive audit of the custom migration runner, all 26 migrations,
query patterns, foreign keys, and schema state management.

Critical findings:
- F-B1-11 P0: Server starts with incomplete schema after migration
  failure, prints [OK] — main.go swallows migration errors
- F-B1-1: Migration 024 self-inserts into schema_migrations
- F-B1-2: Migration 024 references non-existent deprecated column
- F-B1-3: Migration 018 scanner_config has wrong file extension
- F-B1-6: N+1 query in GetDashboardStats (1 query per agent)

15 findings total across P0/CRITICAL/HIGH/MEDIUM/LOW.
See docs/B1_Database_Audit.md for full analysis.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-03-29 06:28:59 -04:00

16 KiB

B-1 Database Migration & Schema Integrity Audit

Date: 2026-03-29 Branch: culurien Scope: Migration runner, schema integrity, query patterns, foreign keys


1. MIGRATION RUNNER ANALYSIS

File: aggregator-server/internal/database/db.go Approach: Custom runner (not golang-migrate, not goose) Tracking: schema_migrations table with version VARCHAR(255) PRIMARY KEY

Transaction Logic (lines 86-131)

The runner wraps each migration SQL in a transaction (db.Beginx() at line 87). The INSERT into schema_migrations happens INSIDE the same transaction (line 121). On successful commit (line 127), both the SQL changes and the tracking record are committed atomically.

"Already exists" handling (lines 93-113): When migration SQL fails with "already exists" or "duplicate key":

  1. Transaction is rolled back (line 100)
  2. Runner checks if this migration was already recorded in schema_migrations (line 104)
  3. If recorded: skip silently (line 107)
  4. If NOT recorded: return fatal error (line 110)

This logic is correct for the case where a prior run successfully applied the migration. However, it creates a problem when combined with main.go's error swallowing.

Server Startup Behavior (main.go:191-195)

if err := db.Migrate(migrationsPath); err != nil {
    fmt.Printf("Warning: Migration failed (tables may already exist): %v\n", err)
}
fmt.Println("[OK] Database migrations completed")

The server logs a warning and continues. The [OK] message prints even when migrations failed. This is the P0 blocker.


2. MIGRATION FILE AUDIT

Numbering Conflicts

Issue Files
Duplicate 009 009_add_agent_version_tracking.up.sql, 009_add_retry_tracking.up.sql
Duplicate 012 012_add_token_seats.up.sql, 012_create_admin_user.up.sql
Duplicate 018 018_create_metrics_and_docker_tables.up.sql, 018_create_scanner_config_table.sql

Missing DOWN Migrations (16 of 28)

Migration Has DOWN?
001 YES
003 NO
004 NO
005 NO
006 NO
007 NO
008 NO
009 (both) NO
010 NO
011 NO
012 (both) NO
013 NO
014 NO
015 YES
016 YES
017 YES
018 (metrics) YES
019 NO
020 YES
021 NO
022 YES
023 YES
023a YES
024 YES
025 YES
026 YES

Idempotency Issues (ETHOS #4 Violations)

Migration Issue
001 No IF NOT EXISTS on CREATE TABLE/INDEX
009 (both) ALTER TABLE ADD COLUMN without IF NOT EXISTS
013 ALTER TABLE ADD COLUMN without IF NOT EXISTS
014 ALTER TABLE ADD COLUMN without IF NOT EXISTS — also ADD CONSTRAINT without IF NOT EXISTS
016 ALTER TABLE ADD COLUMN without IF NOT EXISTS
019 CREATE INDEX without IF NOT EXISTS
021 CREATE TABLE/INDEX without IF NOT EXISTS

Critical Migration Bugs

F-B1-1 CRITICAL: Migration 024 self-inserts into schema_migrations

024_disable_updates_subsystem.up.sql:18-19:

INSERT INTO schema_migrations (version) VALUES
('024_disable_updates_subsystem.up.sql');

This INSERT runs inside the migration transaction. When the runner also inserts at db.go:121, a duplicate key violation occurs. The transaction rolls back, undoing the migration SQL. The server then catches the error at main.go:194 and continues with a warning. Migration 024 is never applied.

F-B1-2 CRITICAL: Migration 024 references non-existent column

024_disable_updates_subsystem.up.sql:10:

SET ... deprecated = true ...

The deprecated column does not exist on agent_subsystems. Migration 015 creates the table without it, and no subsequent migration adds it. This migration will always fail with "column deprecated does not exist."

F-B1-3 HIGH: Migration 018 scanner_config never runs

018_create_scanner_config_table.sql has no .up.sql suffix. The migration runner only processes files ending in .up.sql (db.go:59). The scanner_config table is only created if this file happens to match the filter, which it doesn't. Any handler referencing scanner_config will get "relation does not exist" errors at runtime.

F-B1-4 HIGH: GRANT to non-existent role

018_create_scanner_config_table.sql:34:

GRANT SELECT, INSERT, UPDATE, DELETE ON scanner_config TO redflag_user;

The default DB user is redflag, not redflag_user. This would fail even if the file were executed.


3. MISSING INDEXES AUDIT

Table Query Pattern Index Exists? Finding
agents WHERE machine_id = $1 YES (unique, migration 017) OK
agents WHERE status = $1 YES (migration 001) OK
agents WHERE last_seen < $1 YES (migration 001) OK
agent_commands WHERE agent_id = $1 AND status = 'pending' YES (composite, migration 001) OK
agent_commands WHERE status IN ('pending','sent') AND sent_at < $1 PARTIAL — composite on (agent_id, status) exists but not on (status, sent_at) F-B1-5
agent_commands WHERE expires_at > NOW() YES (partial, migration 026) OK
agent_commands WHERE key_id = $1 YES (migration 025) OK
refresh_tokens WHERE token_hash = $1 YES (unique + partial, migration 008) OK
registration_tokens WHERE token = $1 YES (unique, migration 011) OK
signing_keys WHERE key_id = $1 YES (unique, migration 020) OK
signing_keys WHERE is_active AND is_primary YES (composite, migration 020) OK

F-B1-5 MEDIUM: Missing composite index on agent_commands(status, sent_at)

GetStuckCommands queries WHERE status IN ('pending', 'sent') AND (sent_at < $2 OR created_at < $2). The existing index is (agent_id, status) which helps when filtering by agent but not for the timeout service that scans across all agents.


4. N+1 QUERY AUDIT

F-B1-6 HIGH: GetDashboardStats N+1 loop

aggregator-server/internal/api/handlers/stats.go:55-77:

agents, err := h.agentQueries.ListAgents("", "")
for _, agent := range agents {
    agentStats, err := h.updateQueries.GetUpdateStatsFromState(agent.ID)
    // aggregate...
}

One query per agent on every dashboard load. With 100 agents = 101 queries.

Fix: Replace with single LEFT JOIN current_package_state ON agent_id with aggregate functions.

F-B1-7 MEDIUM: BulkApproveUpdates loop UPDATE

aggregator-server/internal/database/queries/updates.go:153-179:

for _, id := range updateIDs {
    _, err := tx.Exec(query, id)
}

One UPDATE per item. Should use WHERE id = ANY($1).

F-B1-8 LOW: Correlated subqueries in ListAgentsWithLastScan

aggregator-server/internal/database/queries/agents.go: Uses (SELECT MAX(created_at) FROM update_events WHERE agent_id = a.id) as scalar subquery. Could be a LEFT JOIN.


5. FOREIGN KEY & CASCADE AUDIT

Parent Child ON DELETE Risk
agents agent_commands CASCADE Agent delete removes all commands
agents agent_specs CASCADE Agent delete removes specs
agents update_packages CASCADE Agent delete removes packages
agents update_logs CASCADE Agent delete removes logs
agents refresh_tokens CASCADE Agent delete revokes all tokens
agents agent_subsystems CASCADE Agent delete removes subsystems
agents storage_metrics CASCADE Agent delete removes metrics
agents metrics CASCADE Agent delete removes metrics
agents docker_images CASCADE Agent delete removes images
agents system_events CASCADE Agent delete removes events
agents registration_tokens SET NULL (used_by_agent_id) Safe
agents client_errors SET NULL Safe
agent_commands agent_commands (retried_from_id) SET NULL Safe — retry chain preserved
users security_settings No explicit policy F-B1-9
users security_settings_audit No explicit policy F-B1-9
users security_incidents No explicit policy F-B1-9
signing_keys (none) N/A Deprecated keys remain in DB

F-B1-9 LOW: No ON DELETE policy for user-referenced tables

security_settings.updated_by, security_settings_audit.changed_by, and security_incidents.resolved_by reference users(id) without an explicit ON DELETE policy. PostgreSQL defaults to RESTRICT, meaning a user cannot be deleted if they have associated security records. This is probably correct behavior (don't delete admins with audit trails) but should be explicitly documented.

Orphaned records: Under normal operations, orphans are unlikely because all agent-related FKs use CASCADE. The only risk is if a transaction partially commits (which PostgreSQL prevents).


6. REFRESH TOKEN SLIDING WINDOW

Schema: Migration 008 — refresh_tokens table with expires_at, last_used_at, revoked

Sliding window: YES — agents.go:1031-1036:

newExpiry := time.Now().Add(90 * 24 * time.Hour)
h.refreshTokenQueries.UpdateExpiration(refreshToken.ID, newExpiry)

On every renewal, expiry is reset to NOW+90 days.

Background cleanup: NO background job. CleanupExpiredTokens() exists but is only callable via admin endpoint POST /admin/registration-tokens/cleanup. Expired refresh tokens accumulate until manually cleaned.

F-B1-10 MEDIUM: No automatic refresh token cleanup

The refresh_tokens table grows unbounded. Each agent renewal creates/extends tokens but old revoked/expired tokens are never automatically removed.

Maximum token age: 90 days from last use (sliding window).

Indefinite renewal: YES — an active agent can renew indefinitely.

Race condition on simultaneous renewal: Two concurrent renewal requests for the same token could both succeed. ValidateRefreshToken is not inside a transaction with UpdateExpiration. Both requests would read the same token as valid, both would generate new JWTs, and both would update the expiration. The second update overwrites the first. No data corruption, but two valid JWTs are issued. This is acceptable for the use case.


7. SCHEMA MIGRATION STATE CORRUPTION (P0)

Q1: Is the INSERT inside the transaction?

YES. tx.Exec("INSERT INTO schema_migrations...") at db.go:121 is called on tx (the transaction begun at line 87). Both the migration SQL and the tracking INSERT are inside the same transaction. This is correct.

Q2: What happens on migration SQL failure?

  1. tx.Exec(string(content)) at line 93 fails
  2. If "already exists" error: tx.Rollback() at line 100, then check schema_migrations outside the tx (line 104)
  3. If already recorded: skip (correct)
  4. If NOT recorded: return fatal error (correct)
  5. For any other error: tx.Rollback() at line 116, return fatal error

The migration is NOT marked as applied when it fails. The transaction ensures atomicity.

Q3: Is there a duplicate INSERT?

NO. The prior analysis identified a potential duplicate INSERT, but the current code (after review) has only ONE INSERT into schema_migrations — at line 121, inside the transaction. Migration 024 has its own INSERT (line 18-19 of the SQL file), but that's the migration's bug, not the runner's.

Q4: Server startup behavior on migration failure?

main.go:191-195:

if err := db.Migrate(migrationsPath); err != nil {
    fmt.Printf("Warning: Migration failed (tables may already exist): %v\n", err)
}
fmt.Println("[OK] Database migrations completed")

(b) Log a warning and continue. The error is caught, printed as a warning, and the server starts anyway. The [OK] message prints regardless. This is the P0 — the server runs with an incomplete schema.

Q5: What happens on next restart if migration was not applied?

The runner checks schema_migrations — the migration is NOT recorded — it tries to run again. If the migration SQL creates objects that DO exist (from a partial prior run), it may hit "already exists" errors and enter the error handling at lines 94-113. If the objects don't exist in schema_migrations, the runner returns a fatal error (line 110), which main.go swallows.

F-B1-11 P0/CRITICAL: Server starts with incomplete schema

The combination of:

  1. Migration failures being swallowed by main.go
  2. [OK] message printing after failure
  3. No retry or abort mechanism

...means the server can run indefinitely with missing tables, columns, or indexes, causing runtime errors (500s, nil pointer dereferences) that are difficult to diagnose.


8. A-SERIES MIGRATION REVIEW

Migration 025 (A-1: key_id + signed_at)

  • Idempotent: YES — uses ADD COLUMN IF NOT EXISTS and CREATE INDEX IF NOT EXISTS
  • DOWN migration: YES — drops columns and index
  • NULL handling: Both columns are nullable, no constraint violations for existing data

Migration 026 (A-2: expires_at)

  • Idempotent: YES — uses ADD COLUMN IF NOT EXISTS and CREATE INDEX IF NOT EXISTS
  • DOWN migration: YES — drops column and index
  • Backfill: UPDATE ... SET expires_at = created_at + INTERVAL '24 hours' WHERE expires_at IS NULL AND status = 'pending' — correct, only touches pending rows, uses 24h for conservative backfill
  • NULL handling: Column is nullable, IS NULL guard in queries handles un-backfilled rows

Both A-series migrations are well-constructed.


9. ETHOS COMPLIANCE

Principle Violation Location
ETHOS #1 Migration failure logged with emoji, no [TAG] format main.go:194, db.go:107,131
ETHOS #1 [OK] printed after migration failure main.go:196
ETHOS #3 Server assumes migrations succeed and continues main.go:191-195
ETHOS #4 7+ migrations lack IF NOT EXISTS on schema changes See idempotency table above
ETHOS #4 Migration 024 self-inserts into schema_migrations 024:18-19

FINDINGS SUMMARY

ID Severity Finding Location
F-B1-1 CRITICAL Migration 024 self-inserts into schema_migrations, causing duplicate key on runner INSERT 024_disable_updates_subsystem.up.sql:18-19
F-B1-2 CRITICAL Migration 024 references non-existent deprecated column on agent_subsystems 024_disable_updates_subsystem.up.sql:10
F-B1-3 HIGH Migration 018 scanner_config file has no .up.sql suffix — never executed 018_create_scanner_config_table.sql
F-B1-4 HIGH GRANT to non-existent role redflag_user in scanner_config migration 018_create_scanner_config_table.sql:34
F-B1-5 MEDIUM Missing composite index on agent_commands(status, sent_at) for timeout service GetStuckCommands query pattern
F-B1-6 HIGH N+1 in GetDashboardStats: one query per agent on every dashboard load stats.go:55-77
F-B1-7 MEDIUM BulkApproveUpdates: loop UPDATE instead of batch WHERE id = ANY($1) updates.go:153-179
F-B1-8 LOW Correlated subqueries in ListAgentsWithLastScan agents.go ListAgentsWithLastScan
F-B1-9 LOW No ON DELETE policy for user-referenced security tables (defaults to RESTRICT) 020_add_command_signatures.up.sql
F-B1-10 MEDIUM No automatic refresh token cleanup — table grows unbounded refresh_tokens.go, main.go (no background job)
F-B1-11 P0/CRITICAL Server starts with incomplete schema after migration failure, prints [OK] main.go:191-196
F-B1-12 MEDIUM 16 migrations have no DOWN/rollback file See table above
F-B1-13 MEDIUM Duplicate migration numbers (009, 012) — fragile ordering Migration filenames
F-B1-14 LOW Inconsistent UUID generation (uuid_generate_v4 vs gen_random_uuid) Various migrations
F-B1-15 LOW 7+ migrations not idempotent (missing IF NOT EXISTS) See idempotency table