fix(database): B-1 schema integrity and migration fixes

- Fix migration 024 self-insert and bad column reference (F-B1-1, F-B1-2)
  Uses existing enabled/auto_run columns instead of non-existent deprecated
- Abort server on migration failure instead of warning (F-B1-11)
  main.go now calls log.Fatalf, prints [INFO] only on success
- Fix migration 018 scanner_config filename suffix (F-B1-3)
  Renumbered to 027 with .up.sql suffix
- Remove GRANT to non-existent role in scanner_config (F-B1-4)
- Resolve duplicate migration numbers 009 and 012 (F-B1-13)
  Renamed to 009b and 012b for unique lexical sorting
- Add IF NOT EXISTS to all non-idempotent migrations (F-B1-15)
  Fixed: 011, 012, 017, 023, 023a
- Replace N+1 dashboard stats loop with GetAllUpdateStats (F-B1-6)
  Single aggregate query replaces per-agent loop
- Add composite index on agent_commands(status, sent_at) (F-B1-5)
  New migration 028 with partial index for timeout service
- Add background refresh token cleanup goroutine (F-B1-10)
  24-hour ticker calls CleanupExpiredTokens
- ETHOS log format in migration runner (no emojis)

All 55 tests pass (41 server + 14 agent). No regressions.
See docs/B1_Fix_Implementation.md and DEV-025 through DEV-028.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-03-29 07:03:35 -04:00
parent ab676c3b83
commit ec0d880036
33 changed files with 420 additions and 537 deletions

View File

@@ -0,0 +1,54 @@
# B-1 Database & Schema Integrity Fix Implementation
**Date:** 2026-03-29
**Branch:** culurien
---
## Files Changed
| File | Change |
|------|--------|
| `migrations/024_disable_updates_subsystem.up.sql` | Removed self-insert + bad column reference (F-B1-1, F-B1-2) |
| `migrations/024_disable_updates_subsystem.down.sql` | Updated to match fixed up migration |
| `cmd/server/main.go` | Migration failure now calls log.Fatalf (F-B1-11); background token cleanup added (F-B1-10) |
| `internal/database/db.go` | ETHOS log format in migration runner (no emojis) |
| `migrations/018_create_scanner_config_table.sql` | DELETED (F-B1-3) |
| `migrations/027_create_scanner_config_table.up.sql` | NEW — renumbered from 018, fixed suffix and removed GRANT (F-B1-3, F-B1-4) |
| `migrations/027_create_scanner_config_table.down.sql` | NEW — down migration |
| `migrations/009_add_retry_tracking.up.sql` | RENAMED to 009b (F-B1-13) |
| `migrations/012_create_admin_user.up.sql` | RENAMED to 012b (F-B1-13) |
| `migrations/011_create_registration_tokens_table.up.sql` | Added IF NOT EXISTS (F-B1-15) |
| `migrations/012_add_token_seats.up.sql` | Added IF NOT EXISTS (F-B1-15) |
| `migrations/017_add_machine_id.up.sql` | Added IF NOT EXISTS (F-B1-15) |
| `migrations/023_client_error_logging.up.sql` | Added IF NOT EXISTS (F-B1-15) |
| `migrations/023a_command_deduplication.up.sql` | Added IF NOT EXISTS (F-B1-15) |
| `migrations/028_add_stuck_commands_index.up.sql` | NEW — partial index for GetStuckCommands (F-B1-5) |
| `migrations/028_add_stuck_commands_index.down.sql` | NEW — down migration |
| `internal/api/handlers/stats.go` | Replaced N+1 loop with GetAllUpdateStats() (F-B1-6) |
## Migration 024 Fix (Option B)
Used existing `enabled` and `auto_run` columns instead of the non-existent `deprecated` column. The intent of the migration was to disable the legacy updates subsystem — `SET enabled = false, auto_run = false` achieves this using columns that exist in the schema since migration 015.
## Final Migration Sequence
```
001 → 003 → 004 → 005 → 006 → 007 → 008 → 009 → 009b → 010 →
011 → 012 → 012b → 013 → 014 → 015 → 016 → 017 → 018 → 019 →
020 → 021 → 022 → 023 → 023a → 024 → 025 → 026 → 027 → 028
```
No duplicate numbers. Monotonically increasing. All have .up.sql suffix.
## N+1 Fix
Replaced per-agent `GetUpdateStatsFromState(agent.ID)` loop (stats.go:64) with single call to `GetAllUpdateStats()` which aggregates across all agents in one query.
## Background Cleanup
24-hour ticker goroutine calls `refreshTokenQueries.CleanupExpiredTokens()`. Logs success with count and failure with error. No context cancellation (main.go doesn't use a server context — documented as DEV-025).
## Test Results
55 tests pass (41 server + 14 agent). Zero regressions.

View File

@@ -252,3 +252,35 @@ This document records deviations from the implementation spec.
2. `example_integration.go` calls `machineid.ID()` directly instead of `GetMachineID()`
**Action:** Not fixed in this pass (requires careful analysis of downstream effects). Flagged as D-1 fix prompt input. See `docs/Refactor_A_Series.md` Task 6 for full analysis.
---
## DEV-025: Background token cleanup without context cancellation (B-1)
**Issue:** The background token cleanup goroutine in main.go uses a simple `time.NewTicker` loop without `context.Context` cancellation. main.go doesn't have a server-level context, so clean shutdown is handled by the deferred shutdown block and process signals.
**Impact:** On server shutdown, the goroutine is killed by the OS. No data loss risk — `CleanupExpiredTokens` is a DELETE query that's atomic. The ticker approach is consistent with the existing offline-agent-check goroutine (same pattern, same file).
---
## DEV-026: Migration 024 fix uses Option B — existing columns (B-1)
**Issue:** Migration 024 referenced a `deprecated` column that doesn't exist. Two options: (A) add the column in a new migration, (B) use existing `enabled`/`auto_run` columns.
**Decision:** Option B. The migration's intent is to disable the legacy updates subsystem. `SET enabled = false, auto_run = false` achieves this using columns already in the schema (migration 015). Adding an unused `deprecated` column would be unnecessary complexity.
---
## DEV-027: Duplicate migration numbers resolved with suffix letters (B-1)
**Issue:** Migrations 009 and 012 each had two files with the same numeric prefix.
**Decision:** Renamed second files to `009b` and `012b`. The runner sorts lexically (`sort.Strings`), so `009b_add_retry_tracking.up.sql` sorts after `009_add_agent_version_tracking.up.sql` correctly. This preserves the original execution order and doesn't require a full renumber.
---
## DEV-028: Migration 018 renumbered to 027 (B-1)
**Issue:** `018_create_scanner_config_table.sql` had wrong file suffix (.sql not .up.sql) AND shared number 018 with `018_create_metrics_and_docker_tables.up.sql`.
**Decision:** Renumbered to 027. The scanner_config table has never been created by the runner (it was skipped due to wrong suffix), so existing databases don't have it. Number 027 is after all existing migrations, ensuring it runs last in the sequence.