Add docs and project files - force for Culurien

This commit is contained in:
Fimeg
2026-03-28 20:46:24 -04:00
parent dc61797423
commit 484a7f77ce
343 changed files with 119530 additions and 0 deletions

934
ChristmasTodos.md Normal file
View File

@@ -0,0 +1,934 @@
# Christmas Todos
Generated from investigation of RedFlag system architecture, December 2025.
---
## ⚠️ IMMEDIATE ISSUE: updates Subsystem Inconsistency - **RESOLVED**
### Problem
The `updates` subsystem was causing confusion across multiple layers.
### Solution Applied (Dec 23, 2025)
**Migration 025: Platform-Specific Subsystems**
- Created `025_platform_scanner_subsystems.up.sql` - Backfills `apt`, `dnf` for Linux agents, `windows`, `winget` for Windows agents
- Updated database trigger to create platform-specific subsystems for NEW agent registrations
**Scheduler Fix**
- Removed `"updates": 15` from `aggregator-server/internal/scheduler/scheduler.go:196`
**README.md Security Language Fix**
- Changed "All subsequent communications verified via Ed25519 signatures"
- To: "Commands and updates are verified via Ed25519 signatures"
**Orchestrator EventBuffer Integration**
- Changed `main.go:747` to use `NewOrchestratorWithEvents(apiClient.eventBuffer)`
### Remaining (Blockers)
- New agent registrations will now get platform-specific subsystems automatically
- No more "cannot find subsystem" errors for package scanners
---
## History/Timeline System Integration
### Current State
- Chat timeline shows only `agent_commands` + `update_logs` tables
- `system_events` table EXISTS but is NOT integrated into timeline
- `security_events` table EXISTS but is NOT integrated into timeline
- Frontend uses `/api/v1/logs` which queries `GetAllUnifiedHistory` in `updates.go`
### Missing Events
| Category | Missing Events |
|----------|----------------|
| **Agent Lifecycle** | Registration, startup, shutdown, check-in, offline events |
| **Security** | Machine ID mismatch, Ed25519 verification failures, nonce validation failures, unauthorized access attempts |
| **Acknowledgment** | Receipt, success, failure events |
| **Command Verification** | Success/failure logging to timeline (currently only to security log file) |
| **Configuration** | Config fetch attempts, token validation issues |
### Future Design Notes
- Timeline should be filterable by agent
- Server's primary history section (when not filtered by agent) should filter by event types/severity
- Keep options open - don't hardcode narrow assumptions about filtering
### Key Files
- `/home/casey/Projects/RedFlag/aggregator-server/internal/database/queries/updates.go` - `GetAllUnifiedHistory` query
- `/home/casey/Projects/RedFlag/aggregator-server/internal/database/migrations/019_create_system_events_table.up.sql`
- `/home/casey/Projects/RedFlag/aggregator-server/internal/api/handlers/agents.go` - Agent registration/status
- `/home/casey/Projects/RedFlag/aggregator-server/internal/api/middleware/machine_binding.go` - Machine ID checks
- `/home/casey/Projects/RedFlag/aggregator-web/src/components/HistoryTimeline.tsx`
- `/home/casey/Projects/RedFlag/aggregator-web/src/components/ChatTimeline.tsx`
---
## Agent Lifecycle & Scheduler Robustness
### Current State
- Agent CONTINUES checking in on most errors (logs and continues to next iteration)
- Subsystem timeouts configured per type (10s system, 30s APT, 15m DNF, 60s Docker, etc.)
- Circuit breaker implementation exists with configurable thresholds
- Architecture: Simple sleep-based polling (5 min default, 5s rapid mode)
### Risks
| Issue | Risk Level | Details | File |
|-------|------------|---------|------|
| **No panic recovery** | HIGH | Main loop has no `defer recover()`; if it panics, agent crashes | `cmd/agent/main.go:1040`, `internal/service/windows.go:171` |
| **Blocking scans** | MEDIUM | Server-commanded scans block main loop (mitigated by timeouts) | `cmd/agent/subsystem_handlers.go` |
| **No goroutine pool** | MEDIUM | Background goroutines fire-and-forget, no centralized control | Various `go func()` calls |
| **No watchdog** | HIGH | No separate process monitors agent health | None |
| **No separate heartbeat** | MEDIUM | "Heartbeat" is just the check-in cycle | None |
### Mitigations Already In Place
- Per-subsystem timeouts via `context.WithTimeout()`
- Circuit breaker: Can disable subsystems after repeated failures
- OS-level service managers: systemd on Linux, Windows Service Manager
- Watchdog for agent self-updates only (5-minute timeout with rollback)
### Design Note
- Heartbeat should be separate goroutine that continues even if main loop is processing
- Consider errgroup for managing concurrent operations with proper cancellation
- Per-agent configuration for polling intervals, timeouts, etc.
---
## Configurable Settings (Hardcoded vs Configurable)
### Fully HARDCODED (Critical - Need Configuration)
| Setting | Current Value | Location | Priority |
|---------|---------------|----------|----------|
| **Ack maxAge** | 24 hours | `agent/internal/acknowledgment/tracker.go:24` | HIGH |
| **Ack maxRetries** | 10 | `agent/internal/acknowledgment/tracker.go:25` | HIGH |
| **Timeout sentTimeout** | 2 hours | `server/internal/services/timeout.go:28` | HIGH |
| **Timeout pendingTimeout** | 30 minutes | `server/internal/services/timeout.go:29` | HIGH |
| **Update nonce maxAge** | 10 minutes | `server/internal/services/update_nonce.go:26` | MEDIUM |
| **Nonce max age (security handler)** | 300 seconds | `server/internal/api/handlers/security.go:356` | MEDIUM |
| **Machine ID nonce expiry** | 600 seconds | `server/middleware/machine_binding.go:188` | MEDIUM |
| **Min check interval** | 60 sec | `server/internal/command/validator.go:22` | MEDIUM |
| **Max check interval** | 3600 sec | `server/internal/command/validator.go:23` | MEDIUM |
| **Min scanner interval** | 1 min | `server/internal/command/validator.go:24` | MEDIUM |
| **Max scanner interval** | 1440 min | `server/internal/command/validator.go:25` | MEDIUM |
| **Agent HTTP timeout** | 30 seconds | `agent/internal/client/client.go:48` | LOW |
### Already User-Configurable
| Category | Settings | How Configured |
|----------|----------|----------------|
| **Command Signing** | enabled, enforcement_mode (strict/warning/disabled), algorithm | DB + ENV |
| **Nonce Validation** | timeout_seconds (60-3600), reject_expired, log_expired_attempts | DB + ENV |
| **Machine Binding** | enabled, enforcement_mode, strict_action | DB + ENV |
| **Rate Limiting** | 6 limit types (requests, window, enabled) | API endpoints |
| **Network (Agent)** | timeout, retry_count (0-10), retry_delay, max_idle_conn | JSON config |
| **Circuit Breaker** | failure_threshold, failure_window, open_duration, half_open_attempts | JSON config |
| **Subsystem Timeouts** | 7 subsystems (timeout, interval_minutes) | JSON config |
| **Security Logging** | enabled, level, log_successes, file_path, retention, etc. | ENV |
### Per-Agent Configuration Goal
- All timeouts and retry settings should eventually be per-agent configurable
- Server-side overrides possible (e.g., increase timeouts for slow connections)
- Agent should pull overrides during config sync
---
## Implementation Considerations
### History/Timeline Integration Approaches
1. Expand `GetAllUnifiedHistory` to include `system_events` and `security_events`
2. Log critical events directly to `update_logs` with new action types
3. Hybrid: Use `system_events` for telemetry, sync to `update_logs` for timeline visibility
### Configuration Strategy
1. Use existing `SecuritySettingsService` for server-wide defaults
2. Add per-agent overrides in `agents` table (JSONB metadata column)
3. Agent pulls overrides during config sync (already implemented via `syncServerConfigWithRetry`)
4. Add validation ranges to prevent unreasonable values
### Robustness Strategy
1. Add `defer recover()` in main agent loops (Linux: `main.go`, Windows: `windows.go`)
2. Consider separate heartbeat goroutine with independent tick
3. Use errgroup for managed concurrent operations
4. Add health-check endpoint for external monitoring
---
## Related Documentation
- ETHOS principles in `/home/casey/Projects/RedFlag/docs/1_ETHOS/ETHOS.md`
- README at `/home/casey/Projects/RedFlag/README.md`
---
## Status
Created: December 22, 2025
Last Updated: December 22, 2025
---
## FEATURE DEVELOPMENT ARCHITECTURE (Designed Dec 22, 2025)
### Summary
Exhaustive code exploration and architecture design for comprehensive security, error transparency, and reliability improvements. **NOT actual blockers for alpha release.**
### Critical Assessment: Are These Blockers? NO.
The system as currently implemented is **functionally sufficient for alpha release**:
| README Claim | Actual Reality | Blocker? |
|-------------|---------------|----------|
| "Ed25519 signing" | Commands ARE signed ✅ | **No** |
| "All updates cryptographically signed" | Updates ARE signed ✅ | **No** |
| "All subsequent communications verified" | Only commands/updates signed; rest uses TLS+JWT | **No** - TLS+JWT is adequate |
| "Error transparency" | Security logger writes to file ✅ | **No** |
| "Hardware binding" | EXISTS ✅ | **No** |
| "Rate limiting" | EXISTS ✅ | **No** |
| "Circuit breakers" | EXISTS ✅ | **No** |
| "Agent auto-update" | EXISTS ✅ | **No** |
**Conclusion:** These enhancements are quality-of-life improvements, not release blockers. The README's "All subsequent communications" was aspirational language, not a done thing.
---
## Phase 0: Panic Recovery & Critical Security
### Design Decisions (User Approved)
| Question | Decision | Rationale |
|----------|----------|-----------|
| Q1 Panic Recovery | B) Hard Recovery - Log panic, send event, exit | Service managers (systemd/Windows Service) already handle restarts |
| Q2 Startup Event | Full - Include all system info | `GetSystemInfo()` already collects all required fields |
| Q3 Build Scope | A) Verify only - Add verification to existing signing | Signing service designed for existing files |
### Architecture
```
┌─────────────────────────────────────────────────────────────────────┐
│ PANIC RECOVERY COMPONENT │
│ NEW: internal/recovery/panic.go |
│ - NewPanicRecovery(eventBuffer, agentID, version, component) │
│ - HandlePanic() - defer recover(), buffer event, exit(1) │
│ - Wrap(fn) - Helper to wrap any function with recovery │
│ MODIFIED: cmd/agent/main.go │
│ - Wrap runAgent() with panic recovery │
│ MODIFIED: internal/service/windows.go │
│ - Wrap runAgent() with panic recovery (service mode) │
│ Event Flow: │
│ Panic → recover() → SystemEvent → event.Buffer → os.Exit(1) │
│ ↓ │
│ Service Manager Restarts Agent │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ STARTUP EVENT COMPONENT │
│ NEW: internal/startup/event.go │
│ - NewStartupEvent(apiClient, agentID, version) │
│ - Report() - Get system info, send via ReportSystemInfo() │
│ Event Flow: │
│ Agent Start → GetSystemInfo() → ReportSystemInfo() │
│ ↓ │
│ Server: POST /api/v1/agents/:id/system-info │
│ ↓ │
│ Database: CreateSystemEvent() (event_type="agent_startup") │
│ Metadata includes: hostname, os_type, os_version, architecture, │
│ uptime, memory_total, cpu_cores, etc. │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ BUILD VERIFICATION COMPONENT │
│ MODIFIED: services/build_orchestrator.go │
│ - VerifyBinarySignature(binaryPath) - NEW METHOD │
│ - SignBinaryWithVerification(path, version, platform, arch, │
│ verifyExisting) - Enhanced with verify flag │
│ Verification Flow: │
│ Binary Path → Checksum Calculation → Lookup DB Package │
│ ↓ │
│ Verify Checksum → Verify Signature → Return Package Info │
└─────────────────────────────────────────────────────────────────────┘
```
### Implementation Checklists
**Phase 0.1: Panic Recovery (~30 minutes)**
- [ ] Create `internal/recovery/panic.go`
- [ ] Import in `cmd/agent/main.go` and `internal/service/windows.go`
- [ ] Wrap main loops with panic recovery
- [ ] Test panic scenario and verify event buffer
**Phase 0.2: Startup Event (~30 minutes)**
- [ ] Create `internal/startup/event.go`
- [ ] Call startup events in both main.go and windows.go
- [ ] Verify database entries in system_events table
**Phase 0.3: Build Verification (~20 minutes)**
- [ ] Add `VerifyBinarySignature()` to build_orchestrator.go
- [ ] Add verification mode flag handling
- [ ] Test verification flow
---
## Phase 1: Error Transparency
### Design Decisions (User Approved)
| Question | Decision | Rationale |
|----------|----------|-----------|
| Q4 Event Batching | A) Bundle in check-in | Server ALREADY processes buffered_events from metadata |
| Q5 Event Persistence | B) Persisted + exponential backoff retry | events_buffer.json exists, retry pattern from syncServerConfigWithRetry() |
| Q6 Scan Error Granularity | A) One event per scan | Prevents event flood, matches UI expectations |
### Key Finding
**The server ALREADY accepts buffered events:**
`aggregator-server/internal/api/handlers/agents.go:228-264` processes `metadata["buffered_events"]` and calls `CreateSystemEvent()` for each.
**The gap:** Agent's `GetBufferedEvents()` is NEVER called in main.go.
### Architecture
```
┌─────────────────────────────────────────────────────────────────────┐
│ EVENT CREATION HELPERS │
│ NEW: internal/event/events.go │
│ - NewScanFailureEvent(scannerName, err, duration) │
│ - NewScanSuccessEvent(scannerName, updateCount, duration) │
│ - NewAgentLifecycleEvent(eventType, subtype, severity, message) │
│ - NewConfigSyncEvent(success, details, attempt) │
│ - NewOfflineEvent(reason) │
│ - NewReconnectionEvent() │
│ Event Types Defined: │
│ EventTypeAgentStartup, EventTypeAgentCheckIn, EventTypeAgentShutdown│
│ EventTypeAgentScan, EventTypeAgentConfig, EventTypeOffline │
│ SubtypeSuccess, SubtypeFailed, SubtypeSkipped, SubtypeTimeout │
│ SeverityInfo, SeverityWarning, SeverifyError, SeverityCritical │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ RETRY LOGIC COMPONENT │
│ NEW: internal/event/retry.go │
│ - RetryConfig struct (maxRetries, initialDelay, maxDelay, etc.) │
│ - RetryWithBackoff(fn, config) - Generic exponential backoff │
│ Backoff Pattern: 1s → 2s → 4s → 8s (max 4 retries) │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ SCAN HANDLER MODIFICATIONS │
│ MODIFIED: internal/handlers/scan.go │
│ - HandleScanAPT - Add bufferScanFailureEvent on error │
│ - HandleScanDNF - Add bufferScanFailureEvent on error │
│ - HandleScanDocker - Add bufferScanFailureEvent on error │
│ - HandleScanWindows - Add bufferScanFailureEvent on error │
│ - HandleScanWinget - Add bufferScanFailureEvent on error │
│ - HandleScanStorage - Add bufferScanFailureEvent on error │
│ - HandleScanSystem - Add bufferScanFailureEvent on error │
│ Pattern: On scan OR orchestrator.ScanSingle() failure, buffer event│
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ MAIN LOOP INTEGRATION │
│ MODIFIED: cmd/agent/main.go │
│ - Initialize event.Buffer in runAgent() │
│ - Generate and buffer agent_startup event │
│ - Before check-in: SendBufferedEventsWithRetry(agentID, 4) │
│ - Add check-in event to metadata (online, not buffered) │
│ - On check-in failure: Buffer offline event │
│ - On reconnection: Buffer reconnection event │
│ Event Flow: │
│ Scan Error → BufferEvent() → events_buffer.json │
│ ↓ │
│ Check-in → GetBufferedEvents() -> clear buffer │
│ ↓ │
│ Build metrics with metadata["buffered_events"] array │
│ ↓ │
│ POST /api/v1/agents/:id/commands │
│ ↓ │
│ Server: CreateSystemEvent() for each buffered event │
│ ↓ │
│ system_events table ← Future: Timeline UI integration │
└─────────────────────────────────────────────────────────────────────┘
```
### Implementation Checklists
**Phase 1.1: Event Buffer Integration (~30 minutes)**
- [ ] Add `GetEventBufferPath()` to `constants/paths.go`
- [ ] Enhance client with buffer integration
- [ ] Add `bufferEventFromStruct()` helper
**Phase 1.2: Event Creation Library (~30 minutes)**
- [ ] Create `internal/event/events.go` with all event helpers
- [ ] Create `internal/event/retry.go` for generic retry
- [ ] Add tests for event creation
**Phase 1.3: Scan Failure Events (~45 minutes)**
- [ ] Modify all 7 scan handlers (APT, DNF, Docker, Windows, Winget, Storage, System)
- [ ] Add both failure and success event buffering
- [ ] Test scan failure → buffer → delivery flow
**Phase 1.4: Lifecycle Events (~30 minutes)**
- [ ] Add startup event generation
- [ ] Add check-in event (immediate, not buffered)
- [ ] Add config sync event generation
- [ ] Add shutdown event generation
**Phase 1.5: Buffered Event Reporting (~45 minutes)**
- [ ] Implement `SendBufferedEventsWithRetry()` in client
- [ ] Modify main loop to use buffered event reporting
- [ ] Add offline/reconnection event generation
- [ ] Test offline scenario → buffer → reconnect → delivery
**Phase 1.6: Server Enhancements (~20 minutes)**
- [ ] Add enhanced logging for buffered events
- [ ] Add metrics for event processing
- [ ] Limit events per request (100 max) to prevent DoS
---
## Combined Phase 0+1 Summary
### File Changes
| Type | Path | Status |
|------|------|--------|
| **NEW** | `internal/recovery/panic.go` | To create |
| **NEW** | `internal/startup/event.go` | To create |
| **NEW** | `internal/event/events.go` | To create |
| **NEW** | `internal/event/retry.go` | To create |
| **MODIFY** | `cmd/agent/main.go` | Add panic wrapper + events + retry |
| **MODIFY** | `internal/service/windows.go` | Add panic wrapper + events + retry |
| **MODIFY** | `internal/client/client.go` | Event retry integration |
| **MODIFY** | `internal/handlers/scan.go` | Scan failure events |
| **MODIFY** | `services/build_orchestrator.go` | Verification mode |
### Totals
- **New files:** 4
- **Modified files:** 5
- **Lines of code:** ~830
- **Estimated time:** ~5-6 hours
- **No database migrations required**
- **No new API endpoints required**
---
## Future Phases (Designed but not Proceeding)
### Phase 2: UI Componentization
- Extract shared StatusCard from ChatTimeline.tsx (51KB monolith)
- Create TimelineEventCard component
- ModuleFactory for agent overview
- Estimated: 9-10 files, ~1700 LOC
### Phase 3: Factory/Unified Logic
- ScannerFactory for all scanners
- HandlerFactory for command handlers
- Unified event models to eliminate duplication
- Estimated: 8 files, ~1000 LOC
### Phase 4: Scheduler Event Awareness
- Event subscription system in scheduler
- Per-agent error tracking (1h + 24h + 7d windows)
- Adaptive backpressure based on error rates
- Estimated: 5 files, ~800 LOC
### Phase 5: Full Ed25519 Communications
- Sign all agent-to-server POST requests
- Sign server responses
- Response verification middleware
- Estimated: 10 files, ~1400 LOC, HIGH RISK
### Phase 6: Per-Agent Settings
- agent_settings JSONB or extend agent_subsystems table
- Settings API endpoints
- Per-agent configurable intervals, timeouts
- Estimated: 6 files, ~700 LOC
---
## Release Guidance
### For v0.1.28 (Current Alpha)
**Release as-is.** The implemented security model (TLS+JWT+hardware binding+Ed25519 command signing) is sufficient for homelab use.
### For v0.1.29 (Next Release)
**Panic Recovery** - Actual reliability improvement, not just nice-to-have.
### For v0.1.30+ (Future)
**Error Transparency** - Audit trail for operations.
### README Wording Suggestion
Change `"All subsequent communications verified via Ed25519 signatures"` to:
- `"Commands and updates are verified via Ed25519 signatures"`
Or
- `"Server-to-agent communications are verified via Ed25519 signatures"`
---
## Design Questions & Resolutions
| Q | Decision | Rationale |
|---|----------|-----------|
| Q1 Panic Recovery | B) Hard Recovery | Service managers handle restarts |
| Q2 Startup Event | Full | GetSystemInfo() already has all fields |
| Q3 Build Scope | A) Verify only | Signing service for pre-built binaries |
| Q4 Event Batching | A) Bundle in check-in | Server already processes buffered_events |
| Q5 Event Persistence | B) Persisted + backoff | events_buffer.json + syncServerConfigWithRetry pattern |
| Q6 Scan Error Granularity | A) One event per scan | Prevents flood, matches UI |
| Q7 Timeline Refactor | B) Split into multiple files | 51KB monolith needs splitting |
| Q8 Status Card API | Layered progressive API | Simple → Extended → System-level |
| Q9 Scanner Factory | D) Unify creation only | Follows InstallerFactory pattern |
| Q10 Handler Pattern | C) Switch + registration | Go idiom, extensible via registration |
| Q11 Error Window | D) Multiple windows (1h + 24h + 7d) | Comprehensive short/mid/long term view |
| Q12 Backpressure | B) Skip only that subsystem | ETHOS "Assume Failure" - isolate failures |
| Q13 Agent Key Generation | B) Reuse JWT | JWT + Machine ID binding sufficient |
| Q14 Signature Format | C) path:body_hash:timestamp:nonce | Prevents replay attacks |
| Q15 Rollout | A) Dual-mode transition | Follow MachineBindingMiddleware pattern |
| Q16 Settings Store | B with agent_subsystem extension | table already handles subsystem settings |
| Q17 Override Priority | B) Per-agent merges with global | Follows existing config merge pattern |
| Q18 Order | B) Phases 0-1 first | Database/migrations foundational |
| Q19 Testing | B) Integration tests only | No E2E infrastructure exists |
| Q20 Breaking Changes | Acceptable with planning | README acknowledges breaking changes, proven rollout pattern |
---
## Related Documentation
- ETHOS principles in `/home/casey/Projects/RedFlag/docs/1_ETHOS/ETHOS.md`
- README at `/home/casey/Projects/RedFlag/README.md`
- ChristmasTodos created: December 22, 2025
---
## LEGACY .MD FILES - ISSUE INVESTIGATION (Checked Dec 22, 2025)
### Investigation Results from .md Files in Root Directory
Subagents investigated `SOMEISSUES_v0.1.26.md`, `DEPLOYMENT_ISSUES_v0.1.26.md`, `MIGRATION_ISSUES_POST_MORTEM.md`, and `TODO_FIXES_SUMMARY.md`.
### Category: Scan ReportLog Issues (from SOMEISSUES_v0.1.26.md)
| Issue | Status | Evidence |
|-------|--------|----------|
| #1 Storage scans appearing on Updates | **FIXED** | `subsystem_handlers.go:119-123`: ReportLog removed, comment says "[REMOVED logReport after ReportLog removal - unused]" |
| #2 System scans appearing on Updates | **STILL PRESENT** | `subsystem_handlers.go:187-201`: Still has logReport with `Action: "scan_system"` and calls `reportLogWithAck()` |
| #3 Duplicate "Scan All" entries | **FIXED** | `handleScanUpdatesV2` function no longer exists in codebase |
### Category: Route Registration Issues
| Issue | Status | Evidence |
|-------|--------|----------|
| #4 Storage metrics routes | **FIXED** | Routes registered at `main.go:473` (POST) and `:483` (GET) |
| #5 Metrics routes | **FIXED** | Route registered at `main.go:469` for POST /:id/metrics |
### Category: Migration Bugs (from MIGRATION_ISSUES_POST_MORTEM.md)
| Issue | Status | Evidence |
|-------|--------|----------|
| #1 Migration 017 duplicate column | **FIXED** | Now creates unique constraint, no ADD COLUMN |
| #2 Migration 021 manual INSERT | **FIXED** | No INSERT INTO schema_migrations present |
| #3 Duplicate INSERT in migration runner | **FIXED** | Only one INSERT at db.go:121 (success path) |
| #4 agent_commands_pkey violation | **STILL PRESENT** | Frontend reuses command ID for rapid scans; no fix implemented |
### Category: Frontend Code Quality
| Issue | Status | Evidence |
|-------|--------|----------|
| #7 Duplicate frontend files | **STILL PRESENT** | Both `AgentUpdates.tsx` and `AgentUpdatesEnhanced.tsx` still exist |
| #8 V2 naming pattern | **FIXED** | No `handleScanUpdatesV2` found - function renamed |
### Summary: Still Present Issues
| Category | Count | Issues |
|----------|-------|--------|
| **STILL PRESENT** | 4 | System scan ReportLog, agent_commands_pkey, duplicate frontend files |
| **FIXED** | 7 | Storage ReportLog, duplicate scan entries, storage/metrics routes, migration bugs, V2 naming |
| **TOTAL** | 11 | - |
### Are Any of These Blockers?
**NO.** None of the 4 remaining issues are blocking a release:
1. **System scan ReportLog** - Data goes to update_logs table instead of dedicated metrics table, but functionality works
2. **agent_commands_pkey** - Only occurs on rapid button clicking, first click works fine
3. **Duplicate frontend files** - Code quality issue, doesn't affect functionality
These are minor data-location or code quality issues that can be addressed in a follow-up commit.
---
---
## PROGRESS TRACKING - Dec 23, 2025 Session
### Completed This Session
| Task | Status | Notes |
|------|--------|-------|
| **Migration 025** | ✅ COMPLETE | Platform-specific subsystems (apt, dnf, windows, winget) |
| **Scheduler Fix** | ✅ COMPLETE | Removed "updates" from getDefaultInterval() |
| **README Language Fix** | ✅ COMPLETE | Changed security language to be accurate |
| **EventBuffer Integration** | ✅ COMPLETE | main.go:747 now uses NewOrchestratorWithEvents() |
| **TimeContext Implementation** | ✅ COMPLETE | Created TimeContext + updated 13 frontend files for smooth UX |
### Files Created/Modified This Session
**New Files:**
- `aggregator-server/internal/database/migrations/025_platform_scanner_subsystems.up.sql`
- `aggregator-server/internal/database/migrations/025_platform_scanner_subsystems.down.sql`
- `aggregator-web/src/contexts/TimeContext.tsx`
**Modified Files:**
- `aggregator-server/internal/scheduler/scheduler.go` - Removed "updates" interval
- `aggregator-server/internal/database/queries/subsystems.go` - Removed "updates" from CreateDefaultSubsystems
- `README.md` - Fixed security language
- `aggregator-agent/cmd/agent/main.go` - Use NewOrchestratorWithEvents
- `aggregator-agent/internal/handlers/scan.go` - Removed redundant bufferScanFailure (orchestrator handles it)
- `aggregator-web/src/App.tsx` - Added TimeProvider wrapper
- `aggregator-web/src/pages/Agents.tsx` - Use TimeContext
- `aggregator-web/src/components/AgentHealth.tsx` - Use TimeContext
- `aggregator-web/src/components/AgentStorage.tsx` - Use TimeContext
- `aggregator-web/src/components/AgentUpdatesEnhanced.tsx` - Use TimeContext
- `aggregator-web/src/components/HistoryTimeline.tsx` - Use TimeContext
- `aggregator-web/src/components/Layout.tsx` - Use TimeContext
- `aggregator-web/src/components/NotificationCenter.tsx` - Use TimeContext
- `aggregator-web/src/pages/TokenManagement.tsx` - Use TimeContext
- `aggregator-web/src/pages/Docker.tsx` - Use TimeContext
- `aggregator-web/src/pages/LiveOperations.tsx` - Use TimeContext
- `aggregator-web/src/pages/Settings.tsx` - Use TimeContext
- `aggregator-web/src/pages/Updates.tsx` - Use TimeContext
### Pre-Existing Bugs (NOT Fixed This Session)
**TypeScript Build Errors** - These were already present before our changes:
- `src/components/AgentHealth.tsx` - metrics.checks type errors
- `src/components/AgentUpdatesEnhanced.tsx` - installUpdate, getCommandLogs, setIsLoadingLogs errors
- `src/pages/Updates.tsx` - isLoading property errors
- `src/pages/SecuritySettings.tsx` - type errors
- Unused imports in Settings.tsx, TokenManagement.tsx
### Remaining from ChristmasTodos
**Phase 0: Panic Recovery (~3 hours)**
- [ ] Create `internal/recovery/panic.go`
- [ ] Create `internal/startup/event.go`
- [ ] Wrap main.go and windows.go with panic recovery
- [ ] Build verification
**Phase 1: Error Transparency (~5.5 hours)**
- [ ] Update Phase 0.3: Verify binary signatures
- [ ] Scan handler events: Note - Orchestrator ALREADY handles event buffering internally
- [ ] Check-in/config sync/offline events
**Cleanup (~30 min)**
- [ ] Remove unused files from DEC20_CLEANUP_PLAN.md
- [ ] Build verification of all components
**Legacy Issues** (from ChristmasTodos lines 538-573)
- [ ] System scan ReportLog cleanup
- [ ] agent_commands_pkey violation fix
- [ ] Duplicate frontend files (`AgentUpdates.tsx` vs `AgentUpdatesEnhanced.tsx`)
### Next Session Priorities
1. **Immediate**: Fix pre-existing TypeScript errors (AgentHealth, AgentUpdatesEnhanced, etc.)
2. **Cleanup**: Move outdated MD files to docs root directory
3. **Phase 0**: Implement panic recovery for reliability
4. **Phase 1**: Complete error transparency system
---
## COMPREHENSIVE STATUS VERIFICATION - Dec 24, 2025
### Verification Methodology
Code-reviewer agent verified ALL items marked as "COMPLETE" by reading actual source code files and confirming implementation against ChristmasTodos specifications.
### VERIFIED COMPLETE Items (5/5)
| # | Item | Verification | Evidence |
|---|------|--------------|----------|
| 1 | Migration 025 (Platform Scanners) | ✅ | `025_platform_scanner_subsystems.up/.down.sql` exist and are correct |
| 2 | Scheduler Fix (remove 'updates') | ✅ | No "updates" found in scheduler.go (grep confirms) |
| 3 | README Security Language | ✅ | Line 51: "Commands and updates are verified via Ed25519 signatures" |
| 4 | Orchestrator EventBuffer | ✅ | main.go:745 uses `NewOrchestratorWithEvents(apiClient.EventBuffer)` |
| 5 | TimeContext Implementation | ✅ | TimeContext.tsx exists + 13 frontend files verified using `useTime` hook |
### PHASE 0: Panic Recovery - ❌ NOT STARTED (0%)
| Item | Expected | Actual | Status |
|------|----------|---------|--------|
| Create `internal/recovery/panic.go` | New file | **Directory doesn't exist** | ❌ NOT DONE |
| Create `internal/startup/event.go` | New file | **Directory doesn't exist** | ❌ NOT DONE |
| Wrap main.go/windows.go | Add panic wrappers | **Not wrapped** | ❌ NOT DONE |
| Build verification | VerifyBinarySignature() | **Not verified present** | ❌ NOT DONE |
### PHASE 1: Error Transparency - ~25% PARTIAL
| Subtask | Status | Evidence |
|---------|--------|----------|
| Event helpers (internal/event/helpers.go) | ⚠️ PARTIAL | Helpers exist, retry.go missing |
| Scan handler events | ⚠️ PARTIAL | Orchestrator handles internally |
| Lifecycle events | ❌ NOT DONE | Integration not wired |
| Buffered event reporting | ❌ NOT DONE | SendBufferedEventsWithRetry not implemented |
| Server enhancements (100 limit) | ❌ NOT DONE | No metrics logging |
### OVERALL IMPLEMENTATION STATUS
| Category | Total | ✅ Complete | ❌ Not Done | ⚠️ Partial | % Done |
|----------|-------|-------------|-------------|------------|--------|
| Explicit "COMPLETE" items | 5 | 5 | 0 | 0 | 100% |
| Phase 0 items | 3 | 0 | 3 | 0 | 0% |
| Phase 1 items | 6 | 1.5 | 3.5 | 1 | ~25% |
| **Phase 0+1 TOTAL** | 9 | 1.5 | 6.5 | 1 | **~10%** |
---
## BLOCKER ASSESSMENT FOR v0.1.28 ALPHA
### 🚨 TRUE BLOCKERS (Must Fix Before Release)
**NONE** - Release guidance explicitly states v0.1.28 can "Release as-is" (line 468) and confirms system is "functionally sufficient for alpha release" (line 176).
### ⚠️ HIGH PRIORITY (Should Fix - Affects UX/Reliability)
| Priority | Item | Impact | Effort | Notes |
|----------|------|--------|--------|-------|
| **P0** | TypeScript Build Errors | Build blocking | **Unknown** | **VERIFY BUILD NOW** - if `npm run build` fails, fix before release |
| **P1** | agent_commands_pkey | UX annoyance (rapid clicks) | Medium | First click always works, retryable |
| **P2** | Duplicate frontend files | Code quality/maintenance | Low | AgentUpdates.tsx vs AgentUpdatesEnhanced.tsx |
### 💚 NICE TO HAVE (Quality Improvements - Not Blocking)
| Priority | Item | Target Release |
|----------|------|----------------|
| **P3** | Phase 0: Panic Recovery | v0.1.29 (per ChristmasTodos line 471) |
| **P4** | Phase 1: Error Transparency | v0.1.30+ (per ChristmasTodos line 474) |
| **P5** | System scan ReportLog cleanup | When convenient |
| **P6** | General cleanup (unused files) | Low priority |
### 🎯 RELEASE RECOMMENDATION: PROCEED WITH v0.1.28 ALPHA
**Rationale:**
1. Explicit guidance says "Release as-is"
2. Core security features exist and work (Ed25519, hardware binding, rate limiting)
3. No functional blockers - all remaining are quality-of-life improvements
4. Homelab/alpha users accept rough edges
5. Serviceable workarounds exist for known issues
**Immediate Actions Before Release:**
- Verify `npm run build` passes (if fails, fix TypeScript errors)
- Run integration tests on Go components
- Update changelog with known issues
- Tag and release v0.1.28
**Post-Release Priorities:**
1. **v0.1.29**: Panic Recovery (line 471 - "Actual reliability improvement")
2. **v0.1.30+**: Error Transparency system (line 474)
3. Throughout: Fix pkey violation and cleanup as time permits
---
## main.go REFACTORING ANALYSIS - Dec 24, 2025
### Assessment: YES - main.go needs refactoring
**Current Issues:**
- **Size:** 1,995 lines
- **God Function:** `runAgent()` is 1,119 lines - textbook violation of Single Responsibility
- **ETHOS Violation:** "Modular Components" principle not followed
- **Testability:** Near-zero unit test coverage for core agent logic
### ETHOS Alignment Analysis
| ETHOS Principle | Status | Issue |
|----------------|--------|-------|
| "Errors are History" | ✅ FOLLOWED | Events buffered with full context |
| "Security is Non-Negotiable" | ✅ FOLLOWED | Ed25519 verification implemented |
| "Modular Components" | ❌ VIOLATED | 1,995-line file contains all concerns |
| "Assume Failure; Build for Resilience" | ⚠️ PARTIAL | Panic recovery exists but only at top level |
### Major Code Blocks Identified
```
1. CLI Flag Parsing & Command Routing (lines 98-355) - 258 lines
2. Registration Flow (lines 357-468) - 111 lines
3. Service Lifecycle Management (Windows) - 35 lines embedded
4. Agent Initialization (lines 673-802) - 129 lines
5. Main Polling Loop (lines 834-1155) - 321 lines ← GOD FUNCTION
6. Command Processing Switch (lines 1060-1150) - 90 lines
7. Command Handlers (lines 1358-1994) - 636 lines across 10 functions
```
### Proposed File Structure After Refactoring
```
aggregator-agent/
├── cmd/
│ └── agent/
│ ├── main.go # 40-60 lines: entry point only
│ └── cli.go # CLI parsing & command routing
├── internal/
│ ├── agent/
│ │ ├── loop.go # Main polling/orchestration loop
│ │ ├── connection.go # Connection state & resilience
│ │ └── metrics.go # System metrics collection
│ ├── command/
│ │ ├── dispatcher.go # Command routing/dispatch
│ │ └── processor.go # Command execution framework
│ ├── handlers/
│ │ ├── install.go # install_updates handler
│ │ ├── dryrun.go # dry_run_update handler
│ │ ├── heartbeat.go # enable/disable_heartbeat
│ │ ├── reboot.go # reboot handler
│ │ └── systeminfo.go # System info reporting
│ ├── registration/
│ │ └── service.go # Agent registration logic
│ └── service/
│ └── cli.go # Windows service CLI commands
```
### Refactoring Complexity: MODERATE-HIGH (5-7/10)
- **High coupling** between components (ackTracker, apiClient, cfg passed everywhere)
- **Implicit dependencies** through package-level imports
- **Clear functional boundaries** and existing test points
- **Lower risk** than typical for this size (good internal structure)
**Effort Estimate:** 3-5 days for experienced Go developer
### Benefits of Refactoring
#### 1. ETHOS Alignment
- **Modular Components:** Clear separation allows isolated testing/development
- **Assume Failure:** Smaller functions enable better panic recovery wrapping
- **Error Transparency:** Easier to maintain error context with single responsibilities
#### 2. Maintainability
- **Testability:** Each component can be unit tested independently
- **Code Review:** Smaller files (~100-300 lines) are easier to review
- **Onboarding:** New developers understand one component at a time
- **Debugging:** Stack traces show precise function names instead of `main.runAgent`
#### 3. Panic Recovery Improvement
**Current (Limited):**
```go
panicRecovery.Wrap(func() error {
return runAgent(cfg) // If scanner panics, whole agent exits
})
```
**After (Granular):**
```go
panicRecovery.Wrap("main_loop", func() error {
return agent.RunLoop(cfg) // Loop-level protection
})
// Inside agent/loop.go - per-scan protection
panicRecovery.Wrap("apt_scan", func() error {
return scanner.Scan()
})
```
#### 4. Extensibility
- Adding new commands: Implement handler interface and register in dispatcher
- New scanner types: No changes to main loop required
- Platform-specific features: Isolated in platform-specific files
### Phased Refactoring Plan
**Phase 1 (Immediate):** Extract CLI and service commands
- Move lines 98-355 to `cli.go`
- Extract Windows service commands to `service/cli.go`
- **Risk:** Low - pure code movement
- **Time:** 2-3 hours
**Phase 2 (Short-term):** Extract command handlers
- Create `internal/handlers/` package
- Move each command handler to separate file
- **Risk:** Low - handlers already isolated
- **Time:** 1 day
**Phase 3 (Medium-term):** Break up runAgent() god function
- Extract initialization to `startup/initializer.go`
- Extract main loop orchestration to `agent/loop.go`
- Extract connection state logic to `agent/connection.go`
- **Risk:** Medium - requires careful dependency management
- **Time:** 2-3 days
**Phase 4 (Long-term):** Implement command dispatcher pattern
- Create `command/dispatcher.go` to replace switch statement
- Implement handler registration pattern
- **Risk:** Low-Medium
- **Time:** 1 day
### Final Verdict: REFACTORING RECOMMENDED
The 1,995-line main.go violates core software engineering principles and ETHOS guidelines. The presence of a 1,119-line `runAgent()` god function creates significant maintainability and reliability risks.
**Investment:** 3-5 days
**Returns:**
- Testability (currently near-zero)
- Error handling (granular panic recovery per ETHOS)
- Developer velocity (smaller, focused components)
- Production stability (better fault isolation)
The code is well-structured internally (clear sections, good logging, consistent patterns) which makes refactoring lower risk than typical for files this size.
---
## NEXT SESSION NOTES (Dec 24, 2025)
### User Intent
Work pausing for Christmas break. Will proceed with ALL pending items soon.
### FULL REFACTOR - ALL BEFORE v0.2.0
1. **main.go Full Refactor** - 1,995-line file broken down (3-5 days)
- Extract CLI commands, handlers, main loop to separate files
- Enables granular panic recovery per ETHOS
2. **Phase 0: Panic Recovery** (internal/recovery/panic.go, internal/startup/event.go)
- Wrap main.go and windows.go with panic recovery
- Build verification (VerifyBinarySignature)
3. **Phase 1: Error Transparency** (completion)
- Event helpers, retry logic
- Scan handler events
- Lifecycle events
- Buffered event reporting
- Server enhancements
4. **Cleanup**
- Remove unused files
- Fix agent_commands_pkey violation
- Consolidate duplicate frontend files
- System scan ReportLog cleanup
**Then v0.2.0 Release**
### Current State Summary
- v0.1.28 ALPHA: Ready for release after TypeScript build verification
- Phase 0+1: ~10% complete (5/5 items marked "COMPLETE", but actual Phase 0/1 work not done)
- main.go: 1,995 lines, needs refactoring
- TypeScript: ~100+ errors remaining (mostly unused variables)
---
## Status
Created: December 22, 2025
Last Updated: December 24, 2025 (Verification + Blocker Assessment + main.go Analysis + Next Session Notes)