7.2 KiB
7.2 KiB
Issue #1 Proper Solution Design
Problem Root Cause
Agent check-in interval was being incorrectly overridden by scanner subsystem intervals.
Current State After Kimi's "Fast Fix"
- Line that overrode check-in interval was removed
- Scanner intervals are logged but not applied to agent polling
- Separation exists but without validation or protection
What's Missing (Why It's Not "Proper" Yet)
1. No Validation
- No bounds checking on interval values
- Could accept negative intervals
- Could accept intervals that are too short (causing server overload)
- Could accept intervals that are too long (causing agent to appear dead)
2. No Idempotency Verification
- Not tested that operations can be run multiple times safely
- Config updates might not be idempotent
- No verification that rapid mode toggling is safe
3. No Protection Against Regressions
- No guardrails to prevent future developer from re-introducing the bug
- No comments explaining WHY separation is critical
- No architectural documentation
- No tests that would catch if someone re-introduces the override
4. Insufficient Error Handling
- syncServerConfig runs in goroutine with no error recovery
- No retry logic if server temporarily unavailable
- No degraded mode operation
- No circuit breaker pattern
5. No Comprehensive Logging
- No context about WHAT changed in interval
- No history of interval changes over time
- No error context for debugging
Proper Solution Design
Component 1: Validation Layer
type IntervalValidator struct {
minCheckInSeconds int // 60 seconds (1 minute)
maxCheckInSeconds int // 3600 seconds (1 hour)
minScannerMinutes int // 1 minute
maxScannerMinutes int // 1440 minutes (24 hours)
}
func (v *IntervalValidator) ValidateCheckInInterval(seconds int) error {
if seconds < v.minCheckInSeconds {
return fmt.Errorf("check-in interval %d seconds below minimum %d",
seconds, v.minCheckInSeconds)
}
if seconds > v.maxCheckInSeconds {
return fmt.Errorf("check-in interval %d seconds above maximum %d",
seconds, v.maxCheckInSeconds)
}
return nil
}
func (v *IntervalValidator) ValidateScannerInterval(minutes int) error {
if minutes < v.minScannerMinutes {
return fmt.Errorf("scanner interval %d minutes below minimum %d",
minutes, v.minScannerMinutes)
}
if minutes > v.maxScannerMinutes {
return fmt.Errorf("scanner interval %d minutes above maximum %d",
minutes, v.maxScannerMinutes)
}
return nil
}
Component 2: Idempotency Protection
type IntervalGuardian struct {
lastValidatedInterval int
violationCount int
}
func (g *IntervalGuardian) CheckForOverrideAttempt(currentInterval, proposedInterval int) error {
if currentInterval != proposedInterval {
g.violationCount++
return fmt.Errorf("INTERVAL_OVERRIDE_DETECTED: current=%d, proposed=%d, violations=%d",
currentInterval, proposedInterval, g.violationCount)
}
return nil
}
func (g *IntervalGuardian) GetViolationCount() int {
return g.violationCount
}
Component 3: Error Recovery with Retry
func syncServerConfigWithRetry(apiClient *client.Client, cfg *config.Config, maxRetries int) error {
var lastErr error
for attempt := 1; attempt <= maxRetries; attempt++ {
if err := syncServerConfigProper(apiClient, cfg); err != nil {
lastErr = err
log.Printf("[ERROR] [agent] [config] sync attempt %d/%d failed: %v",
attempt, maxRetries, err)
if attempt < maxRetries {
// Exponential backoff: 1s, 2s, 4s, 8s...
backoff := time.Duration(1<<uint(attempt-1)) * time.Second
time.Sleep(backoff)
}
continue
}
log.Printf("[SUCCESS] [agent] [config] synced after %d attempts", attempt)
return nil
}
// After maxRetries, enter degraded mode
cfg.SetDegradedMode(true)
log.Printf("[WARNING] [agent] [config] entering degraded mode after %d failed attempts: %v",
maxRetries, lastErr)
return lastErr
}
Component 4: Comprehensive Logging
func logIntervalChange(subsystem string, oldInterval, newInterval int, reason string) {
log.Printf("[HISTORY] [agent] [interval] subsystem=%s old=%d new=%d reason=%s timestamp=%s",
subsystem, oldInterval, newInterval, reason, time.Now().Format(time.RFC3339))
// Also save to history table for audit
// This would write to history.sqlite3 or similar
}
Implementation Plan
Phase 1: Validation Layer (30 minutes)
- Create
interval_validator.gowith bounds checking - Add validation calls in syncServerConfig
- Add validation error logging
Phase 2: Guardians (30 minutes)
- Create
interval_guardian.gowith override detection - Add guardian check after server config deserialization
- Add violation counter to state.json
Phase 3: Error Recovery (45 minutes)
- Create
retry_sync.gowith exponential backoff - Add degraded mode to config struct
- Test retry logic manually
Phase 4: Comprehensive Logging (30 minutes)
- Add history logging function
- Log all interval changes in structured format
- Create history table schema if not exists
Phase 5: Tests (90 minutes)
- Unit tests for validation bounds
- Unit tests for guardian violations
- Unit tests for retry logic
- Integration test for full sync flow
- Integration test for error scenarios
- Integration test for rapid polling edge cases
Phase 6: Documentation & Cleanup (30 minutes)
- Update inline comments explaining WHY separation is critical
- Add architecture documentation
- Update session documentation
- Verify all ETHOS checklist items
Total Estimated Time: 4.5 hours for proper fix
Idempotency Verification
Test that operations can be run 3x safely:
func TestSyncServerConfigIdempotency(t *testing.T) {
cfg := createTestConfig()
apiClient := createTestAPIClient()
// Run sync 3 times
for i := 0; i < 3; i++ {
err := syncServerConfigProper(apiClient, cfg)
if err != nil {
t.Fatalf("Sync %d failed: %v", i+1, err)
}
}
// Verify check-in interval unchanged
if cfg.CheckInInterval != 300 {
t.Fatalf("Check-in interval changed after 3 runs: %d", cfg.CheckInInterval)
}
}
Why This Is "Proper"
- Validation: Prevents invalid states from ever being accepted
- Protection: Guardians catch regressions immediately
- Recovery: Robust error handling for transient failures
- Observability: Complete history of all changes
- Tested: Comprehensive automated tests
- Documented: Clear comments and architecture documentation
- Idempotent: Verified by tests that operations can be repeated
- Follows ETHOS: No shortcuts, honest logging, proper error handling
Technical Debt Resolution
Debt from Kimi's fast fix: NONE (Kimi actually did this one mostly right) New debt introduced: NONE (if we do these phases properly)
The only "debt" is that we need to implement all phases, but that’s not debt - that's proper engineering.