Add docs and project files - force for Culurien

2026-03-28 20:46:24 -04:00
parent dc61797423
commit 484a7f77ce
343 changed files with 119530 additions and 0 deletions
--- a/CRITICAL-ADD-SMART-MONITORING.md
+++ b/CRITICAL-ADD-SMART-MONITORING.md
@@ -0,0 +1,177 @@
+# CRITICAL: Add SMART Disk Monitoring to RedFlag
+
+## Why This Is Urgent
+
+After tonight's USB drive transfer incident (107GB music collection, system crash due to I/O overload), it's painfully clear that **RedFlag needs disk health monitoring**.
+
+**What happened:**
+- First rsync attempt maxed out disk I/O (no bandwidth limit)
+- System became unresponsive, required hard reboot
+- NTFS filesystem on 10TB drive corrupted from unclean shutdown
+- exFAT USB drive also had unmount corruption
+- Lost ~4 hours to troubleshooting and recovery
+
+**The realization:** We have NO early warning system for disk health issues in RedFlag. We're flying blind on hardware failure until it's too late.
+
+## Why RedFlag Needs SMART Monitoring
+
+**Current gaps:**
+- ❌ No early warning of impending drive failure
+- ❌ No automatic disk health checks
+- ❌ No alerts for reallocated sectors, high temps, or pre-fail indicators
+- ❌ No monitoring of I/O saturation that could cause crashes
+- ❌ No proactive maintenance recommendations
+
+**What SMART monitoring gives us:**
+- ✅ Early warning of drive failure (days/weeks before total failure)
+- ✅ Temperature monitoring (prevent thermal throttling/damage)
+- ✅ Reallocated sector tracking (silent data corruption indicator)
+- ✅ I/O error rate monitoring (predicts filesystem corruption)
+- ✅ Proactive replacement recommendations (maintenance windows, not emergencies)
+- ✅ Correlation between update operations and disk health (did that update cause issues?)
+
+## Proposed Implementation
+
+### Disk Health Scanner Module
+
+```go
+// New scanner module: agent/pkg/scanners/disk_health.go
+
+type DiskHealthStatus struct {
+    Device            string    `json:"device"`
+    SMARTStatus       string    `json:"smart_status"`        // PASSED/FAILED
+    Temperature       int       `json:"temperature_c"`
+    ReallocatedSectors int      `json:"reallocated_sectors"`
+    PendingSectors     int      `json:"pending_sectors"`
+    UncorrectableErrors int     `json:"uncorrectable_errors"`
+    PowerOnHours       int      `json:"power_on_hours"`
+    LastTestDate       time.Time `json:"last_self_test"`
+    HealthScore       int       `json:"health_score"`        // 0-100
+    CriticalAttributes []string `json:"critical_attributes,omitempty"`
+}
+```
+
+### Agent-Side Features
+
+1. **Scheduled SMART Checks**
+   - Run `smartctl -a` every 6 hours
+   - Parse critical attributes (5, 196, 197, 198)
+   - Calculate health score (0-100 scale)
+
+2. **Self-Test Scheduling**
+   - Short self-test: Weekly
+   - Long self-test: Monthly
+   - Log results to agent's local DB
+
+3. **I/O Monitoring**
+   - Track disk utilization %
+   - Monitor I/O wait times
+   - Alert on sustained >80% utilization (prevents crash scenarios)
+
+4. **Temperature Alerts**
+   - Warning at 45°C
+   - Critical at 50°C
+   - Log thermal throttling events
+
+### Server-Side Features
+
+1. **Disk Health Dashboard**
+   - Show all drives across all agents
+   - Color-coded health status (green/yellow/red)
+   - Temperature graphs over time
+   - Predicted failure timeline
+
+2. **Alert System**
+   - Email when health score drops below 70
+   - Critical alert when below 50
+   - Immediate alert on SMART failure
+   - Temperature spike notifications
+
+3. **Maintenance Recommendations**
+   - "Drive /dev/sdb showing 15 reallocated sectors - recommend replacement within 30 days"
+   - "Temperature consistently above 45°C - check cooling"
+   - "Drive has 45,000 hours - consider proactive replacement"
+
+4. **Correlation with Updates**
+   - "System update initiated while disk I/O at 92% - potential correlation?"
+   - Track if updates cause disk health degradation
+
+## Why This Can't Wait
+
+**The $600k/year ConnectWise can't do this:**
+- Their agents don't have hardware-level access
+- Cloud model prevents local disk monitoring
+- Proprietary code prevents community additions
+
+**RedFlag's advantage:**
+- Self-hosted agents have full system access
+- Open source - community can contribute disk monitoring
+- Hardware binding already in place - perfect foundation
+- Error transparency means we see disk issues immediately
+
+**Business case:**
+- One prevented data loss incident = justification
+- Proactive replacement vs emergency outage = measurable ROI
+- MSPs can offer disk health monitoring as premium service
+- Homelabbers catch failing drives before losing family photos
+
+## Technical Considerations
+
+**Dependencies:**
+- `smartmontools` package on agents (most distros have it)
+- Agent needs sudo access for `smartctl` (document in install)
+- NTFS drives need `ntfs-3g` for best SMART support
+- Windows agents need different implementation (WMI)
+
+**Security:**
+- Limited to read-only SMART data
+- No disk modification commands
+- Agent already runs as limited user - no privilege escalation
+
+**Cross-platform:**
+- Linux: `smartctl` (easy)
+- Windows: WMI or `smartctl` via Cygwin (need research)
+- Docker: Smart to pass host device access
+
+## Next Steps
+
+1. **Immediate**: Add `smartmontools` to agent install scripts
+2. **This week**: Create PoC disk health scanner
+3. **Next sprint**: Integrate with agent heartbeat
+4. **v0.2.0**: Full disk health dashboard + alerts
+
+**Estimates:**
+- Linux scanner: 2-3 days
+- Windows scanner: 3-5 days (research needed)
+- Server dashboard: 3-4 days
+- Alert system: 2-3 days
+- Testing: 2-3 days
+
+**Total**: ~2 weeks to production-ready disk health monitoring
+
+## The Bottom Line
+
+Tonight's incident cost us:
+- 4 hours of troubleshooting
+- 107GB music collection at risk
+- 2 unclean shutdowns
+- Corrupted filesystems (NTFS + exFAT)
+- A lot of frustration
+
+**SMART monitoring would have:**
+- Warned about the USB drive issues before the copy
+- Alerted on I/O saturation before crash
+- Given us early warning on the 10TB drive health
+- Provided data to prevent the crash
+
+**This is infrastructure 101. We need this yesterday.**
+
+---
+
+**Priority**: CRITICAL
+**Effort**: Medium (2 weeks)
+**Impact**: High (prevents data loss, adds competitive advantage)
+**User Requested**: YES (after tonight's incident)
+**ConnectWise Can't Match**: Hardware-level monitoring is their blind spot
+
+**Status**: Ready for implementation planning