# CRITICAL: Add SMART Disk Monitoring to RedFlag ## Why This Is Urgent After tonight's USB drive transfer incident (107GB music collection, system crash due to I/O overload), it's painfully clear that **RedFlag needs disk health monitoring**. **What happened:** - First rsync attempt maxed out disk I/O (no bandwidth limit) - System became unresponsive, required hard reboot - NTFS filesystem on 10TB drive corrupted from unclean shutdown - exFAT USB drive also had unmount corruption - Lost ~4 hours to troubleshooting and recovery **The realization:** We have NO early warning system for disk health issues in RedFlag. We're flying blind on hardware failure until it's too late. ## Why RedFlag Needs SMART Monitoring **Current gaps:** - ❌ No early warning of impending drive failure - ❌ No automatic disk health checks - ❌ No alerts for reallocated sectors, high temps, or pre-fail indicators - ❌ No monitoring of I/O saturation that could cause crashes - ❌ No proactive maintenance recommendations **What SMART monitoring gives us:** - ✅ Early warning of drive failure (days/weeks before total failure) - ✅ Temperature monitoring (prevent thermal throttling/damage) - ✅ Reallocated sector tracking (silent data corruption indicator) - ✅ I/O error rate monitoring (predicts filesystem corruption) - ✅ Proactive replacement recommendations (maintenance windows, not emergencies) - ✅ Correlation between update operations and disk health (did that update cause issues?) ## Proposed Implementation ### Disk Health Scanner Module ```go // New scanner module: agent/pkg/scanners/disk_health.go type DiskHealthStatus struct { Device string `json:"device"` SMARTStatus string `json:"smart_status"` // PASSED/FAILED Temperature int `json:"temperature_c"` ReallocatedSectors int `json:"reallocated_sectors"` PendingSectors int `json:"pending_sectors"` UncorrectableErrors int `json:"uncorrectable_errors"` PowerOnHours int `json:"power_on_hours"` LastTestDate time.Time `json:"last_self_test"` HealthScore int `json:"health_score"` // 0-100 CriticalAttributes []string `json:"critical_attributes,omitempty"` } ``` ### Agent-Side Features 1. **Scheduled SMART Checks** - Run `smartctl -a` every 6 hours - Parse critical attributes (5, 196, 197, 198) - Calculate health score (0-100 scale) 2. **Self-Test Scheduling** - Short self-test: Weekly - Long self-test: Monthly - Log results to agent's local DB 3. **I/O Monitoring** - Track disk utilization % - Monitor I/O wait times - Alert on sustained >80% utilization (prevents crash scenarios) 4. **Temperature Alerts** - Warning at 45°C - Critical at 50°C - Log thermal throttling events ### Server-Side Features 1. **Disk Health Dashboard** - Show all drives across all agents - Color-coded health status (green/yellow/red) - Temperature graphs over time - Predicted failure timeline 2. **Alert System** - Email when health score drops below 70 - Critical alert when below 50 - Immediate alert on SMART failure - Temperature spike notifications 3. **Maintenance Recommendations** - "Drive /dev/sdb showing 15 reallocated sectors - recommend replacement within 30 days" - "Temperature consistently above 45°C - check cooling" - "Drive has 45,000 hours - consider proactive replacement" 4. **Correlation with Updates** - "System update initiated while disk I/O at 92% - potential correlation?" - Track if updates cause disk health degradation ## Why This Can't Wait **The $600k/year ConnectWise can't do this:** - Their agents don't have hardware-level access - Cloud model prevents local disk monitoring - Proprietary code prevents community additions **RedFlag's advantage:** - Self-hosted agents have full system access - Open source - community can contribute disk monitoring - Hardware binding already in place - perfect foundation - Error transparency means we see disk issues immediately **Business case:** - One prevented data loss incident = justification - Proactive replacement vs emergency outage = measurable ROI - MSPs can offer disk health monitoring as premium service - Homelabbers catch failing drives before losing family photos ## Technical Considerations **Dependencies:** - `smartmontools` package on agents (most distros have it) - Agent needs sudo access for `smartctl` (document in install) - NTFS drives need `ntfs-3g` for best SMART support - Windows agents need different implementation (WMI) **Security:** - Limited to read-only SMART data - No disk modification commands - Agent already runs as limited user - no privilege escalation **Cross-platform:** - Linux: `smartctl` (easy) - Windows: WMI or `smartctl` via Cygwin (need research) - Docker: Smart to pass host device access ## Next Steps 1. **Immediate**: Add `smartmontools` to agent install scripts 2. **This week**: Create PoC disk health scanner 3. **Next sprint**: Integrate with agent heartbeat 4. **v0.2.0**: Full disk health dashboard + alerts **Estimates:** - Linux scanner: 2-3 days - Windows scanner: 3-5 days (research needed) - Server dashboard: 3-4 days - Alert system: 2-3 days - Testing: 2-3 days **Total**: ~2 weeks to production-ready disk health monitoring ## The Bottom Line Tonight's incident cost us: - 4 hours of troubleshooting - 107GB music collection at risk - 2 unclean shutdowns - Corrupted filesystems (NTFS + exFAT) - A lot of frustration **SMART monitoring would have:** - Warned about the USB drive issues before the copy - Alerted on I/O saturation before crash - Given us early warning on the 10TB drive health - Provided data to prevent the crash **This is infrastructure 101. We need this yesterday.** --- **Priority**: CRITICAL **Effort**: Medium (2 weeks) **Impact**: High (prevents data loss, adds competitive advantage) **User Requested**: YES (after tonight's incident) **ConnectWise Can't Match**: Hardware-level monitoring is their blind spot **Status**: Ready for implementation planning