6.0 KiB
CRITICAL: Add SMART Disk Monitoring to RedFlag
Why This Is Urgent
After tonight's USB drive transfer incident (107GB music collection, system crash due to I/O overload), it's painfully clear that RedFlag needs disk health monitoring.
What happened:
- First rsync attempt maxed out disk I/O (no bandwidth limit)
- System became unresponsive, required hard reboot
- NTFS filesystem on 10TB drive corrupted from unclean shutdown
- exFAT USB drive also had unmount corruption
- Lost ~4 hours to troubleshooting and recovery
The realization: We have NO early warning system for disk health issues in RedFlag. We're flying blind on hardware failure until it's too late.
Why RedFlag Needs SMART Monitoring
Current gaps:
- ❌ No early warning of impending drive failure
- ❌ No automatic disk health checks
- ❌ No alerts for reallocated sectors, high temps, or pre-fail indicators
- ❌ No monitoring of I/O saturation that could cause crashes
- ❌ No proactive maintenance recommendations
What SMART monitoring gives us:
- ✅ Early warning of drive failure (days/weeks before total failure)
- ✅ Temperature monitoring (prevent thermal throttling/damage)
- ✅ Reallocated sector tracking (silent data corruption indicator)
- ✅ I/O error rate monitoring (predicts filesystem corruption)
- ✅ Proactive replacement recommendations (maintenance windows, not emergencies)
- ✅ Correlation between update operations and disk health (did that update cause issues?)
Proposed Implementation
Disk Health Scanner Module
// New scanner module: agent/pkg/scanners/disk_health.go
type DiskHealthStatus struct {
Device string `json:"device"`
SMARTStatus string `json:"smart_status"` // PASSED/FAILED
Temperature int `json:"temperature_c"`
ReallocatedSectors int `json:"reallocated_sectors"`
PendingSectors int `json:"pending_sectors"`
UncorrectableErrors int `json:"uncorrectable_errors"`
PowerOnHours int `json:"power_on_hours"`
LastTestDate time.Time `json:"last_self_test"`
HealthScore int `json:"health_score"` // 0-100
CriticalAttributes []string `json:"critical_attributes,omitempty"`
}
Agent-Side Features
-
Scheduled SMART Checks
- Run
smartctl -aevery 6 hours - Parse critical attributes (5, 196, 197, 198)
- Calculate health score (0-100 scale)
- Run
-
Self-Test Scheduling
- Short self-test: Weekly
- Long self-test: Monthly
- Log results to agent's local DB
-
I/O Monitoring
- Track disk utilization %
- Monitor I/O wait times
- Alert on sustained >80% utilization (prevents crash scenarios)
-
Temperature Alerts
- Warning at 45°C
- Critical at 50°C
- Log thermal throttling events
Server-Side Features
-
Disk Health Dashboard
- Show all drives across all agents
- Color-coded health status (green/yellow/red)
- Temperature graphs over time
- Predicted failure timeline
-
Alert System
- Email when health score drops below 70
- Critical alert when below 50
- Immediate alert on SMART failure
- Temperature spike notifications
-
Maintenance Recommendations
- "Drive /dev/sdb showing 15 reallocated sectors - recommend replacement within 30 days"
- "Temperature consistently above 45°C - check cooling"
- "Drive has 45,000 hours - consider proactive replacement"
-
Correlation with Updates
- "System update initiated while disk I/O at 92% - potential correlation?"
- Track if updates cause disk health degradation
Why This Can't Wait
The $600k/year ConnectWise can't do this:
- Their agents don't have hardware-level access
- Cloud model prevents local disk monitoring
- Proprietary code prevents community additions
RedFlag's advantage:
- Self-hosted agents have full system access
- Open source - community can contribute disk monitoring
- Hardware binding already in place - perfect foundation
- Error transparency means we see disk issues immediately
Business case:
- One prevented data loss incident = justification
- Proactive replacement vs emergency outage = measurable ROI
- MSPs can offer disk health monitoring as premium service
- Homelabbers catch failing drives before losing family photos
Technical Considerations
Dependencies:
smartmontoolspackage on agents (most distros have it)- Agent needs sudo access for
smartctl(document in install) - NTFS drives need
ntfs-3gfor best SMART support - Windows agents need different implementation (WMI)
Security:
- Limited to read-only SMART data
- No disk modification commands
- Agent already runs as limited user - no privilege escalation
Cross-platform:
- Linux:
smartctl(easy) - Windows: WMI or
smartctlvia Cygwin (need research) - Docker: Smart to pass host device access
Next Steps
- Immediate: Add
smartmontoolsto agent install scripts - This week: Create PoC disk health scanner
- Next sprint: Integrate with agent heartbeat
- v0.2.0: Full disk health dashboard + alerts
Estimates:
- Linux scanner: 2-3 days
- Windows scanner: 3-5 days (research needed)
- Server dashboard: 3-4 days
- Alert system: 2-3 days
- Testing: 2-3 days
Total: ~2 weeks to production-ready disk health monitoring
The Bottom Line
Tonight's incident cost us:
- 4 hours of troubleshooting
- 107GB music collection at risk
- 2 unclean shutdowns
- Corrupted filesystems (NTFS + exFAT)
- A lot of frustration
SMART monitoring would have:
- Warned about the USB drive issues before the copy
- Alerted on I/O saturation before crash
- Given us early warning on the 10TB drive health
- Provided data to prevent the crash
This is infrastructure 101. We need this yesterday.
Priority: CRITICAL Effort: Medium (2 weeks) Impact: High (prevents data loss, adds competitive advantage) User Requested: YES (after tonight's incident) ConnectWise Can't Match: Hardware-level monitoring is their blind spot
Status: Ready for implementation planning