Files
Redflag/CRITICAL-ADD-SMART-MONITORING.md

178 lines
6.0 KiB
Markdown

# CRITICAL: Add SMART Disk Monitoring to RedFlag
## Why This Is Urgent
After tonight's USB drive transfer incident (107GB music collection, system crash due to I/O overload), it's painfully clear that **RedFlag needs disk health monitoring**.
**What happened:**
- First rsync attempt maxed out disk I/O (no bandwidth limit)
- System became unresponsive, required hard reboot
- NTFS filesystem on 10TB drive corrupted from unclean shutdown
- exFAT USB drive also had unmount corruption
- Lost ~4 hours to troubleshooting and recovery
**The realization:** We have NO early warning system for disk health issues in RedFlag. We're flying blind on hardware failure until it's too late.
## Why RedFlag Needs SMART Monitoring
**Current gaps:**
- ❌ No early warning of impending drive failure
- ❌ No automatic disk health checks
- ❌ No alerts for reallocated sectors, high temps, or pre-fail indicators
- ❌ No monitoring of I/O saturation that could cause crashes
- ❌ No proactive maintenance recommendations
**What SMART monitoring gives us:**
- ✅ Early warning of drive failure (days/weeks before total failure)
- ✅ Temperature monitoring (prevent thermal throttling/damage)
- ✅ Reallocated sector tracking (silent data corruption indicator)
- ✅ I/O error rate monitoring (predicts filesystem corruption)
- ✅ Proactive replacement recommendations (maintenance windows, not emergencies)
- ✅ Correlation between update operations and disk health (did that update cause issues?)
## Proposed Implementation
### Disk Health Scanner Module
```go
// New scanner module: agent/pkg/scanners/disk_health.go
type DiskHealthStatus struct {
Device string `json:"device"`
SMARTStatus string `json:"smart_status"` // PASSED/FAILED
Temperature int `json:"temperature_c"`
ReallocatedSectors int `json:"reallocated_sectors"`
PendingSectors int `json:"pending_sectors"`
UncorrectableErrors int `json:"uncorrectable_errors"`
PowerOnHours int `json:"power_on_hours"`
LastTestDate time.Time `json:"last_self_test"`
HealthScore int `json:"health_score"` // 0-100
CriticalAttributes []string `json:"critical_attributes,omitempty"`
}
```
### Agent-Side Features
1. **Scheduled SMART Checks**
- Run `smartctl -a` every 6 hours
- Parse critical attributes (5, 196, 197, 198)
- Calculate health score (0-100 scale)
2. **Self-Test Scheduling**
- Short self-test: Weekly
- Long self-test: Monthly
- Log results to agent's local DB
3. **I/O Monitoring**
- Track disk utilization %
- Monitor I/O wait times
- Alert on sustained >80% utilization (prevents crash scenarios)
4. **Temperature Alerts**
- Warning at 45°C
- Critical at 50°C
- Log thermal throttling events
### Server-Side Features
1. **Disk Health Dashboard**
- Show all drives across all agents
- Color-coded health status (green/yellow/red)
- Temperature graphs over time
- Predicted failure timeline
2. **Alert System**
- Email when health score drops below 70
- Critical alert when below 50
- Immediate alert on SMART failure
- Temperature spike notifications
3. **Maintenance Recommendations**
- "Drive /dev/sdb showing 15 reallocated sectors - recommend replacement within 30 days"
- "Temperature consistently above 45°C - check cooling"
- "Drive has 45,000 hours - consider proactive replacement"
4. **Correlation with Updates**
- "System update initiated while disk I/O at 92% - potential correlation?"
- Track if updates cause disk health degradation
## Why This Can't Wait
**The $600k/year ConnectWise can't do this:**
- Their agents don't have hardware-level access
- Cloud model prevents local disk monitoring
- Proprietary code prevents community additions
**RedFlag's advantage:**
- Self-hosted agents have full system access
- Open source - community can contribute disk monitoring
- Hardware binding already in place - perfect foundation
- Error transparency means we see disk issues immediately
**Business case:**
- One prevented data loss incident = justification
- Proactive replacement vs emergency outage = measurable ROI
- MSPs can offer disk health monitoring as premium service
- Homelabbers catch failing drives before losing family photos
## Technical Considerations
**Dependencies:**
- `smartmontools` package on agents (most distros have it)
- Agent needs sudo access for `smartctl` (document in install)
- NTFS drives need `ntfs-3g` for best SMART support
- Windows agents need different implementation (WMI)
**Security:**
- Limited to read-only SMART data
- No disk modification commands
- Agent already runs as limited user - no privilege escalation
**Cross-platform:**
- Linux: `smartctl` (easy)
- Windows: WMI or `smartctl` via Cygwin (need research)
- Docker: Smart to pass host device access
## Next Steps
1. **Immediate**: Add `smartmontools` to agent install scripts
2. **This week**: Create PoC disk health scanner
3. **Next sprint**: Integrate with agent heartbeat
4. **v0.2.0**: Full disk health dashboard + alerts
**Estimates:**
- Linux scanner: 2-3 days
- Windows scanner: 3-5 days (research needed)
- Server dashboard: 3-4 days
- Alert system: 2-3 days
- Testing: 2-3 days
**Total**: ~2 weeks to production-ready disk health monitoring
## The Bottom Line
Tonight's incident cost us:
- 4 hours of troubleshooting
- 107GB music collection at risk
- 2 unclean shutdowns
- Corrupted filesystems (NTFS + exFAT)
- A lot of frustration
**SMART monitoring would have:**
- Warned about the USB drive issues before the copy
- Alerted on I/O saturation before crash
- Given us early warning on the 10TB drive health
- Provided data to prevent the crash
**This is infrastructure 101. We need this yesterday.**
---
**Priority**: CRITICAL
**Effort**: Medium (2 weeks)
**Impact**: High (prevents data loss, adds competitive advantage)
**User Requested**: YES (after tonight's incident)
**ConnectWise Can't Match**: Hardware-level monitoring is their blind spot
**Status**: Ready for implementation planning