178 lines
6.0 KiB
Markdown
178 lines
6.0 KiB
Markdown
# CRITICAL: Add SMART Disk Monitoring to RedFlag
|
|
|
|
## Why This Is Urgent
|
|
|
|
After tonight's USB drive transfer incident (107GB music collection, system crash due to I/O overload), it's painfully clear that **RedFlag needs disk health monitoring**.
|
|
|
|
**What happened:**
|
|
- First rsync attempt maxed out disk I/O (no bandwidth limit)
|
|
- System became unresponsive, required hard reboot
|
|
- NTFS filesystem on 10TB drive corrupted from unclean shutdown
|
|
- exFAT USB drive also had unmount corruption
|
|
- Lost ~4 hours to troubleshooting and recovery
|
|
|
|
**The realization:** We have NO early warning system for disk health issues in RedFlag. We're flying blind on hardware failure until it's too late.
|
|
|
|
## Why RedFlag Needs SMART Monitoring
|
|
|
|
**Current gaps:**
|
|
- ❌ No early warning of impending drive failure
|
|
- ❌ No automatic disk health checks
|
|
- ❌ No alerts for reallocated sectors, high temps, or pre-fail indicators
|
|
- ❌ No monitoring of I/O saturation that could cause crashes
|
|
- ❌ No proactive maintenance recommendations
|
|
|
|
**What SMART monitoring gives us:**
|
|
- ✅ Early warning of drive failure (days/weeks before total failure)
|
|
- ✅ Temperature monitoring (prevent thermal throttling/damage)
|
|
- ✅ Reallocated sector tracking (silent data corruption indicator)
|
|
- ✅ I/O error rate monitoring (predicts filesystem corruption)
|
|
- ✅ Proactive replacement recommendations (maintenance windows, not emergencies)
|
|
- ✅ Correlation between update operations and disk health (did that update cause issues?)
|
|
|
|
## Proposed Implementation
|
|
|
|
### Disk Health Scanner Module
|
|
|
|
```go
|
|
// New scanner module: agent/pkg/scanners/disk_health.go
|
|
|
|
type DiskHealthStatus struct {
|
|
Device string `json:"device"`
|
|
SMARTStatus string `json:"smart_status"` // PASSED/FAILED
|
|
Temperature int `json:"temperature_c"`
|
|
ReallocatedSectors int `json:"reallocated_sectors"`
|
|
PendingSectors int `json:"pending_sectors"`
|
|
UncorrectableErrors int `json:"uncorrectable_errors"`
|
|
PowerOnHours int `json:"power_on_hours"`
|
|
LastTestDate time.Time `json:"last_self_test"`
|
|
HealthScore int `json:"health_score"` // 0-100
|
|
CriticalAttributes []string `json:"critical_attributes,omitempty"`
|
|
}
|
|
```
|
|
|
|
### Agent-Side Features
|
|
|
|
1. **Scheduled SMART Checks**
|
|
- Run `smartctl -a` every 6 hours
|
|
- Parse critical attributes (5, 196, 197, 198)
|
|
- Calculate health score (0-100 scale)
|
|
|
|
2. **Self-Test Scheduling**
|
|
- Short self-test: Weekly
|
|
- Long self-test: Monthly
|
|
- Log results to agent's local DB
|
|
|
|
3. **I/O Monitoring**
|
|
- Track disk utilization %
|
|
- Monitor I/O wait times
|
|
- Alert on sustained >80% utilization (prevents crash scenarios)
|
|
|
|
4. **Temperature Alerts**
|
|
- Warning at 45°C
|
|
- Critical at 50°C
|
|
- Log thermal throttling events
|
|
|
|
### Server-Side Features
|
|
|
|
1. **Disk Health Dashboard**
|
|
- Show all drives across all agents
|
|
- Color-coded health status (green/yellow/red)
|
|
- Temperature graphs over time
|
|
- Predicted failure timeline
|
|
|
|
2. **Alert System**
|
|
- Email when health score drops below 70
|
|
- Critical alert when below 50
|
|
- Immediate alert on SMART failure
|
|
- Temperature spike notifications
|
|
|
|
3. **Maintenance Recommendations**
|
|
- "Drive /dev/sdb showing 15 reallocated sectors - recommend replacement within 30 days"
|
|
- "Temperature consistently above 45°C - check cooling"
|
|
- "Drive has 45,000 hours - consider proactive replacement"
|
|
|
|
4. **Correlation with Updates**
|
|
- "System update initiated while disk I/O at 92% - potential correlation?"
|
|
- Track if updates cause disk health degradation
|
|
|
|
## Why This Can't Wait
|
|
|
|
**The $600k/year ConnectWise can't do this:**
|
|
- Their agents don't have hardware-level access
|
|
- Cloud model prevents local disk monitoring
|
|
- Proprietary code prevents community additions
|
|
|
|
**RedFlag's advantage:**
|
|
- Self-hosted agents have full system access
|
|
- Open source - community can contribute disk monitoring
|
|
- Hardware binding already in place - perfect foundation
|
|
- Error transparency means we see disk issues immediately
|
|
|
|
**Business case:**
|
|
- One prevented data loss incident = justification
|
|
- Proactive replacement vs emergency outage = measurable ROI
|
|
- MSPs can offer disk health monitoring as premium service
|
|
- Homelabbers catch failing drives before losing family photos
|
|
|
|
## Technical Considerations
|
|
|
|
**Dependencies:**
|
|
- `smartmontools` package on agents (most distros have it)
|
|
- Agent needs sudo access for `smartctl` (document in install)
|
|
- NTFS drives need `ntfs-3g` for best SMART support
|
|
- Windows agents need different implementation (WMI)
|
|
|
|
**Security:**
|
|
- Limited to read-only SMART data
|
|
- No disk modification commands
|
|
- Agent already runs as limited user - no privilege escalation
|
|
|
|
**Cross-platform:**
|
|
- Linux: `smartctl` (easy)
|
|
- Windows: WMI or `smartctl` via Cygwin (need research)
|
|
- Docker: Smart to pass host device access
|
|
|
|
## Next Steps
|
|
|
|
1. **Immediate**: Add `smartmontools` to agent install scripts
|
|
2. **This week**: Create PoC disk health scanner
|
|
3. **Next sprint**: Integrate with agent heartbeat
|
|
4. **v0.2.0**: Full disk health dashboard + alerts
|
|
|
|
**Estimates:**
|
|
- Linux scanner: 2-3 days
|
|
- Windows scanner: 3-5 days (research needed)
|
|
- Server dashboard: 3-4 days
|
|
- Alert system: 2-3 days
|
|
- Testing: 2-3 days
|
|
|
|
**Total**: ~2 weeks to production-ready disk health monitoring
|
|
|
|
## The Bottom Line
|
|
|
|
Tonight's incident cost us:
|
|
- 4 hours of troubleshooting
|
|
- 107GB music collection at risk
|
|
- 2 unclean shutdowns
|
|
- Corrupted filesystems (NTFS + exFAT)
|
|
- A lot of frustration
|
|
|
|
**SMART monitoring would have:**
|
|
- Warned about the USB drive issues before the copy
|
|
- Alerted on I/O saturation before crash
|
|
- Given us early warning on the 10TB drive health
|
|
- Provided data to prevent the crash
|
|
|
|
**This is infrastructure 101. We need this yesterday.**
|
|
|
|
---
|
|
|
|
**Priority**: CRITICAL
|
|
**Effort**: Medium (2 weeks)
|
|
**Impact**: High (prevents data loss, adds competitive advantage)
|
|
**User Requested**: YES (after tonight's incident)
|
|
**ConnectWise Can't Match**: Hardware-level monitoring is their blind spot
|
|
|
|
**Status**: Ready for implementation planning
|