# CRITICAL: Add SMART Disk Monitoring to RedFlag

## Why This Is Urgent

After tonight's USB drive transfer incident (107GB music collection, system crash due to I/O overload), it's painfully clear that **RedFlag needs disk health monitoring**.

**What happened:**
- First rsync attempt maxed out disk I/O (no bandwidth limit)
- System became unresponsive, required hard reboot
- NTFS filesystem on 10TB drive corrupted from unclean shutdown
- exFAT USB drive also had unmount corruption
- Lost ~4 hours to troubleshooting and recovery

**The realization:** We have NO early warning system for disk health issues in RedFlag. We're flying blind on hardware failure until it's too late.

## Why RedFlag Needs SMART Monitoring

**Current gaps:**
- ❌ No early warning of impending drive failure
- ❌ No automatic disk health checks
- ❌ No alerts for reallocated sectors, high temps, or pre-fail indicators
- ❌ No monitoring of I/O saturation that could cause crashes
- ❌ No proactive maintenance recommendations

**What SMART monitoring gives us:**
- ✅ Early warning of drive failure (days/weeks before total failure)
- ✅ Temperature monitoring (prevent thermal throttling/damage)
- ✅ Reallocated sector tracking (silent data corruption indicator)
- ✅ I/O error rate monitoring (predicts filesystem corruption)
- ✅ Proactive replacement recommendations (maintenance windows, not emergencies)
- ✅ Correlation between update operations and disk health (did that update cause issues?)

## Proposed Implementation

### Disk Health Scanner Module

```go
// New scanner module: agent/pkg/scanners/disk_health.go

type DiskHealthStatus struct {
    Device            string    `json:"device"`
    SMARTStatus       string    `json:"smart_status"`        // PASSED/FAILED
    Temperature       int       `json:"temperature_c"`
    ReallocatedSectors int      `json:"reallocated_sectors"`
    PendingSectors     int      `json:"pending_sectors"`
    UncorrectableErrors int     `json:"uncorrectable_errors"`
    PowerOnHours       int      `json:"power_on_hours"`
    LastTestDate       time.Time `json:"last_self_test"`
    HealthScore       int       `json:"health_score"`        // 0-100
    CriticalAttributes []string `json:"critical_attributes,omitempty"`
}
```

### Agent-Side Features

1. **Scheduled SMART Checks**
   - Run `smartctl -a` every 6 hours
   - Parse critical attributes (5, 196, 197, 198)
   - Calculate health score (0-100 scale)

2. **Self-Test Scheduling**
   - Short self-test: Weekly
   - Long self-test: Monthly
   - Log results to agent's local DB

3. **I/O Monitoring**
   - Track disk utilization %
   - Monitor I/O wait times
   - Alert on sustained >80% utilization (prevents crash scenarios)

4. **Temperature Alerts**
   - Warning at 45°C
   - Critical at 50°C
   - Log thermal throttling events

### Server-Side Features

1. **Disk Health Dashboard**
   - Show all drives across all agents
   - Color-coded health status (green/yellow/red)
   - Temperature graphs over time
   - Predicted failure timeline

2. **Alert System**
   - Email when health score drops below 70
   - Critical alert when below 50
   - Immediate alert on SMART failure
   - Temperature spike notifications

3. **Maintenance Recommendations**
   - "Drive /dev/sdb showing 15 reallocated sectors - recommend replacement within 30 days"
   - "Temperature consistently above 45°C - check cooling"
   - "Drive has 45,000 hours - consider proactive replacement"

4. **Correlation with Updates**
   - "System update initiated while disk I/O at 92% - potential correlation?"
   - Track if updates cause disk health degradation

## Why This Can't Wait

**The $600k/year ConnectWise can't do this:**
- Their agents don't have hardware-level access
- Cloud model prevents local disk monitoring
- Proprietary code prevents community additions

**RedFlag's advantage:**
- Self-hosted agents have full system access
- Open source - community can contribute disk monitoring
- Hardware binding already in place - perfect foundation
- Error transparency means we see disk issues immediately

**Business case:**
- One prevented data loss incident = justification
- Proactive replacement vs emergency outage = measurable ROI
- MSPs can offer disk health monitoring as premium service
- Homelabbers catch failing drives before losing family photos

## Technical Considerations

**Dependencies:**
- `smartmontools` package on agents (most distros have it)
- Agent needs sudo access for `smartctl` (document in install)
- NTFS drives need `ntfs-3g` for best SMART support
- Windows agents need different implementation (WMI)

**Security:**
- Limited to read-only SMART data
- No disk modification commands
- Agent already runs as limited user - no privilege escalation

**Cross-platform:**
- Linux: `smartctl` (easy)
- Windows: WMI or `smartctl` via Cygwin (need research)
- Docker: Smart to pass host device access

## Next Steps

1. **Immediate**: Add `smartmontools` to agent install scripts
2. **This week**: Create PoC disk health scanner
3. **Next sprint**: Integrate with agent heartbeat
4. **v0.2.0**: Full disk health dashboard + alerts

**Estimates:**
- Linux scanner: 2-3 days
- Windows scanner: 3-5 days (research needed)
- Server dashboard: 3-4 days
- Alert system: 2-3 days
- Testing: 2-3 days

**Total**: ~2 weeks to production-ready disk health monitoring

## The Bottom Line

Tonight's incident cost us:
- 4 hours of troubleshooting
- 107GB music collection at risk
- 2 unclean shutdowns
- Corrupted filesystems (NTFS + exFAT)
- A lot of frustration

**SMART monitoring would have:**
- Warned about the USB drive issues before the copy
- Alerted on I/O saturation before crash
- Given us early warning on the 10TB drive health
- Provided data to prevent the crash

**This is infrastructure 101. We need this yesterday.**

---

**Priority**: CRITICAL
**Effort**: Medium (2 weeks)
**Impact**: High (prevents data loss, adds competitive advantage)
**User Requested**: YES (after tonight's incident)
**ConnectWise Can't Match**: Hardware-level monitoring is their blind spot

**Status**: Ready for implementation planning