Files

Fimeg 484a7f77ce Add docs and project files - force for Culurien

2026-03-28 20:46:24 -04:00

6.0 KiB

Raw Blame History

CRITICAL: Add SMART Disk Monitoring to RedFlag

Why This Is Urgent

After tonight's USB drive transfer incident (107GB music collection, system crash due to I/O overload), it's painfully clear that RedFlag needs disk health monitoring.

What happened:

First rsync attempt maxed out disk I/O (no bandwidth limit)
System became unresponsive, required hard reboot
NTFS filesystem on 10TB drive corrupted from unclean shutdown
exFAT USB drive also had unmount corruption
Lost ~4 hours to troubleshooting and recovery

The realization: We have NO early warning system for disk health issues in RedFlag. We're flying blind on hardware failure until it's too late.

Why RedFlag Needs SMART Monitoring

Current gaps:

❌ No early warning of impending drive failure
❌ No automatic disk health checks
❌ No alerts for reallocated sectors, high temps, or pre-fail indicators
❌ No monitoring of I/O saturation that could cause crashes
❌ No proactive maintenance recommendations

What SMART monitoring gives us:

✅ Early warning of drive failure (days/weeks before total failure)
✅ Temperature monitoring (prevent thermal throttling/damage)
✅ Reallocated sector tracking (silent data corruption indicator)
✅ I/O error rate monitoring (predicts filesystem corruption)
✅ Proactive replacement recommendations (maintenance windows, not emergencies)
✅ Correlation between update operations and disk health (did that update cause issues?)

Proposed Implementation

Disk Health Scanner Module

// New scanner module: agent/pkg/scanners/disk_health.go

type DiskHealthStatus struct {
    Device            string    `json:"device"`
    SMARTStatus       string    `json:"smart_status"`        // PASSED/FAILED
    Temperature       int       `json:"temperature_c"`
    ReallocatedSectors int      `json:"reallocated_sectors"`
    PendingSectors     int      `json:"pending_sectors"`
    UncorrectableErrors int     `json:"uncorrectable_errors"`
    PowerOnHours       int      `json:"power_on_hours"`
    LastTestDate       time.Time `json:"last_self_test"`
    HealthScore       int       `json:"health_score"`        // 0-100
    CriticalAttributes []string `json:"critical_attributes,omitempty"`
}

Agent-Side Features

Scheduled SMART Checks
- Run smartctl -a every 6 hours
- Parse critical attributes (5, 196, 197, 198)
- Calculate health score (0-100 scale)
Self-Test Scheduling
- Short self-test: Weekly
- Long self-test: Monthly
- Log results to agent's local DB
I/O Monitoring
- Track disk utilization %
- Monitor I/O wait times
- Alert on sustained >80% utilization (prevents crash scenarios)
Temperature Alerts
- Warning at 45°C
- Critical at 50°C
- Log thermal throttling events

Server-Side Features

Disk Health Dashboard
- Show all drives across all agents
- Color-coded health status (green/yellow/red)
- Temperature graphs over time
- Predicted failure timeline
Alert System
- Email when health score drops below 70
- Critical alert when below 50
- Immediate alert on SMART failure
- Temperature spike notifications
Maintenance Recommendations
- "Drive /dev/sdb showing 15 reallocated sectors - recommend replacement within 30 days"
- "Temperature consistently above 45°C - check cooling"
- "Drive has 45,000 hours - consider proactive replacement"
Correlation with Updates
- "System update initiated while disk I/O at 92% - potential correlation?"
- Track if updates cause disk health degradation

Why This Can't Wait

The $600k/year ConnectWise can't do this:

Their agents don't have hardware-level access
Cloud model prevents local disk monitoring
Proprietary code prevents community additions

RedFlag's advantage:

Self-hosted agents have full system access
Open source - community can contribute disk monitoring
Hardware binding already in place - perfect foundation
Error transparency means we see disk issues immediately

Business case:

One prevented data loss incident = justification
Proactive replacement vs emergency outage = measurable ROI
MSPs can offer disk health monitoring as premium service
Homelabbers catch failing drives before losing family photos

Technical Considerations

Dependencies:

smartmontools package on agents (most distros have it)
Agent needs sudo access for smartctl (document in install)
NTFS drives need ntfs-3g for best SMART support
Windows agents need different implementation (WMI)

Security:

Limited to read-only SMART data
No disk modification commands
Agent already runs as limited user - no privilege escalation

Cross-platform:

Linux: smartctl (easy)
Windows: WMI or smartctl via Cygwin (need research)
Docker: Smart to pass host device access

Next Steps

Immediate: Add smartmontools to agent install scripts
This week: Create PoC disk health scanner
Next sprint: Integrate with agent heartbeat
v0.2.0: Full disk health dashboard + alerts

Estimates:

Linux scanner: 2-3 days
Windows scanner: 3-5 days (research needed)
Server dashboard: 3-4 days
Alert system: 2-3 days
Testing: 2-3 days

Total: ~2 weeks to production-ready disk health monitoring

The Bottom Line

Tonight's incident cost us:

4 hours of troubleshooting
107GB music collection at risk
2 unclean shutdowns
Corrupted filesystems (NTFS + exFAT)
A lot of frustration

SMART monitoring would have:

Warned about the USB drive issues before the copy
Alerted on I/O saturation before crash
Given us early warning on the 10TB drive health
Provided data to prevent the crash

This is infrastructure 101. We need this yesterday.

Priority: CRITICAL Effort: Medium (2 weeks) Impact: High (prevents data loss, adds competitive advantage) User Requested: YES (after tonight's incident) ConnectWise Can't Match: Hardware-level monitoring is their blind spot

Status: Ready for implementation planning

6.0 KiB Raw Blame History