Files
Redflag/CRITICAL-ADD-SMART-MONITORING.md

6.0 KiB

CRITICAL: Add SMART Disk Monitoring to RedFlag

Why This Is Urgent

After tonight's USB drive transfer incident (107GB music collection, system crash due to I/O overload), it's painfully clear that RedFlag needs disk health monitoring.

What happened:

  • First rsync attempt maxed out disk I/O (no bandwidth limit)
  • System became unresponsive, required hard reboot
  • NTFS filesystem on 10TB drive corrupted from unclean shutdown
  • exFAT USB drive also had unmount corruption
  • Lost ~4 hours to troubleshooting and recovery

The realization: We have NO early warning system for disk health issues in RedFlag. We're flying blind on hardware failure until it's too late.

Why RedFlag Needs SMART Monitoring

Current gaps:

  • No early warning of impending drive failure
  • No automatic disk health checks
  • No alerts for reallocated sectors, high temps, or pre-fail indicators
  • No monitoring of I/O saturation that could cause crashes
  • No proactive maintenance recommendations

What SMART monitoring gives us:

  • Early warning of drive failure (days/weeks before total failure)
  • Temperature monitoring (prevent thermal throttling/damage)
  • Reallocated sector tracking (silent data corruption indicator)
  • I/O error rate monitoring (predicts filesystem corruption)
  • Proactive replacement recommendations (maintenance windows, not emergencies)
  • Correlation between update operations and disk health (did that update cause issues?)

Proposed Implementation

Disk Health Scanner Module

// New scanner module: agent/pkg/scanners/disk_health.go

type DiskHealthStatus struct {
    Device            string    `json:"device"`
    SMARTStatus       string    `json:"smart_status"`        // PASSED/FAILED
    Temperature       int       `json:"temperature_c"`
    ReallocatedSectors int      `json:"reallocated_sectors"`
    PendingSectors     int      `json:"pending_sectors"`
    UncorrectableErrors int     `json:"uncorrectable_errors"`
    PowerOnHours       int      `json:"power_on_hours"`
    LastTestDate       time.Time `json:"last_self_test"`
    HealthScore       int       `json:"health_score"`        // 0-100
    CriticalAttributes []string `json:"critical_attributes,omitempty"`
}

Agent-Side Features

  1. Scheduled SMART Checks

    • Run smartctl -a every 6 hours
    • Parse critical attributes (5, 196, 197, 198)
    • Calculate health score (0-100 scale)
  2. Self-Test Scheduling

    • Short self-test: Weekly
    • Long self-test: Monthly
    • Log results to agent's local DB
  3. I/O Monitoring

    • Track disk utilization %
    • Monitor I/O wait times
    • Alert on sustained >80% utilization (prevents crash scenarios)
  4. Temperature Alerts

    • Warning at 45°C
    • Critical at 50°C
    • Log thermal throttling events

Server-Side Features

  1. Disk Health Dashboard

    • Show all drives across all agents
    • Color-coded health status (green/yellow/red)
    • Temperature graphs over time
    • Predicted failure timeline
  2. Alert System

    • Email when health score drops below 70
    • Critical alert when below 50
    • Immediate alert on SMART failure
    • Temperature spike notifications
  3. Maintenance Recommendations

    • "Drive /dev/sdb showing 15 reallocated sectors - recommend replacement within 30 days"
    • "Temperature consistently above 45°C - check cooling"
    • "Drive has 45,000 hours - consider proactive replacement"
  4. Correlation with Updates

    • "System update initiated while disk I/O at 92% - potential correlation?"
    • Track if updates cause disk health degradation

Why This Can't Wait

The $600k/year ConnectWise can't do this:

  • Their agents don't have hardware-level access
  • Cloud model prevents local disk monitoring
  • Proprietary code prevents community additions

RedFlag's advantage:

  • Self-hosted agents have full system access
  • Open source - community can contribute disk monitoring
  • Hardware binding already in place - perfect foundation
  • Error transparency means we see disk issues immediately

Business case:

  • One prevented data loss incident = justification
  • Proactive replacement vs emergency outage = measurable ROI
  • MSPs can offer disk health monitoring as premium service
  • Homelabbers catch failing drives before losing family photos

Technical Considerations

Dependencies:

  • smartmontools package on agents (most distros have it)
  • Agent needs sudo access for smartctl (document in install)
  • NTFS drives need ntfs-3g for best SMART support
  • Windows agents need different implementation (WMI)

Security:

  • Limited to read-only SMART data
  • No disk modification commands
  • Agent already runs as limited user - no privilege escalation

Cross-platform:

  • Linux: smartctl (easy)
  • Windows: WMI or smartctl via Cygwin (need research)
  • Docker: Smart to pass host device access

Next Steps

  1. Immediate: Add smartmontools to agent install scripts
  2. This week: Create PoC disk health scanner
  3. Next sprint: Integrate with agent heartbeat
  4. v0.2.0: Full disk health dashboard + alerts

Estimates:

  • Linux scanner: 2-3 days
  • Windows scanner: 3-5 days (research needed)
  • Server dashboard: 3-4 days
  • Alert system: 2-3 days
  • Testing: 2-3 days

Total: ~2 weeks to production-ready disk health monitoring

The Bottom Line

Tonight's incident cost us:

  • 4 hours of troubleshooting
  • 107GB music collection at risk
  • 2 unclean shutdowns
  • Corrupted filesystems (NTFS + exFAT)
  • A lot of frustration

SMART monitoring would have:

  • Warned about the USB drive issues before the copy
  • Alerted on I/O saturation before crash
  • Given us early warning on the 10TB drive health
  • Provided data to prevent the crash

This is infrastructure 101. We need this yesterday.


Priority: CRITICAL Effort: Medium (2 weeks) Impact: High (prevents data loss, adds competitive advantage) User Requested: YES (after tonight's incident) ConnectWise Can't Match: Hardware-level monitoring is their blind spot

Status: Ready for implementation planning