13 KiB
Server Health Dashboard Component
Priority: P3 (Enhancement) Source Reference: From todos.md line 6 Status: Ready for Implementation
Problem Statement
Administrators lack visibility into server health status, coordination components, and overall system performance. There is no centralized dashboard showing server agent/coordinator selection mechanisms, version verification, config validation, or health check integration.
Feature Description
Create a Server Health Dashboard that provides real-time monitoring of server status, health indicators, coordination components, and performance metrics to help administrators understand system health and troubleshoot issues.
Acceptance Criteria
- Real-time server status monitoring dashboard
- Health check integration with settings page
- Server agent/coordinator selection mechanism visibility
- Version verification and config validation status
- Performance metrics display (CPU, memory, database connections)
- Alert thresholds for critical server health issues
- Historical health data tracking
- System status indicators (database, API, scheduler)
Technical Approach
1. Server Health Service
Health Monitoring Service (aggregator-server/internal/services/health_service.go):
type ServerHealth struct {
ServerID string `json:"server_id"`
Status string `json:"status"` // "healthy", "degraded", "unhealthy"
Uptime time.Duration `json:"uptime"`
Version string `json:"version"`
DatabaseStatus DatabaseHealth `json:"database_status"`
SchedulerStatus SchedulerHealth `json:"scheduler_status"`
APIServerStatus APIServerHealth `json:"api_server_status"`
SystemMetrics SystemMetrics `json:"system_metrics"`
LastHealthCheck time.Time `json:"last_health_check"`
HealthIssues []HealthIssue `json:"health_issues"`
}
type DatabaseHealth struct {
Status string `json:"status"`
ConnectionPool int `json:"connection_pool"`
ResponseTime float64 `json:"response_time"`
LastChecked time.Time `json:"last_checked"`
}
type SchedulerHealth struct {
Status string `json:"status"`
RunningJobs int `json:"running_jobs"`
QueueLength int `json:"queue_length"`
LastJobExecution time.Time `json:"last_job_execution"`
}
Health Check Implementation:
func (s *HealthService) CheckServerHealth() (*ServerHealth, error) {
health := &ServerHealth{
ServerID: s.serverID,
Status: "healthy",
LastHealthCheck: time.Now(),
}
// Database health check
dbHealth, err := s.checkDatabaseHealth()
if err != nil {
health.HealthIssues = append(health.HealthIssues, HealthIssue{
Type: "database",
Message: fmt.Sprintf("Database health check failed: %v", err),
Severity: "critical",
})
health.Status = "unhealthy"
}
health.DatabaseStatus = *dbHealth
// Scheduler health check
schedulerHealth := s.checkSchedulerHealth()
health.SchedulerStatus = *schedulerHealth
// System metrics
systemMetrics := s.getSystemMetrics()
health.SystemMetrics = *systemMetrics
// Overall status determination
health.determineOverallStatus()
return health, nil
}
2. Database Health Monitoring
Database Connection Health:
func (s *HealthService) checkDatabaseHealth() (*DatabaseHealth, error) {
start := time.Now()
// Test database connection
var result int
err := s.db.QueryRow("SELECT 1").Scan(&result)
if err != nil {
return nil, fmt.Errorf("database connection failed: %w", err)
}
responseTime := time.Since(start).Seconds()
// Get connection pool stats
stats := s.db.Stats()
return &DatabaseHealth{
Status: "healthy",
ConnectionPool: stats.OpenConnections,
ResponseTime: responseTime,
LastChecked: time.Now(),
}, nil
}
3. API Endpoint
Health API Handler (aggregator-server/internal/api/handlers/health.go):
// GET /api/v1/health
func (h *HealthHandler) GetServerHealth(c *gin.Context) {
health, err := h.healthService.CheckServerHealth()
if err != nil {
c.JSON(http.StatusInternalServerError, gin.H{
"status": "unhealthy",
"error": err.Error(),
})
return
}
c.JSON(http.StatusOK, health)
}
// GET /api/v1/health/history
func (h *HealthHandler) GetHealthHistory(c *gin.Context) {
// Return historical health data for charts
}
4. Frontend Dashboard Component
Server Health Dashboard (aggregator-web/src/pages/ServerHealth.tsx):
interface ServerHealth {
server_id: string;
status: 'healthy' | 'degraded' | 'unhealthy';
uptime: number;
version: string;
database_status: {
status: string;
connection_pool: number;
response_time: number;
};
scheduler_status: {
status: string;
running_jobs: number;
queue_length: number;
};
system_metrics: {
cpu_usage: number;
memory_usage: number;
disk_usage: number;
};
health_issues: Array<{
type: string;
message: string;
severity: 'info' | 'warning' | 'critical';
}>;
}
const ServerHealthDashboard: React.FC = () => {
const [health, setHealth] = useState<ServerHealth | null>(null);
const [autoRefresh, setAutoRefresh] = useState(true);
return (
<div className="server-health-dashboard">
<div className="health-header">
<h2>Server Health</h2>
<div className="health-controls">
<RefreshToggle enabled={autoRefresh} onChange={setAutoRefresh} />
<RefreshButton onClick={() => fetchHealthData()} />
</div>
</div>
{/* Overall Status */}
<div className="overall-status">
<StatusIndicator
status={health?.status || 'unknown'}
message={`Server ${health?.status || 'unknown'}`}
/>
<div className="uptime">
Uptime: {formatDuration(health?.uptime || 0)}
</div>
</div>
{/* Health Metrics Grid */}
<div className="health-metrics-grid">
<HealthCard
title="Database"
status={health?.database_status.status}
metrics={[
{ label: "Connections", value: health?.database_status.connection_pool },
{ label: "Response Time", value: `${health?.database_status.response_time?.toFixed(2)}ms` }
]}
/>
<HealthCard
title="Scheduler"
status={health?.scheduler_status.status}
metrics={[
{ label: "Running Jobs", value: health?.scheduler_status.running_jobs },
{ label: "Queue Length", value: health?.scheduler_status.queue_length }
]}
/>
<HealthCard
title="System Resources"
status="healthy"
metrics={[
{ label: "CPU", value: `${health?.system_metrics.cpu_usage}%` },
{ label: "Memory", value: `${health?.system_metrics.memory_usage}%` },
{ label: "Disk", value: `${health?.system_metrics.disk_usage}%` }
]}
/>
</div>
{/* Health Issues */}
{health?.health_issues && health.health_issues.length > 0 && (
<div className="health-issues">
<h3>Health Issues</h3>
{health.health_issues.map((issue, index) => (
<HealthIssueAlert key={index} issue={issue} />
))}
</div>
)}
{/* Historical Charts */}
<div className="health-charts">
<h3>Historical Health Data</h3>
<div className="charts-grid">
<HealthChart
title="Response Time"
data={historicalData.responseTime}
unit="ms"
/>
<HealthChart
title="System Load"
data={historicalData.systemLoad}
unit="%"
/>
</div>
</div>
</div>
);
};
5. Health Monitoring Components
Status Indicator Component:
const StatusIndicator: React.FC<{ status: string; message: string }> = ({ status, message }) => {
const getStatusColor = (status: string) => {
switch (status) {
case 'healthy': return 'green';
case 'degraded': return 'yellow';
case 'unhealthy': return 'red';
default: return 'gray';
}
};
return (
<div className={`status-indicator ${getStatusColor(status)}`}>
<div className="status-dot"></div>
<span className="status-message">{message}</span>
</div>
);
};
Health Card Component:
interface HealthCardProps {
title: string;
status: string;
metrics: Array<{ label: string; value: string | number }>;
}
const HealthCard: React.FC<HealthCardProps> = ({ title, status, metrics }) => {
return (
<div className={`health-card status-${status}`}>
<div className="card-header">
<h3>{title}</h3>
<StatusBadge status={status} />
</div>
<div className="card-metrics">
{metrics.map((metric, index) => (
<div key={index} className="metric">
<span className="metric-label">{metric.label}:</span>
<span className="metric-value">{metric.value}</span>
</div>
))}
</div>
</div>
);
};
Definition of Done
- ✅ Server health monitoring service implemented
- ✅ Database, scheduler, and system resource health checks
- ✅ Real-time health dashboard with status indicators
- ✅ Historical health data tracking and visualization
- ✅ Alert system for critical health issues
- ✅ Auto-refresh functionality
- ✅ Mobile-responsive design
- ✅ Integration with existing settings page
Test Plan
-
Unit Tests
- Health check calculations
- Status determination logic
- Error handling scenarios
-
Integration Tests
- Database health check under load
- Scheduler monitoring accuracy
- System metrics collection
-
Stress Tests
- Dashboard performance under high load
- Health check impact on system resources
- Concurrent health monitoring
-
Scenario Tests
- Database connection failures
- High system load conditions
- Scheduler queue overflow scenarios
Files to Modify
aggregator-server/internal/services/health_service.go- New serviceaggregator-server/internal/api/handlers/health.go- New handlersaggregator-web/src/pages/ServerHealth.tsx- New dashboardaggregator-web/src/components/StatusIndicator.tsx- Status componentsaggregator-web/src/components/HealthCard.tsx- Health card componentaggregator-web/src/lib/api.ts- API integration
Health Check Categories
1. System Health
- CPU usage percentage
- Memory usage percentage
- Disk space availability
- Network connectivity
2. Application Health
- Database connectivity and response time
- API server responsiveness
- Scheduler operation status
- Background service status
3. Business Logic Health
- Agent registration flow
- Command queue processing
- Update distribution
- Token management
Alert Thresholds
Critical Alerts
- Database connection failures
- CPU usage > 90% for > 5 minutes
- Memory usage > 95%
- Scheduler queue length > 1000
Warning Alerts
- Database response time > 1 second
- CPU usage > 80% for > 10 minutes
- Memory usage > 85%
- Queue length > 500
Estimated Effort
- Development: 16-20 hours
- Testing: 8-12 hours
- Review: 4-6 hours
- Design/UX: 6-8 hours
Dependencies
- Existing monitoring infrastructure
- System metrics collection
- Database connection pooling
- Background job processing
Risk Assessment
Low Risk - Enhancement that adds monitoring capabilities without affecting core functionality. Health checks are read-only operations with minimal system impact.
Implementation Phases
Phase 1: Core Health Service
- Implement health check service
- Create health monitoring endpoints
- Basic status determination logic
Phase 2: Dashboard UI
- Create health dashboard layout
- Implement status indicators and metrics
- Add real-time updates
Phase 3: Advanced Features
- Historical data tracking
- Alert system integration
- Performance optimization
Future Enhancements
- Multi-Server Monitoring: Support for clustered deployments
- Predictive Health: ML-based health prediction
- Automated Remediation: Self-healing capabilities
- Integration with External Monitoring: Prometheus, Grafana
- Custom Health Checks: Pluggable health check system