10 KiB
Update Metrics Dashboard
Priority: P3 (Enhancement) Source Reference: From todos.md line 60 Status: Ready for Implementation
Problem Statement
Administrators lack visibility into update operations across their agent fleet. There is no centralized dashboard showing update success/failure rates, agent update readiness, or performance analytics for update operations.
Feature Description
Create a comprehensive Update Metrics Dashboard that provides real-time visibility into update operations, including success/failure rates, agent readiness tracking, performance analytics, and historical trend analysis for update management.
Acceptance Criteria
- Dashboard showing real-time update metrics across all agents
- Update success/failure rates with trend analysis
- Agent update readiness status and categorization
- Performance analytics for update operations
- Historical update operation tracking
- Filterable views by agent groups, time ranges, and update types
- Export capabilities for reporting
- Alert thresholds for update failure rates
Technical Approach
1. Backend Metrics Collection
Update Metrics Service (aggregator-server/internal/services/update_metrics.go):
type UpdateMetrics struct {
TotalUpdates int64 `json:"total_updates"`
SuccessfulUpdates int64 `json:"successful_updates"`
FailedUpdates int64 `json:"failed_updates"`
PendingUpdates int64 `json:"pending_updates"`
AverageUpdateTime float64 `json:"average_update_time"`
UpdateSuccessRate float64 `json:"update_success_rate"`
ReadyForUpdate int64 `json:"ready_for_update"`
NotReadyForUpdate int64 `json:"not_ready_for_update"`
LastUpdated time.Time `json:"last_updated"`
}
type UpdateMetricsTimeSeries struct {
Timestamp time.Time `json:"timestamp"`
SuccessRate float64 `json:"success_rate"`
UpdateCount int64 `json:"update_count"`
FailureRate float64 `json:"failure_rate"`
}
Metrics Calculation:
func (s *UpdateMetricsService) CalculateUpdateMetrics(timeRange time.Duration) (*UpdateMetrics, error) {
metrics := &UpdateMetrics{}
// Get update statistics from database
stats, err := s.queries.GetUpdateStats(time.Now().Add(-timeRange))
if err != nil {
return nil, err
}
metrics.TotalUpdates = stats.TotalUpdates
metrics.SuccessfulUpdates = stats.SuccessfulUpdates
metrics.FailedUpdates = stats.FailedUpdates
metrics.PendingUpdates = stats.PendingUpdates
if metrics.TotalUpdates > 0 {
metrics.UpdateSuccessRate = float64(metrics.SuccessfulUpdates) / float64(metrics.TotalUpdates) * 100
}
// Calculate agent readiness
readiness, err := s.queries.GetAgentReadinessStats()
if err == nil {
metrics.ReadyForUpdate = readiness.ReadyCount
metrics.NotReadyForUpdate = readiness.NotReadyCount
}
return metrics, nil
}
2. Database Queries
Update Statistics (aggregator-server/internal/database/queries/updates.go):
-- Update success/failure statistics
SELECT
COUNT(*) as total_updates,
COUNT(CASE WHEN status = 'completed' THEN 1 END) as successful_updates,
COUNT(CASE WHEN status = 'failed' THEN 1 END) as failed_updates,
COUNT(CASE WHEN status IN ('pending', 'sent') THEN 1 END) as pending_updates,
AVG(EXTRACT(EPOCH FROM (completed_at - created_at))) as avg_update_time
FROM update_events
WHERE created_at > NOW() - $1::INTERVAL;
-- Agent readiness statistics
SELECT
COUNT(CASE WHEN has_available_updates = true AND last_seen > NOW() - INTERVAL '1 hour' THEN 1 END) as ready_count,
COUNT(CASE WHEN has_available_updates = false OR last_seen <= NOW() - INTERVAL '1 hour' THEN 1 END) as not_ready_count
FROM agents;
3. API Endpoints
Metrics API (aggregator-server/internal/api/handlers/metrics.go):
// GET /api/v1/metrics/updates
func (h *MetricsHandler) GetUpdateMetrics(c *gin.Context) {
timeRange := c.DefaultQuery("timeRange", "24h")
duration, err := time.ParseDuration(timeRange)
if err != nil {
c.JSON(http.StatusBadRequest, gin.H{"error": "Invalid time range"})
return
}
metrics, err := h.metricsService.CalculateUpdateMetrics(duration)
if err != nil {
c.JSON(http.StatusInternalServerError, gin.H{"error": err.Error()})
return
}
c.JSON(http.StatusOK, metrics)
}
// GET /api/v1/metrics/updates/timeseries
func (h *MetricsHandler) GetUpdateTimeSeries(c *gin.Context) {
// Return time series data for charts
}
4. Frontend Dashboard Components
Update Metrics Dashboard (aggregator-web/src/pages/UpdateMetrics.tsx):
interface UpdateMetrics {
totalUpdates: number;
successfulUpdates: number;
failedUpdates: number;
pendingUpdates: number;
updateSuccessRate: number;
readyForUpdate: number;
notReadyForUpdate: number;
}
const UpdateMetricsDashboard: React.FC = () => {
const [metrics, setMetrics] = useState<UpdateMetrics | null>(null);
const [timeRange, setTimeRange] = useState<string>("24h");
return (
<div className="update-metrics-dashboard">
<div className="metrics-header">
<h2>Update Operations Dashboard</h2>
<TimeRangeSelector value={timeRange} onChange={setTimeRange} />
</div>
<div className="metrics-grid">
<MetricCard
title="Success Rate"
value={metrics?.updateSuccessRate || 0}
unit="%"
trend={getSuccessRateTrend()}
/>
<MetricCard
title="Ready for Updates"
value={metrics?.readyForUpdate || 0}
unit="agents"
/>
<MetricCard
title="Failed Updates"
value={metrics?.failedUpdates || 0}
trend={getFailureTrend()}
/>
<MetricCard
title="Pending Updates"
value={metrics?.pendingUpdates || 0}
/>
</div>
<div className="charts-section">
<UpdateSuccessRateChart timeRange={timeRange} />
<UpdateVolumeChart timeRange={timeRange} />
<AgentReadinessChart timeRange={timeRange} />
</div>
</div>
);
};
Chart Components:
UpdateSuccessRateChart: Line chart showing success rate over timeUpdateVolumeChart: Bar chart showing update volume trendsAgentReadinessChart: Pie chart showing ready vs not-ready agentsFailureReasonChart: Breakdown of update failure reasons
5. Real-time Updates
WebSocket Integration (optional):
// Real-time metrics updates
useEffect(() => {
const ws = new WebSocket(`${API_BASE}/ws/metrics/updates`);
ws.onmessage = (event) => {
const updatedMetrics = JSON.parse(event.data);
setMetrics(updatedMetrics);
};
return () => ws.close();
}, [timeRange]);
Definition of Done
- ✅ Update metrics calculation service implemented
- ✅ RESTful API endpoints for metrics data
- ✅ Comprehensive dashboard with key metrics
- ✅ Interactive charts showing trends and analytics
- ✅ Real-time or near real-time updates
- ✅ Filtering by time range, agent groups, update types
- ✅ Export functionality for reports
- ✅ Mobile-responsive design
- ✅ Performance optimization for large datasets
Test Plan
-
Unit Tests
- Metrics calculation accuracy
- Time series data generation
- API response formatting
-
Integration Tests
- End-to-end metrics flow
- Database query performance
- Real-time update functionality
-
Performance Tests
- Dashboard load times with large datasets
- API response times under load
- Chart rendering performance
-
User Acceptance Tests
- Administrators can easily identify update issues
- Dashboard provides actionable insights
- Interface is intuitive and responsive
Files to Modify
aggregator-server/internal/services/update_metrics.go- New serviceaggregator-server/internal/database/queries/metrics.go- New queriesaggregator-server/internal/api/handlers/metrics.go- New handlersaggregator-web/src/pages/UpdateMetrics.tsx- New dashboard pageaggregator-web/src/components/MetricCard.tsx- Metric display componentaggregator-web/src/components/charts/- Chart componentsaggregator-web/src/lib/api.ts- API integration
Metrics Categories
1. Success Metrics
- Update success rate percentage
- Successful update count
- Average update completion time
- Agent readiness percentage
2. Failure Metrics
- Failed update count
- Failure rate percentage
- Common failure reasons
- Rollback frequency
3. Performance Metrics
- Update queue length
- Average processing time
- Agent response time
- Server load during updates
4. Agent Metrics
- Agents ready for updates
- Agents with available updates
- Agents requiring manual intervention
- Update distribution by agent version
Estimated Effort
- Development: 20-24 hours
- Testing: 12-16 hours
- Review: 6-8 hours
- Design/UX: 8-10 hours
Dependencies
- Existing update events database
- Agent status tracking system
- Chart library (Chart.js, D3.js, etc.)
- WebSocket infrastructure (for real-time updates)
Risk Assessment
Low-Medium Risk - Enhancement that creates new functionality without affecting existing systems. Performance considerations for large datasets need attention.
Implementation Phases
Phase 1: Core Metrics API
- Implement metrics calculation service
- Create database queries for statistics
- Build REST API endpoints
Phase 2: Dashboard UI
- Create basic dashboard layout
- Implement metric cards and charts
- Add time range filtering
Phase 3: Advanced Features
- Real-time updates
- Export functionality
- Alert thresholds
- Advanced filtering and search
Future Enhancements
- Predictive Analytics: Predict update success based on agent patterns
- Automated Recommendations: Suggest optimal update timing
- Integration with APM: Correlate update performance with system metrics
- Custom Dashboards: User-configurable metric views
- SLA Monitoring: Track update performance against service level agreements