Files
Redflag/docs/3_BACKLOG/P3-006_Structured-Logging-System.md

436 lines
13 KiB
Markdown

# Structured Logging System
**Priority**: P3 (Enhancement)
**Source Reference**: From todos.md line 54
**Status**: Ready for Implementation
## Problem Statement
The current logging system lacks structure, correlation IDs, and centralized aggregation capabilities. This makes it difficult to trace operations across the distributed system, debug issues, and perform effective log analysis for monitoring and troubleshooting.
## Feature Description
Implement a comprehensive structured logging system with JSON format logs, correlation IDs for request tracing, centralized log aggregation, and performance metrics collection to improve observability and debugging capabilities.
## Acceptance Criteria
1. JSON-formatted structured logs throughout the application
2. Correlation IDs for tracing requests across agent-server communication
3. Centralized log aggregation and storage
4. Performance metrics collection and reporting
5. Log levels (debug, info, warn, error, fatal) with appropriate filtering
6. Structured logging for both server and agent components
7. Log retention policies and rotation
8. Integration with external logging systems (optional)
## Technical Approach
### 1. Structured Logging Library
**Custom Logger Implementation** (`aggregator-common/internal/logging/`):
```go
package logging
import (
"context"
"time"
"github.com/sirupsen/logrus"
)
type CorrelationID string
type StructuredLogger struct {
logger *logrus.Logger
}
type LogFields struct {
CorrelationID string `json:"correlation_id"`
Component string `json:"component"`
Operation string `json:"operation"`
UserID string `json:"user_id,omitempty"`
AgentID string `json:"agent_id,omitempty"`
Duration time.Duration `json:"duration,omitempty"`
Error string `json:"error,omitempty"`
RequestID string `json:"request_id,omitempty"`
Metadata map[string]interface{} `json:"metadata,omitempty"`
}
func NewStructuredLogger(component string) *StructuredLogger {
logger := logrus.New()
logger.SetFormatter(&logrus.JSONFormatter{
TimestampFormat: time.RFC3339,
})
return &StructuredLogger{logger: logger}
}
func (l *StructuredLogger) WithContext(ctx context.Context) *logrus.Entry {
fields := logrus.Fields{
"component": l.component,
}
if correlationID := ctx.Value(CorrelationID("")); correlationID != nil {
fields["correlation_id"] = correlationID
}
return l.logger.WithFields(fields)
}
func (l *StructuredLogger) Info(ctx context.Context, operation string, fields LogFields) {
entry := l.WithContext(ctx)
entry = entry.WithField("operation", operation)
entry = entry.WithFields(l.convertFields(fields))
entry.Info()
}
```
### 2. Correlation ID Management
**Middleware for HTTP Requests** (`aggregator-server/internal/middleware/`):
```go
func CorrelationIDMiddleware() gin.HandlerFunc {
return func(c *gin.Context) {
correlationID := c.GetHeader("X-Correlation-ID")
if correlationID == "" {
correlationID = generateCorrelationID()
}
c.Header("X-Correlation-ID", correlationID)
c.Set("correlation_id", correlationID)
ctx := context.WithValue(c.Request.Context(), logging.CorrelationID(correlationID))
c.Request = c.Request.WithContext(ctx)
c.Next()
}
}
func generateCorrelationID() string {
return uuid.New().String()
}
```
**Agent-Side Correlation** (`aggregator-agent/internal/communication/`):
```go
func (c *Client) makeRequest(method, endpoint string, body interface{}) (*http.Response, error) {
correlationID := c.generateCorrelationID()
// Log request start
c.logger.Info(
context.WithValue(context.Background(), logging.CorrelationID(correlationID)),
"api_request_start",
logging.LogFields{
Method: method,
Endpoint: endpoint,
AgentID: c.agentID,
},
)
req, err := http.NewRequest(method, endpoint, nil)
if err != nil {
return nil, err
}
req.Header.Set("X-Correlation-ID", correlationID)
req.Header.Set("Authorization", "Bearer "+c.token)
// ... rest of request handling
}
```
### 3. Centralized Log Storage
**Log Aggregation Service** (`aggregator-server/internal/services/log_aggregation.go`):
```go
type LogEntry struct {
ID uuid.UUID `json:"id" db:"id"`
Timestamp time.Time `json:"timestamp" db:"timestamp"`
Level string `json:"level" db:"level"`
Component string `json:"component" db:"component"`
CorrelationID string `json:"correlation_id" db:"correlation_id"`
Message string `json:"message" db:"message"`
AgentID *uuid.UUID `json:"agent_id,omitempty" db:"agent_id"`
UserID *uuid.UUID `json:"user_id,omitempty" db:"user_id"`
Operation string `json:"operation" db:"operation"`
Duration *int `json:"duration,omitempty" db:"duration"`
Error *string `json:"error,omitempty" db:"error"`
Metadata map[string]interface{} `json:"metadata" db:"metadata"`
}
type LogAggregationService struct {
db *sql.DB
logBuffer chan LogEntry
batchSize int
}
func (s *LogAggregationService) ProcessLogs(ctx context.Context) {
ticker := time.NewTicker(5 * time.Second)
batch := make([]LogEntry, 0, s.batchSize)
for {
select {
case entry := <-s.logBuffer:
batch = append(batch, entry)
if len(batch) >= s.batchSize {
s.flushBatch(ctx, batch)
batch = batch[:0]
}
case <-ticker.C:
if len(batch) > 0 {
s.flushBatch(ctx, batch)
batch = batch[:0]
}
case <-ctx.Done():
// Flush remaining logs before exit
if len(batch) > 0 {
s.flushBatch(ctx, batch)
}
return
}
}
}
```
### 4. Performance Metrics Collection
**Metrics Service** (`aggregator-server/internal/services/metrics.go`):
```go
type PerformanceMetrics struct {
RequestCount int64 `json:"request_count"`
ErrorCount int64 `json:"error_count"`
AverageLatency time.Duration `json:"average_latency"`
P95Latency time.Duration `json:"p95_latency"`
P99Latency time.Duration `json:"p99_latency"`
ActiveConnections int64 `json:"active_connections"`
DatabaseQueries int64 `json:"database_queries"`
}
type MetricsService struct {
requestLatencies []time.Duration
startTime time.Time
mutex sync.Mutex
}
func (m *MetricsService) RecordRequest(duration time.Duration) {
m.mutex.Lock()
defer m.mutex.Unlock()
m.requestLatencies = append(m.requestLatencies, duration)
// Keep only last 10000 measurements
if len(m.requestLatencies) > 10000 {
m.requestLatencies = m.requestLatencies[1:]
}
}
func (m *MetricsService) GetMetrics() PerformanceMetrics {
m.mutex.Lock()
defer m.mutex.Unlock()
metrics := PerformanceMetrics{
RequestCount: int64(len(m.requestLatencies)),
}
if len(m.requestLatencies) > 0 {
sort.Slice(m.requestLatencies, func(i, j int) bool {
return m.requestLatencies[i] < m.requestLatencies[j]
})
var total time.Duration
for _, d := range m.requestLatencies {
total += d
}
metrics.AverageLatency = total / time.Duration(len(m.requestLatencies))
metrics.P95Latency = m.requestLatencies[int(float64(len(m.requestLatencies))*0.95)]
metrics.P99Latency = m.requestLatencies[int(float64(len(m.requestLatencies))*0.99)]
}
return metrics
}
```
### 5. Logging Integration Points
**HTTP Request Logging** (`aggregator-server/internal/middleware/`):
```go
func RequestLoggingMiddleware(logger *logging.StructuredLogger) gin.HandlerFunc {
return func(c *gin.Context) {
start := time.Now()
c.Next()
duration := time.Since(start)
correlationID := c.GetString("correlation_id")
fields := logging.LogFields{
Duration: duration,
Method: c.Request.Method,
Path: c.Request.URL.Path,
Status: c.Writer.Status(),
ClientIP: c.ClientIP(),
UserAgent: c.Request.UserAgent(),
}
if c.Writer.Status() >= 400 {
fields.Error = c.Errors.String()
logger.Error(context.Background(), "http_request_error", fields)
} else {
logger.Info(context.Background(), "http_request", fields)
}
}
}
```
**Database Query Logging**:
```go
func (q *Queries) LogQuery(query string, args []interface{}, duration time.Duration, err error) {
fields := logging.LogFields{
Operation: "database_query",
Query: query,
Duration: duration,
Args: args,
}
if err != nil {
fields.Error = err.Error()
q.logger.Error(context.Background(), "database_query_error", fields)
} else {
q.logger.Debug(context.Background(), "database_query", fields)
}
}
```
### 6. Log Retention and Rotation
**Log Management Service**:
```go
type LogRetentionConfig struct {
RetentionDays int `json:"retention_days"`
MaxLogSize int64 `json:"max_log_size_bytes"`
Compression bool `json:"compression_enabled"`
ArchiveLocation string `json:"archive_location"`
}
func (s *LogService) CleanupOldLogs() error {
cutoff := time.Now().AddDate(0, 0, -s.config.RetentionDays)
_, err := s.db.Exec(`
DELETE FROM system_logs
WHERE timestamp < $1
`, cutoff)
return err
}
```
## Definition of Done
- ✅ Structured JSON logging implemented across all components
- ✅ Correlation ID propagation for end-to-end request tracing
- ✅ Centralized log storage with efficient buffering
- ✅ Performance metrics collection and reporting
- ✅ Log level filtering and configuration
- ✅ Log retention and rotation policies
- ✅ Integration with existing HTTP, database, and agent communication layers
- ✅ Dashboard or interface for log viewing and searching
## Test Plan
1. **Unit Tests**
- Structured log format validation
- Correlation ID propagation accuracy
- Log filtering and routing
2. **Integration Tests**
- End-to-end request tracing
- Agent-server communication logging
- Database query logging accuracy
3. **Performance Tests**
- Logging overhead under load
- Log aggregation throughput
- Buffer management efficiency
4. **Retention Tests**
- Log rotation functionality
- Archive creation and compression
- Cleanup policy enforcement
## Files to Modify
- `aggregator-common/internal/logging/` - New logging package
- `aggregator-server/internal/middleware/` - Add correlation and request logging
- `aggregator-server/internal/services/` - Add metrics and log aggregation
- `aggregator-agent/internal/` - Add structured logging to agent
- Database schema - Add system_logs table
- Configuration - Add logging settings
## Log Schema
```sql
CREATE TABLE system_logs (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
timestamp TIMESTAMP WITH TIME ZONE NOT NULL DEFAULT NOW(),
level VARCHAR(20) NOT NULL,
component VARCHAR(100) NOT NULL,
correlation_id VARCHAR(100),
message TEXT NOT NULL,
agent_id UUID REFERENCES agents(id),
user_id UUID REFERENCES users(id),
operation VARCHAR(100),
duration INTEGER, -- milliseconds
error TEXT,
metadata JSONB,
INDEX idx_timestamp (timestamp),
INDEX idx_correlation_id (correlation_id),
INDEX idx_component (component),
INDEX idx_level (level)
);
```
## Estimated Effort
- **Development**: 24-30 hours
- **Testing**: 16-20 hours
- **Review**: 8-10 hours
- **Infrastructure Setup**: 4-6 hours
## Dependencies
- Logrus or similar structured logging library
- Database storage for log aggregation
- Configuration management for log settings
## Risk Assessment
**Low-Medium Risk** - Enhancement that adds comprehensive logging. Main considerations are performance impact and log volume management. Proper buffering and async processing will mitigate risks.
## Implementation Phases
### Phase 1: Core Logging Infrastructure
1. Implement structured logger
2. Add correlation ID middleware
3. Integrate with HTTP layer
### Phase 2: Agent Logging
1. Add structured logging to agent
2. Implement correlation ID propagation
3. Add communication layer logging
### Phase 3: Log Aggregation
1. Implement log buffering and storage
2. Add performance metrics collection
3. Create log retention system
### Phase 4: Dashboard and Monitoring
1. Log viewing interface
2. Search and filtering capabilities
3. Metrics dashboard integration
## Future Enhancements
1. **External Log Integration**: Elasticsearch, Splunk, etc.
2. **Real-time Log Streaming**: WebSocket-based live log viewing
3. **Log Analysis**: Automated log analysis and anomaly detection
4. **Compliance Reporting**: SOX, GDPR compliance reporting
5. **Distributed Tracing**: Integration with OpenTelemetry