feat: add resilience and reliability features for agent subsystems
Added circuit breakers with configurable timeouts for all subsystems (APT, DNF, Docker, Windows, Winget, Storage). Replaces cron-based scheduler with priority queue that should scale beyond 1000+ agents if your homelab is that big. Command acknowledgment system ensures results aren't lost on network failures or restarts. Agent tracks pending acknowledgments with persistent state and automatic retry. - Circuit breakers: 3 failures in 1min opens circuit, 30s cooldown - Per-subsystem timeouts: 30s-10min depending on scanner - Priority queue scheduler: O(log n), worker pool, jitter, backpressure - Acknowledgments: at-least-once delivery, max 10 retries over 24h - All tests passing (26/26)
This commit is contained in:
@@ -23,8 +23,9 @@ type AgentCommand struct {
|
||||
|
||||
// CommandsResponse is returned when an agent checks in for commands
|
||||
type CommandsResponse struct {
|
||||
Commands []CommandItem `json:"commands"`
|
||||
RapidPolling *RapidPollingConfig `json:"rapid_polling,omitempty"`
|
||||
Commands []CommandItem `json:"commands"`
|
||||
RapidPolling *RapidPollingConfig `json:"rapid_polling,omitempty"`
|
||||
AcknowledgedIDs []string `json:"acknowledged_ids,omitempty"` // IDs server has received
|
||||
}
|
||||
|
||||
// RapidPollingConfig contains rapid polling configuration for the agent
|
||||
|
||||
Reference in New Issue
Block a user