Redflag/docs/A2_Replay_Attack_Audit.md

# A-2 Command Replay Attack Audit
**Date**: 2026-03-28
**Branch**: unstabledeveloper
**Scope**: Audit-only — no implementation changes

---

## 1. Signed Command Payload Analysis

### What fields are included in the signed message

**New format** (when `cmd.SignedAt != nil`):
```
{cmd.ID}:{cmd.CommandType}:{sha256(json(cmd.Params))}:{cmd.SignedAt.Unix()}
```
Source: `aggregator-server/internal/services/signing.go:361`, `aggregator-agent/internal/crypto/verification.go:71`

**Old format** (backward compat, when `cmd.SignedAt == nil`):
```
{cmd.ID}:{cmd.CommandType}:{sha256(json(cmd.Params))}
```
Source: `aggregator-agent/internal/crypto/verification.go:55`

### What is NOT in the signed payload

| Field | In signed payload? | Notes |
|---|---|---|
| `cmd.ID` (UUID) | YES | Unique per-command identifier |
| `cmd.CommandType` | YES | e.g. `install_updates`, `reboot` |
| `sha256(params)` | YES | Hash of full params JSON |
| `signed_at` timestamp | YES (new format only) | Unix seconds |
| `cmd.AgentID` | **NO** | Absent from signature |
| `cmd.Source` | **NO** | Absent from signature |
| `cmd.Status` | **NO** | Absent from signature |
| Nonce | **NO** | Not used in command signing |

**FINDING F-1 (HIGH)**: `agent_id` is not included in the signed payload. A valid signed command is not cryptographically bound to a specific agent. The only uniqueness guarantee is the command UUID — if an attacker could inject a captured command into a different agent's command queue, the signature would verify correctly.

---

## 2. Nonce Mechanism

### What the nonce looks like

The `SigningService` in `aggregator-server/internal/services/signing.go` has two nonce methods:

```go
func (s *SigningService) SignNonce(nonceUUID uuid.UUID, timestamp time.Time) (string, error)
func (s *SigningService) VerifyNonce(nonceUUID uuid.UUID, timestamp time.Time, signatureHex string, maxAge time.Duration) (bool, error)
```

Nonce format: `"{uuid}:{unix_timestamp}"` — signed with Ed25519.

### Where nonces are used

**Nonces are NOT used in command signing or command verification.**

The `SignNonce`/`VerifyNonce` methods exist exclusively for the agent update package flow (preventing replay of update download requests). They are completely disconnected from the command replay protection path.

The agent's `ProcessCommand` function (`command_handler.go:101`) calls `VerifyCommandWithTimestamp` or `VerifyCommand`. Neither of these checks any nonce. There is no nonce storage, no nonce tracking map, and no nonce field in `AgentCommand` or `CommandItem`.

**FINDING F-2 (CRITICAL)**: There is no nonce in the command signing path. The original issue comment ("nonce-only replay protection") is inaccurate in the opposite direction — there is no nonce AND no reliable replay protection for commands signed with the old format.

---

## 3. Verification Function Behaviour

### `VerifyCommand` (old format, no timestamp)

Source: `aggregator-agent/internal/crypto/verification.go:25`

Checks:
1. Signature field is non-empty
2. Signature is valid hex, correct length (64 bytes)
3. Ed25519 signature over `{id}:{type}:{sha256(params)}` verifies against public key

Returns: `error` (nil = pass). **No time check. No nonce check.**

**FINDING F-3 (CRITICAL)**: Commands signed with the old format (no `signed_at`) are valid indefinitely. A captured signature can be replayed at any time in the future — there is no expiry mechanism for old-format commands.

### `VerifyCommandWithTimestamp` (new format)

Source: `aggregator-agent/internal/crypto/verification.go:85`

Checks:
1. If `cmd.SignedAt == nil` → falls back to `VerifyCommand()` (see F-3)
2. `age = now.Sub(*cmd.SignedAt)` must satisfy: `age <= 24h` AND `age >= -5min`
3. Signature valid over `{id}:{type}:{sha256(params)}:{unix_timestamp}`

**FINDING F-4 (HIGH)**: 24-hour replay window. A captured signed command remains valid for replay for up to 24 hours from signing time. This is the default value of `commandMaxAge = 24 * time.Hour` defined in `command_handler.go:21`.

---

## 4. Command Creation Flow

### Full path: Dashboard approves install → command signed → stored

```
POST /updates/:id/install
  → UpdateHandler.InstallUpdate()                    [handlers/updates.go:459]
      → models.AgentCommand{...}                     [no signing yet]
      → h.agentHandler.signAndCreateCommand(cmd)     [agents.go:49]
          → signingService.SignCommand(cmd)           [services/signing.go:345]
              → cmd.SignedAt = &now                  [side-effect]
              → cmd.KeyID = GetCurrentKeyID()        [side-effect]
              → message = "{id}:{type}:{hash}:{ts}"
              → ed25519.Sign(privateKey, message)
              → returns hex signature
          → cmd.Signature = signature
          → commandQueries.CreateCommand(cmd)        [queries/commands.go:22]
              → INSERT INTO agent_commands (... key_id, signed_at ...)
```

The `ConfirmDependencies` and `ReportDependencies` (auto-install) handlers follow identical paths through `signAndCreateCommand`.

### RetryCommand path (DOES NOT RE-SIGN)

```
POST /commands/:id/retry
  → UpdateHandler.RetryCommand()                     [handlers/updates.go:779]
      → commandQueries.RetryCommand(id)              [queries/commands.go:189]
          → newCommand = AgentCommand{               [copies Params, new UUID]
                Signature: "",                       [EMPTY — not re-signed]
                SignedAt: nil,                       [nil — no timestamp]
                KeyID: "",                           [empty — no key reference]
            }
          → q.CreateCommand(newCommand)              [stored unsigned]
```

**FINDING F-5 (CRITICAL)**: `RetryCommand` creates a new command without calling `signAndCreateCommand`. The retried command has `Signature = ""`, `SignedAt = nil`, `KeyID = ""`. In strict enforcement mode, the agent rejects any command with an empty signature. This means **the retry feature is entirely broken when command signing is enabled in strict mode**. The HTTP handler in `updates.go:779` returns 200 OK and the command is stored in the DB, but the agent will reject it every time it polls.

---

## 5. Agent Command Fetch and Execution Flow

### Full path: Agent polls → receives commands → verifies → executes

```
GET /api/v1/agents/{id}/commands
  → AgentHandler.GetCommands()                       [handlers/agents.go:204]
      → commandQueries.GetPendingCommands(agentID)   [status = 'pending' only]
      → commandQueries.GetStuckCommands(agentID, 5m) [sent > 5 min, not completed]
      → allCommands = pending + stuck
      → for each cmd: MarkCommandSent(cmd.ID)        [transitions pending → sent]
      → returns CommandItem{ID, Type, Params, Signature, KeyID, SignedAt}
```

Agent-side:
```
main.go:875: apiClient.GetCommands(cfg.AgentID, metrics)
main.go:928: for _, cmd := range commands {
main.go:932:     commandHandler.ProcessCommand(cmd, cfg, cfg.AgentID)
main.go:954:     switch cmd.Type { ... execute ... }
```

### What `GetPendingCommands` returns

```sql
SELECT * FROM agent_commands
WHERE agent_id = $1 AND status = 'pending'
ORDER BY created_at ASC
LIMIT 100
```

There is no `WHERE created_at > NOW() - INTERVAL '24 hours'` filter. A command created 30 days ago with status `pending` (e.g., if it was never successfully sent) would be returned. If it has the old-format signature (no `signed_at`), the agent would execute it with no time check.

**FINDING F-6 (HIGH)**: The server-side command queue has no TTL filter. Old pending commands are delivered indefinitely. Combined with old-format signing (F-3), this means commands can persist in the queue and be executed arbitrarily long after creation.

---

## 6. Database Schema — TTL and Command Expiry

### agent_commands table (from migration 001 + amendments)

```sql
CREATE TABLE agent_commands (
    id          UUID PRIMARY KEY,
    agent_id    UUID REFERENCES agents(id),
    command_type VARCHAR(50),
    params      JSONB,
    status      VARCHAR(20) DEFAULT 'pending',
    created_at  TIMESTAMP DEFAULT NOW(),
    sent_at     TIMESTAMP,
    completed_at TIMESTAMP,
    result      JSONB,
    signature   VARCHAR(128),          -- migration 020
    key_id      VARCHAR(64),           -- migration 025
    signed_at   TIMESTAMP,             -- migration 025
    idempotency_key VARCHAR(64) UNIQUE -- migration 023a
);
```

**FINDING F-7 (HIGH)**: No `expires_at` column exists. No TTL constraint exists. No scheduled cleanup job for old pending commands exists in the codebase. The only cleanup mechanisms are:

- Manual `ClearOldFailedCommands(days)` — applies to `failed`/`timed_out` only, not `pending`
- Manual `CancelCommand(id)` — single-command manual cancellation
- The deduplication index from migration 023a prevents duplicate pending commands per `(agent_id, command_type)`, but this only prevents new duplicates — it doesn't expire old ones

---

## 7. Attack Surface Assessment

### Can a captured signed command be replayed indefinitely?

**New format (with `signed_at`)**: Replayable for 24 hours from signing time. After that, `VerifyCommandWithTimestamp` rejects it as too old.

**Old format (no `signed_at`)**: **YES — replayable indefinitely.** `VerifyCommand` has no time check. Any command signed before the A-1 implementation was deployed (before `signed_at` was added) is permanently replayable.

The backward-compatibility fallback in `VerifyCommandWithTimestamp` (`if cmd.SignedAt == nil → VerifyCommand`) means new servers talking to old agents, or commands in the DB pre-dating migration 025, all fall into the unlimited-replay category.

### Replay attack scenarios

**Scenario A — Network MITM (24h window)**
An attacker positioned between server and agent captures a valid `install_updates` command with `signed_at` set. Within 24 hours, they can re-present this command to the agent. If the agent's command handler receives it (via MITM on the polling response), it passes `VerifyCommandWithTimestamp` and is executed — potentially installing the same update a second time, or more dangerously triggering a `reboot` or `update_agent` command twice.

**Scenario B — Old-format signature captured forever**
Any command signed before `signed_at` support was deployed (old server version or commands created before migration 025 ran) has no timestamp. A captured signature is valid forever. The only defense is that the command UUID must match, but if an attacker can inject a command with a matching UUID into the DB, verification passes.

**Scenario C — Retry creates unsigned commands (strict mode)**
An operator clicks "Retry" on a failed `install_updates` command. The server creates a new unsigned command. In strict mode, the agent rejects it silently (logs the rejection, reports `failed` to the server). The operator may not understand why the retry keeps failing, and may downgrade the enforcement mode to `warning` as a workaround — which is exactly the wrong response.

**Scenario D — `agent_id` not in signature (cross-agent injection)**
If an attacker can write to the `agent_commands` table directly (e.g., via SQL injection elsewhere, or compromised server credentials), they can copy a signed command for agent A into agent B's queue. The Ed25519 signature will verify correctly on agent B because `agent_id` is not in the signed content.

**Scenario E — Stuck command re-execution**
The `GetStuckCommands` query re-delivers commands that are in `sent` status for > 5 minutes. If a command was genuinely stuck (network failure, agent restart), it may be re-executed when the agent comes back online. If the command is `reboot` or `install_updates`, this can cause unintended repeated execution. There is no duplicate-execution guard on the agent side (no "already executed command ID" tracking).

---

## 8. Summary Table

| Finding | Severity | Description |
|---------|----------|-------------|
| **F-1** | HIGH | `agent_id` absent from signed payload — commands not cryptographically bound to a specific agent |
| **F-2** | CRITICAL | No nonce in command signing path — no single-use guarantee for command signatures |
| **F-3** | CRITICAL | Old-format commands (no `signed_at`) have zero time-based replay protection — valid forever |
| **F-4** | HIGH | 24-hour replay window for new-format commands — adequate for most attacks but generous |
| **F-5** | CRITICAL | `RetryCommand` creates unsigned commands — entire retry feature broken in strict enforcement mode |
| **F-6** | HIGH | Server `GetPendingCommands` has no TTL filter — stale pending commands delivered indefinitely |
| **F-7** | HIGH | No `expires_at` column in `agent_commands` — no schema-enforced command TTL |

### Severity definitions used
- **CRITICAL**: Exploitable by an attacker with no special access, or breaks core security feature silently
- **HIGH**: Requires attacker to have partial access (MITM position, DB access) or silently degrades security posture

---

## 9. Out of Scope / Confirmed Clean

- The Ed25519 signing algorithm itself is correctly implemented (A-1 verified).
- The key rotation implementation (A-1) correctly identifies and uses the right public key per command.
- The timestamp arithmetic in `VerifyCommandWithTimestamp` is not inverted (verified in A-1).
- The JWT authentication on `GET /agents/:id/commands` is enforced by middleware — an unauthenticated attacker cannot directly call the command endpoint to inject commands through the server API.
- The deduplication index (migration 023a) prevents duplicate `pending` commands of the same type per agent.

---

## 10. Recommended Fixes (Prioritised, Not Yet Implemented)

| Priority | Fix | Addresses |
|----------|-----|-----------|
| 1 | Re-sign commands in `RetryCommand` — call `signAndCreateCommand` instead of `commandQueries.CreateCommand` directly | F-5 |
| 2 | Add `agent_id` to the signed message payload | F-1 |
| 3 | Add server-side command TTL: `expires_at` column + filter in `GetPendingCommands` | F-6, F-7 |
| 4 | Add agent-side executed-command deduplication: an in-memory or on-disk set of recently executed command UUIDs | F-2 (partial), F-4 |
| 5 | Remove old-format (no-timestamp) backward compat after a defined migration period — enforce `signed_at` as required | F-3 |
| 6 | Reduce `commandMaxAge` from 24h to a tighter window (4h) once retry infrastructure is fixed | F-4 |