Skip to main content

// Section 14.10 · Operate

Failure Modes and How the Daemon Responds

1 min14.10Operate
// 8 of 11 · node cli

Failure modes. CLI response. Remediation.

// 14.8 · failure modes · six categories with daemon behaviour and operator remediation

Six failure modes the daemon catches in production. For each: how the CLI surfaces the failure, what state the daemon enters, and the remediation the operator runs to recover.

Failure mode table

// failure · cli response · remediation

// FAILURE// CLI RESPONSE// REMEDIATION
Coordinator unreachable
Daemon enters OFFLINE_RETRY state, backs off exponentially.
Check network connectivity. Run `parallelix-node verify` for the full coordinator-reachability check.
Transient coordinator 403 (RPC blip)
Daemon tolerates unexplained 403s for ~5 minutes, backing off and retrying to ride out RPC blips. After ~5 minutes of consecutive unexplained 403s the daemon stops with a hint. not_staked and in_cooldown exit cleanly (exit 0) immediately.
Check the coordinator is reachable. Verify `--node-id` matches the registered node and that this machine's nodeKeyHash was the one registered.
GPU driver missing or wedged
Daemon refuses to accept requests tagged for that tier.
Reinstall or reset the GPU driver. Rerun `verify` to confirm the driver responds.
Request deadline exceeded
Daemon returns `node.deadline_exceeded` to the coordinator; coordinator re-dispatches to another node.
Investigate local CPU/GPU load. Result posts are retried up to 3 times before the coordinator requeues.
Signature verification failed by coordinator
Request rejected; recorded as failed PoE in §7.1 taxonomy. The result is dropped and the node is flagged.
Check clock skew with NTP. Rotate the node key via `parallelix-node init` if persistent.
Hardware tier mismatch
Daemon refuses to attach or run; uptime for the node would zero against a mismatched tier.
Re-run `parallelix-node setup` to re-detect hardware, or declare the tier `probe` reports when re-staking.
Heartbeat gap (machine offline)
Uptime for the window drops; the node earns 0 for that period and its reward share falls. No penalty against principal.
Restore connectivity. Uptime is the reward weight; principal is always returned on unstake.

Severity classification

// what's transient · what needs an operator fix · what costs reward share

// TRANSIENT

Resolves with backoff

Coordinator unreachable, unexplained 403s, request deadline exceeded. Daemon backs off and continues running. Unexplained 403s tolerated for ~5 minutes before the daemon stops. not_staked and in_cooldown exit cleanly immediately.

// CONFIG

Operator-side fix required

GPU driver missing/wedged, signature mismatch, hardware tier mismatch. Daemon enters reduced-function state or refuses to run; operator action required.

// UPTIME

Lost reward share, no penalty to principal

Heartbeat gap or failed PoE. Uptime drops, so the node earns 0 for that period and its reward weight falls. There is no slashing; principal is always returned on unstake.