Withdrawals stalled for 14 minutes on March 18
1. Summary
On 2026-03-18 between 14:08:22 UTC and 14:22:11 UTC, withdrawals from one of our Ethereum hot wallets stopped processing. 1,847 withdrawal requests queued for 14 minutes; all completed normally once the wallet recovered. No customer funds were at risk. No customer was charged twice. We're publishing this because we owe the explanation.
2. Timeline
- 14:08:22 — Hot wallet A submits withdrawal tx with nonce 412,387.
- 14:08:34 — Ethereum block 22,107,402 produced; tx confirmed.
- 14:09:11 — Block 22,107,402 reorged out (3-block reorg from competing validator proposals).
- 14:09:18 — Tx returns to mempool; our wallet daemon thinks it's still confirmed (cache miss on the reorg notification).
- 14:09:30 — Daemon submits next tx with nonce 412,388. Mempool rejects it: nonce 412,387 still pending.
- 14:09:30 → 14:22:00 — Daemon retries every 30s, all rejected. No alert fires because the daemon classifies "nonce too low" as user error, not infrastructure error.
- 14:22:00 — On-call engineer notices Sankey chart on the trading dashboard showing zero ETH withdrawals for 12+ minutes; pages.
- 14:22:11 — Manually re-broadcast nonce 412,387, drain queue.
3. Root cause
Two contributing factors:
- Our reorg-watcher service was running on the same node as the wallet daemon. When the daemon started spinning on rejected submissions, the watcher's CPU starved and missed the reorg notification for ~15 seconds.
- Our alerting rule for "wallet daemon stuck" was based on submission count, not effective throughput. Submissions kept happening (and kept being rejected), so the count looked healthy.
4. What we changed
- Reorg-watcher moved to its own dedicated host with its own JSON-RPC connection. No shared CPU with the daemon.
- Alerts switched from
submissions/mintoeffective_throughput / submissions. If submissions are happening but none are confirming, that ratio drops below 0.1 and the alert fires within 60 seconds. - Withdrawal queue dashboard added a "stuck nonce" widget visible to the on-call team alongside the trading dashboards.
- Updated runbook to make manual nonce re-broadcast a 90-second action with two-engineer signoff, instead of a 4-minute ad-hoc procedure.
5. Impact and goodwill
1,847 withdrawals were delayed by an average of 9 minutes. We refunded the network fee on every stuck transaction (totalling about $4,200 across the affected accounts) and credited each impacted customer with a one-time fee waiver on their next deposit. Affected customers received this email within 90 minutes of resolution.
6. What we're keeping
The instinct to publish post-mortems publicly. We could have kept this quiet — only 1,847 customers were directly affected, the chain reorg explanation is plausible-deniability for the delay, and our SOC 2 framework doesn't require disclosure here. Publishing instead means the trading desk reads it, market makers read it, and the next on-call engineer reads it before they hit the same trap.
Past post-mortems are kept on the blog index. Related engineering reading: how we built the matching engine. Subscribe to incident notifications via account settings or the status page.