提现于 3 月 18 日停滞 14 分钟
1. 摘要
On 2026-03-18 between 14:08:22 UTC and 14:22:11 UTC,我们的一个以太坊热钱包停止處理提现。 1,847 笔提现请求排队 for 14 minutes; all completed normally once the wallet recovered. No customer funds were at risk. No customer was charged twice. We're publishing this because we owe the explanation.
2. 時间線
- 14:08:22 — 热钱包 A submits withdrawal tx with nonce 412,387.
- 14:08:34 — 以太坊區塊 22,107,402 出塊;交易确认。
- 14:09:11 — 區塊 22,107,402 被重组(3 个區塊的重组,源于竞争性驗證者提案)。
- 14:09:18 — 交易回到内存池;我们的钱包守护进程仍认為它已确认(對重组通知緩存未命中)。
- 14:09:30 — 守护进程提交下一笔 nonce 為 412,388 的交易。内存池拒绝它:nonce 412,387 仍未完成。
- 14:09:30 → 14:22:00 — Daemon retries every 30s, all rejected. No alert fires because the daemon classifies "nonce too low" as user error, not infrastructure error.
- 14:22:00 — 值班工程师注意到交易仪表盘上的桑基图显示 ETH 提现 12 分钟以上為零;发送告警。
- 14:22:11 — 手动重新广播 nonce 412,387,清空队列。
3. 根本原因
两个原因:
- Our reorg-watcher service was running on the same node as the wallet daemon. When the daemon started spinning on rejected submissions, the watcher's CPU starved and missed the reorg notification for ~15 seconds.
- Our alerting rule for "wallet daemon stuck" was based on submission count, not effective throughput. Submissions kept happening (and kept being rejected), so the count looked healthy.
4. 我们做了哪些改变
- Reorg-watcher moved to its own dedicated host with its own JSON-RPC connection. No shared CPU with the daemon.
- 告警从
submissions/mintoeffective_throughput / submissions. If submissions are happening but none are confirming, that ratio drops below 0.1 与 alert fires within 60 seconds. - 提取al queue dashboard added a "stuck nonce" widget visible to the on-call team alongside the trading dashboards.
- 更新于 runbook to make manual nonce re-broadcast a 90-second action with two-engineer signoff, instead of a 4-minute ad-hoc procedure.
5. 影响与善意补偿
1,847 withdrawals were delayed by an average of 9 minutes. We refunded the network fee on every stuck transaction (totalling about $4,200 across the affected accounts) and credited each impacted customer with a one-time fee waiver on their next deposit. Affected customers received this email within 90 minutes of resolution.
6. 我们坚持的做法
The instinct to publish post-mortems publicly. We could have kept this quiet — only 1,847 customers were directly affected, the chain reorg explanation is plausible-deniability for the delay, and our SOC 2 framework doesn't require disclosure here. Publishing instead means the trading desk reads it, market makers read it, 与 next on-call engineer reads it before they hit the same trap.
历次事故複盘可在 博客索引查看。相关技术文章: 我们如何構建撮合引擎。可通过帳戶设置或 狀態页面.