Sub-millisecond order matching, explained
1. Where we started
The matching engine we shipped in 2022 was a textbook implementation: a single-threaded event loop processing orders against a price-time-priority book represented as nested btreemaps. Median latency from order receipt to execution report was around 4.2 ms, p99 around 11 ms. Fine for retail flow, embarrassing for anyone trying to market-make.
2. The latency budget
The matching itself is cheap — 50–200 nanoseconds for a typical order against a warm book. The other 4 ms was: kernel network stack, JSON parsing, validation, accounting writes, replication, and reporting. We measured each segment under load and found the most expensive parts were validation (~1.4 ms, 30+ allowlist lookups) and accounting writes (~1.7 ms, hitting Postgres).
3. What we changed
Validation moved to a write-once snapshot
Allowlists, fee tiers, and per-account limits used to be cache-misses against Postgres. We snapshot them into a memory-mapped page on every change, with a 16-byte version header. The matching loop reads them lock-free; updates flip the snapshot atomically. Validation dropped to ~80 µs.
Accounting moved to a journal
Instead of writing fills directly to Postgres on the hot path, we append an event to a per-shard ring buffer and ack the order. A separate journaler persists durably within 5 ms — enough to satisfy our consistency story without blocking matching. Accounting on the hot path dropped to ~30 µs.
Network stack: io_uring + zero-copy
We migrated the order ingress from epoll-based blocking I/O to io_uring with batched submissions — that alone cut syscall overhead by ~400 µs. On the network ingress side, an XDP program steers packets for the hot trading pairs into the matching engine's queue without traversing the rest of the kernel network stack, saving another ~300 µs. Together: ~700 µs off the round-trip.
4. What we did NOT change
The order book data structure stayed the same. Price-time-priority semantics stayed the same. The execution-report format stayed the same. Every change was at the edge — validation snapshot, async journaler, network stack. Nothing in the actual matching code moved. This let us roll the change out shard-by-shard with bit-for-bit replay diffing.
5. Results
- Median latency: 4.2 ms → 580 µs (7×)
- p99 latency: 11 ms → 1.4 ms (8×)
- Throughput per shard: 18,000 ops/s → 92,000 ops/s
- Accounting durability: synchronous → async-within-5ms (we publish the SLA explicitly)
Market makers noticed within a week. Spreads on majors tightened by 6–18 bps as their inventory turn rate caught up with the new fill rate.
6. What we're working on next
The next bottleneck is replication — we run two-phase commit across three regions, and the cross-region acknowledge is now the longest hop in the system. We're prototyping a Raft variant where the matching shard can complete the order on a single-region quorum and replicate asynchronously, with explicit invariants about what gets preserved on failover. Expect a follow-up post when that's live.
Engineering posts on bitexasia are written by the engineers who shipped the change. Related: a recent withdrawal-rail post-mortem — same precision-first approach applied to a failure mode rather than an optimisation. Tag your API tickets with engine if you spot a regression — they go to this team's board directly.