You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
What changed: HashMap<i64, VecDeque<Order>> replaced with arena-based object pool + intrusive doubly linked lists. Cancel complexity: O(n) → O(1).
Order Cancel
Benchmark
Phase 1
Phase 2
Change
cancel_all/100
6.26 µs
4.46 µs
-29%
cancel_all/1000
55.42 µs
39.79 µs
-28%
cancel_all/5000
313.87 µs
208.34 µs
-34%
cancel_middle_of_1k
2.16 µs
0.91 µs
-58%
Matching
Benchmark
Phase 1
Phase 2
Change
match/full_fill_1k
96.20 µs
96.87 µs
~0%
match/multi_level_sweep
42.07 µs
50.61 µs
+20%
Insert
Benchmark
Phase 1
Phase 2
Change
insert/non_crossing/100
5.15 µs
5.57 µs
~0%
insert/non_crossing/1000
47.91 µs
49.34 µs
~0%
insert/non_crossing/10000
529.28 µs
537.06 µs
~0%
Mixed Workload
Benchmark
Phase 1
Phase 2
Change
mixed_workload_10k
1.19 ms
0.69 ms
-42%
Memory
Metric
Phase 1
Phase 2
Order storage
Heap-allocated per order (VecDeque<Order>)
Pre-allocated arena (1M slots, 64 MB)
Order size
48 bytes (unaligned)
64 bytes (#[repr(C, align(64))], 1 cache line)
Cancel lookup
O(1) index lookup + O(n) linear scan
O(1) index lookup + O(1) unlink
Hot-path malloc
Every insert/cancel
Zero (arena pre-allocated)
unsafe blocks
0
0
Test Suite
Metric
Phase 1
Phase 2
Test count
37
53
Test time
~0.06s
~0.13s
Note: Phase 1 test time measured before arena was introduced. Phase 2 tests use with_capacity(1024) to keep arena small (64 KB vs 64 MB default).
Phase 2 → Phase 3: Sorted Price Levels
What changed: HashMap<i64, PriceLevel> replaced with BTreeMap<i64, PriceLevel> for bid/ask sides. Best-price recomputation after level removal: O(n) linear scan → O(log n) BTreeMap min/max.
Note: Phase 2 and Phase 3 numbers below are from the same benchmark session for an apples-to-apples comparison.
Matching (Phase 2→3)
Benchmark
Phase 2
Phase 3
Change
match/full_fill_1k
87.47 µs
61.49 µs
-30%
match/multi_level_sweep
45.14 µs
14.73 µs
-67%
multi_level_sweep sweeps 100 ask levels, emptying each and recomputing best_ask. With HashMap this was O(n) per level removal; with BTreeMap it's O(log n).
Order Cancel (Phase 2→3)
Benchmark
Phase 2
Phase 3
Change
cancel_all/100
4.42 µs
2.05 µs
-54%
cancel_all/1000
39.56 µs
20.79 µs
-47%
cancel_all/5000
206.19 µs
110.74 µs
-46%
cancel_best_level/100
35.62 µs
5.29 µs
-85%
cancel_best_level/500
254.49 µs
40.12 µs
-84%
cancel_best_level/1000
696.94 µs
124.07 µs
-82%
cancel_middle_of_1k
1.02 µs
0.85 µs
-17%
cancel_best_level cancels N orders across N distinct prices, emptying a level on every cancel (worst case for best-price recomputation). O(n²) total with HashMap → O(n log n) with BTreeMap.
Insert (Phase 2→3)
Benchmark
Phase 2
Phase 3
Change
insert/non_crossing/100
8.18 µs
6.40 µs
-22%
insert/non_crossing/1000
77.46 µs
64.23 µs
-17%
insert/non_crossing/10000
780.92 µs
608.91 µs
-22%
Surprising improvement: with only 2 distinct prices, BTreeMap avoids HashMap's hashing overhead and its pre-allocated 4096-entry bucket array polluting the cache. Regression expected only with many distinct price levels.
Mixed Workload (Phase 2→3)
Benchmark
Phase 2
Phase 3
Change
mixed_workload_10k
825.83 µs
887.23 µs
+7%
Slight regression due to BTreeMap node allocation on new price level creation (no with_capacity equivalent).
Test Suite (Phase 2→3)
Metric
Phase 2
Phase 3
Test count
53
53
Test time
~0.13s
~0.11s
Phase 4: Lock-Free SPSC Ring Buffer
What changed: New lock-free single-producer single-consumer ring buffer (src/ring.rs) using the Disruptor pattern — free-running cursors with bitmask indexing, CachePadded atomics, and local cursor caching.
Throughput (1M items, 2 threads)
Benchmark
SPSC Ring
mpsc (recv)
mpsc (try_recv spin)
Speedup vs best mpsc
u64 / 1M
1.85 ms
16.35 ms
20.04 ms
8.8x
Order (48B) / 1M
5.51 ms
20.03 ms
24.10 ms
3.6x
SPSC ring buffer: ~540M u64 ops/sec, ~182M Order ops/sec.
std::sync::mpsc (blocking recv): ~61M u64 ops/sec, ~50M Order ops/sec.
std::sync::mpsc (try_recv spin): ~50M u64 ops/sec, ~41M Order ops/sec — spin variant is slower than blocking, likely due to try_recv lock overhead under contention.
Note: std::sync::mpsc is a multi-producer channel (extra synchronization overhead vs our single-producer design). Both benchmarks include thread spawn/join in the measured iteration.
Load latest snapshot → replay WAL from snapshot position
Hot-path allocation
Zero (pre-allocated encode buffer in WAL)
unsafe blocks
2 (both MmapMut::map_mut() with SAFETY comments)
New dependencies
memmap2, crc32fast, serde, bincode
New dev-dependencies
tempfile
Test Suite (Phase 5→6)
Metric
Phase 5
Phase 6
Test count
91
134
Test time
~0.13s
~0.16s
Phase 7: Benchmarking Suite + Observability
What changed: Eliminated hot-path fills allocation (AddOrderResult returns &[Fill] instead of Vec<Fill>). Added HdrHistogram latency recording, tracking allocator, end-to-end criterion benchmarks, and a load generator that proves the kill metric.
Fills Allocation Fix
AddOrderResult.fills changed from Vec<Fill> (heap-allocated via std::mem::take) to &'a [Fill] (borrowed from pre-allocated buffer). This was the last remaining hot-path allocation in the matching engine.
Throughput: ~4.7M orders/sec (single-threaded matching, ring pop → match → encode).
End-to-End Criterion Benchmarks
Benchmark
Time
Per-order
e2e/crossing_orders/1k
34 µs
34 ns/order
e2e/crossing_orders/10k
405 µs
40 ns/order
e2e/mixed_100k
6.4 ms
64 ns/op
e2e/avg_crossing_100k (100K)
5.6 ms
56 ns/order
Allocation Tracking
With --features alloc-track, the load generator reports heap allocations during the measured phase. The 6 allocations detected are from BTreeMap internal node management (level creation/removal) — a known trade-off documented in Phase 3. The matching engine's own data structures (arena, fills buffer) produce zero allocations.