An MBO research framework for serious signal work

Most retail-facing microstructure work stops at top-of-book aggregated data. That’s fine for understanding concepts, but it hides exactly the things that separate a toy signal from one that survives live trading: queue position, cancellation patterns, and who is actually generating flow.

Market-by-order (MBO) data fixes that — at the cost of being substantially harder to work with. This page is the research framework we use when we want to take an idea seriously.

What MBO actually is

Think of MBO as a stream of individual order messages rather than an aggregated snapshot of the book. Typical message types:

Add — a new resting order arrives at a price.
Modify — an order changes size or price.
Cancel — an order is removed.
Execute (fill) — an order or part of it trades.

From this stream you can reconstruct the book at any point in time — but more importantly, you can reason about queue position, order lifetime distributions, and the microstructure of aggression itself.

The framework, in phases

Phase 1 — Reconstruction

Build a book reconstructor first. Nothing else matters until this is right.

Non-negotiables:

Deterministic replay: given the same message stream, the book at time t must be byte-identical every run.
Per-order ID tracking: you must be able to answer “where in the queue is this specific order?” at any time.
Gap handling: any missing sequence or timestamp must crash loudly, not silently interpolate.

This is also where you write your first set of invariants: total size at each level equals the sum of its individual orders; every fill references an order that exists; spread is never negative. Invariants fire in tests and in production.

Phase 2 — Feature design

Once you can reconstruct, you can compute features that are impossible from aggregated data:

True queue position — not the assumption you join at the back; the real queue at the instant of your arrival.
Order-age distributions — how long before a given level’s orders get filled or cancelled, conditional on the book state.
Aggressor identity patterns — order-size clustering, repeat IDs across a session, burstiness.
Multi-level net flow — depth-weighted OFI that knows which orders moved and which didn’t.

A good feature is one that (a) has a clear causal story, (b) is not already priced in at your execution latency, and (c) is measurable with the data you actually have in production.

Phase 3 — Honest evaluation

This is where most research fails. A short checklist:

Time-respecting splits. Shuffle-splits leak information. Use walk-forward.
Realistic fill model. At minimum: queue-aware passive fills, adverse-selection penalty on fills, explicit market-order slippage, fees & rebates.
Cost sensitivity. Sweep your fee and spread assumptions. If the signal only works at zero cost, it doesn’t work.
Regime partitions. Report results by volatility bucket, session, and market state. A single blended Sharpe averages over things that shouldn’t be averaged.
Capacity estimate. At what size does expected impact cancel expected edge? A number, not a shrug.

Phase 4 — Execution modeling

Signal and execution are a product, not a sum. The framework:

Choose a target horizon matched to the signal’s predictive window.
Define a participation cap (e.g., maximum fraction of visible depth you’ll consume).
Model child order placement (passive, peg, aggressive) with an explicit state machine.
Model cancellation logic when the signal decays.
Measure implementation shortfall against the signal’s arrival price, not against midpoint evolution.

If you can’t describe your execution state machine in one page of plain English, it’s too complicated.

Phase 5 — Paper-to-live reconciliation

Even with a careful backtest, paper ≠ live. Bridge the gap explicitly:

Run the strategy in a shadow mode against live data with simulated fills. Compare simulated fills to actual top-of-book activity for deviations.
Track fill rate, adverse selection at fill, and participation rate as first-class metrics.
Expect the first live P&L to be worse than paper. If it isn’t, you’re not measuring something.

What to avoid

Optimizing on microstructure features without a queue model. You’re fitting to a world that doesn’t exist.
Snapshot-based features on MBO data. If you’re throwing away the message stream, you wasted the data.
Strategies with no closed-form capacity estimate. “It looks great at size X” means nothing without a theory of why it stops working at size 10X.
Any backtest that ignores cancellations. Cancellation patterns are signal.

Tools

You don’t strictly need a commercial vendor — public MBO-style data exists for crypto venues and some equity exchanges. But whatever you use:

Store raw messages, not reconstructions. Reconstructions are derivable.
Version your feature definitions in code. Notebooks hide bugs.
Keep at least one regression test per invariant, run on every change.

Closing thought

MBO data is a force multiplier for serious research, but it’s also a force multiplier for self-deception. The discipline isn’t in the dataset — it’s in the framework you apply to it. If you’re disciplined about reconstruction, feature design, evaluation, execution, and reconciliation, MBO lets you see the market more clearly than aggregated data ever will.

If you’re not, it just lets you overfit faster.