Why your backtest lies to you (and how to make it stop)

A good backtest is an argument, not a result. Somewhere on a shared drive right now there’s a strategy showing a Sharpe of 4.2 that will go sideways the day it goes live. Not because the author lied — but because a backtest can be both technically correct and fundamentally dishonest at the same time.

Here’s the checklist I keep taped next to my monitor.

1. Look-ahead bias (the obvious one that keeps happening)

You computed a feature at time t using data from time t + Δ. Classic examples:

Using a rolling mean that includes the current bar without the shift(1).
Normalizing a whole series with StandardScaler().fit_transform(X) — which uses all-sample statistics.
Joining on a close-of-day signal against intraday returns that include the close.
Any merge on timestamps without explicit “as-of” semantics.

Catch: write a brutal test that asserts every feature used at time t depends only on data with timestamps < t. Put it in CI. Let it fail loudly.

2. Survivorship bias

You’re running a strategy on the current S&P 500 constituents back to 2010. Except half of those weren’t in the index in 2010. The ones that were and aren’t anymore — the failures — are silently excluded.

Same problem in crypto: the top 50 tokens today. Survivorship bias overstates long-biased strategies dramatically. The only fix is to use a point-in-time universe.

3. Slippage that doesn’t cross the spread

If your backtest fills market orders at the mid, or passive orders at the best price with no queue model, it’s wrong in the direction of your P&L. Every time.

Minimum realistic model:

Market orders cross the spread and walk the book proportional to size.
Passive orders have a fill probability that depends on queue position and rate of aggression on that side.
Apply adverse selection to passive fills: if you got filled, the book tends to have moved against you.

If your strategy only works at zero slippage, you don’t have a strategy.

4. Fees and rebates on the wrong side of the ledger

Small-looking fees compound. A 2 bp round-trip on a strategy that trades 200 times a day is 400 bps per day. Not a typo. Many positive-Sharpe strategies in research flip to negative Sharpe once real fees are applied. A few flip the other way with rebates — which is also a trap, because your fill rate at maker venues is often worse than you modeled.

Catch: sweep fee assumptions across a realistic range. Report Sharpe at the low, mid, and high end. If the strategy only works at the low, it’s not robust.

5. Regime averaging

A single Sharpe is a claim about an average over the test period. Your live P&L doesn’t see the average — it sees the current regime. Always report:

Sharpe by volatility bucket.
Sharpe by session / time of day.
Sharpe by market state (trending vs. mean-reverting, rising vs. falling volume).

A strategy that’s Sharpe 3 in one regime and -1 in another is not a Sharpe 1 strategy. It’s a regime-conditional strategy with a detection problem.

6. Hyperparameter leakage

You picked the lookback window by scanning values and choosing the best. You picked the threshold the same way. You tuned the cost model. Every one of these is a degree of freedom that eats out-of-sample performance.

Catch: count your degrees of freedom. Report the effective information ratio after penalizing for them. Or better — commit your hyperparameters before touching the holdout, and report both.

7. Implementation shortfall isn’t the same as backtest return

Even with honest slippage, there’s the gap between the price at signal arrival and the price at execution completion. For a short-horizon signal, that gap can be most of the expected return.

Report two numbers:

Return assuming instant execution at signal time.
Return including actual child-order execution.

The second is the number that matters. The first tells you where you’re losing the most — slippage, signal decay, or participation caps.

8. Survivor strategies

This is the subtle one. You ran 200 strategies. You kept the top 20. You ran a “robustness check” on those 20. This is not robustness — it’s double-selection. The top strategies out of 200 will cluster around overfit outcomes even if all 200 started as random.

The honest approach is preregistration: define the hypothesis, the test, and the success criteria before you look at the data. Everything else is noise-surfing.

The checklist

If you want something short:

Feature at t uses only data < t.
Universe is point-in-time.
Slippage model crosses the spread and walks the book.
Passive fills have a queue-aware probability.
Fees, rebates, and funding are in the ledger.
Sharpe is broken down by regime.
Hyperparameters committed before holdout.
Implementation shortfall reported separately.
No double-selection across strategies.

The backtest that passes this is still only an argument. But it’s one you can defend.

Final thought

The most successful quants I know treat backtests with deep suspicion. They assume every number is wrong and spend most of their time figuring out how it’s wrong. That instinct is learned — usually the hard way. The goal of a checklist like this is to make sure you don’t have to learn every lesson in production.