Wang Lin Mikhelson 2020

Matthew Wang, Yi-Hong Lin and Ilya Mikhelson’s “Regime-Switching Factor Investing with Hidden Markov Models” appeared in the Journal of Risk and Financial Management (MDPI, vol. 13(12), article 311, 2020). It uses a three-state Gaussian HMM, trained on S&P 500 ETF daily return and volatility, to label the US market as bull, bear or sideways, and then rotates among six style-factor portfolios — Fama-French three-factor, a modified Fama-French, Carhart four-factor, a Value model, an AQR-style quality/value/momentum model, and the S&P 500 ETF itself — picking whichever factor model historically performed best in the detected regime. It is the vault’s example of an Hidden Markov Model Regime Detection used as a Style Factor Rotation overlay rather than as a direct entry/exit signal.

The headline numbers are dramatic. Over the out-of-sample window (September 2017-April 2020) the HMM rotation reports a cumulative return of 2.4491 — roughly 245% — against 53% for the best single factor (the AQR-style model) and essentially flat for the S&P 500. The reported out-of-sample Sharpe ratio is 2.017 versus -0.174 for the index, the information ratio is 1.64, and maximum drawdown is 12.8% versus 34.1% for the S&P 500. A factor regression of HMM returns on MKT/SMB/HML/MOM/QMJ leaves a statistically significant alpha (coefficient 0.17, t = 3.31), which the authors describe as “approximately 2% annually.”

Those numbers do not survive scrutiny as a profitability claim, and the reasons are structural rather than incidental. First, no transaction costs, slippage, holding costs or turnover are charged anywhere in the paper — striking for a strategy that retrains its HMM every day on a sliding 2707-day window and may switch its entire factor portfolio on a daily confidence rule. Second, the constituent factor models are leveraged: the paper describes leveraged long/short Fama-French and AQR implementations and a twice-leveraged long-only Value model, so a 2.0 Sharpe partly reflects gearing, not pure timing skill. Third, the out-of-sample window is only about 2.5 years, and the equity-curve discussion makes clear the outperformance is concentrated in exactly two episodes — December 2018 and March 2020 — when the HMM switched out of the leveraged Value model into a market-neutral Fama-French model and avoided the drawdowns. With one or two regime calls effectively producing the result, the effective sample size for the profitability claim is tiny. Fourth, the factor models were tuned in-sample (Jan 2007-Sep 2017) on the very regimes the HMM then exploits, and the authors themselves admit they “didn’t dive as deeply into model parameter optimization in order to prevent overfitting.” The paper’s own Treynor-Mazuy test for market-timing skill is not statistically significant for any model, including the HMM — so the timing-skill claim rests only on the in-sample-tuned abnormal-return regression.

This places Wang/Lin/Mikhelson in sharp contrast with Bulla et al. 2010, which tests a comparable HMM-timing idea but charges explicit costs, uses a four-decade sample across five markets, applies a median filter to control turnover, and consequently reports honest, modest after-cost excess returns. The gap between Bulla’s 18.5-201.6bp annual excess return and Wang/Lin/Mikhelson’s 245% cumulative gain is itself diagnostic of how much cost-free, leveraged, short-window backtesting can inflate apparent performance. The result is also directly vulnerable to the Dacco and Satchell 1999 critique: the strategy’s edge is concentrated in two real-time regime calls, and the paper does no robustness testing of how sensitive that edge is to misclassification timing. It is best read as a methodological demonstration that an HMM can rotate factor exposures coherently — which the authors honestly frame as the contribution — not as evidence of a tradeable edge.

Profitability grade — weak. The study reports out-of-sample numbers and a benchmark, which lifts it above pure in-sample work, but it omits transaction costs and slippage entirely, relies on leveraged portfolios, draws its result from a ~2.5-year crash-dominated window, tunes its components in-sample, runs no robustness battery, and offers no code or data release for replication. The reported “outperformance” is a gross, cost-free, leveraged backtest artefact rather than substantiated profitability.

Wang Lin Mikhelson 2020 [tests-strategy] Style Factor Rotation Wang Lin Mikhelson 2020 [contradicts] Bulla et al. 2010 Style Factor Rotation [part-of] Factor Timing Wang Lin Mikhelson 2020 [supports] Overfitting in Quantitative Trading

Connections

Sources