Transformer Forecast Collapse: Why More Powerful Models Make Worse Financial Predictions

A mathematician just proved that the metric every financial AI model optimizes makes predictions strictly worse as models get more powerful. Under MSE loss, the optimal predictor for financial returns is identically zero. PatchTST, one of the most capable time-series transformers available, lost to a one-parameter linear model on 92% of 233 test windows across 1,160 days of EUR/USD data. Average error: 1.71x worse than the linear baseline. More parameters, worse predictions.
I spent 13 years in financial services building models that used MSE or a close variant. Credit underwriting at a $30MM commercial portfolio. Predictive models at a fintech lender. Fraud detection systems at a credit union. Every single one optimized some version of squared error. Reading Andreoletti's proof felt like finding out the road you've been driving on for a decade has a structural crack running its entire length. The road still works. But now you can see exactly where it fails.
The Proof in Three Steps
The argument is clean and builds on classical results, applied to the specific structure of financial time series.
Step 1: MSE optimizes for the conditional mean. Under squared trajectory loss, the unique minimizer of population risk is the conditional expectation of the future given the observed past. This is textbook statistical learning theory. The optimal predictor f*(X) = E[Y | X]. Nothing controversial here.
Step 2: In financial data, the conditional mean is trivial. Under standard martingale assumptions, the Bayes-optimal predictor for prices is to repeat the last observed price. For returns, the optimal predictor is identically zero. The conditional mean is degenerate. Any predictable structure in returns is weak, unstable, and dominated by noise. Decades of empirical finance confirm this. Goyal and Welch (2008) showed that most return predictors fail out of sample. The efficient market hypothesis is, mathematically, the shape of the conditional mean.
Step 3: More expressivity adds variance without reducing bias. This is the core result. Compare two model classes: a simple one-parameter linear model that scales the last price, and an interpolating model (like a transformer) capable of exactly reproducing training outputs.
The linear model's expected error converges to H * sigma^2 + O(H * sigma^2 / n). That's the irreducible noise floor plus a term that vanishes with more data.
The interpolating model's expected error is bounded below by 2H * sigma^2. Double the noise floor. The extra sigma^2 comes from noise reuse: the model memorizes noise from training examples and replays it at test time. More data helps the linear model. More data does not help the interpolating model. The 2x factor is structural.
Why Transformers Memorize Noise
Think of it like a card counter in a casino who memorizes every hand they've ever seen. When a similar-looking hand appears, they replay their memory of what happened last time. But each remembered outcome includes the random luck of that specific hand. The counter is replaying the noise of the past and calling it prediction.
Transformers do the same thing at scale. When an interpolating predictor encounters a new input, it finds the nearest training example and returns that example's output, noise included. Test noise and training noise are independent, each with variance H * sigma^2. Combined: 2H * sigma^2. The approximation error from the nearest-neighbor mismatch only adds more.
Adding parameters doesn't fix this. It makes the memorization more precise, which makes the noise replay more confident. The model reproduces randomness with higher fidelity and calls it learning.
PatchTST vs. a Spreadsheet
The empirical results confirm the theory. Andreoletti tested PatchTST against a one-parameter linear model on 1,160 days of high-frequency EUR/USD data at 30-second intervals (December 2020 through July 2025).
Results:
- PatchTST produced larger errors on 92% of 233 test windows
- Average trajectory error ratio: 1.71x worse than the linear model
- Increasing model capacity widened the gap
This wasn't a fluke sample. Zeng et al. (AAAI 2023) had already shown that a single-layer linear model (LTSF-Linear) outperformed Informer, Autoformer, FEDformer, and PatchTST on 9 benchmarks, often by a large margin. Their explanation focused on architecture: transformers lose temporal information through permutation-invariant attention. Andreoletti's proof goes deeper. The loss function meeting data where the conditional mean carries no information produces collapse regardless of architecture.
Zhou et al. (2025) proved a related result: Linear Self-Attention models cannot achieve lower expected MSE than classical linear models for in-context forecasting. Andreoletti extends this beyond attention mechanisms to any sufficiently expressive model class.
The 2025 Quant Winter
If the math sounds abstract, the market provided concrete evidence. 2025 was a brutal year for quantitative strategies.
- Q1: Millennium lost approximately $900 million on index rebalancing strategies, triggering deleveraging cascades across the industry
- April: Trump's tariff announcements caused regime shifts that broke models trained on historical correlations
- June-July: A retail-driven rally in heavily shorted stocks squeezed quant equity managers. Goldman Sachs' prime services unit estimated 4.2% average losses for equity quant managers from June through July, the worst run for systematic long-short strategies since late 2023
- October: Renaissance's RIEF lost 14.4% in a single month, its worst monthly loss in over a decade, worse than anything during COVID. RIEF and RIDA collectively manage over $20 billion
The Financial Times dubbed it rolling thunder: multiple waves of quant failures driven by crowded factor models, regime mismatch, and AI systems trained on historical data that couldn't adapt to sentiment-driven markets.
The forecast collapse proof explains the baseline failure mode. Even without regime changes, even in calm markets, the most expressive models do not produce better predictions. They produce worse ones. The 2025 quant winter layered regime risk on top of a foundation that was already cracked.
Goodhart's Law Gets a Proof
This is Goodhart's Law in its strong form, applied to prediction.
Goodhart (1975): "Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes."
Sohl-Dickstein's formalization: when a measure becomes a target and is effectively optimized, the thing it measures grows worse.
Applied to financial prediction:
- MSE is the measure (proxy for prediction quality)
- The model optimizes MSE aggressively
- MSE's optimal solution is the conditional mean
- In financial data, the conditional mean is flat or zero
- The model converges to trivial predictions while achieving optimal MSE
- The metric is satisfied. The actual goal is destroyed.
I've watched this pattern in less mathematical contexts throughout my career. In fraud detection, we'd optimize for detection rate. The metric improved. But we'd inadvertently train the model to flag the easy cases while missing the sophisticated ones. The measure became the target, and the thing it measured got worse. The forecast collapse proof formalizes what practitioners have felt for years: optimizing the metric harder doesn't mean you're solving the problem better.
Why This Doesn't Kill All Financial AI
The proof is specific. It says: under MSE loss, for time-series forecasting where the conditional mean is degenerate, more expressivity hurts. It does not say financial AI is useless. It maps a precise boundary.
Volatility forecasting works. The conditional variance, unlike the conditional mean, is non-trivial and learnable. Volatility clustering, GARCH effects, and mean reversion in variance provide real structure. Temporal fusion transformers show strong results here because there's actual signal in the conditional second moment.
Cross-sectional prediction works. Gu et al. (2020) found that machine learning gains in finance come primarily from cross-sectional prediction: ranking stocks relative to each other rather than forecasting individual returns. Cross-sectional rankings can have meaningful conditional structure even when individual return means are degenerate.
Alternative loss functions help. Quantile loss targets specific distribution percentiles instead of the mean. Directional loss functions optimize for getting the sign of the move right. Recent research shows 3.4 to 6.1 percentage point gains in directional accuracy over MSE. Andreoletti suggests diffusion-based models and probabilistic forecasting as escape routes: methods that learn the full conditional distribution rather than collapsing to the mean.
Regime-aware architectures adapt. Models that explicitly detect and adapt to market regimes maintain performance during transitions. One framework achieved 0.59% MAPE for one-day predictions with 72% directional accuracy, maintaining stability during high-volatility periods.
The lesson: the default configuration, the one every tutorial teaches, the one every Kaggle notebook uses, provably makes predictions worse when applied to financial time series.
The AI Washing Problem Gets Sharper
The SEC has already fined investment advisers for overstating AI capabilities. Delphia (USA) Inc. paid $225,000 for claiming AI used client data in ways it didn't. Global Predictions Inc. paid $175,000 for unsubstantiated performance claims.
The forecast collapse proof raises the bar. If the most expressive models provably converge to trivial predictions under standard loss functions, then marketing claims about transformer-based alpha generation require extraordinary evidence. The CFA Institute's 2025 report on AI washing addresses exactly this gap between what firms claim and what the math supports.
As a CFA charterholder, this hits a specific nerve. The CFA Institute's consistent position has been AI + HI: artificial intelligence plus human intelligence. The forecast collapse proof validates that position from first principles. The model cannot choose its own loss function. The loss function determines whether the model learns something useful or converges to nothing. That choice is irreducibly human.
What to Do About It
If you're building financial prediction systems, here's where the proof leads.
Stop defaulting to MSE. Squared error is the loss function equivalent of selecting "default" on every dropdown menu. For financial time series where the conditional mean is degenerate, MSE guarantees that more model capacity produces worse predictions. Choose a loss function that aligns with what you actually care about: direction, quantiles, tail risk.
Match expressivity to signal. If your data has strong conditional structure (weather, electricity demand, traffic), use expressive models. If the conditional mean is approximately flat (financial returns at most horizons), use the simplest model that captures what's there. A one-parameter linear model beat PatchTST 92% of the time. Respect that.
Audit where your transformers actually help. Volatility forecasting, cross-sectional ranking, regime detection, anomaly identification. These are domains where the conditional structure is non-trivial. Focus model complexity where it has mathematical justification to improve things.
Demand proof, not demos. When a vendor shows a transformer-based financial prediction system, ask one question: what is the conditional mean of your target variable? If they can't answer, they haven't read the math.
The Bigger Picture
I wrote about the structural limits of AI in finance two weeks ago. Reflexivity, confidence-failure paradox, governance gaps. The forecast collapse proof adds a fourth limit that sits underneath all of them. Before reflexivity degrades patterns, before confidence misleads, the model is already converging to trivial predictions because MSE plus a degenerate conditional mean equals flatness.
The macro view: AI in finance fails at the systems level because markets are reflexive, models are overconfident, and governance hasn't caught up. The micro view: AI in finance fails at the model level because the standard loss function is mathematically mismatched to the domain.
Together, they form a complete argument. And the micro view is, in some ways, more damning. You can fix governance. You can hedge against reflexivity. You cannot argue with a proof.
Thirteen years of building models, and the metric I used in every one of them has a formal proof showing it makes predictions worse when models get powerful enough to fully optimize it. That's not a reason to stop building. It's a reason to stop building on autopilot.