Where Time-Series Foundation Models Fit Beside Classical Forecasters

Google's TimesFM has been climbing GitHub for a few months now, and it has dragged a familiar question back into the open: when a pretrained, zero-shot foundation model exists for your problem domain, what happens to the stack you already have? Forecasting teams are now facing the same fork that NLP teams faced around 2019 — train from scratch, fine-tune a base, or zero-shot a foundation — except the incumbents here are not transformers. They are ARIMA, Prophet, exponential smoothing, and a wall of LightGBM models that someone's data science team spent two years tuning.

We have been running forecasting workloads in production long enough to have opinions about where TimesFM-class models actually belong. They are not a replacement for the classical stack. They are not a toy either. The honest answer is more interesting than either of those positions.

What a time-series foundation model actually is

TimesFM, Moirai, Chronos, Lag-Llama — the cohort is now broad enough to talk about as a category. These are decoder-only or encoder-decoder transformers, pretrained on hundreds of billions of time points drawn from Google Trends, Wikipedia pageviews, M4, electricity load, weather, retail, and synthetic mixes. They tokenize sequences into patches, learn a generic representation of temporal structure, and produce probabilistic forecasts given a context window — typically 512 to 2048 points — without any fitting on your data.

The zero-shot claim is the headline. On many public benchmarks they match or beat statistical baselines without ever seeing the target series. That is genuinely new. The old transfer-learning story for time series was weak; series were too heterogeneous, scales too different, seasonalities too domain-specific. Patch tokenization plus scale normalization plus enough pretraining data appears to have broken that ceiling.

What they are not: they are not causal models, they do not natively ingest exogenous regressors in most current implementations (Moirai is the exception worth watching), and they do not give you interpretable components the way Prophet does. They are pattern completers for univariate sequences, with quantile heads bolted on.

Where classical methods still win cleanly

If your series is short — fewer than a few hundred observations — ARIMA and ETS will usually beat a foundation model, and they will do it on a laptop. The pretraining advantage shows up most clearly when the context window has enough history to expose seasonality and regime structure. On a sparse weekly series with 80 points, the foundation model is guessing from very little, and a well-specified state-space model with the right seasonal period encoded by hand will beat it.

If interpretability is part of the deliverable — and for anything touching finance, capacity planning, or regulated reporting, it usually is — Prophet's additive decomposition or a structural time-series model is not optional. "The transformer said so" is not an explanation a CFO accepts. You can extract attention weights from TimesFM, but they do not map to trend, seasonality, holiday, and residual the way stakeholders want.

If exogenous features carry most of the signal — price, promotion, weather, marketing spend — gradient boosting on lagged features remains the practical winner. LightGBM with proper feature engineering still beats every foundation model we have tested on retail demand problems where the calendar and the price grid drive the series. Foundation models are improving on covariate handling, but they are behind where a competent feature-engineered GBM sits today.

If you need cheap inference at scale — say, ten million SKU-store forecasts refreshed nightly — a foundation model is the wrong tool. TimesFM-1.0-200M needs a GPU to be remotely practical, and even then you are looking at orders of magnitude more compute per series than an ARIMA fit. The economics do not work for high-cardinality, low-value-per-series workloads.

Where foundation models earn their place

The sweet spot is what we have started calling the cold-start middle: series that are long enough to have structure but new enough that you have no model for them, and numerous enough that hand-tuning each one is uneconomic.

Concretely: a new product line with six months of daily data and twenty correlated SKUs. A monitoring metric that just got instrumented and needs anomaly bounds tomorrow. A demand forecast for a market the business entered last quarter. In all of these, the choice has historically been a weak ARIMA or a borrowed model from a similar series. A zero-shot foundation model now gives you a credible probabilistic forecast immediately, with no fitting and no per-series tuning. That is a real workflow change.

The second sweet spot is exploratory forecasting — the work analysts do when they are trying to understand whether a series is forecastable at all, before committing to a modelling investment. Running TimesFM over a portfolio of candidate series and looking at the prediction-interval widths is a fast way to triage. Series where the foundation model produces tight, plausible intervals are worth modelling properly. Series where the intervals blow up are telling you something about the data, not the model.

The third is ensembling. We have seen consistent small gains from blending a foundation model forecast with a classical baseline, weighted by recent residual performance. It is not glamorous, but on operational dashboards where every percentage point of MAPE matters, it works.

The inference cost question

This is the part most blog posts skip. TimesFM-200M runs at roughly 50ms per forecast on an A10G for a 512-point context, batched. Scale that to a portfolio of 100,000 series refreshed hourly and you are looking at meaningful GPU spend — call it a few thousand dollars a month at current rates, plus the engineering overhead of a serving layer. An equivalent ARIMA fleet runs on a single CPU box for tens of dollars.

The right mental model is that a foundation model costs roughly what a small language model costs to serve, and you should budget for it the same way. For a high-value forecast — supply chain, treasury, capacity — that is fine. For a low-value forecast at scale, it is not.

Quantized variants and the smaller Chronos models change this math somewhat, but not by an order of magnitude. The architectural floor is the floor.

A position to hold

Our working position, which we apply on engagements: classical methods remain the default for established, well-understood series with reasonable history and known drivers. Foundation models replace the bottom of the classical toolkit — the weak ARIMA fits on short or new series — and they replace the analyst's first-pass exploratory model. They do not replace the gradient-boosted feature-engineered model that drives the core business forecast, and they do not replace the structural model that has to be explained.

The interesting architectural move over the next year is not whether foundation models displace classical forecasters wholesale — they will not — but whether covariate-aware variants like Moirai close the gap with GBMs on demand forecasting. That is the workload with the most money attached, and the one where the foundation-model camp has the most to prove. If a pretrained model with exogenous regressor support starts beating well-tuned LightGBM on M5-class problems, the conversation changes materially.

Until then, the right architecture is plural. Keep the ARIMA. Keep the Prophet for the stakeholder-facing components. Keep the GBM for the covariate-heavy workloads. Add a foundation model for cold-start, exploration, and ensembling. Pay the GPU bill where it earns its keep, and not anywhere else.