The current wave of datacenter construction is staggering by any measure. Microsoft, Google, and Amazon are collectively committing hundreds of billions of dollars to GPU clusters, custom silicon, and the power infrastructure to feed them. OpenAI is building in Arizona. Meta is stringing fibre across continents. The numbers are large enough that they feel structural — like we are watching the permanent shape of computing being poured into concrete.
We think that framing deserves scrutiny.
Not because the investment is irrational. It isn't. The compute requirements for training frontier models are real, the latency and throughput demands of inference at scale are real, and the engineering complexity of running these systems reliably is genuinely difficult. Hyperscalers are solving hard problems well. But "this infrastructure is necessary right now" and "this infrastructure defines the enduring architecture of AI" are two different claims, and the second one has a complicated history.
The Pendulum Has Swung Before
Computing has oscillated between centralised and distributed models roughly every fifteen to twenty years, and the oscillations are not random. They track three variables: where capability lives, where cost gravity sits, and where latency pressure is highest.
In the 1970s and early 1980s, compute was centralised by necessity — mainframes and minicomputers were the only machines that could do useful work. The personal computer swung the pendulum hard toward the edge. Then the web swung it back toward servers. Then mobile pushed capability back to the device. Then cloud ML centralised training and inference again.
Each swing looked permanent to people living through it. Each one wasn't.
The pattern is not that centralisation loses. It is that centralisation holds until the edge catches up, and then the load redistributes. The question worth asking right now is whether the edge is catching up again — and the honest answer is yes, faster than the hyperscaler narrative tends to acknowledge.
What On-Device Models Actually Look Like Today
Apple's A17 Pro runs a 3-billion-parameter model locally with acceptable latency for many real-time tasks. The M-series chips have a unified memory architecture that is genuinely well-suited to transformer inference — the memory bandwidth numbers are competitive with discrete GPUs from two years ago. Qualcomm's Snapdragon X Elite ships with a 45 TOPS NPU. MediaTek, Samsung, and Google's Tensor chips are all moving in the same direction.
Meta's Llama 3 8B runs at usable speeds on a MacBook Pro with no network call. Mistral 7B fits in 8GB of RAM with 4-bit quantisation and produces outputs that are genuinely useful for a wide range of tasks. Microsoft's Phi-3 Mini was explicitly designed for edge deployment and benchmarks surprisingly well against models three times its size on reasoning tasks.
None of this means on-device models are ready to replace GPT-4 class inference today. They aren't. But the trajectory is steep and the gap is closing faster than the infrastructure investment cycle can respond to. The hyperscalers are locking in capacity for a compute profile that the best open models needed eighteen months ago.
The Lock-In Architecture Is Visible If You Look
The datacenter buildout is not just infrastructure — it is also a moat construction project, and the two goals are entangled in ways that matter for how we think about the endgame.
Consider the dependency stack that cloud AI creates: proprietary APIs, token-based pricing that scales with usage, latency characteristics that require persistent connectivity, and fine-tuning pipelines that strongly prefer keeping your data inside the provider's ecosystem. Each layer is individually reasonable. Together they create switching costs that compound.
This is not a conspiracy. It is rational business strategy. But it is worth naming clearly, because the "we need this infrastructure to do AI" argument and the "we benefit from you needing this infrastructure" argument produce the same datacenter, and conflating them leads to muddled thinking about what the buildout actually represents.
The open model ecosystem is quietly dismantling pieces of this stack. Ollama makes local model deployment genuinely simple. LM Studio handles quantised models across consumer hardware with a reasonable UI. The llama.cpp project has driven extraordinary optimisation work that makes edge inference faster on every release cycle. Hugging Face has become a distribution layer that operates largely outside hyperscaler control. The tooling is not yet enterprise-grade across the board, but the direction is clear.
Where the Load Will Actually Sit
Our working thesis is that the compute stack will bifurcate more sharply than most current commentary suggests.
Frontier training will remain centralised, probably forever. Training a GPT-4 class successor requires synchronised compute at a scale that is physically impractical to distribute in the way that inference can be. The hyperscalers will own this tier, and the economics of it will remain brutal enough to keep the field narrow. This is genuinely load-bearing infrastructure, and the investment in it is not going to look foolish.
Frontier inference is where the story gets more interesting. Right now, running GPT-4 class inference requires the cloud. In two years, it probably requires a high-end workstation. In four years, it probably runs on a flagship phone. That is not a prediction pulled from optimism — it is an extrapolation from the quantisation and distillation research that is already published and already working. The question is not whether this happens but how the business models adapt when it does.
Routine inference — the long tail of summarisation, classification, extraction, code completion, and conversational tasks that make up the majority of actual AI usage — will migrate to the edge faster than the hyperscalers would prefer to discuss publicly. The economics are straightforward: for a task where a 7B parameter model is sufficient, running it locally costs a one-time hardware amortisation versus an ongoing per-token fee. At scale, that arithmetic is decisive.
The Buildout Is Not a Mistake, But It Is Not the Whole Story
We do not think the hyperscaler investment is a blunder. The infrastructure being built today is solving real problems for the capability frontier, and the capability frontier matters — it defines what is possible and pulls the rest of the ecosystem forward. There is genuine public good in having organisations capable of training the next generation of models, even if the business model around inference eventually looks different.
What we are more sceptical of is the implicit argument that the current architecture is durable — that the relationship between cloud providers and AI consumers is settling into a stable configuration analogous to how cloud computing consolidated web infrastructure in the 2010s. That analogy is seductive and probably wrong.
Web workloads centralised because the network effect of shared infrastructure was genuinely valuable and the alternative (running your own servers) was operationally painful. AI inference workloads will decentralise partly because the network effect is weaker (your summarisation task does not benefit from being colocated with someone else's) and partly because the operational tooling for local deployment is improving at a remarkable rate.
The 2010s cloud consolidation also happened against a backdrop of relatively static hardware. The AI inference story is happening while edge silicon is advancing rapidly and while the open model ecosystem is specifically optimising for deployment outside the cloud. Those are different conditions.
What This Means for How We Build
For teams making architectural decisions today, the practical implication is to avoid building lock-in deeper than your actual requirements justify. Using a proprietary API because it is the best tool for a specific task is sensible. Architecting your data pipeline around a single provider's ecosystem because it is convenient is a decision worth examining carefully.
Abstraction layers that separate your application logic from specific model providers are worth the engineering overhead. Evaluating whether your inference workload actually requires frontier-class capability — or whether a well-prompted open model handles it adequately — is worth the testing time. The open models are not always sufficient, but they are sufficient more often than most production systems currently use them.
The pendulum is mid-swing. The hyperscalers are building real things that solve real problems, and the current moment of centralised AI capability is not artificial or illusory. But the structural forces that have redistributed compute toward the edge in every previous cycle are present and accelerating. The datacenter buildout is load-bearing for the capability frontier. Whether it is load-bearing for the broader AI stack is a question the next few hardware generations will answer — and the early evidence suggests the answer is no.