Where AI ROI Actually Comes From in Infrastructure and Operations

Gartner recently put a number on something most operators already suspected: only 28% of AI use cases in infrastructure and operations fully deliver on their expected value. The rest partially deliver, stall, or quietly get absorbed into the general cost of experimentation. It's a useful number because it cuts against the dominant narrative — that the gap between AI promise and AI payoff is a capability gap waiting on a better model.

It isn't. The gap is almost entirely about integration discipline and governance alignment. That's an unglamorous finding, but it's the one that matches what we see inside real environments.

The capability ceiling moved; the integration ceiling didn't

Frontier models have become commodity-adjacent. The delta between GPT-4-class, Claude Sonnet-class, and open-weight models in the Llama or Qwen families is small enough that for most I&O tasks — log triage, change-risk scoring, runbook generation, anomaly summarisation, ticket routing — the choice of model is a rounding error on outcome quality. What isn't a rounding error: whether the model has clean access to the right telemetry, whether its outputs are trusted enough to trigger action, and whether the organisation has a place to put those outputs inside an existing workflow.

This is the part that quietly determines ROI. A flawless incident-summarisation model attached to a ticket system no one reads produces zero value. A mediocre model wired into the on-call path, with a feedback loop that measures MTTR delta, produces real, measurable savings within a quarter. The model is not the bottleneck. The wiring is.

What the 28% actually have in common

When we look at the I&O AI deployments that work — ours and others' — a short list of properties keeps appearing:

They attach to a system of record, not a system of suggestion. The use cases that return value write into ServiceNow, Jira, PagerDuty, Terraform plans, or the CMDB. They don't produce a separate dashboard that someone has to remember to check. If the AI output doesn't land somewhere an engineer or an automated process already looks, it effectively doesn't exist.

They have a defined human-in-the-loop boundary. Not a vague commitment to oversight — an actual line. "The model drafts the change request; a human approves it." "The model auto-closes tickets below confidence 0.9; everything else routes to tier 2." When that boundary is explicit, trust accumulates. When it's fuzzy, trust erodes on the first bad output and never recovers.

They measure a pre-existing operational KPI. MTTR, change failure rate, ticket deflection, alert noise ratio, cost per workload. Not "AI adoption" or "queries per week." If the success metric was invented for the AI project, the project is almost certainly in the 72%.

They were scoped to a narrow, high-frequency task. The winning use cases are boring: classifying the 40,000 low-priority alerts per month, summarising the same eight categories of incident, drafting the same kinds of postmortems. High frequency plus narrow scope plus clear ground truth is where the economics work. Ambitious, cross-domain "AI copilots for operations" are where budgets go to retire.

Governance is not a brake; it's the thing that lets you ship

There's a persistent framing that governance slows AI delivery. In I&O specifically, the opposite is true. The projects that stall usually stall because no one answered questions that governance would have forced early: Who owns the model's outputs? What data can it see? What's the rollback procedure when it's wrong? Which change-management class do AI-initiated actions fall under? Is a model-generated runbook subject to the same review as a human-authored one?

Without answers, every deployment hits the same wall at the same place — usually the first time the model does something weird in production and an SRE has to decide whether to trust it. Teams that pre-answered those questions push through. Teams that didn't spend six months negotiating after the fact, and by the time they finish, the sponsor has moved on.

Governance alignment, in practice, means three concrete things: the AI system's scope of action is written down, its failure modes have defined escalation paths, and its outputs are auditable in the same way a human operator's would be. None of this requires a committee. It requires someone to sit down for a day and write it.

The integration work is the work

Most of the effort in a successful I&O AI deployment isn't prompt engineering or model selection. It's the connective tissue: getting the model read access to Splunk or Datadog without punching a hole in the security model, getting write access to the ticketing system with appropriate scoping, building the evaluation harness that tells you when a model update has regressed on your specific alert taxonomy, and establishing the feedback loop where engineer corrections flow back into future prompts or fine-tuning data.

This is unglamorous. It looks like platform engineering because it is platform engineering. The organisations getting value from AI in operations are the ones that treat AI systems as another production service — with SLOs, on-call, deployment pipelines, and versioning — rather than as a demo that graduated.

This is also why frontier-model sophistication is a distraction. Upgrading from one capable model to a slightly more capable one yields marginal gains. Upgrading from "our model can't see the CMDB" to "our model can see the CMDB" yields step-function gains. The leverage is almost entirely on the integration side, and it will stay there until integration becomes commoditised — which it is not, and won't be for a while.

A useful test

Before starting an I&O AI project, a short diagnostic sorts likely winners from likely members of the 72%:

Can we name the exact operational metric this will move, and by how much?
Does the output land inside a system an engineer already uses?
Is there a written boundary for what the model can do autonomously versus what requires human approval?
Do we have a way to measure quality that doesn't depend on the model's own confidence scores?
If the model is wrong in the worst plausible way, what happens?

Four or five yeses and the project is probably worth doing. Two or three and it's probably a pilot that won't graduate. Zero or one and the budget would do more good elsewhere.

Where this leaves us

The 28% figure isn't a ceiling. It's a description of the current maturity curve. The organisations above it aren't using better models — they're doing the integration and governance work that lets ordinary models produce extraordinary operational outcomes. That's genuinely good news, because integration discipline is a learnable skill with a known shape. Frontier-model access is not a moat. Operational wiring is.

The practices that work are the practices that have always worked in platform engineering: narrow scope, real metrics, clear ownership, tight feedback loops, and writing things down. AI doesn't change the shape of that work. It just raises the payoff for doing it well.