Cloudflare Radar's bot traffic charts have been doing the rounds again, and the numbers are no longer a curiosity. Across many categories of public content, automated requests now match or exceed human ones. On some properties — documentation, reference data, structured catalogues — the ratio is not even close. Add agentic browsers (Comet, Dia, Operator-style flows), AI search modes that summarise before a user ever clicks, and the steady background hum of training and retrieval crawlers, and a question that used to be rhetorical becomes a measurable input to product strategy: who is the page actually for?
We think most teams are still answering that question with assumptions from 2018. The defaults — block bots aggressively, design exclusively for human reading patterns, monetise through human attention — were correct when bots were exceptions. They are increasingly wrong when bots are the median visitor.
The composition of traffic has shifted, and the response hasn't
There is a tendency to lump all non-human traffic into one bucket labelled "bots" and then argue about whether they are good or bad. That framing is not useful anymore. The traffic mix on a typical content site now contains at least four distinct populations: classic search crawlers building indexes, training crawlers ingesting for model weights, retrieval crawlers fetching on-demand to answer a specific user query, and agent traffic acting on behalf of a named human in real time. Each has different economics, different latency requirements, and a different relationship to your business.
A retrieval crawler fetching a page because a user just asked an AI assistant a question is, functionally, a customer visit with a translation layer in front of it. Blocking it is closer to blocking a referral from Google than to blocking a scraper. A training crawler is more like a one-time licensing event with no follow-up. An agent acting on behalf of a logged-in user is just your user, holding the mouse differently. Treating these the same — with one robots.txt rule and one rate limit — leaves obvious value on the table and creates obvious risk.
Rate limits need to become policy, not plumbing
Most rate limiting today is reactive: a WAF rule, a per-IP bucket, a Cloudflare managed challenge. That worked when the goal was keeping the site up. It does not work when the goal is differentiating between traffic populations whose business value spans several orders of magnitude.
What we'd build instead is a tiered policy with explicit verbs. Identified, signed agents (Anthropic, OpenAI, Perplexity, Google-Extended, and whoever else publishes verifiable identity) get a generous budget and structured responses. Unidentified high-volume traffic gets aggressive throttling and a 402 or 429 with a machine-readable explanation of how to get more. Human traffic continues to get the existing experience. The middle tier — unsigned but plausibly legitimate — is where the interesting product work lives, because that is where pricing, attribution, and trust get negotiated.
This is not hypothetical infrastructure. Cloudflare's pay-per-crawl, the various "AI bot" toggles, and emerging signed-agent proposals are early versions. They are not yet good enough to build a strategy around, but they are good enough to start measuring against.
Content design for a reader that doesn't scroll
A model doesn't scan a hero image or skip a sidebar. It reads the markup. Pages that bury the actual answer under three paragraphs of throat-clearing convert poorly to humans and worse to machines, because the model has to either guess at what's important or ingest the whole thing and pay for the tokens. The pages that work well for both audiences are the ones with a clear lede, structured facts, stable headings, and explicit semantics — schema.org, OpenGraph, sensible HTML, and a sitemap that tells the truth.
The interesting consequence is that good machine-facing design and good human-facing design have converged more than they have diverged. The era of SEO content engineered to game keyword density was already over; the era of content engineered to be cited cleanly by an LLM rewards roughly the same things a thoughtful technical writer would do anyway. Be correct, be specific, attribute claims, structure the document.
Where they diverge is in length and redundancy. Human readers tolerate — sometimes want — preamble, context, voice. Models want the answer in a form they can lift. The pragmatic answer is to write for humans and expose structured extracts for machines: JSON-LD blocks, a /llms.txt if you believe in it, an API endpoint that returns the same facts as the page. Stop pretending the HTML is the only surface.
Monetisation needs a machine-shaped option
Ad-supported content assumes a human eye on a rendered page. That assumption breaks the moment a model summarises the page and the user never visits. Arguing that this is theft is a comfortable position but not a useful one — the behaviour is not going to stop because publishers are upset about it.
The options that actually exist: per-crawl pricing (Cloudflare's bet), licensing deals with named model vendors (the Reddit/NYT/AP path), API access with metered billing (the long-standing answer that publishers mostly didn't take seriously), and bundled access through aggregators. None of these is fully formed. All of them assume the publisher can identify and price machine traffic separately from human traffic, which is exactly the capability most sites don't have yet.
Our read: the publishers who win the next five years are the ones who treat their content as a two-sided product — a free, ad-supported, human-facing site and a paid, structured, machine-facing feed — and who instrument the boundary between them carefully enough to know what each side is worth. The publishers who lose are the ones who try to wall everything off and discover that their traffic was 70% machine-mediated and the humans were following the machines.
Machine-facing signalling is a product surface now
robots.txt was always a polite fiction, and it is now a thoroughly inadequate one. Sites need a real signalling layer: which agents are welcome, at what rate, for what purpose, under what terms, with what attribution expectations. Some of this will standardise (IETF AI preferences work, signed agent proposals). Most of it will be ad-hoc for a while.
The practical move is to treat your machine-facing surface — robots.txt, sitemap, structured data, any /.well-known/ endpoints, API documentation, rate-limit responses — as a product with an owner, a roadmap, and analytics. Not as ops hygiene. The teams doing this well already have dashboards showing which crawlers are hitting which routes, which queries surface their content in AI search, and which agents are converting (or not) into downstream value.
Where we'd start
If we were advising a content-heavy product team this quarter, the order would be: measure first (segment traffic into the four populations above and put numbers on each), then signal (clean up robots.txt, ship structured data, publish an explicit policy for AI agents), then differentiate (tier rate limits by identity, return machine-readable errors), then monetise (pick one of pay-per-crawl, licensing, or metered API and run the experiment).
The mistake to avoid is treating this as a defensive posture. The web's audience is shifting, and the teams pretending otherwise are optimising for a reader that is steadily becoming a minority. The opportunity is to design deliberately for the audience you actually have — which now includes a lot of machines, some of them carrying real users behind them, and worth taking seriously on those terms.