Industry Commentary

What Automated ML Engineer Agents Actually Do in Practice

By John Jansen · · 7 min read

Share

Hugging Face recently shipped ml-intern, an agent positioned as an end-to-end ML engineer: hand it a paper or a task description, and it will read the literature, write training code, run experiments, and produce a model artifact. It is one of the more honest attempts we have seen at the autonomous-engineer pitch, partly because the team behind it actually trains models for a living and knows what the workflow looks like from the inside.

We have been watching this category closely — Devin, SWE-agent, OpenHands, the various Cursor-derived agents — and ML engineering is a particularly interesting test case. Unlike general software work, ML has a tight, measurable feedback loop: loss curves, eval metrics, validation accuracy. An agent either makes the number go up or it does not. There is less room to hide behind plausible-looking code.

So it is worth being specific about what these agents do well, and where the limits show up in practice.

What the workflow actually looks like

The surface pitch for ml-intern and its peers is "reads papers, trains models, ships outputs." In practice the loop is more granular than that. The agent is doing roughly six things in sequence, often interleaved:

  1. Parsing a paper or spec into a list of implementation requirements.
  2. Searching for reference implementations, datasets, and pretrained checkpoints.
  3. Writing training and evaluation scripts.
  4. Launching jobs, monitoring them, and reading logs.
  5. Diagnosing failures — OOMs, NaN losses, dataloader stalls, distributed training hangs.
  6. Iterating on hyperparameters and reporting results.

The first three steps are where current agents are genuinely useful. Reading a paper and producing a defensible first-pass implementation is a task LLMs are well-suited to. The reference architecture is usually in the training data, the loss functions are standard, and the agent can scaffold a project that compiles and runs in minutes rather than days. For routine reproductions — a known architecture on a known dataset — the productivity gain is real.

It is the last three steps where the cracks appear, and they appear in predictable places.

Where the limits show up

The failure modes are not the ones marketing decks worry about. The agent does not hallucinate a non-existent PyTorch API and stop. It does something more subtle: it produces a training run that looks fine and is silently wrong.

A few patterns we keep seeing:

Eval contamination and metric drift. Agents are good at writing eval code that runs. They are less reliable at noticing that the eval set leaks into training, that the metric definition does not match the paper, or that the reported number is computed on a different split than the one the baseline uses. These are the kinds of mistakes a careful human reviewer catches in five minutes and an agent will defend for an hour.

Configuration archaeology. Real ML work involves dozens of hyperparameters whose interactions are not documented anywhere. An agent will pick reasonable defaults, but "reasonable" and "what the paper actually used" diverge constantly. Learning rate schedules, warmup steps, weight decay on bias terms, gradient clipping thresholds — the agent has no strong prior on which of these matter for a given setup, and so it tunes the wrong knobs.

Distributed training failure modes. Single-GPU code generation is largely solved. Multi-node, multi-GPU code with FSDP or DeepSpeed is a different problem. The error messages are bad, the failure modes are stochastic, and debugging requires knowing which of NCCL, the scheduler, the storage layer, or the code is at fault. Agents tend to thrash here — they will try seven things, none of which are the actual fix, and burn through compute budget doing it.

Cost awareness. This is the one that is most underrated. A human ML engineer has a strong, mostly-implicit sense of what an experiment should cost. They will not launch a 7B finetune to test whether the dataloader works. Agents do not yet have this calibration. They will gladly spin up the full job to verify a one-line change, and the bill arrives later.

What this tells us about the broader category

The interesting thing about ml-intern is not whether it is better or worse than a human ML engineer — that framing is mostly noise. The interesting question is what shape of work it changes.

Our read: agents in this category are shifting the bottleneck from implementation to judgment. The work of writing the training script, wiring up the eval, and producing the first run is collapsing toward zero. The work of deciding whether the run is trustworthy, whether the eval is honest, and whether the result is worth scaling up — that work is not collapsing. If anything, it is becoming more important, because there is now more output to vet.

This matches what we see in adjacent agent categories. SWE-bench scores keep climbing, but the value of a senior engineer who can review an agent's PR and know in thirty seconds whether it is load-bearing or theatre has gone up, not down. The same pattern is showing up in ML.

There is also a structural point about evaluation. Coding agents are easy to benchmark — tests pass or they do not. ML agents are harder, because "success" is a contested object. An agent that reproduces a paper's reported number to within 0.5% is impressive. An agent that does so by accidentally training on the test set is a liability. The benchmarks the field uses to score these systems mostly cannot tell the difference, which means published scores are running ahead of real-world reliability.

Where we would actually deploy one

Having used several of these agents on real work, the pattern that has emerged for us is narrow but useful. They earn their keep on:

  • Reproductions of well-known architectures where the ground truth is public and a human can sanity-check the eval in an afternoon.
  • Hyperparameter sweeps where the search space is bounded and the agent is doing orchestration rather than judgment.
  • Boilerplate generation — dataloaders, training loops, logging, checkpoint management — where the agent's output goes through normal code review.
  • First-pass exploration of a new dataset or task, where being approximately right quickly is more valuable than being precisely right slowly.

They are not yet reliable for novel research, for production training pipelines without human-in-the-loop review, or for any setting where eval integrity is the main thing you are paying for.

The honest read

ml-intern is a real product doing real work, and it is a meaningful step beyond the previous generation of coding-only agents. The team has clearly thought hard about the loop between code, compute, and metrics. That said, the gap between "the agent shipped a model" and "the agent shipped a model you should trust" is still wide, and closing it is not just a matter of more parameters or better tool use. It requires the agent to develop the kind of metric skepticism that ML engineers spend years acquiring — knowing when a number is too good, when an eval is rigged by accident, when a loss curve is lying.

The agents that matter in eighteen months will not be the ones that produce the most code. They will be the ones that produce the least code while being the most honest about what they did and did not verify. On that axis, the field has further to go than the headline benchmarks suggest — and the practitioners shipping these systems mostly know it.

Want to discuss this?

We write about what we're actually working on. If this is relevant to something you're building, we'd love to hear about it.