AI & Tools

AI Coding Agents in 2026: What Actually Ships

June 2026 · 8 min read

A practitioner's honest ledger of agentic coding - where it's a genuine force multiplier, where the demos lie, and the discipline that separates the two.

AISoftware EngineeringProductivityTools

AI Coding Agents in 2026: What Actually Ships

There are two stories about AI coding agents, and they're both being sold too hard.

Story one: agents are replacing engineers, a junior dev with Claude or Cursor now outputs like a senior, and if you're not all-in you're already behind. Story two: it's a bubble, the code is garbage, and anyone shipping AI-written software is building a house of cards.

I use these tools every day for real work. The truth is more interesting than either pitch, and the data backing it up is genuinely messy in a way that should make you suspicious of anyone who's certain.

The data does not agree with the hype - in either direction

Start with the study that should have humbled everyone. In July 2025, METR ran a proper randomized controlled trial: experienced open-source developers, working on their own mature codebases, randomly allowed or not allowed to use AI per task. The result was the opposite of the marketing. Allowing AI made them 19% slower.

The part that should stay with you: those same developers believed AI had made them about 20% faster. They were wrong about their own productivity by nearly forty points. That perception gap is, I think, the single most important fact in this entire conversation. The tools feel fast. Feeling fast and being fast are not the same thing, and our intuition cannot tell them apart.

Now the counterweight, because intellectual honesty demands it. In February 2026 METR published an update - and partly walked the result back. With a larger cohort (57 developers, 143 repositories, 800-plus tasks) the point estimates moved sharply toward zero, but with confidence intervals so wide they straddle both real slowdown and real speedup. METR's own framing is heavily hedged: they call it "very weak evidence" because of selection bias - developers kept declining to work without AI - while concluding it's "likely that developers are more sped up from AI tools now - in early 2026 - compared to our estimates from early 2025." Translation: the tools and the harnesses improved fast, the brutal 2025 number probably overstated the damage, and we still can't measure the real effect cleanly. The gap narrowed. The uncertainty didn't.

Zoom out to the industry and the picture sharpens into a paradox. Google's 2025 DORA report puts AI adoption among developers at 90%, with a median of two hours a day spent working with it and over 80% reporting it makes them more productive. Meanwhile the 2025 Stack Overflow Developer Survey has 84% of developers using or planning to use AI tools - while trust is sliding the other way: positive sentiment fell from over 70% to 60% in a single year, and more developers now actively distrust the accuracy of AI output (about 46%) than trust it (about 33%). The most experienced engineers are the most skeptical of all.

So: near-universal adoption, daily use, widespread belief that it helps - paired with falling trust and a controlled trial that struggled to measure the gains everyone feels. Both things are true. Hold them at once.

AI amplifies what's already there

The most useful framing in that whole pile of research is DORA's: AI is an amplifier, not a fixer. It doesn't make a struggling team good. It makes a good team faster and a chaotic team more chaotic, faster.

The same report found AI adoption correlated with higher delivery throughput but a negative relationship with delivery stability. Read that twice. Teams are shipping more, and breaking more. The acceleration is real, and so is the wreckage it exposes downstream. If your testing, review, and deployment discipline was shaky before, an agent doesn't paper over that - it pours more code through the same cracks, faster.

This matches everything I've seen. The engineers getting enormous value from these tools already knew how to decompose a problem, write a clear spec, and review a diff with a critical eye. The ones drowning expected the agent to supply judgment they didn't have. The tool amplified the gap between them.

Where agents actually ship

Stripped of hype, here's where I reliably get real leverage:

Well-scoped, well-specified tasks. "Add a rate limiter to this endpoint using the existing middleware pattern, 100 requests per minute per key, return 429 with a Retry-After header." Concrete, bounded, verifiable. Agents are excellent at this.
Greenfield and boilerplate. New files, scaffolding, glue code, config, the tenth CRUD endpoint that looks like the previous nine. Low context required, low stakes, fast to verify.
Tests and throwaway tooling. Generating test cases, one-off scripts, data migrations - work where the cost of a mistake is low and the feedback loop is immediate.
Translation and mechanical refactors. Porting between languages, applying a consistent rename, mechanical pattern changes across many files. Tedious for a human, fast and accurate for an agent.
Exploration. Throwing three rough approaches at a problem to see which feels right before I commit. The code is disposable; the thinking it provokes isn't.

The common thread: bounded scope, low required context, and a fast way to verify the output. When all three hold, agents are a genuine multiplier and I'd be slower without them.

Where the demos lie

And here's where the polished launch videos quietly mislead:

Large, mature, interconnected codebases. This is exactly where METR's experienced developers slowed down. The agent doesn't hold the whole system in its head, doesn't know the three undocumented reasons that module is weird, and confidently produces code that's locally plausible and globally wrong.
Novel logic with no precedent to pattern-match. Agents are extraordinary pattern-matchers and mediocre inventors. The genuinely new algorithm, the subtle concurrency problem, the thing that isn't in the training data - that's still yours.
The "almost right" tax. The biggest frustration in the Stack Overflow survey, cited by 66% of developers, was AI output that's almost right but not quite - which leads straight to the second-biggest: debugging AI-generated code taking more time, not less. An answer that's 90% correct can be slower than no answer, because you have to find and fix the hidden 10% you didn't write and don't fully understand.
Anything where you can't quickly tell if it's correct. If verifying the output is as hard as producing it, the agent hasn't saved you the work - it's just moved it, and added a layer of unfamiliarity on top.

The discipline that separates the two

The difference between developers who win with these tools and developers who get burned isn't prompt-craft. It's old-fashioned engineering discipline, which the tools reward more than ever:

Decompose ruthlessly. Agents are good at small, clear tasks and bad at big, vague ones. Your job is turning a fuzzy goal into a sequence of bounded steps. This was always a senior skill. Now it's the whole game.

Specify like you mean it. A good prompt is just a good ticket. If you can write a brief a human contractor could execute, you can drive an agent. If you can't, no model will rescue you.

Review every line like it came from a stranger. Because it did. The agent has no stake in your system and no memory of why things are the way they are. Code you don't understand is a liability whether a human or a model wrote it - but at least the human could explain their reasoning.

Know when to put it down. For something you already know how to do in five minutes, the round-trip of prompting, reading, and verifying can take ten. The skill is pattern-recognition about which tasks benefit and which don't, and it only comes from paying attention to where the tool actually helped versus where it just felt like it did.

The bottom line

AI coding agents in 2026 are a real, significant productivity tool for people who already know what they're doing - and a generator of confident, plausible, subtly-wrong code for people who don't. The technology improved fast enough in eighteen months to turn a measured slowdown into a measured wash-to-slight-gain, and the trajectory is still up.

But the gains aren't free and they aren't automatic. They go to the engineers with the judgment to aim the tool, the discipline to verify it, and the honesty to notice when it's making them feel productive instead of being productive. The fundamentals didn't get less important. They got more valuable - because now they're the thing standing between a useful agent and a fast way to ship bugs.

Sources

Trying to get real leverage from AI in your engineering workflow without shipping a pile of plausible bugs? Ironwright helps teams integrate these tools where they actually pay off - and put the guardrails up where they don't.