Security

Prompt Injection Is the New Attack Surface

June 2026 · 8 min read

The moment you give an AI agent tools, the prompt becomes your security boundary - and it's one we don't yet know how to reliably defend.

SecurityAIPrompt InjectionAgents

Prompt Injection Is the New Attack Surface

For most of the last two years, prompt injection felt like a parlor trick. You'd convince a chatbot to ignore its instructions and talk like a pirate, everyone would laugh, and the model vendor would patch the specific phrasing. Mildly embarrassing. Not a security incident.

That era is over. The thing that changed isn't the models - it's what we let them touch.

The moment you hand an AI agent real tools - your inbox, your shell, your codebase, a browser, a database - the text it reads stops being content and starts being instructions it might act on. And it cannot reliably tell the difference. That's not a bug someone forgot to fix. It's the architecture.

The boundary moved, and most people didn't notice

Here's the mental model shift that took me a while to internalize.

In a normal application, you have a clean line between code and data. Your program is the instructions; user input is the data the instructions operate on. SQL injection, XSS, command injection - the entire injection family - are all the same failure: attacker-controlled data crossing the line and getting executed as instructions. We've spent thirty years building defenses for that line. Parameterized queries. Output encoding. Sandboxing.

Large language models erase the line entirely. To an LLM, the system prompt, your instructions, the email it just summarized, and the web page it just fetched are all the same thing: tokens in a context window. There is no privileged channel. The "instructions" and the "data" are made of the identical substance, and the model decides what to obey based on what reads as most authoritative - which an attacker can absolutely manipulate.

This is why the OWASP Top 10 for LLM Applications lists prompt injection as LLM01 - the number one risk. Not because it's the flashiest, but because it's foundational and unsolved.

EchoLeak was the wake-up call

If you want the moment prompt injection graduated from theory to "patch this now," it was EchoLeak.

In June 2025, researchers at Aim Labs disclosed CVE-2025-32711, a zero-click vulnerability in Microsoft 365 Copilot, rated CVSS 9.3. The attack was almost insulting in its elegance: send the target an email. That's it. No link to click, no attachment to open, no user interaction at all.

The email contained instructions written in plain language - phrased to look like it was addressed to a human, specifically to slip past Microsoft's cross-prompt-injection classifier. When the victim later asked Copilot something unrelated, Copilot would pull that email into its context as part of its normal retrieval, read the hidden instructions, and exfiltrate internal data out through an auto-loaded image and a permitted proxy. Aim called it an "LLM Scope Violation": the assistant was tricked into crossing its own trust boundary and leaking data it was authorized to see but never authorized to send.

What makes EchoLeak important isn't that Microsoft got caught - they patched it fast and handled it well. It's that the attack chained together four separate, individually-reasonable design decisions into a full data-exfiltration exploit, with zero user error. The user did nothing wrong. There was nothing for them to do wrong. That's the new shape of the threat.

The lethal trifecta

The cleanest framework I've found for reasoning about this comes from Simon Willison, who in June 2025 named the lethal trifecta. An AI agent becomes dangerous when it combines three capabilities:

Access to private data - your emails, files, internal systems.
Exposure to untrusted content - anything an attacker can influence: a web page, an email, a PR comment, a calendar invite.
The ability to communicate externally - send an email, make a request, write to a shared surface.

Any agent with all three is, by Willison's argument, exposed to data exfiltration that you cannot reliably engineer away. Not "safe as long as you write a good system prompt." Not "safe as long as you use a bigger model." As he puts it, given "the infinite number of different ways that malicious instructions could be phrased," how confident can you really be that your protection works every single time? Better alignment and safety training make the attack harder to pull off, not impossible - and because these systems are non-deterministic, "harder" is not a security guarantee.

Once you internalize the trifecta, you start seeing it everywhere. The helpful agent that reads your email and can also send email? Trifecta. The coding agent that browses docs and can also open pull requests? Trifecta. The support bot that reads tickets and can hit internal APIs? Trifecta. Most genuinely useful agents are useful precisely because they have all three. That's the tension at the center of this whole field.

You cannot filter your way out

The instinct - and it's the wrong one - is to treat this like spam. Build a classifier, scan the input for malicious instructions, block the bad ones.

EchoLeak is the counterexample that should end that conversation. Microsoft had a cross-prompt-injection classifier. The attacker beat it by writing the malicious instructions to sound like a normal message to a human. Because language is open-ended, there is no clean signature for "this text is secretly an instruction to the AI." Any filter good enough to catch real attacks will also block legitimate content, and any filter loose enough to allow legitimate content will miss cleverly-phrased attacks. You're playing whack-a-mole against natural language, and natural language has infinite moves.

This is the part people resist, because filtering is the cheap, familiar answer. It feels like progress. It catches the demos. It does not catch a motivated attacker, and a security control that only stops unmotivated attackers isn't a security control.

Defense is an architecture problem

The good news is that the problem is tractable - just not at the layer most people are looking.

If you can't trust the model to police itself, you stop relying on it to. You move the security boundary out of the prompt and into the system around it. The research that's most convincing here is CaMeL ("Defeating Prompt Injections by Design"), from researchers at Google DeepMind and ETH Zürich, which borrows the oldest ideas in security: control-flow integrity, least privilege, and information-flow control.

The core pattern is separation. A privileged model handles trusted instructions and plans the work but never sees untrusted content. A second, quarantined model processes the untrusted content but has no access to tools and can't change the plan. Data gets tagged with capabilities - where it came from, what it's allowed to do - and the system, not the model, enforces what can flow where. In testing on the AgentDojo benchmark, CaMeL completed 77% of tasks with provable security, against 84% for an undefended agent. You give up a little capability to make the attack structurally impossible rather than merely discouraged. That's a trade worth making for anything touching real data.

You don't need to implement a research paper to apply the principle. The practical version, the stuff I actually push when I'm looking at someone's agent design:

Assume every external input is hostile. Treat the content your agent reads exactly like you'd treat an anonymous user's form submission - because functionally, that's what it is.
Break the trifecta on purpose. If an agent handles private data and reads untrusted content, take away its ability to send data out. If it needs to send data out, don't also feed it untrusted content. Removing any one leg defuses the worst outcome.
Least privilege, scoped tight. An agent that summarizes documents does not need send-email permission. Give each agent the smallest set of tools its job requires, and nothing more.
Put a human in the loop on consequential actions. Sending money, deleting data, emailing externally, merging code - these should require explicit confirmation, not an agent's unilateral decision based on text it read somewhere.
Log what the agent did, not just what it said. Your audit trail needs to capture tool calls and data flows, because that's where the damage happens.

The bottom line

Prompt injection isn't a phase the technology will grow out of. It's the direct consequence of building systems that take instructions in natural language and act on the world. As long as that's the design, the input is the attack surface.

The teams that get burned over the next couple of years will be the ones who bolted an agent onto sensitive systems, sprinkled in a content filter, and called it secure. The teams that do well will treat AI agents the way we already treat any other untrusted-input-handling component: with skepticism, with isolation, and with the boundary enforced by the architecture instead of by hope.

We've solved injection problems before. We did it by stopping the data from ever being executed as instructions - not by getting better at guessing which data was malicious. The same move works here. We just have to actually make it.

Sources

Building agents that touch real systems and want a second set of eyes on the threat model? Ironwright works with teams to design AI integrations that are useful without being a liability - security built in from the architecture up, not bolted on after.