Why adversarial prompt engineering is not the problem — and what actually is

In early 2023, a group of researchers demonstrated something that made security people uncomfortable and product people dismissive.

They showed that a language model could be instructed to do things its creators never intended, not by the person using it, but by content it was asked to process.

The paper was called “Not what you’ve signed up for.” The attack was called indirect prompt injection.

Three years later, the industry still has not fully absorbed the lesson.


The fixation on prompt injection

If you follow AI security discourse, you would think prompt injection is the central problem. It dominates conference talks. It tops the OWASP list. It generates endless proof-of-concept videos.

And it should get attention. It is a real vulnerability.

But the fixation on prompt injection obscures a more important truth: prompt injection is a symptom, not the disease.

The disease is that we have built systems that blur the boundary between data and instruction, between reading and acting, between assistance and agency — and then we secured only the reading part.

When an agent can execute actions, the attack surface is not the prompt. It is the entire pipeline from input to side effect.

Most security programs are still focused on the pipeline from input to output. That gap — between output and action — is where the real damage lives.


How we got here

To understand the gap, you need to understand the trajectory.

Phase one was chat. You ask a model a question, it answers. The worst case is a wrong answer. Security concern: minimal.

Phase two was retrieval. You ask a model a question, it searches documents and answers. The worst case is the model retrieving something adversarial and repeating it. Security concern: information integrity.

Phase three is agency. You ask a model a question, it searches documents, reasons about them, and then does something: sends an email, updates a ticket, calls an API, modifies a file, triggers a workflow.

The worst case in phase three is not bad text. It is unauthorized action taken under your identity, with your credentials, against your systems.

The security model, however, barely changed between phases.

We added more tools, more integrations, more autonomy. We did not add proportionally more controls.

That is the agent security gap.


The taxonomy nobody uses

Research has given us useful language. The problem is that most teams ignore it.

The original indirect prompt injection paper by Greshake, Abdelnabi, and Mishra offered a taxonomy that still holds:

Direct prompt injection — the user explicitly tries to manipulate the model. This is the “ignore previous instructions” style attack. It gets the most attention because it is the most visible.

Indirect prompt injection — an adversary embeds instructions in data the model will process. A web page. An email. A document. A tool response. The user does not see the injection; the model does.

This distinction matters enormously for threat modeling because the trust assumptions are different.

In direct injection, the attacker is the user. You can rate-limit, monitor, and apply behavioral analysis.

In indirect injection, the attacker is anyone who can influence data the model consumes. That is a much larger set.

Think about what an AI agent processes in a typical enterprise deployment:

  • internal documents from wikis and knowledge bases,
  • customer emails and support tickets,
  • web pages fetched during research tasks,
  • code from repositories,
  • tool responses from third-party APIs,
  • messages from multiple chat channels.

Every one of these is an indirect injection vector.

If your threat model only accounts for the user typing something malicious, you have modeled the easy case and missed the dangerous one.


The compounding effect: from injection to action

Here is the part that most discussions skip.

Prompt injection becomes dramatically more dangerous when combined with two other properties: insecure output handling and excessive agency.

OWASP lists these as separate vulnerabilities — LLM01, LLM02, and LLM08. In practice, they form a chain.

The chain looks like this:

  1. An adversary plants crafted content in a document the agent will process. (Prompt injection — LLM01)

  2. The model incorporates the injected instruction into its reasoning and generates output designed to exploit downstream systems. (Insecure output handling — LLM02)

  3. The agent executes that output as an action — sending data to an external endpoint, modifying a record, calling an API with stolen credentials — because it has been granted autonomy to act. (Excessive agency — LLM08)

Any one of these in isolation is manageable. Together, they are an exploit chain.

And here is what makes this worse than traditional software vulnerabilities: the chain does not require code execution on the target system. It requires language.

Language is the new exploit payload.


Why model improvements will not save you

There is a persistent hope that better models will solve this.

Every new model release is accompanied by claims of improved instruction-following, better refusal behavior, stronger alignment. And these improvements are real. Models do get better at rejecting obvious manipulation attempts.

But the research tells a more complicated story.

The Zou et al. paper on universal adversarial attacks showed that adversarial suffixes — seemingly random token sequences — can cause aligned models to produce objectionable content across multiple model architectures, including models the attacker never had direct access to. The attacks were transferable.

This means the attacker does not need to find a weakness in your specific model. They need to find a weakness in the class of models, and that weakness propagates.

More recent work has shown similar transferability for jailbreak techniques, cross-model prompt injection, and even multi-turn conversation attacks where the injection is distributed across several exchanges — a particularly dangerous pattern because it defeats single-turn evaluation and exploits the agent’s accumulated state rather than any single input.

The distributed multi-turn attack is arguably the most operationally relevant threat in production agent systems. It is also the least addressed by current security architectures, most of which evaluate one action proposal at a time without understanding the conversation trajectory that produced it.

The implication is clear: model quality improves, but so does attack sophistication. It is an arms race, and the defender’s position is structurally harder because the attacker only needs one successful path while the defender must close all of them.

Relying on model alignment as your primary defense against adversarial prompt engineering is like relying on employees to never open phishing emails. It helps. It is not sufficient.


The architectural response that actually works

If model-level defenses are necessary but insufficient, what fills the gap?

Architecture.

The security properties you need cannot depend on the model always behaving correctly. They must hold even when the model is confused, manipulated, or actively adversarial.

This requires a fundamentally different design philosophy: assume the model will be compromised, and build systems that limit what a compromised model can do.

In practice, this means five architectural principles.

1. Separate instruction channels from data channels

The root cause of indirect prompt injection is that the system treats retrieved data and system instructions as the same kind of thing: tokens in a context window.

Secure architectures separate these. System instructions come from a trusted, integrity-protected channel. Data comes from untrusted sources and is labeled as such.

But here is the nuance most accounts skip: true separation is impossible at the model layer.

Transformer architectures do not have distinct instruction and data memory spaces. Everything — system prompts, user messages, retrieved documents, tool outputs — becomes the same sequence of probabilistic tokens processed by the same attention mechanism. The model does not natively distinguish between “this is an instruction I should follow” and “this is content I should reason about.” That distinction is an emergent behavior shaped by training, not a structural property of the architecture.

This means data/instruction separation is not a model capability. It is an orchestration-layer constraint — enforced by how the application wraps the model: trust labels at ingestion, role boundaries in context construction, and output policies that prevent data-derived content from flowing into action decisions.

The model will always blend instruction and data internally. Security relies entirely on external orchestration ensuring that blending does not escape into unauthorized action.

The model should be able to reason about data without data becoming instruction. This is an input architecture problem, not a model behavior problem — but it must be acknowledged that the architecture in question is the application wrapper, not the model itself.

2. Make action authorization external to the model

The model can propose actions. A separate policy engine must authorize them.

This is the most important principle and the one most frequently violated.

When the model both decides what to do and is trusted to do it, you have a single point of failure. A successful prompt injection compromises both reasoning and execution simultaneously.

The policy engine should be deterministic, rule-based, and independent of model output. It evaluates proposed actions against explicit criteria: who is requesting, what is the action, what is the target, what is the impact tier, and whether human approval is required.

This is not novel. It is the same principle behind transaction authorization in banking, change management in infrastructure, and approval workflows in enterprise software. The difference is that AI systems often skip it in the name of developer experience.

But here is where the abstraction breaks down.

Traditional policy engines evaluate structured payloads: fixed fields, enumerated values, well-defined schemas. An LLM’s action proposals are not like that. They are natural-language reasoning outputs that must be parsed into structured parameters before any rule can fire. That parsing step is itself a non-deterministic, fuzzy operation — and it is exactly where the gap between “deterministic policy” and “model output” becomes porous.

A policy engine that evaluates {"action": "delete_user", "target": "uid-1234"} is straightforward. A policy engine that must extract that intent from “Based on the conversation history, I believe the best course of action is to remove the problematic account referenced earlier” is a different engineering problem entirely.

This means the policy engine alone is not sufficient. You need three things working together:

  1. Structured action schemas — constrain the model’s output to a defined schema (function calling, tool-use formats) rather than free-form text. This moves the boundary between fuzzy and structured as close to the model as possible.

  2. Schema-level policy rules — evaluate the structured output against deterministic criteria: action type, target scope, credential audience, impact tier.

  3. Confidence-aware routing — when the model’s structured output falls near policy boundaries (e.g., a write action to a resource that could be either benign or destructive depending on context), route to human review rather than relying on the binary pass/fail of the policy engine.

The policy engine must be deterministic. But the path from model output to policy input requires careful engineering, and that path is where most implementations fail silently.

3. Implement progressive capability scoping

Agents should not start with maximum privilege.

The default should be read-only discovery. Write actions, external API calls, and destructive operations should require explicit elevation through scoped challenges.

This mirrors OAuth scope minimization and military need-to-know principles. An agent that can answer questions about your wiki should not, by default, be able to delete wiki pages.

The scope model should be visible, auditable, and enforced at the infrastructure level — not just in the model’s system prompt.

4. Add human checkpoints for high-impact actions

The industry has been slow to accept this because it conflicts with the “fully autonomous agent” narrative.

But the narrative is wrong for production systems.

Not every action needs human approval. But actions that are irreversible, externally visible, or affect sensitive data absolutely should.

The design challenge is not eliminating human involvement. It is making human involvement precise and non-ceremonial. Show the user exactly what will happen, in plain language, and let them approve or reject. Make the default safe, not fast.

Approval fatigue is a real problem — and it is more severe than most discussions acknowledge. In enterprise environments, even heavily scoped “high-impact” systems generate action volumes that can overwhelm human reviewers within days of deployment. A simple binary “approve/reject for destructive actions” gate does not scale.

The solution is dynamic, contextual risk scoring between the policy engine and the human queue:

  1. Risk tier classification — every proposed action receives a risk score based on a combination of structural properties (action type, target sensitivity, credential scope) and contextual properties (source of the triggering input, whether the input chain crosses trust boundaries, how close the action falls to known adversarial patterns).

  2. Secondary evaluator model — a lightweight, heavily constrained model (not the primary agent model) evaluates the action proposal against historical patterns and known attack signatures. This model is read-only, cannot take actions itself, and produces only a confidence score. It acts as a triage layer, not a decision maker.

  3. Adaptive routing — high-confidence benign actions auto-approve. Low-confidence or high-impact actions route to human review with the evaluator’s risk score and reasoning attached, so the reviewer has context rather than a raw action dump.

  4. Feedback loop — human decisions feed back into the risk scorer, improving triage accuracy over time and reducing false-positive review volume.

This is not about eliminating human involvement for the actions that matter most. It is about ensuring that human attention is directed where it adds the most value — which requires a scoring layer that can distinguish between genuinely risky proposals and noisy false positives at scale.

5. Secure state continuity across conversation lifecycles

This is the gap most architectural discussions skip entirely, and it is where distributed multi-turn attacks live.

Most security models evaluate one turn at a time: the model receives input, proposes an action, the policy engine evaluates it. This works for single-turn injection but collapses against distributed attacks where an adversary’s intent is assembled gradually across multiple exchanges — possibly spanning context-window boundaries.

Consider a realistic multi-turn attack:

  • Turn 1: User asks the agent to research a topic. Agent fetches a web page containing a subtle framing directive embedded in seemingly benign content.
  • Turn 2: User asks a follow-up question. The framing directive from turn 1 influences the model’s reasoning but does not trigger action yet.
  • Turn 3: User asks the agent to perform an action. The accumulated framing from turns 1–2 shapes the action proposal in a way that no single-turn policy engine would catch, because no single turn contained a complete attack.

The problem is not that the policy engine fails. It is that the policy engine evaluates each proposal in isolation, missing the pattern that only emerges across the conversation’s state trajectory.

This is fundamentally a state integrity problem, and it operates at three levels:

Level 1: Within a single context window. The model’s attention mechanism carries influence from earlier turns into later reasoning. There is no clean boundary between “what was said before” and “what is being decided now.” The model’s hidden states are the attack surface, and they are opaque to the policy engine.

Level 2: Across context-window resets. Many systems summarize or compress conversation history when context limits are reached. An adversary who understands the summarization logic can craft payloads that survive compression — instructions that are generic enough to persist through summarization but specific enough to steer future reasoning.

Level 3: Across session boundaries. Persistent memory systems, knowledge bases, and tool-call history can carry adversarial influence across entirely separate conversation sessions. An injection planted in one session can resurface weeks later when the agent retrieves relevant context.

What a robust architecture must do:

  1. Conversation-level intent tracking. Maintain a running state summary of the agent’s current task trajectory — what it is trying to accomplish, what actions it has proposed, what data sources have influenced reasoning. This summary is evaluated holistically, not turn-by-turn, when high-impact actions are proposed.

  2. Turn-boundary freshness checks. Before executing high-impact actions, verify that the action’s rationale is traceable to recent, explicit user intent — not just accumulated context from untrusted sources processed earlier in the conversation. If the primary justification for an action traces back to untrusted data ingested several turns ago, escalate to human review.

  3. Context-window reset as a security event. Treat summarization and context compression as security-sensitive operations, not just memory management. Ensure that compression does not silently preserve adversarial directives. This may require adversarial testing of the summarization pipeline itself.

  4. Cross-session state audit. Persistent memory and knowledge stores that feed into agent reasoning should be treated as trust boundaries with their own integrity checks. When an agent retrieves a stored fact that influences a high-impact action, the retrieval source and provenance should be part of the policy evaluation.

  5. Anomaly detection over conversation trajectories. Monitor not just individual actions but the pattern of an agent’s behavior over time. A sudden shift in action type, target scope, or risk profile mid-conversation — especially after processing untrusted content — is a signal that deserves investigation regardless of whether any single action triggers a policy violation.

The key insight: a policy engine that only evaluates individual action proposals is blind to the attack vector that matters most in production — the slow, distributed manipulation of agent state across turns, windows, and sessions.

6. Build observability into the action layer

Most AI observability focuses on model behavior: latency, token usage, output quality.

Security observability needs to focus on action behavior: what was proposed, what was authorized, what was executed, and what changed.

This means complete audit trails for every agent-initiated action, with enough context to reconstruct intent. Not just “the agent called API X,” but “the agent called API X because of reasoning chain Y, which was influenced by data source Z.”

Without this, incident investigation is guesswork.


The operational gap

Even teams that understand the architecture often fail operationally.

Here is what the operational gap looks like:

  • The security team writes guidelines. The product team ships features. The two documents never meet.
  • Threat models are created during design reviews and never updated after deployment.
  • Red-team exercises test model behavior but not end-to-end action chains.
  • Incident response playbooks assume traditional infrastructure compromise, not adversarial model manipulation.
  • Monitoring dashboards track model performance but not action anomaly patterns.

The result is a system that looks secure on paper and is porous in practice.

Closing this gap requires treating agent security as a cross-functional operational discipline, not a checklist item for the security team.


What good looks like in practice

A mature agent security program has these properties:

  1. Every agent capability has an explicit trust boundary documented and enforced.

  2. Untrusted data is labeled at ingestion and cannot be promoted to instruction without explicit, auditable policy approval — with the recognition that separation is enforced at the orchestration layer, not the model layer.

  3. Action authorization is handled by a deterministic policy engine operating on structured action schemas — and the path from model output to structured input is itself a secured, validated translation step.

  4. High-impact actions require human approval, routed through a dynamic risk-scoring layer that uses contextual analysis and a constrained secondary evaluator to manage review volume at enterprise scale.

  5. Scope is minimal by default and elevated only through explicit, scoped challenges.

  6. State continuity across turns, context-window resets, and session boundaries is tracked, validated, and treated as a first-class security surface — not assumed to be benign.

  7. Every agent-initiated action produces an audit event with full provenance, including the conversation trajectory and data sources that influenced the decision.

  8. Adversarial testing covers end-to-end chains including distributed multi-turn attacks, not just single-turn model responses.

  9. Incident playbooks address adversarial manipulation specifically, not generically — including scenarios where the compromise spans multiple conversation lifecycles.

  10. Metrics track action-layer security and state-integrity signals, not just model-layer performance.

  11. Security review is a continuous process tied to deployment velocity, not a gate that slows shipping.

Most organizations today are at one or two of these. The gap between one and ten is where most incidents will happen.


The uncomfortable truth about autonomous agents

The industry wants autonomous agents. Investors want them. Product teams want them. Users, once they experience the convenience, want them.

Security should want them too — but only when the architecture supports safe autonomy.

Right now, much of what is marketed as “autonomous” is actually “unconstrained.” The agent can do many things because nobody bothered to define what it should not do.

That is not autonomy. That is negligence dressed in buzzwords.

True autonomous agent security means the system can operate freely within a well-defined trust envelope and cannot escape that envelope even under adversarial conditions.

Building that envelope is hard. It requires engineering discipline that conflicts with rapid prototyping culture. It requires security involvement that conflicts with “move fast” incentives. It requires operational investment that conflicts with short-term shipping goals.

But the cost of not building it is higher.

When an agent with broad privileges is compromised through prompt injection, the blast radius is not a bad chat response. It is real action taken against real systems using real credentials.

The attacker does not need to find a zero-day in your code. They need to find the right words in a document your agent will read.


The 90-day plan

If you run security for an organization deploying AI agents, here is what I would do.

Days 0–30: Map the action surface

Inventory every action your agents can take. For each action, document:

  • who authorized it,
  • what credentials it uses,
  • what data it can access,
  • what it can modify or delete,
  • whether human approval is required.

Most teams discover several actions they did not know existed.

Days 31–60: Add external authorization and structured action schemas

Implement a policy engine that sits between the model and action execution. Start with deny-by-default for any action the inventory did not explicitly approve.

Constrain model output to structured action schemas (function-calling or tool-use formats) rather than free-form text. This is the prerequisite for the deterministic policy engine to work — without structured schemas, the translation layer between model output and policy input remains an uncontrolled fuzzy gap.

Add human approval for high-impact actions. Make the approval prompt clear enough that a non-technical stakeholder can evaluate it.

Days 61–90: Deploy triage layer and test the full chain

Deploy the secondary evaluator. Stand up a lightweight, heavily constrained model as a risk-scoring triage layer between the policy engine and the human review queue. This model must be read-only, non-acting, and produce only confidence scores. Wire it into the approval routing: high-confidence benign actions auto-approve, low-confidence or high-impact actions route to human review with the evaluator’s risk score attached.

Implement conversation-level intent tracking. Add a running state summary that captures the agent’s task trajectory — what it is trying to accomplish, what data sources have influenced reasoning, and what actions have been proposed. This is what the policy engine evaluates for high-impact decisions, not just the current turn in isolation.

Run end-to-end adversarial simulations. Not “can we trick the model?” but “can we trick the model into causing real harm?”

Test indirect injection through documents, emails, web pages, and tool responses. Test multi-turn conversations where the injection is distributed across turns, context-window boundaries, and session resets. Test combination attacks where prompt injection leads to insecure output handling leads to unauthorized action.

Evaluate whether your state-integrity mechanisms can detect when an action’s rationale traces back to untrusted data processed several turns earlier. Test whether the summarization pipeline silently preserves adversarial directives across context-window resets.

Measure detection time, containment time, and recovery time.

If recovery takes more than minutes, you are not ready for production agents.

Days 91–180: Hardening and feedback loops

The 90-day plan gets you to a defensible baseline. The next quarter makes it sustainable.

Tune the secondary evaluator with real traffic. Use human approval decisions as ground truth to calibrate the risk scorer. Measure false-positive rate on auto-approvals and false-negative rate on auto-approved actions that should have been escalated. Target: reduce human review volume by 60–80% without increasing missed high-risk actions.

Audit cross-session state integrity. Test whether adversarial content injected in one session can resurface through persistent memory or knowledge-store retrieval in a later session. Add provenance tracking to any stored content that feeds into agent reasoning.

Adversarial red-team exercises on a regular cadence. Move from one-time validation to continuous testing. Schedule quarterly adversarial simulations that include distributed multi-turn and cross-session attack patterns. Track metrics over time: detection latency, containment time, and the ratio of discovered-to-prevented policy violations.


The deeper problem

There is a structural reason the agent security gap exists, and it is not technical.

It is that the people building AI systems and the people securing them are often not the same people, and they do not share the same incentives.

Builders are rewarded for capability. Security is rewarded for constraint. When capability and constraint are in tension — which they always are in agent systems — the builder usually wins because the builder ships the feature.

This is not a criticism of builders. It is an observation about organizational dynamics.

The fix is not to slow down builders. It is to make security constraints visible, enforceable, and integrated into the development workflow so they do not feel like obstacles.

A policy engine that automatically enforces least privilege is less friction than a security review that asks teams to justify their scope requests. A human approval gate that only triggers for high-impact actions is less friction than a blanket requirement that everyone ignores.

Good security architecture reduces total friction. Bad security architecture increases it.

Most agent systems today have bad security architecture because they were built for capability first and retrofitted with constraints later.

The organizations that reverse this order — constraints first, capability within constraints — will be the ones that scale safely.


Final point

Adversarial prompt engineering is real. It is getting more sophisticated. It is not going away.

But it is also not the root problem.

The root problem is that we built systems that can act on the world and then secured them as if they could only talk about the world.

The gap between what agents can do and what we have constrained them to do safely is the agent security gap.

Closing it requires treating action authorization, scope enforcement, and human oversight as first-class architectural requirements — not as afterthoughts, not as policy documents, and not as features to add in version two.

Because by version two, someone will already have exploited the gap.

The question is not whether your agent will encounter adversarial input. It will. The question is whether your architecture prevents that input from becoming adversarial action.

If you cannot answer that question with confidence, the gap is still open.


References