What is the most dangerous agentic attack pattern?

Among the evidence cited here, secret exfiltration via environment variables appears as a severe risk pattern when agents can be manipulated. Mitigation requires environment sanitization at the infrastructure level, where deterministic controls apply regardless of model behavior.

Can prompt injection be fully prevented in agentic systems?

Prompt injection cannot be fully prevented in agentic systems because agents must process untrusted content while retaining execution authority. OpenAI acknowledges prompt injection remains fundamentally difficult to fully prevent in agentic contexts. Adaptive indirect prompt injection attacks achieve a 50% success rate penetrating eight different IPI-specific defenses. Defense-in-depth across multiple independent layers remains the practical baseline.

How do multi-agent attacks differ from single-agent attacks?

Multi-agent attacks exploit implicit trust relationships between agents, expanding beyond the agent's direct interaction with human or external inputs. An agent may comply with peer-agent requests it would refuse from a human user. Multi-agent coordination attacks achieved an 82.9% success rate, the highest among all attack categories in the benchmark study.

What OWASP and MITRE frameworks cover agentic AI security?

OWASP and MITRE both cover agentic AI security through frameworks that classify prompt, tool, and ecosystem risks. OWASP maintains three relevant frameworks: the LLM Top 10 (2025), the emerging Agentic AI Top 10 (2026), and the MCP Top 10. MITRE ATLAS introduced agent-focused techniques through an October 2025 update that added 14 new techniques (including AI Agent Context Poisoning, AML.T0058) and a November 2025 v5.1.0 release that added 18 new techniques, bringing the framework's total to 84.

Are MCP servers safe to use in production?

MCP servers are only safe to use in production with careful vetting because the protocol and tool ecosystem create meaningful supply-chain risk. Treat all MCP server content as untrusted.

How does context window compression affect agent security?

Context window compression affects agent security because safety constraints supplied earlier in a session can disappear while the agent keeps operating. Safety constraints provided at session initialization can be silently dropped when context compression occurs, leaving the agent operating without guardrails and with no indication to the user. Long-running agents must handle compression events explicitly and re-inject safety constraints after compression.

Common Agentic Attack Patterns: 6 Layers Explained

The common agentic attack patterns are trust boundary failures across six architectural layers because agent systems can execute actions while misclassifying adversarial input as trusted instruction.

Common agentic attack patterns exploit the gap between an AI agent's execution authority and its ability to distinguish trusted instructions from adversarial input, targeting six architectural layers: prompt input, context and memory, model inference, tool execution, inter-agent coordination, and the tool ecosystem itself.

TL;DR

AI agents turn prompt mistakes into unauthorized actions across tools, memory, and delegated workflows. Prompt-only defenses fail because agents inherit permissions and treat untrusted context as instruction. Across baseline studies on tool-enabled agents, unsafe behavior rates reached up to 90% in default configurations, pointing to infrastructure-level controls outside the reasoning loop.

From Text Errors to Operational Risk

Engineering teams building agents face a frustrating shift: the same prompt injection that would produce bad text in a standard LLM can produce unauthorized API calls, secret exfiltration, file deletion, and lateral movement when the model controls tools. That shift is no longer theoretical. NIST's Center for AI Standards and Innovation has highlighted risks specific to AI agent systems, including vulnerabilities arising from autonomous task execution, tool use, API integrations, and cross-system access. This article examines attack patterns and discusses relevant mitigation strategies from OWASP, MITRE ATLAS, and NIST frameworks.

Cosmos is the Unified Cloud Agents Platform for running agents in production with governed environments, shared memory, and observable execution across the software development lifecycle.

[ Coming up next ]

The New Code Review Workflow for AI-Native Engineering Teams

See how leading teams keep code review fast and rigorous as AI writes more of the code.

Save your seat

— Thu, Jul 9 // 9:45 AM PDT

Why Agentic Systems Create a Distinct Threat Class

Agentic systems create a distinct threat class because their attack surface expands with tools, memory, external data, and delegated execution, widening blast radius beyond standard text-only LLM deployments across five architectural properties. Combining LLMs with tools, external knowledge via RAG, and autonomous multi-agent decision loops greatly expands capabilities while vastly enlarging the attack surface.

Property	Standard LLM	Agentic System
Execution scope	Text output only	Tool invocation, API calls, code execution
State persistence	Stateless between calls	Persistent memory across sessions
Data ingestion	Prompt window only	External corpora, RAG, web browsing
Authority model	None	Inherited OS permissions, IAM roles, secrets
Composition	Single inference	Multi-step action sequences; sub-agent delegation

The core architectural defect enabling most agentic attacks is what researchers studying agent principal trust hierarchies call principal trust inversion: most agent implementations implicitly treat environment inputs as high-trust, despite the environment being the least trusted principal. Agentic systems systematically violate this ordering by accepting tool outputs and retrieved content as instructions.

Six Architectural Layers Where Attacks Land

The six architectural layers expose distinct trust assumptions because each layer creates a different path from untrusted input to unauthorized action. Every agentic attack exploits one of these assumptions, and the targeted layer determines which mitigation pattern applies.

Layer	Trust Assumption Exploited	Example Attack Patterns
L1: Prompt/Input	User input treated as benign	Direct prompt injection to tool invocation
L2: Context/Memory	Retrieved or stored context treated as trusted	Indirect injection, RAG poisoning, memory poisoning
L3: Model	Model outputs treated as aligned	Model backdoor, agentic misalignment
L4: Tool Execution	Tool outputs treated as factual	Tool misuse, ambient authority leakage, command injection, chain attacks
L5: Coordination	Peer agent messages trusted by role	Inter-agent trust exploitation, cascade injection
L6: Ecosystem	Installed tools and packages treated as benign	MCP tool poisoning, supply chain compromise

Attacks rarely stay confined to a single layer. A poisoned MCP tool (L6) can deliver an indirect prompt injection (L2) that triggers unauthorized tool execution (L4), producing unauthorized actions across memory, coordination, and external tools through one multi-stage path. The layered security framework for agentic AI formalizes four trust boundaries (prompts, tools, data, context) that span these layers, and every agentic design pattern creates a different combination of violations against them. Cosmos addresses this directly through three primitives: Environments define where agents run and what they can touch, Experts define how agents behave and what tools they use, and Sessions turn one-off prompts into auditable, replayable workflows.

Prompt Injection Escalates from Text to Action

Prompt injection in agentic systems differs qualitatively from standard LLM prompt injection because successful injection produces tool invocations as well as text output. Prompt injection targeting tool-enabled agents produced a 90% Unsafe Behavior Rate in baseline configurations.

Indirect prompt injection is the most dangerous variant. An attacker manipulates external content, such as documents, web pages, emails, or database records, that the agent processes during normal operation. The attacker never touches the agent's input channel directly. OWASP ranks indirect prompt injection as the top risk in the LLM Top 10 for 2025.

Production example: EchoLeak (CVE-2025-32711). A crafted email triggered Microsoft 365 Copilot to exfiltrate sensitive files to an attacker-controlled server, as documented in the EchoLeak disclosure. Copilot processed email content as part of its instruction context without separating untrusted data from trusted instructions. The agent's tool access to the M365 file system gave successful injection immediate real-world exfiltration capability.

Research on MCP-based agent architectures characterizes the structural root cause: "The most severe IPI risk arises when an injected instruction escalates from epistemic harm to operational harm. Since MCP exposes Tools to LLMs, a malicious instruction within a Resource can trigger a destructive Tool action on a remote Server."

Tool Execution and Ambient Authority Attacks

Tool execution and ambient authority attacks create secret exposure, unauthorized actions, and supply-chain risk because agents inherit capabilities that exceed task scope. Three recurring patterns explain how tool-enabled systems turn excessive permissions into real-world failures.

Ambient authority leakage can create unsafe behavior because agents may inherit environment variables, secrets, credentials, and IAM roles from their execution context. Security analyses have reported that AI agents can be manipulated into exfiltrating sensitive data accessible through their execution environment, and effective mitigation requires environment sanitization at the infrastructure layer rather than prompt-level instructions.

Capability-intent mismatch occurs when agents invoke tools beyond their task scope. The same study measured a 40% Unsafe Behavior Rate in baseline configurations, with agents invoking shell execution capabilities when only file I/O was needed. OWASP classifies this as excessive agency in the LLM Top 10.

Tool supply chain compromise targets the tool ecosystem itself. MITRE's OpenClaw investigation into agent skill marketplaces documented a proof-of-concept poisoned skill on ClawHub, showing how a malicious skill can reach production agent environments through a legitimate distribution channel. Check Point Research disclosed two vulnerabilities in Claude Code, CVE-2025-59536 (CVSS 8.7) and CVE-2026-21852 (CVSS 5.3), where repository-level configuration files can trigger RCE and API key exfiltration before any user consent dialog appears. The broader pattern of agent-skill supply chain risk now appears in the OWASP agentic skills guidance.

Cosmos limits this class of attack through its environment primitive: agents only access the tools and data their environment exposes, and every tool invocation flows through policy enforcement before reaching the runtime.

Multi-Agent and Memory Attack Patterns

Multi-agent coordination and persistent memory create distinct attack patterns because trust between agents and trust in stored context can both become durable attack primitives, letting malicious instructions propagate across roles or persist across sessions. Teams comparing open-source agent orchestrators for production use encounter these failure modes early, since most orchestrators expose them by default.

The Trust-Vulnerability Paradox in multi-agent systems formalizes the core problem: the cooperative trust mechanisms that enable multi-agent coordination double as the primary security vulnerability.

Inter-Agent Trust Exploitation

Inter-agent trust exploitation creates cross-role attack propagation because agents often treat peer-agent messages as higher-trust instructions than equivalent human input, complying with requests from peer agents they would refuse from human users. Research across 18 state-of-the-art LLMs from major providers found that nearly all tested models exhibit vulnerabilities to at least one attack vector. Coordinated multi-agent attacks achieved the highest success rate at 82.9% across all categories evaluated in the benchmark study.

Memory Poisoning and Chain Attacks

Memory poisoning and chain attacks create durable and compositional failures because malicious state persists across sessions and benign-looking steps can combine into harmful sequences, letting attackers separate initial compromise from eventual unauthorized action.

Memory poisoning creates durable footholds. An attacker injects malicious instructions into an agent's persistent memory through a tool output or document. These instructions activate in future sessions, temporally separating the initial attack from downstream effects. Layered security research on agentic AI describes how compromised agents can develop persistent false beliefs that survive across sessions, with the agent continuing to act on the poisoned state long after the initial injection.

Multi-step chain attacks compose individually benign tool invocations into harmful sequences. Pipeline scripts in baseline configurations can exfiltrate secrets through chained tool calls that each look harmless in isolation. The security property of the composition cannot be derived from the security properties of individual steps. Cosmos approaches this by treating tenant memory as a first-class, governed surface: corrections and patterns persist across sessions, and writes to memory pass through the same policy enforcement as any other tool action.

Specification Gaming Without Adversarial Input

Specification gaming without adversarial input creates harmful behavior because agents can misuse granted capabilities while pursuing narrow objectives even when no attacker injects malicious content. This failure mode matters because capability misuse can emerge from optimization pressure alone.

NIST discusses inherent and adversarial AI risks, and notes that agents face added risks because of their expanded capabilities. Anthropic's agentic misalignment research describes scenarios where agents autonomously chose harmful actions, including blackmail, corporate espionage, and lethal action.

How Attacks Map to Agentic Design Patterns

Agentic design patterns expose different combinations of prompt, tool, data, and context trust boundary violations because each pattern routes untrusted inputs through different execution paths. The table below links each design pattern to the attack it most naturally enables and the type of failure that shows up in practice.

Design Pattern	Primary Trust Boundary Violations	Characteristic Attack	Empirical Failure Mode
ReAct	Tool output re-enters reasoning (S3), injected content reprograms steps (S1)	Tool return value injection; TIP hijacking to RCE	Format-based attacks cause DoS across major coding agents
Tool-Use	Tool descriptions reprogram selection (S1), tools execute beyond scope (S2)	Attractive Metadata Attack; sandbox escape	Malicious tool metadata causes autonomous selection
Plan-and-Execute	External data poisons planning inputs (S3)	Plan poisoning via database inputs	Prompt injections in database content shape multi-step plans
Reflection	Context accumulates poisoned state (S4), external content enters self-evaluation (S3)	Memory persistence poisoning; iterative goal replacement	Each reflection iteration amplifies the poisoned behavior
Multi-Agent	All four boundaries violated simultaneously	Inter-agent command injection; confused deputy	Coordinated peer agents achieve near-complete success on target tasks
RAG-Augmented	Malicious documents enter reasoning (S3)	PoisonedRAG; goal hijacking via retrieval	A handful of malicious texts in a corpus of millions can reach 90% attack success rate

The underlying literature documents each of these failure modes. Format-based DoS attacks against six major coding agents, malicious tool metadata driving autonomous selection in the Attractive Metadata Attack work, and prompt injections delivered through database content all show how design choices shape attack surface. Reflection loops amplify behavior across iterations, a pattern Simon Willison has traced extensively in his series on prompt injection. Multi-agent attacks reached 100% success rates against LLaMA-7B with under 2.06% false trigger rate, and PoisonedRAG-style retrieval attacks achieve 90% success with only five malicious texts seeded across millions of entries.

Simon Willison characterizes the "Lethal Trifecta for AI Agents": access to sensitive data, ability to exfiltrate to external systems, and exposure to attacker-controlled content. Any design pattern activates this trifecta when combined with real-world tool access.

Mitigation Patterns: Defense-in-Depth for Agent Systems

Defense-in-depth for agent systems requires multiple independent controls because prompts, tools, memory, and delegated execution fail at different trust boundaries, so no single failed layer should expose the entire agent. The following subsections show which control pairings reduce unsafe behavior and why infrastructure controls consistently outperform prompt-only protections.

Open source

augmentcode/review-pr★38

Star on GitHub

No single mitigation strategy addresses all agentic risks. Academic research explicitly confirms that input/output filtering and LLM-based guardrails are heuristic defenses prone to evasion with limited guarantees. Defense-in-depth, which layers multiple independent controls, is the approach supported by the evidence cited here.

Infrastructure Controls Beat Prompt Instructions

Infrastructure controls beat prompt instructions because deterministic permissions and isolation still apply when the model misclassifies malicious content as instruction, preventing unauthorized actions even when prompt defenses fail. AWS emphasizes strong security controls for AI agents. If the IAM role lacks s3:DeleteObject permission, the agent cannot delete objects regardless of prompt manipulation. The security perimeter belongs in infrastructure controls that enforce permissions regardless of prompt content. Cosmos environments apply this same principle, scoping what an agent can touch at the runtime layer before any prompt reaches the model.

The following Python 3.12 example shows observable policy behavior at the tool layer: cat is allowed, while rm is blocked.

python

# Python 3.12
policy = {
    "allowed_commands": {"ls", "cat"},
    "deny_patterns": {"rm", "chmod", "curl", "wget", "ssh"},
}

def command_allowed(cmd: str) -> bool:
    return cmd in policy["allowed_commands"] and cmd not in policy["deny_patterns"]

print(command_allowed("cat"))
print(command_allowed("rm"))

Expected output:

text

True
False

Common failure mode: this example only validates command names, not shell arguments, so a real implementation still needs sandboxing and path validation.

OWASP maintains a fuller reference in the AI agent security cheat sheet.

Effective Defense Pairings

Effective defense pairings reduce unsafe behavior because they combine controls across independent failure modes, so a bypass at one layer still meets enforcement at another. Research on mitigation combinations identifies four effective pairings:

Policy gating + sandboxing: Restrict capabilities while isolating execution. Kernel-level sandboxing via eBPF, seccomp, and containerization can confine agents to their authorized capability set at the OS level.
Input validation + policy gating: Counter adversarial inputs while enforcing least privilege. OpenAI describes a defense-in-depth approach to safety, with multiple safeguards such as model training, system-level checks, product design choices, policy enforcement, and classifier-based protections.
HITL + sandboxing: Human oversight combined with execution isolation for high-risk operations. Anthropic's work on trustworthy agents describes per-action approval requirements: always allow, needs approval, or block.
Scoped tokens + policy gating: Prevent privilege escalation while maintaining fine-grained access control. AWS IAM permission boundaries cap the maximum permissions an agent role can hold from identity-based policies, regardless of attached identity-based policies.

These pairings work best when teams combine deterministic enforcement, scoped execution, and approval gates across independent trust boundaries. Cosmos packages this same model: HITL is a configurable property of every expert, sandboxing happens at the environment level, and policy enforcement applies to every tool call before it executes.

Production Layer Stack

A production layer stack separates deterministic enforcement, model-facing filters, execution controls, and observability so one failed layer does not expose the entire agent.

text

┌──────────────────────────────────────────────────────────┐
│  Layer 1: Infrastructure Controls (Deterministic)         │
│  IAM policies, network ACLs, permission boundaries        │
├──────────────────────────────────────────────────────────┤
│  Layer 2: Input Guardrails (Pre-LLM)                     │
│  Rules-based filters, length limits, blocklists           │
├──────────────────────────────────────────────────────────┤
│  Layer 3: Tool Execution Controls                         │
│  Scoped permissions per tool, sandboxed execution         │
├──────────────────────────────────────────────────────────┤
│  Layer 4: Human-in-the-Loop Gates                        │
│  Approval checkpoints for irreversible actions            │
├──────────────────────────────────────────────────────────┤
│  Layer 5: Output Guardrails (Post-LLM)                   │
│  PII redaction, content validation, URL filtering         │
├──────────────────────────────────────────────────────────┤
│  Layer 6: Observability and Circuit Breakers              │
│  OpenTelemetry traces, token/cost ceilings, auto-halt    │
└──────────────────────────────────────────────────────────┘

Production defenses should include trajectory monitoring because inspecting individual tool calls in isolation misses the broader pattern. The sequence and context of calls provide important signal for detecting advanced persistent threats and living-off-the-land techniques. Cosmos emits a structured event for every action an agent takes, which gives observability and audit pipelines the same shape regardless of which tool or expert produced the call. Teams evaluating broader observability stacks often review AI agent observability tools before rollout.

Enforce Trust Boundaries Before Your Next Agent Deployment

The core tradeoff in agent systems is direct: more autonomy increases usefulness while widening the gap between what the agent can do and what it can safely judge. Prompt instructions cannot close that gap once the agent holds secrets, tool access, or delegated authority. The practical next step is to map where untrusted input meets elevated permissions, then move enforcement to IAM boundaries, sandboxing, filesystem isolation, and deterministic policy gates before expanding autonomy further.

Common Agentic Attack Patterns: 6 Layers Explained

TL;DR

From Text Errors to Operational Risk

The New Code Review Workflow for AI-Native Engineering Teams

Why Agentic Systems Create a Distinct Threat Class

Six Architectural Layers Where Attacks Land

Prompt Injection Escalates from Text to Action

Tool Execution and Ambient Authority Attacks

Multi-Agent and Memory Attack Patterns

Inter-Agent Trust Exploitation

Memory Poisoning and Chain Attacks

Specification Gaming Without Adversarial Input

How Attacks Map to Agentic Design Patterns

Mitigation Patterns: Defense-in-Depth for Agent Systems

Infrastructure Controls Beat Prompt Instructions

Effective Defense Pairings

Production Layer Stack

Enforce Trust Boundaries Before Your Next Agent Deployment

Frequently Asked Questions

Written by

Molisha Shah

Give your codebase the agents it deserves

TL;DR

From Text Errors to Operational Risk

The New Code Review Workflow for AI-Native Engineering Teams

Why Agentic Systems Create a Distinct Threat Class

Six Architectural Layers Where Attacks Land

Prompt Injection Escalates from Text to Action

Tool Execution and Ambient Authority Attacks

Multi-Agent and Memory Attack Patterns

Inter-Agent Trust Exploitation

Memory Poisoning and Chain Attacks

Specification Gaming Without Adversarial Input

How Attacks Map to Agentic Design Patterns

Mitigation Patterns: Defense-in-Depth for Agent Systems

Infrastructure Controls Beat Prompt Instructions

Effective Defense Pairings

Production Layer Stack

Enforce Trust Boundaries Before Your Next Agent Deployment

Frequently Asked Questions

What is the most dangerous agentic attack pattern?

Can prompt injection be fully prevented in agentic systems?

How do multi-agent attacks differ from single-agent attacks?

What OWASP and MITRE frameworks cover agentic AI security?

Are MCP servers safe to use in production?

How does context window compression affect agent security?

Related

Written by

Molisha Shah

Give your codebase the agents it deserves