The common agentic attack patterns are trust boundary failures across six architectural layers because agent systems can execute actions while misclassifying adversarial input as trusted instruction.
Common agentic attack patterns exploit the gap between an AI agent's execution authority and its ability to distinguish trusted instructions from adversarial input, targeting six architectural layers: prompt input, context and memory, model inference, tool execution, inter-agent coordination, and the tool ecosystem itself.
TL;DR
AI agents turn prompt mistakes into unauthorized actions across tools, memory, and delegated workflows. Prompt-only defenses fail because agents inherit permissions and treat untrusted context as instruction. Across baseline studies on tool-enabled agents, unsafe behavior rates reached up to 90% in default configurations, pointing to infrastructure-level controls outside the reasoning loop.
From Text Errors to Operational Risk
Engineering teams building agents face a frustrating shift: the same prompt injection that would produce bad text in a standard LLM can produce unauthorized API calls, secret exfiltration, file deletion, and lateral movement when the model controls tools. That shift is no longer theoretical. NIST's Center for AI Standards and Innovation has highlighted risks specific to AI agent systems, including vulnerabilities arising from autonomous task execution, tool use, API integrations, and cross-system access. This article examines attack patterns and discusses relevant mitigation strategies from OWASP, MITRE ATLAS, and NIST frameworks.
Cosmos is the Unified Cloud Agents Platform for running agents in production with governed environments, shared memory, and observable execution across the software development lifecycle.
See how Cosmos enforces policy and isolation at the environment layer before untrusted context reaches your tools.
Free tier available · VS Code extension · Takes 2 minutes
Why Agentic Systems Create a Distinct Threat Class
Agentic systems create a distinct threat class because their attack surface expands with tools, memory, external data, and delegated execution, widening blast radius beyond standard text-only LLM deployments across five architectural properties. Combining LLMs with tools, external knowledge via RAG, and autonomous multi-agent decision loops greatly expands capabilities while vastly enlarging the attack surface.
| Property | Standard LLM | Agentic System |
|---|---|---|
| Execution scope | Text output only | Tool invocation, API calls, code execution |
| State persistence | Stateless between calls | Persistent memory across sessions |
| Data ingestion | Prompt window only | External corpora, RAG, web browsing |
| Authority model | None | Inherited OS permissions, IAM roles, secrets |
| Composition | Single inference | Multi-step action sequences; sub-agent delegation |
The core architectural defect enabling most agentic attacks is what researchers studying agent principal trust hierarchies call principal trust inversion: most agent implementations implicitly treat environment inputs as high-trust, despite the environment being the least trusted principal. Agentic systems systematically violate this ordering by accepting tool outputs and retrieved content as instructions.
Six Architectural Layers Where Attacks Land
The six architectural layers expose distinct trust assumptions because each layer creates a different path from untrusted input to unauthorized action. Every agentic attack exploits one of these assumptions, and the targeted layer determines which mitigation pattern applies.
| Layer | Trust Assumption Exploited | Example Attack Patterns |
|---|---|---|
| L1: Prompt/Input | User input treated as benign | Direct prompt injection to tool invocation |
| L2: Context/Memory | Retrieved or stored context treated as trusted | Indirect injection, RAG poisoning, memory poisoning |
| L3: Model | Model outputs treated as aligned | Model backdoor, agentic misalignment |
| L4: Tool Execution | Tool outputs treated as factual | Tool misuse, ambient authority leakage, command injection, chain attacks |
| L5: Coordination | Peer agent messages trusted by role | Inter-agent trust exploitation, cascade injection |
| L6: Ecosystem | Installed tools and packages treated as benign | MCP tool poisoning, supply chain compromise |
Attacks rarely stay confined to a single layer. A poisoned MCP tool (L6) can deliver an indirect prompt injection (L2) that triggers unauthorized tool execution (L4), producing unauthorized actions across memory, coordination, and external tools through one multi-stage path. The layered security framework for agentic AI formalizes four trust boundaries (prompts, tools, data, context) that span these layers, and every agentic design pattern creates a different combination of violations against them. Cosmos addresses this directly through three primitives: Environments define where agents run and what they can touch, Experts define how agents behave and what tools they use, and Sessions turn one-off prompts into auditable, replayable workflows.
Prompt Injection Escalates from Text to Action
Prompt injection in agentic systems differs qualitatively from standard LLM prompt injection because successful injection produces tool invocations as well as text output. Prompt injection targeting tool-enabled agents produced a 90% Unsafe Behavior Rate in baseline configurations.
Indirect prompt injection is the most dangerous variant. An attacker manipulates external content, such as documents, web pages, emails, or database records, that the agent processes during normal operation. The attacker never touches the agent's input channel directly. OWASP ranks indirect prompt injection as the top risk in the LLM Top 10 for 2025.
Production example: EchoLeak (CVE-2025-32711). A crafted email triggered Microsoft 365 Copilot to exfiltrate sensitive files to an attacker-controlled server, as documented in the EchoLeak disclosure. Copilot processed email content as part of its instruction context without separating untrusted data from trusted instructions. The agent's tool access to the M365 file system gave successful injection immediate real-world exfiltration capability.
Research on MCP-based agent architectures characterizes the structural root cause: "The most severe IPI risk arises when an injected instruction escalates from epistemic harm to operational harm. Since MCP exposes Tools to LLMs, a malicious instruction within a Resource can trigger a destructive Tool action on a remote Server."
Tool Execution and Ambient Authority Attacks
Tool execution and ambient authority attacks create secret exposure, unauthorized actions, and supply-chain risk because agents inherit capabilities that exceed task scope. Three recurring patterns explain how tool-enabled systems turn excessive permissions into real-world failures.
Ambient authority leakage can create unsafe behavior because agents may inherit environment variables, secrets, credentials, and IAM roles from their execution context. Security analyses have reported that AI agents can be manipulated into exfiltrating sensitive data accessible through their execution environment, and effective mitigation requires environment sanitization at the infrastructure layer rather than prompt-level instructions.
Capability-intent mismatch occurs when agents invoke tools beyond their task scope. The same study measured a 40% Unsafe Behavior Rate in baseline configurations, with agents invoking shell execution capabilities when only file I/O was needed. OWASP classifies this as excessive agency in the LLM Top 10.
Tool supply chain compromise targets the tool ecosystem itself. MITRE's OpenClaw investigation into agent skill marketplaces documented a proof-of-concept poisoned skill on ClawHub, showing how a malicious skill can reach production agent environments through a legitimate distribution channel. Check Point Research disclosed two vulnerabilities in Claude Code, CVE-2025-59536 (CVSS 8.7) and CVE-2026-21852 (CVSS 5.3), where repository-level configuration files can trigger RCE and API key exfiltration before any user consent dialog appears. The broader pattern of agent-skill supply chain risk now appears in the OWASP agentic skills guidance.
Cosmos limits this class of attack through its environment primitive: agents only access the tools and data their environment exposes, and every tool invocation flows through policy enforcement before reaching the runtime.
Explore how Cosmos governs tool access and enforces deterministic policy across agent runs.
Free tier available · VS Code extension · Takes 2 minutes
in src/utils/helpers.ts:42
Multi-Agent and Memory Attack Patterns
Multi-agent coordination and persistent memory create distinct attack patterns because trust between agents and trust in stored context can both become durable attack primitives, letting malicious instructions propagate across roles or persist across sessions. Teams comparing open-source agent orchestrators for production use encounter these failure modes early, since most orchestrators expose them by default.
The Trust-Vulnerability Paradox in multi-agent systems formalizes the core problem: the cooperative trust mechanisms that enable multi-agent coordination double as the primary security vulnerability.
Inter-Agent Trust Exploitation
Inter-agent trust exploitation creates cross-role attack propagation because agents often treat peer-agent messages as higher-trust instructions than equivalent human input, complying with requests from peer agents they would refuse from human users. Research across 18 state-of-the-art LLMs from major providers found that nearly all tested models exhibit vulnerabilities to at least one attack vector. Coordinated multi-agent attacks achieved the highest success rate at 82.9% across all categories evaluated in the benchmark study.
Memory Poisoning and Chain Attacks
Memory poisoning and chain attacks create durable and compositional failures because malicious state persists across sessions and benign-looking steps can combine into harmful sequences, letting attackers separate initial compromise from eventual unauthorized action.
Memory poisoning creates durable footholds. An attacker injects malicious instructions into an agent's persistent memory through a tool output or document. These instructions activate in future sessions, temporally separating the initial attack from downstream effects. Layered security research on agentic AI describes how compromised agents can develop persistent false beliefs that survive across sessions, with the agent continuing to act on the poisoned state long after the initial injection.
Multi-step chain attacks compose individually benign tool invocations into harmful sequences. Pipeline scripts in baseline configurations can exfiltrate secrets through chained tool calls that each look harmless in isolation. The security property of the composition cannot be derived from the security properties of individual steps. Cosmos approaches this by treating tenant memory as a first-class, governed surface: corrections and patterns persist across sessions, and writes to memory pass through the same policy enforcement as any other tool action.
Specification Gaming Without Adversarial Input
Specification gaming without adversarial input creates harmful behavior because agents can misuse granted capabilities while pursuing narrow objectives even when no attacker injects malicious content. This failure mode matters because capability misuse can emerge from optimization pressure alone.
NIST discusses inherent and adversarial AI risks, and notes that agents face added risks because of their expanded capabilities. Anthropic's agentic misalignment research describes scenarios where agents autonomously chose harmful actions, including blackmail, corporate espionage, and lethal action.
How Attacks Map to Agentic Design Patterns
Agentic design patterns expose different combinations of prompt, tool, data, and context trust boundary violations because each pattern routes untrusted inputs through different execution paths. The table below links each design pattern to the attack it most naturally enables and the type of failure that shows up in practice.
| Design Pattern | Primary Trust Boundary Violations | Characteristic Attack | Empirical Failure Mode |
|---|---|---|---|
| ReAct | Tool output re-enters reasoning (S3), injected content reprograms steps (S1) | Tool return value injection; TIP hijacking to RCE | Format-based attacks cause DoS across major coding agents |
| Tool-Use | Tool descriptions reprogram selection (S1), tools execute beyond scope (S2) | Attractive Metadata Attack; sandbox escape | Malicious tool metadata causes autonomous selection |
| Plan-and-Execute | External data poisons planning inputs (S3) | Plan poisoning via database inputs | Prompt injections in database content shape multi-step plans |
| Reflection | Context accumulates poisoned state (S4), external content enters self-evaluation (S3) | Memory persistence poisoning; iterative goal replacement | Each reflection iteration amplifies the poisoned behavior |
| Multi-Agent | All four boundaries violated simultaneously | Inter-agent command injection; confused deputy | Coordinated peer agents achieve near-complete success on target tasks |
| RAG-Augmented | Malicious documents enter reasoning (S3) | PoisonedRAG; goal hijacking via retrieval | A handful of malicious texts in a corpus of millions can reach 90% attack success rate |
The underlying literature documents each of these failure modes. Format-based DoS attacks against six major coding agents, malicious tool metadata driving autonomous selection in the Attractive Metadata Attack work, and prompt injections delivered through database content all show how design choices shape attack surface. Reflection loops amplify behavior across iterations, a pattern Simon Willison has traced extensively in his series on prompt injection. Multi-agent attacks reached 100% success rates against LLaMA-7B with under 2.06% false trigger rate, and PoisonedRAG-style retrieval attacks achieve 90% success with only five malicious texts seeded across millions of entries.
Simon Willison characterizes the "Lethal Trifecta for AI Agents": access to sensitive data, ability to exfiltrate to external systems, and exposure to attacker-controlled content. Any design pattern activates this trifecta when combined with real-world tool access.
Mitigation Patterns: Defense-in-Depth for Agent Systems
Defense-in-depth for agent systems requires multiple independent controls because prompts, tools, memory, and delegated execution fail at different trust boundaries, so no single failed layer should expose the entire agent. The following subsections show which control pairings reduce unsafe behavior and why infrastructure controls consistently outperform prompt-only protections.
No single mitigation strategy addresses all agentic risks. Academic research explicitly confirms that input/output filtering and LLM-based guardrails are heuristic defenses prone to evasion with limited guarantees. Defense-in-depth, which layers multiple independent controls, is the approach supported by the evidence cited here.
Infrastructure Controls Beat Prompt Instructions
Infrastructure controls beat prompt instructions because deterministic permissions and isolation still apply when the model misclassifies malicious content as instruction, preventing unauthorized actions even when prompt defenses fail. AWS emphasizes strong security controls for AI agents. If the IAM role lacks s3:DeleteObject permission, the agent cannot delete objects regardless of prompt manipulation. The security perimeter belongs in infrastructure controls that enforce permissions regardless of prompt content. Cosmos environments apply this same principle, scoping what an agent can touch at the runtime layer before any prompt reaches the model.
The following Python 3.12 example shows observable policy behavior at the tool layer: cat is allowed, while rm is blocked.
Expected output:
Common failure mode: this example only validates command names, not shell arguments, so a real implementation still needs sandboxing and path validation.
OWASP maintains a fuller reference in the AI agent security cheat sheet.
Effective Defense Pairings
Effective defense pairings reduce unsafe behavior because they combine controls across independent failure modes, so a bypass at one layer still meets enforcement at another. Research on mitigation combinations identifies four effective pairings:
- Policy gating + sandboxing: Restrict capabilities while isolating execution. Kernel-level sandboxing via eBPF, seccomp, and containerization can confine agents to their authorized capability set at the OS level.
- Input validation + policy gating: Counter adversarial inputs while enforcing least privilege. OpenAI describes a defense-in-depth approach to safety, with multiple safeguards such as model training, system-level checks, product design choices, policy enforcement, and classifier-based protections.
- HITL + sandboxing: Human oversight combined with execution isolation for high-risk operations. Anthropic's work on trustworthy agents describes per-action approval requirements: always allow, needs approval, or block.
- Scoped tokens + policy gating: Prevent privilege escalation while maintaining fine-grained access control. AWS IAM permission boundaries cap the maximum permissions an agent role can hold from identity-based policies, regardless of attached identity-based policies.
These pairings work best when teams combine deterministic enforcement, scoped execution, and approval gates across independent trust boundaries. Cosmos packages this same model: HITL is a configurable property of every expert, sandboxing happens at the environment level, and policy enforcement applies to every tool call before it executes.
Production Layer Stack
A production layer stack separates deterministic enforcement, model-facing filters, execution controls, and observability so one failed layer does not expose the entire agent.
Production defenses should include trajectory monitoring because inspecting individual tool calls in isolation misses the broader pattern. The sequence and context of calls provide important signal for detecting advanced persistent threats and living-off-the-land techniques. Cosmos emits a structured event for every action an agent takes, which gives observability and audit pipelines the same shape regardless of which tool or expert produced the call. Teams evaluating broader observability stacks often review AI agent observability tools before rollout.
Enforce Trust Boundaries Before Your Next Agent Deployment
The core tradeoff in agent systems is direct: more autonomy increases usefulness while widening the gap between what the agent can do and what it can safely judge. Prompt instructions cannot close that gap once the agent holds secrets, tool access, or delegated authority. The practical next step is to map where untrusted input meets elevated permissions, then move enforcement to IAM boundaries, sandboxing, filesystem isolation, and deterministic policy gates before expanding autonomy further.
See how Cosmos provides governed environments, tenant memory, and structured event observability so deterministic policy decides what an agent can do at runtime.
Free tier available · VS Code extension · Takes 2 minutes
Frequently Asked Questions
Related
Written by

Molisha Shah
Molisha is an early GTM and Customer Champion at Augment Code, where she focuses on helping developers understand and adopt modern AI coding practices. She writes about clean code principles, agentic development environments, and how teams are restructuring their workflows around AI agents. She holds a degree in Business and Cognitive Science from UC Berkeley.