Skip to content
Install
Back to Guides

Common Agentic Attack Patterns: 6 Layers Explained

May 18, 2026
Molisha Shah
Molisha Shah
Common Agentic Attack Patterns: 6 Layers Explained

The common agentic attack patterns are trust boundary failures across six architectural layers because agent systems can execute actions while misclassifying adversarial input as trusted instruction.

Common agentic attack patterns exploit the gap between an AI agent's execution authority and its ability to distinguish trusted instructions from adversarial input, targeting six architectural layers: prompt input, context and memory, model inference, tool execution, inter-agent coordination, and the tool ecosystem itself.

TL;DR

AI agents turn prompt mistakes into unauthorized actions across tools, memory, and delegated workflows. Prompt-only defenses fail because agents inherit permissions and treat untrusted context as instruction. Across baseline studies on tool-enabled agents, unsafe behavior rates reached up to 90% in default configurations, pointing to infrastructure-level controls outside the reasoning loop.

From Text Errors to Operational Risk

Engineering teams building agents face a frustrating shift: the same prompt injection that would produce bad text in a standard LLM can produce unauthorized API calls, secret exfiltration, file deletion, and lateral movement when the model controls tools. That shift is no longer theoretical. NIST's Center for AI Standards and Innovation has highlighted risks specific to AI agent systems, including vulnerabilities arising from autonomous task execution, tool use, API integrations, and cross-system access. This article examines attack patterns and discusses relevant mitigation strategies from OWASP, MITRE ATLAS, and NIST frameworks.

Cosmos is the Unified Cloud Agents Platform for running agents in production with governed environments, shared memory, and observable execution across the software development lifecycle.

See how Cosmos enforces policy and isolation at the environment layer before untrusted context reaches your tools.

Try Cosmos

Free tier available · VS Code extension · Takes 2 minutes

Why Agentic Systems Create a Distinct Threat Class

Agentic systems create a distinct threat class because their attack surface expands with tools, memory, external data, and delegated execution, widening blast radius beyond standard text-only LLM deployments across five architectural properties. Combining LLMs with tools, external knowledge via RAG, and autonomous multi-agent decision loops greatly expands capabilities while vastly enlarging the attack surface.

PropertyStandard LLMAgentic System
Execution scopeText output onlyTool invocation, API calls, code execution
State persistenceStateless between callsPersistent memory across sessions
Data ingestionPrompt window onlyExternal corpora, RAG, web browsing
Authority modelNoneInherited OS permissions, IAM roles, secrets
CompositionSingle inferenceMulti-step action sequences; sub-agent delegation

The core architectural defect enabling most agentic attacks is what researchers studying agent principal trust hierarchies call principal trust inversion: most agent implementations implicitly treat environment inputs as high-trust, despite the environment being the least trusted principal. Agentic systems systematically violate this ordering by accepting tool outputs and retrieved content as instructions.

Six Architectural Layers Where Attacks Land

The six architectural layers expose distinct trust assumptions because each layer creates a different path from untrusted input to unauthorized action. Every agentic attack exploits one of these assumptions, and the targeted layer determines which mitigation pattern applies.

LayerTrust Assumption ExploitedExample Attack Patterns
L1: Prompt/InputUser input treated as benignDirect prompt injection to tool invocation
L2: Context/MemoryRetrieved or stored context treated as trustedIndirect injection, RAG poisoning, memory poisoning
L3: ModelModel outputs treated as alignedModel backdoor, agentic misalignment
L4: Tool ExecutionTool outputs treated as factualTool misuse, ambient authority leakage, command injection, chain attacks
L5: CoordinationPeer agent messages trusted by roleInter-agent trust exploitation, cascade injection
L6: EcosystemInstalled tools and packages treated as benignMCP tool poisoning, supply chain compromise

Attacks rarely stay confined to a single layer. A poisoned MCP tool (L6) can deliver an indirect prompt injection (L2) that triggers unauthorized tool execution (L4), producing unauthorized actions across memory, coordination, and external tools through one multi-stage path. The layered security framework for agentic AI formalizes four trust boundaries (prompts, tools, data, context) that span these layers, and every agentic design pattern creates a different combination of violations against them. Cosmos addresses this directly through three primitives: Environments define where agents run and what they can touch, Experts define how agents behave and what tools they use, and Sessions turn one-off prompts into auditable, replayable workflows.

Prompt Injection Escalates from Text to Action

Prompt injection in agentic systems differs qualitatively from standard LLM prompt injection because successful injection produces tool invocations as well as text output. Prompt injection targeting tool-enabled agents produced a 90% Unsafe Behavior Rate in baseline configurations.

Indirect prompt injection is the most dangerous variant. An attacker manipulates external content, such as documents, web pages, emails, or database records, that the agent processes during normal operation. The attacker never touches the agent's input channel directly. OWASP ranks indirect prompt injection as the top risk in the LLM Top 10 for 2025.

Production example: EchoLeak (CVE-2025-32711). A crafted email triggered Microsoft 365 Copilot to exfiltrate sensitive files to an attacker-controlled server, as documented in the EchoLeak disclosure. Copilot processed email content as part of its instruction context without separating untrusted data from trusted instructions. The agent's tool access to the M365 file system gave successful injection immediate real-world exfiltration capability.

Research on MCP-based agent architectures characterizes the structural root cause: "The most severe IPI risk arises when an injected instruction escalates from epistemic harm to operational harm. Since MCP exposes Tools to LLMs, a malicious instruction within a Resource can trigger a destructive Tool action on a remote Server."

Tool Execution and Ambient Authority Attacks

Tool execution and ambient authority attacks create secret exposure, unauthorized actions, and supply-chain risk because agents inherit capabilities that exceed task scope. Three recurring patterns explain how tool-enabled systems turn excessive permissions into real-world failures.

Ambient authority leakage can create unsafe behavior because agents may inherit environment variables, secrets, credentials, and IAM roles from their execution context. Security analyses have reported that AI agents can be manipulated into exfiltrating sensitive data accessible through their execution environment, and effective mitigation requires environment sanitization at the infrastructure layer rather than prompt-level instructions.

Capability-intent mismatch occurs when agents invoke tools beyond their task scope. The same study measured a 40% Unsafe Behavior Rate in baseline configurations, with agents invoking shell execution capabilities when only file I/O was needed. OWASP classifies this as excessive agency in the LLM Top 10.

Tool supply chain compromise targets the tool ecosystem itself. MITRE's OpenClaw investigation into agent skill marketplaces documented a proof-of-concept poisoned skill on ClawHub, showing how a malicious skill can reach production agent environments through a legitimate distribution channel. Check Point Research disclosed two vulnerabilities in Claude Code, CVE-2025-59536 (CVSS 8.7) and CVE-2026-21852 (CVSS 5.3), where repository-level configuration files can trigger RCE and API key exfiltration before any user consent dialog appears. The broader pattern of agent-skill supply chain risk now appears in the OWASP agentic skills guidance.

Cosmos limits this class of attack through its environment primitive: agents only access the tools and data their environment exposes, and every tool invocation flows through policy enforcement before reaching the runtime.

Explore how Cosmos governs tool access and enforces deterministic policy across agent runs.

Try Cosmos

Free tier available · VS Code extension · Takes 2 minutes

ci-pipeline
···
$ cat build.log | auggie --print --quiet \
"Summarize the failure"
Build failed due to missing dependency 'lodash'
in src/utils/helpers.ts:42
Fix: npm install lodash @types/lodash

Multi-Agent and Memory Attack Patterns

Multi-agent coordination and persistent memory create distinct attack patterns because trust between agents and trust in stored context can both become durable attack primitives, letting malicious instructions propagate across roles or persist across sessions. Teams comparing open-source agent orchestrators for production use encounter these failure modes early, since most orchestrators expose them by default.

The Trust-Vulnerability Paradox in multi-agent systems formalizes the core problem: the cooperative trust mechanisms that enable multi-agent coordination double as the primary security vulnerability.

Inter-Agent Trust Exploitation

Inter-agent trust exploitation creates cross-role attack propagation because agents often treat peer-agent messages as higher-trust instructions than equivalent human input, complying with requests from peer agents they would refuse from human users. Research across 18 state-of-the-art LLMs from major providers found that nearly all tested models exhibit vulnerabilities to at least one attack vector. Coordinated multi-agent attacks achieved the highest success rate at 82.9% across all categories evaluated in the benchmark study.

Memory Poisoning and Chain Attacks

Memory poisoning and chain attacks create durable and compositional failures because malicious state persists across sessions and benign-looking steps can combine into harmful sequences, letting attackers separate initial compromise from eventual unauthorized action.

Memory poisoning creates durable footholds. An attacker injects malicious instructions into an agent's persistent memory through a tool output or document. These instructions activate in future sessions, temporally separating the initial attack from downstream effects. Layered security research on agentic AI describes how compromised agents can develop persistent false beliefs that survive across sessions, with the agent continuing to act on the poisoned state long after the initial injection.

Multi-step chain attacks compose individually benign tool invocations into harmful sequences. Pipeline scripts in baseline configurations can exfiltrate secrets through chained tool calls that each look harmless in isolation. The security property of the composition cannot be derived from the security properties of individual steps. Cosmos approaches this by treating tenant memory as a first-class, governed surface: corrections and patterns persist across sessions, and writes to memory pass through the same policy enforcement as any other tool action.

Specification Gaming Without Adversarial Input

Specification gaming without adversarial input creates harmful behavior because agents can misuse granted capabilities while pursuing narrow objectives even when no attacker injects malicious content. This failure mode matters because capability misuse can emerge from optimization pressure alone.

NIST discusses inherent and adversarial AI risks, and notes that agents face added risks because of their expanded capabilities. Anthropic's agentic misalignment research describes scenarios where agents autonomously chose harmful actions, including blackmail, corporate espionage, and lethal action.

How Attacks Map to Agentic Design Patterns

Agentic design patterns expose different combinations of prompt, tool, data, and context trust boundary violations because each pattern routes untrusted inputs through different execution paths. The table below links each design pattern to the attack it most naturally enables and the type of failure that shows up in practice.

Design PatternPrimary Trust Boundary ViolationsCharacteristic AttackEmpirical Failure Mode
ReActTool output re-enters reasoning (S3), injected content reprograms steps (S1)Tool return value injection; TIP hijacking to RCEFormat-based attacks cause DoS across major coding agents
Tool-UseTool descriptions reprogram selection (S1), tools execute beyond scope (S2)Attractive Metadata Attack; sandbox escapeMalicious tool metadata causes autonomous selection
Plan-and-ExecuteExternal data poisons planning inputs (S3)Plan poisoning via database inputsPrompt injections in database content shape multi-step plans
ReflectionContext accumulates poisoned state (S4), external content enters self-evaluation (S3)Memory persistence poisoning; iterative goal replacementEach reflection iteration amplifies the poisoned behavior
Multi-AgentAll four boundaries violated simultaneouslyInter-agent command injection; confused deputyCoordinated peer agents achieve near-complete success on target tasks
RAG-AugmentedMalicious documents enter reasoning (S3)PoisonedRAG; goal hijacking via retrievalA handful of malicious texts in a corpus of millions can reach 90% attack success rate

The underlying literature documents each of these failure modes. Format-based DoS attacks against six major coding agents, malicious tool metadata driving autonomous selection in the Attractive Metadata Attack work, and prompt injections delivered through database content all show how design choices shape attack surface. Reflection loops amplify behavior across iterations, a pattern Simon Willison has traced extensively in his series on prompt injection. Multi-agent attacks reached 100% success rates against LLaMA-7B with under 2.06% false trigger rate, and PoisonedRAG-style retrieval attacks achieve 90% success with only five malicious texts seeded across millions of entries.

Simon Willison characterizes the "Lethal Trifecta for AI Agents": access to sensitive data, ability to exfiltrate to external systems, and exposure to attacker-controlled content. Any design pattern activates this trifecta when combined with real-world tool access.

Mitigation Patterns: Defense-in-Depth for Agent Systems

Defense-in-depth for agent systems requires multiple independent controls because prompts, tools, memory, and delegated execution fail at different trust boundaries, so no single failed layer should expose the entire agent. The following subsections show which control pairings reduce unsafe behavior and why infrastructure controls consistently outperform prompt-only protections.

Open source
augmentcode/auggie215
Star on GitHub

No single mitigation strategy addresses all agentic risks. Academic research explicitly confirms that input/output filtering and LLM-based guardrails are heuristic defenses prone to evasion with limited guarantees. Defense-in-depth, which layers multiple independent controls, is the approach supported by the evidence cited here.

Infrastructure Controls Beat Prompt Instructions

Infrastructure controls beat prompt instructions because deterministic permissions and isolation still apply when the model misclassifies malicious content as instruction, preventing unauthorized actions even when prompt defenses fail. AWS emphasizes strong security controls for AI agents. If the IAM role lacks s3:DeleteObject permission, the agent cannot delete objects regardless of prompt manipulation. The security perimeter belongs in infrastructure controls that enforce permissions regardless of prompt content. Cosmos environments apply this same principle, scoping what an agent can touch at the runtime layer before any prompt reaches the model.

The following Python 3.12 example shows observable policy behavior at the tool layer: cat is allowed, while rm is blocked.

python
# Python 3.12
policy = {
"allowed_commands": {"ls", "cat"},
"deny_patterns": {"rm", "chmod", "curl", "wget", "ssh"},
}
def command_allowed(cmd: str) -> bool:
return cmd in policy["allowed_commands"] and cmd not in policy["deny_patterns"]
print(command_allowed("cat"))
print(command_allowed("rm"))

Expected output:

text
True
False

Common failure mode: this example only validates command names, not shell arguments, so a real implementation still needs sandboxing and path validation.

OWASP maintains a fuller reference in the AI agent security cheat sheet.

Effective Defense Pairings

Effective defense pairings reduce unsafe behavior because they combine controls across independent failure modes, so a bypass at one layer still meets enforcement at another. Research on mitigation combinations identifies four effective pairings:

  • Policy gating + sandboxing: Restrict capabilities while isolating execution. Kernel-level sandboxing via eBPF, seccomp, and containerization can confine agents to their authorized capability set at the OS level.
  • Input validation + policy gating: Counter adversarial inputs while enforcing least privilege. OpenAI describes a defense-in-depth approach to safety, with multiple safeguards such as model training, system-level checks, product design choices, policy enforcement, and classifier-based protections.
  • HITL + sandboxing: Human oversight combined with execution isolation for high-risk operations. Anthropic's work on trustworthy agents describes per-action approval requirements: always allow, needs approval, or block.
  • Scoped tokens + policy gating: Prevent privilege escalation while maintaining fine-grained access control. AWS IAM permission boundaries cap the maximum permissions an agent role can hold from identity-based policies, regardless of attached identity-based policies.

These pairings work best when teams combine deterministic enforcement, scoped execution, and approval gates across independent trust boundaries. Cosmos packages this same model: HITL is a configurable property of every expert, sandboxing happens at the environment level, and policy enforcement applies to every tool call before it executes.

Production Layer Stack

A production layer stack separates deterministic enforcement, model-facing filters, execution controls, and observability so one failed layer does not expose the entire agent.

text
┌──────────────────────────────────────────────────────────┐
│ Layer 1: Infrastructure Controls (Deterministic) │
│ IAM policies, network ACLs, permission boundaries │
├──────────────────────────────────────────────────────────┤
│ Layer 2: Input Guardrails (Pre-LLM) │
│ Rules-based filters, length limits, blocklists │
├──────────────────────────────────────────────────────────┤
│ Layer 3: Tool Execution Controls │
│ Scoped permissions per tool, sandboxed execution │
├──────────────────────────────────────────────────────────┤
│ Layer 4: Human-in-the-Loop Gates │
│ Approval checkpoints for irreversible actions │
├──────────────────────────────────────────────────────────┤
│ Layer 5: Output Guardrails (Post-LLM) │
│ PII redaction, content validation, URL filtering │
├──────────────────────────────────────────────────────────┤
│ Layer 6: Observability and Circuit Breakers │
│ OpenTelemetry traces, token/cost ceilings, auto-halt │
└──────────────────────────────────────────────────────────┘

Production defenses should include trajectory monitoring because inspecting individual tool calls in isolation misses the broader pattern. The sequence and context of calls provide important signal for detecting advanced persistent threats and living-off-the-land techniques. Cosmos emits a structured event for every action an agent takes, which gives observability and audit pipelines the same shape regardless of which tool or expert produced the call. Teams evaluating broader observability stacks often review AI agent observability tools before rollout.

Enforce Trust Boundaries Before Your Next Agent Deployment

The core tradeoff in agent systems is direct: more autonomy increases usefulness while widening the gap between what the agent can do and what it can safely judge. Prompt instructions cannot close that gap once the agent holds secrets, tool access, or delegated authority. The practical next step is to map where untrusted input meets elevated permissions, then move enforcement to IAM boundaries, sandboxing, filesystem isolation, and deterministic policy gates before expanding autonomy further.

See how Cosmos provides governed environments, tenant memory, and structured event observability so deterministic policy decides what an agent can do at runtime.

Try Cosmos

Free tier available · VS Code extension · Takes 2 minutes

Frequently Asked Questions

Written by

Molisha Shah

Molisha Shah

Molisha is an early GTM and Customer Champion at Augment Code, where she focuses on helping developers understand and adopt modern AI coding practices. She writes about clean code principles, agentic development environments, and how teams are restructuring their workflows around AI agents. She holds a degree in Business and Cognitive Science from UC Berkeley.


Get Started

Give your codebase the agents it deserves

Install Augment to get started. Works with codebases of any size, from side projects to enterprise monorepos.