How does an async AI agent workflow differ from a standard background job queue?

Standard task queues (Celery, SQS, BullMQ) handle single-shot LLM inference but lack built-in checkpointing for multi-step agents: a worker crash mid-execution can redeliver the job and re-run prior LLM calls, duplicating side effects unless steps are idempotent. Durable execution frameworks persist state at each step boundary and resume from the last successful checkpoint.

Which checkpointing strategy minimizes LLM token costs on recovery?

Event history replay in Temporal avoids re-issuing LLM API calls on recovery because it recreates workflow state from persisted history and replays completed activity results from stored values. Azure bills Durable Functions orchestrator replay as function execution on the Consumption plan, and retried activities that call an LLM invoke the provider again, incurring another charge.

What happens when agent code changes while workflows are in flight?

LangGraph supports adding and removing state keys for existing threads, but renamed keys lose their saved state and incompatible type changes can break in-flight threads. Temporal requires GetVersion gates to avoid NondeterminismError on in-flight workflows. Check each framework's versioning behavior before deployment.

How should approval gate timeouts be handled?

Treat a timeout as a denial, and encode escalation policies (SLA timers, reminder intervals, escalation contacts) directly in the workflow definition. Temporal supports this natively via workflow.wait_condition() with a timeout parameter; LangGraph offers node-level timeout controls but may need external scheduling for workflow-level policies.

Can per-step checkpointing guarantee exactly-once tool execution?

LangGraph's per-superstep snapshot approach does not guarantee exactly-once execution for tool calls with external side effects, so workflows must handle re-execution safely with idempotency keys, upserts, or read-before-write checks. Temporal, Flink Agents, and Restate provide durable, replayable execution semantics for tool calls.

How Async AI Agent Workflows Survive Failures

Async AI agent workflows use durable long-running execution. Persistent state checkpointing decouples task submission from completion, so work can survive timeouts, crashes, approvals, and restarts.

TL;DR

Agent workflows break synchronous HTTP architectures because API Gateway defaults to a 29-second timeout, Azure Functions HTTP triggers stop at 230 seconds, and Lambda ends at 900 seconds. Production async systems recover by persisting checkpoints, pausing durably for approvals, and keeping memory across sessions instead of depending on one live process.

Why Agent Workflows Need Durable Execution

A 12-step agent workflow crashes at step 8. Steps 1 through 7 already consumed model calls, wrote database records, and called external APIs. Synchronous execution offers one recovery option: start over from step 1, re-running every prior model call and re-executing every side effect.

This failure mode is not hypothetical. Anthropic's multi-agent research documentation describes solving the problem of resuming from failure points by building state management that reconstructs a failed subagent's exact state and continues from where it left off.

Engineering teams building agent systems face four recurring problems: infrastructure timeouts kill connections before agents finish, process crashes discard intermediate progress, human approvals block execution indefinitely, and active state grows before tasks complete. Those failures compound when standard retry logic resubmits a task after a client timeout, so side effects execute once per instance rather than once per intent.

Handling those failures reliably requires durable, long-running execution.

Augment Cosmos, the unified cloud agents platform, runs agents across the software development lifecycle with shared context and memory that persists between runs. Its Context Engine analyzes entire codebases through semantic dependency analysis and supports codebase understanding across 400,000+ files, while Cosmos Sessions keep shared context, history, prior outputs, and feedback available across tasks.

[ Free report ]

The Agentic SDLC

How teams like Stripe, Ramp, and Uber move from solo coding agents to a coordinated, team-level system.

Download the guide

Why Synchronous Request-Response Breaks for Real Agent Tasks

Synchronous request-response breaks for real agent tasks because infrastructure timeouts expire long before multi-step agent work completes. Once that happens, the system has no durable way to return output, preserve context, or safely recover side effects.

Independent timeout limits terminate different parts of the stack at different times, so even after a workflow clears one layer, the next can still kill it. None of those layers preserve enough execution state to resume safely after a crash.

Infrastructure Layer	Timeout
API Gateway default integration timeout	29 seconds
Azure Functions HTTP trigger hard ceiling	230 seconds
Lambda maximum execution timeout	900 seconds (15 min)

API Gateway's 29-second timeout is a raisable default: Regional and private REST APIs can extend it through a service quota increase that trades against account-level throttling. Azure's 230 seconds and Lambda's 900 seconds stay fixed.

Raising timeout thresholds turns a clean error into a silent one. When the client times out before the agent completes, the agent keeps running with no way to return output. The client receives an error, retries, and two agent instances execute the same task, so side effects like sent emails, written records, and external API calls can fire more than once.

Anthropic documents additional failure modes from their own production systems. Claude Sonnet 4.5 begins wrapping up work prematurely as it approaches its perceived limit, a behavior they call context anxiety. In another pattern, an agent attempts to complete an entire task in a single pass, runs out of room mid-implementation, and the next session must guess prior work.

Synchronous architectures provide no mechanism to checkpoint state, hand off to a fresh execution, or resume from a known-good point with intermediate results intact. Teams often weigh these failure modes against workflow orchestration platforms and broader agent workflow implementation approaches before deciding what to build.

State Checkpointing Strategies for Async AI Agent Workflows

State checkpointing captures agent execution progress to durable storage before each step, so a crash at step 8 resumes from step 7 rather than restarting. Chat history alone is insufficient: on Terminal-Bench workloads, the Crab paper reports chat-only recovery achieves only 8 to 13% correctness versus 100% for semantics-aware checkpoint/restore, because chat-only approaches miss OS-level side effects like filesystem changes and process state.

Checkpoint design determines recovery fidelity: persisted state controls recovery scope, repeated work, and whether tool outputs remain usable after a restart. Checkpoints must preserve the categories of state that survive failures and handoffs.

Three categories of state require persistence at each checkpoint:

Working memory: current conversation, active task state, recent tool results
Active execution state: current task, immediately preceding step output, structured schema being filled
Long-term memory: cross-session knowledge, episodic memory, patterns across tasks

Per-Step Snapshot vs. Event History Replay

Per-step snapshot checkpointing and event history replay shape async AI recovery differently, and that choice changes storage cost, exactly-once behavior, and recovery precision.

Per-step snapshot checkpointing in LangGraph serializes a snapshot of the graph state at every super-step, such as after each node in a sequential graph. Microsoft Agent Framework's documentation mentions checkpointing only generally, without specifying full-graph-state serialization per node. LangGraph's BaseCheckpointSaver writes a StateSnapshot containing the complete state values, which node executes next, metadata about what each node wrote, and a parent_config field that creates a checkpoint chain traversable backward through parent references. Recovery granularity reaches the exact failed node with no work lost within the current thread.

Event history with deterministic replay in Temporal records every state transition as an immutable event in an append-only log. On recovery, it replays workflows from that history to rebuild exact prior state, and does not re-execute LLM API calls modeled as Activities because prior Activity results are returned from stored history. A hard history limit applies; workflows that exceed it use Continue-As-New to start fresh with carried-forward state.

Dimension	Per-Step Snapshot (LangGraph)	Event History Replay (Temporal)
What's persisted	Full state at every node	Event log plus Activity results
Exactly-once tool calls	Developer responsibility	Framework-guaranteed
Storage cost	High (full snapshot per step)	Medium (events plus results)
History limit	No hard limit documented	51,200 events / 50MB
Operational complexity	Low (library)	High (cluster required)

Failure Recovery for Long-Running Agents

Failure recovery for long-running agents needs controls that observe workflow state and side effects directly, since transport-level success signals miss them. This matters most for duplicate external actions and reasoning loops that keep consuming budget while health checks still look normal.

Standard circuit breakers, retry policies, and health checks operate on HTTP status codes and latency signals. These mechanisms are structurally blind to the failure modes that non-deterministic LLM reasoning introduces. Microsoft's SRE guidance for autonomous AI agents describes a case where every transport signal looked healthy, HTTP 200, normal latency, zero errors, while the agent approved an unauthorized transaction, wrote to the wrong table, or burned roughly $800 of model budget in a reasoning loop before anyone noticed.

Tool Execution Failures with Side Effects

Tool execution failures with side effects are the highest-severity failure mode in write-heavy workflows: unless recovery logic records prior execution, a retry can re-run and duplicate a partially completed mutation.

Tool Category	Risk	Recovery Strategy
Data access (read-only)	Low	Retry aggressively; fall back to cached response
Computation (process and transform)	Medium	Retry with idempotency consideration
Mutation (write operations)	High	Require idempotency keys; never naively retry

Idempotency key patterns for agents key on (run_id, step_index, action_type) to deduplicate write operations at the storage layer regardless of LLM non-determinism.

Reasoning Loops and Budget Exhaustion

Reasoning loops and budget exhaustion require workflow-level limits because agents keep re-planning after tool failures in ways infrastructure monitors cannot see, so unchecked internal retries consume unbounded budget.

An agent receiving a tool timeout may re-plan internally, issue a modified request to the same rate-limited endpoint, receive another 429, re-plan again, and iterate. This loop is invisible to infrastructure-layer circuit breakers because the retries occur inside reasoning, where the calling code that those breakers monitor never registers them.

LangGraph addresses this with a configurable maximum iteration count that functions as a workflow circuit breaker rather than at the infrastructure level. Soft limits with human-in-the-loop escalation preserve long-running workflows at the cost of requiring interrupt infrastructure.

Graceful Degradation Ladder

Graceful degradation in async AI agent workflows applies recovery controls in layers for transient failures, rate limits, provider outages, and unrecoverable states.

Production systems commonly use this recovery sequence:

Retry with exponential backoff and jitter for transient errors (5xx, network timeouts)
Honor Retry-After and pause for rate limits (429)
Circuit breaker open plus provider failover for sustained primary provider unavailability
Cached response for read-only operations when all providers are down
Simplified reasoning mode using a reduced-capability model or abbreviated tool set
Human-in-the-loop escalation via interrupt mechanism for decisions requiring authorization
Dead letter queue for unrecoverable state after retry exhaustion

AWS Step Functions provides error handling with configurable backoff and jitter for retries, including LLM-integrated workflows. Moving retry orchestration into the state machine definition makes retries durable across process restarts.

Augment Cosmos Environments isolate execution, while Sessions preserve memory and auditable, replayable run history for long-running recovery paths.

Human-in-the-Loop Async Approval Patterns

Human-in-the-loop async approval patterns persist workflow state before an external decision event, then resume after the decision arrives, holding no live process while a person decides.

Every production pattern shares one requirement: persist agent state to external storage before the human wait begins, so a later invocation can rehydrate it. LangGraph's interrupted threads resume later.

Checkpoint, Interrupt, Resume

Checkpoint, interrupt, and resume patterns in LangGraph pause graph execution at a persisted checkpoint, surface the approval request externally, and later rehydrate the same thread when a human decision arrives.

Teams implementing this directly should pin explicit versions of the language runtime, LangGraph, the Postgres checkpoint library, and PostgreSQL so resumed threads stay compatible.

In practice, the first invocation pauses at the interrupt and returns an approval request; a later invocation with the same thread identity resumes from the persisted checkpoint instead of restarting. Common failure modes are missing persistence setup, missing dependencies, or resuming with a different thread configuration than before the interrupt.

LangChain's middleware supports four decision types. approve executes as-is, edit modifies arguments before executing, reject cancels with feedback, and respond returns a human reply directly as a ToolMessage.

Durable Wait with Temporal Signals

Durable wait with Temporal signals pauses workflow progress on the server through persisted event history, keeping timeout and escalation logic inside the workflow definition instead of in external schedulers.

Temporal workflows use workflow.wait_condition() to pause at a checkpoint, with escalation policies, SLA timers, and reminder intervals configured in the workflow itself. The workflow handles approval waits as part of its logic; status shows Running during the wait while the server persists the full event history.

Pattern	Resource Usage During Wait	Timeout Handling	Framework
Checkpoint + Interrupt	Zero (process exits)	External cron or re-invoke	LangGraph
Temporal Signal/Wait	Zero (workflow suspended)	Built-in timeout= parameter	Temporal
Approval Queue + Polling	Zero (agent terminated)	Escalation policy / approval deadline handling	HumanLayer, custom

Operations that should trigger approval gates include production database mutations (DELETE, DROP, ALTER), external API calls with side effects (payments, emails), system configuration changes, and compliance-sensitive operations (GDPR, HIPAA, SOC2). For broader operating models around approvals and handoffs, readers often pair this topic with AI-human collaboration and agent handoff patterns.

Memory Across Sessions

Memory across sessions lets long-running agents continue work through preserved state. Short-term memory tracks the ongoing conversation within a session, while long-term memory persists facts, preferences, and behavioral patterns across sessions.

Keeping active conversation state separate from reusable knowledge lets an agent hand off work across restarts and separate sessions without losing useful state.

Short-Term: Conversation Buffer with Summarization

Short-term conversation memory with summarization controls unbounded message growth by replacing older exchanges with persisted summaries, keeping useful context available after restarts.

Open source

augmentcode/review-pr★40

Star on GitHub

Unbounded conversation history eventually exceeds practical limits. LangGraph documents four common strategies for managing short-term memory: trim messages, delete messages, summarize messages, and custom strategies.

A critical distinction separates transient context for a single call from persistent context saved in state across turns: summarization must permanently replace old messages with a summary in state, or it only trims the LLM call context without solving unbounded growth.

Long-Term: Cross-Session Semantic Memory

Long-term cross-session semantic memory stores reusable facts and patterns outside the active thread, so future sessions retrieve prior knowledge instead of reconstructing it.

LangGraph separates the checkpointer (thread-scoped state) from the store (cross-thread memory) as separate components. In that model, the messages table remains append-only and application-owned as the UI source of truth, the LangGraph Checkpointer (PostgresSaver) holds thread-scoped state under LangGraph ownership, and the LangGraph Store (PostgresStore) keeps cross-thread semantic memory as namespaced documents. Together, those layers separate active execution state from reusable memory.

Async workflow memory usually spans episodic (past events and interactions), semantic (reusable domain knowledge), and procedural (how to perform routines and tasks) types.

The MemGPT paper introduces a tiered model inspired by virtual memory. In that model, agents actively manage their own memory through function calls and decide what to retain, summarize, or archive across main context, recall storage, and archival storage.

Augment Cosmos keeps persistent context across sessions, so cross-session recovery workflows maintain codebase understanding instead of re-explaining context after each run.

How Cosmos Manages Persistent Agent Workflows

Cosmos manages persistent agent workflows through isolated execution environments, reusable agent behavior definitions, and durable sessions. Those primitives keep long-running work intact across restarts instead of disappearing with a process crash.

Augment Cosmos combines Environments, Experts, and Sessions for production cloud agent workflows:

Environments define where agents run and what they can touch, with isolation across sessions
Experts define how agents behave, what tools they use, and what events they subscribe to
Sessions turn one-off prompts into auditable, replayable workflows that persist across days and weeks

Cosmos limits manual coordination to defined review checkpoints while agents handle the work between them across long-running and parallel sessions.

With BYOK across Anthropic, OpenAI, Bedrock, Vertex, and open-source models, Augment Cosmos supports model-agnostic execution because the platform does not lock one workflow to a single provider roadmap.

Build Async Agent Workflows That Survive Production

The core tradeoff is whether to build checkpointing, idempotency, approval handling, and memory infrastructure in-house or adopt a platform that includes those controls. External writes and human review widen the recovery scope further, moving the problem beyond a single request-response cycle.

Start by mapping one real agent workflow against timeout ceilings, duplicate writes, approval waits, and cross-session memory loss. Then trace where the current stack already has durable controls and where it still depends on process uptime, client retries, or manual reconstruction of state. That gap analysis makes the build-versus-buy decision concrete, tying architecture choices to one workflow's recovery cost rather than platform claims.

How Async AI Agent Workflows Survive Failures

TL;DR

Why Agent Workflows Need Durable Execution

The Agentic SDLC

Why Synchronous Request-Response Breaks for Real Agent Tasks

State Checkpointing Strategies for Async AI Agent Workflows

Per-Step Snapshot vs. Event History Replay

Failure Recovery for Long-Running Agents

Tool Execution Failures with Side Effects

Reasoning Loops and Budget Exhaustion

Graceful Degradation Ladder

Human-in-the-Loop Async Approval Patterns

Checkpoint, Interrupt, Resume

Durable Wait with Temporal Signals

Memory Across Sessions

Short-Term: Conversation Buffer with Summarization

Long-Term: Cross-Session Semantic Memory

How Cosmos Manages Persistent Agent Workflows

Build Async Agent Workflows That Survive Production

FAQ

Written by

Paula Hingel

Give your codebase the agents it deserves

TL;DR

Why Agent Workflows Need Durable Execution

The Agentic SDLC

Why Synchronous Request-Response Breaks for Real Agent Tasks

State Checkpointing Strategies for Async AI Agent Workflows

Per-Step Snapshot vs. Event History Replay

Failure Recovery for Long-Running Agents

Tool Execution Failures with Side Effects

Reasoning Loops and Budget Exhaustion

Graceful Degradation Ladder

Human-in-the-Loop Async Approval Patterns

Checkpoint, Interrupt, Resume

Durable Wait with Temporal Signals

Memory Across Sessions

Short-Term: Conversation Buffer with Summarization

Long-Term: Cross-Session Semantic Memory

How Cosmos Manages Persistent Agent Workflows

Build Async Agent Workflows That Survive Production

FAQ

How does an async AI agent workflow differ from a standard background job queue?

Which checkpointing strategy minimizes LLM token costs on recovery?

What happens when agent code changes while workflows are in flight?

How should approval gate timeouts be handled?

Can per-step checkpointing guarantee exactly-once tool execution?

Related

Written by

Paula Hingel

Give your codebase the agents it deserves