Multi-Agent Frameworks: Five Bets, Three Categories, One Decision
Anthropic Managed Agents, LangGraph, CrewAI, OpenAI Agents SDK, and Flue solve the same surface problem with five very different bets. Three categories: hosted runtime, library/orchestrator, harness primitive. Same workflow spiked across all five (cd6 code review, 890 LOC of working spike code) shows the LOC tax for each framework's distinctive value layer — and where each one actually earns it. Side-by-side matrix, programming-model shapes, cost crossover analysis, and the question your team is actually answering.
TL;DR
Five frameworks, three categories. The bet each one is making is more important than its feature list:
| Category | Members | The bet |
|---|---|---|
| Hosted runtime | Anthropic Managed Agents | ”Bring a YAML, leave with a running agent.” Loop, sandbox, state, billing — all server-side. |
| Library / orchestrator | LangGraph, CrewAI, OpenAI Agents SDK | ”Pip install. You own the runtime.” Different opinions on programming model. |
| Harness primitive | Flue | ”Claude Code as a library.” Sandbox + skills + agents-as-files. Multi-agent emerges, not declared. |
The apples-to-apples comparator for Anthropic Managed Agents is not bare OpenAI Agents SDK — it’s AgentKit + Agents SDK (OpenAI’s hosted runtime). Get this framing right or every build/buy comparison drifts.
For “learn the space,” pick two anchors to deeply understand — LangGraph (graph-based, mature, OSS) and Anthropic Managed Agents (hosted, opinionated, vendor-locked). The other three become understandable as variations on these two axes.
The five frameworks were spiked end-to-end on the same workflow (cd6 code review crew, 890 lines of working spike code). The LOC ratios tell a sharper story than the feature matrices: CMA’s distinctive lever (shared filesystem) is cheap to add (+28 LOC). OpenAI’s two patterns are roughly equal weight. LangGraph’s value layer (StateGraph + Command + checkpointer + interrupt) costs ~3× the shallow shape — but it’s the only spike where routing is deterministic and the workflow survives a process restart.
Side-by-Side Matrix
| Dimension | Anthropic Managed Agents | LangGraph | CrewAI | OpenAI Agents SDK | Flue |
|---|---|---|---|---|---|
| License | Proprietary (managed service) | MIT (lib) / commercial (Platform) | MIT (lib) / commercial (AMP) | MIT | Apache-2.0 |
| Status | Public beta; multi-agent in research preview | v1.1.10 stable, 1.2 alpha | v1.14.4 stable | v0.16.0 (pre-1.0) | v0.0.x experimental |
| Age / maturity | New (Apr 2026) | 1.0 GA late 2025; mature | Mature, ~50k stars | Production-ready successor to Swarm | 3 months old |
| Primary pattern | Coordinator + roster (depth=1) | StateGraph + Command + interrupt | Crew + Process + Flow | Agents + Handoffs + Guardrails | Agent-as-file, subagent via task() |
| Multi-agent shape | Supervisor-only | Supervisor / swarm / hierarchical / router | Sequential or hierarchical | Handoff chain | Emerges from task() + shared sandbox |
| State model | Per-thread isolation + shared FS in container | Typed State + checkpoints + BaseStore | Consolidated Memory (LanceDB) + Knowledge | Sessions (9 backends) | In-memory or Durable Objects |
| Time travel / HITL | user.interrupt; no checkpoint rewind documented | Yes — checkpoints + interrupt/Command(resume=) | Limited | Approval/interrupt mechanisms | Not first-class |
| Observability | First-class event stream (SSE) | LangSmith (paid) or OTel/Langfuse | Plugin-only (Langfuse, AgentOps, MLflow…) | Default-on tracing → OpenAI dashboard; 25+ processors | Not described |
| Provider lock-in | Claude only | None (BYO LLM) | None (BYO LLM) | OpenAI-leaning; LiteLLM possible w/ caveats | None (multi-model strings) |
| Hosting | Anthropic-managed only | BYO (lib) or self-host Server | BYO (lib) or AMP | BYO; AgentKit for hosted | BYO (Node / CF / CI) |
| SDK languages | Python, TS, Java, Go, C#, Ruby, PHP + CLI | Python, TS | Python | Python, TS | TS only |
| Pricing | $0.08/session-hour + token rates | Lib free; Platform commercial | Lib free; AMP commercial | Lib free; OpenAI tokens; AgentKit separate | Free (BYO compute) |
| Sandbox / code-exec | Built-in container | Not provided (BYO) | Not provided (BYO) | Sandbox Agents (Apr 2026) | Pluggable (just-bash / Daytona / CF) |
| Production users (named) | Anthropic-internal early access | LinkedIn, Uber, Klarna, Replit, Elastic, AppFolio | PwC, IBM, NVIDIA, PepsiCo, J&J, DocuSign, US DoD (vendor-stated) | Klarna, Canva, Clay, OpenAI internal | None public |
| Independent post-mortems | None yet | Few; mostly LangChain-published | Few; mostly vendor-stated | Few; mostly OpenAI-stated | None |
Stop Asking “Which Is Best.” Ask Which Axis Matters.
There is no winner across all teams. There are five real trade-off axes. Decide which one is load-bearing for your situation, then the framework choice mostly falls out.
1. Control plane — yours or theirs?
| Yours (lib) | Theirs (managed) |
|---|---|
| LangGraph, CrewAI, OpenAI Agents SDK, Flue | Anthropic Managed Agents, OpenAI AgentKit, CrewAI AMP, LangGraph Platform |
Theirs: faster to first running agent. Pay per session-hour. Lock-in surface = the API shape + the vendor’s infra + (sometimes) the model.
Yours: you own retries, scaling, observability, persistence. Pay your AWS bill. No lock-in beyond the OSS API. Slower to start, cheaper at scale, mandatory if compliance forbids data-leaving.
2. Programming model — graph, persona, or handoff?
- Graph (LangGraph): explicit nodes/edges/state. Steepest curve. Most expressive. Best for stateful, durable, branching workflows.
- Persona (CrewAI): role/goal/backstory + Process enum. Fastest to a 50-line demo. Token-heavy at scale; debugging is “what prompt actually got sent?”.
- Handoff (OpenAI Agents SDK, Anthropic CMA, Flue): agent calls agent as a tool. Closest to how engineers think. Semantic surprises (“why did agent A keep talking?”) are the failure mode.
These aren’t equivalent — they encode different bets about what the developer should think about.
3. State strategy — message thread or filesystem?
- Message thread (LangGraph state, OpenAI Sessions, CrewAI Memory): pass state as data through messages or typed state object. Token cost grows with state.
- Shared filesystem (Anthropic CMA, Flue): agents read/write files in a shared sandbox. Cheaper for large artifacts. New problem: race conditions, no transactions.
CMA’s shared-FS-as-side-channel is genuinely novel and worth understanding even if you don’t adopt CMA. It changes the cost model.
4. Observability — built-in or BYO?
| Strongest built-in | Plugin-first | Hosted-only |
|---|---|---|
| OpenAI Agents SDK (default-on tracing), Anthropic CMA (SSE event stream) | CrewAI (Langfuse/MLflow/etc.), LangGraph (LangSmith or OTel) | (None pure managed) |
Compliance flag: OpenAI Agents SDK ships with tracing on by default, sending data to OpenAI’s dashboard. Disable before first production run if your data policy requires it. Non-negotiable for healthcare/finance/EU-data workloads.
5. Vendor / model coupling
- Single-vendor: CMA (Claude only), AgentKit (OpenAI-leaning).
- Multi-vendor BYO: LangGraph, CrewAI, Flue (model strings).
- Multi-vendor with caveats: OpenAI Agents SDK (LiteLLM works, but hosted tools — web_search, code_interpreter — assume OpenAI models).
If “no platform lock-in” is org policy, that knocks out CMA entirely and complicates Agents SDK adoption.
What Each Framework Actually Is
Five paragraphs. Read the one your team is closest to.
Anthropic Managed Agents — the hosted runtime
Claude Managed Agents (CMA) is Anthropic’s hosted agent harness. It bundles model + tools + sandbox container + event stream + persistence behind a small set of REST endpoints. Multi-agent is one feature within CMA, not a standalone framework. From the docs: “Pre-built, configurable agent harness that runs in managed infrastructure. Best for long-running tasks and asynchronous work.”
Four core concepts: Agent (versioned config), Environment (container template), Session (a running instance, stateful, persistent FS), Events (SSE-based stream).
Multi-agent extends this: a coordinator agent declares a roster of other agents it may delegate to. Each delegated agent gets its own session thread but shares the same container/filesystem.
coordinator = client.beta.agents.create(
name="Engineering Lead",
model="claude-opus-4-7",
system=COORDINATOR_PROMPT,
tools=[{"type": "agent_toolset_20260401"}],
multiagent={
"type": "coordinator",
"agents": [
{"type": "agent", "id": architect.id},
{"type": "agent", "id": builder.id},
{"type": "agent", "id": reviewer.id},
],
},
)
Hard constraints from the docs: “The coordinator can only delegate to one level of agents; depth > 1 is ignored.” Max 20 unique agents in multiagent.agents. Concurrent threads per session: 25. Models: Claude 4.5 and later only.
Pricing: standard Claude tokens plus $0.08 per session-hour, billed to the millisecond, charged while any thread is running. Idle = free. Multi-agent cost note: a coordinator firing 5 parallel children = 5× token spend, and the session-hour clock ticks while any thread runs.
The distinctive primitive is the shared filesystem as state channel. Most OSS frameworks pass state via explicit message-passing (LangGraph state object, CrewAI task outputs). CMA leans on the FS as side-channel state. Agents pass artifact pointers (/workspace/design.md) instead of stuffing 5KB into messages. This is the version of CMA that earns the session-hour fee.
LangGraph — the graph machine
LangGraph is a library for building stateful, graph-based agent workflows from LangChain. License MIT. ~31,400 GitHub stars. Most stable line: 1.1.x (latest 1.1.10, Apr 2026); 1.2.0 on alpha. Hit 1.0 GA in late 2025, so past the early-churn phase but still iterating.
LangGraph sits at the orchestration layer. It is not a model SDK. You bring your own LLM client (Anthropic, OpenAI, local). It owns the state machine, persistence, streaming, human-in-the-loop, and (optionally) deployment.
The core abstraction is StateGraph — a directed graph where nodes are functions, edges define transitions, and a typed State (TypedDict, Pydantic, or dataclass) flows through them with reducer-style updates. Key primitives: add_node, add_edge, add_conditional_edges, Command (state update + node transition in one — how handoffs are implemented), interrupt (HITL), Send (fan-out), subgraphs.
Two state layers, intentionally separated:
- Short-term (thread): in-graph
State, persisted as checkpoints at each super-step. Time travel viaget_state_historyandupdate_state. Exact resumption after crash. - Long-term (cross-thread):
BaseStoreinterface. Namespaced K/V with optional vector search.
Persistence backends: in-memory (dev), SQLite (single-process), Postgres (production default), Redis (community).
Time travel = real. You can rewind to any prior checkpoint, edit state, fork. This is the killer feature vs CrewAI/Swarm-style frameworks.
Production evidence (LangChain’s own list — case-study quality, not third-party post-mortems): LinkedIn (AI recruiter, hierarchical), Uber (code migration + tests), Klarna (~85M users, claimed 80% resolution-time reduction), Replit (multi-agent coding copilot), Elastic (security threat detection), AppFolio (property-manager copilot). Replit and Uber have spoken at conferences, lending weight there.
Honest pitfalls from the issue tracker: checkpoint serialization bloat (open issue: 85% storage / 37.8% token overhead, no opt-out path), Postgres SSL errors recurring across versions, run-cancellation drops streamed-but-uncheckpointed state, prebuilt-package version drift, vendor lock-in toward LangSmith for the best DX.
CrewAI — the persona crew
Python multi-agent framework, completely independent of LangChain (built from scratch). License MIT. Latest 1.14.4 (Apr 2026). ~50.8k GitHub stars. Two product surfaces: open-source library and CrewAI Enterprise / AMP Suite (commercial control plane).
Marketing claim: “5.76x faster than LangGraph in certain cases.” Treat with skepticism — vendor benchmark, no independent reproduction surfaced.
Two layered abstractions: Flow (event-driven, stateful workflow backbone) and Crew (team of role-playing agents collaborating on a delegated task). Each Agent has role, goal, backstory, tools, LLM. Tasks have description + expected output + agent assignment + optional context. Process is a sequential or hierarchical enum.
In hierarchical mode, a manager LLM (or manager_agent) dynamically allocates tasks to agents based on capabilities, reviews outputs, validates completion. You pass manager_llm="gpt-4o" or a custom manager agent.
Pattern is persona-heavy — agents prompted as personas (“You are a Senior Researcher with 10 years…”). Critics argue this inflates token usage and adds variance vs cleaner functional decomposition.
Memory is consolidated under one Memory class (older docs reference short/long/entity types separately — terminology shifted). Default storage: LanceDB under ./.crewai/memory. Default embeddings: OpenAI.
Observability is deferred to ecosystem. Production teams pick a vendor (Langfuse most common in OSS) or pay for AMP. The open-core pattern.
Vendor-stated production users: PwC, IBM, NVIDIA, PepsiCo, J&J, DocuSign, US DoD (~150 enterprise customers within first 6 months of launch, ~2 billion agentic executions in trailing 12 months). Treat as logo-deck signal, not architectural validation — independent post-mortems naming companies are rare.
Honest pitfalls: token bloat from role prompts (persona system prompts repeated per turn balloon costs in hierarchical mode), manager-LLM latency tax (1-2 extra LLM calls per task), deployment footprint (crewai[tools] venv approaches 1 GB), version churn (Memory API consolidation, Flow vs Crew best-practice shift).
OpenAI Agents SDK — the handoff orchestra
Open-source Python (and TypeScript) framework for multi-agent workflows. Production-ready successor to Swarm (which OpenAI explicitly retired as “educational”). Provider-agnostic via LiteLLM and Any-LLM adapters. MIT-licensed. Latest v0.16.0 (May 2026). ~26k stars. Pre-1.0 (still 0.x), but OpenAI calls it “production-ready.”
Three primitives carried forward from Swarm, expanded:
- Agents —
LLM + instructions + tools + guardrails + handoffs. - Handoffs — agents delegate to other agents (originally Swarm’s signature pattern). Modeled as a special tool call.
- Guardrails — input/output validation that can short-circuit a run.
Newer additions: Sandbox Agents (container-based runtime for long-horizon / code-execution tasks; Apr 2026 update — OpenAI’s response to “uncontrolled tool access” failures observed in Q1 2026 production), MCP servers, hosted tools (web search, file search, code interpreter, computer use), Realtime Agents (voice), human-in-the-loop approval/interrupt.
Design principle from the docs: “Python-first… use built-in language features to orchestrate and chain agents, rather than needing to learn new abstractions.” No DSL, no graph compiler — contrasts sharply with LangGraph.
State via Sessions — built-in conversation memory across Runner.run calls. Replaces manual .to_input_list() plumbing. Backends shipped or via extras: SQLite (dev), Redis (low-latency distributed), SQLAlchemy, MongoDB, Dapr, OpenAIConversationsSession (server-managed), EncryptedSession (wraps another, adds TTL), AdvancedSQLiteSession (branching/analytics) — 9 backends total.
Critical caveat: “Sessions cannot be combined with conversation_id, previous_response_id, or auto_previous_response_id in the same run.” Pick one source of conversation truth.
Tracing on by default, sinking to OpenAI’s dashboard via BatchTraceProcessor. Disable: env OPENAI_AGENTS_DISABLE_TRACING=1 or set_tracing_disabled(True) or per-run RunConfig.tracing_disabled. 25+ third-party processor integrations: Logfire, AgentOps, Braintrust, LangSmith, Langfuse, Langtrace, Arize-Phoenix, MLflow, W&B, Datadog, Keywords AI, Scorecard. Strongest observability story among OSS contenders out of the box — and the strongest compliance footgun.
Apples-to-apples vs Anthropic Managed Agents is AgentKit + Agents SDK, not Agents SDK alone. AgentKit (separate OpenAI product, launched Oct 2025) wraps the SDK with a hosted ChatKit UI, agent builder, and managed runtime.
Flue — the harness primitive
Flue (withastro/flue) is a TypeScript runtime for headless, programmable agents that feel like Claude Code but with no TUI, no human-in-the-loop assumption, and a pluggable sandbox. Runtime-agnostic (Node.js, Cloudflare Workers, GitHub Actions / GitLab CI). Apache-2.0. Maintainer: Fred K. Schott (Astro co-founder).
Not primarily a multi-agent orchestration framework in the LangGraph / CrewAI sense — it is a framework for building one Claude-Code-style agent per invocation, with subagent (task) and role (role) primitives layered on top.
Unit of composition is an “agent” (a TypeScript file under .flue/agents/) that gets a sandbox, a session, tools, and skills — same mental model as Claude Code itself. Skills + AGENTS.md are first-class. Most agent “logic” lives in Markdown (skills, role definitions, context files); the runtime auto-discovers them from a workspace dir.
The sandbox is the point. Default is a virtual sandbox powered by Vercel Labs’ just-bash — bash-like execution without a real container. Pluggable to local (host FS), Daytona (real Linux container), Cloudflare R2-mounted, etc.
Multi-agent emerges via task() and multiple init() calls sharing a sandbox — not via an explicit graph. Closer to Claude Code’s subagent pattern than to LangGraph’s state machines.
Maturity flags are honest: README explicitly says “Experimental — APIs may change.” v0.0.x tags. ~2,683 stars. @flue/connectors already deprecated in favor of flue add codegen — churn evidence. Zero unit tests. No production case studies, no benchmarks, no real-world cost data. Single org / single primary maintainer — bus factor risk.
For a deeper Flue dive, see flue-agent-harness-framework.
Five Spike Variants, Same Workflow
To get past feature matrices into actual code shape, the same workflow (cd6 code-review crew: Architect → Builder → Reviewer, coordinated by Engineering Lead) was implemented across five spike variants. Same shared personas. Same fixture (BankAccount module with planted bugs: SQLi, mutable default, non-atomic transfer). Code-only — no live runs.
Headline numbers
| Metric | CMA inline | CMA FS | OpenAI as-tools | OpenAI handoff | LangGraph shallow | LangGraph deep |
|---|---|---|---|---|---|---|
| Code lines (no comments/blanks) | 140 | 168 | 100 | 148 | 87 | 247 |
| Primitives used | 5 | 5 + bash/write | 3 | 4 | 3 | 7 |
| Async required? | No | No | Yes | Yes | No | No |
| Multi-agent declaration | Server (multiagent) | Server | Client (as_tool) | Client (handoffs=[…] + handoff()) | Client (@tool dispatch) | Graph code (Command(goto=…)) |
| State channel | Inline messages | Container FS | Run context | History auto-passed (filterable) | Implicit messages | Typed State + checkpointer |
| Routing decided by | LLM coordinator | LLM coordinator | LLM coordinator | LLM hands off | LLM coordinator | Graph code (deterministic) |
| HITL / interrupt | user.interrupt | user.interrupt | manual | manual | manual | interrupt() + Command(resume=…) |
| Persistence | Session is durable | Session is durable | None (Sessions opt-in) | None | None | Checkpointer (InMem here, Postgres in prod) |
| Default observability | SSE event stream | SSE event stream | Default-on tracing → OpenAI | Default-on tracing → OpenAI | None (LangSmith opt-in) | None (LangSmith opt-in) |
What the LOC ratios actually mean
| Comparison | Δ LOC | What you buy |
|---|---|---|
| CMA inline → CMA FS | +28 (~20%) | FS-as-state-channel cost lever (projected ~5–10× fewer prompt tokens) |
| OpenAI as-tools → handoff | +48 (~50%) | Control-transfer chain + typed Pydantic feedback between agents |
| LangGraph shallow → deep | +160 (~3×) | Graph-encoded routing + typed state + checkpoints + HITL interrupt + bounded revision loop |
The ratios tell the story.
CMA’s distinctive value (FS) is cheap to add. OpenAI’s two patterns are roughly equal weight. LangGraph’s value layer is expensive in LOC but qualitatively different — it’s the only spike where routing is deterministic and the workflow survives a process restart.
Programming-model shape (side-by-side)
CMA — declarative server-side coordinator
coordinator = client.beta.agents.create(
name="Engineering Lead",
model="claude-opus-4-7",
system=COORDINATOR_PROMPT,
tools=[{"type": "agent_toolset_20260401"}],
multiagent={
"type": "coordinator",
"agents": [
{"type": "agent", "id": architect.id},
{"type": "agent", "id": builder.id},
{"type": "agent", "id": reviewer.id},
],
},
)
multiagent is a server-side object. Anthropic enforces depth=1, the 20-agent roster cap, and the 25-thread concurrency cap. Nothing on the client side encodes those constraints.
OpenAI Agents SDK — agents-as-tools
coordinator = Agent(
name="engineering_lead",
instructions=COORDINATOR_PROMPT,
tools=[
architect.as_tool(tool_name="ask_architect", tool_description=...),
builder.as_tool(tool_name="ask_builder", tool_description=...),
reviewer.as_tool(tool_name="ask_reviewer", tool_description=...),
],
)
as_tool reuses the regular tools mechanism. No special construct. The coordinator decides at runtime which tool to call when. No depth limit (tools can wrap tools).
LangGraph — single dispatch tool (shallow)
SUBAGENTS = {"architect": ..., "builder": ..., "reviewer": ...}
@tool
def task(agent_name: str, description: str) -> str:
"""Launch an ephemeral specialist..."""
return SUBAGENTS[agent_name].invoke(...)["messages"][-1].content
coordinator = create_agent(model=MODEL, tools=[task], prompt=COORDINATOR_PROMPT)
Single tool name, registry lookup. Idiomatic LangGraph. The full StateGraph isn’t needed for a workflow this linear. The full StateGraph is when you need typed state, checkpoints, deterministic routing, or HITL — and it costs 3× the LOC.
The Surprising Findings
CMA’s SSE stream is genuinely best-in-class for multi-agent
The event stream emits agent.thread_message_sent and agent.thread_message_received between coordinator and child without any client-side instrumentation. You see the handoff trail live.
OpenAI gets equivalent info from result.new_items only after the run completes. LangGraph requires you to slice through messages[*].tool_calls post-hoc.
For debugging multi-agent decisions in production, CMA’s stream model is a real advantage. Held up under spike-shape scrutiny.
OpenAI’s default-on tracing is a real compliance footgun
Runner.run() traces by default and posts to the OpenAI dashboard. Easy to miss in a quickstart, hard to retrofit in a regulated environment.
Disable on first run if your org has any data-residency policy. Non-negotiable for healthcare / finance / EU-data workloads.
LangGraph’s deep value layer is the only deterministic-routing spike
Every other spike — including CMA — relies on the LLM coordinator to follow the workflow. The LangGraph StateGraph is the only one where the workflow is encoded in code, not in a prompt the LLM might ignore. For high-stakes pipelines, this matters.
CMA’s shared filesystem is the version that earns the fee
Inline-CMA (passing content via messages) looks like an expensive variant of OpenAI as-tools — same coordinator-stitches shape, but with Anthropic-infra lock-in and a session-hour meter ticking.
FS-CMA (passing artifact pointers, agents reading/writing /workspace/*) is qualitatively different — and the only one that scales gracefully when artifacts are large. Per-child prompt drops from ~1.5–2.5KB (inline target/design) to ~150–300 chars (pointer only). Across 3 children, projected ~5–10× fewer prompt tokens for context-passing. Output tokens unchanged. Live numbers needed to confirm.
If you wire up the inline variant, you’re paying CMA’s premium for a workflow OSS does compute-only.
Pattern convergence (and the one that breaks it)
For coordinator-stitches workflows, OpenAI’s as_tool() and LangGraph’s tool-dispatch idiom land in the same shape (~100 LOC).
For handoff-style workflows, OpenAI’s handoffs= and LangGraph’s Command(goto=…) land in the same conceptual shape — though LangGraph trades extra LOC for stronger guarantees (typed state, deterministic routing, checkpoints).
The shape that breaks convergence is LangGraph deep. Nothing else in the comparator gives you graph-encoded routing + checkpoints + interrupt + bounded revision loop in one primitive set.
Cost Shape
No live numbers — projections only. Live runs are the next budget-justified step (~$2–10 across all 5 variants depending on token mix).
| Framework | Variable cost | Fixed cost | Cost shape |
|---|---|---|---|
| CMA | Tokens × 4 agents | $0.08/session-hour while any thread runs | Wall-clock matters. Slow tools = meter ticks. |
| OpenAI Agents SDK | Tokens × N tool calls | None (lib free; tracing free at low volume) | Pure token cost; coordinator overhead = 1 LLM turn per round-trip |
| LangGraph | Tokens × N tool calls | None (lib free); LangSmith paid tier above free quota | Same shape as OpenAI; messages list grows with state |
For a 3-step workflow that completes in under 2 minutes, OpenAI/LangGraph are likely 3-10× cheaper than CMA. For a long-horizon workflow (hours, with sandboxed code execution), CMA’s bundled compute may flip the equation.
This is the cost crossover question every team needs to model on its own workload before committing.
Decision Guidance
Not the answer for every team — directional map. Verify against real workload before committing.
| If your situation is… | Look at first |
|---|---|
| ”We need durable, branching, human-in-the-loop workflows” | LangGraph (deep). Confirmed at 247-LOC floor. The shallow create_agent shape (87 LOC) competes with OpenAI on simple workflows but doesn’t differentiate. Reach for the deep shape only if you need ≥2 of: graph-encoded routing, durability, HITL, bounded loops. Otherwise the LOC tax isn’t worth it. |
| ”We’re a Claude shop, want fastest path to running agents, OK with vendor lock-in” | Anthropic Managed Agents — but only if your workflow has large artifacts (code, docs, datasets) where the shared-FS state channel earns the session-hour fee. Inline-only CMA = paying premium for what OSS does compute-only. |
| ”We’re an OpenAI shop, conversational handoffs (support → billing → refund)“ | OpenAI Agents SDK (handoff). Clean primitive. Typed Pydantic feedback gives structured inter-agent contracts without ceremony. Watch default-on tracing. |
| ”We’re an OpenAI shop, pipeline with stitching” | OpenAI Agents SDK (as-tools). Lowest LOC for coordinator-stitches workflows after raw LangGraph shallow. |
| ”We need a working demo for stakeholders by Friday” | CrewAI (caveat: rewrite for production) — or OpenAI as-tools at 100 LOC, competitive. |
| ”We already author Claude Code skills, want them deployable” | Flue (track + spike, don’t standardize yet) |
| “We can’t pick — too many requirements” | LangGraph as default; revisit in 3 months |
When the LOC tax buys something real
The three “expensive” spike variants each buy something specific. Use this table to decide whether you’re paying for something you actually need:
| Variant | LOC tax over baseline | Buys |
|---|---|---|
| CMA FS over inline | +28 (~20%) | Cost lever — prompt tokens drop ~5–10× when artifacts are large |
| OpenAI handoff over as-tools | +48 (~50%) | Control transfer + typed Pydantic feedback contracts |
| LangGraph deep over shallow | +160 (~3×) | Deterministic routing + crash durability + HITL + bounded revision loops |
If you don’t need what’s in the right column, don’t pay the cost on the left.
Cross-Framework Gold Seams
Topics worth deep-diving regardless of which framework “wins”:
-
Handoff semantics. Modeled as tool calls in CMA, OpenAI Agents SDK, and Flue. Modeled as
Command+ state update in LangGraph. Modeled asProcess.hierarchical+ manager LLM in CrewAI. The bug surface is the same in all five — when does control return. Worth a dedicated mental-model post for the team. -
State durability. LangGraph’s checkpoint + time-travel is the most rigorous. CMA’s shared-FS is the most pragmatic. CrewAI Flow has it but underdocumented. OpenAI Sessions has 9 backends but the durability semantics across handoffs are messier. The framework’s durability story IS its production story.
-
Observability cost. Free OSS frameworks all lean on a paid SaaS observability layer (LangSmith, AgentOps, Langfuse-Cloud, Datadog) at scale. The honest TCO comparison includes that layer.
-
Pricing model — per-session-hour vs per-compute. CMA’s $0.08/session-hour is novel and unforgiving — 5 parallel children = 5× tokens AND meter still ticks while any thread runs. OSS = your AWS/GCP bill, predictable but more ops work. Cost crossover point worth modeling for any real workload.
-
Sandbox / code-exec. Three different stories — CMA bundled, OpenAI Sandbox Agents (new Apr 2026, fixing real Q1 production failures), Flue pluggable, LangGraph/CrewAI = roll your own. Sandbox is becoming table-stakes; track this.
Pitfalls That Apply Broadly
| Pitfall | Frameworks affected |
|---|---|
| Vendor benchmarks unreproduced (“5.76× faster”) | CrewAI especially; all to some degree |
| Logo-deck production claims, no architecture detail | All five (LangGraph slightly better) |
| Pre-1.0 / beta API churn | OpenAI Agents SDK, CMA (beta), Flue (v0.0.x) |
| Multi-agent depth limited or undocumented | CMA (depth=1 hard), CrewAI (manager-LLM tax), Flue (no graph primitives) |
| Default observability ships data to vendor | OpenAI Agents SDK (default-on tracing → OpenAI), LangSmith pull |
| State serialization bloat | LangGraph (open issue: 85% storage / 37.8% token overhead) |
| Single-maintainer / org bus factor | Flue (Schott + small team) |
| Token bloat from persona prompts | CrewAI |
Open Questions (Live Runs Required)
Code-only mode is exhausted. The remaining unknowns require execution:
- Live cost run — all 5 variants, same prompt, capture token counts + wall-clock + dollars. Without this, every cost claim is projection.
- CMA shared-FS prompt-token saving — confirm projected ~5–10× via inspecting
agent.thread_message_sentpayload sizes during a live run. - OpenAI tracing-disabled verification — confirm env var actually disables data flow to OpenAI dashboard (some practitioner posts allege partial leakage). Requires network inspection during a live run.
- Coordinator LLM-rewrites-children footgun — across 5 spikes, does any LLM coordinator observably violate the “thin orchestrator” rule despite the persona forbidding it? Live runs needed.
- Postgres checkpointer at scale — open issue: 85% storage bloat, 37.8% token overhead. Real cost only visible at production volume.
- Multi-framework composition — can LangGraph orchestrate Anthropic Managed Agents as nodes? Both want to own the loop — pick one, but for what reasons?
- Compliance & data residency mapping — CMA = Anthropic infra. OpenAI tracing = OpenAI dashboard. Self-hosted LangGraph or CrewAI = your residency. Map this against the user data class your team handles.
- Skills/Agents.md interop — Anthropic Skills format, Claude Code skills, Flue skills — same shape, different runtimes. Is there a portable skill spec emerging?
What This Research Is and Is Not
Is: Quick-scan landscape map. T-shape broad layer + 5 code-only spike variants (890 lines). Tier-1 docs read for each framework. Honest pitfalls surfaced from issue trackers and critical blog posts. Decision guidance grounded in spike code shape, not only feature lists.
Is not: A live run. No working examples were executed. No benchmarks were measured. No cost model was computed. Production claims were not independently verified.
Next phase if team commits to a framework: 1-day code spike with real workload, real cost model, real failure-mode test — per the Phase 4 template in the research skill. Estimated cost across all 5 variants for live runs: ~$2–10 depending on token volume and Claude vs GPT mix.
Sources
Sources & Provenance
Verifiable sources. Dates matter. Credibility assessed.