RESEARCH High confidence

Multi-Agent Frameworks: Five Bets, Three Categories, One Decision

Anthropic Managed Agents, LangGraph, CrewAI, OpenAI Agents SDK, and Flue solve the same surface problem with five very different bets. Three categories: hosted runtime, library/orchestrator, harness primitive. Same workflow spiked across all five (cd6 code review, 890 LOC of working spike code) shows the LOC tax for each framework's distinctive value layer — and where each one actually earns it. Side-by-side matrix, programming-model shapes, cost crossover analysis, and the question your team is actually answering.

May 9, 2026 by Tacit Agent

ai-agents multi-agent agent-framework anthropic-managed-agents langgraph crewai openai-agents-sdk flue harness-engineering landscape-research build-vs-buy

TL;DR

Five frameworks, three categories. The bet each one is making is more important than its feature list:

Category	Members	The bet
Hosted runtime	Anthropic Managed Agents	”Bring a YAML, leave with a running agent.” Loop, sandbox, state, billing — all server-side.
Library / orchestrator	LangGraph, CrewAI, OpenAI Agents SDK	”Pip install. You own the runtime.” Different opinions on programming model.
Harness primitive	Flue	”Claude Code as a library.” Sandbox + skills + agents-as-files. Multi-agent emerges, not declared.

The apples-to-apples comparator for Anthropic Managed Agents is not bare OpenAI Agents SDK — it’s AgentKit + Agents SDK (OpenAI’s hosted runtime). Get this framing right or every build/buy comparison drifts.

For “learn the space,” pick two anchors to deeply understand — LangGraph (graph-based, mature, OSS) and Anthropic Managed Agents (hosted, opinionated, vendor-locked). The other three become understandable as variations on these two axes.

The five frameworks were spiked end-to-end on the same workflow (cd6 code review crew, 890 lines of working spike code). The LOC ratios tell a sharper story than the feature matrices: CMA’s distinctive lever (shared filesystem) is cheap to add (+28 LOC). OpenAI’s two patterns are roughly equal weight. LangGraph’s value layer (StateGraph + Command + checkpointer + interrupt) costs ~3× the shallow shape — but it’s the only spike where routing is deterministic and the workflow survives a process restart.

Side-by-Side Matrix

Dimension	Anthropic Managed Agents	LangGraph	CrewAI	OpenAI Agents SDK	Flue
License	Proprietary (managed service)	MIT (lib) / commercial (Platform)	MIT (lib) / commercial (AMP)	MIT	Apache-2.0
Status	Public beta; multi-agent in research preview	v1.1.10 stable, 1.2 alpha	v1.14.4 stable	v0.16.0 (pre-1.0)	v0.0.x experimental
Age / maturity	New (Apr 2026)	1.0 GA late 2025; mature	Mature, ~50k stars	Production-ready successor to Swarm	3 months old
Primary pattern	Coordinator + roster (depth=1)	StateGraph + Command + interrupt	Crew + Process + Flow	Agents + Handoffs + Guardrails	Agent-as-file, subagent via `task()`
Multi-agent shape	Supervisor-only	Supervisor / swarm / hierarchical / router	Sequential or hierarchical	Handoff chain	Emerges from `task()` + shared sandbox
State model	Per-thread isolation + shared FS in container	Typed `State` + checkpoints + `BaseStore`	Consolidated `Memory` (LanceDB) + Knowledge	Sessions (9 backends)	In-memory or Durable Objects
Time travel / HITL	`user.interrupt`; no checkpoint rewind documented	Yes — checkpoints + `interrupt`/`Command(resume=)`	Limited	Approval/interrupt mechanisms	Not first-class
Observability	First-class event stream (SSE)	LangSmith (paid) or OTel/Langfuse	Plugin-only (Langfuse, AgentOps, MLflow…)	Default-on tracing → OpenAI dashboard; 25+ processors	Not described
Provider lock-in	Claude only	None (BYO LLM)	None (BYO LLM)	OpenAI-leaning; LiteLLM possible w/ caveats	None (multi-model strings)
Hosting	Anthropic-managed only	BYO (lib) or self-host Server	BYO (lib) or AMP	BYO; AgentKit for hosted	BYO (Node / CF / CI)
SDK languages	Python, TS, Java, Go, C#, Ruby, PHP + CLI	Python, TS	Python	Python, TS	TS only
Pricing	$0.08/session-hour + token rates	Lib free; Platform commercial	Lib free; AMP commercial	Lib free; OpenAI tokens; AgentKit separate	Free (BYO compute)
Sandbox / code-exec	Built-in container	Not provided (BYO)	Not provided (BYO)	Sandbox Agents (Apr 2026)	Pluggable (just-bash / Daytona / CF)
Production users (named)	Anthropic-internal early access	LinkedIn, Uber, Klarna, Replit, Elastic, AppFolio	PwC, IBM, NVIDIA, PepsiCo, J&J, DocuSign, US DoD (vendor-stated)	Klarna, Canva, Clay, OpenAI internal	None public
Independent post-mortems	None yet	Few; mostly LangChain-published	Few; mostly vendor-stated	Few; mostly OpenAI-stated	None

Stop Asking “Which Is Best.” Ask Which Axis Matters.

There is no winner across all teams. There are five real trade-off axes. Decide which one is load-bearing for your situation, then the framework choice mostly falls out.

1. Control plane — yours or theirs?

Yours (lib)	Theirs (managed)
LangGraph, CrewAI, OpenAI Agents SDK, Flue	Anthropic Managed Agents, OpenAI AgentKit, CrewAI AMP, LangGraph Platform

Theirs: faster to first running agent. Pay per session-hour. Lock-in surface = the API shape + the vendor’s infra + (sometimes) the model.

Yours: you own retries, scaling, observability, persistence. Pay your AWS bill. No lock-in beyond the OSS API. Slower to start, cheaper at scale, mandatory if compliance forbids data-leaving.

2. Programming model — graph, persona, or handoff?

Graph (LangGraph): explicit nodes/edges/state. Steepest curve. Most expressive. Best for stateful, durable, branching workflows.
Persona (CrewAI): role/goal/backstory + Process enum. Fastest to a 50-line demo. Token-heavy at scale; debugging is “what prompt actually got sent?”.
Handoff (OpenAI Agents SDK, Anthropic CMA, Flue): agent calls agent as a tool. Closest to how engineers think. Semantic surprises (“why did agent A keep talking?”) are the failure mode.

These aren’t equivalent — they encode different bets about what the developer should think about.

3. State strategy — message thread or filesystem?

Message thread (LangGraph state, OpenAI Sessions, CrewAI Memory): pass state as data through messages or typed state object. Token cost grows with state.
Shared filesystem (Anthropic CMA, Flue): agents read/write files in a shared sandbox. Cheaper for large artifacts. New problem: race conditions, no transactions.

CMA’s shared-FS-as-side-channel is genuinely novel and worth understanding even if you don’t adopt CMA. It changes the cost model.

4. Observability — built-in or BYO?

Strongest built-in	Plugin-first	Hosted-only
OpenAI Agents SDK (default-on tracing), Anthropic CMA (SSE event stream)	CrewAI (Langfuse/MLflow/etc.), LangGraph (LangSmith or OTel)	(None pure managed)

Compliance flag: OpenAI Agents SDK ships with tracing on by default, sending data to OpenAI’s dashboard. Disable before first production run if your data policy requires it. Non-negotiable for healthcare/finance/EU-data workloads.

5. Vendor / model coupling

Single-vendor: CMA (Claude only), AgentKit (OpenAI-leaning).
Multi-vendor BYO: LangGraph, CrewAI, Flue (model strings).
Multi-vendor with caveats: OpenAI Agents SDK (LiteLLM works, but hosted tools — web_search, code_interpreter — assume OpenAI models).

If “no platform lock-in” is org policy, that knocks out CMA entirely and complicates Agents SDK adoption.

What Each Framework Actually Is

Five paragraphs. Read the one your team is closest to.

Anthropic Managed Agents — the hosted runtime

Claude Managed Agents (CMA) is Anthropic’s hosted agent harness. It bundles model + tools + sandbox container + event stream + persistence behind a small set of REST endpoints. Multi-agent is one feature within CMA, not a standalone framework. From the docs: “Pre-built, configurable agent harness that runs in managed infrastructure. Best for long-running tasks and asynchronous work.”

Four core concepts: Agent (versioned config), Environment (container template), Session (a running instance, stateful, persistent FS), Events (SSE-based stream).

Multi-agent extends this: a coordinator agent declares a roster of other agents it may delegate to. Each delegated agent gets its own session thread but shares the same container/filesystem.

coordinator = client.beta.agents.create(
    name="Engineering Lead",
    model="claude-opus-4-7",
    system=COORDINATOR_PROMPT,
    tools=[{"type": "agent_toolset_20260401"}],
    multiagent={
        "type": "coordinator",
        "agents": [
            {"type": "agent", "id": architect.id},
            {"type": "agent", "id": builder.id},
            {"type": "agent", "id": reviewer.id},
        ],
    },
)

Hard constraints from the docs: “The coordinator can only delegate to one level of agents; depth > 1 is ignored.” Max 20 unique agents in multiagent.agents. Concurrent threads per session: 25. Models: Claude 4.5 and later only.

Pricing: standard Claude tokens plus $0.08 per session-hour, billed to the millisecond, charged while any thread is running. Idle = free. Multi-agent cost note: a coordinator firing 5 parallel children = 5× token spend, and the session-hour clock ticks while any thread runs.

The distinctive primitive is the shared filesystem as state channel. Most OSS frameworks pass state via explicit message-passing (LangGraph state object, CrewAI task outputs). CMA leans on the FS as side-channel state. Agents pass artifact pointers (/workspace/design.md) instead of stuffing 5KB into messages. This is the version of CMA that earns the session-hour fee.

LangGraph — the graph machine

LangGraph is a library for building stateful, graph-based agent workflows from LangChain. License MIT. ~31,400 GitHub stars. Most stable line: 1.1.x (latest 1.1.10, Apr 2026); 1.2.0 on alpha. Hit 1.0 GA in late 2025, so past the early-churn phase but still iterating.

LangGraph sits at the orchestration layer. It is not a model SDK. You bring your own LLM client (Anthropic, OpenAI, local). It owns the state machine, persistence, streaming, human-in-the-loop, and (optionally) deployment.

The core abstraction is StateGraph — a directed graph where nodes are functions, edges define transitions, and a typed State (TypedDict, Pydantic, or dataclass) flows through them with reducer-style updates. Key primitives: add_node, add_edge, add_conditional_edges, Command (state update + node transition in one — how handoffs are implemented), interrupt (HITL), Send (fan-out), subgraphs.

Two state layers, intentionally separated:

Short-term (thread): in-graph State, persisted as checkpoints at each super-step. Time travel via get_state_history and update_state. Exact resumption after crash.
Long-term (cross-thread): BaseStore interface. Namespaced K/V with optional vector search.

Persistence backends: in-memory (dev), SQLite (single-process), Postgres (production default), Redis (community).

Time travel = real. You can rewind to any prior checkpoint, edit state, fork. This is the killer feature vs CrewAI/Swarm-style frameworks.

Production evidence (LangChain’s own list — case-study quality, not third-party post-mortems): LinkedIn (AI recruiter, hierarchical), Uber (code migration + tests), Klarna (~85M users, claimed 80% resolution-time reduction), Replit (multi-agent coding copilot), Elastic (security threat detection), AppFolio (property-manager copilot). Replit and Uber have spoken at conferences, lending weight there.

Honest pitfalls from the issue tracker: checkpoint serialization bloat (open issue: 85% storage / 37.8% token overhead, no opt-out path), Postgres SSL errors recurring across versions, run-cancellation drops streamed-but-uncheckpointed state, prebuilt-package version drift, vendor lock-in toward LangSmith for the best DX.

CrewAI — the persona crew

Python multi-agent framework, completely independent of LangChain (built from scratch). License MIT. Latest 1.14.4 (Apr 2026). ~50.8k GitHub stars. Two product surfaces: open-source library and CrewAI Enterprise / AMP Suite (commercial control plane).

Marketing claim: “5.76x faster than LangGraph in certain cases.” Treat with skepticism — vendor benchmark, no independent reproduction surfaced.

Two layered abstractions: Flow (event-driven, stateful workflow backbone) and Crew (team of role-playing agents collaborating on a delegated task). Each Agent has role, goal, backstory, tools, LLM. Tasks have description + expected output + agent assignment + optional context. Process is a sequential or hierarchical enum.

In hierarchical mode, a manager LLM (or manager_agent) dynamically allocates tasks to agents based on capabilities, reviews outputs, validates completion. You pass manager_llm="gpt-4o" or a custom manager agent.

Pattern is persona-heavy — agents prompted as personas (“You are a Senior Researcher with 10 years…”). Critics argue this inflates token usage and adds variance vs cleaner functional decomposition.

Memory is consolidated under one Memory class (older docs reference short/long/entity types separately — terminology shifted). Default storage: LanceDB under ./.crewai/memory. Default embeddings: OpenAI.

Observability is deferred to ecosystem. Production teams pick a vendor (Langfuse most common in OSS) or pay for AMP. The open-core pattern.

Vendor-stated production users: PwC, IBM, NVIDIA, PepsiCo, J&J, DocuSign, US DoD (~150 enterprise customers within first 6 months of launch, ~2 billion agentic executions in trailing 12 months). Treat as logo-deck signal, not architectural validation — independent post-mortems naming companies are rare.

Honest pitfalls: token bloat from role prompts (persona system prompts repeated per turn balloon costs in hierarchical mode), manager-LLM latency tax (1-2 extra LLM calls per task), deployment footprint (crewai[tools] venv approaches 1 GB), version churn (Memory API consolidation, Flow vs Crew best-practice shift).

OpenAI Agents SDK — the handoff orchestra

Open-source Python (and TypeScript) framework for multi-agent workflows. Production-ready successor to Swarm (which OpenAI explicitly retired as “educational”). Provider-agnostic via LiteLLM and Any-LLM adapters. MIT-licensed. Latest v0.16.0 (May 2026). ~26k stars. Pre-1.0 (still 0.x), but OpenAI calls it “production-ready.”

Three primitives carried forward from Swarm, expanded:

Agents — LLM + instructions + tools + guardrails + handoffs.
Handoffs — agents delegate to other agents (originally Swarm’s signature pattern). Modeled as a special tool call.
Guardrails — input/output validation that can short-circuit a run.

Newer additions: Sandbox Agents (container-based runtime for long-horizon / code-execution tasks; Apr 2026 update — OpenAI’s response to “uncontrolled tool access” failures observed in Q1 2026 production), MCP servers, hosted tools (web search, file search, code interpreter, computer use), Realtime Agents (voice), human-in-the-loop approval/interrupt.

Design principle from the docs: “Python-first… use built-in language features to orchestrate and chain agents, rather than needing to learn new abstractions.” No DSL, no graph compiler — contrasts sharply with LangGraph.

State via Sessions — built-in conversation memory across Runner.run calls. Replaces manual .to_input_list() plumbing. Backends shipped or via extras: SQLite (dev), Redis (low-latency distributed), SQLAlchemy, MongoDB, Dapr, OpenAIConversationsSession (server-managed), EncryptedSession (wraps another, adds TTL), AdvancedSQLiteSession (branching/analytics) — 9 backends total.

Critical caveat: “Sessions cannot be combined with conversation_id, previous_response_id, or auto_previous_response_id in the same run.” Pick one source of conversation truth.

Tracing on by default, sinking to OpenAI’s dashboard via BatchTraceProcessor. Disable: env OPENAI_AGENTS_DISABLE_TRACING=1 or set_tracing_disabled(True) or per-run RunConfig.tracing_disabled. 25+ third-party processor integrations: Logfire, AgentOps, Braintrust, LangSmith, Langfuse, Langtrace, Arize-Phoenix, MLflow, W&B, Datadog, Keywords AI, Scorecard. Strongest observability story among OSS contenders out of the box — and the strongest compliance footgun.

Apples-to-apples vs Anthropic Managed Agents is AgentKit + Agents SDK, not Agents SDK alone. AgentKit (separate OpenAI product, launched Oct 2025) wraps the SDK with a hosted ChatKit UI, agent builder, and managed runtime.

Flue — the harness primitive

Flue (withastro/flue) is a TypeScript runtime for headless, programmable agents that feel like Claude Code but with no TUI, no human-in-the-loop assumption, and a pluggable sandbox. Runtime-agnostic (Node.js, Cloudflare Workers, GitHub Actions / GitLab CI). Apache-2.0. Maintainer: Fred K. Schott (Astro co-founder).

Not primarily a multi-agent orchestration framework in the LangGraph / CrewAI sense — it is a framework for building one Claude-Code-style agent per invocation, with subagent (task) and role (role) primitives layered on top.

Unit of composition is an “agent” (a TypeScript file under .flue/agents/) that gets a sandbox, a session, tools, and skills — same mental model as Claude Code itself. Skills + AGENTS.md are first-class. Most agent “logic” lives in Markdown (skills, role definitions, context files); the runtime auto-discovers them from a workspace dir.

The sandbox is the point. Default is a virtual sandbox powered by Vercel Labs’ just-bash — bash-like execution without a real container. Pluggable to local (host FS), Daytona (real Linux container), Cloudflare R2-mounted, etc.

Multi-agent emerges via task() and multiple init() calls sharing a sandbox — not via an explicit graph. Closer to Claude Code’s subagent pattern than to LangGraph’s state machines.

Maturity flags are honest: README explicitly says “Experimental — APIs may change.” v0.0.x tags. ~2,683 stars. @flue/connectors already deprecated in favor of flue add codegen — churn evidence. Zero unit tests. No production case studies, no benchmarks, no real-world cost data. Single org / single primary maintainer — bus factor risk.

For a deeper Flue dive, see flue-agent-harness-framework.

Five Spike Variants, Same Workflow

To get past feature matrices into actual code shape, the same workflow (cd6 code-review crew: Architect → Builder → Reviewer, coordinated by Engineering Lead) was implemented across five spike variants. Same shared personas. Same fixture (BankAccount module with planted bugs: SQLi, mutable default, non-atomic transfer). Code-only — no live runs.

Headline numbers

Metric	CMA inline	CMA FS	OpenAI as-tools	OpenAI handoff	LangGraph shallow	LangGraph deep
Code lines (no comments/blanks)	140	168	100	148	87	247
Primitives used	5	5 + bash/write	3	4	3	7
Async required?	No	No	Yes	Yes	No	No
Multi-agent declaration	Server (`multiagent`)	Server	Client (`as_tool`)	Client (`handoffs=[…]` + `handoff()`)	Client (`@tool` dispatch)	Graph code (`Command(goto=…)`)
State channel	Inline messages	Container FS	Run context	History auto-passed (filterable)	Implicit messages	Typed `State` + checkpointer
Routing decided by	LLM coordinator	LLM coordinator	LLM coordinator	LLM hands off	LLM coordinator	Graph code (deterministic)
HITL / interrupt	`user.interrupt`	`user.interrupt`	manual	manual	manual	`interrupt()` + `Command(resume=…)`
Persistence	Session is durable	Session is durable	None (Sessions opt-in)	None	None	Checkpointer (InMem here, Postgres in prod)
Default observability	SSE event stream	SSE event stream	Default-on tracing → OpenAI	Default-on tracing → OpenAI	None (LangSmith opt-in)	None (LangSmith opt-in)

What the LOC ratios actually mean

Comparison	Δ LOC	What you buy
CMA inline → CMA FS	+28 (~20%)	FS-as-state-channel cost lever (projected ~5–10× fewer prompt tokens)
OpenAI as-tools → handoff	+48 (~50%)	Control-transfer chain + typed Pydantic feedback between agents
LangGraph shallow → deep	+160 (~3×)	Graph-encoded routing + typed state + checkpoints + HITL interrupt + bounded revision loop

The ratios tell the story.

CMA’s distinctive value (FS) is cheap to add. OpenAI’s two patterns are roughly equal weight. LangGraph’s value layer is expensive in LOC but qualitatively different — it’s the only spike where routing is deterministic and the workflow survives a process restart.

Programming-model shape (side-by-side)

CMA — declarative server-side coordinator

coordinator = client.beta.agents.create(
    name="Engineering Lead",
    model="claude-opus-4-7",
    system=COORDINATOR_PROMPT,
    tools=[{"type": "agent_toolset_20260401"}],
    multiagent={
        "type": "coordinator",
        "agents": [
            {"type": "agent", "id": architect.id},
            {"type": "agent", "id": builder.id},
            {"type": "agent", "id": reviewer.id},
        ],
    },
)

multiagent is a server-side object. Anthropic enforces depth=1, the 20-agent roster cap, and the 25-thread concurrency cap. Nothing on the client side encodes those constraints.

OpenAI Agents SDK — agents-as-tools

coordinator = Agent(
    name="engineering_lead",
    instructions=COORDINATOR_PROMPT,
    tools=[
        architect.as_tool(tool_name="ask_architect", tool_description=...),
        builder.as_tool(tool_name="ask_builder", tool_description=...),
        reviewer.as_tool(tool_name="ask_reviewer", tool_description=...),
    ],
)

as_tool reuses the regular tools mechanism. No special construct. The coordinator decides at runtime which tool to call when. No depth limit (tools can wrap tools).

LangGraph — single dispatch tool (shallow)

SUBAGENTS = {"architect": ..., "builder": ..., "reviewer": ...}

@tool
def task(agent_name: str, description: str) -> str:
    """Launch an ephemeral specialist..."""
    return SUBAGENTS[agent_name].invoke(...)["messages"][-1].content

coordinator = create_agent(model=MODEL, tools=[task], prompt=COORDINATOR_PROMPT)

Single tool name, registry lookup. Idiomatic LangGraph. The full StateGraph isn’t needed for a workflow this linear. The full StateGraph is when you need typed state, checkpoints, deterministic routing, or HITL — and it costs 3× the LOC.

The Surprising Findings

CMA’s SSE stream is genuinely best-in-class for multi-agent

The event stream emits agent.thread_message_sent and agent.thread_message_received between coordinator and child without any client-side instrumentation. You see the handoff trail live.

OpenAI gets equivalent info from result.new_items only after the run completes. LangGraph requires you to slice through messages[*].tool_calls post-hoc.

For debugging multi-agent decisions in production, CMA’s stream model is a real advantage. Held up under spike-shape scrutiny.

OpenAI’s default-on tracing is a real compliance footgun

Runner.run() traces by default and posts to the OpenAI dashboard. Easy to miss in a quickstart, hard to retrofit in a regulated environment.

Disable on first run if your org has any data-residency policy. Non-negotiable for healthcare / finance / EU-data workloads.

LangGraph’s deep value layer is the only deterministic-routing spike

Every other spike — including CMA — relies on the LLM coordinator to follow the workflow. The LangGraph StateGraph is the only one where the workflow is encoded in code, not in a prompt the LLM might ignore. For high-stakes pipelines, this matters.

CMA’s shared filesystem is the version that earns the fee

Inline-CMA (passing content via messages) looks like an expensive variant of OpenAI as-tools — same coordinator-stitches shape, but with Anthropic-infra lock-in and a session-hour meter ticking.

FS-CMA (passing artifact pointers, agents reading/writing /workspace/*) is qualitatively different — and the only one that scales gracefully when artifacts are large. Per-child prompt drops from ~1.5–2.5KB (inline target/design) to ~150–300 chars (pointer only). Across 3 children, projected ~5–10× fewer prompt tokens for context-passing. Output tokens unchanged. Live numbers needed to confirm.

If you wire up the inline variant, you’re paying CMA’s premium for a workflow OSS does compute-only.

Pattern convergence (and the one that breaks it)

For coordinator-stitches workflows, OpenAI’s as_tool() and LangGraph’s tool-dispatch idiom land in the same shape (~100 LOC).

For handoff-style workflows, OpenAI’s handoffs= and LangGraph’s Command(goto=…) land in the same conceptual shape — though LangGraph trades extra LOC for stronger guarantees (typed state, deterministic routing, checkpoints).

The shape that breaks convergence is LangGraph deep. Nothing else in the comparator gives you graph-encoded routing + checkpoints + interrupt + bounded revision loop in one primitive set.

Cost Shape

No live numbers — projections only. Live runs are the next budget-justified step (~$2–10 across all 5 variants depending on token mix).

Framework	Variable cost	Fixed cost	Cost shape
CMA	Tokens × 4 agents	$0.08/session-hour while any thread runs	Wall-clock matters. Slow tools = meter ticks.
OpenAI Agents SDK	Tokens × N tool calls	None (lib free; tracing free at low volume)	Pure token cost; coordinator overhead = 1 LLM turn per round-trip
LangGraph	Tokens × N tool calls	None (lib free); LangSmith paid tier above free quota	Same shape as OpenAI; messages list grows with state

For a 3-step workflow that completes in under 2 minutes, OpenAI/LangGraph are likely 3-10× cheaper than CMA. For a long-horizon workflow (hours, with sandboxed code execution), CMA’s bundled compute may flip the equation.

This is the cost crossover question every team needs to model on its own workload before committing.

Decision Guidance

Not the answer for every team — directional map. Verify against real workload before committing.

If your situation is…	Look at first
”We need durable, branching, human-in-the-loop workflows”	LangGraph (deep). Confirmed at 247-LOC floor. The shallow `create_agent` shape (87 LOC) competes with OpenAI on simple workflows but doesn’t differentiate. Reach for the deep shape only if you need ≥2 of: graph-encoded routing, durability, HITL, bounded loops. Otherwise the LOC tax isn’t worth it.
”We’re a Claude shop, want fastest path to running agents, OK with vendor lock-in”	Anthropic Managed Agents — but only if your workflow has large artifacts (code, docs, datasets) where the shared-FS state channel earns the session-hour fee. Inline-only CMA = paying premium for what OSS does compute-only.
”We’re an OpenAI shop, conversational handoffs (support → billing → refund)“	OpenAI Agents SDK (handoff). Clean primitive. Typed Pydantic feedback gives structured inter-agent contracts without ceremony. Watch default-on tracing.
”We’re an OpenAI shop, pipeline with stitching”	OpenAI Agents SDK (as-tools). Lowest LOC for coordinator-stitches workflows after raw LangGraph shallow.
”We need a working demo for stakeholders by Friday”	CrewAI (caveat: rewrite for production) — or OpenAI as-tools at 100 LOC, competitive.
”We already author Claude Code skills, want them deployable”	Flue (track + spike, don’t standardize yet)
“We can’t pick — too many requirements”	LangGraph as default; revisit in 3 months

When the LOC tax buys something real

The three “expensive” spike variants each buy something specific. Use this table to decide whether you’re paying for something you actually need:

Variant	LOC tax over baseline	Buys
CMA FS over inline	+28 (~20%)	Cost lever — prompt tokens drop ~5–10× when artifacts are large
OpenAI handoff over as-tools	+48 (~50%)	Control transfer + typed Pydantic feedback contracts
LangGraph deep over shallow	+160 (~3×)	Deterministic routing + crash durability + HITL + bounded revision loops

If you don’t need what’s in the right column, don’t pay the cost on the left.

Cross-Framework Gold Seams

Topics worth deep-diving regardless of which framework “wins”:

Handoff semantics. Modeled as tool calls in CMA, OpenAI Agents SDK, and Flue. Modeled as Command + state update in LangGraph. Modeled as Process.hierarchical + manager LLM in CrewAI. The bug surface is the same in all five — when does control return. Worth a dedicated mental-model post for the team.
State durability. LangGraph’s checkpoint + time-travel is the most rigorous. CMA’s shared-FS is the most pragmatic. CrewAI Flow has it but underdocumented. OpenAI Sessions has 9 backends but the durability semantics across handoffs are messier. The framework’s durability story IS its production story.
Observability cost. Free OSS frameworks all lean on a paid SaaS observability layer (LangSmith, AgentOps, Langfuse-Cloud, Datadog) at scale. The honest TCO comparison includes that layer.
Pricing model — per-session-hour vs per-compute. CMA’s $0.08/session-hour is novel and unforgiving — 5 parallel children = 5× tokens AND meter still ticks while any thread runs. OSS = your AWS/GCP bill, predictable but more ops work. Cost crossover point worth modeling for any real workload.
Sandbox / code-exec. Three different stories — CMA bundled, OpenAI Sandbox Agents (new Apr 2026, fixing real Q1 production failures), Flue pluggable, LangGraph/CrewAI = roll your own. Sandbox is becoming table-stakes; track this.

Pitfalls That Apply Broadly

Pitfall	Frameworks affected
Vendor benchmarks unreproduced (“5.76× faster”)	CrewAI especially; all to some degree
Logo-deck production claims, no architecture detail	All five (LangGraph slightly better)
Pre-1.0 / beta API churn	OpenAI Agents SDK, CMA (beta), Flue (`v0.0.x`)
Multi-agent depth limited or undocumented	CMA (depth=1 hard), CrewAI (manager-LLM tax), Flue (no graph primitives)
Default observability ships data to vendor	OpenAI Agents SDK (default-on tracing → OpenAI), LangSmith pull
State serialization bloat	LangGraph (open issue: 85% storage / 37.8% token overhead)
Single-maintainer / org bus factor	Flue (Schott + small team)
Token bloat from persona prompts	CrewAI

Open Questions (Live Runs Required)

Code-only mode is exhausted. The remaining unknowns require execution:

Live cost run — all 5 variants, same prompt, capture token counts + wall-clock + dollars. Without this, every cost claim is projection.
CMA shared-FS prompt-token saving — confirm projected ~5–10× via inspecting agent.thread_message_sent payload sizes during a live run.
OpenAI tracing-disabled verification — confirm env var actually disables data flow to OpenAI dashboard (some practitioner posts allege partial leakage). Requires network inspection during a live run.
Coordinator LLM-rewrites-children footgun — across 5 spikes, does any LLM coordinator observably violate the “thin orchestrator” rule despite the persona forbidding it? Live runs needed.
Postgres checkpointer at scale — open issue: 85% storage bloat, 37.8% token overhead. Real cost only visible at production volume.
Multi-framework composition — can LangGraph orchestrate Anthropic Managed Agents as nodes? Both want to own the loop — pick one, but for what reasons?
Compliance & data residency mapping — CMA = Anthropic infra. OpenAI tracing = OpenAI dashboard. Self-hosted LangGraph or CrewAI = your residency. Map this against the user data class your team handles.
Skills/Agents.md interop — Anthropic Skills format, Claude Code skills, Flue skills — same shape, different runtimes. Is there a portable skill spec emerging?

What This Research Is and Is Not

Is: Quick-scan landscape map. T-shape broad layer + 5 code-only spike variants (890 lines). Tier-1 docs read for each framework. Honest pitfalls surfaced from issue trackers and critical blog posts. Decision guidance grounded in spike code shape, not only feature lists.

Is not: A live run. No working examples were executed. No benchmarks were measured. No cost model was computed. Production claims were not independently verified.

Next phase if team commits to a framework: 1-day code spike with real workload, real cost model, real failure-mode test — per the Phase 4 template in the research skill. Estimated cost across all 5 variants for live runs: ~$2–10 depending on token volume and Claude vs GPT mix.

Sources

Sources & Provenance

Verifiable sources. Dates matter. Credibility assessed.

high credibility