Harness Engineering & Deep Agents: The Architecture Layer Above Context Engineering
LangChain's Deep Agents SDK codifies four primitives (planning, subagents, filesystem, detailed prompts) observed in Claude Code, Manus, and Deep Research. OpenAI coined 'harness engineering' — the complete system wrapping an agent. Here's the full landscape, the evidence, and what it means for how agents are built in 2026.
TL;DR
2025 was the year of agents. 2026 is the year of agent harnesses — the systems around the model that determine whether agents actually work. LangChain’s Harrison Chase studied Claude Code, Manus, and OpenAI Deep Research, extracted four shared primitives (detailed prompts, planning tools, subagents, filesystem), and shipped an open-source SDK called Deep Agents. OpenAI independently coined “harness engineering” after their Codex team built 1M lines of code with agents. The hierarchy that emerged: Prompt Engineering ⊂ Context Engineering ⊂ Harness Engineering. The harness is the operating system; the model is just the CPU.
The Hierarchy
HARNESS ENGINEERING ← constraints, verification, lifecycle
↳ CONTEXT ENGINEERING ← what the model sees, when, in what format
↳ PROMPT ENGINEERING ← how you phrase individual requests
| Layer | Scope | Analogy |
|---|---|---|
| Prompt Engineering | Single text string | Writing a SQL query |
| Context Engineering | All information the model sees across turns | Database schema + indexes |
| Harness Engineering | Entire system: context + constraints + verification + observability + lifecycle | The operating system |
Philipp Schmid’s framing: Model = CPU. Context Window = RAM. Agent Harness = Operating System. Agent = Application.
Why This Matters Now
Three convergent signals in Q1 2026:
-
LangChain shipped Deep Agents — an open-source SDK that codifies the four primitives observed in Claude Code, Manus, and OpenAI Deep Research. Their harness-only changes took a coding agent from Top 30 to Top 5 on Terminal Bench 2.0. Same model, different harness, 14 percentage point improvement.
-
OpenAI coined “harness engineering” — Their Codex team built ~1M lines of production code using agents in 5 months (~1/10th manual time). The lesson: the harness matters more than the model. 3-7 engineers, 1,500 merged PRs, 3.5 PRs per engineer per day.
-
LangChain Skills hit 29% → 95% — Claude Code’s pass rate on LangChain ecosystem tasks jumped from 29% to 95% just by loading the right skill files. Not a model upgrade. Not fine-tuning. Markdown files loaded at the right time.
The Four Primitives of Deep Agents
Harrison Chase studied Claude Code, Manus, and OpenAI Deep Research and found they all share four architectural primitives. These are not LangChain inventions — they’re patterns extracted from what already works.
1. Detailed System Prompt
Long, complex prompts with specific tool-usage instructions and few-shot examples. Not a single sentence — hundreds to thousands of lines.
“Without these system prompts, the agents would not be nearly as deep. Prompting matters still!” — Harrison Chase
Claude Code’s system prompt is ~2,000+ lines. It specifies when to use each tool, how to handle edge cases, and behavioral guidelines for dozens of scenarios.
2. Planning Tool
A tool that lets the agent create and track a plan. The key insight: Claude Code’s “Todo list” tool is functionally a no-op — it does nothing except serve as context engineering. The act of writing a plan forces structured thinking.
“Planning (even if done via a no-op tool call) is a big component of that.” — Harrison Chase
3. Sub-Agents
Isolated agent instances with their own context windows. The main agent delegates, receives condensed results, keeps its own context clean.
Four use cases:
- Context preservation — multistep tasks that would clutter the main context
- Specialization — domain-specific instructions and tools per subagent
- Multi-model — cheaper/faster models for simpler subtasks
- Parallelization — simultaneous execution to reduce latency
“If the subagent is doing a lot of exploratory work before coming with its final answer, the main agent still only gets the final result, not the 20 tool calls that produced it.”
4. Filesystem
Acts as shared workspace and external memory. Agents can write notes, save intermediate results, and maintain state across long-running tasks. This is essential for managing accumulated context.
Manus uses the filesystem heavily for memory management. Claude Code uses git worktrees to give each subagent an isolated copy of the repository.
LangChain’s Deep Agents SDK
What It Is
An open-source Python package (pip install deepagents) built on LangGraph. Provides the four primitives as composable building blocks, plus middleware for cross-cutting concerns.
Skills: Progressive Disclosure
The newest primitive. Skills are markdown files loaded dynamically — only their descriptions sit in context until the agent decides one is relevant, then the full instructions load.
.deepagents/skills/
deploy/SKILL.md
review-pr/SKILL.md
This solves the documented problem that giving too many tools to an agent degrades its performance. Skills are lazy-loaded capabilities.
Context Management
Three-tiered compression system:
- Tool result offloading — Responses >20K tokens get written to filesystem, replaced with a file path + 10-line preview
- Tool input truncation — At 85% capacity, older tool calls are truncated and replaced with file pointers
- Summarization — LLM generates structured summary (intent, artifacts, next steps), full conversation saved to filesystem as canonical record
Middleware Architecture
The harness engineering layer. Middleware intercepts agent behavior at key points:
| Middleware | What It Does |
|---|---|
PreCompletionChecklistMiddleware | Forces verification pass before agent can exit |
LocalContextMiddleware | Maps directory structure and tool availability on start |
LoopDetectionMiddleware | Tracks per-file edit counts, nudges agent after N edits to same file |
| Time budgeting | Injects warnings to shift focus toward verification as time runs low |
The Terminal Bench 2.0 Results
LangChain’s coding agent went from 52.8% (outside Top 30) to 66.5% (Top 5) on Terminal Bench 2.0 — 89 tasks across ML, debugging, and biology domains.
The model was held constant (gpt-5.2-codex). Only the harness changed.
What Moved the Needle
| Change | Impact |
|---|---|
| Build & Self-Verify Loop | 4-phase: Plan → Build → Verify → Fix. Forces test execution before completion. |
| Environment Context Delivery | Maps working directory, identifies tools on startup. Reduces context discovery failures. |
| Loop Detection | Tracks file edit counts, intervenes after repeated edits to same file (“doom loops” of 10+ iterations). |
| Reasoning Sandwich | xhigh reasoning for planning, high for implementation, xhigh for verification. 53.9% → 66.5%. |
| Time Budgeting | Nudges agent toward verification as deadline approaches. |
The “Reasoning Sandwich” result is particularly telling: xhigh reasoning everywhere scored 53.9% (timeout issues). High reasoning everywhere scored 63.6%. The sandwich scored 66.5%. More thinking isn’t always better — strategically allocated thinking is.
LangChain Skills: 29% → 95%
Released March 4, 2026. The headline number: Claude Code’s pass rate on LangChain ecosystem tasks jumped from 29% to 95% by loading skill files.
11 skills across three categories:
- LangChain — core library patterns
- LangGraph — stateful agent workflows
- Deep Agents — primitives, middleware, filesystem
Installation:
npx skills add langchain-ai/langchain-skills --agent claude-code --skill '*' --yes --global
Skills are markdown files loaded via progressive disclosure. The npx skills CLI is maintained by Vercel Labs.
The Competitive Landscape
Claude Code
The reference implementation that Deep Agents was modeled after. Key architecture:
| Primitive | Claude Code Implementation |
|---|---|
| Detailed prompt | ~2,000+ line system prompt |
| Planning | TodoWrite tool (functional no-op for structured thinking) |
| Subagents | Isolated instances with independent context, auto-cleanup |
| Filesystem | Git worktrees per subagent, CLAUDE.md for persistent memory |
| Skills | .claude/skills/*.md with progressive disclosure |
| Context window | 200K+ tokens, intelligent on-demand file reading |
Verdict: Excels at complex, multi-file architectural work. Successfully completed a 23-file JWT auth migration that Cursor and Windsurf couldn’t handle.
OpenAI Agents SDK
Different category — a framework for building custom agents, not a ready-made harness.
| Strength | Weakness |
|---|---|
| Fastest path to working agent on OpenAI models | OpenAI models only — no Claude, Gemini, open-source |
| Built-in tracing and observability | No state persistence (no checkpointing) |
| Minimal code footprint (under 100 lines) | No native MCP support |
| 2-3 day learning curve | No native human-in-the-loop |
Cursor
Evolving from AI-enhanced IDE toward proactive agent platform:
- Agent Mode: Autonomous plan-execute-verify loop
- Automations (March 2026): Event-driven agents triggered by codebase changes, Slack messages, timers. Runs in cloud sandboxes with memory.
- Rules system:
.cursor/rules/*.mdc— project, user, team, and agent rules - Strongest autocomplete experience, but struggles with 40+ file architectural changes (~60-80K token effective limit)
LangGraph
The production leader for complex stateful agent workflows. Model-agnostic, built-in checkpointing, 1-2 week learning curve. Deep Agents is built on top of LangGraph.
Harness Engineering: The Full Definition
OpenAI coined the term based on their Codex experiment. Birgitta Bockeler (Thoughtworks) expanded it on Martin Fowler’s site.
Harness engineering = building systems around a model to optimize goals like task performance, token efficiency, and latency. It encompasses:
- System prompt design
- Tool choice and configuration
- Execution flow and lifecycle
- Middleware and hooks
- Memory systems
- Skills and progressive disclosure
- Verification and self-check loops
- Observability and tracing
- Loop detection and recovery
- Reasoning budget allocation
“The goal of a harness is to mold the inherently spiky intelligence of a model for tasks we care about.” — LangChain
“When the agent struggles, we treat it as a signal to improve the harness.” — Birgitta Bockeler, Thoughtworks
OpenAI’s Five Principles of Harness Engineering
- Intent-Driven Development — Engineers specify intent declaratively, not code directly
- Autonomous Agent Iteration — Agents open PRs, evaluate changes, iterate until criteria met
- Observability Integration — Agents use telemetry to monitor and debug their own work
- Architectural Constraint Enforcement — Mechanical rules maintain structural integrity
- Documentation as Machine-Readable Artifacts — Structured docs for agent consumption, not just humans
Context Rot and the “Dumb Zone”
Two critical research findings underpin this entire space:
Context rot (Chroma research): As token count increases, the model’s ability to accurately recall information decreases. Not a cliff — a gradient. Effective capacity is ~65% of claimed max.
The “dumb zone” (HumanLayer): In high-context regimes, there’s a measurable performance drop where models struggle to complete tasks. Subagents and context compression exist specifically to avoid entering this zone.
“Effective context management becomes critical to prevent context rot.” — Chester Curme & Mason Daugherty, LangChain
What This Means in Practice
The Pattern Consensus
Five independent groups (Anthropic, OpenAI, LangChain, Manus, Cursor) converged on remarkably similar architectural patterns:
| Pattern | Claude Code | Deep Agents | OpenAI Codex | Manus | Cursor |
|---|---|---|---|---|---|
| Detailed system prompt | Yes | Yes | Yes | Yes | Rules system |
| Planning tool | TodoWrite | Planning tool | Intent specs | Yes | Agent Mode |
| Subagents | Yes | Yes | Multi-agent | Yes | Automations |
| Filesystem memory | CLAUDE.md + worktrees | FileSystem | Structured docs | Heavy use | .cursor/rules |
| Skills / progressive disclosure | .claude/skills/ | SKILL.md | — | — | .cursor/rules/*.mdc |
| Verification loop | Yes | PreCompletionChecklist | PR iteration | — | Agent Mode |
| Context compression | Compaction at 95% | 3-tier (offload, truncate, summarize) | — | Context engineering | — |
For Agent Builders
- Start with the harness, not the model. LangChain’s 14-point improvement came from harness changes alone. The model was constant.
- Implement verification loops. The single biggest harness improvement is forcing agents to test their own work before declaring done.
- Use reasoning strategically. The “reasoning sandwich” (high reasoning for planning and verification, standard for implementation) outperforms maximum reasoning everywhere.
- Detect doom loops. Track per-file edit counts. After N edits to the same file, inject “consider a different approach.”
- Load skills lazily. Progressive disclosure — descriptions in context, full instructions on demand — outperforms loading everything upfront.
For Teams Using AI Coding Agents
- Write CLAUDE.md / rules files. This is the highest-leverage harness investment. Project knowledge that loads automatically.
- Use subagents for complex work. Context isolation is not optional for long-running tasks.
- Install domain skills. The 29% → 95% result on LangChain tasks shows skills aren’t marginal — they’re transformative for domain-specific work.
- Set compaction earlier. 70-80% capacity, not 95%. By 95%, quality has already degraded.
The Tacit Angle
Harness engineering makes session memory even more critical than context engineering alone. Every middleware intervention, every subagent delegation, every context compression — these are decisions that reshape what the agent knows. The reasons behind those decisions live in sessions.
| Harness Component | Without Session Memory | With Session Memory |
|---|---|---|
| Verification loops | Pass/fail, no learning | Why it failed, what was tried |
| Doom loop detection | Reset and retry | Pattern of what caused the loop |
| Context compression | Information permanently lost | Full history searchable |
| Skill loading | Stateless each time | Which skills helped, which didn’t |
| Subagent delegation | Results without process | Full reasoning chain preserved |
The more sophisticated the harness, the more valuable it becomes to persist what happens inside it.
Open Questions
- Harness portability: Can a harness tuned for one model transfer to another? LangChain says “harnesses require model-specific tuning.”
- Skill ecosystems: Will skills become a shared ecosystem (like npm for agents) or remain project-specific?
- Verification completeness: How do you verify that the verifier is correct? Turtles all the way down.
- Reasoning allocation: The “reasoning sandwich” works for coding — does it generalize to other domains?
- Harness complexity: At what point does harness engineering become over-engineering? Where’s the elbow?
Confidence Assessment
| Claim | Confidence |
|---|---|
| Deep agents share four common primitives | High — multi-source convergence across Claude Code, Manus, Deep Research |
| Harness changes matter more than model changes | High — Terminal Bench 2.0 evidence (same model, 14pt improvement) |
| The hierarchy (prompt ⊂ context ⊂ harness) is real | High — independent convergence from Anthropic, OpenAI, Thoughtworks |
| Skills with progressive disclosure improve performance | High — 29% → 95% measured result |
| Verification loops are the highest-leverage harness improvement | High — multiple sources cite this |
| The “reasoning sandwich” generalizes beyond coding | Medium — only tested on Terminal Bench 2.0 |
| Skills will become a shared ecosystem | Medium — early signals (Vercel Labs CLI) but unproven |
| Deep Agents SDK will see widespread adoption | Low — LangChain has history of adoption then criticism |
Sources & Provenance
Verifiable sources. Dates matter. Credibility assessed.
Deep Agents ↗
Harrison Chase · LangChain Blog
"Foundational post identifying four primitives shared by Claude Code, Manus, and Deep Research: detailed prompts, planning tools, subagents, and filesystem. 'Using an LLM to call tools in a loop is the simplest form of an agent' — deep agents enrich this with architectural primitives."
Improving Deep Agents with Harness Engineering ↗
LangChain · LangChain Blog
"Coding agent went from 52.8% to 66.5% on Terminal Bench 2.0 with harness-only changes. Key techniques: build-and-verify loop, environment context delivery, loop detection, reasoning sandwich (xhigh-high-xhigh), time budgeting."
LangChain Skills ↗
LangChain · LangChain Blog
"11 skills across LangChain, LangGraph, and Deep Agents. Claude Code pass rate on LangChain tasks: 29% → 95% with skills loaded. Skills use progressive disclosure — descriptions in context, full instructions loaded on demand."
Building Multi-Agent Applications with Deep Agents ↗
Sydney Runkle and Vivek Trivedy · LangChain Blog
"Two first-class primitives: subagents (context isolation) and skills (progressive disclosure). Decision matrix for when to use each. 'Multi-agent patterns don't have to be complicated.'"
Context Management for Deep Agents ↗
Chester Curme and Mason Daugherty · LangChain Blog
"Three-tiered context compression: tool result offloading (>20K tokens), tool input truncation (at 85% capacity), and summarization with filesystem backup. 'Effective context management becomes critical to prevent context rot.'"
Harness Engineering ↗
Birgitta Bockeler · Martin Fowler (Thoughtworks)
"Harness engineering is broader than prompt and context engineering — it encompasses constraints, verification mechanisms, and iterative feedback loops. 'When the agent struggles, we treat it as a signal to improve the harness.'"
Effective Context Engineering for AI Agents ↗
Prithvi Rajasekaran, Ethan Dixon, Carly Ryan, Jeremy Hadfield · Anthropic Engineering
"Four context strategies: Write, Select, Compress, Isolate. 'Most agent failures are not model failures anymore — they are context failures.' Production evidence from Claude Code."
The Importance of Agent Harness in 2026 ↗
Philipp Schmid · Personal Blog
"Model = CPU, Context Window = RAM, Agent Harness = Operating System, Agent = Application. The harness is where competitive differentiation lives, not the model."
OpenAI: Harness Engineering with Codex ↗
InfoQ · InfoQ
"OpenAI Codex team: 3-7 engineers, ~1M lines of code, 5 months, 1,500 merged PRs. Five principles: intent-driven development, autonomous iteration, observability integration, architectural constraint enforcement, documentation as machine-readable artifacts."
Context Engineering for AI Agents: Lessons from Building Manus ↗
Manus Team · Manus Blog
"Production lessons from building a deep agent: filesystem as primary memory layer, context compression strategies, and the importance of structured external state."
Context Rot Research ↗
Chroma · Chroma Research
"As token count increases, model accuracy for information recall decreases — a gradient, not a cliff. Effective capacity is roughly 65% of claimed maximum context window."
Context-Efficient Backpressure (The 'Dumb Zone') ↗
HumanLayer · HumanLayer Blog
"Documented measurable performance drop in high-context regimes — the 'dumb zone' where models struggle to complete tasks despite having relevant information in context."
Cursor vs Windsurf vs Claude Code: The Honest Comparison ↗
DEV Community · DEV Community
"Claude Code: 200K+ token context, excels at multi-file architecture. Cursor: ~60-80K effective tokens, best autocomplete. Claude Code successfully completed 23-file JWT migration that others couldn't."
LangGraph vs CrewAI vs OpenAI Agents SDK ↗
Particula · Particula Blog
"LangGraph: production leader for complex stateful workflows. OpenAI SDK: fastest to prototype but OpenAI-only. CrewAI: multi-agent orchestration. Deep Agents builds on LangGraph."
Terminal Bench 2.0 Leaderboard ↗
Terminal Bench · tbench.ai
"89-task benchmark across ML, debugging, and biology domains. LangChain's harness-engineered agent scored 66.5% (Top 5) with gpt-5.2-codex."
The Emerging Harness Engineering Playbook ↗
Ignorance.ai · Ignorance.ai
"Framework for thinking about harness engineering as a discipline: constraints, feedback loops, documentation, linters, lifecycle management, verification, and iteration cycles."
Cursor Automations ↗
TechCrunch · TechCrunch
"Cursor's new Automations system: event-driven agents triggered by codebase changes, Slack messages, or timers. Runs in cloud sandboxes with persistent memory. Shift from reactive IDE to proactive agent platform."
2025 Was Agents, 2026 Is Agent Harnesses ↗
Aakash Gupta · Medium
"Industry trend analysis: the shift from building agents to engineering the systems that make agents reliable. Harness engineering as the key differentiator."
What Are DeepAgents in LangChain? — A Comprehensive Guide ↗
QualityPoint Technologies · QualityPoint Blog
"Tutorial-level overview of Deep Agents primitives. Summarizes the four-primitive architecture for beginners."
Debugging Deep Agents with LangSmith ↗
LangChain · LangChain Blog
"Observability patterns for deep agent debugging. LangSmith integration for traces, latency analysis, and token cost tracking."
Claude Code Architecture (Reverse Engineered) ↗
Substack · Substack
"Reverse-engineered analysis of Claude Code's architecture: subagents, worktrees, context handling, and system prompt structure."
A Mental Model for Claude Code ↗
Level Up Coding · Level Up Coding
"Conceptual framework for understanding Claude Code's skills, subagents, and plugin architecture."
Agent Skills Specification ↗
Agent Skills · agentskills.io
"Emerging specification for portable agent skills across frameworks."
Context Engineering is the New AI Moat ↗
StartupHub.ai · StartupHub.ai
"Video coverage of Harrison Chase's Sequoia Capital appearance discussing long-horizon agents and context engineering as competitive moat."