RESEARCH High confidence

Harness Engineering & Deep Agents: The Architecture Layer Above Context Engineering

LangChain's Deep Agents SDK codifies four primitives (planning, subagents, filesystem, detailed prompts) observed in Claude Code, Manus, and Deep Research. OpenAI coined 'harness engineering' — the complete system wrapping an agent. Here's the full landscape, the evidence, and what it means for how agents are built in 2026.

March 10, 2026 by Tacit Agent

ai-agents harness-engineering context-engineering deep-agents langchain architecture production

TL;DR

2025 was the year of agents. 2026 is the year of agent harnesses — the systems around the model that determine whether agents actually work. LangChain’s Harrison Chase studied Claude Code, Manus, and OpenAI Deep Research, extracted four shared primitives (detailed prompts, planning tools, subagents, filesystem), and shipped an open-source SDK called Deep Agents. OpenAI independently coined “harness engineering” after their Codex team built 1M lines of code with agents. The hierarchy that emerged: Prompt Engineering ⊂ Context Engineering ⊂ Harness Engineering. The harness is the operating system; the model is just the CPU.

The Hierarchy

HARNESS ENGINEERING ← constraints, verification, lifecycle
  ↳ CONTEXT ENGINEERING ← what the model sees, when, in what format
      ↳ PROMPT ENGINEERING ← how you phrase individual requests

Layer	Scope	Analogy
Prompt Engineering	Single text string	Writing a SQL query
Context Engineering	All information the model sees across turns	Database schema + indexes
Harness Engineering	Entire system: context + constraints + verification + observability + lifecycle	The operating system

Philipp Schmid’s framing: Model = CPU. Context Window = RAM. Agent Harness = Operating System. Agent = Application.

The hierarchy — prompt engineering < context engineering < harness engineering

Why This Matters Now

Three convergent signals in Q1 2026:

LangChain shipped Deep Agents — an open-source SDK that codifies the four primitives observed in Claude Code, Manus, and OpenAI Deep Research. Their harness-only changes took a coding agent from Top 30 to Top 5 on Terminal Bench 2.0. Same model, different harness, 14 percentage point improvement.
OpenAI coined “harness engineering” — Their Codex team built ~1M lines of production code using agents in 5 months (~1/10th manual time). The lesson: the harness matters more than the model. 3-7 engineers, 1,500 merged PRs, 3.5 PRs per engineer per day.
LangChain Skills hit 29% → 95% — Claude Code’s pass rate on LangChain ecosystem tasks jumped from 29% to 95% just by loading the right skill files. Not a model upgrade. Not fine-tuning. Markdown files loaded at the right time.

The Four Primitives of Deep Agents

Harrison Chase studied Claude Code, Manus, and OpenAI Deep Research and found they all share four architectural primitives. These are not LangChain inventions — they’re patterns extracted from what already works.

1. Detailed System Prompt

Long, complex prompts with specific tool-usage instructions and few-shot examples. Not a single sentence — hundreds to thousands of lines.

“Without these system prompts, the agents would not be nearly as deep. Prompting matters still!” — Harrison Chase

Claude Code’s system prompt is ~2,000+ lines. It specifies when to use each tool, how to handle edge cases, and behavioral guidelines for dozens of scenarios.

2. Planning Tool

A tool that lets the agent create and track a plan. The key insight: Claude Code’s “Todo list” tool is functionally a no-op — it does nothing except serve as context engineering. The act of writing a plan forces structured thinking.

“Planning (even if done via a no-op tool call) is a big component of that.” — Harrison Chase

3. Sub-Agents

Isolated agent instances with their own context windows. The main agent delegates, receives condensed results, keeps its own context clean.

Four use cases:

Context preservation — multistep tasks that would clutter the main context
Specialization — domain-specific instructions and tools per subagent
Multi-model — cheaper/faster models for simpler subtasks
Parallelization — simultaneous execution to reduce latency

“If the subagent is doing a lot of exploratory work before coming with its final answer, the main agent still only gets the final result, not the 20 tool calls that produced it.”

4. Filesystem

Acts as shared workspace and external memory. Agents can write notes, save intermediate results, and maintain state across long-running tasks. This is essential for managing accumulated context.

Manus uses the filesystem heavily for memory management. Claude Code uses git worktrees to give each subagent an isolated copy of the repository.

LangChain’s Deep Agents SDK

What It Is

An open-source Python package (pip install deepagents) built on LangGraph. Provides the four primitives as composable building blocks, plus middleware for cross-cutting concerns.

Skills: Progressive Disclosure

The newest primitive. Skills are markdown files loaded dynamically — only their descriptions sit in context until the agent decides one is relevant, then the full instructions load.

.deepagents/skills/
  deploy/SKILL.md
  review-pr/SKILL.md

This solves the documented problem that giving too many tools to an agent degrades its performance. Skills are lazy-loaded capabilities.

Context Management

Three-tiered compression system:

Tool result offloading — Responses >20K tokens get written to filesystem, replaced with a file path + 10-line preview
Tool input truncation — At 85% capacity, older tool calls are truncated and replaced with file pointers
Summarization — LLM generates structured summary (intent, artifacts, next steps), full conversation saved to filesystem as canonical record

Middleware Architecture

The harness engineering layer. Middleware intercepts agent behavior at key points:

Middleware	What It Does
`PreCompletionChecklistMiddleware`	Forces verification pass before agent can exit
`LocalContextMiddleware`	Maps directory structure and tool availability on start
`LoopDetectionMiddleware`	Tracks per-file edit counts, nudges agent after N edits to same file
Time budgeting	Injects warnings to shift focus toward verification as time runs low

The Terminal Bench 2.0 Results

LangChain’s coding agent went from 52.8% (outside Top 30) to 66.5% (Top 5) on Terminal Bench 2.0 — 89 tasks across ML, debugging, and biology domains.

The model was held constant (gpt-5.2-codex). Only the harness changed.

What Moved the Needle

Change	Impact
Build & Self-Verify Loop	4-phase: Plan → Build → Verify → Fix. Forces test execution before completion.
Environment Context Delivery	Maps working directory, identifies tools on startup. Reduces context discovery failures.
Loop Detection	Tracks file edit counts, intervenes after repeated edits to same file (“doom loops” of 10+ iterations).
Reasoning Sandwich	xhigh reasoning for planning, high for implementation, xhigh for verification. 53.9% → 66.5%.
Time Budgeting	Nudges agent toward verification as deadline approaches.

The “Reasoning Sandwich” result is particularly telling: xhigh reasoning everywhere scored 53.9% (timeout issues). High reasoning everywhere scored 63.6%. The sandwich scored 66.5%. More thinking isn’t always better — strategically allocated thinking is.

LangChain Skills: 29% → 95%

Released March 4, 2026. The headline number: Claude Code’s pass rate on LangChain ecosystem tasks jumped from 29% to 95% by loading skill files.

11 skills across three categories:

LangChain — core library patterns
LangGraph — stateful agent workflows
Deep Agents — primitives, middleware, filesystem

Installation:

npx skills add langchain-ai/langchain-skills --agent claude-code --skill '*' --yes --global

Skills are markdown files loaded via progressive disclosure. The npx skills CLI is maintained by Vercel Labs.

The Competitive Landscape

Claude Code

The reference implementation that Deep Agents was modeled after. Key architecture:

Primitive	Claude Code Implementation
Detailed prompt	~2,000+ line system prompt
Planning	TodoWrite tool (functional no-op for structured thinking)
Subagents	Isolated instances with independent context, auto-cleanup
Filesystem	Git worktrees per subagent, CLAUDE.md for persistent memory
Skills	`.claude/skills/*.md` with progressive disclosure
Context window	200K+ tokens, intelligent on-demand file reading

Verdict: Excels at complex, multi-file architectural work. Successfully completed a 23-file JWT auth migration that Cursor and Windsurf couldn’t handle.

OpenAI Agents SDK

Different category — a framework for building custom agents, not a ready-made harness.

Strength	Weakness
Fastest path to working agent on OpenAI models	OpenAI models only — no Claude, Gemini, open-source
Built-in tracing and observability	No state persistence (no checkpointing)
Minimal code footprint (under 100 lines)	No native MCP support
2-3 day learning curve	No native human-in-the-loop

Cursor

Evolving from AI-enhanced IDE toward proactive agent platform:

Agent Mode: Autonomous plan-execute-verify loop
Automations (March 2026): Event-driven agents triggered by codebase changes, Slack messages, timers. Runs in cloud sandboxes with memory.
Rules system: .cursor/rules/*.mdc — project, user, team, and agent rules
Strongest autocomplete experience, but struggles with 40+ file architectural changes (~60-80K token effective limit)

LangGraph

The production leader for complex stateful agent workflows. Model-agnostic, built-in checkpointing, 1-2 week learning curve. Deep Agents is built on top of LangGraph.

Harness Engineering: The Full Definition

OpenAI coined the term based on their Codex experiment. Birgitta Bockeler (Thoughtworks) expanded it on Martin Fowler’s site.

Harness engineering = building systems around a model to optimize goals like task performance, token efficiency, and latency. It encompasses:

System prompt design
Tool choice and configuration
Execution flow and lifecycle
Middleware and hooks
Memory systems
Skills and progressive disclosure
Verification and self-check loops
Observability and tracing
Loop detection and recovery
Reasoning budget allocation

“The goal of a harness is to mold the inherently spiky intelligence of a model for tasks we care about.” — LangChain

“When the agent struggles, we treat it as a signal to improve the harness.” — Birgitta Bockeler, Thoughtworks

OpenAI’s Five Principles of Harness Engineering

Intent-Driven Development — Engineers specify intent declaratively, not code directly
Autonomous Agent Iteration — Agents open PRs, evaluate changes, iterate until criteria met
Observability Integration — Agents use telemetry to monitor and debug their own work
Architectural Constraint Enforcement — Mechanical rules maintain structural integrity
Documentation as Machine-Readable Artifacts — Structured docs for agent consumption, not just humans

Context Rot and the “Dumb Zone”

Two critical research findings underpin this entire space:

Context rot (Chroma research): As token count increases, the model’s ability to accurately recall information decreases. Not a cliff — a gradient. Effective capacity is ~65% of claimed max.

The “dumb zone” (HumanLayer): In high-context regimes, there’s a measurable performance drop where models struggle to complete tasks. Subagents and context compression exist specifically to avoid entering this zone.

“Effective context management becomes critical to prevent context rot.” — Chester Curme & Mason Daugherty, LangChain

What This Means in Practice

The Pattern Consensus

Five independent groups (Anthropic, OpenAI, LangChain, Manus, Cursor) converged on remarkably similar architectural patterns:

Pattern	Claude Code	Deep Agents	OpenAI Codex	Manus	Cursor
Detailed system prompt	Yes	Yes	Yes	Yes	Rules system
Planning tool	TodoWrite	Planning tool	Intent specs	Yes	Agent Mode
Subagents	Yes	Yes	Multi-agent	Yes	Automations
Filesystem memory	CLAUDE.md + worktrees	FileSystem	Structured docs	Heavy use	.cursor/rules
Skills / progressive disclosure	.claude/skills/	SKILL.md	—	—	.cursor/rules/*.mdc
Verification loop	Yes	PreCompletionChecklist	PR iteration	—	Agent Mode
Context compression	Compaction at 95%	3-tier (offload, truncate, summarize)	—	Context engineering	—

For Agent Builders

Start with the harness, not the model. LangChain’s 14-point improvement came from harness changes alone. The model was constant.
Implement verification loops. The single biggest harness improvement is forcing agents to test their own work before declaring done.
Use reasoning strategically. The “reasoning sandwich” (high reasoning for planning and verification, standard for implementation) outperforms maximum reasoning everywhere.
Detect doom loops. Track per-file edit counts. After N edits to the same file, inject “consider a different approach.”
Load skills lazily. Progressive disclosure — descriptions in context, full instructions on demand — outperforms loading everything upfront.

For Teams Using AI Coding Agents

Write CLAUDE.md / rules files. This is the highest-leverage harness investment. Project knowledge that loads automatically.
Use subagents for complex work. Context isolation is not optional for long-running tasks.
Install domain skills. The 29% → 95% result on LangChain tasks shows skills aren’t marginal — they’re transformative for domain-specific work.
Set compaction earlier. 70-80% capacity, not 95%. By 95%, quality has already degraded.

The Tacit Angle

Harness engineering makes session memory even more critical than context engineering alone. Every middleware intervention, every subagent delegation, every context compression — these are decisions that reshape what the agent knows. The reasons behind those decisions live in sessions.

Harness Component	Without Session Memory	With Session Memory
Verification loops	Pass/fail, no learning	Why it failed, what was tried
Doom loop detection	Reset and retry	Pattern of what caused the loop
Context compression	Information permanently lost	Full history searchable
Skill loading	Stateless each time	Which skills helped, which didn’t
Subagent delegation	Results without process	Full reasoning chain preserved

The more sophisticated the harness, the more valuable it becomes to persist what happens inside it.

Open Questions

Harness portability: Can a harness tuned for one model transfer to another? LangChain says “harnesses require model-specific tuning.”
Skill ecosystems: Will skills become a shared ecosystem (like npm for agents) or remain project-specific?
Verification completeness: How do you verify that the verifier is correct? Turtles all the way down.
Reasoning allocation: The “reasoning sandwich” works for coding — does it generalize to other domains?
Harness complexity: At what point does harness engineering become over-engineering? Where’s the elbow?

Confidence Assessment

Claim	Confidence
Deep agents share four common primitives	High — multi-source convergence across Claude Code, Manus, Deep Research
Harness changes matter more than model changes	High — Terminal Bench 2.0 evidence (same model, 14pt improvement)
The hierarchy (prompt ⊂ context ⊂ harness) is real	High — independent convergence from Anthropic, OpenAI, Thoughtworks
Skills with progressive disclosure improve performance	High — 29% → 95% measured result
Verification loops are the highest-leverage harness improvement	High — multiple sources cite this
The “reasoning sandwich” generalizes beyond coding	Medium — only tested on Terminal Bench 2.0
Skills will become a shared ecosystem	Medium — early signals (Vercel Labs CLI) but unproven
Deep Agents SDK will see widespread adoption	Low — LangChain has history of adoption then criticism

Sources & Provenance

Verifiable sources. Dates matter. Credibility assessed.

DOCS High credibility

December 2025

Deep Agents ↗

Harrison Chase · LangChain Blog

"Foundational post identifying four primitives shared by Claude Code, Manus, and Deep Research: detailed prompts, planning tools, subagents, and filesystem. 'Using an LLM to call tools in a loop is the simplest form of an agent' — deep agents enrich this with architectural primitives."

DOCS High credibility

February 2026

Improving Deep Agents with Harness Engineering ↗

LangChain · LangChain Blog

"Coding agent went from 52.8% to 66.5% on Terminal Bench 2.0 with harness-only changes. Key techniques: build-and-verify loop, environment context delivery, loop detection, reasoning sandwich (xhigh-high-xhigh), time budgeting."

DOCS High credibility

March 2026

LangChain Skills ↗

LangChain · LangChain Blog

"11 skills across LangChain, LangGraph, and Deep Agents. Claude Code pass rate on LangChain tasks: 29% → 95% with skills loaded. Skills use progressive disclosure — descriptions in context, full instructions loaded on demand."

DOCS High credibility

January 2026

Building Multi-Agent Applications with Deep Agents ↗

Sydney Runkle and Vivek Trivedy · LangChain Blog

"Two first-class primitives: subagents (context isolation) and skills (progressive disclosure). Decision matrix for when to use each. 'Multi-agent patterns don't have to be complicated.'"

DOCS High credibility

January 2026

Context Management for Deep Agents ↗

Chester Curme and Mason Daugherty · LangChain Blog

"Three-tiered context compression: tool result offloading (>20K tokens), tool input truncation (at 85% capacity), and summarization with filesystem backup. 'Effective context management becomes critical to prevent context rot.'"

DOCS High credibility

February 2026

Harness Engineering ↗

Birgitta Bockeler · Martin Fowler (Thoughtworks)

"Harness engineering is broader than prompt and context engineering — it encompasses constraints, verification mechanisms, and iterative feedback loops. 'When the agent struggles, we treat it as a signal to improve the harness.'"

DOCS High credibility

September 2025

Effective Context Engineering for AI Agents ↗

Prithvi Rajasekaran, Ethan Dixon, Carly Ryan, Jeremy Hadfield · Anthropic Engineering

"Four context strategies: Write, Select, Compress, Isolate. 'Most agent failures are not model failures anymore — they are context failures.' Production evidence from Claude Code."

INDUSTRY High credibility

February 2026

The Importance of Agent Harness in 2026 ↗

Philipp Schmid · Personal Blog

"Model = CPU, Context Window = RAM, Agent Harness = Operating System, Agent = Application. The harness is where competitive differentiation lives, not the model."

NEWS Medium credibility

February 2026

OpenAI: Harness Engineering with Codex ↗

InfoQ · InfoQ

"OpenAI Codex team: 3-7 engineers, ~1M lines of code, 5 months, 1,500 merged PRs. Five principles: intent-driven development, autonomous iteration, observability integration, architectural constraint enforcement, documentation as machine-readable artifacts."

DOCS Medium credibility

2025

Context Engineering for AI Agents: Lessons from Building Manus ↗

Manus Team · Manus Blog

"Production lessons from building a deep agent: filesystem as primary memory layer, context compression strategies, and the importance of structured external state."

DOCS Medium credibility

2025

Context Rot Research ↗

Chroma · Chroma Research

"As token count increases, model accuracy for information recall decreases — a gradient, not a cliff. Effective capacity is roughly 65% of claimed maximum context window."

DOCS Medium credibility

2025

Context-Efficient Backpressure (The 'Dumb Zone') ↗

HumanLayer · HumanLayer Blog

"Documented measurable performance drop in high-context regimes — the 'dumb zone' where models struggle to complete tasks despite having relevant information in context."

INDUSTRY Medium credibility

2026

Cursor vs Windsurf vs Claude Code: The Honest Comparison ↗

DEV Community · DEV Community

"Claude Code: 200K+ token context, excels at multi-file architecture. Cursor: ~60-80K effective tokens, best autocomplete. Claude Code successfully completed 23-file JWT migration that others couldn't."

INDUSTRY Medium credibility

2026

LangGraph vs CrewAI vs OpenAI Agents SDK ↗

Particula · Particula Blog

"LangGraph: production leader for complex stateful workflows. OpenAI SDK: fastest to prototype but OpenAI-only. CrewAI: multi-agent orchestration. Deep Agents builds on LangGraph."

DOCS Medium credibility

2026

Terminal Bench 2.0 Leaderboard ↗

Terminal Bench · tbench.ai

"89-task benchmark across ML, debugging, and biology domains. LangChain's harness-engineered agent scored 66.5% (Top 5) with gpt-5.2-codex."

INDUSTRY Medium credibility

2026

The Emerging Harness Engineering Playbook ↗

Ignorance.ai · Ignorance.ai

"Framework for thinking about harness engineering as a discipline: constraints, feedback loops, documentation, linters, lifecycle management, verification, and iteration cycles."

NEWS Medium credibility

March 2026

Cursor Automations ↗

TechCrunch · TechCrunch

"Cursor's new Automations system: event-driven agents triggered by codebase changes, Slack messages, or timers. Runs in cloud sandboxes with persistent memory. Shift from reactive IDE to proactive agent platform."

INDUSTRY Medium credibility

2026

2025 Was Agents, 2026 Is Agent Harnesses ↗

Aakash Gupta · Medium

"Industry trend analysis: the shift from building agents to engineering the systems that make agents reliable. Harness engineering as the key differentiator."

INDUSTRY Low credibility

February 2026

What Are DeepAgents in LangChain? — A Comprehensive Guide ↗

QualityPoint Technologies · QualityPoint Blog

"Tutorial-level overview of Deep Agents primitives. Summarizes the four-primitive architecture for beginners."

DOCS Low credibility

2026

Debugging Deep Agents with LangSmith ↗

LangChain · LangChain Blog

"Observability patterns for deep agent debugging. LangSmith integration for traces, latency analysis, and token cost tracking."

INDUSTRY Low credibility

2026

Claude Code Architecture (Reverse Engineered) ↗

Substack · Substack

"Reverse-engineered analysis of Claude Code's architecture: subagents, worktrees, context handling, and system prompt structure."

INDUSTRY Low credibility

2026

A Mental Model for Claude Code ↗

Level Up Coding · Level Up Coding

"Conceptual framework for understanding Claude Code's skills, subagents, and plugin architecture."

DOCS Low credibility

2026

Agent Skills Specification ↗

Agent Skills · agentskills.io

"Emerging specification for portable agent skills across frameworks."

NEWS Low credibility

2026

Context Engineering is the New AI Moat ↗

StartupHub.ai · StartupHub.ai

"Video coverage of Harrison Chase's Sequoia Capital appearance discussing long-horizon agents and context engineering as competitive moat."