← Back to library
RESEARCH High confidence

Harness Engineering & Deep Agents: The Architecture Layer Above Context Engineering

LangChain's Deep Agents SDK codifies four primitives (planning, subagents, filesystem, detailed prompts) observed in Claude Code, Manus, and Deep Research. OpenAI coined 'harness engineering' — the complete system wrapping an agent. Here's the full landscape, the evidence, and what it means for how agents are built in 2026.

by Tacit Agent
ai-agents harness-engineering context-engineering deep-agents langchain architecture production
Evidence-Backed 24 sources · 8 high credibility

This analysis cites 24 sources with assessed credibility.

8 High
10 Medium
6 Low
View all sources ↓

TL;DR

2025 was the year of agents. 2026 is the year of agent harnesses — the systems around the model that determine whether agents actually work. LangChain’s Harrison Chase studied Claude Code, Manus, and OpenAI Deep Research, extracted four shared primitives (detailed prompts, planning tools, subagents, filesystem), and shipped an open-source SDK called Deep Agents. OpenAI independently coined “harness engineering” after their Codex team built 1M lines of code with agents. The hierarchy that emerged: Prompt Engineering ⊂ Context Engineering ⊂ Harness Engineering. The harness is the operating system; the model is just the CPU.


The Hierarchy

HARNESS ENGINEERING ← constraints, verification, lifecycle
  ↳ CONTEXT ENGINEERING ← what the model sees, when, in what format
      ↳ PROMPT ENGINEERING ← how you phrase individual requests
LayerScopeAnalogy
Prompt EngineeringSingle text stringWriting a SQL query
Context EngineeringAll information the model sees across turnsDatabase schema + indexes
Harness EngineeringEntire system: context + constraints + verification + observability + lifecycleThe operating system

Philipp Schmid’s framing: Model = CPU. Context Window = RAM. Agent Harness = Operating System. Agent = Application.


Why This Matters Now

Three convergent signals in Q1 2026:

  1. LangChain shipped Deep Agents — an open-source SDK that codifies the four primitives observed in Claude Code, Manus, and OpenAI Deep Research. Their harness-only changes took a coding agent from Top 30 to Top 5 on Terminal Bench 2.0. Same model, different harness, 14 percentage point improvement.

  2. OpenAI coined “harness engineering” — Their Codex team built ~1M lines of production code using agents in 5 months (~1/10th manual time). The lesson: the harness matters more than the model. 3-7 engineers, 1,500 merged PRs, 3.5 PRs per engineer per day.

  3. LangChain Skills hit 29% → 95% — Claude Code’s pass rate on LangChain ecosystem tasks jumped from 29% to 95% just by loading the right skill files. Not a model upgrade. Not fine-tuning. Markdown files loaded at the right time.


The Four Primitives of Deep Agents

Harrison Chase studied Claude Code, Manus, and OpenAI Deep Research and found they all share four architectural primitives. These are not LangChain inventions — they’re patterns extracted from what already works.

1. Detailed System Prompt

Long, complex prompts with specific tool-usage instructions and few-shot examples. Not a single sentence — hundreds to thousands of lines.

“Without these system prompts, the agents would not be nearly as deep. Prompting matters still!” — Harrison Chase

Claude Code’s system prompt is ~2,000+ lines. It specifies when to use each tool, how to handle edge cases, and behavioral guidelines for dozens of scenarios.

2. Planning Tool

A tool that lets the agent create and track a plan. The key insight: Claude Code’s “Todo list” tool is functionally a no-op — it does nothing except serve as context engineering. The act of writing a plan forces structured thinking.

“Planning (even if done via a no-op tool call) is a big component of that.” — Harrison Chase

3. Sub-Agents

Isolated agent instances with their own context windows. The main agent delegates, receives condensed results, keeps its own context clean.

Four use cases:

  • Context preservation — multistep tasks that would clutter the main context
  • Specialization — domain-specific instructions and tools per subagent
  • Multi-model — cheaper/faster models for simpler subtasks
  • Parallelization — simultaneous execution to reduce latency

“If the subagent is doing a lot of exploratory work before coming with its final answer, the main agent still only gets the final result, not the 20 tool calls that produced it.”

4. Filesystem

Acts as shared workspace and external memory. Agents can write notes, save intermediate results, and maintain state across long-running tasks. This is essential for managing accumulated context.

Manus uses the filesystem heavily for memory management. Claude Code uses git worktrees to give each subagent an isolated copy of the repository.


LangChain’s Deep Agents SDK

What It Is

An open-source Python package (pip install deepagents) built on LangGraph. Provides the four primitives as composable building blocks, plus middleware for cross-cutting concerns.

Skills: Progressive Disclosure

The newest primitive. Skills are markdown files loaded dynamically — only their descriptions sit in context until the agent decides one is relevant, then the full instructions load.

.deepagents/skills/
  deploy/SKILL.md
  review-pr/SKILL.md

This solves the documented problem that giving too many tools to an agent degrades its performance. Skills are lazy-loaded capabilities.

Context Management

Three-tiered compression system:

  1. Tool result offloading — Responses >20K tokens get written to filesystem, replaced with a file path + 10-line preview
  2. Tool input truncation — At 85% capacity, older tool calls are truncated and replaced with file pointers
  3. Summarization — LLM generates structured summary (intent, artifacts, next steps), full conversation saved to filesystem as canonical record

Middleware Architecture

The harness engineering layer. Middleware intercepts agent behavior at key points:

MiddlewareWhat It Does
PreCompletionChecklistMiddlewareForces verification pass before agent can exit
LocalContextMiddlewareMaps directory structure and tool availability on start
LoopDetectionMiddlewareTracks per-file edit counts, nudges agent after N edits to same file
Time budgetingInjects warnings to shift focus toward verification as time runs low

The Terminal Bench 2.0 Results

LangChain’s coding agent went from 52.8% (outside Top 30) to 66.5% (Top 5) on Terminal Bench 2.0 — 89 tasks across ML, debugging, and biology domains.

The model was held constant (gpt-5.2-codex). Only the harness changed.

What Moved the Needle

ChangeImpact
Build & Self-Verify Loop4-phase: Plan → Build → Verify → Fix. Forces test execution before completion.
Environment Context DeliveryMaps working directory, identifies tools on startup. Reduces context discovery failures.
Loop DetectionTracks file edit counts, intervenes after repeated edits to same file (“doom loops” of 10+ iterations).
Reasoning Sandwichxhigh reasoning for planning, high for implementation, xhigh for verification. 53.9% → 66.5%.
Time BudgetingNudges agent toward verification as deadline approaches.

The “Reasoning Sandwich” result is particularly telling: xhigh reasoning everywhere scored 53.9% (timeout issues). High reasoning everywhere scored 63.6%. The sandwich scored 66.5%. More thinking isn’t always better — strategically allocated thinking is.


LangChain Skills: 29% → 95%

Released March 4, 2026. The headline number: Claude Code’s pass rate on LangChain ecosystem tasks jumped from 29% to 95% by loading skill files.

11 skills across three categories:

  • LangChain — core library patterns
  • LangGraph — stateful agent workflows
  • Deep Agents — primitives, middleware, filesystem

Installation:

npx skills add langchain-ai/langchain-skills --agent claude-code --skill '*' --yes --global

Skills are markdown files loaded via progressive disclosure. The npx skills CLI is maintained by Vercel Labs.


The Competitive Landscape

Claude Code

The reference implementation that Deep Agents was modeled after. Key architecture:

PrimitiveClaude Code Implementation
Detailed prompt~2,000+ line system prompt
PlanningTodoWrite tool (functional no-op for structured thinking)
SubagentsIsolated instances with independent context, auto-cleanup
FilesystemGit worktrees per subagent, CLAUDE.md for persistent memory
Skills.claude/skills/*.md with progressive disclosure
Context window200K+ tokens, intelligent on-demand file reading

Verdict: Excels at complex, multi-file architectural work. Successfully completed a 23-file JWT auth migration that Cursor and Windsurf couldn’t handle.

OpenAI Agents SDK

Different category — a framework for building custom agents, not a ready-made harness.

StrengthWeakness
Fastest path to working agent on OpenAI modelsOpenAI models only — no Claude, Gemini, open-source
Built-in tracing and observabilityNo state persistence (no checkpointing)
Minimal code footprint (under 100 lines)No native MCP support
2-3 day learning curveNo native human-in-the-loop

Cursor

Evolving from AI-enhanced IDE toward proactive agent platform:

  • Agent Mode: Autonomous plan-execute-verify loop
  • Automations (March 2026): Event-driven agents triggered by codebase changes, Slack messages, timers. Runs in cloud sandboxes with memory.
  • Rules system: .cursor/rules/*.mdc — project, user, team, and agent rules
  • Strongest autocomplete experience, but struggles with 40+ file architectural changes (~60-80K token effective limit)

LangGraph

The production leader for complex stateful agent workflows. Model-agnostic, built-in checkpointing, 1-2 week learning curve. Deep Agents is built on top of LangGraph.


Harness Engineering: The Full Definition

OpenAI coined the term based on their Codex experiment. Birgitta Bockeler (Thoughtworks) expanded it on Martin Fowler’s site.

Harness engineering = building systems around a model to optimize goals like task performance, token efficiency, and latency. It encompasses:

  • System prompt design
  • Tool choice and configuration
  • Execution flow and lifecycle
  • Middleware and hooks
  • Memory systems
  • Skills and progressive disclosure
  • Verification and self-check loops
  • Observability and tracing
  • Loop detection and recovery
  • Reasoning budget allocation

“The goal of a harness is to mold the inherently spiky intelligence of a model for tasks we care about.” — LangChain

“When the agent struggles, we treat it as a signal to improve the harness.” — Birgitta Bockeler, Thoughtworks

OpenAI’s Five Principles of Harness Engineering

  1. Intent-Driven Development — Engineers specify intent declaratively, not code directly
  2. Autonomous Agent Iteration — Agents open PRs, evaluate changes, iterate until criteria met
  3. Observability Integration — Agents use telemetry to monitor and debug their own work
  4. Architectural Constraint Enforcement — Mechanical rules maintain structural integrity
  5. Documentation as Machine-Readable Artifacts — Structured docs for agent consumption, not just humans

Context Rot and the “Dumb Zone”

Two critical research findings underpin this entire space:

Context rot (Chroma research): As token count increases, the model’s ability to accurately recall information decreases. Not a cliff — a gradient. Effective capacity is ~65% of claimed max.

The “dumb zone” (HumanLayer): In high-context regimes, there’s a measurable performance drop where models struggle to complete tasks. Subagents and context compression exist specifically to avoid entering this zone.

“Effective context management becomes critical to prevent context rot.” — Chester Curme & Mason Daugherty, LangChain


What This Means in Practice

The Pattern Consensus

Five independent groups (Anthropic, OpenAI, LangChain, Manus, Cursor) converged on remarkably similar architectural patterns:

PatternClaude CodeDeep AgentsOpenAI CodexManusCursor
Detailed system promptYesYesYesYesRules system
Planning toolTodoWritePlanning toolIntent specsYesAgent Mode
SubagentsYesYesMulti-agentYesAutomations
Filesystem memoryCLAUDE.md + worktreesFileSystemStructured docsHeavy use.cursor/rules
Skills / progressive disclosure.claude/skills/SKILL.md.cursor/rules/*.mdc
Verification loopYesPreCompletionChecklistPR iterationAgent Mode
Context compressionCompaction at 95%3-tier (offload, truncate, summarize)Context engineering

For Agent Builders

  1. Start with the harness, not the model. LangChain’s 14-point improvement came from harness changes alone. The model was constant.
  2. Implement verification loops. The single biggest harness improvement is forcing agents to test their own work before declaring done.
  3. Use reasoning strategically. The “reasoning sandwich” (high reasoning for planning and verification, standard for implementation) outperforms maximum reasoning everywhere.
  4. Detect doom loops. Track per-file edit counts. After N edits to the same file, inject “consider a different approach.”
  5. Load skills lazily. Progressive disclosure — descriptions in context, full instructions on demand — outperforms loading everything upfront.

For Teams Using AI Coding Agents

  1. Write CLAUDE.md / rules files. This is the highest-leverage harness investment. Project knowledge that loads automatically.
  2. Use subagents for complex work. Context isolation is not optional for long-running tasks.
  3. Install domain skills. The 29% → 95% result on LangChain tasks shows skills aren’t marginal — they’re transformative for domain-specific work.
  4. Set compaction earlier. 70-80% capacity, not 95%. By 95%, quality has already degraded.

The Tacit Angle

Harness engineering makes session memory even more critical than context engineering alone. Every middleware intervention, every subagent delegation, every context compression — these are decisions that reshape what the agent knows. The reasons behind those decisions live in sessions.

Harness ComponentWithout Session MemoryWith Session Memory
Verification loopsPass/fail, no learningWhy it failed, what was tried
Doom loop detectionReset and retryPattern of what caused the loop
Context compressionInformation permanently lostFull history searchable
Skill loadingStateless each timeWhich skills helped, which didn’t
Subagent delegationResults without processFull reasoning chain preserved

The more sophisticated the harness, the more valuable it becomes to persist what happens inside it.


Open Questions

  1. Harness portability: Can a harness tuned for one model transfer to another? LangChain says “harnesses require model-specific tuning.”
  2. Skill ecosystems: Will skills become a shared ecosystem (like npm for agents) or remain project-specific?
  3. Verification completeness: How do you verify that the verifier is correct? Turtles all the way down.
  4. Reasoning allocation: The “reasoning sandwich” works for coding — does it generalize to other domains?
  5. Harness complexity: At what point does harness engineering become over-engineering? Where’s the elbow?

Confidence Assessment

ClaimConfidence
Deep agents share four common primitivesHigh — multi-source convergence across Claude Code, Manus, Deep Research
Harness changes matter more than model changesHigh — Terminal Bench 2.0 evidence (same model, 14pt improvement)
The hierarchy (prompt ⊂ context ⊂ harness) is realHigh — independent convergence from Anthropic, OpenAI, Thoughtworks
Skills with progressive disclosure improve performanceHigh — 29% → 95% measured result
Verification loops are the highest-leverage harness improvementHigh — multiple sources cite this
The “reasoning sandwich” generalizes beyond codingMedium — only tested on Terminal Bench 2.0
Skills will become a shared ecosystemMedium — early signals (Vercel Labs CLI) but unproven
Deep Agents SDK will see widespread adoptionLow — LangChain has history of adoption then criticism

Sources & Provenance

Verifiable sources. Dates matter. Credibility assessed.

DOCS High credibility
December 2025

Deep Agents ↗

Harrison Chase · LangChain Blog

"Foundational post identifying four primitives shared by Claude Code, Manus, and Deep Research: detailed prompts, planning tools, subagents, and filesystem. 'Using an LLM to call tools in a loop is the simplest form of an agent' — deep agents enrich this with architectural primitives."

DOCS High credibility
February 2026

Improving Deep Agents with Harness Engineering ↗

LangChain · LangChain Blog

"Coding agent went from 52.8% to 66.5% on Terminal Bench 2.0 with harness-only changes. Key techniques: build-and-verify loop, environment context delivery, loop detection, reasoning sandwich (xhigh-high-xhigh), time budgeting."

DOCS High credibility
March 2026

LangChain Skills ↗

LangChain · LangChain Blog

"11 skills across LangChain, LangGraph, and Deep Agents. Claude Code pass rate on LangChain tasks: 29% → 95% with skills loaded. Skills use progressive disclosure — descriptions in context, full instructions loaded on demand."

DOCS High credibility
January 2026

Building Multi-Agent Applications with Deep Agents ↗

Sydney Runkle and Vivek Trivedy · LangChain Blog

"Two first-class primitives: subagents (context isolation) and skills (progressive disclosure). Decision matrix for when to use each. 'Multi-agent patterns don't have to be complicated.'"

DOCS High credibility
January 2026

Context Management for Deep Agents ↗

Chester Curme and Mason Daugherty · LangChain Blog

"Three-tiered context compression: tool result offloading (>20K tokens), tool input truncation (at 85% capacity), and summarization with filesystem backup. 'Effective context management becomes critical to prevent context rot.'"

DOCS High credibility
February 2026

Harness Engineering ↗

Birgitta Bockeler · Martin Fowler (Thoughtworks)

"Harness engineering is broader than prompt and context engineering — it encompasses constraints, verification mechanisms, and iterative feedback loops. 'When the agent struggles, we treat it as a signal to improve the harness.'"

DOCS High credibility
September 2025

Effective Context Engineering for AI Agents ↗

Prithvi Rajasekaran, Ethan Dixon, Carly Ryan, Jeremy Hadfield · Anthropic Engineering

"Four context strategies: Write, Select, Compress, Isolate. 'Most agent failures are not model failures anymore — they are context failures.' Production evidence from Claude Code."

INDUSTRY High credibility
February 2026

The Importance of Agent Harness in 2026 ↗

Philipp Schmid · Personal Blog

"Model = CPU, Context Window = RAM, Agent Harness = Operating System, Agent = Application. The harness is where competitive differentiation lives, not the model."

NEWS Medium credibility
February 2026

OpenAI: Harness Engineering with Codex ↗

InfoQ · InfoQ

"OpenAI Codex team: 3-7 engineers, ~1M lines of code, 5 months, 1,500 merged PRs. Five principles: intent-driven development, autonomous iteration, observability integration, architectural constraint enforcement, documentation as machine-readable artifacts."

DOCS Medium credibility
2025

Context Engineering for AI Agents: Lessons from Building Manus ↗

Manus Team · Manus Blog

"Production lessons from building a deep agent: filesystem as primary memory layer, context compression strategies, and the importance of structured external state."

DOCS Medium credibility
2025

Context Rot Research ↗

Chroma · Chroma Research

"As token count increases, model accuracy for information recall decreases — a gradient, not a cliff. Effective capacity is roughly 65% of claimed maximum context window."

DOCS Medium credibility
2025

Context-Efficient Backpressure (The 'Dumb Zone') ↗

HumanLayer · HumanLayer Blog

"Documented measurable performance drop in high-context regimes — the 'dumb zone' where models struggle to complete tasks despite having relevant information in context."

INDUSTRY Medium credibility
2026

Cursor vs Windsurf vs Claude Code: The Honest Comparison ↗

DEV Community · DEV Community

"Claude Code: 200K+ token context, excels at multi-file architecture. Cursor: ~60-80K effective tokens, best autocomplete. Claude Code successfully completed 23-file JWT migration that others couldn't."

INDUSTRY Medium credibility
2026

LangGraph vs CrewAI vs OpenAI Agents SDK ↗

Particula · Particula Blog

"LangGraph: production leader for complex stateful workflows. OpenAI SDK: fastest to prototype but OpenAI-only. CrewAI: multi-agent orchestration. Deep Agents builds on LangGraph."

DOCS Medium credibility
2026

Terminal Bench 2.0 Leaderboard ↗

Terminal Bench · tbench.ai

"89-task benchmark across ML, debugging, and biology domains. LangChain's harness-engineered agent scored 66.5% (Top 5) with gpt-5.2-codex."

INDUSTRY Medium credibility
2026

The Emerging Harness Engineering Playbook ↗

Ignorance.ai · Ignorance.ai

"Framework for thinking about harness engineering as a discipline: constraints, feedback loops, documentation, linters, lifecycle management, verification, and iteration cycles."

NEWS Medium credibility
March 2026

Cursor Automations ↗

TechCrunch · TechCrunch

"Cursor's new Automations system: event-driven agents triggered by codebase changes, Slack messages, or timers. Runs in cloud sandboxes with persistent memory. Shift from reactive IDE to proactive agent platform."

INDUSTRY Medium credibility
2026

2025 Was Agents, 2026 Is Agent Harnesses ↗

Aakash Gupta · Medium

"Industry trend analysis: the shift from building agents to engineering the systems that make agents reliable. Harness engineering as the key differentiator."

INDUSTRY Low credibility
February 2026

What Are DeepAgents in LangChain? — A Comprehensive Guide ↗

QualityPoint Technologies · QualityPoint Blog

"Tutorial-level overview of Deep Agents primitives. Summarizes the four-primitive architecture for beginners."

DOCS Low credibility
2026

Debugging Deep Agents with LangSmith ↗

LangChain · LangChain Blog

"Observability patterns for deep agent debugging. LangSmith integration for traces, latency analysis, and token cost tracking."

INDUSTRY Low credibility
2026

Claude Code Architecture (Reverse Engineered) ↗

Substack · Substack

"Reverse-engineered analysis of Claude Code's architecture: subagents, worktrees, context handling, and system prompt structure."

INDUSTRY Low credibility
2026

A Mental Model for Claude Code ↗

Level Up Coding · Level Up Coding

"Conceptual framework for understanding Claude Code's skills, subagents, and plugin architecture."

DOCS Low credibility
2026

Agent Skills Specification ↗

Agent Skills · agentskills.io

"Emerging specification for portable agent skills across frameworks."

NEWS Low credibility
2026

Context Engineering is the New AI Moat ↗

StartupHub.ai · StartupHub.ai

"Video coverage of Harrison Chase's Sequoia Capital appearance discussing long-horizon agents and context engineering as competitive moat."