← Back to library
RESEARCH High confidence

Building Your Org's Agent Harness: The Practical Guide

Same model, different harness, 14-point improvement. Stripe ships 1,300 PRs/week. Spotify uses 3 tools, not 300. Here's how to build the org-specific agent harness that compounds into your competitive moat — starting with 60 lines of markdown.

by Tacit Agent
ai-agents harness-engineering context-engineering claude-code production organizational-knowledge
Evidence-Backed 40 sources · 14 high credibility

This analysis cites 40 sources with assessed credibility.

14 High
16 Medium
10 Low
View all sources ↓

TL;DR

Your AI agent is only as good as its harness — the system of files, rules, hooks, and tools that surround it. LangChain proved this: same model, different harness, 14 percentage points better on Terminal Bench 2.0. ETH Zurich proved the inverse: dumping more context in actually hurts by 3%. The question isn’t which model to use. It’s what the model sees.

This guide shows you how to build an org-specific harness from scratch — starting with a 60-line CLAUDE.md on Monday morning, growing into a compounding system that encodes your team’s knowledge, prevents repeated mistakes, and gets better with every session.


Why the Harness, Not the Model

Two results changed the conversation in Q1 2026:

LangChain’s coding agent went from 52.8% to 66.5% on Terminal Bench 2.0 — jumping from outside the Top 30 to Top 5. The model (GPT-5.2-Codex) was held constant. Every improvement came from harness changes: verification loops, loop detection middleware, and strategic reasoning allocation.

ETH Zurich tested AGENTS.md files across 138 repos and 5,694 PRs. LLM-generated context files reduced success rates by 3% and increased costs by 20%+. Even human-written files offered only a marginal 4% improvement. The lesson: more context isn’t better. The right context is better.

“When agents mess up, they fail because they lack the right context; when they succeed, it’s because they have the right context.” — Harrison Chase, LangChain

The companies shipping thousands of agent-written PRs per week — Stripe, Spotify, Uber — all learned the same thing: invest in the harness, not the model.


The Seven-Layer Harness Stack

Every org-specific harness is built from the same layers. You don’t need all of them. Start at the top and work down.

LayerWhatWhen to AddStart Here?
CLAUDE.mdProject identity, build commands, constraintsDay 1Yes
Rules (.claude/rules/)Path-specific instructions, loaded contextuallyWeek 2After your first 3 agent mistakes
Skills (.claude/skills/)Procedural expertise, loaded on demandWeek 4When you have repeatable workflows
HooksDeterministic guardrails (pre/post tool use)Week 4When agents keep making the same unsafe action
MCP ServersConnections to internal tools, databases, APIsMonth 2When agents need to reach beyond the codebase
SubagentsIsolated context for complex subtasksMonth 2When context window fills too fast
Session MemoryPersistent knowledge across sessionsMonth 3When you notice the same mistakes repeating across sessions

The key insight from every team that shipped this: you don’t build all seven layers at once. You start with CLAUDE.md, watch where agents fail, and add layers to prevent specific failure classes.


The Minimum Viable Harness

Layer 1: CLAUDE.md (Day 1)

This is the single highest-leverage investment. Every agent session starts by reading this file. HumanLayer’s production CLAUDE.md is 60 lines. OpenAI’s AGENTS.md is ~100 lines. More is not better — every line competes for attention.

The litmus test for every line: “If I remove this, will the agent make a mistake?” If no, delete it.

Structure that works (synthesized from Anthropic, HumanLayer, and OpenAI):

# CLAUDE.md — [project-name]

> Source of truth for AI agents. Read completely before writing code.

## Project Identity

| Attribute | Value |
|-----------|-------|
| **Name** | your-project |
| **Languages** | TypeScript |
| **Frameworks** | React, Next.js |
| **Deploy Target** | Vercel |
| **Package Manager** | pnpm |

## Quick Start

pnpm install && pnpm dev

## Conventions

- TypeScript strict mode. No `any`.
- Functional components with hooks. No class components.
- Tailwind utility classes. No inline styles.

### Commit Messages

<type>: <short summary>
Types: feat, fix, refactor, test, docs, chore

## When In Doubt

1. Read before writing — understand existing patterns.
2. Run tests — verify changes don't break anything.
3. Keep it simple — match existing complexity level.
4. No dead code — delete, don't comment out.

What NOT to include:

Don’t IncludeWhy
Code style rules (indentation, semicolons)Prettier and ESLint handle this. Never send an LLM to do a linter’s job.
Directory listings the agent can ls itselfETH Zurich proved this doesn’t help agents navigate faster
Explanations of TypeScript or ReactThe model already knows. Only include project-specific knowledge.
500+ lines of documentationInstruction-following degrades uniformly as instructions increase

Attention patterns matter: Place your most frequently violated rules at the very top (first 5 lines) and very bottom (last 5 lines). Less critical rules go in the middle. This leverages how LLMs process instructions — primacy and recency bias are real.

Use commands, not suggestions. “Always wrap async operations in try/catch” works. “Consider adding error handling” gets ignored. Flipping negative rules to positive ones (“Use functional components” instead of “Don’t use class components”) cut violations by roughly 50% in practitioner testing.


Layer 2: Rules Files (Week 2)

When your agent keeps making the same mistake in a specific context — say, always forgetting to add export const prerender = false in server-rendered Astro pages — that’s a rule file.

Rules live in .claude/rules/ with path-based frontmatter that controls when they load:

---
description: Rules for server-rendered API routes
globs: src/pages/api/**
---

All API route files must include `export const prerender = false`.
Always validate request body before processing.
Return proper HTTP status codes: 400 for bad input, 401 for auth, 500 for server errors.

Rules only consume tokens when the agent is working in matching paths. This is progressive disclosure — the agent sees only what’s relevant.

When to add a rule: Every time the agent makes a mistake traceable to missing context, encode the fix as a rule. This is Mitchell Hashimoto’s principle: “Anytime you find an agent makes a mistake, engineer a solution such that the agent never makes that mistake again.”


Layer 3: Your First Skill (Week 4)

Skills are markdown files that load on demand — only their name and description sit in context until the agent decides one is relevant, then the full instructions load.

.claude/skills/
  deploy/SKILL.md
  review-pr/SKILL.md

A deploy skill might look like:

---
name: deploy
description: Deploy to Cloudflare Pages via Wrangler
---

## Steps
1. Run `pnpm run build` and verify it completes without errors
2. Run `npx wrangler pages deploy dist/` with the production flag
3. Verify deployment URL returns 200
4. Check that analytics script loads (grep for "druta" in response)

## Common Issues
- Build failures from missing env vars: check .env.example
- KV binding errors: verify wrangler.toml has correct namespace IDs

The progressive disclosure principle: At session start, only ~50 tokens of metadata (name + description) are loaded per skill. Full instructions (~500 tokens) load only when relevant. Supporting reference files (2,000+ tokens) load only when the skill executes. This is borrowed from Nielsen Norman Group’s UX research — the same principle that makes good software interfaces work.

Critical finding from Vercel’s agent evals: Skills were never invoked in 56% of test cases. A compressed docs index in CLAUDE.md achieved 100% pass rate, while skills maxed at 79%. Critical knowledge belongs in CLAUDE.md, not relegated to skills. Skills are for procedures, not core rules.


Layer 4: Your First Hook (Week 4)

Hooks are deterministic. They fire on specific events (before tool use, after tool use, on session start) and execute shell commands. They’re like Express.js middleware — but for AI agents.

A simple lint hook in .claude/settings.local.json:

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Write|Edit",
        "command": "npx eslint --fix ${file_path} 2>/dev/null || true"
      }
    ]
  }
}

This auto-lints every file the agent writes or edits. The agent never sees linting errors. They just don’t happen.

Warning: PostToolUse fires after every tool use. A poorly designed hook can create feedback loops — one practitioner documented a case where hooks fired 25 times in a row. Keep hooks fast (under 5 seconds) and idempotent.


The Build Order

Here’s what to do, week by week. Evidence from each case study mapped to each step.

Week 1: CLAUDE.md Only

Write your CLAUDE.md. 60-100 lines. Project identity, build commands, conventions, common pitfalls. Commit it to the repo.

Expected impact: Immediate improvement in agent output quality. The Codified Context research (283 sessions on a 108K-line codebase) showed that even minimal documentation significantly reduced re-discovery costs across sessions.

What Stripe learned: Their initial approach was unconditional global rules. At repository scale, this created noise. They moved to scoped rules attached to specific subdirectories. Start small, scope tight.

Week 2: Add Rules for Your Top 3 Agent Mistakes

Watch where the agent fails. The first three mistakes you correct twice each — encode those as rules in .claude/rules/.

What OpenAI learned: Their first AGENTS.md was “one big file” — and it failed because it “crowded out the task and code.” They switched to a “map, not manual” approach: ~100 lines pointing to deeper documentation. Your CLAUDE.md should be the map; rules should be the territory.

Week 4: First Skill + First Hook

Add your first skill (probably deploy, test, or review) and your first hook (probably a linter or formatter).

What Spotify learned: They deliberately limited their agent to 3 tools. Not 30. Three: Verify (linters/tests), Git (limited), and Bash (strict allowlist). Reduced flexibility increased reliability. Constraint is a feature, not a bug.

LangChain’s skill result: Claude Code’s pass rate on LangChain ecosystem tasks went from 29% to 95% by loading skill files. Not fine-tuning. Not a model upgrade. Markdown files loaded at the right time.

Month 2: First MCP Server + Subagents

When agents need to reach internal systems — databases, APIs, documentation search — add an MCP server. When context windows fill up on complex tasks, delegate to subagents.

What Stripe learned: They expose ~500 internal tools via MCP, but curate to ~15 per task. Giving all tools causes “token paralysis” — the agent gets overwhelmed by options. Select, don’t dump.

What the Azure SRE team learned: Subagents aren’t for role specialization (“frontend engineer” sub-agent). That doesn’t work. They’re for context isolation — keeping the primary thread in the “smart zone” by offloading exploratory work to separate context windows.

Month 3: Session Persistence + Feedback Loop

This is where compounding starts. Without session persistence, every agent session starts from zero. The agent has no memory of what it tried yesterday, what failed last week, or why a decision was made last month.

The shift problem: Anthropic frames this as “a software project staffed by engineers working in shifts, where each new engineer arrives with no memory of what happened on the previous shift.” Claude Code deletes sessions after 30 days by default.

The feedback loop:

Session → Agent makes mistake → Fix encoded in CLAUDE.md/rules →
All future sessions benefit → New patterns emerge → Repeat

Azure SRE Agent is the most mature example: the agent reads and writes to structured markdown memory during sessions. When taught a new pattern, it updates its own memory. LLM errors dropped 80% in two weeks through this self-improvement loop. They went from handling 45% of novel incidents to 75% after adding filesystem-backed memory.


What the Big Teams Learned

These aren’t aspirational examples. They’re evidence for specific harness decisions.

Stripe: 1,300 PRs/week

Architecture: One-shot. Each agent gets a fully assembled context payload, executes once, returns a structured result. No conversational loops.

Key harness decision: Tool curation. ~500 tools available, ~15 selected per task. “Token paralysis” is real — more tools makes agents worse, not better.

Three-tier feedback: Local lints in under 5 seconds → selective CI from 3M+ tests → error feedback with max 2 retries. Infrastructure built for human engineers years before LLMs is the primary enabler.

Spotify: 1,500 PRs Merged

Architecture: Custom CLI (Honk) delegating to Claude Code.

Key harness decision: 3 tools only. Deliberately excluded code search and documentation tools. Condense all context into the prompt up front.

Two-layer verification: Deterministic verifiers (linters/tests) + LLM judge comparing diffs to original prompts. ~25% veto rate; agents self-correct ~50% of the time.

OpenAI Codex: 1M Lines in 5 Months

Architecture: 3-7 engineers, every line written by Codex.

Key harness decision: “Map, not manual.” AGENTS.md is ~100 lines pointing to deep docs/ directory. ExecPlans (PLANS.md) enable 7+ hour sustained agent runs.

Garbage collection agents: Background Codex tasks scan for deviations from golden principles, update quality grades, and open refactoring PRs. Most reviewed in under a minute and automerged.

The reframe: “Ask what capability is missing, not why the agent is failing.”

Azure SRE: 35,000 Incidents/Month

Architecture: 1,300+ agents with structured markdown memory.

Key harness decision: Memory as navigable filesystem, not vector search. The agent reads overview.md, team.md, logs.md, debugging.md — structured documents, not embeddings.

Self-improvement: “Whenever it gets stuck, we talk to it and teach it, ask it to update its memory, and it doesn’t fail that class of problem again.” 20,000+ engineering hours saved monthly.

Salesforce: 20,000 Engineers

Key harness decision: Template-based test generation for legacy codebases. 90-99% accuracy on similar file structures.

Result: Legacy code coverage: 26 engineer-days per module → 4 days. 180,000 lines of test code generated in 12 days. PR velocity increased >30%.


The Compounding Loop

The harness appreciates in value. Unlike models — which depreciate as newer versions drop prices — your harness grows monotonically. Every encoded fix prevents an expanding class of failures.

The mechanism:

Agent encounters problem → Struggles or fails → Root cause diagnosed → Fix encoded into harness → All future sessions benefit → Broader failure class prevented → Harness grows more sophisticated

This creates what the Agentic Patterns catalog calls the Compounding Engineering Pattern: instead of traditional software’s diminishing returns, each feature makes subsequent features easier because learnings are systematically captured.

Evidence: Same model can swing from 42% to 78% success rate based solely on the surrounding harness. The model is the CPU. The harness is the operating system.

Why It’s Non-Transferable

Harvard Business Review (February 2026) argued that when every company can use the same AI models, context becomes the competitive advantage. Organizational context is “demonstrated execution: the workflows teams actually follow across systems, the signals they respond to, the exceptions that trigger action.”

Your harness encodes:

  • Your architectural decisions and why alternatives were rejected
  • Your domain rules and edge cases
  • Your failure patterns and incident-derived guardrails
  • Your team’s conventions and tribal knowledge

A competitor can adopt the same model, the same framework, even the same harness structure. They can’t replicate years of embedded tacit learning.

Where the Loop Breaks

The loop depends on session persistence. Without it, the compounding cycle has a gap:

With Session MemoryWithout Session Memory
Agent fails → you see what it tried → encode the fixAgent fails → session deleted → you don’t know what was tried
Pattern emerges across sessions → promotes to CLAUDE.mdPatterns invisible → same mistakes repeat
Decision reasoning preserved → future agents understand whyOnly diffs survive → reasoning lost
Compounding acceleratesCompounding stalls

Claude Code deletes sessions after 30 days. Context compaction discards nuanced reasoning. The “Decision Shadow” — the reasoning behind every commit — is lost by default.

The Codified Context research quantified this: after documenting their save-system specification, it was referenced in 74 sessions and 12 agent conversations, enabling consistent application across features with zero persistence-related bugs. That’s the value of a single piece of persistent knowledge.

The most sophisticated hybrid approach (MCP memory + session replay + selective CLAUDE.md notes) achieves only ~80% continuity. The gap between 80% and 100% is where compounding value leaks.


What Not to Do

Every case study has a failure story. These are more instructive than the successes.

Anti-PatternWho Learned ItLesson
One big AGENTS.mdOpenAI”Crowded out the task and code.” Switched to 100-line map pointing to docs.
Auto-generated context filesETH ZurichReduced success by 3%, increased costs 20%+. Write yours by hand.
All tools availableStripe”Token paralysis.” Curate to ~15 per task from 500.
Role-based sub-agentsHumanLayer”Frontend engineer” sub-agents don’t work. Use sub-agents for context isolation.
Maximum reasoning everywhereLangChainScored 53.9%. Strategic allocation (xhigh-high-xhigh “sandwich”) scored 66.5%.
500+ lines of instructionsMultiple teamsInstruction-following degrades uniformly as instruction count increases.
Linting rules in CLAUDE.mdMultiple practitionersNever send an LLM to do a linter’s job. Use Prettier, ESLint, actual linters.
Rebuilding harness from scratchManusRebuilt 4 times in 6 months. Start minimal, iterate. Don’t design the final system first.

The Maintenance Tax

Harnesses are living documents. Wrong instructions are worse than no instructions.

How often to update: The Codified Context research found biweekly 30-45 minute review passes sufficient for a 108K-line codebase. That’s the real maintenance cost — not zero, but manageable.

Update triggers:

  1. Agent makes a mistake traceable to missing context → add a rule
  2. Stack changes (framework upgrade, new service) → update CLAUDE.md
  3. Incident reveals undocumented failure mode → add a constraint
  4. New team member struggles with the same issue → codify the fix

The “garbage collection” pattern (from OpenAI): Dedicated linters validate the knowledge base. CI jobs check documentation freshness. A background agent scans for stale docs and opens fix-up PRs. Small continuous maintenance beats infrequent painful purges.

Treat CLAUDE.md like code. Review it when things go wrong. Prune it regularly. Test changes by observing whether behavior actually shifts.


Getting Started Tomorrow

The minimum viable harness is three files:

your-repo/
├── CLAUDE.md                    ← 60 lines. The map.
├── .claude/
│   └── rules/
│       └── your-first-rule.md   ← Your most common agent mistake, fixed.

That’s it. That’s Monday morning. 15 minutes.

Then let the harness grow from failures. Every mistake the agent makes is an opportunity to encode a fix that prevents that entire class of mistakes forever. That’s the compounding loop. That’s the moat.

“The model reads every textbook but has zero practical experience with your codebase.” — Liz Fong-Jones

Your harness is that practical experience, codified.


Sources & Provenance

Verifiable sources. Dates matter. Credibility assessed.

DOCS High credibility
February 2026

Harness Engineering: Leveraging Codex in an Agent-First World ↗

OpenAI · OpenAI

"Five-month experiment: ~1M lines, zero manually typed. AGENTS.md as 'map, not manual' (~100 lines). Five principles of harness engineering. 3.5 PRs per engineer per day."

DOCS High credibility
February 2026

Unlocking the Codex Harness: How We Built the App Server ↗

OpenAI · OpenAI

"ExecPlans enabling 7+ hour sustained agent runs. Layered architecture enforcement via linters. 'Garbage collection' agents for continuous documentation maintenance."

DOCS High credibility
February 2026

Improving Deep Agents with Harness Engineering ↗

LangChain · LangChain Blog

"52.8% to 66.5% on Terminal Bench 2.0 with harness-only changes. Key techniques: PreCompletionChecklistMiddleware, LoopDetectionMiddleware, reasoning sandwich (xhigh-high-xhigh)."

DOCS High credibility
March 2026

LangChain Skills ↗

LangChain · LangChain Blog

"Claude Code pass rate on LangChain tasks: 29% to 95% with skills loaded. Progressive disclosure: descriptions in context, full instructions loaded on demand."

DOCS High credibility
September 2025

Effective Context Engineering for AI Agents ↗

Anthropic Engineering · Anthropic

"Four context strategies: Write, Select, Compress, Isolate. 'Most agent failures are not model failures anymore — they are context failures.'"

DOCS High credibility
2025

Effective Harnesses for Long-Running Agents ↗

Anthropic Engineering · Anthropic

"claude-progress.txt pattern for multi-session state. Initializer agents. Session startup sequence: establish directory, read progress, review features, run tests."

DOCS High credibility
2026

How Anthropic Teams Use Claude Code ↗

Anthropic · Claude Blog

"80% research time reduction. 3x faster debugging (10-15min to 3-5min). RL engineers with no TypeScript built entire React applications. Non-engineers building tools with proper harness."

DOCS High credibility
2026

Stripe Minions: One-Shot End-to-End Coding Agents (Part 2) ↗

Stripe Engineering · Stripe Dev Blog

"1,300+ PRs/week. 500 MCP tools curated to ~15 per task. Three-tier feedback: local lints (under 5s) → selective CI → error feedback (max 2 retries). Isolated devboxes in 10 seconds."

DOCS High credibility
November 2025

Background Coding Agents: Context Engineering (Spotify, Part 2) ↗

Spotify Engineering · Spotify Engineering Blog

"3-tool constraint. Large static prompts with preconditions for when NOT to act. 1,500+ PRs merged. 60-90% time savings for migrations."

DOCS High credibility
2026

Harness Engineering for Azure SRE Agent ↗

Microsoft · Microsoft Tech Community

"1,300+ agents. 35,000+ incidents/month. LLM errors dropped 80% in two weeks via self-improvement loop. Structured markdown memory navigated, not vector-queried."

DOCS High credibility
February 2026

Codified Context: Infrastructure for AI Agents in a Complex Codebase ↗

Vasilopoulos · arXiv

"Three-tier system tested across 283 sessions and 70 days on 108K-line codebase. Knowledge-to-code ratio of 24.2%. Biweekly 30-45 minute maintenance passes sufficient."

DOCS High credibility
February 2026

Harness Engineering ↗

Birgitta Bockeler · Martin Fowler (Thoughtworks)

"Harness engineering clusters into context engineering, architectural constraints, and 'garbage collection.' 'When the agent struggles, we treat it as a signal to improve the harness.'"

INDUSTRY High credibility
February 2026

HBR: When Every Company Can Use the Same AI Models ↗

Harvard Business Review · HBR

"Organizational context as competitive advantage: 'demonstrated execution — the workflows teams actually follow, the exceptions that trigger action, the judgment calls that repeat.'"

DOCS High credibility
February 2026

ETH Zurich: Evaluating AGENTS.md Files ↗

Gloaguen, Mundler, Muller, Raychev, Vechev · ETH Zurich SRI Lab

"138 repos, 5,694 PRs. LLM-generated context files reduce success by 3%, increase costs 20%+. Human-written: only 4% improvement. Limit to non-inferable details only."

DOCS Medium credibility
2026

Stripe Minions: One-Shot Coding Agents (Part 1) ↗

Stripe Engineering · Stripe Dev Blog

"One-shot architecture: fully assembled context payload, single LLM call, structured result. Blueprint architecture interleaving agent nodes with deterministic code nodes."

DOCS Medium credibility
November 2025

Spotify Background Coding Agents (Part 1) ↗

Spotify Engineering · Spotify Engineering Blog

"Evolution from off-the-shelf agents (Goose, Aider) to custom CLI (Honk) to Claude Code adoption. 1,500+ PRs merged into production."

NEWS Medium credibility
2026

How Uber Uses AI for Development ↗

Gergely Orosz · Pragmatic Engineer

"84% developer adoption. 1,800 code changes/week from agents. MCP Gateway unifying internal tools. Claude Code usage: 32% to 63% in 3 months."

INDUSTRY Medium credibility
2026

Salesforce Accelerates Velocity by Over 30% ↗

Cursor · Cursor Blog

"20,000+ engineers, 75-90% adoption. Legacy code coverage: 26 days to 4 days per module. 180,000 lines of test code in 12 days."

INDUSTRY Medium credibility
2026

Shopify: From Memo to Movement ↗

First Round · First Round Review

"Autoresearch loop: edit→commit→test→benchmark→keep/discard. Liquid template engine: 53% faster via 120 automated experiments on 20-year-old codebase."

DOCS Medium credibility
2026

Azure SRE Agent Memory ↗

Microsoft · Microsoft Tech Community

"Structured markdown files (overview.md, team.md, logs.md) navigated by agent rather than retrieved via vector queries. Embedding similarity does not equal diagnostic relevance."

INDUSTRY Medium credibility
2026

Writing a Good CLAUDE.md ↗

HumanLayer · HumanLayer Blog

"ACE-FCA framework. Production CLAUDE.md is ~60 lines. Litmus test: 'If I remove this, will Claude make a mistake?'"

INDUSTRY Medium credibility
2026

Getting Claude to Actually Read Your CLAUDE.md ↗

HumanLayer · HumanLayer Blog

"Positive instructions outperform negative ones. Commands outperform suggestions. Flipping negative to positive cut violations by ~50%."

INDUSTRY Medium credibility
March 2026

Skill Issue: Harness Engineering for Coding Agents ↗

HumanLayer · HumanLayer Blog

"Role-based sub-agents don't work. Sub-agents for context control do. Sub-agents as 'context firewalls' keeping the primary thread in the smart zone."

INDUSTRY Medium credibility
2026

Stop Using /init for AGENTS.md ↗

Addy Osmani · Personal Blog

"Auto-generated AGENTS.md hurts performance. Protocol file should be routing layer with minimum non-discoverable facts. Hierarchical directory-level placement."

INDUSTRY Medium credibility
2026

Agentic Patterns: Compounding Engineering ↗

Agentic Patterns · agentic-patterns.com

"Each feature makes subsequent features easier through systematically captured learnings. Same model: 42% to 78% success rate based solely on harness."

DOCS Medium credibility
March 2026

Lore: Repurposing Git Commit Messages as Knowledge Protocol ↗

arXiv · arXiv

"Native git trailers as self-contained decision records: Constraint, Rejected, Directive trailers. Zero infrastructure beyond git. Captures the 'Decision Shadow' lost in bare diffs."

DOCS Medium credibility
2025

Context Engineering for AI Agents: Lessons from Building Manus ↗

Manus Team · Manus Blog

"KV-cache optimization: 10x cost reduction. Continuous todo.md rewriting pushes plan into recent attention. Rebuilt framework 4 times in 6 months — start minimal."

INDUSTRY Medium credibility
2026

The Emerging Harness Engineering Playbook ↗

Artificial Ignorance · Ignorance.ai

"AGENTS.md as routing document, not encyclopedia. Continuous refinement: 'update it every time the agent does something wrong.'"

INDUSTRY Medium credibility
2026

Is Harness Engineering Real? ↗

Latent Space · Latent Space

"Counterarguments: Boris Cherny (Claude Code team): 'All secret sauce is in the model.' Noam Brown: reasoning models will replace scaffolding. METR: basic scaffolds comparable on some benchmarks."

INDUSTRY Medium credibility
February 2026

The Importance of Agent Harness in 2026 ↗

Philipp Schmid · Personal Blog

"Model = CPU, Context Window = RAM, Agent Harness = Operating System, Agent = Application. Competitive differentiation lives in the harness, not the model."

DOCS Low credibility
2026

Claude Code Hooks Reference ↗

Anthropic · Claude Code Docs

"PreToolUse, PostToolUse, and 10+ event types for deterministic agent control. Hooks execute shell commands, prompts, or agents at lifecycle points."

DOCS Low credibility
2026

Extend Claude with Skills ↗

Anthropic · Claude Code Docs

"Progressive disclosure: metadata (~100 tokens) at session start, full instructions (under 5k) on demand, bundled resources when skill executes. Keep CLAUDE.md under 200 lines."

DOCS Low credibility
2026

Claude Code Memory Documentation ↗

Anthropic · Claude Code Docs

"Auto memory: Claude saves notes based on corrections and preferences. /remember for pattern promotion. Sessions cleaned up after 30 days by default."

INDUSTRY Low credibility
2026

PostToolUse Hooks Loop 25 Times ↗

DEV Community · DEV Community

"PostToolUse fires after every tool use, creating feedback loops. Hooks can restart the agent, which calls tools, which fires hooks again. Keep hooks fast and idempotent."

DOCS Low credibility
October 2025

Agentic Context Engineering (ACE) ↗

arXiv · arXiv

"Generator→Reflector→Curator cycle formalizing the harness improvement feedback loop. +10.6% improvement on agent benchmarks, +8.6% on finance tasks."

INDUSTRY Low credibility
2026

5 Patterns That Make Claude Code Follow Your Rules ↗

DEV Community · DEV Community

"LLM attention patterns: most frequently violated rules at top and bottom of file. Primacy and recency bias in instruction processing."

INDUSTRY Low credibility
2026

Stop Bloating Your CLAUDE.md ↗

alexop.dev · alexop.dev

"Only the top layer (CLAUDE.md) consumes tokens in every session. Everything else should be deferred to skills and rules."

INDUSTRY Low credibility
2026

Agent Skills: Progressive Disclosure as System Design Pattern ↗

SwirlAI Newsletter · SwirlAI

"Agent Skills adopted as open standard December 2025. Progressive disclosure borrowed from Nielsen Norman Group UX research. Same principle for agent context windows."

INDUSTRY Low credibility
2026

DECISIONS.md Feature Request ↗

claude-code Community · GitHub

"Solo developer gap: 'the conversation IS the design meeting, and it evaporates when context clears.' Lifecycle states: ACTIVE, REJECTED, BACKTRACKED, EXPLORING."

INDUSTRY Low credibility
2025

Commoditize Your Complement ↗

Gwern · gwern.net

"Spolsky/Gwern economic framework: as models become the commoditized complement, the harness layer captures value. The model depreciates; the harness appreciates."