Building Your Org's Agent Harness: The Practical Guide
Same model, different harness, 14-point improvement. Stripe ships 1,300 PRs/week. Spotify uses 3 tools, not 300. Here's how to build the org-specific agent harness that compounds into your competitive moat — starting with 60 lines of markdown.
TL;DR
Your AI agent is only as good as its harness — the system of files, rules, hooks, and tools that surround it. LangChain proved this: same model, different harness, 14 percentage points better on Terminal Bench 2.0. ETH Zurich proved the inverse: dumping more context in actually hurts by 3%. The question isn’t which model to use. It’s what the model sees.
This guide shows you how to build an org-specific harness from scratch — starting with a 60-line CLAUDE.md on Monday morning, growing into a compounding system that encodes your team’s knowledge, prevents repeated mistakes, and gets better with every session.
Why the Harness, Not the Model
Two results changed the conversation in Q1 2026:
LangChain’s coding agent went from 52.8% to 66.5% on Terminal Bench 2.0 — jumping from outside the Top 30 to Top 5. The model (GPT-5.2-Codex) was held constant. Every improvement came from harness changes: verification loops, loop detection middleware, and strategic reasoning allocation.
ETH Zurich tested AGENTS.md files across 138 repos and 5,694 PRs. LLM-generated context files reduced success rates by 3% and increased costs by 20%+. Even human-written files offered only a marginal 4% improvement. The lesson: more context isn’t better. The right context is better.
“When agents mess up, they fail because they lack the right context; when they succeed, it’s because they have the right context.” — Harrison Chase, LangChain
The companies shipping thousands of agent-written PRs per week — Stripe, Spotify, Uber — all learned the same thing: invest in the harness, not the model.
The Seven-Layer Harness Stack
Every org-specific harness is built from the same layers. You don’t need all of them. Start at the top and work down.
| Layer | What | When to Add | Start Here? |
|---|---|---|---|
| CLAUDE.md | Project identity, build commands, constraints | Day 1 | Yes |
Rules (.claude/rules/) | Path-specific instructions, loaded contextually | Week 2 | After your first 3 agent mistakes |
Skills (.claude/skills/) | Procedural expertise, loaded on demand | Week 4 | When you have repeatable workflows |
| Hooks | Deterministic guardrails (pre/post tool use) | Week 4 | When agents keep making the same unsafe action |
| MCP Servers | Connections to internal tools, databases, APIs | Month 2 | When agents need to reach beyond the codebase |
| Subagents | Isolated context for complex subtasks | Month 2 | When context window fills too fast |
| Session Memory | Persistent knowledge across sessions | Month 3 | When you notice the same mistakes repeating across sessions |
The key insight from every team that shipped this: you don’t build all seven layers at once. You start with CLAUDE.md, watch where agents fail, and add layers to prevent specific failure classes.
The Minimum Viable Harness
Layer 1: CLAUDE.md (Day 1)
This is the single highest-leverage investment. Every agent session starts by reading this file. HumanLayer’s production CLAUDE.md is 60 lines. OpenAI’s AGENTS.md is ~100 lines. More is not better — every line competes for attention.
The litmus test for every line: “If I remove this, will the agent make a mistake?” If no, delete it.
Structure that works (synthesized from Anthropic, HumanLayer, and OpenAI):
# CLAUDE.md — [project-name]
> Source of truth for AI agents. Read completely before writing code.
## Project Identity
| Attribute | Value |
|-----------|-------|
| **Name** | your-project |
| **Languages** | TypeScript |
| **Frameworks** | React, Next.js |
| **Deploy Target** | Vercel |
| **Package Manager** | pnpm |
## Quick Start
pnpm install && pnpm dev
## Conventions
- TypeScript strict mode. No `any`.
- Functional components with hooks. No class components.
- Tailwind utility classes. No inline styles.
### Commit Messages
<type>: <short summary>
Types: feat, fix, refactor, test, docs, chore
## When In Doubt
1. Read before writing — understand existing patterns.
2. Run tests — verify changes don't break anything.
3. Keep it simple — match existing complexity level.
4. No dead code — delete, don't comment out.
What NOT to include:
| Don’t Include | Why |
|---|---|
| Code style rules (indentation, semicolons) | Prettier and ESLint handle this. Never send an LLM to do a linter’s job. |
Directory listings the agent can ls itself | ETH Zurich proved this doesn’t help agents navigate faster |
| Explanations of TypeScript or React | The model already knows. Only include project-specific knowledge. |
| 500+ lines of documentation | Instruction-following degrades uniformly as instructions increase |
Attention patterns matter: Place your most frequently violated rules at the very top (first 5 lines) and very bottom (last 5 lines). Less critical rules go in the middle. This leverages how LLMs process instructions — primacy and recency bias are real.
Use commands, not suggestions. “Always wrap async operations in try/catch” works. “Consider adding error handling” gets ignored. Flipping negative rules to positive ones (“Use functional components” instead of “Don’t use class components”) cut violations by roughly 50% in practitioner testing.
Layer 2: Rules Files (Week 2)
When your agent keeps making the same mistake in a specific context — say, always forgetting to add export const prerender = false in server-rendered Astro pages — that’s a rule file.
Rules live in .claude/rules/ with path-based frontmatter that controls when they load:
---
description: Rules for server-rendered API routes
globs: src/pages/api/**
---
All API route files must include `export const prerender = false`.
Always validate request body before processing.
Return proper HTTP status codes: 400 for bad input, 401 for auth, 500 for server errors.
Rules only consume tokens when the agent is working in matching paths. This is progressive disclosure — the agent sees only what’s relevant.
When to add a rule: Every time the agent makes a mistake traceable to missing context, encode the fix as a rule. This is Mitchell Hashimoto’s principle: “Anytime you find an agent makes a mistake, engineer a solution such that the agent never makes that mistake again.”
Layer 3: Your First Skill (Week 4)
Skills are markdown files that load on demand — only their name and description sit in context until the agent decides one is relevant, then the full instructions load.
.claude/skills/
deploy/SKILL.md
review-pr/SKILL.md
A deploy skill might look like:
---
name: deploy
description: Deploy to Cloudflare Pages via Wrangler
---
## Steps
1. Run `pnpm run build` and verify it completes without errors
2. Run `npx wrangler pages deploy dist/` with the production flag
3. Verify deployment URL returns 200
4. Check that analytics script loads (grep for "druta" in response)
## Common Issues
- Build failures from missing env vars: check .env.example
- KV binding errors: verify wrangler.toml has correct namespace IDs
The progressive disclosure principle: At session start, only ~50 tokens of metadata (name + description) are loaded per skill. Full instructions (~500 tokens) load only when relevant. Supporting reference files (2,000+ tokens) load only when the skill executes. This is borrowed from Nielsen Norman Group’s UX research — the same principle that makes good software interfaces work.
Critical finding from Vercel’s agent evals: Skills were never invoked in 56% of test cases. A compressed docs index in CLAUDE.md achieved 100% pass rate, while skills maxed at 79%. Critical knowledge belongs in CLAUDE.md, not relegated to skills. Skills are for procedures, not core rules.
Layer 4: Your First Hook (Week 4)
Hooks are deterministic. They fire on specific events (before tool use, after tool use, on session start) and execute shell commands. They’re like Express.js middleware — but for AI agents.
A simple lint hook in .claude/settings.local.json:
{
"hooks": {
"PostToolUse": [
{
"matcher": "Write|Edit",
"command": "npx eslint --fix ${file_path} 2>/dev/null || true"
}
]
}
}
This auto-lints every file the agent writes or edits. The agent never sees linting errors. They just don’t happen.
Warning: PostToolUse fires after every tool use. A poorly designed hook can create feedback loops — one practitioner documented a case where hooks fired 25 times in a row. Keep hooks fast (under 5 seconds) and idempotent.
The Build Order
Here’s what to do, week by week. Evidence from each case study mapped to each step.
Week 1: CLAUDE.md Only
Write your CLAUDE.md. 60-100 lines. Project identity, build commands, conventions, common pitfalls. Commit it to the repo.
Expected impact: Immediate improvement in agent output quality. The Codified Context research (283 sessions on a 108K-line codebase) showed that even minimal documentation significantly reduced re-discovery costs across sessions.
What Stripe learned: Their initial approach was unconditional global rules. At repository scale, this created noise. They moved to scoped rules attached to specific subdirectories. Start small, scope tight.
Week 2: Add Rules for Your Top 3 Agent Mistakes
Watch where the agent fails. The first three mistakes you correct twice each — encode those as rules in .claude/rules/.
What OpenAI learned: Their first AGENTS.md was “one big file” — and it failed because it “crowded out the task and code.” They switched to a “map, not manual” approach: ~100 lines pointing to deeper documentation. Your CLAUDE.md should be the map; rules should be the territory.
Week 4: First Skill + First Hook
Add your first skill (probably deploy, test, or review) and your first hook (probably a linter or formatter).
What Spotify learned: They deliberately limited their agent to 3 tools. Not 30. Three: Verify (linters/tests), Git (limited), and Bash (strict allowlist). Reduced flexibility increased reliability. Constraint is a feature, not a bug.
LangChain’s skill result: Claude Code’s pass rate on LangChain ecosystem tasks went from 29% to 95% by loading skill files. Not fine-tuning. Not a model upgrade. Markdown files loaded at the right time.
Month 2: First MCP Server + Subagents
When agents need to reach internal systems — databases, APIs, documentation search — add an MCP server. When context windows fill up on complex tasks, delegate to subagents.
What Stripe learned: They expose ~500 internal tools via MCP, but curate to ~15 per task. Giving all tools causes “token paralysis” — the agent gets overwhelmed by options. Select, don’t dump.
What the Azure SRE team learned: Subagents aren’t for role specialization (“frontend engineer” sub-agent). That doesn’t work. They’re for context isolation — keeping the primary thread in the “smart zone” by offloading exploratory work to separate context windows.
Month 3: Session Persistence + Feedback Loop
This is where compounding starts. Without session persistence, every agent session starts from zero. The agent has no memory of what it tried yesterday, what failed last week, or why a decision was made last month.
The shift problem: Anthropic frames this as “a software project staffed by engineers working in shifts, where each new engineer arrives with no memory of what happened on the previous shift.” Claude Code deletes sessions after 30 days by default.
The feedback loop:
Session → Agent makes mistake → Fix encoded in CLAUDE.md/rules →
All future sessions benefit → New patterns emerge → Repeat
Azure SRE Agent is the most mature example: the agent reads and writes to structured markdown memory during sessions. When taught a new pattern, it updates its own memory. LLM errors dropped 80% in two weeks through this self-improvement loop. They went from handling 45% of novel incidents to 75% after adding filesystem-backed memory.
What the Big Teams Learned
These aren’t aspirational examples. They’re evidence for specific harness decisions.
Stripe: 1,300 PRs/week
Architecture: One-shot. Each agent gets a fully assembled context payload, executes once, returns a structured result. No conversational loops.
Key harness decision: Tool curation. ~500 tools available, ~15 selected per task. “Token paralysis” is real — more tools makes agents worse, not better.
Three-tier feedback: Local lints in under 5 seconds → selective CI from 3M+ tests → error feedback with max 2 retries. Infrastructure built for human engineers years before LLMs is the primary enabler.
Spotify: 1,500 PRs Merged
Architecture: Custom CLI (Honk) delegating to Claude Code.
Key harness decision: 3 tools only. Deliberately excluded code search and documentation tools. Condense all context into the prompt up front.
Two-layer verification: Deterministic verifiers (linters/tests) + LLM judge comparing diffs to original prompts. ~25% veto rate; agents self-correct ~50% of the time.
OpenAI Codex: 1M Lines in 5 Months
Architecture: 3-7 engineers, every line written by Codex.
Key harness decision: “Map, not manual.” AGENTS.md is ~100 lines pointing to deep docs/ directory. ExecPlans (PLANS.md) enable 7+ hour sustained agent runs.
Garbage collection agents: Background Codex tasks scan for deviations from golden principles, update quality grades, and open refactoring PRs. Most reviewed in under a minute and automerged.
The reframe: “Ask what capability is missing, not why the agent is failing.”
Azure SRE: 35,000 Incidents/Month
Architecture: 1,300+ agents with structured markdown memory.
Key harness decision: Memory as navigable filesystem, not vector search. The agent reads overview.md, team.md, logs.md, debugging.md — structured documents, not embeddings.
Self-improvement: “Whenever it gets stuck, we talk to it and teach it, ask it to update its memory, and it doesn’t fail that class of problem again.” 20,000+ engineering hours saved monthly.
Salesforce: 20,000 Engineers
Key harness decision: Template-based test generation for legacy codebases. 90-99% accuracy on similar file structures.
Result: Legacy code coverage: 26 engineer-days per module → 4 days. 180,000 lines of test code generated in 12 days. PR velocity increased >30%.
The Compounding Loop
The harness appreciates in value. Unlike models — which depreciate as newer versions drop prices — your harness grows monotonically. Every encoded fix prevents an expanding class of failures.
The mechanism:
Agent encounters problem → Struggles or fails → Root cause diagnosed → Fix encoded into harness → All future sessions benefit → Broader failure class prevented → Harness grows more sophisticated
This creates what the Agentic Patterns catalog calls the Compounding Engineering Pattern: instead of traditional software’s diminishing returns, each feature makes subsequent features easier because learnings are systematically captured.
Evidence: Same model can swing from 42% to 78% success rate based solely on the surrounding harness. The model is the CPU. The harness is the operating system.
Why It’s Non-Transferable
Harvard Business Review (February 2026) argued that when every company can use the same AI models, context becomes the competitive advantage. Organizational context is “demonstrated execution: the workflows teams actually follow across systems, the signals they respond to, the exceptions that trigger action.”
Your harness encodes:
- Your architectural decisions and why alternatives were rejected
- Your domain rules and edge cases
- Your failure patterns and incident-derived guardrails
- Your team’s conventions and tribal knowledge
A competitor can adopt the same model, the same framework, even the same harness structure. They can’t replicate years of embedded tacit learning.
Where the Loop Breaks
The loop depends on session persistence. Without it, the compounding cycle has a gap:
| With Session Memory | Without Session Memory |
|---|---|
| Agent fails → you see what it tried → encode the fix | Agent fails → session deleted → you don’t know what was tried |
| Pattern emerges across sessions → promotes to CLAUDE.md | Patterns invisible → same mistakes repeat |
| Decision reasoning preserved → future agents understand why | Only diffs survive → reasoning lost |
| Compounding accelerates | Compounding stalls |
Claude Code deletes sessions after 30 days. Context compaction discards nuanced reasoning. The “Decision Shadow” — the reasoning behind every commit — is lost by default.
The Codified Context research quantified this: after documenting their save-system specification, it was referenced in 74 sessions and 12 agent conversations, enabling consistent application across features with zero persistence-related bugs. That’s the value of a single piece of persistent knowledge.
The most sophisticated hybrid approach (MCP memory + session replay + selective CLAUDE.md notes) achieves only ~80% continuity. The gap between 80% and 100% is where compounding value leaks.
What Not to Do
Every case study has a failure story. These are more instructive than the successes.
| Anti-Pattern | Who Learned It | Lesson |
|---|---|---|
| One big AGENTS.md | OpenAI | ”Crowded out the task and code.” Switched to 100-line map pointing to docs. |
| Auto-generated context files | ETH Zurich | Reduced success by 3%, increased costs 20%+. Write yours by hand. |
| All tools available | Stripe | ”Token paralysis.” Curate to ~15 per task from 500. |
| Role-based sub-agents | HumanLayer | ”Frontend engineer” sub-agents don’t work. Use sub-agents for context isolation. |
| Maximum reasoning everywhere | LangChain | Scored 53.9%. Strategic allocation (xhigh-high-xhigh “sandwich”) scored 66.5%. |
| 500+ lines of instructions | Multiple teams | Instruction-following degrades uniformly as instruction count increases. |
| Linting rules in CLAUDE.md | Multiple practitioners | Never send an LLM to do a linter’s job. Use Prettier, ESLint, actual linters. |
| Rebuilding harness from scratch | Manus | Rebuilt 4 times in 6 months. Start minimal, iterate. Don’t design the final system first. |
The Maintenance Tax
Harnesses are living documents. Wrong instructions are worse than no instructions.
How often to update: The Codified Context research found biweekly 30-45 minute review passes sufficient for a 108K-line codebase. That’s the real maintenance cost — not zero, but manageable.
Update triggers:
- Agent makes a mistake traceable to missing context → add a rule
- Stack changes (framework upgrade, new service) → update CLAUDE.md
- Incident reveals undocumented failure mode → add a constraint
- New team member struggles with the same issue → codify the fix
The “garbage collection” pattern (from OpenAI): Dedicated linters validate the knowledge base. CI jobs check documentation freshness. A background agent scans for stale docs and opens fix-up PRs. Small continuous maintenance beats infrequent painful purges.
Treat CLAUDE.md like code. Review it when things go wrong. Prune it regularly. Test changes by observing whether behavior actually shifts.
Getting Started Tomorrow
The minimum viable harness is three files:
your-repo/
├── CLAUDE.md ← 60 lines. The map.
├── .claude/
│ └── rules/
│ └── your-first-rule.md ← Your most common agent mistake, fixed.
That’s it. That’s Monday morning. 15 minutes.
Then let the harness grow from failures. Every mistake the agent makes is an opportunity to encode a fix that prevents that entire class of mistakes forever. That’s the compounding loop. That’s the moat.
“The model reads every textbook but has zero practical experience with your codebase.” — Liz Fong-Jones
Your harness is that practical experience, codified.
Sources & Provenance
Verifiable sources. Dates matter. Credibility assessed.
Harness Engineering: Leveraging Codex in an Agent-First World ↗
OpenAI · OpenAI
"Five-month experiment: ~1M lines, zero manually typed. AGENTS.md as 'map, not manual' (~100 lines). Five principles of harness engineering. 3.5 PRs per engineer per day."
Unlocking the Codex Harness: How We Built the App Server ↗
OpenAI · OpenAI
"ExecPlans enabling 7+ hour sustained agent runs. Layered architecture enforcement via linters. 'Garbage collection' agents for continuous documentation maintenance."
Improving Deep Agents with Harness Engineering ↗
LangChain · LangChain Blog
"52.8% to 66.5% on Terminal Bench 2.0 with harness-only changes. Key techniques: PreCompletionChecklistMiddleware, LoopDetectionMiddleware, reasoning sandwich (xhigh-high-xhigh)."
LangChain Skills ↗
LangChain · LangChain Blog
"Claude Code pass rate on LangChain tasks: 29% to 95% with skills loaded. Progressive disclosure: descriptions in context, full instructions loaded on demand."
Effective Context Engineering for AI Agents ↗
Anthropic Engineering · Anthropic
"Four context strategies: Write, Select, Compress, Isolate. 'Most agent failures are not model failures anymore — they are context failures.'"
Effective Harnesses for Long-Running Agents ↗
Anthropic Engineering · Anthropic
"claude-progress.txt pattern for multi-session state. Initializer agents. Session startup sequence: establish directory, read progress, review features, run tests."
How Anthropic Teams Use Claude Code ↗
Anthropic · Claude Blog
"80% research time reduction. 3x faster debugging (10-15min to 3-5min). RL engineers with no TypeScript built entire React applications. Non-engineers building tools with proper harness."
Stripe Minions: One-Shot End-to-End Coding Agents (Part 2) ↗
Stripe Engineering · Stripe Dev Blog
"1,300+ PRs/week. 500 MCP tools curated to ~15 per task. Three-tier feedback: local lints (under 5s) → selective CI → error feedback (max 2 retries). Isolated devboxes in 10 seconds."
Background Coding Agents: Context Engineering (Spotify, Part 2) ↗
Spotify Engineering · Spotify Engineering Blog
"3-tool constraint. Large static prompts with preconditions for when NOT to act. 1,500+ PRs merged. 60-90% time savings for migrations."
Harness Engineering for Azure SRE Agent ↗
Microsoft · Microsoft Tech Community
"1,300+ agents. 35,000+ incidents/month. LLM errors dropped 80% in two weeks via self-improvement loop. Structured markdown memory navigated, not vector-queried."
Codified Context: Infrastructure for AI Agents in a Complex Codebase ↗
Vasilopoulos · arXiv
"Three-tier system tested across 283 sessions and 70 days on 108K-line codebase. Knowledge-to-code ratio of 24.2%. Biweekly 30-45 minute maintenance passes sufficient."
Harness Engineering ↗
Birgitta Bockeler · Martin Fowler (Thoughtworks)
"Harness engineering clusters into context engineering, architectural constraints, and 'garbage collection.' 'When the agent struggles, we treat it as a signal to improve the harness.'"
HBR: When Every Company Can Use the Same AI Models ↗
Harvard Business Review · HBR
"Organizational context as competitive advantage: 'demonstrated execution — the workflows teams actually follow, the exceptions that trigger action, the judgment calls that repeat.'"
ETH Zurich: Evaluating AGENTS.md Files ↗
Gloaguen, Mundler, Muller, Raychev, Vechev · ETH Zurich SRI Lab
"138 repos, 5,694 PRs. LLM-generated context files reduce success by 3%, increase costs 20%+. Human-written: only 4% improvement. Limit to non-inferable details only."
Stripe Minions: One-Shot Coding Agents (Part 1) ↗
Stripe Engineering · Stripe Dev Blog
"One-shot architecture: fully assembled context payload, single LLM call, structured result. Blueprint architecture interleaving agent nodes with deterministic code nodes."
Spotify Background Coding Agents (Part 1) ↗
Spotify Engineering · Spotify Engineering Blog
"Evolution from off-the-shelf agents (Goose, Aider) to custom CLI (Honk) to Claude Code adoption. 1,500+ PRs merged into production."
How Uber Uses AI for Development ↗
Gergely Orosz · Pragmatic Engineer
"84% developer adoption. 1,800 code changes/week from agents. MCP Gateway unifying internal tools. Claude Code usage: 32% to 63% in 3 months."
Salesforce Accelerates Velocity by Over 30% ↗
Cursor · Cursor Blog
"20,000+ engineers, 75-90% adoption. Legacy code coverage: 26 days to 4 days per module. 180,000 lines of test code in 12 days."
Shopify: From Memo to Movement ↗
First Round · First Round Review
"Autoresearch loop: edit→commit→test→benchmark→keep/discard. Liquid template engine: 53% faster via 120 automated experiments on 20-year-old codebase."
Azure SRE Agent Memory ↗
Microsoft · Microsoft Tech Community
"Structured markdown files (overview.md, team.md, logs.md) navigated by agent rather than retrieved via vector queries. Embedding similarity does not equal diagnostic relevance."
Writing a Good CLAUDE.md ↗
HumanLayer · HumanLayer Blog
"ACE-FCA framework. Production CLAUDE.md is ~60 lines. Litmus test: 'If I remove this, will Claude make a mistake?'"
Getting Claude to Actually Read Your CLAUDE.md ↗
HumanLayer · HumanLayer Blog
"Positive instructions outperform negative ones. Commands outperform suggestions. Flipping negative to positive cut violations by ~50%."
Skill Issue: Harness Engineering for Coding Agents ↗
HumanLayer · HumanLayer Blog
"Role-based sub-agents don't work. Sub-agents for context control do. Sub-agents as 'context firewalls' keeping the primary thread in the smart zone."
Stop Using /init for AGENTS.md ↗
Addy Osmani · Personal Blog
"Auto-generated AGENTS.md hurts performance. Protocol file should be routing layer with minimum non-discoverable facts. Hierarchical directory-level placement."
Agentic Patterns: Compounding Engineering ↗
Agentic Patterns · agentic-patterns.com
"Each feature makes subsequent features easier through systematically captured learnings. Same model: 42% to 78% success rate based solely on harness."
Lore: Repurposing Git Commit Messages as Knowledge Protocol ↗
arXiv · arXiv
"Native git trailers as self-contained decision records: Constraint, Rejected, Directive trailers. Zero infrastructure beyond git. Captures the 'Decision Shadow' lost in bare diffs."
Context Engineering for AI Agents: Lessons from Building Manus ↗
Manus Team · Manus Blog
"KV-cache optimization: 10x cost reduction. Continuous todo.md rewriting pushes plan into recent attention. Rebuilt framework 4 times in 6 months — start minimal."
The Emerging Harness Engineering Playbook ↗
Artificial Ignorance · Ignorance.ai
"AGENTS.md as routing document, not encyclopedia. Continuous refinement: 'update it every time the agent does something wrong.'"
Is Harness Engineering Real? ↗
Latent Space · Latent Space
"Counterarguments: Boris Cherny (Claude Code team): 'All secret sauce is in the model.' Noam Brown: reasoning models will replace scaffolding. METR: basic scaffolds comparable on some benchmarks."
The Importance of Agent Harness in 2026 ↗
Philipp Schmid · Personal Blog
"Model = CPU, Context Window = RAM, Agent Harness = Operating System, Agent = Application. Competitive differentiation lives in the harness, not the model."
Claude Code Hooks Reference ↗
Anthropic · Claude Code Docs
"PreToolUse, PostToolUse, and 10+ event types for deterministic agent control. Hooks execute shell commands, prompts, or agents at lifecycle points."
Extend Claude with Skills ↗
Anthropic · Claude Code Docs
"Progressive disclosure: metadata (~100 tokens) at session start, full instructions (under 5k) on demand, bundled resources when skill executes. Keep CLAUDE.md under 200 lines."
Claude Code Memory Documentation ↗
Anthropic · Claude Code Docs
"Auto memory: Claude saves notes based on corrections and preferences. /remember for pattern promotion. Sessions cleaned up after 30 days by default."
PostToolUse Hooks Loop 25 Times ↗
DEV Community · DEV Community
"PostToolUse fires after every tool use, creating feedback loops. Hooks can restart the agent, which calls tools, which fires hooks again. Keep hooks fast and idempotent."
Agentic Context Engineering (ACE) ↗
arXiv · arXiv
"Generator→Reflector→Curator cycle formalizing the harness improvement feedback loop. +10.6% improvement on agent benchmarks, +8.6% on finance tasks."
5 Patterns That Make Claude Code Follow Your Rules ↗
DEV Community · DEV Community
"LLM attention patterns: most frequently violated rules at top and bottom of file. Primacy and recency bias in instruction processing."
Stop Bloating Your CLAUDE.md ↗
alexop.dev · alexop.dev
"Only the top layer (CLAUDE.md) consumes tokens in every session. Everything else should be deferred to skills and rules."
Agent Skills: Progressive Disclosure as System Design Pattern ↗
SwirlAI Newsletter · SwirlAI
"Agent Skills adopted as open standard December 2025. Progressive disclosure borrowed from Nielsen Norman Group UX research. Same principle for agent context windows."
DECISIONS.md Feature Request ↗
claude-code Community · GitHub
"Solo developer gap: 'the conversation IS the design meeting, and it evaporates when context clears.' Lifecycle states: ACTIVE, REJECTED, BACKTRACKED, EXPLORING."
Commoditize Your Complement ↗
Gwern · gwern.net
"Spolsky/Gwern economic framework: as models become the commoditized complement, the harness layer captures value. The model depreciates; the harness appreciates."