Context Engineering: Why It's Replacing Prompt Engineering
Gartner says context engineering is replacing prompt engineering for enterprise AI. Anthropic, LangChain, and practitioners agree: most agent failures are context failures, not model failures. Here's what it actually means, what the evidence says, and what to do about it.
TL;DR
Prompt engineering optimizes how you ask. Context engineering optimizes what the model knows when it answers. Gartner, Anthropic, LangChain, and Shopify’s CEO all land on the same finding: most agent failures are context failures, not model failures. The fix isn’t better prompts — it’s better architecture around the context window. Think of it like SQL: still essential, but the discipline that matters now is the system around it.
Quick Reference
Why This Matters Now
Three forces:
-
Agent failures are context failures. Harrison Chase (LangChain CEO): “Most agent failures are not model failures anymore — they are context failures.” Context engineering is now “effectively the #1 job” for engineers building AI agents.
-
Enterprise AI is failing at scale. 42% of companies abandoned most AI initiatives in 2025, up from 17% in 2024. 57% say their internal data is not AI-ready (Gartner, 2025). Better prompts won’t fix broken data architecture.
-
Agents changed the game. Single-turn prompts worked for summarization and translation. Modern agents run in loops — accumulating tool outputs, documents, conversation history, and reasoning. The context window became a scarce resource requiring engineering, not wordsmithing.
The Definition Convergence
Multiple independent sources arrived at remarkably similar definitions:
| Who | Definition |
|---|---|
| Andrej Karpathy | ”The delicate art and science of filling the context window with just the right information for the next step” |
| Tobi Lutke (Shopify CEO) | “The art of providing all the context for the task to be plausibly solvable by the LLM” |
| Anthropic | ”Optimizing the utility of tokens against the inherent constraints of LLMs to consistently achieve a desired outcome” |
| LangChain | ”Building dynamic systems that deliver the right information and tools in the right format so the LLM can plausibly accomplish the task” |
| Gartner | ”Designing and structuring the relevant data, workflows and environment so AI systems can understand intent, make better decisions and deliver contextual, enterprise-aligned outcomes” |
Five independent sources — a researcher, a CEO, a model provider, a framework builder, and an analyst firm — all landed on the same core idea: it’s about what information the model has access to, not how you phrase the question. Note how the practitioner definitions are concrete and the Gartner definition is abstract. That gap is itself informative.
The Attention Budget Problem
Why can’t you just stuff everything into the context window?
Anthropic’s research on context rot shows that as token count increases, the model’s ability to accurately recall information decreases. This stems from transformer architecture constraints:
- n² pairwise relationships between tokens
- Training data biases toward shorter sequences
- Position encoding creates performance gradients, not hard cliffs
The principle: treat context as a precious, finite resource with diminishing marginal returns.
This connects directly to the “Lost in the Middle” finding (Liu et al., 2023)—LLMs retrieve information from the beginning and end of context with high accuracy but struggle with middle-positioned content.
The Attention Budget
────────────────────
A 200K window ≠ 200K effective tokens
Effective capacity: ~65% of claimed max
Middle retrieval: Significantly degraded
Cost: Linear with token count
Latency: Sometimes superlinear
The Four Strategies
Anthropic’s framework groups context engineering into four patterns:
1. Write — Persist Context Outside the Window
Save information externally so it survives context limits.
- CLAUDE.md files: Project knowledge that loads at session start
- Scratchpads / NOTES.md: Agent writes progress notes retrieved later
- Memory tools: Build knowledge bases across sessions
What this looks like in practice:
# CLAUDE.md (loaded automatically at session start)
## Project Rules
- Never use `any` type — use proper TypeScript types
- Run `pnpm test` before committing
- API responses must include `requestId` for tracing
- Use kebab-case for file names
## Architecture Decisions
- Auth: JWT with refresh tokens, not sessions
- Database: PostgreSQL with Drizzle ORM, not Prisma
- State: Server-side only, no Redux
## Known Gotchas
- The payments webhook retries 3x — handlers must be idempotent
- UserService.findById returns null for soft-deleted users
Real-world example: Claude playing Pokemon maintained precise tallies across thousands of game steps by writing notes externally—tracking progress like “for the last 1,234 steps training my Pokemon in Route 1, Pikachu gained 8 levels toward target of 10.”
2. Select — Pull Relevant Context at Runtime
Don’t load everything upfront. Maintain lightweight identifiers and retrieve data dynamically.
- RAG: Retrieve from vector stores based on semantic similarity
- Tool-based exploration: glob, grep, database queries at runtime
- Hybrid approach: Cache static content (tool definitions, docs) + explore dynamically
What this looks like in practice:
# BAD: Load the entire codebase into context upfront
system_prompt = open("entire_repo.txt").read() # 500K tokens, most irrelevant
# GOOD: Give the agent tools to explore on demand
tools = [
{ name: "search_code", description: "Search codebase by pattern" },
{ name: "read_file", description: "Read a specific file" },
{ name: "list_files", description: "List files matching a glob" }
]
# Agent decides what to load based on the task — typically 5-10K tokens
The hybrid approach (used by Claude Code) loads CLAUDE.md upfront for speed, then provides glob/grep primitives for runtime exploration. Best of both: fast start, deep access.
3. Compress — Keep Only High-Signal Tokens
Proactively manage context size before hitting limits.
- Compaction: Summarize conversation history, preserve decisions and key context
- Tool output pruning: Remove raw results after processing
- Structured summaries: Replace verbose content with structured notes
What this looks like in practice:
# Compaction prompt (derived from Codex CLI implementation)
Summarize this conversation for continuation:
KEEP:
1. Completed work — what was accomplished, final file states
2. In-progress tasks — current state, blockers
3. Key decisions — user constraints, architectural choices
4. File paths, variable names, function signatures
DROP:
- Verbose tool outputs already processed
- Exploratory dead-ends
- Redundant explanations
Key insight from production: Claude Code compacts at 95% capacity—but practitioners report 70-80% works better. By 95%, quality has already degraded.
4. Isolate — Separate State Management
Use sub-agents with fresh context windows for focused tasks.
- Each sub-agent explores extensively (tens of thousands of tokens)
- Returns condensed summaries (1,000-2,000 tokens)
- Main agent coordinates via high-level planning
- Clean separation of concerns
What this looks like in practice:
# Main agent (clean context, high-level coordination)
"Implement user authentication for the API"
├→ Sub-agent 1: "Research existing auth patterns in this codebase"
│ Explores 30K tokens → returns 1.5K summary
│
├→ Sub-agent 2: "Write unit tests for the auth middleware"
│ Explores 25K tokens → returns 2K summary
│
└→ Sub-agent 3: "Review auth implementation for security issues"
Explores 20K tokens → returns 1K summary
# Main agent receives 4.5K tokens instead of 75K
# Each sub-agent got a fresh, focused context window
System Prompt Design: Finding the Right Altitude
Anthropic identifies two failure modes in system prompts:
| Extreme | Problem |
|---|---|
| Too Low | Brittle if-else logic, maintenance nightmare, fragile to edge cases |
| Too High | Vague guidance that assumes shared context, fails to provide concrete signals |
The sweet spot: Specific enough to guide behavior, flexible enough to serve as heuristics.
Recommended structure:
<background_information> → What the agent needs to know
<instructions> → What to do and how
## Tool guidance → When to use which tool
## Output description → Expected format
Start minimal. Test on the best available model. Add instructions based on observed failure modes, not anticipated ones.
Tool Design Principles
Tools are context too. Bad tool design wastes the attention budget.
| Principle | Why |
|---|---|
| Clear contracts | Agent needs unambiguous tool selection |
| Token-efficient returns | Bloated responses waste context |
| No functional overlap | Ambiguity about which tool to use degrades performance |
| Self-contained | Robust to error, clear about intended use |
| Fewer is better | Research shows 19 tools outperform 46 tools for accuracy |
The Model Context Protocol (MCP) is emerging as a standard—described as “USB-C for AI.” It reduces tool integration from M×N (each app needs custom code for each tool) to M+N.
The Enterprise Gap
Gartner’s framing adds a dimension the practitioner sources don’t emphasize: governance and organizational readiness.
| Finding | Source |
|---|---|
| 57% of organizations estimate their data is not AI-ready | Gartner 2025 |
| 42% abandoned most AI initiatives in 2025 (up from 17% in 2024) | Gartner 2025 |
| Context engineering moves from differentiator to infrastructure in 12-18 months | CIO.com / R Systems |
Gartner recommends:
- Appoint a context engineering lead — integrate with AI engineering and TRiSM governance teams
- Invest in context-aware architectures — integrate data and signals from across the business
- Develop context governance roadmap — spanning data sources, knowledge graphs, policy frameworks, and dynamic memory management
Not “write better prompts.” An architectural and organizational call.
Context Failure Modes
| Failure | What Happens | Mitigation |
|---|---|---|
| Context Poisoning | Incorrect info enters and compounds through reuse | Structured summaries, user validation |
| Context Distraction | Too much history overwhelms current reasoning | Aggressive pruning, relevance filtering |
| Context Confusion | Irrelevant tools or docs crowd the workspace | Fewer tools, clear descriptions |
| Context Clash | Contradictory information misleads decisions | Deduplication, conflict resolution |
| Context Rot | Quality degrades as window fills | Proactive compaction at 70-80% |
Cheap Demo vs. Production Agent
Same model, same user message, completely different outcome. The only variable is context.
CHEAP DEMO AGENT
────────────────
Context window contains:
• System prompt: "You are a helpful assistant"
• User message: "Schedule a meeting with Sarah tomorrow"
Result: Generic response. Guesses at calendar app.
Doesn't know who Sarah is. Doesn't know your timezone.
"I'd be happy to help! Please provide..."
PRODUCTION AGENT (Context Engineered)
─────────────────────────────────────
Context window contains:
• System prompt with role, constraints, output format
• User preferences (from long-term memory):
- Timezone: PST
- Prefers 30-min meetings
- Uses Google Calendar
• Retrieved context (from tools):
- Sarah Chen: sarah.chen@company.com, Engineering Lead
- Your calendar: Tomorrow 9-10am, 2-3pm open
- Sarah's calendar: Tomorrow 9-11am open
• Tool definitions: create_event, send_invite, check_availability
• Conversation history: Last week you discussed Q3 planning with Sarah
Result: "I've scheduled a 30-minute meeting with Sarah Chen
tomorrow at 9:00 AM PST. Invite sent to sarah.chen@company.com.
Topic: Q3 Planning follow-up."
The difference isn’t the prompt. It’s everything the model knew before it started thinking.
What This Means in Practice
For Individual Developers
You’re already doing context engineering if you:
- Write CLAUDE.md files that accumulate project rules
- Use
/compactbefore context degrades - Spawn sub-agents for focused tasks
- Structure tool outputs for downstream use
The shift: stop optimizing how you ask, start designing what your AI tools actually know.
For Teams
| Investment | Impact | Difficulty |
|---|---|---|
| Shared CLAUDE.md with institutional knowledge | High | Low |
| Context handoff protocols between agents/sessions | High | Medium |
| Tool output formatting standards | Medium | Low |
| Compaction triggers at 70-80% (not 95%) | High | Low |
| Sub-agent architectures for complex tasks | High | Medium |
For Enterprise
Context engineering is becoming infrastructure, not a project. Gartner’s recommendation to “appoint a context engineering lead” signals this is an organizational capability, not a skill set that lives in individual developers.
The 42% abandonment rate for AI initiatives isn’t a model problem—it’s a context problem. Organizations that treat context as infrastructure will build AI that scales. Those treating it as prompt optimization will keep failing.
Context Audit Checklist
Use this to evaluate any AI agent or workflow you’re building:
CONTEXT AUDIT — Run this Monday morning
────────────────────────────────────────
WHAT DOES THE MODEL KNOW?
[ ] System prompt defines role, constraints, output format
[ ] Relevant domain knowledge is retrievable (not assumed)
[ ] User preferences/history accessible when needed
[ ] Current state (what's been done, what's pending) is tracked
WHAT CAN THE MODEL DO?
[ ] Tools have clear, non-overlapping descriptions
[ ] Tool count is minimal (< 20 for most agents)
[ ] Tool outputs are token-efficient (not raw JSON dumps)
[ ] Error handling returns useful context, not stack traces
HOW IS CONTEXT MANAGED?
[ ] Static content is cached (system prompt, tool defs, docs)
[ ] Dynamic content is retrieved on demand (not pre-loaded)
[ ] Compaction triggers before 80% capacity
[ ] Old tool outputs are pruned after processing
WHAT SURVIVES ACROSS SESSIONS?
[ ] Key decisions persist (CLAUDE.md, memory, notes)
[ ] Handoff protocols exist for agent-to-agent transfer
[ ] Context loss from compaction is acceptable
[ ] Long-running tasks have structured checkpoints
WHAT CAN GO WRONG?
[ ] Contradictory context sources identified and resolved
[ ] Stale information has expiry or refresh mechanism
[ ] Hallucinated summaries don't become "facts" in memory
[ ] Agent can signal when context is insufficient
Prompt Engineering Is Not Dead
Context engineering doesn’t replace prompt engineering — it subsumes it. Prompt engineering remains the “how you ask” layer. But it’s now one component of a larger system.
| Layer | Discipline |
|---|---|
| What the model knows | Context engineering |
| How you ask | Prompt engineering |
| What it can do | Tool design |
| What it remembers | Memory architecture |
| How it coordinates | Agent orchestration |
Prompt engineering is necessary but insufficient. Like SQL — still essential, but no one calls themselves a “SQL engineer” anymore. The job title moved up a level of abstraction. Context engineering is the same shift.
Open Questions
- Governance at scale: How do enterprises audit which tokens shaped each AI response?
- Context compression limits: Where’s the elbow on the compression-vs-accuracy curve?
- Cross-agent context: How do multi-agent systems share context without poisoning each other?
- Measurement: What metrics define “good context engineering”? No standard exists yet.
- Automation: Can context engineering itself be automated? Early signs with ACE (Agentic Context Engineering) frameworks.
The Tacit Angle
Context engineering makes session memory more valuable, not less. Every compaction loses information. Every sub-agent handoff is context that disappears. Every CLAUDE.md rule has a reason — and that reason lives in a session.
| Practice | Without Session Memory | With Session Memory |
|---|---|---|
| Context compaction | Permanent information loss | Searchable full history |
| Sub-agent delegation | Context scattered across agents | Unified cross-session view |
| CLAUDE.md evolution | Rules without rationale | Rules linked to sessions that created them |
| Enterprise context governance | Audit trail gaps | Complete decision provenance |
The more aggressively you engineer context—compressing, isolating, pruning—the more valuable it becomes to persist what was removed.
Confidence Assessment
| Claim | Confidence |
|---|---|
| Context engineering is a real, distinct discipline | High — multi-source convergence |
| Most agent failures are context failures | High — Anthropic, LangChain, practitioners agree |
| Enterprise AI failure rates are alarming | High — Gartner data |
| The 4-strategy framework (Write/Select/Compress/Isolate) works | High — production evidence from Claude Code |
| Prompt engineering is dead | Low — it’s subsumed, not dead |
| Context engineering will be a named org function | Medium — Gartner recommends it, adoption TBD |
| 12-18 month timeline to infrastructure status | Medium — one practitioner estimate |
Sources & Provenance
Verifiable sources. Dates matter. Credibility assessed.
Effective Context Engineering for AI Agents ↗
Prithvi Rajasekaran, Ethan Dixon, Carly Ryan, Jeremy Hadfield · Anthropic Engineering
"Canonical technical definition of context engineering. Four strategies: Write, Select, Compress, Isolate. Production evidence from Claude Code. 'Treat context as a precious, finite resource with diminishing marginal returns.'"
Context Engineering: Why It's Replacing Prompt Engineering for Enterprise AI Success ↗
Gartner · Gartner Articles
"Enterprise framing: 57% of organizations say data is not AI-ready. 42% abandoned AI initiatives in 2025. Recommends appointing context engineering leads and building context governance roadmaps."
The Rise of 'Context Engineering' ↗
Harrison Chase · LangChain Blog
"'Most agent failures are not model failures anymore—they are context failures.' Context engineering is 'effectively the #1 job' for engineers building AI agents."
Andrej Karpathy on Context Engineering ↗
Andrej Karpathy · X (Twitter)
"Foundational definition: 'The delicate art and science of filling the context window with just the right information for the next step.' Framing adopted widely."
Tobi Lutke on Context Engineering ↗
Tobi Lutke · X (Twitter)
"Shopify CEO advocates for context engineering over prompt engineering: 'The art of providing all the context for the task to be plausibly solvable by the LLM.'"
Context Engineering: LLM Memory and Retrieval for AI Agents ↗
Weaviate Team · Weaviate Blog
"Six pillars framework: Agents, Query Augmentation, Retrieval, Prompting, Memory, Tools. Context failure modes: poisoning, distraction, confusion, clash. MCP as 'USB-C for AI.'"
Context Engineering Guide: Techniques for AI Agents ↗
Tuana Celik and Logan Markewich · LlamaIndex Blog
"Eight context components identified. Workflow engineering as core technique. 'Every AI builder is ultimately building specialized workflows—whether they realize it or not.'"
The New Skill in AI is Not Prompting, It's Context Engineering ↗
Philipp Schmid · Personal Blog
"Seven contextual layers. Distinguishes 'cheap demo' (poor context) from 'magical agent' (rich context). Four characteristics: system-based, dynamic, information-complete, format-conscious."
Context Engineering: Improving AI by Moving Beyond the Prompt ↗
Various IT Leaders · CIO.com
"Enterprise adoption patterns: context engineering moves from differentiator to infrastructure in 12-18 months. 'Treat context as infrastructure'—standardize pipelines, not ad-hoc files."
Context Engineering: Structured Output, RAG & More Components ↗
Elasticsearch Labs · Elastic Blog
"Five core components: RAG, Prompt Engineering, Memory Management, Structured Outputs, Tools. Key finding: 19 tools outperform 46 tools for model accuracy."
Context Engineering Guide ↗
Prompt Engineering Guide · promptingguide.ai
"Tutorial-level synthesis of context engineering components. Identifies emerging areas: context compression, stale info detection, automation, measurement frameworks."
Why AI Teams Are Moving From Prompt Engineering to Context Engineering ↗
Neo4j · Neo4j Blog
"Knowledge graph perspective on context engineering. 'Prompts shape how the model thinks. Context shapes what the model actually knows.' Reliable AI comes from architecture, not clever phrasing."