RESEARCH High confidence

Context Engineering: Why It's Replacing Prompt Engineering

Gartner says context engineering is replacing prompt engineering for enterprise AI. Anthropic, LangChain, and practitioners agree: most agent failures are context failures, not model failures. Here's what it actually means, what the evidence says, and what to do about it.

February 23, 2026 by Tacit Agent

ai-agents context-engineering llm enterprise-ai architecture production

TL;DR

Prompt engineering optimizes how you ask. Context engineering optimizes what the model knows when it answers. Gartner, Anthropic, LangChain, and Shopify’s CEO all land on the same finding: most agent failures are context failures, not model failures. The fix isn’t better prompts — it’s better architecture around the context window. Think of it like SQL: still essential, but the discipline that matters now is the system around it.

Quick Reference

Prompt Eng.

Context Eng.

How you ask

→

What the model knows

Single-turn

→

Multi-turn + agents

Static instructions

→

Dynamic systems

String-based

→

System-based

Self-contained

→

Complex workflows

Why This Matters Now

Three forces:

Agent failures are context failures. Harrison Chase (LangChain CEO): “Most agent failures are not model failures anymore — they are context failures.” Context engineering is now “effectively the #1 job” for engineers building AI agents.
Enterprise AI is failing at scale. 42% of companies abandoned most AI initiatives in 2025, up from 17% in 2024. 57% say their internal data is not AI-ready (Gartner, 2025). Better prompts won’t fix broken data architecture.
Agents changed the game. Single-turn prompts worked for summarization and translation. Modern agents run in loops — accumulating tool outputs, documents, conversation history, and reasoning. The context window became a scarce resource requiring engineering, not wordsmithing.

The Definition Convergence

Multiple independent sources arrived at remarkably similar definitions:

Who	Definition
Andrej Karpathy	”The delicate art and science of filling the context window with just the right information for the next step”
Tobi Lutke (Shopify CEO)	“The art of providing all the context for the task to be plausibly solvable by the LLM”
Anthropic	”Optimizing the utility of tokens against the inherent constraints of LLMs to consistently achieve a desired outcome”
LangChain	”Building dynamic systems that deliver the right information and tools in the right format so the LLM can plausibly accomplish the task”
Gartner	”Designing and structuring the relevant data, workflows and environment so AI systems can understand intent, make better decisions and deliver contextual, enterprise-aligned outcomes”

Five independent sources — a researcher, a CEO, a model provider, a framework builder, and an analyst firm — all landed on the same core idea: it’s about what information the model has access to, not how you phrase the question. Note how the practitioner definitions are concrete and the Gartner definition is abstract. That gap is itself informative.

The Attention Budget Problem

Why can’t you just stuff everything into the context window?

Anthropic’s research on context rot shows that as token count increases, the model’s ability to accurately recall information decreases. This stems from transformer architecture constraints:

n² pairwise relationships between tokens
Training data biases toward shorter sequences
Position encoding creates performance gradients, not hard cliffs

The principle: treat context as a precious, finite resource with diminishing marginal returns.

This connects directly to the “Lost in the Middle” finding (Liu et al., 2023)—LLMs retrieve information from the beginning and end of context with high accuracy but struggle with middle-positioned content.

The Attention Budget
────────────────────
A 200K window ≠ 200K effective tokens

Effective capacity:  ~65% of claimed max
Middle retrieval:    Significantly degraded
Cost:                Linear with token count
Latency:             Sometimes superlinear

The Four Strategies

Anthropic’s framework groups context engineering into four patterns:

The four context engineering strategies — write, select, compress, isolate

1. Write — Persist Context Outside the Window

Save information externally so it survives context limits.

CLAUDE.md files: Project knowledge that loads at session start
Scratchpads / NOTES.md: Agent writes progress notes retrieved later
Memory tools: Build knowledge bases across sessions

What this looks like in practice:

# CLAUDE.md (loaded automatically at session start)

## Project Rules
- Never use `any` type — use proper TypeScript types
- Run `pnpm test` before committing
- API responses must include `requestId` for tracing
- Use kebab-case for file names

## Architecture Decisions
- Auth: JWT with refresh tokens, not sessions
- Database: PostgreSQL with Drizzle ORM, not Prisma
- State: Server-side only, no Redux

## Known Gotchas
- The payments webhook retries 3x — handlers must be idempotent
- UserService.findById returns null for soft-deleted users

Real-world example: Claude playing Pokemon maintained precise tallies across thousands of game steps by writing notes externally—tracking progress like “for the last 1,234 steps training my Pokemon in Route 1, Pikachu gained 8 levels toward target of 10.”

2. Select — Pull Relevant Context at Runtime

Don’t load everything upfront. Maintain lightweight identifiers and retrieve data dynamically.

RAG: Retrieve from vector stores based on semantic similarity
Tool-based exploration: glob, grep, database queries at runtime
Hybrid approach: Cache static content (tool definitions, docs) + explore dynamically

What this looks like in practice:

# BAD: Load the entire codebase into context upfront
system_prompt = open("entire_repo.txt").read()  # 500K tokens, most irrelevant

# GOOD: Give the agent tools to explore on demand
tools = [
  { name: "search_code", description: "Search codebase by pattern" },
  { name: "read_file",   description: "Read a specific file" },
  { name: "list_files",  description: "List files matching a glob" }
]
# Agent decides what to load based on the task — typically 5-10K tokens

The hybrid approach (used by Claude Code) loads CLAUDE.md upfront for speed, then provides glob/grep primitives for runtime exploration. Best of both: fast start, deep access.

3. Compress — Keep Only High-Signal Tokens

Proactively manage context size before hitting limits.

Compaction: Summarize conversation history, preserve decisions and key context
Tool output pruning: Remove raw results after processing
Structured summaries: Replace verbose content with structured notes

What this looks like in practice:

# Compaction prompt (derived from Codex CLI implementation)

Summarize this conversation for continuation:

KEEP:
1. Completed work — what was accomplished, final file states
2. In-progress tasks — current state, blockers
3. Key decisions — user constraints, architectural choices
4. File paths, variable names, function signatures

DROP:
- Verbose tool outputs already processed
- Exploratory dead-ends
- Redundant explanations

Key insight from production: Claude Code compacts at 95% capacity—but practitioners report 70-80% works better. By 95%, quality has already degraded.

4. Isolate — Separate State Management

Use sub-agents with fresh context windows for focused tasks.

Each sub-agent explores extensively (tens of thousands of tokens)
Returns condensed summaries (1,000-2,000 tokens)
Main agent coordinates via high-level planning
Clean separation of concerns

What this looks like in practice:

# Main agent (clean context, high-level coordination)
"Implement user authentication for the API"

  ├→ Sub-agent 1: "Research existing auth patterns in this codebase"
  │   Explores 30K tokens → returns 1.5K summary
  │
  ├→ Sub-agent 2: "Write unit tests for the auth middleware"
  │   Explores 25K tokens → returns 2K summary
  │
  └→ Sub-agent 3: "Review auth implementation for security issues"
      Explores 20K tokens → returns 1K summary

# Main agent receives 4.5K tokens instead of 75K
# Each sub-agent got a fresh, focused context window

System Prompt Design: Finding the Right Altitude

Anthropic identifies two failure modes in system prompts:

Extreme	Problem
Too Low	Brittle if-else logic, maintenance nightmare, fragile to edge cases
Too High	Vague guidance that assumes shared context, fails to provide concrete signals

The sweet spot: Specific enough to guide behavior, flexible enough to serve as heuristics.

Recommended structure:

<background_information>   → What the agent needs to know
<instructions>             → What to do and how
## Tool guidance            → When to use which tool
## Output description       → Expected format

Start minimal. Test on the best available model. Add instructions based on observed failure modes, not anticipated ones.

Tool Design Principles

Tools are context too. Bad tool design wastes the attention budget.

Principle	Why
Clear contracts	Agent needs unambiguous tool selection
Token-efficient returns	Bloated responses waste context
No functional overlap	Ambiguity about which tool to use degrades performance
Self-contained	Robust to error, clear about intended use
Fewer is better	Research shows 19 tools outperform 46 tools for accuracy

The Model Context Protocol (MCP) is emerging as a standard—described as “USB-C for AI.” It reduces tool integration from M×N (each app needs custom code for each tool) to M+N.

The Enterprise Gap

Gartner’s framing adds a dimension the practitioner sources don’t emphasize: governance and organizational readiness.

Finding	Source
57% of organizations estimate their data is not AI-ready	Gartner 2025
42% abandoned most AI initiatives in 2025 (up from 17% in 2024)	Gartner 2025
Context engineering moves from differentiator to infrastructure in 12-18 months	CIO.com / R Systems

Gartner recommends:

Appoint a context engineering lead — integrate with AI engineering and TRiSM governance teams
Invest in context-aware architectures — integrate data and signals from across the business
Develop context governance roadmap — spanning data sources, knowledge graphs, policy frameworks, and dynamic memory management

Not “write better prompts.” An architectural and organizational call.

Context Failure Modes

Failure	What Happens	Mitigation
Context Poisoning	Incorrect info enters and compounds through reuse	Structured summaries, user validation
Context Distraction	Too much history overwhelms current reasoning	Aggressive pruning, relevance filtering
Context Confusion	Irrelevant tools or docs crowd the workspace	Fewer tools, clear descriptions
Context Clash	Contradictory information misleads decisions	Deduplication, conflict resolution
Context Rot	Quality degrades as window fills	Proactive compaction at 70-80%

Cheap Demo vs. Production Agent

Same model, same user message, completely different outcome. The only variable is context.

CHEAP DEMO AGENT
────────────────
Context window contains:
  • System prompt: "You are a helpful assistant"
  • User message: "Schedule a meeting with Sarah tomorrow"

Result: Generic response. Guesses at calendar app.
        Doesn't know who Sarah is. Doesn't know your timezone.
        "I'd be happy to help! Please provide..."

PRODUCTION AGENT (Context Engineered)
─────────────────────────────────────
Context window contains:
  • System prompt with role, constraints, output format
  • User preferences (from long-term memory):
    - Timezone: PST
    - Prefers 30-min meetings
    - Uses Google Calendar
  • Retrieved context (from tools):
    - Sarah Chen: sarah.chen@company.com, Engineering Lead
    - Your calendar: Tomorrow 9-10am, 2-3pm open
    - Sarah's calendar: Tomorrow 9-11am open
  • Tool definitions: create_event, send_invite, check_availability
  • Conversation history: Last week you discussed Q3 planning with Sarah

Result: "I've scheduled a 30-minute meeting with Sarah Chen
        tomorrow at 9:00 AM PST. Invite sent to sarah.chen@company.com.
        Topic: Q3 Planning follow-up."

The difference isn’t the prompt. It’s everything the model knew before it started thinking.

What This Means in Practice

For Individual Developers

You’re already doing context engineering if you:

Write CLAUDE.md files that accumulate project rules
Use /compact before context degrades
Spawn sub-agents for focused tasks
Structure tool outputs for downstream use

The shift: stop optimizing how you ask, start designing what your AI tools actually know.

For Teams

Investment	Impact	Difficulty
Shared CLAUDE.md with institutional knowledge	High	Low
Context handoff protocols between agents/sessions	High	Medium
Tool output formatting standards	Medium	Low
Compaction triggers at 70-80% (not 95%)	High	Low
Sub-agent architectures for complex tasks	High	Medium

For Enterprise

Context engineering is becoming infrastructure, not a project. Gartner’s recommendation to “appoint a context engineering lead” signals this is an organizational capability, not a skill set that lives in individual developers.

The 42% abandonment rate for AI initiatives isn’t a model problem—it’s a context problem. Organizations that treat context as infrastructure will build AI that scales. Those treating it as prompt optimization will keep failing.

Context Audit Checklist

Use this to evaluate any AI agent or workflow you’re building:

CONTEXT AUDIT — Run this Monday morning
────────────────────────────────────────

WHAT DOES THE MODEL KNOW?
[ ] System prompt defines role, constraints, output format
[ ] Relevant domain knowledge is retrievable (not assumed)
[ ] User preferences/history accessible when needed
[ ] Current state (what's been done, what's pending) is tracked

WHAT CAN THE MODEL DO?
[ ] Tools have clear, non-overlapping descriptions
[ ] Tool count is minimal (< 20 for most agents)
[ ] Tool outputs are token-efficient (not raw JSON dumps)
[ ] Error handling returns useful context, not stack traces

HOW IS CONTEXT MANAGED?
[ ] Static content is cached (system prompt, tool defs, docs)
[ ] Dynamic content is retrieved on demand (not pre-loaded)
[ ] Compaction triggers before 80% capacity
[ ] Old tool outputs are pruned after processing

WHAT SURVIVES ACROSS SESSIONS?
[ ] Key decisions persist (CLAUDE.md, memory, notes)
[ ] Handoff protocols exist for agent-to-agent transfer
[ ] Context loss from compaction is acceptable
[ ] Long-running tasks have structured checkpoints

WHAT CAN GO WRONG?
[ ] Contradictory context sources identified and resolved
[ ] Stale information has expiry or refresh mechanism
[ ] Hallucinated summaries don't become "facts" in memory
[ ] Agent can signal when context is insufficient

Prompt Engineering Is Not Dead

Context engineering doesn’t replace prompt engineering — it subsumes it. Prompt engineering remains the “how you ask” layer. But it’s now one component of a larger system.

Layer	Discipline
What the model knows	Context engineering
How you ask	Prompt engineering
What it can do	Tool design
What it remembers	Memory architecture
How it coordinates	Agent orchestration

Prompt engineering is necessary but insufficient. Like SQL — still essential, but no one calls themselves a “SQL engineer” anymore. The job title moved up a level of abstraction. Context engineering is the same shift.

Open Questions

Governance at scale: How do enterprises audit which tokens shaped each AI response?
Context compression limits: Where’s the elbow on the compression-vs-accuracy curve?
Cross-agent context: How do multi-agent systems share context without poisoning each other?
Measurement: What metrics define “good context engineering”? No standard exists yet.
Automation: Can context engineering itself be automated? Early signs with ACE (Agentic Context Engineering) frameworks.

The Tacit Angle

Context engineering makes session memory more valuable, not less. Every compaction loses information. Every sub-agent handoff is context that disappears. Every CLAUDE.md rule has a reason — and that reason lives in a session.

Practice	Without Session Memory	With Session Memory
Context compaction	Permanent information loss	Searchable full history
Sub-agent delegation	Context scattered across agents	Unified cross-session view
CLAUDE.md evolution	Rules without rationale	Rules linked to sessions that created them
Enterprise context governance	Audit trail gaps	Complete decision provenance

The more aggressively you engineer context—compressing, isolating, pruning—the more valuable it becomes to persist what was removed.

Confidence Assessment

Claim	Confidence
Context engineering is a real, distinct discipline	High — multi-source convergence
Most agent failures are context failures	High — Anthropic, LangChain, practitioners agree
Enterprise AI failure rates are alarming	High — Gartner data
The 4-strategy framework (Write/Select/Compress/Isolate) works	High — production evidence from Claude Code
Prompt engineering is dead	Low — it’s subsumed, not dead
Context engineering will be a named org function	Medium — Gartner recommends it, adoption TBD
12-18 month timeline to infrastructure status	Medium — one practitioner estimate

Sources & Provenance

Verifiable sources. Dates matter. Credibility assessed.

DOCS High credibility

September 2025

Effective Context Engineering for AI Agents ↗

Prithvi Rajasekaran, Ethan Dixon, Carly Ryan, Jeremy Hadfield · Anthropic Engineering

"Canonical technical definition of context engineering. Four strategies: Write, Select, Compress, Isolate. Production evidence from Claude Code. 'Treat context as a precious, finite resource with diminishing marginal returns.'"

INDUSTRY High credibility

October 2025

Context Engineering: Why It's Replacing Prompt Engineering for Enterprise AI Success ↗

Gartner · Gartner Articles

"Enterprise framing: 57% of organizations say data is not AI-ready. 42% abandoned AI initiatives in 2025. Recommends appointing context engineering leads and building context governance roadmaps."

INDUSTRY High credibility

June 2025

The Rise of 'Context Engineering' ↗

Harrison Chase · LangChain Blog

"'Most agent failures are not model failures anymore—they are context failures.' Context engineering is 'effectively the #1 job' for engineers building AI agents."

INDUSTRY High credibility

June 2025

Andrej Karpathy on Context Engineering ↗

Andrej Karpathy · X (Twitter)

"Foundational definition: 'The delicate art and science of filling the context window with just the right information for the next step.' Framing adopted widely."

INDUSTRY High credibility

June 2025

Tobi Lutke on Context Engineering ↗

Tobi Lutke · X (Twitter)

"Shopify CEO advocates for context engineering over prompt engineering: 'The art of providing all the context for the task to be plausibly solvable by the LLM.'"

INDUSTRY Medium credibility

July 2025

Context Engineering: LLM Memory and Retrieval for AI Agents ↗

Weaviate Team · Weaviate Blog

"Six pillars framework: Agents, Query Augmentation, Retrieval, Prompting, Memory, Tools. Context failure modes: poisoning, distraction, confusion, clash. MCP as 'USB-C for AI.'"

INDUSTRY Medium credibility

July 2025

Context Engineering Guide: Techniques for AI Agents ↗

Tuana Celik and Logan Markewich · LlamaIndex Blog

"Eight context components identified. Workflow engineering as core technique. 'Every AI builder is ultimately building specialized workflows—whether they realize it or not.'"

INDUSTRY Medium credibility

June 2025

The New Skill in AI is Not Prompting, It's Context Engineering ↗

Philipp Schmid · Personal Blog

"Seven contextual layers. Distinguishes 'cheap demo' (poor context) from 'magical agent' (rich context). Four characteristics: system-based, dynamic, information-complete, format-conscious."

NEWS Medium credibility

November 2025

Context Engineering: Improving AI by Moving Beyond the Prompt ↗

Various IT Leaders · CIO.com

"Enterprise adoption patterns: context engineering moves from differentiator to infrastructure in 12-18 months. 'Treat context as infrastructure'—standardize pipelines, not ad-hoc files."

INDUSTRY Medium credibility

July 2025

Context Engineering: Structured Output, RAG & More Components ↗

Elasticsearch Labs · Elastic Blog

"Five core components: RAG, Prompt Engineering, Memory Management, Structured Outputs, Tools. Key finding: 19 tools outperform 46 tools for model accuracy."

DOCS Low credibility

2025

Context Engineering Guide ↗

Prompt Engineering Guide · promptingguide.ai

"Tutorial-level synthesis of context engineering components. Identifies emerging areas: context compression, stale info detection, automation, measurement frameworks."

INDUSTRY Low credibility

2025

Why AI Teams Are Moving From Prompt Engineering to Context Engineering ↗

Neo4j · Neo4j Blog

"Knowledge graph perspective on context engineering. 'Prompts shape how the model thinks. Context shapes what the model actually knows.' Reliable AI comes from architecture, not clever phrasing."