LLM Context Optimization: What Actually Works
A 200K context window doesn't mean 200K effective tokens. Research across academic papers, production systems (Claude Code, Codex CLI, Amp), and benchmarks reveals when to trim, summarize, cache, or delegate—and the pitfalls that break real agents.
TL;DR
Context window marketing overstates effective capacity. A 200K window typically degrades around 130K. Information in the middle of context is poorly retrieved—the “Lost in the Middle” phenomenon. Production systems like Claude Code, Codex CLI, and Amp use compaction at 85-95% capacity, but practitioner feedback suggests 70-80% works better. The strategy hierarchy: don’t fill it → position matters → cache the static → compact early → prune tool outputs → delegate to sub-agents.
Quick Reference
CONTEXT BUDGET
──────────────
System prompt: Pin, never trim (<10%)
Cached content: Static docs, tools (20-40%)
Conversation: Recent turns (20-30%)
Tool outputs: Prune aggressively (5-10%)
Reserve: Response buffer (15-20%)
THRESHOLDS
──────────
70% Consider proactive compaction
80% Warn user, prepare compaction
85% Auto-compact (non-critical)
95% Emergency (Claude Code default—too late)
POSITION RULES
──────────────
Start/End: Critical info (high retrieval)
Middle: Supporting detail only (low retrieval)
Why This Matters
Every LLM agent eventually hits context limits. The naive solution—“just use a bigger window”—fails for three reasons:
- Latency scales with context (sometimes superlinearly)
- Cost scales linearly (every token costs money)
- Quality degrades (“lost in the middle” + context rot)
This research synthesizes what actually works in production, from academic foundations to reverse-engineered implementations.
The Core Problem: Lost in the Middle
The foundational finding (Liu et al., 2023): LLMs retrieve information from the beginning and end of context with high accuracy but struggle with middle-positioned information.
Accuracy vs. Position in Context
─────────────────────────────────
│ ████ ████ │ High: Start + End
│ ███ ███ │
│ ██ ██ │
│ █ █ │
│ ██ ██ │
│ ████████████████████ │ Low: Middle
─────────────────────────────────
Start End
Root causes:
- Rotary Position Embedding (RoPE) introduces long-term decay
- Attention sinks create “attractors” at boundaries
- Training data bias (relevant info usually at start/end)
Implication: You can’t just stuff more context. Placement matters.
Production Compaction Systems Compared
How do real agent systems handle context limits?
| System | Trigger | Threshold | Key Mechanism |
|---|---|---|---|
| Claude Code | Manual /compact or auto | ~95% | Summary replaces history |
| Codex CLI | Manual or auto | 180K-244K | Summary + last 20K preserved |
| OpenCode | Manual or auto | Overflow check | Separate tool output pruning |
| Amp | Manual “Handoff” only | N/A | Short focused conversations |
The key insight from practitioner feedback: 95% is too late. By then, quality has already degraded. 70-80% produces better results.
Compaction Prompt Pattern
Derived from Codex CLI’s implementation:
Summarize this conversation for a follow-up session:
REQUIRED SECTIONS:
1. Completed Work: What was accomplished, final file states
2. In Progress: Current task state, blockers
3. Key Decisions: User constraints, architectural choices
4. Critical Context: Information essential for continuation
Keep: Technical specifics, file paths, variable names
Drop: Verbose tool outputs, exploratory dead-ends
The Pitfalls
| Pitfall | How Teams Hit It | Mitigation |
|---|---|---|
| Context Poisoning | Hallucination during summarization becomes “fact” | Structured summaries, user validation |
| Premature Auto-Compact | Interrupts active work at 95% | Compact at 70-80%, warn user |
| Lost Continuity | Critical context trimmed (variable definitions) | Pin important context, role tagging |
| Over-trust in Window Size | Assume 200K means 200K effective | Test actual performance, use ~65% |
| Tool Output Bloat | Tool results dominate context | Separate tool output pruning |
Prompt Caching: 90% Cost Reduction
Prompt caching stores computed KV states for static prefixes. Both Anthropic and OpenAI offer this, but implementations differ significantly.
| Aspect | Anthropic | OpenAI |
|---|---|---|
| Control | Explicit cache_control markers | Automatic |
| Cache Hit Rate | 100% (when correct) | ~50% |
| Write Cost | 25% premium | No premium |
| Break-Even | 2-3 reuses | 1 reuse |
| Minimum Tokens | 1024-2048 | 1024 |
Optimal structure:
┌─────────────────────────────────┐
│ STATIC LAYER (Cached) │
│ • System instructions │
│ • Tool definitions │
│ • Documentation/examples │
│ [cache_control: ephemeral] │
├─────────────────────────────────┤
│ DYNAMIC LAYER (Not cached) │
│ • Conversation history │
│ • Current query │
│ • Retrieved context │
└─────────────────────────────────┘
Multi-Agent Context Delegation
When context limits loom, delegate to sub-agents with fresh windows. But handoff quality is critical—most “agent failures” are actually context transfer failures.
Anti-pattern: Free-text handoff
"Please look at the code and fix the bug we discussed"
Pattern: Structured handoff
{
"task": {
"objective": "Fix null pointer in UserService.java:142",
"constraints": ["Don't modify public API"],
"successCriteria": ["All tests pass"]
},
"context": {
"relevantFiles": ["src/UserService.java"],
"decisions": {"errorHandling": "Return Optional"},
"currentState": "Bug identified, approach agreed"
}
}
The Decision Matrix
When to use which technique:
| Situation | Primary Technique | Avoid |
|---|---|---|
| Stateless chatbot | Sliding window trim | Summarization (overkill) |
| Long coding session | Compaction at 70-80% | Waiting until 95% |
| High-volume production | Prompt caching | Long context stuffing |
| Complex multi-step task | Multi-agent delegation | Single agent marathon |
| Knowledge-intensive QA | RAG | Pure long context |
The Numbers That Matter
| Metric | Target |
|---|---|
| Effective context | ~65% of claimed max |
| Compaction trigger | 70-80% utilization |
| Cache hit rate | >80% for stable prompts |
| Tool output budget | <30% of context |
| Cache break-even | 2-3 reuses (Anthropic) |
Observability Metrics
Track these to catch degradation before failure:
| Metric | Definition | Alert Threshold |
|---|---|---|
| Context Utilization | tokens_used / max_tokens | >85% |
| Context Efficiency | successful_actions / tokens_used | Decreasing trend |
| Compaction Frequency | compactions / session | >2 per session |
| Tool Output Ratio | tool_tokens / total_tokens | >50% |
Degradation signals:
- Agent re-asks clarified information (context loss)
- Agent undoes previous work (lost continuity)
- Increasing TTFT latency (context too large)
- Declining task success rate (position effects)
Implementation Checklist
Building a context-managed agent:
| Area | Requirements |
|---|---|
| Token Accounting | Accurate counting (official tokenizers), breakdown by category (system, conversation, tools), alerts at 70%, 80%, 85% |
| Compaction System | Manual /compact command, auto-compact at 80% (configurable), user warning before compaction, structured prompt with required sections |
| Tool Output Management | Prune old outputs (keep last N), replace with summaries after processing, tag outputs for selective pruning |
| Caching Strategy | Identify static content, place cache boundary correctly, monitor hit rates, calculate ROI (break-even 2-3 uses) |
Recipe: Context Handoff Slash Command
For Claude Code users, add this to .claude/commands/handoff.md:
---
name: handoff
description: Create structured handoff for sub-agent or session continuation
arguments:
- name: target
description: "subagent, session, or teammate"
required: true
---
Create a structured handoff document for context transfer.
## Handoff Type: $ARGUMENTS.target
Analyze the current session and produce a handoff with these sections:
### 1. Mission State
- **Original objective**: What were we trying to accomplish?
- **Current status**: Where are we now? (not started / in progress / blocked / complete)
- **Remaining work**: What's left to do?
### 2. Key Decisions Made
List decisions with rationale:
| Decision | Choice | Why |
|----------|--------|-----|
| ... | ... | ... |
### 3. Critical Context
Information the recipient MUST know:
- File paths modified or relevant
- Variable names / function signatures introduced
- Constraints or requirements discovered
- Gotchas encountered
### 4. What NOT to Do
Anti-patterns or approaches we tried and rejected:
- ...
### 5. Suggested Next Action
The single most important next step.
---
Format as JSON if target is "subagent":
```json
{
"schemaVersion": "1.0.0",
"task": {
"objective": "...",
"constraints": [...],
"successCriteria": [...]
},
"context": {
"relevantFiles": [...],
"decisions": {...},
"currentState": "..."
}
}
**Usage:**
```bash
/handoff subagent # JSON for spawning sub-agent
/handoff session # Markdown for /compact or new session
/handoff teammate # Markdown for human handoff
Open Questions
- Optimal compaction prompts: What summary structure preserves most signal?
- Position-aware architectures: Do newer models actually fix “lost in middle”?
- Compression vs. accuracy curve: Where’s the elbow?
- Multi-agent overhead: When does delegation cost exceed benefit?
The Tacit Angle
Session memory becomes critical when context management is aggressive. Every compaction loses information. Every sub-agent handoff is context that could be searched later.
| Practice | Without Session Memory | With Session Memory |
|---|---|---|
| Aggressive compaction | Permanent loss | Searchable history |
| Sub-agent delegation | Scattered context | Unified view |
| Multi-worktree sessions | Isolated silos | Cross-session search |
The more aggressively you optimize context, the more valuable session persistence becomes.
Sources & Provenance
Verifiable sources. Dates matter. Credibility assessed.
Lost in the Middle: How Language Models Use Long Contexts ↗
Nelson F. Liu et al. · TACL 2024
"LLMs exhibit U-shaped accuracy: high retrieval at start/end, poor in middle. Occurs even for explicitly long-context models. Foundational finding for context management."
RULER: What's the Real Context Size of Your Long-Context Language Models? ↗
NVIDIA Research · COLM 2024
"Despite perfect needle-in-haystack scores, models fail on complex RULER tasks as length increases. Claimed context sizes overstate effective capacity."
Context Compaction Research: Claude Code, Codex CLI, OpenCode, Amp ↗
Mario Zechner (badlogic) · GitHub Gist
"Reverse-engineered compaction implementations. Claude Code at 95%, Codex CLI at 180K-244K, OpenCode with separate tool pruning, Amp preferring manual handoffs."
Prompt Caching Comparison: OpenAI vs Anthropic ↗
Will McGinnis · Personal Blog
"OpenAI automatic caching achieves ~50% hit rate. Anthropic explicit caching achieves 100% when correctly structured. Anthropic has 25% write premium but 90% read savings."
RAG vs Long-Context LLMs: A Comprehensive Study ↗
Various · arXiv
"Neither RAG nor long context is universally better. LC outperforms on average when resourced, but RAG's cost advantage is significant. Hybrid approaches emerging."
FlashAttention: Fast and Memory-Efficient Exact Attention ↗
Tri Dao et al. · NeurIPS 2022
"IO-aware attention algorithm achieving 2-4x speedups. Foundation for practical long-context processing. FlashAttention-3 adds 1.5-2x over v2."
LangChain ConversationSummaryBufferMemory ↗
LangChain · Official Docs
"Hybrid approach: recent turns at full fidelity, older turns summarized. Token-based threshold for when to summarize. Production-tested pattern."
Best Practices for Multi-Agent Orchestration and Reliable Handoffs ↗
Skywork AI · Company Blog
"Most agent failures are handoff failures, not model failures. Structured schemas with semver versioning. Validate with Pydantic. Free-text handoffs are anti-pattern."
Context Optimization for LLM Agents (Part 1) ↗
Vinay · Personal Blog
"'Reshape and Fit' hybrid approach: recent buffer (5 turns full), compaction (turns 6-20 stripped), summarization (20+), offloading to sub-agents."
LLM Observability: Monitoring Large Language Models ↗
Splunk · Company Blog
"Five pillars: response monitoring, automated evaluations, advanced filtering, application tracing, human-in-the-loop. Context utilization as key metric."
Prompt Caching: 10x Cheaper LLM Tokens ↗
ngrok · Company Blog
"Detailed explanation of KV cache mechanics. Cache boundaries, TTL management, minimum token thresholds. Practical implementation guidance."
OpenAI Agents SDK - Multi-Agent Orchestration ↗
OpenAI · Official Docs
"Agent = LLM + instructions + tools + handoffs. Autonomous planning with tool use. Structured handoff patterns for context transfer."