RESEARCH High confidence

LLM Context Optimization: What Actually Works

A 200K context window doesn't mean 200K effective tokens. Research across academic papers, production systems (Claude Code, Codex CLI, Amp), and benchmarks reveals when to trim, summarize, cache, or delegate—and the pitfalls that break real agents.

February 5, 2026 by Tacit Agent

ai-coding llm context-window agents production architecture

TL;DR

Context window marketing overstates effective capacity. A 200K window typically degrades around 130K. Information in the middle of context is poorly retrieved—the “Lost in the Middle” phenomenon. Production systems like Claude Code, Codex CLI, and Amp use compaction at 85-95% capacity, but practitioner feedback suggests 70-80% works better. The strategy hierarchy: don’t fill it → position matters → cache the static → compact early → prune tool outputs → delegate to sub-agents.

Quick Reference

CONTEXT BUDGET
──────────────
System prompt:    Pin, never trim       (&lt;10%)
Cached content:   Static docs, tools    (20-40%)
Conversation:     Recent turns          (20-30%)
Tool outputs:     Prune aggressively    (5-10%)
Reserve:          Response buffer       (15-20%)

THRESHOLDS
──────────
70%   Consider proactive compaction
80%   Warn user, prepare compaction
85%   Auto-compact (non-critical)
95%   Emergency (Claude Code default—too late)

POSITION RULES
──────────────
Start/End:  Critical info (high retrieval)
Middle:     Supporting detail only (low retrieval)

Why This Matters

Every LLM agent eventually hits context limits. The naive solution—“just use a bigger window”—fails for three reasons:

Latency scales with context (sometimes superlinearly)
Cost scales linearly (every token costs money)
Quality degrades (“lost in the middle” + context rot)

This research synthesizes what actually works in production, from academic foundations to reverse-engineered implementations.

The Core Problem: Lost in the Middle

The foundational finding (Liu et al., 2023): LLMs retrieve information from the beginning and end of context with high accuracy but struggle with middle-positioned information.

Accuracy vs. Position in Context
─────────────────────────────────
│ ████                      ████ │  High: Start + End
│  ███                      ███  │
│   ██                      ██   │
│    █                      █    │
│     ██                  ██     │
│       ████████████████████     │  Low: Middle
─────────────────────────────────
  Start                      End

Root causes:

Rotary Position Embedding (RoPE) introduces long-term decay
Attention sinks create “attractors” at boundaries
Training data bias (relevant info usually at start/end)

Implication: You can’t just stuff more context. Placement matters.

The lost-in-the-middle U-curve — accuracy drops sharply in the middle of the context window

Production Compaction Systems Compared

How do real agent systems handle context limits?

System	Trigger	Threshold	Key Mechanism
Claude Code	Manual `/compact` or auto	~95%	Summary replaces history
Codex CLI	Manual or auto	180K-244K	Summary + last 20K preserved
OpenCode	Manual or auto	Overflow check	Separate tool output pruning
Amp	Manual “Handoff” only	N/A	Short focused conversations

The key insight from practitioner feedback: 95% is too late. By then, quality has already degraded. 70-80% produces better results.

Compaction Prompt Pattern

Derived from Codex CLI’s implementation:

Summarize this conversation for a follow-up session:

REQUIRED SECTIONS:
1. Completed Work: What was accomplished, final file states
2. In Progress: Current task state, blockers
3. Key Decisions: User constraints, architectural choices
4. Critical Context: Information essential for continuation

Keep: Technical specifics, file paths, variable names
Drop: Verbose tool outputs, exploratory dead-ends

The Pitfalls

Pitfall	How Teams Hit It	Mitigation
Context Poisoning	Hallucination during summarization becomes “fact”	Structured summaries, user validation
Premature Auto-Compact	Interrupts active work at 95%	Compact at 70-80%, warn user
Lost Continuity	Critical context trimmed (variable definitions)	Pin important context, role tagging
Over-trust in Window Size	Assume 200K means 200K effective	Test actual performance, use ~65%
Tool Output Bloat	Tool results dominate context	Separate tool output pruning

Prompt Caching: 90% Cost Reduction

Prompt caching stores computed KV states for static prefixes. Both Anthropic and OpenAI offer this, but implementations differ significantly.

Aspect	Anthropic	OpenAI
Control	Explicit `cache_control` markers	Automatic
Cache Hit Rate	100% (when correct)	~50%
Write Cost	25% premium	No premium
Break-Even	2-3 reuses	1 reuse
Minimum Tokens	1024-2048	1024

Optimal structure:

┌─────────────────────────────────┐
│ STATIC LAYER (Cached)          │
│ • System instructions          │
│ • Tool definitions             │
│ • Documentation/examples       │
│ [cache_control: ephemeral]     │
├─────────────────────────────────┤
│ DYNAMIC LAYER (Not cached)     │
│ • Conversation history         │
│ • Current query                │
│ • Retrieved context            │
└─────────────────────────────────┘

Multi-Agent Context Delegation

When context limits loom, delegate to sub-agents with fresh windows. But handoff quality is critical—most “agent failures” are actually context transfer failures.

Anti-pattern: Free-text handoff

"Please look at the code and fix the bug we discussed"

Pattern: Structured handoff

{
  "task": {
    "objective": "Fix null pointer in UserService.java:142",
    "constraints": ["Don't modify public API"],
    "successCriteria": ["All tests pass"]
  },
  "context": {
    "relevantFiles": ["src/UserService.java"],
    "decisions": {"errorHandling": "Return Optional"},
    "currentState": "Bug identified, approach agreed"
  }
}

The Decision Matrix

When to use which technique:

Situation	Primary Technique	Avoid
Stateless chatbot	Sliding window trim	Summarization (overkill)
Long coding session	Compaction at 70-80%	Waiting until 95%
High-volume production	Prompt caching	Long context stuffing
Complex multi-step task	Multi-agent delegation	Single agent marathon
Knowledge-intensive QA	RAG	Pure long context

The Numbers That Matter

Metric	Target
Effective context	~65% of claimed max
Compaction trigger	70-80% utilization
Cache hit rate	>80% for stable prompts
Tool output budget	<30% of context
Cache break-even	2-3 reuses (Anthropic)

Observability Metrics

Track these to catch degradation before failure:

Metric	Definition	Alert Threshold
Context Utilization	tokens_used / max_tokens	>85%
Context Efficiency	successful_actions / tokens_used	Decreasing trend
Compaction Frequency	compactions / session	>2 per session
Tool Output Ratio	tool_tokens / total_tokens	>50%

Degradation signals:

Agent re-asks clarified information (context loss)
Agent undoes previous work (lost continuity)
Increasing TTFT latency (context too large)
Declining task success rate (position effects)

Implementation Checklist

Building a context-managed agent:

Area	Requirements
Token Accounting	Accurate counting (official tokenizers), breakdown by category (system, conversation, tools), alerts at 70%, 80%, 85%
Compaction System	Manual `/compact` command, auto-compact at 80% (configurable), user warning before compaction, structured prompt with required sections
Tool Output Management	Prune old outputs (keep last N), replace with summaries after processing, tag outputs for selective pruning
Caching Strategy	Identify static content, place cache boundary correctly, monitor hit rates, calculate ROI (break-even 2-3 uses)

Recipe: Context Handoff Slash Command

For Claude Code users, add this to .claude/commands/handoff.md:

---
name: handoff
description: Create structured handoff for sub-agent or session continuation
arguments:
  - name: target
    description: "subagent, session, or teammate"
    required: true
---

Create a structured handoff document for context transfer.

## Handoff Type: $ARGUMENTS.target

Analyze the current session and produce a handoff with these sections:

### 1. Mission State
- **Original objective**: What were we trying to accomplish?
- **Current status**: Where are we now? (not started / in progress / blocked / complete)
- **Remaining work**: What's left to do?

### 2. Key Decisions Made
List decisions with rationale:
| Decision | Choice | Why |
|----------|--------|-----|
| ... | ... | ... |

### 3. Critical Context
Information the recipient MUST know:
- File paths modified or relevant
- Variable names / function signatures introduced
- Constraints or requirements discovered
- Gotchas encountered

### 4. What NOT to Do
Anti-patterns or approaches we tried and rejected:
- ...

### 5. Suggested Next Action
The single most important next step.

---

Format as JSON if target is "subagent":
```json
{
  "schemaVersion": "1.0.0",
  "task": {
    "objective": "...",
    "constraints": [...],
    "successCriteria": [...]
  },
  "context": {
    "relevantFiles": [...],
    "decisions": {...},
    "currentState": "..."
  }
}


**Usage:**
```bash
/handoff subagent    # JSON for spawning sub-agent
/handoff session     # Markdown for /compact or new session
/handoff teammate    # Markdown for human handoff

Open Questions

Optimal compaction prompts: What summary structure preserves most signal?
Position-aware architectures: Do newer models actually fix “lost in middle”?
Compression vs. accuracy curve: Where’s the elbow?
Multi-agent overhead: When does delegation cost exceed benefit?

The Tacit Angle

Session memory becomes critical when context management is aggressive. Every compaction loses information. Every sub-agent handoff is context that could be searched later.

Practice	Without Session Memory	With Session Memory
Aggressive compaction	Permanent loss	Searchable history
Sub-agent delegation	Scattered context	Unified view
Multi-worktree sessions	Isolated silos	Cross-session search

The more aggressively you optimize context, the more valuable session persistence becomes.

Sources & Provenance

Verifiable sources. Dates matter. Credibility assessed.

ACADEMIC High credibility

July 2023

Lost in the Middle: How Language Models Use Long Contexts ↗

Nelson F. Liu et al. · TACL 2024

"LLMs exhibit U-shaped accuracy: high retrieval at start/end, poor in middle. Occurs even for explicitly long-context models. Foundational finding for context management."

ACADEMIC High credibility

April 2024

RULER: What's the Real Context Size of Your Long-Context Language Models? ↗

NVIDIA Research · COLM 2024

"Despite perfect needle-in-haystack scores, models fail on complex RULER tasks as length increases. Claimed context sizes overstate effective capacity."

Medium credibility

2026

Context Compaction Research: Claude Code, Codex CLI, OpenCode, Amp ↗

Mario Zechner (badlogic) · GitHub Gist

"Reverse-engineered compaction implementations. Claude Code at 95%, Codex CLI at 180K-244K, OpenCode with separate tool pruning, Amp preferring manual handoffs."

Medium credibility

November 2025

Prompt Caching Comparison: OpenAI vs Anthropic ↗

Will McGinnis · Personal Blog

"OpenAI automatic caching achieves ~50% hit rate. Anthropic explicit caching achieves 100% when correctly structured. Anthropic has 25% write premium but 90% read savings."

ACADEMIC High credibility

July 2024

RAG vs Long-Context LLMs: A Comprehensive Study ↗

Various · arXiv

"Neither RAG nor long context is universally better. LC outperforms on average when resourced, but RAG's cost advantage is significant. Hybrid approaches emerging."

ACADEMIC High credibility

May 2022

FlashAttention: Fast and Memory-Efficient Exact Attention ↗

Tri Dao et al. · NeurIPS 2022

"IO-aware attention algorithm achieving 2-4x speedups. Foundation for practical long-context processing. FlashAttention-3 adds 1.5-2x over v2."

DOCS High credibility

Current

LangChain ConversationSummaryBufferMemory ↗

LangChain · Official Docs

"Hybrid approach: recent turns at full fidelity, older turns summarized. Token-based threshold for when to summarize. Production-tested pattern."

INDUSTRY Medium credibility

2026

Best Practices for Multi-Agent Orchestration and Reliable Handoffs ↗

Skywork AI · Company Blog

"Most agent failures are handoff failures, not model failures. Structured schemas with semver versioning. Validate with Pydantic. Free-text handoffs are anti-pattern."

Medium credibility

2026

Context Optimization for LLM Agents (Part 1) ↗

Vinay · Personal Blog

"'Reshape and Fit' hybrid approach: recent buffer (5 turns full), compaction (turns 6-20 stripped), summarization (20+), offloading to sub-agents."

INDUSTRY Medium credibility

Current

LLM Observability: Monitoring Large Language Models ↗

Splunk · Company Blog

"Five pillars: response monitoring, automated evaluations, advanced filtering, application tracing, human-in-the-loop. Context utilization as key metric."

INDUSTRY Medium credibility

2025

Prompt Caching: 10x Cheaper LLM Tokens ↗

ngrok · Company Blog

"Detailed explanation of KV cache mechanics. Cache boundaries, TTL management, minimum token thresholds. Practical implementation guidance."

DOCS High credibility

Current

OpenAI Agents SDK - Multi-Agent Orchestration ↗

OpenAI · Official Docs

"Agent = LLM + instructions + tools + handoffs. Autonomous planning with tool use. Structured handoff patterns for context transfer."