Programmatic Tool Calling: How AI Agents Learned to Use Your Computer
From autocomplete to autonomous agents. The evolution of AI tool calling — from Copilot's inline suggestions to Claude Code's bash execution, sub-agents, and MCP integration. What changed, what it means for developers, and where the evidence actually points.
TL;DR
AI coding agents went from suggesting lines (2021) to running your terminal (2025). PR volume up 98%, review time up 91%. Developers report feeling 20% faster — a randomized controlled trial found them 19% slower. Adoption is near-universal (84%), trust is low (33%). The shift is real. The productivity math is not settled.
The Evolution
Four phases, five years:
Phase 1: AUTOCOMPLETE (2021-2023)
─────────────────────────────────
Copilot, ChatGPT/Claude chat
→ AI suggests next line. Human accepts/rejects.
→ No tool calling. Human is the tool.
Phase 2: LIMITED TOOL CALLING (2023-2024)
─────────────────────────────────────────
Cursor, Aider, Windsurf
→ AI reads files, edits files, limited commands.
→ IDE-native. Single-file or repo-map context.
Phase 3: FULL TOOL CALLING (2025)
─────────────────────────────────
Claude Code, Codex CLI, Copilot Agent Mode
→ Read, write, bash, search, web, sub-agents.
→ Terminal-native. MCP for extensibility.
Phase 4: MULTI-AGENT (2026-emerging, speculative)
──────────────────────────────────────────────────
Swarm orchestration, autonomous PRs
→ Agents assigned issues, create PRs, review each other.
→ Developer as orchestrator, not implementer.
→ Early signals only. Not yet proven at scale.
The naming evolved with the phases. Karpathy coined “vibe coding” in February 2025 — “fully giving in to the vibes, embracing exponentials, forgetting the code exists.” By late 2025, he moved on to “agentic engineering”: orchestrating agents who write the code while acting as oversight.
What Programmatic Tool Calling Actually Is
Two meanings in circulation:
Broad sense (the industry shift)
AI coding agents that autonomously call tools — read files, write code, run commands, search codebases — during a coding session. The shift from human-initiated to agent-initiated tool use. This is what defines the “agentic coding” era.
Narrow/technical sense (Anthropic’s API feature)
A specific API capability (beta, November 2025) where Claude writes Python code to orchestrate multiple tool calls in a sandboxed container. Tool results are processed programmatically — intermediate results never enter the context window.
Why it matters technically:
TRADITIONAL TOOL CALLING
─────────────────────────
User prompt → Model reasons → Tool call #1 → Result in context
→ Model reasons → Tool call #2 → Result in context
→ Model reasons → Tool call #3 → Result in context
→ Model reasons → Final answer
Each step = full inference pass. All results accumulate in context.
PROGRAMMATIC TOOL CALLING
─────────────────────────
User prompt → Model writes Python code → Code runs in sandbox
├→ Tool call #1 → processed in code (not in context)
├→ Tool call #2 → processed in code (not in context)
└→ Tool call #3 → processed in code (not in context)
→ Only final output enters context → Model answers
One inference pass. Context stays clean.
Performance data from Anthropic:
- 37% token reduction on complex research tasks
- Eliminated 19+ inference passes in multi-step workflows
- Accuracy improved: Opus 4 from 49% to 74%; Opus 4.5 from 79.5% to 88.1%
Note: As of early 2026, the narrow PTC API feature is not yet available inside Claude Code itself (GitHub issue #12836). Claude Code uses the broad sense — full tool calling through its built-in tool suite.
What Full Tool Calling Looks Like in Practice
A Phase 3 agent (Claude Code, Codex CLI) has: file read/write, bash execution, regex search, glob matching, web access, and sub-agent spawning. The key architectural detail is sub-agents with fresh context windows — the main agent sends a focused task to a sub-agent (Explore, Plan, or General-Purpose), and receives a condensed summary (1-2K tokens) instead of raw exploration results (20-50K tokens). This is the Isolate strategy from context engineering applied at the tool level.
MCP (Model Context Protocol) extends this further — databases, APIs, monitoring, Slack, GitHub all become callable tools. The integration model shifts from M×N custom integrations to M+N standardized connections.
The Permission Model
The core tension: more autonomy = more productivity, but also more risk.
How Claude Code handles it
Four modes: Normal (asks before destructive actions), Auto-accept (approves most), Plan (read-only), Bypass (full autonomy). Granular rules in settings.json:
Allow: Read, Bash(git *), Bash(npm *)
Deny: Read(.env*), Bash(rm *), Bash(sudo *)
Evaluation order: Deny > Ask > Allow. Deny always wins.
Trust calibration data
From Anthropic’s research on agent autonomy (February 2026):
| Metric | Value |
|---|---|
| New users: auto-approve rate | ~20% |
| Experienced users (750+ sessions) | >40% |
| Human interventions per session | Decreased from 5.4 to 3.3 |
| Claude proactively stops vs human interrupts | 2x more often |
| Irreversible actions among all tool calls | 0.8% |
The agent self-regulates. It stops to ask for clarification more often than humans interrupt it. Common stops: presenting approach choices (35%), gathering diagnostic info (21%), requesting credentials (12%).
The trust learning curve
A UC San Diego / Cornell study found professional developers do not blindly trust agents. All 13 observed developers controlled software design themselves or thoroughly revised agent plans. The study estimates ~2,000 hours (a full year) to develop calibrated trust — trust based on understanding the AI’s strengths and failure modes through direct experience.
The Numbers: What’s Actually Happening
Adoption
| Metric | Value | Source |
|---|---|---|
| Developers using AI tools weekly | 65% | Stack Overflow 2025 |
| Using or planning to use AI tools | 84% | Stack Overflow 2025 |
| Running 3+ AI tools in parallel | 59% | Greptile 2025 |
| Repos with CLAUDE.md files | 67% | Greptile 2025 |
| Claude Code user growth since Claude 4 | +300% | Anthropic 2026 |
The productivity paradox
| Metric | Value | Source |
|---|---|---|
| PR volume increase | +98% | Faros AI |
| PR review time increase | +91% | Faros AI |
| Lines per developer increase | +76% | Greptile 2025 |
| PR size increase | +33% | Greptile 2025 |
| Developer self-estimated speed gain | +20% | METR RCT |
| Actual measured speed | -19% (slower) | METR RCT |
The METR study is the most rigorous data point: 16 experienced open-source developers, their own repositories (22K+ stars, 1M+ lines), randomly assigned to use or not use AI tools. Developers using AI were 19% slower — while estimating they were 20% faster. A nearly 40-percentage-point perception gap.
Contributing factors: developers accepted less than 44% of AI generations. The overhead of reviewing, testing, and rejecting suggestions consumed more time than it saved on large, complex, mature codebases.
The quality tax
| Metric | Value | Source |
|---|---|---|
| Issues per AI PR vs human PR | 10.83 vs 6.45 | CodeRabbit |
| AI code: critical issue multiplier | 1.4x more | CodeRabbit |
| XSS vulnerabilities in AI code | 2.74x more likely | CodeRabbit |
| AI code failing security tests | 45% | Veracode |
| New security findings from AI code (Jun 2025) | 10,000+/month (10x spike) | Apiiro |
Trust
| Metric | Value | Source |
|---|---|---|
| Developers who trust AI accuracy | 33% | Stack Overflow 2025 |
| Developers who distrust AI accuracy | 46% | Stack Overflow 2025 |
| ”Highly trusting” | 3% | Stack Overflow 2025 |
| Positive sentiment for AI tools (2023-24 → 2025) | 70%+ → 60% | Stack Overflow 2025 |
What Changes for Developers
The role inversion
Developer goes from writer to reviewer and architect.
“Treat AI as an over-confident junior developer.” — Addy Osmani
What changes in practice:
- Planning time increases: 15-minute spec sessions before implementation become standard
- Micro-task decomposition: Break work into focused chunks — agents do best with targeted prompts
- Version control as safety net: Granular commits enable quick rollbacks
- Multi-model arbitrage: When one model gets stuck, try another
- Agent configuration files: CLAUDE.md, .cursorrules — a new developer artifact
The review bottleneck
Code generation got dramatically faster. Code review didn’t:
- Senior engineers spend 4.3 minutes reviewing AI code vs 1.2 minutes for human code — 3.6x per review unit
- AI generates 6.4x more code than humans for the same requirements
- The nature of review changed from “does this work?” to “do we need all of this?”
Cursor’s $290M+ acquisition of Graphite (December 2025) confirmed this is a real bottleneck, not a theoretical one. Their CEO: “code review is taking a growing share of developer time as writing code keeps shrinking.”
The CLAUDE.md pattern
67% of repos now contain CLAUDE.md files. This is a new category of developer artifact: instructions written by humans for consumption by AI, checked into version control alongside the code. It sits at the intersection of documentation, configuration, and prompt engineering.
The fragmentation problem: CLAUDE.md for Claude Code, .cursor/rules/*.mdc for Cursor, .github/copilot-instructions.md for Copilot. AGENTS.md emerged in July 2025 as an open standard to solve this — one file for any agent.
Beyond Developers
At Epic (healthcare technology), over half of Claude Code usage is by non-developer roles — support and implementation staff doing tasks that previously required engineering. AI now writes ~30% of Microsoft’s code and 25%+ of Google’s (MIT Technology Review). At Anthropic, ~90% of Claude Code’s own codebase is written by Claude Code.
Recommendations
Based on the evidence:
-
Restructure review before scaling generation. Teams adding AI coding tools without changing how they review will get slower, not faster (Faros AI, LogRocket). Budget 3-4x review time per AI-generated PR.
-
Start with permission controls, loosen with experience. New users auto-approve 20% of actions; experienced users reach 40% (Anthropic). Don’t skip to full autonomy. The UC San Diego study found all 13 professional developers maintained manual oversight.
-
Treat CLAUDE.md as a first-class artifact. 67% of repos already have one (Greptile). Make it part of your onboarding, not an afterthought. Document conventions explicitly — the agent reads what you write.
-
Expect the quality tax. AI PRs carry 1.4x more critical issues (CodeRabbit), 2.74x more XSS vulnerabilities, and 45% fail security tests (Veracode). Add automated quality gates before the human review step.
-
Track what the agent does. The more autonomous the agent, the more valuable the decision trail. Session logs, tool call histories, and permission patterns are the new audit artifacts.
Open Questions
- Is the METR result generalizable? 16 developers, specific repos. Does it hold across team sizes and project types?
- When does the quality tax break even? At what point does AI speed offset review overhead?
- Will review automation work? Current AI review catches 44-82% of issues. Is that enough to close the bottleneck?
- How does PTC (narrow sense) change the game when it lands in Claude Code? Context preservation + multi-tool orchestration could meaningfully shift the autonomy curve.
- Enterprise governance: Only 6% have advanced AI security strategies while 40% of apps embed agents. The EU AI Act deadline is August 2026.
The Tacit Angle
Programmatic tool calling generates orders of magnitude more session data than manual coding. A single Claude Code session can produce 50-200+ tool calls — each one a decision the agent made about your codebase. That’s a fundamentally different volume of context than a chat thread.
| What Happens Today | What Session Memory Enables |
|---|---|
| Agent explores codebase, finds the fix, session ends. Next bug: starts from zero. | ”Last time this module broke, the agent traced it to X — start there.” |
| You calibrate permissions manually each session (20% → 40% over months) | Permission patterns persist: “this agent is safe with test files, not with configs” |
| Sub-agent findings scattered — main agent gets summary, details lost | Full sub-agent traces searchable: what it tried, what it rejected, why |
| METR’s 19% slowdown partly from re-doing work AI already explored | Prior exploration reusable — skip the 56% of generations that get rejected |
The METR study found developers accepted less than 44% of AI generations. That’s 56% wasted exploration. Session memory turns rejected approaches into “don’t try this again” signals.
Confidence Assessment
| Claim | Confidence |
|---|---|
| Tool calling is the defining shift in AI coding | High — universal adoption, clear phase transition |
| PR volume is up ~98% | High — Faros AI data, 10K+ developers |
| Experienced devs may be slower with AI (METR) | High — randomized controlled trial |
| Developers overestimate AI speed gains | High — 40-point perception gap in METR |
| AI code has more quality issues | High — CodeRabbit, Veracode, Apiiro converge |
| Review is the new bottleneck | High — data + $290M acquisition confirms |
| CLAUDE.md is standard practice | Medium — 67% adoption, but fragmented |
| Multi-agent is the next phase | Medium — emerging, not yet proven at scale |
Sources & Provenance
Verifiable sources. Dates matter. Credibility assessed.
Introducing Advanced Tool Use ↗
Anthropic · Anthropic Engineering
"Programmatic tool calling reduces tokens 37%, eliminates 19+ inference passes, improves accuracy from 49% to 74% (Opus 4). Tool Search reduces tool definitions from 77K to 8.7K tokens."
Measuring AI Agent Autonomy in Practice ↗
Anthropic Research · Anthropic
"Human interventions per session decreased from 5.4 to 3.3. Claude proactively stops 2x more than humans interrupt. New users auto-approve 20%, experienced users 40%+. Only 0.8% of tool calls appear irreversible."
AI-Assisted Development: METR Randomized Controlled Trial ↗
METR · METR
"16 experienced devs, their own repos (22K+ stars). AI users were 19% slower while estimating 20% faster. Devs accepted < 44% of AI generations. The most rigorous productivity study to date."
Stack Overflow 2025 Developer Survey ↗
Stack Overflow · Stack Overflow
"84% use or plan to use AI tools. Only 33% trust output, 46% distrust. Positive sentiment dropped from 70%+ to 60%. 45% cite 'almost right but not quite' as top frustration."
Professional Software Developers Don't Vibe, They Control ↗
UC San Diego / Cornell · arXiv
"All 13 observed developers controlled design themselves. 9 of 13 carefully reviewed every code change. Estimates ~2,000 hours to develop calibrated trust in AI agents."
State of AI vs. Human Code Generation Report ↗
CodeRabbit · CodeRabbit
"AI PRs have 10.83 issues vs 6.45 for human PRs. AI code has 1.4x more critical issues, 1.7x more major issues, 2.74x more XSS vulnerabilities."
AI Software Engineering: Impact on Developer Productivity ↗
Faros AI · Faros AI
"10,000+ developers analyzed. PR volume up 98%, review time up 91%. The defining data point for the review bottleneck thesis."
State of AI Coding 2025 ↗
Greptile · Greptile
"67% of repos have CLAUDE.md. Lines per developer up 76%. PR size up 33%. 59% of developers run 3+ AI tools in parallel."
My LLM Coding Workflow Going into 2026 ↗
Addy Osmani · Personal Blog
"Treat AI as 'an over-confident junior developer.' Planning time increases. Micro-task decomposition becomes standard. At Anthropic, ~90% of Claude Code's codebase is written by Claude Code."
AI Coding Tools Shift Bottleneck to Review ↗
LogRocket · LogRocket
"Senior engineers spend 4.3 min reviewing AI code vs 1.2 min for human code. AI generates 6.4x more code for same task. Review shifted from 'does this work?' to 'do we need all of this?'"
Eight Trends Defining How Software Gets Built in 2026 ↗
Anthropic · Claude Blog
"Multi-agent coordination, AI-automated review, extending beyond engineering teams, security from the start. The four priorities for agentic coding in 2026."
Cursor Acquires Graphite ↗
TechCrunch · TechCrunch
"$290M+ acquisition. CEO: 'code review is taking a growing share of developer time as writing code keeps shrinking.' Stacked PRs as the solution."
AI Code Security: 10x Vulnerability Spike ↗
Apiiro · Apiiro
"New security findings from AI code spiked to 10,000+/month — a 10x increase in 6 months. Speed gains partially offset by downstream quality costs."
Generative Coding: 2026 Breakthrough Technology ↗
MIT Technology Review · MIT Technology Review
"AI writes ~30% of Microsoft's code, 25%+ of Google's. Named one of 10 Breakthrough Technologies of 2026."