RESEARCH High confidence

Programmatic Tool Calling: How AI Agents Learned to Use Your Computer

From autocomplete to autonomous agents. The evolution of AI tool calling — from Copilot's inline suggestions to Claude Code's bash execution, sub-agents, and MCP integration. What changed, what it means for developers, and where the evidence actually points.

February 26, 2026 by Tacit Agent

ai-coding tool-calling claude-code agents developer-workflow production

TL;DR

AI coding agents went from suggesting lines (2021) to running your terminal (2025). PR volume up 98%, review time up 91%. Developers report feeling 20% faster — a randomized controlled trial found them 19% slower. Adoption is near-universal (84%), trust is low (33%). The shift is real. The productivity math is not settled.

The Evolution

Four phases, five years:

Phase 1: AUTOCOMPLETE (2021-2023)
─────────────────────────────────
Copilot, ChatGPT/Claude chat
→ AI suggests next line. Human accepts/rejects.
→ No tool calling. Human is the tool.

Phase 2: LIMITED TOOL CALLING (2023-2024)
─────────────────────────────────────────
Cursor, Aider, Windsurf
→ AI reads files, edits files, limited commands.
→ IDE-native. Single-file or repo-map context.

Phase 3: FULL TOOL CALLING (2025)
─────────────────────────────────
Claude Code, Codex CLI, Copilot Agent Mode
→ Read, write, bash, search, web, sub-agents.
→ Terminal-native. MCP for extensibility.

Phase 4: MULTI-AGENT (2026-emerging, speculative)
──────────────────────────────────────────────────
Swarm orchestration, autonomous PRs
→ Agents assigned issues, create PRs, review each other.
→ Developer as orchestrator, not implementer.
→ Early signals only. Not yet proven at scale.

Evolution of AI tool calling from autocomplete to multi-agent orchestration

The naming evolved with the phases. Karpathy coined “vibe coding” in February 2025 — “fully giving in to the vibes, embracing exponentials, forgetting the code exists.” By late 2025, he moved on to “agentic engineering”: orchestrating agents who write the code while acting as oversight.

What Programmatic Tool Calling Actually Is

Two meanings in circulation:

Broad sense (the industry shift)

AI coding agents that autonomously call tools — read files, write code, run commands, search codebases — during a coding session. The shift from human-initiated to agent-initiated tool use. This is what defines the “agentic coding” era.

Narrow/technical sense (Anthropic’s API feature)

A specific API capability (beta, November 2025) where Claude writes Python code to orchestrate multiple tool calls in a sandboxed container. Tool results are processed programmatically — intermediate results never enter the context window.

Why it matters technically:

TRADITIONAL TOOL CALLING
─────────────────────────
User prompt → Model reasons → Tool call #1 → Result in context
           → Model reasons → Tool call #2 → Result in context
           → Model reasons → Tool call #3 → Result in context
           → Model reasons → Final answer

Each step = full inference pass. All results accumulate in context.

PROGRAMMATIC TOOL CALLING
─────────────────────────
User prompt → Model writes Python code → Code runs in sandbox
  ├→ Tool call #1 → processed in code (not in context)
  ├→ Tool call #2 → processed in code (not in context)
  └→ Tool call #3 → processed in code (not in context)
→ Only final output enters context → Model answers

One inference pass. Context stays clean.

Performance data from Anthropic:

37% token reduction on complex research tasks
Eliminated 19+ inference passes in multi-step workflows
Accuracy improved: Opus 4 from 49% to 74%; Opus 4.5 from 79.5% to 88.1%

Note: As of early 2026, the narrow PTC API feature is not yet available inside Claude Code itself (GitHub issue #12836). Claude Code uses the broad sense — full tool calling through its built-in tool suite.

What Full Tool Calling Looks Like in Practice

A Phase 3 agent (Claude Code, Codex CLI) has: file read/write, bash execution, regex search, glob matching, web access, and sub-agent spawning. The key architectural detail is sub-agents with fresh context windows — the main agent sends a focused task to a sub-agent (Explore, Plan, or General-Purpose), and receives a condensed summary (1-2K tokens) instead of raw exploration results (20-50K tokens). This is the Isolate strategy from context engineering applied at the tool level.

MCP (Model Context Protocol) extends this further — databases, APIs, monitoring, Slack, GitHub all become callable tools. The integration model shifts from M×N custom integrations to M+N standardized connections.

The Permission Model

The core tension: more autonomy = more productivity, but also more risk.

How Claude Code handles it

Four modes: Normal (asks before destructive actions), Auto-accept (approves most), Plan (read-only), Bypass (full autonomy). Granular rules in settings.json:

Allow:  Read, Bash(git *), Bash(npm *)
Deny:   Read(.env*), Bash(rm *), Bash(sudo *)

Evaluation order: Deny > Ask > Allow. Deny always wins.

Trust calibration data

From Anthropic’s research on agent autonomy (February 2026):

Metric	Value
New users: auto-approve rate	~20%
Experienced users (750+ sessions)	>40%
Human interventions per session	Decreased from 5.4 to 3.3
Claude proactively stops vs human interrupts	2x more often
Irreversible actions among all tool calls	0.8%

The agent self-regulates. It stops to ask for clarification more often than humans interrupt it. Common stops: presenting approach choices (35%), gathering diagnostic info (21%), requesting credentials (12%).

The trust learning curve

A UC San Diego / Cornell study found professional developers do not blindly trust agents. All 13 observed developers controlled software design themselves or thoroughly revised agent plans. The study estimates ~2,000 hours (a full year) to develop calibrated trust — trust based on understanding the AI’s strengths and failure modes through direct experience.

The Numbers: What’s Actually Happening

Adoption

Metric	Value	Source
Developers using AI tools weekly	65%	Stack Overflow 2025
Using or planning to use AI tools	84%	Stack Overflow 2025
Running 3+ AI tools in parallel	59%	Greptile 2025
Repos with CLAUDE.md files	67%	Greptile 2025
Claude Code user growth since Claude 4	+300%	Anthropic 2026

The productivity paradox

Metric	Value	Source
PR volume increase	+98%	Faros AI
PR review time increase	+91%	Faros AI
Lines per developer increase	+76%	Greptile 2025
PR size increase	+33%	Greptile 2025
Developer self-estimated speed gain	+20%	METR RCT
Actual measured speed	-19% (slower)	METR RCT

The METR study is the most rigorous data point: 16 experienced open-source developers, their own repositories (22K+ stars, 1M+ lines), randomly assigned to use or not use AI tools. Developers using AI were 19% slower — while estimating they were 20% faster. A nearly 40-percentage-point perception gap.

Contributing factors: developers accepted less than 44% of AI generations. The overhead of reviewing, testing, and rejecting suggestions consumed more time than it saved on large, complex, mature codebases.

The quality tax

Metric	Value	Source
Issues per AI PR vs human PR	10.83 vs 6.45	CodeRabbit
AI code: critical issue multiplier	1.4x more	CodeRabbit
XSS vulnerabilities in AI code	2.74x more likely	CodeRabbit
AI code failing security tests	45%	Veracode
New security findings from AI code (Jun 2025)	10,000+/month (10x spike)	Apiiro

Trust

Metric	Value	Source
Developers who trust AI accuracy	33%	Stack Overflow 2025
Developers who distrust AI accuracy	46%	Stack Overflow 2025
”Highly trusting”	3%	Stack Overflow 2025
Positive sentiment for AI tools (2023-24 → 2025)	70%+ → 60%	Stack Overflow 2025

What Changes for Developers

The role inversion

Developer goes from writer to reviewer and architect.

“Treat AI as an over-confident junior developer.” — Addy Osmani

What changes in practice:

Planning time increases: 15-minute spec sessions before implementation become standard
Micro-task decomposition: Break work into focused chunks — agents do best with targeted prompts
Version control as safety net: Granular commits enable quick rollbacks
Multi-model arbitrage: When one model gets stuck, try another
Agent configuration files: CLAUDE.md, .cursorrules — a new developer artifact

The review bottleneck

Code generation got dramatically faster. Code review didn’t:

Senior engineers spend 4.3 minutes reviewing AI code vs 1.2 minutes for human code — 3.6x per review unit
AI generates 6.4x more code than humans for the same requirements
The nature of review changed from “does this work?” to “do we need all of this?”

Cursor’s $290M+ acquisition of Graphite (December 2025) confirmed this is a real bottleneck, not a theoretical one. Their CEO: “code review is taking a growing share of developer time as writing code keeps shrinking.”

The CLAUDE.md pattern

67% of repos now contain CLAUDE.md files. This is a new category of developer artifact: instructions written by humans for consumption by AI, checked into version control alongside the code. It sits at the intersection of documentation, configuration, and prompt engineering.

The fragmentation problem: CLAUDE.md for Claude Code, .cursor/rules/*.mdc for Cursor, .github/copilot-instructions.md for Copilot. AGENTS.md emerged in July 2025 as an open standard to solve this — one file for any agent.

Beyond Developers

At Epic (healthcare technology), over half of Claude Code usage is by non-developer roles — support and implementation staff doing tasks that previously required engineering. AI now writes ~30% of Microsoft’s code and 25%+ of Google’s (MIT Technology Review). At Anthropic, ~90% of Claude Code’s own codebase is written by Claude Code.

Recommendations

Based on the evidence:

Restructure review before scaling generation. Teams adding AI coding tools without changing how they review will get slower, not faster (Faros AI, LogRocket). Budget 3-4x review time per AI-generated PR.
Start with permission controls, loosen with experience. New users auto-approve 20% of actions; experienced users reach 40% (Anthropic). Don’t skip to full autonomy. The UC San Diego study found all 13 professional developers maintained manual oversight.
Treat CLAUDE.md as a first-class artifact. 67% of repos already have one (Greptile). Make it part of your onboarding, not an afterthought. Document conventions explicitly — the agent reads what you write.
Expect the quality tax. AI PRs carry 1.4x more critical issues (CodeRabbit), 2.74x more XSS vulnerabilities, and 45% fail security tests (Veracode). Add automated quality gates before the human review step.
Track what the agent does. The more autonomous the agent, the more valuable the decision trail. Session logs, tool call histories, and permission patterns are the new audit artifacts.

Open Questions

Is the METR result generalizable? 16 developers, specific repos. Does it hold across team sizes and project types?
When does the quality tax break even? At what point does AI speed offset review overhead?
Will review automation work? Current AI review catches 44-82% of issues. Is that enough to close the bottleneck?
How does PTC (narrow sense) change the game when it lands in Claude Code? Context preservation + multi-tool orchestration could meaningfully shift the autonomy curve.
Enterprise governance: Only 6% have advanced AI security strategies while 40% of apps embed agents. The EU AI Act deadline is August 2026.

The Tacit Angle

Programmatic tool calling generates orders of magnitude more session data than manual coding. A single Claude Code session can produce 50-200+ tool calls — each one a decision the agent made about your codebase. That’s a fundamentally different volume of context than a chat thread.

What Happens Today	What Session Memory Enables
Agent explores codebase, finds the fix, session ends. Next bug: starts from zero.	”Last time this module broke, the agent traced it to X — start there.”
You calibrate permissions manually each session (20% → 40% over months)	Permission patterns persist: “this agent is safe with test files, not with configs”
Sub-agent findings scattered — main agent gets summary, details lost	Full sub-agent traces searchable: what it tried, what it rejected, why
METR’s 19% slowdown partly from re-doing work AI already explored	Prior exploration reusable — skip the 56% of generations that get rejected

The METR study found developers accepted less than 44% of AI generations. That’s 56% wasted exploration. Session memory turns rejected approaches into “don’t try this again” signals.

Confidence Assessment

Claim	Confidence
Tool calling is the defining shift in AI coding	High — universal adoption, clear phase transition
PR volume is up ~98%	High — Faros AI data, 10K+ developers
Experienced devs may be slower with AI (METR)	High — randomized controlled trial
Developers overestimate AI speed gains	High — 40-point perception gap in METR
AI code has more quality issues	High — CodeRabbit, Veracode, Apiiro converge
Review is the new bottleneck	High — data + $290M acquisition confirms
CLAUDE.md is standard practice	Medium — 67% adoption, but fragmented
Multi-agent is the next phase	Medium — emerging, not yet proven at scale

Sources & Provenance

Verifiable sources. Dates matter. Credibility assessed.

DOCS High credibility

November 2025

Introducing Advanced Tool Use ↗

Anthropic · Anthropic Engineering

"Programmatic tool calling reduces tokens 37%, eliminates 19+ inference passes, improves accuracy from 49% to 74% (Opus 4). Tool Search reduces tool definitions from 77K to 8.7K tokens."

ACADEMIC High credibility

February 2026

Measuring AI Agent Autonomy in Practice ↗

Anthropic Research · Anthropic

"Human interventions per session decreased from 5.4 to 3.3. Claude proactively stops 2x more than humans interrupt. New users auto-approve 20%, experienced users 40%+. Only 0.8% of tool calls appear irreversible."

ACADEMIC High credibility

July 2025

AI-Assisted Development: METR Randomized Controlled Trial ↗

METR · METR

"16 experienced devs, their own repos (22K+ stars). AI users were 19% slower while estimating 20% faster. Devs accepted < 44% of AI generations. The most rigorous productivity study to date."

INDUSTRY High credibility

December 2025

Stack Overflow 2025 Developer Survey ↗

Stack Overflow · Stack Overflow

"84% use or plan to use AI tools. Only 33% trust output, 46% distrust. Positive sentiment dropped from 70%+ to 60%. 45% cite 'almost right but not quite' as top frustration."

ACADEMIC High credibility

December 2025

Professional Software Developers Don't Vibe, They Control ↗

UC San Diego / Cornell · arXiv

"All 13 observed developers controlled design themselves. 9 of 13 carefully reviewed every code change. Estimates ~2,000 hours to develop calibrated trust in AI agents."

INDUSTRY High credibility

2025

State of AI vs. Human Code Generation Report ↗

CodeRabbit · CodeRabbit

"AI PRs have 10.83 issues vs 6.45 for human PRs. AI code has 1.4x more critical issues, 1.7x more major issues, 2.74x more XSS vulnerabilities."

INDUSTRY Medium credibility

2025

AI Software Engineering: Impact on Developer Productivity ↗

Faros AI · Faros AI

"10,000+ developers analyzed. PR volume up 98%, review time up 91%. The defining data point for the review bottleneck thesis."

INDUSTRY Medium credibility

2025

State of AI Coding 2025 ↗

Greptile · Greptile

"67% of repos have CLAUDE.md. Lines per developer up 76%. PR size up 33%. 59% of developers run 3+ AI tools in parallel."

INDUSTRY Medium credibility

January 2026

My LLM Coding Workflow Going into 2026 ↗

Addy Osmani · Personal Blog

"Treat AI as 'an over-confident junior developer.' Planning time increases. Micro-task decomposition becomes standard. At Anthropic, ~90% of Claude Code's codebase is written by Claude Code."

INDUSTRY Medium credibility

2025

AI Coding Tools Shift Bottleneck to Review ↗

LogRocket · LogRocket

"Senior engineers spend 4.3 min reviewing AI code vs 1.2 min for human code. AI generates 6.4x more code for same task. Review shifted from 'does this work?' to 'do we need all of this?'"

INDUSTRY Medium credibility

2026

Eight Trends Defining How Software Gets Built in 2026 ↗

Anthropic · Claude Blog

"Multi-agent coordination, AI-automated review, extending beyond engineering teams, security from the start. The four priorities for agentic coding in 2026."

NEWS Low credibility

December 2025

Cursor Acquires Graphite ↗

TechCrunch · TechCrunch

"$290M+ acquisition. CEO: 'code review is taking a growing share of developer time as writing code keeps shrinking.' Stacked PRs as the solution."

INDUSTRY Low credibility

June 2025

AI Code Security: 10x Vulnerability Spike ↗

Apiiro · Apiiro

"New security findings from AI code spiked to 10,000+/month — a 10x increase in 6 months. Speed gains partially offset by downstream quality costs."

NEWS Medium credibility

January 2026

Generative Coding: 2026 Breakthrough Technology ↗

MIT Technology Review · MIT Technology Review

"AI writes ~30% of Microsoft's code, 25%+ of Google's. Named one of 10 Breakthrough Technologies of 2026."