RESEARCH High confidence

Agent Infrastructure Foundation: 12 Interfaces, Commodity Backends, Empty-Diff Exit Gate

Harness engineering named the architecture above the model. This is the buildable form. 12 stable interfaces a small platform team owns, backends as commodity rentals, a 6-week POC with one honest exit gate: a different engineer ships the second workflow with zero foundation diff.

May 2, 2026 by Tacit Agent

ai-agents agent-infrastructure harness-engineering platform-engineering hitl eval-gates replay-safety model-router production

TL;DR

Harness engineering named the layer above the model. This piece names what’s inside it — and what isn’t.

A small platform team should own exactly 12 stable interfaces and treat every backend below them as a commodity rental. The interfaces are the IP. The backends are swappable. The proof your foundation works isn’t a benchmark or a customer logo — it’s a single git command:

git diff packages/foundation/ <W1-merge>..<W2-ship>
# must be empty

A different engineer ships the second workflow without touching one line of foundation code. If the diff is non-empty, the foundation leaked. Slip the calendar; never compromise the gate.

This piece is the buildable form of the harness/infra split: 12 interfaces with method signatures and invariants, a HITL state machine that answers four kinds of human asks, a replay-safety mechanism that stops resume-time duplicate side effects, a 6-week POC with four exit gates, and ten numericized decisions (D1–D10) that route an org from defaults to its specific config.

The Two Stacks Pretending to Be One

Most teams I see ship one bundle: agents + tools + sandbox + traces + evaluators + model glue, all welded. Then a vendor swap shows up. Then HIPAA shows up. Then open-weight gets cheap. And the bundle stops moving — because half of it never belonged to the team.

The split nobody draws:

Stack	What	Buyability
Harness	Pause. Judge. Approve. Undo. Audit. The grip humans keep on the loop.	Not buyable. No vendor ships “your humans pausing your agents on your terms.”
Infra	Memory that survives a crash. Permission that names an actor. Cost with a cap. Backends that swap.	Only buyable. Every backend is rental.

The harness is your IP, by definition. The infra is your contract — stable interface, swappable backend.

The teams shipping fastest in 2026 aren’t picking better tooling. They’re drawing the line between what they hold and what they rent — and refusing to confuse the two.

The 12 Interfaces

Each has a method signature, a set of invariants, a swap-test recipe, and a default backend. Agent code never imports a backend SDK directly — it imports the interface. The interface contract is locked at v1.0.0 semver. Backends rotate behind it.

#	Interface	Owns	Default backend	Difficulty to swap
I1	`ToolRunner`	Tool dispatch, sandbox lane routing, replay cache	Persistent + ephemeral lanes	🟢 easy
I2	`Checkpointer`	Versioned graph state, migration plan on version bump	LangGraph + Postgres	🔴 hard
I3	`WorkflowEngine`	OPTIONAL — engaged when D4 fires (long-running, SLA, multi-region)	Temporal	🟡 medium
I4	`TraceSink`	OTel GenAI v1.37 emit + S3 dump escape hatch	Phoenix	🟢 easy
I5	`EvalGate`	Pass-rate delta + cost + latency + cross-family judge mandate	Phoenix Evals → Braintrust	🟡 medium
I6	`HITLBroker`	Pause / resume / expire / escalate + HMAC-signed resume token	Postgres-LISTEN + Slack bot	🟡 medium
I7	`SecretsProvider`	Inject at tool boundary; never into model context	Vault / AWS-SM	🟢 easy
I8	`IdentityProvider`	`whoami` + `on_behalf_of`; no anonymous tool calls	Custom JWT → WorkOS / Clerk	🟢 easy
I9	`PolicyGate`	5-mode cascade: per-actor → per-tenant → per-tool → per-environment → default-deny	Custom YAML → OPA / Cerbos	🟡 medium
I10	`RateLimiter` + `CircuitBreaker`	Per-workflow LLM cap, per-vendor RPS, doom-loop guard	Redis	🟢 easy
I11	`CostAttributor`	Tag every span with `team_id` + `workflow_id`; CI test for untagged spans	OpenLLMetry + Phoenix cost-table	🟢 easy
I12	`ModelRouter`	Cost / complexity / sensitivity routing + cross-family judge enforcement	Anthropic-only B1 → Vercel AI SDK / OpenRouter	🟢 easy

Build-vs-buy is decided at the interface level, not per backend. The matrix has 12 rows for 12 interfaces. The cake mentions thirteen-plus backend names — but the buy decision is “do we own the interface or do we rent the whole layer?” Always own the interface.

The hardest swap is I2 Checkpointer. Graph state is product memory; you don’t migrate it like config. Every other interface is easy or medium. I12 ModelRouter is easy specifically because of one invariant (below).

The Invariants That Hold The Line

Three load-bearing invariants. Drop any one and the foundation leaks.

INV-1 (I12 ModelRouter): No provider SDK in agent code

# CI grep test
! grep -rE "from anthropic|import anthropic|from openai|import OpenAI" \
  packages/ --exclude-dir=foundation/model-router/backends/

If this test fails, the agent code is one provider away from a rewrite. The grep is the wall. The test runs on every PR. Without it, the empty-diff swap proof is impossible.

INV-3 (I5 EvalGate): Cross-family judge mandate

The judge model MUST be from a different family than the agent. Same-family judges inflate pass rates by +5–10pp on the golden set (arxiv 2410.21819). Without cross-family enforcement, the eval gate lies and bad agents ship.

I5 declares the rule. I12 ModelRouter enforces it via a task_class=judge routing rule that picks a judge from a different family — and, when sensitivity gating applies, picks cross-family within the allowed pool (BAA / Standard), never around it.

This is the literal R20 mitigation. It appears in five places in the foundation: I5 invariant, I12 routing rule, I5 → I12 DAG edge, eval CI test, and the model-router’s per-call routing diagram.

INV-4 (I6 HITLBroker): Pause goes through Checkpointer first

HITLBroker.pause() calls Checkpointer.save() first. Otherwise resume has nothing to load. This is the literal “stop redundant work” mechanism — the grip on the loop only holds if state survives the pause.

HITL: Four Kinds, Six States, One Token Contract

The state machine collapses six states (PENDING / APPROVED / REJECTED / EXPIRED / ESCALATED / CANCELED) but humans aren’t being asked the same thing each time.

Kind	What the human is asked	Payload
approve	”Send this email to the prospect?”	`{kind: "approve"}`
choose	”Which of these 3 reply drafts?”	`{kind: "choose", choice_id: "draft_b"}`
provide	”What’s the discount cap for this account?”	`{kind: "provide", payload: {cap: 0.15}}`
veto	”Auto-publishing in 24h. Veto?” — default-fire on EXPIRED	timeout = approve

The EXPIRED → APPROVED edge fires only for kind: "veto" requests. Every other state transition writes an audit row.

Resume token contract (HMAC-signed, single-use, monthly key rotation with a 2-key grace window):

{workflow_id, node_id, checkpoint_id, decision_slot,
 actor_role, expires_at, nonce}

Storage: tokens table with hash only, never raw payload. Wire format: /resume?t=<base64url>. Single-use enforced via tokens.usedAt.

TTL ladder: 24h L1 → 48h L2 → 7d hard cap → page on-call. The escalation is a state machine, not a Slack reminder.

Replay Safety: The “Stop Redundant Work” Mechanism

User goal A8 is one line: “resume MUST NOT redo work.” The naive failure mode: agent crashes mid-loop, resume re-runs every tool call, send_email fires twice. The customer sees two emails. You see an incident.

The fix is a typed taxonomy on every tool definition:

`replay_class`	Examples	Resume behavior
`pure`	`parse_json`, `compute_hash`, `format_template`	Always replay-safe. Cache is optimization only.
`idempotent_with_key`	`stripe_charge` (idem-key), `create_user` (idempotency-key header), `upsert_record`	Safe IFF same idempotency key. Cache key includes `idempotency_key_fn(input)`. Vendor enforces single-write.
`unsafe_on_replay`	`send_email` (no thread_ts), `post_slack_msg`, `publish_event`, `delete_record`	Cache HIT only. Cache MISS on resume → throw `ReplayUnsafeError`. Operator decides skip / retry-with-new-input / cancel.

The cache table is dead simple:

CREATE TABLE tool_call_results (
  workflow_id     TEXT NOT NULL,
  node_id         TEXT NOT NULL,
  checkpoint_id   TEXT NOT NULL,
  tool_name       TEXT NOT NULL,
  input_hash      TEXT NOT NULL,
  result_json     TEXT NOT NULL,
  created_at      INTEGER NOT NULL,
  PRIMARY KEY (workflow_id, node_id, checkpoint_id, tool_name, input_hash)
);
CREATE INDEX tcr_workflow ON tool_call_results(workflow_id, created_at);

The no-cross-FK rule matters: this table shares the foundation Postgres but has no FK to workflowInstances or audit_events. Backend swaps don’t bring referential integrity along — they’d break the swap claim. The cache stands alone.

On resume, ToolRunner.execute() looks up (workflow_id, node_id, checkpoint_id, tool_name, input_hash). HIT returns cached result, no sandbox dispatch, no side effect. MISS on unsafe_on_replay throws hard. MISS on pure or idempotent_with_key runs through PolicyGate → Sandbox → cache → Audit.

This is how A8 (“stop redundant work”) becomes a mechanism, not a slogan.

The 6-Week POC + One Honest Exit Gate

Six weeks. Two engineers. One workflow shipped end-to-end (W1), then a different engineer ships a second (W6) — and the foundation diff is empty.

Week	Who	Deliverable	Gate
W0	1 eng (½)	Decisions doc: D2 (Phoenix mode), I8 (identity backend), promotion-contract YAML, SDR labeling slot booked	Decisions committed
W1	1 eng	8 of 12 interfaces scaffolded with mock backends. CI scaffold. INV-1 grep test green.	`pnpm run agent:dev` boots
W2	1 eng	I1 real backends (persistent + ephemeral). Tools annotated `replay_class` + `persistence_class`.	SWAP 1 — Daytona ↔ OpenComputer flip; agent unchanged. Replay test passes.
W3	+1 eng	I4 + I5: OpenLLMetry → OTel Collector → Phoenix + S3 dump. 50-example golden dataset.	SWAP 2 — stub Langfuse TraceSink, flip env, spans flow.
W4	2 eng	I6 HITLBroker. Resume-token live. `agent-eval.yml` regression gate w/ cross-family judge mandate.	CROSS-FAMILY — pause → Slack → approve → resume hits cache. Degraded prompt PR fails CI.
W5	2 eng	W1 to staging. Foundation README + swap-test recipes + gotcha list. Tag interfaces `v1.0.0`.	README enables team-B handoff
W6	different eng	W2 ships from same interfaces. NEW tool defs, NEW graph nodes, NEW sandbox backend, NEW I12 backend. NO changes to foundation packages.	G1–G4 — empty-diff + replay safety + eval gate enforces + 3 swaps each <1 day

The W6 different-engineer constraint is what makes the empty-diff proof meaningful. If the same person who wrote the foundation also writes W2, they’ll quietly bend the foundation to make it work. A different engineer with read-only access to the README is the actual test.

The slip rule: if any gate fails, iterate the foundation. Slip the calendar; never compromise the gates. “Done with caveats” is not allowed. Foundation leak (Gate 1 fail) → fix and re-run W6. Calendar slips, gate doesn’t.

D1–D10: What’s a Knob, What’s an Invariant

Ten numericized decisions. Each has a default, plus the trigger that flips it. Invariants are not on this list — they don’t flip.

ID	Decision	Default	Flip when
D1	HITL UI surface	Slack bot	Enterprise SSO-gated in-app inbox needed
D2	Phoenix mode	Cloud-first	Data residency rules, bill > $2k/mo, or trace volume > 10M spans/mo
D3	Sandbox lane defaults	Ephemeral E2B	A6 fires (compliance) → pre-staged Tensorlake (HIPAA + SOC2 + EU)
D4	WorkflowEngine engagement	Deferred (LangGraph alone)	Workflow > 24h, SLA penalty, or cross-region failover
D5	HITL UX	Slack-only	Approval volume > 100/day or audit needs in-app history
D6	LLM cap per workflow	50 calls	Long-running deep-research workflow needs 200+ (raise with explicit cost ceiling)
D7	Default `persistence_class`	`ephemeral`	Tool needs warm session (Playwright, REPL) — tag `persistent` on tool def
D8	Postgres topology	Single PG	Trace volume bottlenecks graph-state writes → split
D9	ModelRouter backend	Anthropic-only B1 shim	Need non-Anthropic model (open-weight, GPT-4o judge) — swap to Vercel AI SDK / OpenRouter
D10	Open-weight share	OFF (closed-only)	Open-weight pass-rate within 2pp of closed on golden set → ramp 30% → 60% → 80%

D1–D10 are config knobs. They flip without breaking interfaces. They map directly onto the 7-question decision tree (compliance / duration / tenancy / HITL frequency / workflow shape / failure cost / cost sensitivity) — answer the questions, the decisions resolve.

The Cost Model — Why the Foundation Pays For Itself

Per-run blended cost for a representative agent (5 LLM calls + 3 tool calls + 1 judge):

Lane	Routing config	Per-run	vs baseline
D10 OFF (closed only)	5 Sonnet calls + 1 GPT-4o judge	$0.21 / run	baseline
D10 partial (30% open)	2 Llama-3.3 cheap_classify + 3 Sonnet + 1 cross-family judge	$0.13 / run	−38%
D10 full (80% open + Mixtral)	4 open-weight (Groq / Together) + 1 Sonnet plan + 1 cross-family judge	$0.05 / run	−76%

Assumptions: average call ~25k input + ~3k output tokens. Numbers ±30%; validate at POC W5 against real workload.

Switching from closed to open-weight costs zero agent rewrites — because of INV-1. The grep test is the wall. The wall is what makes the cost number real.

The eval gate doesn’t compromise either: judge stays cross-family in every lane. Cheap classifier + smart judge is exactly the routing the foundation enables.

The Decision Tree: Which-When By Org Profile

Seven questions. Answer top-to-bottom. Each answer takes a default or flips a decision.

Compliance posture? HIPAA / FedRAMP / EU residency → Tensorlake + Bedrock + vLLM option · SOC2 → Daytona + E2B · unconstrained → defaults
Workload duration? <30s tool calls → ephemeral primary · multi-min coding-agent → persistent primary · mixed → both lanes co-equal
Multi-tenant? Single → trivial tenant_id · SaaS → §6.8 enterprise spec (SSO/SCIM, audit hash chain, per-tenant region pin) · isolated → per-tenant DB stamp
HITL frequency? <5% → I6 optional path · most → I6 first-class with Slack · 3+ personas → custom HITL UI
Workflow shape? <24h → LangGraph alone · >24h or SLA → add Temporal · multi-region → Temporal mandatory
Failure cost? Internal → defaults · customer-facing → R2 + R8 mitigations critical, eval gate required · regulatory → all §6.8 invariants enforced, audit chain with prev_hash
Cost sensitivity? <$200/mo → defaults · $2k–$20k → flip D10 for cheap_classify · >$20k → D6 aggressive, D10 broad, D8 split-PG

Compliance (Q1) and tenancy (Q3) are the most expensive to retrofit. Spend the time on those before W0. Cost (Q7) and HITL frequency (Q4) re-tune mid-POC without foundation changes.

If two branches feel equally true, take the branch that triggers more invariants. Over-investing in safety is recoverable; under-investing is not.

What This Replaces

This piece does not replace harness engineering as a concept. It builds it.

Harness Engineering & Deep Agents named the layer above context engineering and showed that LangChain’s harness-only changes took a coding agent from 52.8% to 66.5% on Terminal Bench 2.0. This piece answers: “what does the org-owned form of that harness look like, and how do we keep it from leaking?”
Building an Org Harness mapped the org-level operating system around AI work. This piece is the runtime layer — the 12 interfaces an internal platform team owns, with method signatures and exit gates.
Epistemological Crisis named what happens when models confidently lie. INV-3 (cross-family judge mandate) is one of the few mechanical defenses that works. The foundation enforces it via I12 routing rule.

These pieces compose. Harness engineering is the architecture. Org harness is the human shape. Agent infrastructure foundation is the buildable runtime.

What This Does Not Do

Refusing to do these things is part of the contract.

It does not pick your model. I12 ModelRouter is the interface; the routing config is per-team, per-task-class, per-sensitivity. The foundation has no opinion on whether Sonnet or Llama-3.3 is “better” — only that whichever you pick, the agent code never imports its SDK.
It does not pick your sandbox vendor. I1 ToolRunner has persistent + ephemeral lanes; the default backends are Daytona + E2B but the swap test in W2 requires you to flip to a second backend on day one. If the swap is hard, the foundation isn’t a foundation.
It does not solve eval forever. Phoenix Evals is the start. Braintrust is one likely successor. The interface is EvalGate; the backend is whoever you trust this quarter. Cross-family judge mandate (INV-3) holds across all of them.
It does not mandate Temporal. I3 WorkflowEngine is OPTIONAL — explicitly. LangGraph alone is the default. D4 trigger is what flips it on. Most workflows don’t need it.
It does not pretend the empty-diff gate is easy. Most teams’ first attempt at this gate fails. That’s the point. If it didn’t fail sometimes, it wouldn’t be a useful gate. Slip the calendar; never compromise the gate.

How to Use This

Three modes. Pick yours.

Mode 1 — You’re starting from zero. Use the 6-week POC verbatim. W0 prep is non-negotiable. The four exit gates are non-negotiable. The decision tree gives you your D1–D10 defaults.

Mode 2 — You already have an “agent platform.” Run the empty-diff swap test against your current stack. Pick a backend you’ve never swapped — sandbox is the cheapest test. If swapping it requires touching agent code, you have a stack, not a foundation. The 12 interfaces are the punch list.

Mode 3 — You’re evaluating a vendor pitch. Map the vendor’s offering onto the 12 interfaces. Anything they own across multiple interfaces is the lock-in surface — that’s what you’ll never own back. Anything they own on a single interface is a backend rental — that’s fine. The cake (interfaces above, backends below) is the diagnostic.

Visual Companions

The full reference (PLAN.md v3.3) lives in the repo. Five HTML diagram pages walk the foundation visually:

Foundation DAG — 12 interfaces with dependencies, hero panels, big-idea SVG cake, D1–D10 grid, cost model, ModelRouter routing flowchart, build-vs-buy matrix
HITL State Machine — 4 request kinds + 6-state diagram + TTL ladder + token contract
6-Week POC Timeline — week-by-week deliverables + engineer-week loading heatmap + 4 exit gates
Replay Safety — resume-time tool dispatch sequence + replay_class taxonomy + tool_call_results DDL
Decision Tree — Q1–Q7 question cards with HOT / WARM / COLD branches

Each page has a light/dark theme toggle, glossary callouts, and modal expand for diagrams. They’re built to be read cold — no prior context required.

The Single Honest Test

If you remember one thing from this piece, remember this:

A different engineer ships the second workflow without touching one line of foundation code. If the diff is non-empty, the foundation leaked. Slip the calendar; never compromise the gate.

Everything else — interfaces, invariants, state machines, cost models, decision trees — is in service of that one test passing.

Build the harness. Swap the rest.

Sources & Provenance

Verifiable sources. Dates matter. Credibility assessed.

DOCS High credibility

February 2026

Improving Deep Agents with Harness Engineering ↗

LangChain · LangChain Blog

"The original harness-engineering case study. Coding agent went from 52.8% to 66.5% on Terminal Bench 2.0 with harness-only changes — same model, different harness. Establishes that the harness, not the model, is the lever."

DOCS High credibility

2026

LangGraph Postgres Checkpointer ↗

LangGraph · LangGraph Docs

"Reference implementation of versioned graph state with replay support. Validates I2 Checkpointer's invariant that graph_id is semver-locked and never modified in place — migration plan required on version bump."

DOCS High credibility

2026

OTel Semantic Conventions for GenAI v1.37 ↗

OpenTelemetry · OTel Spec

"The schema for I4 TraceSink. Standardizes gen-ai.system, gen-ai.request.model, gen-ai.usage.input_tokens, gen-ai.completion across vendors. Backend swap = OTel Collector exporter change, not span schema change."

High credibility

October 2024

LLM-as-Judge Bias: Same-Family Inflation ↗

Various · arxiv 2410.21819

"Empirical evidence that same-family LLM judges inflate pass rates by +5–10pp on golden datasets. The literal R20 mitigation: judge model MUST be from a different family than the agent. INV-3 in I5 EvalGate enforces."

DOCS High credibility

April 2026

Tensorlake Sandboxes — Compliance Out of Box ↗

Tensorlake · Tensorlake Docs

"Firecracker microVMs with HIPAA, SOC2, and EU residency available out-of-box. Validates the A6 trigger pattern: pre-stage a compliance backend behind I1 ToolRunner, flip lane in 0.5–1 day when compliance fires."

High credibility

April 2026

OpenComputer (diggerhq) ↗

Digger · GitHub

"Apache-2.0 full Linux VM sandbox with agent SDK. The OSS-purist alternative to Daytona for I1 ToolRunner persistent lane. Validates that 'persistent + ephemeral' is a real lane split, not a single product."

DOCS High credibility

2026

Phoenix Evals + Tracing ↗

Arize · Phoenix Docs

"Phoenix's LLM-as-a-judge tooling supports cross-family judge specification natively. The default I5 EvalGate backend; the swap-target is Braintrust. CI gate consumes pass-rate delta, cost, latency."

DOCS High credibility

2026

Temporal Workflow Determinism ↗

Temporal · Temporal Docs

"The reference for I3 WorkflowEngine deterministic replay semantics. Each LangGraph node wrapped as a Temporal Activity gets cross-region failover, signal-based HITL, and replay safety at the orchestration layer."

DOCS High credibility

2026

Postgres LISTEN/NOTIFY for HITL Brokering ↗

PostgreSQL · PostgreSQL Docs

"The default I6 HITLBroker backend. LISTEN/NOTIFY is a built-in pub/sub primitive — no separate broker required. HMAC-signed resume tokens + tokens table sit on top."

DOCS Medium credibility

2026

Vercel AI SDK Provider Abstraction ↗

Vercel · Vercel AI SDK

"Reference for I12 ModelRouter swap. Anthropic, OpenAI, Groq, Together, Mistral, Cohere all behind a single TypeScript interface. Cross-family judge selection becomes a config decision, not a code change."

DOCS Medium credibility

2026

OpenRouter Multi-Provider Gateway ↗

OpenRouter · OpenRouter Docs

"The closed-+-open-weight aggregation point. Validates the D10 ramp-up pattern (0% → 30% → 60% → 80% open-weight) without per-provider integration work. Cost data per model published per token."

DOCS Medium credibility

2026

OPA / Cerbos Policy Engines ↗

Open Policy Agent / Cerbos · OPA Docs

"The 5-mode cascade pattern (per-actor → per-tenant → per-tool → per-environment → default-deny) is implementable in Rego (OPA) or Cerbos policy DSL. I9 PolicyGate's swap from custom YAML to OPA is the medium-difficulty path."

DOCS Medium credibility

2026

OpenLLMetry ↗

Traceloop · GitHub

"OTel-compliant gen-ai instrumentation library. Tags spans with model, prompt, completion, tokens, cost. The default emit path for I4 TraceSink and I11 CostAttributor. Sits below I4 — vendor-portable."

DOCS Medium credibility

2026

WorkOS / Clerk Identity for Agents ↗

WorkOS / Clerk · WorkOS Docs

"The default I8 IdentityProvider swap target. Actor + on_behalf_of + tenant.id are first-class. Multi-tenant routing keys off actor.tenant_id without bespoke identity code."

DOCS Medium credibility

2026

Braintrust Eval Platform ↗

Braintrust · Braintrust Docs

"The likely I5 EvalGate swap target after Phoenix Evals. Production eval at scale with branching experiments. Validates that EvalGate's interface (pass-rate delta + cost + latency JSON) is portable across implementations."

DOCS Low credibility

2026

Daytona Persistent Sandboxes ↗

Daytona · Daytona Docs

"The default I1 ToolRunner persistent-lane backend for the local-parity case. Docker-compose for local dev mirrors production semantics. Swap-tested against OpenComputer in W2 of the POC."

DOCS Low credibility

2026

E2B Ephemeral Sandboxes ↗

E2B · E2B Docs

"The default I1 ToolRunner ephemeral-lane backend. Sub-second cold starts, MicroVM isolation. Pairs with Daytona for the persistent-vs-ephemeral lane split."

DOCS Low credibility

2026

HashiCorp Vault for Secrets at Tool Boundary ↗

HashiCorp · Vault Docs

"The reference I7 SecretsProvider swap target after env-var defaults. Audit row written per secret access; secret never logged or sent to model. AWS Secrets Manager is the equivalent on AWS-native stacks."