Agent Infrastructure Foundation: 12 Interfaces, Commodity Backends, Empty-Diff Exit Gate
Harness engineering named the architecture above the model. This is the buildable form. 12 stable interfaces a small platform team owns, backends as commodity rentals, a 6-week POC with one honest exit gate: a different engineer ships the second workflow with zero foundation diff.
TL;DR
Harness engineering named the layer above the model. This piece names what’s inside it — and what isn’t.
A small platform team should own exactly 12 stable interfaces and treat every backend below them as a commodity rental. The interfaces are the IP. The backends are swappable. The proof your foundation works isn’t a benchmark or a customer logo — it’s a single git command:
git diff packages/foundation/ <W1-merge>..<W2-ship>
# must be empty
A different engineer ships the second workflow without touching one line of foundation code. If the diff is non-empty, the foundation leaked. Slip the calendar; never compromise the gate.
This piece is the buildable form of the harness/infra split: 12 interfaces with method signatures and invariants, a HITL state machine that answers four kinds of human asks, a replay-safety mechanism that stops resume-time duplicate side effects, a 6-week POC with four exit gates, and ten numericized decisions (D1–D10) that route an org from defaults to its specific config.
The Two Stacks Pretending to Be One
Most teams I see ship one bundle: agents + tools + sandbox + traces + evaluators + model glue, all welded. Then a vendor swap shows up. Then HIPAA shows up. Then open-weight gets cheap. And the bundle stops moving — because half of it never belonged to the team.
The split nobody draws:
| Stack | What | Buyability |
|---|---|---|
| Harness | Pause. Judge. Approve. Undo. Audit. The grip humans keep on the loop. | Not buyable. No vendor ships “your humans pausing your agents on your terms.” |
| Infra | Memory that survives a crash. Permission that names an actor. Cost with a cap. Backends that swap. | Only buyable. Every backend is rental. |
The harness is your IP, by definition. The infra is your contract — stable interface, swappable backend.
The teams shipping fastest in 2026 aren’t picking better tooling. They’re drawing the line between what they hold and what they rent — and refusing to confuse the two.
The 12 Interfaces
Each has a method signature, a set of invariants, a swap-test recipe, and a default backend. Agent code never imports a backend SDK directly — it imports the interface. The interface contract is locked at v1.0.0 semver. Backends rotate behind it.
| # | Interface | Owns | Default backend | Difficulty to swap |
|---|---|---|---|---|
| I1 | ToolRunner | Tool dispatch, sandbox lane routing, replay cache | Persistent + ephemeral lanes | 🟢 easy |
| I2 | Checkpointer | Versioned graph state, migration plan on version bump | LangGraph + Postgres | 🔴 hard |
| I3 | WorkflowEngine | OPTIONAL — engaged when D4 fires (long-running, SLA, multi-region) | Temporal | 🟡 medium |
| I4 | TraceSink | OTel GenAI v1.37 emit + S3 dump escape hatch | Phoenix | 🟢 easy |
| I5 | EvalGate | Pass-rate delta + cost + latency + cross-family judge mandate | Phoenix Evals → Braintrust | 🟡 medium |
| I6 | HITLBroker | Pause / resume / expire / escalate + HMAC-signed resume token | Postgres-LISTEN + Slack bot | 🟡 medium |
| I7 | SecretsProvider | Inject at tool boundary; never into model context | Vault / AWS-SM | 🟢 easy |
| I8 | IdentityProvider | whoami + on_behalf_of; no anonymous tool calls | Custom JWT → WorkOS / Clerk | 🟢 easy |
| I9 | PolicyGate | 5-mode cascade: per-actor → per-tenant → per-tool → per-environment → default-deny | Custom YAML → OPA / Cerbos | 🟡 medium |
| I10 | RateLimiter + CircuitBreaker | Per-workflow LLM cap, per-vendor RPS, doom-loop guard | Redis | 🟢 easy |
| I11 | CostAttributor | Tag every span with team_id + workflow_id; CI test for untagged spans | OpenLLMetry + Phoenix cost-table | 🟢 easy |
| I12 | ModelRouter | Cost / complexity / sensitivity routing + cross-family judge enforcement | Anthropic-only B1 → Vercel AI SDK / OpenRouter | 🟢 easy |
Build-vs-buy is decided at the interface level, not per backend. The matrix has 12 rows for 12 interfaces. The cake mentions thirteen-plus backend names — but the buy decision is “do we own the interface or do we rent the whole layer?” Always own the interface.
The hardest swap is I2 Checkpointer. Graph state is product memory; you don’t migrate it like config. Every other interface is easy or medium. I12 ModelRouter is easy specifically because of one invariant (below).
The Invariants That Hold The Line
Three load-bearing invariants. Drop any one and the foundation leaks.
INV-1 (I12 ModelRouter): No provider SDK in agent code
# CI grep test
! grep -rE "from anthropic|import anthropic|from openai|import OpenAI" \
packages/ --exclude-dir=foundation/model-router/backends/
If this test fails, the agent code is one provider away from a rewrite. The grep is the wall. The test runs on every PR. Without it, the empty-diff swap proof is impossible.
INV-3 (I5 EvalGate): Cross-family judge mandate
The judge model MUST be from a different family than the agent. Same-family judges inflate pass rates by +5–10pp on the golden set (arxiv 2410.21819). Without cross-family enforcement, the eval gate lies and bad agents ship.
I5 declares the rule. I12 ModelRouter enforces it via a task_class=judge routing rule that picks a judge from a different family — and, when sensitivity gating applies, picks cross-family within the allowed pool (BAA / Standard), never around it.
This is the literal R20 mitigation. It appears in five places in the foundation: I5 invariant, I12 routing rule, I5 → I12 DAG edge, eval CI test, and the model-router’s per-call routing diagram.
INV-4 (I6 HITLBroker): Pause goes through Checkpointer first
HITLBroker.pause() calls Checkpointer.save() first. Otherwise resume has nothing to load. This is the literal “stop redundant work” mechanism — the grip on the loop only holds if state survives the pause.
HITL: Four Kinds, Six States, One Token Contract
The state machine collapses six states (PENDING / APPROVED / REJECTED / EXPIRED / ESCALATED / CANCELED) but humans aren’t being asked the same thing each time.
| Kind | What the human is asked | Payload |
|---|---|---|
| approve | ”Send this email to the prospect?” | {kind: "approve"} |
| choose | ”Which of these 3 reply drafts?” | {kind: "choose", choice_id: "draft_b"} |
| provide | ”What’s the discount cap for this account?” | {kind: "provide", payload: {cap: 0.15}} |
| veto | ”Auto-publishing in 24h. Veto?” — default-fire on EXPIRED | timeout = approve |
The EXPIRED → APPROVED edge fires only for kind: "veto" requests. Every other state transition writes an audit row.
Resume token contract (HMAC-signed, single-use, monthly key rotation with a 2-key grace window):
{workflow_id, node_id, checkpoint_id, decision_slot,
actor_role, expires_at, nonce}
Storage: tokens table with hash only, never raw payload. Wire format: /resume?t=<base64url>. Single-use enforced via tokens.usedAt.
TTL ladder: 24h L1 → 48h L2 → 7d hard cap → page on-call. The escalation is a state machine, not a Slack reminder.
Replay Safety: The “Stop Redundant Work” Mechanism
User goal A8 is one line: “resume MUST NOT redo work.” The naive failure mode: agent crashes mid-loop, resume re-runs every tool call, send_email fires twice. The customer sees two emails. You see an incident.
The fix is a typed taxonomy on every tool definition:
replay_class | Examples | Resume behavior |
|---|---|---|
pure | parse_json, compute_hash, format_template | Always replay-safe. Cache is optimization only. |
idempotent_with_key | stripe_charge (idem-key), create_user (idempotency-key header), upsert_record | Safe IFF same idempotency key. Cache key includes idempotency_key_fn(input). Vendor enforces single-write. |
unsafe_on_replay | send_email (no thread_ts), post_slack_msg, publish_event, delete_record | Cache HIT only. Cache MISS on resume → throw ReplayUnsafeError. Operator decides skip / retry-with-new-input / cancel. |
The cache table is dead simple:
CREATE TABLE tool_call_results (
workflow_id TEXT NOT NULL,
node_id TEXT NOT NULL,
checkpoint_id TEXT NOT NULL,
tool_name TEXT NOT NULL,
input_hash TEXT NOT NULL,
result_json TEXT NOT NULL,
created_at INTEGER NOT NULL,
PRIMARY KEY (workflow_id, node_id, checkpoint_id, tool_name, input_hash)
);
CREATE INDEX tcr_workflow ON tool_call_results(workflow_id, created_at);
The no-cross-FK rule matters: this table shares the foundation Postgres but has no FK to workflowInstances or audit_events. Backend swaps don’t bring referential integrity along — they’d break the swap claim. The cache stands alone.
On resume, ToolRunner.execute() looks up (workflow_id, node_id, checkpoint_id, tool_name, input_hash). HIT returns cached result, no sandbox dispatch, no side effect. MISS on unsafe_on_replay throws hard. MISS on pure or idempotent_with_key runs through PolicyGate → Sandbox → cache → Audit.
This is how A8 (“stop redundant work”) becomes a mechanism, not a slogan.
The 6-Week POC + One Honest Exit Gate
Six weeks. Two engineers. One workflow shipped end-to-end (W1), then a different engineer ships a second (W6) — and the foundation diff is empty.
| Week | Who | Deliverable | Gate |
|---|---|---|---|
| W0 | 1 eng (½) | Decisions doc: D2 (Phoenix mode), I8 (identity backend), promotion-contract YAML, SDR labeling slot booked | Decisions committed |
| W1 | 1 eng | 8 of 12 interfaces scaffolded with mock backends. CI scaffold. INV-1 grep test green. | pnpm run agent:dev boots |
| W2 | 1 eng | I1 real backends (persistent + ephemeral). Tools annotated replay_class + persistence_class. | SWAP 1 — Daytona ↔ OpenComputer flip; agent unchanged. Replay test passes. |
| W3 | +1 eng | I4 + I5: OpenLLMetry → OTel Collector → Phoenix + S3 dump. 50-example golden dataset. | SWAP 2 — stub Langfuse TraceSink, flip env, spans flow. |
| W4 | 2 eng | I6 HITLBroker. Resume-token live. agent-eval.yml regression gate w/ cross-family judge mandate. | CROSS-FAMILY — pause → Slack → approve → resume hits cache. Degraded prompt PR fails CI. |
| W5 | 2 eng | W1 to staging. Foundation README + swap-test recipes + gotcha list. Tag interfaces v1.0.0. | README enables team-B handoff |
| W6 | different eng | W2 ships from same interfaces. NEW tool defs, NEW graph nodes, NEW sandbox backend, NEW I12 backend. NO changes to foundation packages. | G1–G4 — empty-diff + replay safety + eval gate enforces + 3 swaps each <1 day |
The W6 different-engineer constraint is what makes the empty-diff proof meaningful. If the same person who wrote the foundation also writes W2, they’ll quietly bend the foundation to make it work. A different engineer with read-only access to the README is the actual test.
The slip rule: if any gate fails, iterate the foundation. Slip the calendar; never compromise the gates. “Done with caveats” is not allowed. Foundation leak (Gate 1 fail) → fix and re-run W6. Calendar slips, gate doesn’t.
D1–D10: What’s a Knob, What’s an Invariant
Ten numericized decisions. Each has a default, plus the trigger that flips it. Invariants are not on this list — they don’t flip.
| ID | Decision | Default | Flip when |
|---|---|---|---|
| D1 | HITL UI surface | Slack bot | Enterprise SSO-gated in-app inbox needed |
| D2 | Phoenix mode | Cloud-first | Data residency rules, bill > $2k/mo, or trace volume > 10M spans/mo |
| D3 | Sandbox lane defaults | Ephemeral E2B | A6 fires (compliance) → pre-staged Tensorlake (HIPAA + SOC2 + EU) |
| D4 | WorkflowEngine engagement | Deferred (LangGraph alone) | Workflow > 24h, SLA penalty, or cross-region failover |
| D5 | HITL UX | Slack-only | Approval volume > 100/day or audit needs in-app history |
| D6 | LLM cap per workflow | 50 calls | Long-running deep-research workflow needs 200+ (raise with explicit cost ceiling) |
| D7 | Default persistence_class | ephemeral | Tool needs warm session (Playwright, REPL) — tag persistent on tool def |
| D8 | Postgres topology | Single PG | Trace volume bottlenecks graph-state writes → split |
| D9 | ModelRouter backend | Anthropic-only B1 shim | Need non-Anthropic model (open-weight, GPT-4o judge) — swap to Vercel AI SDK / OpenRouter |
| D10 | Open-weight share | OFF (closed-only) | Open-weight pass-rate within 2pp of closed on golden set → ramp 30% → 60% → 80% |
D1–D10 are config knobs. They flip without breaking interfaces. They map directly onto the 7-question decision tree (compliance / duration / tenancy / HITL frequency / workflow shape / failure cost / cost sensitivity) — answer the questions, the decisions resolve.
The Cost Model — Why the Foundation Pays For Itself
Per-run blended cost for a representative agent (5 LLM calls + 3 tool calls + 1 judge):
| Lane | Routing config | Per-run | vs baseline |
|---|---|---|---|
| D10 OFF (closed only) | 5 Sonnet calls + 1 GPT-4o judge | $0.21 / run | baseline |
| D10 partial (30% open) | 2 Llama-3.3 cheap_classify + 3 Sonnet + 1 cross-family judge | $0.13 / run | −38% |
| D10 full (80% open + Mixtral) | 4 open-weight (Groq / Together) + 1 Sonnet plan + 1 cross-family judge | $0.05 / run | −76% |
Assumptions: average call ~25k input + ~3k output tokens. Numbers ±30%; validate at POC W5 against real workload.
Switching from closed to open-weight costs zero agent rewrites — because of INV-1. The grep test is the wall. The wall is what makes the cost number real.
The eval gate doesn’t compromise either: judge stays cross-family in every lane. Cheap classifier + smart judge is exactly the routing the foundation enables.
The Decision Tree: Which-When By Org Profile
Seven questions. Answer top-to-bottom. Each answer takes a default or flips a decision.
- Compliance posture? HIPAA / FedRAMP / EU residency → Tensorlake + Bedrock + vLLM option · SOC2 → Daytona + E2B · unconstrained → defaults
- Workload duration? <30s tool calls → ephemeral primary · multi-min coding-agent → persistent primary · mixed → both lanes co-equal
- Multi-tenant? Single → trivial
tenant_id· SaaS → §6.8 enterprise spec (SSO/SCIM, audit hash chain, per-tenant region pin) · isolated → per-tenant DB stamp - HITL frequency? <5% → I6 optional path · most → I6 first-class with Slack · 3+ personas → custom HITL UI
- Workflow shape? <24h → LangGraph alone · >24h or SLA → add Temporal · multi-region → Temporal mandatory
- Failure cost? Internal → defaults · customer-facing → R2 + R8 mitigations critical, eval gate required · regulatory → all §6.8 invariants enforced, audit chain with
prev_hash - Cost sensitivity? <$200/mo → defaults · $2k–$20k → flip D10 for cheap_classify · >$20k → D6 aggressive, D10 broad, D8 split-PG
Compliance (Q1) and tenancy (Q3) are the most expensive to retrofit. Spend the time on those before W0. Cost (Q7) and HITL frequency (Q4) re-tune mid-POC without foundation changes.
If two branches feel equally true, take the branch that triggers more invariants. Over-investing in safety is recoverable; under-investing is not.
What This Replaces
This piece does not replace harness engineering as a concept. It builds it.
- Harness Engineering & Deep Agents named the layer above context engineering and showed that LangChain’s harness-only changes took a coding agent from 52.8% to 66.5% on Terminal Bench 2.0. This piece answers: “what does the org-owned form of that harness look like, and how do we keep it from leaking?”
- Building an Org Harness mapped the org-level operating system around AI work. This piece is the runtime layer — the 12 interfaces an internal platform team owns, with method signatures and exit gates.
- Epistemological Crisis named what happens when models confidently lie. INV-3 (cross-family judge mandate) is one of the few mechanical defenses that works. The foundation enforces it via I12 routing rule.
These pieces compose. Harness engineering is the architecture. Org harness is the human shape. Agent infrastructure foundation is the buildable runtime.
What This Does Not Do
Refusing to do these things is part of the contract.
- It does not pick your model. I12 ModelRouter is the interface; the routing config is per-team, per-task-class, per-sensitivity. The foundation has no opinion on whether Sonnet or Llama-3.3 is “better” — only that whichever you pick, the agent code never imports its SDK.
- It does not pick your sandbox vendor. I1 ToolRunner has persistent + ephemeral lanes; the default backends are Daytona + E2B but the swap test in W2 requires you to flip to a second backend on day one. If the swap is hard, the foundation isn’t a foundation.
- It does not solve eval forever. Phoenix Evals is the start. Braintrust is one likely successor. The interface is
EvalGate; the backend is whoever you trust this quarter. Cross-family judge mandate (INV-3) holds across all of them. - It does not mandate Temporal. I3 WorkflowEngine is OPTIONAL — explicitly. LangGraph alone is the default. D4 trigger is what flips it on. Most workflows don’t need it.
- It does not pretend the empty-diff gate is easy. Most teams’ first attempt at this gate fails. That’s the point. If it didn’t fail sometimes, it wouldn’t be a useful gate. Slip the calendar; never compromise the gate.
How to Use This
Three modes. Pick yours.
Mode 1 — You’re starting from zero. Use the 6-week POC verbatim. W0 prep is non-negotiable. The four exit gates are non-negotiable. The decision tree gives you your D1–D10 defaults.
Mode 2 — You already have an “agent platform.” Run the empty-diff swap test against your current stack. Pick a backend you’ve never swapped — sandbox is the cheapest test. If swapping it requires touching agent code, you have a stack, not a foundation. The 12 interfaces are the punch list.
Mode 3 — You’re evaluating a vendor pitch. Map the vendor’s offering onto the 12 interfaces. Anything they own across multiple interfaces is the lock-in surface — that’s what you’ll never own back. Anything they own on a single interface is a backend rental — that’s fine. The cake (interfaces above, backends below) is the diagnostic.
Visual Companions
The full reference (PLAN.md v3.3) lives in the repo. Five HTML diagram pages walk the foundation visually:
- Foundation DAG — 12 interfaces with dependencies, hero panels, big-idea SVG cake, D1–D10 grid, cost model, ModelRouter routing flowchart, build-vs-buy matrix
- HITL State Machine — 4 request kinds + 6-state diagram + TTL ladder + token contract
- 6-Week POC Timeline — week-by-week deliverables + engineer-week loading heatmap + 4 exit gates
- Replay Safety — resume-time tool dispatch sequence + replay_class taxonomy + tool_call_results DDL
- Decision Tree — Q1–Q7 question cards with HOT / WARM / COLD branches
Each page has a light/dark theme toggle, glossary callouts, and modal expand for diagrams. They’re built to be read cold — no prior context required.
The Single Honest Test
If you remember one thing from this piece, remember this:
A different engineer ships the second workflow without touching one line of foundation code. If the diff is non-empty, the foundation leaked. Slip the calendar; never compromise the gate.
Everything else — interfaces, invariants, state machines, cost models, decision trees — is in service of that one test passing.
Build the harness. Swap the rest.
Sources & Provenance
Verifiable sources. Dates matter. Credibility assessed.
Improving Deep Agents with Harness Engineering ↗
LangChain · LangChain Blog
"The original harness-engineering case study. Coding agent went from 52.8% to 66.5% on Terminal Bench 2.0 with harness-only changes — same model, different harness. Establishes that the harness, not the model, is the lever."
LangGraph Postgres Checkpointer ↗
LangGraph · LangGraph Docs
"Reference implementation of versioned graph state with replay support. Validates I2 Checkpointer's invariant that graph_id is semver-locked and never modified in place — migration plan required on version bump."
OTel Semantic Conventions for GenAI v1.37 ↗
OpenTelemetry · OTel Spec
"The schema for I4 TraceSink. Standardizes gen-ai.system, gen-ai.request.model, gen-ai.usage.input_tokens, gen-ai.completion across vendors. Backend swap = OTel Collector exporter change, not span schema change."
LLM-as-Judge Bias: Same-Family Inflation ↗
Various · arxiv 2410.21819
"Empirical evidence that same-family LLM judges inflate pass rates by +5–10pp on golden datasets. The literal R20 mitigation: judge model MUST be from a different family than the agent. INV-3 in I5 EvalGate enforces."
Tensorlake Sandboxes — Compliance Out of Box ↗
Tensorlake · Tensorlake Docs
"Firecracker microVMs with HIPAA, SOC2, and EU residency available out-of-box. Validates the A6 trigger pattern: pre-stage a compliance backend behind I1 ToolRunner, flip lane in 0.5–1 day when compliance fires."
OpenComputer (diggerhq) ↗
Digger · GitHub
"Apache-2.0 full Linux VM sandbox with agent SDK. The OSS-purist alternative to Daytona for I1 ToolRunner persistent lane. Validates that 'persistent + ephemeral' is a real lane split, not a single product."
Phoenix Evals + Tracing ↗
Arize · Phoenix Docs
"Phoenix's LLM-as-a-judge tooling supports cross-family judge specification natively. The default I5 EvalGate backend; the swap-target is Braintrust. CI gate consumes pass-rate delta, cost, latency."
Temporal Workflow Determinism ↗
Temporal · Temporal Docs
"The reference for I3 WorkflowEngine deterministic replay semantics. Each LangGraph node wrapped as a Temporal Activity gets cross-region failover, signal-based HITL, and replay safety at the orchestration layer."
Postgres LISTEN/NOTIFY for HITL Brokering ↗
PostgreSQL · PostgreSQL Docs
"The default I6 HITLBroker backend. LISTEN/NOTIFY is a built-in pub/sub primitive — no separate broker required. HMAC-signed resume tokens + tokens table sit on top."
Vercel AI SDK Provider Abstraction ↗
Vercel · Vercel AI SDK
"Reference for I12 ModelRouter swap. Anthropic, OpenAI, Groq, Together, Mistral, Cohere all behind a single TypeScript interface. Cross-family judge selection becomes a config decision, not a code change."
OpenRouter Multi-Provider Gateway ↗
OpenRouter · OpenRouter Docs
"The closed-+-open-weight aggregation point. Validates the D10 ramp-up pattern (0% → 30% → 60% → 80% open-weight) without per-provider integration work. Cost data per model published per token."
OPA / Cerbos Policy Engines ↗
Open Policy Agent / Cerbos · OPA Docs
"The 5-mode cascade pattern (per-actor → per-tenant → per-tool → per-environment → default-deny) is implementable in Rego (OPA) or Cerbos policy DSL. I9 PolicyGate's swap from custom YAML to OPA is the medium-difficulty path."
OpenLLMetry ↗
Traceloop · GitHub
"OTel-compliant gen-ai instrumentation library. Tags spans with model, prompt, completion, tokens, cost. The default emit path for I4 TraceSink and I11 CostAttributor. Sits below I4 — vendor-portable."
WorkOS / Clerk Identity for Agents ↗
WorkOS / Clerk · WorkOS Docs
"The default I8 IdentityProvider swap target. Actor + on_behalf_of + tenant.id are first-class. Multi-tenant routing keys off actor.tenant_id without bespoke identity code."
Braintrust Eval Platform ↗
Braintrust · Braintrust Docs
"The likely I5 EvalGate swap target after Phoenix Evals. Production eval at scale with branching experiments. Validates that EvalGate's interface (pass-rate delta + cost + latency JSON) is portable across implementations."
Daytona Persistent Sandboxes ↗
Daytona · Daytona Docs
"The default I1 ToolRunner persistent-lane backend for the local-parity case. Docker-compose for local dev mirrors production semantics. Swap-tested against OpenComputer in W2 of the POC."
E2B Ephemeral Sandboxes ↗
E2B · E2B Docs
"The default I1 ToolRunner ephemeral-lane backend. Sub-second cold starts, MicroVM isolation. Pairs with Daytona for the persistent-vs-ephemeral lane split."
HashiCorp Vault for Secrets at Tool Boundary ↗
HashiCorp · Vault Docs
"The reference I7 SecretsProvider swap target after env-var defaults. Audit row written per secret access; secret never logged or sent to model. AWS Secrets Manager is the equivalent on AWS-native stacks."