The Runtime Commoditized. The Eval Layer Is Still Your Problem.

The Runtime Commoditized. The Eval Layer Is Still Your Problem.

Three hyperscalers shipped production-grade agent runtimes inside a sixty-day window. Five days after the most recent one launched, Axios put a $500 million Claude bill on the front page of corporate America. The two stories sit on the same fact: the runtime layer just commoditized. The harness that turns a runtime into a cost-controlled production system did not.

The May 9 Pilot-to-Production Death March post closed on Uber's harness as the gap most enterprises had not crossed: central MCP gateway, tool registry, security scans, no-code Agent Builder. The thirty days since stress-tested that conclusion. Google, Microsoft, and AWS each shipped a production runtime. None shipped the harness. The buyer still owns eval, observability, governance, and cost control. Adobe's 2026 Digital Trends survey of 3,000 customer-experience executives found only 31% have a measurement framework for agentic AI. 47% have none or do not know. The runtimes do not close that gap; they lower the friction to build the thing that produces it.

What Shipped in Sixty Days

The honest framing is a window. Microsoft Agent Framework 1.0 GA'd on April 3, 2026. AWS Bedrock AgentCore Runtime went GA on October 13, 2025, with a steady cadence of additions since and an AgentCore harness public preview on April 22, 2026. Google Antigravity 2.0 launched at I/O on May 19, 2026. The May 2026 convergence is the moment all three are simultaneously in market with GA production surfaces. Antigravity 2.0 closed the window.

Google Antigravity 2.0 (developer blog, TechCrunch coverage) ships a standalone desktop app, CLI, SDK, Managed Agents API, and Enterprise Platform tier. The headline capability is parallel subagent orchestration with managed execution. That is the kind of thing teams used to wire up by hand on top of LangGraph or a homegrown orchestrator. It is a runtime: a place agents live and run.

Microsoft Agent Framework 1.0 (Microsoft devblog, Visual Studio Magazine) is the post-AutoGen synthesis. First-party connectors into Foundry, Azure OpenAI, OpenAI, Anthropic, Bedrock, Gemini, and Ollama. Type safety, middleware, session-based state, telemetry hooks. The framing is multi-vendor neutrality: Microsoft becomes the wiring layer and the model providers are interchangeable. A runtime with a connector library bolted on.

AWS Bedrock AgentCore Runtime (Ernest Chiang's GA writeup, AWS docs) takes the deepest isolation posture. Every session runs in a dedicated Firecracker microVM with isolated CPU, memory, and filesystem. The microVM terminates and memory sanitizes after the session ends. Up to 8-hour sessions, 15-minute idle timeout. The April 22 AgentCore harness preview is AWS starting to surface the eval and observability layer, but it remains in preview. The runtime is GA. The harness is not.

Read the three side by side and the pattern is the same. Each one is a substrate. Antigravity orchestrates the agents. MS Agent Framework wires the connectors. AgentCore isolates the session. None of them tell you when your agent is wrong. None of them sit on a budget and refuse to spend past it. None of them ship a golden-set test harness, a trace replay tool with semantic search, an attribution graph for which prompt triggered which tool call, or a regression suite that catches behavior drift between model versions.

The runtime is what gets your agent running. The harness is what keeps it from setting money on fire.

The Sticker Shock Story Is an Observability Story

Axios published "AI sticker shock hits corporate America" on May 28, 2026. The anecdotes that anchor the piece are the part worth reading carefully, and the part worth treating carefully.

The headline number is a $500-million Claude bill in a single month, racked up by an unnamed enterprise client that failed to set usage caps. The source is an AI consultant interviewed by Axios. The Decoder picked the same anecdote up. Treat it as a reported anecdote, not a verified case study. The company is unnamed, the bill is uncorroborated by primary documents, and the chain of attribution runs through a consultant with an obvious interest in the story landing. What the anecdote does establish, even if the number is off by a factor of five, is the shape of the failure: usage uncapped, observability absent, cost controls non-existent, until the invoice arrived.

Two more grounded data points from the same piece. Microsoft canceled the majority of its Claude Code licenses, partly due to cost. Uber burned its full 2026 AI budget by April. Both are first-party-ish: Microsoft's procurement decision is observable in the dev-tools market, and Uber's spend trajectory squares with the 60,000-executions-per-week production agent number from the May 9 post. The CloudBees CEO Anuj Kapur's quote that workforce reduction is "the only lever they can pull" to offset AI bills is what a CFO says when no one owns the cost surface.

The common shape across all three: nobody was watching the meter. The agent ran. The token bill ran with it. There was no system in place to alert when burn rate diverged from projection, no per-workflow budget governor, no automated kill-switch on runaway loops. These are not LLM problems. These are not runtime problems. These are observability and governance problems, and they are exactly the layer the three new runtimes do not ship.

The Buyer-Side Numbers

Adobe's 2026 Digital Trends survey drew on 3,000 executives and practitioners from a customer-experience leaning respondent pool (flag inline). 44% of organizations have a measurement framework for generative AI. Only 31% have a measurement framework for agentic AI specifically. 47% have neither or are unsure. The CX skew matters; these numbers are strongest for marketing and service agents, less directly applicable to deep-tech or coding-agent deployments. The directional claim holds across cuts of the data: roughly half of enterprises deploying agents have no defined way to measure whether the agents are working.

CMSWire's read on the same data frames it as ambition outrunning readiness. The agent rollouts are happening. The measurement infrastructure is not.

Gartner's 2026 Hype Cycle for Agentic AI sharpens the consequence. Agentic AI sits at the Peak of Inflated Expectations with a 2-5 year time to plateau. Gartner forecasts (a projection, not measured data) that 40%+ of agentic AI projects will be canceled by end of 2027 due to escalating costs, unclear business value, or inadequate risk controls. The three named cancellation drivers are the three things eval infrastructure exists to manage. Only 13% of IT application leaders strongly agreed they had the right governance structures in place. 74% view agents as a new attack vector. 17% of organizations have deployed AI agents today; 60%+ expect to within two years.

The MIT State of AI in Business 2025 finding that 95% of generative AI pilots fail to deliver measurable P&L impact sits in the same vicinity but should not carry the argument alone. The figure has been contested by other researchers on methodology and framing. Use it as one data point in a stack with Adobe and Gartner. The healthier read from the MIT data is the vendor-led versus internal-build split: specialized vendor-led deployments succeed around 67% of the time, internal builds around 33%. Ownership and methodology gate ROI more than model quality does.

Stack the four together. Adobe's 31% measurement-framework number. Gartner's 13% governance-readiness number. MIT's 33% internal-build success rate. The Axios anecdotes about uncapped spend. The runtime is no longer the constraint. The harness around it is.

What the Runtimes Actually Contain

Side by side, the contrast is sharper than the marketing surfaces let on.

ComponentAntigravity 2.0MS Agent Framework 1.0AWS AgentCore Runtime
Execution modelManaged agents, parallel subagents.NET / Python framework with workflow primitivesPer-session Firecracker microVM
IsolationManaged runtimeProcess-levelPer-session microVM, sanitized on exit
Model providersGoogle ecosystem + external via SDKFirst-party multi-vendor connectorsBedrock-hosted + bring-your-own
Session ceilingLong-running supportedApplication-defined8 hours, 15-min idle timeout
ObservabilityTelemetry hooksOpenTelemetry integrationCloudWatch + AgentCore harness (preview)
Eval / golden-set testingNot shippedNot shippedHarness in preview only
Cost governorNot shippedNot shippedNot shipped
Audit trail / trace replayActivity logTelemetry pipelineCloudTrail
Agent-scoped IAMStandard authStandard authIAM integration, agent-scoped tokens
Tool registryManaged Agents APIConnector libraryBedrock tool catalog

Three rows are non-empty across all three vendors: isolation, observability hooks, and tool surfaces. Three rows are empty across all three vendors: eval and golden-set testing, cost governors, and the actual trace-replay-and-attribution loop that turns telemetry data into a debugging surface. Telemetry hooks are not observability. CloudWatch metrics are not an eval suite. A connector library is not a registry with pull-request gating, security scans, and a no-code Agent Builder.

This is the line Uber crossed on their own. Their central MCP gateway, tool registry with pre-deployment security scans, continuous production monitoring, and no-code Agent Builder were the platform investment that preceded the 1,500 production agents. None of the three hyperscaler runtimes ship that platform layer. The runtimes assume the buyer has it or is willing to build it.

Bain's 2026 enterprise AI work calls this layer first-class: governance gets promoted from compliance checkbox to strategic enabler, eval becomes a first-class requirement with shared services for trace capture, golden-set testing, and multistep behavior measurement. Their Code Red piece puts the sequencing in plain English: governance and trust precede orchestration and scale. The vendors who shipped runtimes shipped orchestration without the prerequisite. The buyer-side data says only about a third of enterprises have built the prerequisite themselves.

Why the Runtimes Don't Close the Gap

There are good reasons the hyperscalers stopped where they did. Eval is workload-specific in a way runtimes cannot be. A coding agent's golden set looks nothing like a customer-support agent's. A fraud-investigation agent's success criteria look nothing like a research-summarization agent's. Cost governors require knowing what a workflow is supposed to cost, which only the buyer knows. Trace replay with semantic attribution requires a model of what a correct trajectory looks like, which is again a domain question.

So the runtimes punt, sensibly given the abstraction level. The problem is that buyers read "GA production runtime" and translate it as "GA production system," and those are different things. The runtime gets you a place agents can run. The system requires:

  • An evaluation suite that tests the agent against a curated set of cases representing the production distribution, run before every deployment, with regression alerts on quality drops. Anthropic's demystifying evals piece is the cleanest practical writeup: 20-50 evals from real failures, grade what the agent produced, integrate into CI/CD.
  • A cost governor that knows what each workflow is supposed to spend, alerts on divergence, and hard-caps individual sessions and aggregate burn. Inference is up to 85% of agent operational spend and retries are the dominant driver. The difference between a $50K month and a $500K month is whether anyone wrote the cap.
  • A trace replay surface that lets a human operator pull any production session, see the full prompt-tool-response chain, search across sessions by semantic content of the failure, and reproduce locally. CloudTrail does not do this. OpenTelemetry hooks are the substrate this is built on.
  • A tool registry that gates which tools any given agent can access, requires pull-request review for new tool exposure, runs security scans before deployment, and exposes the catalog as a discoverable surface for new agents to use. Uber's MCP gateway is the reference implementation.
  • Agent-scoped IAM that distinguishes a user invoking an agent from an agent invoking a tool and applies the right permission scope. Most enterprise auth was built for humans clicking through UIs. The agent case needs roughly an order of magnitude finer-grained tokens than humans get.

Every item on that list is buyer-side work. The three new runtimes make the first sentence (getting an agent running) much cheaper. They do nothing about the next five.

What This Means for Buyers and Builders

If you are buying agent infrastructure in mid-2026, the calculus has shifted. The runtime is a commodity. The differentiation lives entirely in the harness, and the harness is either built in-house or assembled from a stack of vertical tools (LangSmith, LangFuse, Braintrust, Patronus, Arize Phoenix) sitting on top of whichever runtime you picked. Choose the runtime on isolation and connector requirements. Budget the harness as the actual project.

If you are building agent products, the implication is the procurement inversion logic extended one layer down. Enterprise customers will keep picking runtimes by checklist: compliance posture, identity integration, regional availability. They will spend the next twelve months realizing they bought a substrate and looking for someone to sell them the harness. The 31% of enterprises with an agentic measurement framework today and the 60%+ who plan to deploy agents within two years sit on opposite sides of the same gap. That gap is the product surface for the next wave.

The MIT 33%/67% split between internal builds and vendor-led deployments is the cleanest leading indicator. Internal builds fail twice as often because the buyer underestimates the harness investment. Vendor-led deployments succeed more because the vendor builds the eval and observability layer once and amortizes it across customers. That is the leverage motion this blog cares about. Pick a workload, build the harness shared services, sell them as a layer over whichever hyperscaler runtime the customer chose.

The CloudBees CEO's "only lever is workforce reduction" framing is the failure mode at scale. It is what happens when nobody owned the cost surface, the eval surface, or the governance surface, and the invoice arrived before the org chart did. The fix is not a better model. The fix is not a faster runtime. The fix is the boring middle layer the three hyperscalers shipped around: the registry, the eval suite, the cost governor, the trace surface, the IAM scope. Six months from now the agentic AI story will be told in those numbers.

The runtimes commoditized. The eval layer is still your problem. The buyer who treats those two sentences as one project is the one whose 2027 budget does not get burned by April.