The Pilot-to-Production Death March

Five weeks ago the agent-engineering picture was abstract. Eighty-eight percent of pilots never reaching production, observability outpacing evaluation, frameworks converging on graph orchestration. Sharp, but theoretical. Then the practitioners showed up.
Between April 1 and May 9, 2026, five independent Reddit threads from five disjoint communities converged on the same claim: the LLM is no longer the binding constraint on agent reliability. The constraint is the second stack around it. Tool registries, agent-scoped permissions, context architecture, observability with audit trails, and the machine-readable surface that any SaaS now needs to be callable by a non-human caller. None of these posts reference each other. The communities don't overlap. The diagnosis is identical.
Uber published the leading indicator. 1,500+ monthly active agents, 60,000+ executions per week, 90% of 5,000+ engineers using AI agents monthly, 95% adoption of the internal coding agent. They got there by building a central MCP gateway, a tool registry, pre-deployment security scans, and a no-code Agent Builder before any of it scaled. Most enterprises did the opposite: shipped agents into pilot, then watched the harness gap surface as the failure mode that killed production rollout.
ARR per engineer is the cleanest forward-looking filter on which side of the gap a company will land. It can't be gamed by adding more agents. The denominator gets worse if you add headcount. The metric is structurally hostile to agent-washing, which Gartner estimates is roughly 95% of what's currently being marketed as agentic.
The May 2026 Convergence
Within six days, five posts in five subreddits described the same problem in different words.
r/sysadmin, May 4 ran a reality check from the Microsoft AI Tour in Zurich. Score around 670. Microsoft "scrubbed 'LLM' and 'GenAI' from all the slides and replaced them with 'Agents' sprinkled on top of absolutely everything." The author's gloss: "These 'agents' are mostly just the same old LLMs wrapped in fancy scripts and system prompts. They inherit the exact same issues with context, hallucinations, and AI fatigue. The only difference is that now, instead of catching this AI BS in a Word document, we are going to have to debug it in broken business processes." Their internal corporate-chatbot adoption pattern: spike in month one, 70-80% drop-off thereafter.
r/AI_Agents, May 7, Antoneose, "Hot take: most AI agent teams are secretly just 'context engineering' teams": "The more I work on AI agents, the more I feel like the actual problem isn't the LLM. It's the infrastructure mess around it." The post enumerates the stack every serious team converges on: LLM plus vector DB plus cache plus retrieval pipeline plus connectors plus permissions plus memory layer plus observability plus audit logs plus orchestration glue. Then the questions teams actually spend months answering: what does the agent know right now, why did it retrieve this, is the memory fresh, can this be audited, why is latency suddenly terrible. Diagnosis: "They're building distributed context engineering systems."
r/SaaS, May 4, CrewPale9061, the most operationally specific post of the window: "Was CPO at a SaaS. Customers kept asking us to give their AI agents access. Scoping it honestly was depressing enough that I quit." The diagnosis: "Agents do not navigate UIs. They do not read your docs. They do not know which endpoint to call for what. They need a discoverable surface, machine-readable schemas, agent-scoped auth (your existing API token approach does not quite fit), and rate limits that make sense when one customer's agent fires a thousand calls in a minute. None of that fit our existing stack. It was basically a whole second stack sitting on top of the one we already had." The CPO quit and started a company building it.
r/artificial, May 6, jradoff, two days at the AI Agents Conference NYC: "Almost every talk and every booth at the AI Agents Conference was selling a fix for something that broke this year when agents hit production. Observability, governance, supervisor agents, data substrates, 'someone's gotta babysit the bots.'" Most relevant line of the entire window: "One speaker (a VC) said his number for evaluating AI-native startups is ARR per engineer, and that the number ought to be going up... You can vibe-code much of what those booths were selling in a few days or weeks if you have the domain knowledge."
r/AgentsOfAI, May 5, Such_Grace: "Every major platform is pushing 'agentic AI' right now but I can't find a single compelling real-world story that makes me think 'oh, that's the problem it solves.'" Salesforce Agentforce as the case study: "Their product names told you exactly what you were buying: Sales Cloud, Field Service, Health Cloud. The name was the problem statement. Then Agentforce drops and suddenly it's all 'headless AI agents' and 'trust layer' — technical capabilities focused on how the thing works, not what business problem you're solving." The one real deployment heard about firsthand: case deflection, ~20-30% ticket reduction. "That's not a paradigm shift, that's a chatbot."
r/AI_Agents, May 8, RepublicMotor905: "Our AI agent worked fine in the pilot, but now that it's chewing on real production data, things are falling apart fast... It makes one slightly off tool call, and by step four it's hallucinating a solution or stuck in a loop. Also caught it trying to reach for tools it shouldn't even have access to for the task it's running." The closer: "feel like I'm missing some basic engineering principle here and just throwing prompts at the problem."
Five posts. Five communities. One diagnosis. The model behaves the same in pilot and production. The infrastructure around it does not.
What Uber Actually Had to Build
The most concrete production-scale data point is Uber's. As of late April 2026, they run 1,500+ monthly active agents internally with 60,000+ executions per week. 90% of their 5,000+ engineers use AI agents monthly, and the internal coding agent Minions ships ~1,800 code changes weekly with 95% engineer adoption. Sources: ShiftMag and Pragmatic Engineer.
Three problems forced the platform investment.
Tool sprawl across 10,000+ internal services. Dozens of teams built MCP servers and custom integrations independently. No shared standards. No reuse. No central oversight. Meghana Somasundara, Agentic AI Lead: "If you can't manage the development lifecycle, you just can't trust it in production."
Blast radius. Agents could call endpoints they shouldn't, expose sensitive data, trigger operations nobody intended. Somasundara again: "With agents, it's a lot faster, a lot quicker, and the blast radius is a lot higher."
Discovery. Rush Tehrani, Senior Engineering Manager on the Agentic AI Platform: "How does an agent actually find the right one?"
The fix was a central control plane. Every Uber endpoint becomes an MCP tool. Service owners decide what gets exposed and how it's defined. Every change flows through pull requests. Security scans run before deployment. Continuous monitoring runs in production. A central registry kills duplication and flags third-party MCPs. They also shipped a no-code Agent Builder where engineers can lock parameters and pre-select tools so the agent has fewer runtime decisions to make.
Read this against the r/SaaS CPO post and the picture sharpens. Uber's MCP gateway, registry, and Agent Builder map almost exactly onto the four components the CPO listed: discoverable surface, machine-readable schemas, agent-scoped auth, and rate limits that survive a 1,000-call-per-minute spike. The CPO was describing what an enterprise of 5,000 engineers needs. The catch is that very few enterprises have a 5,000-engineer budget for it.
ARR Per Engineer: The Filter That Survives Agent-Washing
The framing from the AI Agents Conference NYC is the cleanest articulation of an emerging VC posture. From the r/artificial thread directly: "One speaker (a VC) said his number for evaluating AI-native startups is ARR per engineer, and that the number ought to be going up."
Two reasons it works.
First, it can't be gamed by adding more agents. Adding more agents to the product doesn't change ARR. Adding more engineers to the company makes the denominator worse. The metric is structurally hostile to agent-washing, the practice of bolting agentic branding onto existing automation without any deeper architectural change. Gartner estimates fewer than 5% of vendors marketing agentic capabilities deliver genuine autonomous functionality. ARR per engineer punishes the other 95% directly.
Second, broader VC commentary has converged on the same metric independently. Revenue per employee has emerged as the efficiency signal because AI-native startups are multiple times more efficient on revenue per headcount than traditional SaaS, and not showing that efficiency now reads as a weakness against any AI-comparable peer set. (Crescendo VC summary, 2026)
The Conference takeaway list confirms the surrounding shift: SaaS pricing moving from flat ARR to usage-based, "ARP" (Agentic Resource Planning) replacing static ERP framing, ~25% of retail transactions projected as agentic by 2030, and a "Wild West" security analogy invoking 1990s file-sharing. (Mainfactor recap) The undertone across all twenty takeaways is the same: capability is no longer the differentiator, scaffolding is.
The Framework Consolidation Map
Every major vendor shipped or matured an agent SDK in the April-2-to-May-9 window. The cumulative effect is a stack-by-stack admission that the harness, not the model, is what teams need help with.
Microsoft retired AutoGen on October 1, 2025 and unified it with Semantic Kernel into Microsoft Agent Framework, hitting 1.0 on April 3, 2026. Agent Framework explicitly merges AutoGen's multi-agent orchestration with Semantic Kernel's session-based state management, type safety, middleware, and telemetry. It ships the audit-trail and governance layer that was missing from research-grade AutoGen. (VentureBeat, Microsoft Learn)
OpenAI released the Responses API and Agents SDK in March 2026. The Responses API consolidates Chat Completions plus the Assistants API tool-use surface (Assistants will be deprecated in mid-2026). It ships built-in tools: web search, file search, computer use. The Agents SDK exposes a small surface (Agents, handoffs as agents-as-tools, Guardrails) and now extends to TypeScript with sandbox-agent support and an open-source harness. (OpenAI announcement)
Google brought ADK (Agent Development Kit) to 1.0 GA across Python, Go, Java, and TypeScript at Cloud Next April 2026. The release adds an app/plugin architecture, advanced context engineering primitives, human-in-the-loop workflows, and named tools (GoogleMapsTool, UrlContextTool, ContainerCodeExecutor, VertexAICodeExecutor, ComputerUseTool). It also introduces Event Compaction, a sliding window of recent events with summarized older interactions, claiming a reduction in token usage by up to 38% and a latency improvement of 18%. (Google Developers blog, InfoQ)
Anthropic shipped two distinct products in Q1-Q2 2026: the Claude Agent SDK (a programmatic harness for self-hosted agents, built around the gather-context → take-action → verify-results loop) and Managed Agents (hosted infrastructure with sandboxing, built-in tools, server-sent event streaming, memory in public beta), launched April 8, 2026. (Claude Code docs)
AWS Bedrock AgentCore Runtime supports both real-time and long-running agent workloads up to 8 hours per session. Idle timeout 15 minutes. Sessions terminate on inactivity, max lifetime, or being deemed unhealthy. (AWS docs)
LangChain shipped Deep Agents v0.5 on April 7, 2026, introducing asynchronous subagents that delegate long-running tasks without blocking the primary execution loop. Their April 2026 State of Agent Engineering update kept the headline numbers consistent: 57% of organizations have agents in production, 30.4% are in active development, quality remains the #1 barrier (cited by one-third of respondents), latency is now #2 at 20%.
The pattern across all five framework moves is identical. Each vendor is shipping the layers practitioners flagged as missing. Compaction, structured note-taking, sub-agent architectures, registries, audit trails, durable execution windows. None of this is model improvement. All of it is harness investment.
Industry Analyst Data
The cross-source consistency carries the weight here. Five different methodologies, five different respondent pools, one finding.
McKinsey, 2026 State of AI in Trust: 62% of organizations are experimenting with or piloting AI agents. Scaling caps at "no more than 10% of respondents" in any given business function. Security and risk concerns are the #1 barrier, cited by nearly two-thirds of respondents, ahead of regulatory uncertainty and technical limitations. McKinsey's framing: "Organizations can no longer concern themselves only with AI systems saying the wrong thing; they must also contend with systems doing the wrong thing." Average responsible-AI maturity score climbed from 2.0 in 2025 to 2.3 in 2026, with only ~one-third reporting maturity ≥3 in strategy, governance, and agentic AI governance.
Gartner, 2026 Hype Cycle for Agentic AI: Agentic AI sits at the Peak of Inflated Expectations. Only 17% of organizations have deployed AI agents. More than 60% expect to do so within the next two years, the most aggressive adoption curve among emerging technologies measured. Gartner reaffirms: 40%+ of agentic AI projects will be canceled by end-2027 due to escalating costs, unclear business value, or inadequate risk controls.
Deloitte, State of AI in the Enterprise 2026 (n=3,235 leaders, August-September 2025): 85% of companies expect to customize agents to fit business needs. Only 1 in 5 has a mature governance model for agentic AI.
BCG, "The $200B AI Opportunity": Agentic AI drives 17% of total AI value today, projected to nearly double by 2028. 5% of companies qualify as "future-built" for AI. Three in four employees believe AI agents will matter for future success, but only 13% say their companies have broadly integrated them into workflows.
Bain, "From Roadmap to Reality" recommends a three-phase approach. Phase 1 (foundation): data governance frameworks, centralized policy enforcement, runtime guardrails, identity management for non-human principals, prompt-level protections, memory governance, observability. Phase 2 (orchestration): multistep workflow engines, MCP-based tool abstractions, agent-to-agent communication. Phase 3 (scaling): federated discovery and routing across business units. Bain's framing principle: "Governance and trust must precede orchestration and scale."
The share of organizations piloting lands at 60-96%. The share scaling lands at 8-17%. The dominant barrier across every report is governance, security, and scaffolding rather than LLM capability. Same story, told five times by people who didn't coordinate.
The Failure Modes, Named
Practitioner posts and analyst reports point at the same six failure modes. Each now has a quoted example.
1. Tool registry sprawl. Uber's version: dozens of teams building MCP servers across 10,000+ services with no shared standards, no reuse, agents calling endpoints they shouldn't. Fix: central registry plus pull-request gating plus pre-deployment security scans.
2. Permission scoping. From the r/SaaS CPO post: "agent-scoped auth (your existing API token approach does not quite fit)." In customer deployments more broadly, agents have been observed receiving roughly ten times more permissions than needed when mapped against real user privileges and entitlements. Most SaaS auth was built for humans clicking through UIs; agents need machine-readable schemas, scoped tokens, and rate limits sized for non-human callers.
3. Pilot-to-production drift. From the r/AI_Agents post: "Our AI agent worked fine in the pilot, but now that it's chewing on real production data, things are falling apart fast... It makes one slightly off tool call, and by step four it's hallucinating a solution or stuck in a loop." Industry reads this as agent drift: the model is unchanged but the derived context the agent reads at decision time has gone stale, or production data shape doesn't match the pilot. (Tacnode write-up)
4. Context engineering as the actual job. From Antoneose: "They're building distributed context engineering systems." Anthropic's piece defines context engineering as "the smallest possible set of high-signal tokens that maximize the likelihood of some desired outcome" and recommends three techniques: compaction, structured note-taking (NOTES.md, persistent external memory), and sub-agent architectures with clean context windows. Google ADK Event Compaction (38% token reduction, 18% latency improvement) is the same idea shipped as a framework primitive.
5. Retry / timeout cascades. Every failed step that triggers a retry burns tokens without producing value. Inference costs are estimated at up to 85% of agent operational spend. AWS Bedrock AgentCore's 8-hour session ceiling is the new upper bound on long-running agentic work, useful but not high enough for genuinely autonomous workflows that need persistence across days.
6. Audit and memory consistency. Antoneose's enumeration: "What exactly does the agent know right now? Why did it retrieve this? Is the memory fresh? Can this be audited?" McKinsey's "doing the wrong thing" framing maps to the same problem from the governance side. Every framework that shipped in April-May 2026 added telemetry, audit trails, or memory state primitives.
The Microsoft AI Tour author nailed the consequence in one line: "The 'Editing Tax' for AI BS ends up taking more time and energy than just writing the damn thing from scratch." Their internal corporate-chatbot adoption pattern (month-one spike followed by 70-80% drop-off) is what these failure modes feel like to the user once they pile up.
Counterarguments
Long-context models do reduce context-engineering pressure. Gemini's multi-million-token context, GPT-5 long-context tiers, and Claude Opus 4.7's xhigh effort level all push the density at which compaction becomes mandatory upward. Anthropic's own piece concedes reasoning starts degrading around 3,000 tokens. Model gains continually push that ceiling. The argument here is not that the LLM is irrelevant. It's that LLM gains alone don't close the production gap practitioners are reporting.
Vertical agents short-circuit the registry-and-permissions wall. The CPO's complaint is about exposing arbitrary SaaS surfaces to arbitrary agents. A narrow vertical agent (defined ICP, defined integration set, owned by the same vendor) bypasses most of the scoping problem because the vendor controls both the agent and the permission surface. This is part of why customer-service deployments still show the strongest production economics. The thesis applies most strongly to general-purpose, cross-product agentic workflows, exactly the use case the AI Agents Conference booths were trying to sell.
The LLM is still partly the constraint at the frontier. SWE-bench Verified keeps moving up, 40% to 80%+ in a year, and Claude Opus 4.7 vs GPT-5.5 vs Gemini 3.1 Pro differences on long-horizon coding tasks remain measurable. The frontier moves the floor of what the harness has to clean up but doesn't eliminate it. The practitioner thesis is that the binding constraint has moved, not that the model has stopped mattering.
Selection bias. Reddit posts about agent failure are louder than posts about agents quietly succeeding. The CPO who quit and the sysadmin who flew home from Zurich annoyed self-select into the convergent-failure narrative; unshipped successes don't post. Counter-evidence: Uber's 95% engineer adoption of Minions, 90% monthly agent usage across 5,000+ engineers, and the customer-service vertical's documented economics. The convergent diagnosis is best read as a description of what specifically breaks at the median deployment, not a claim that agents don't work.
What Builders Should Actually Invest In
If the binding constraint is harness rather than model, the implied builder priority list inverts.
Invest in, in this order:
-
Agent-scoped IAM. The CPO's second stack: discoverable surface, machine-readable schemas, agent-scoped tokens, rate limits sized for 1,000 calls/minute bursts. If you build SaaS, this is now table stakes for being callable by your customers' agents. If you build agents, this is the layer your vendors don't have yet, and the gap your product can fill.
-
Tool registry + governance layer. Uber's MCP gateway is the reference architecture. Pull-request gating, pre-deployment security scans, central registry, third-party MCP scrutiny. Microsoft Agent Framework, Google ADK 1.0, and the Bain three-phase framework all converge here.
-
Eval-driven development. The 89% observability / 52% evaluation gap from the April State of Agent Engineering data remains the single biggest operational risk. Anthropic's recommendation (20-50 evals from real failures, grade what the agent produced, integrate into CI/CD) applies as much in May as it did in April. Eval coverage is also the upstream fix for the diligence wall enterprise buyers ran into.
-
Context architecture. Compaction, structured note-taking, sub-agent decomposition. Pick a framework primitive (Google ADK Event Compaction, Anthropic Managed Agents memory, LangGraph state checkpoints) and build to it. "Build distributed context engineering systems" is now table stakes, not advanced practice.
-
Cost-budget governors. Inference is up to 85% of operational spend. Retries are the dominant driver. Hard caps per session, per workflow, per customer. AgentCore's 8-hour cap is a forcing function, not a limitation.
-
Narrow vertical scope before horizontal generality. Customer-service deployments have the cleanest ROI numbers. "General-purpose agent" still does not. The CPO's diagnosis applies hardest at the seam between many tools and many agents; the more controlled the vertical, the smaller the seam.
Don't invest in (yet):
- Agent supervisor / babysitter products. The conference data suggests the booths densely populated this year are exactly the products that get vibe-coded out of existence in the next 12-18 months as framework primitives subsume them.
- Yet another LLM benchmark wrapper. Models are converging, harness is diverging. Durable differentiation lives on the harness side.
- Agent marketplaces without a distribution thesis. The buildinpublic data point (67K open-source agents tracked, 99% creator failure rate) confirms shipping is no longer the hard part. Getting anyone to install your agent is.
This last point connects to a broader pattern in the reaction-to-AI cycle. The practitioner backlash thread on AI deskilling was about juniors who couldn't debug because they'd never had to. The agent equivalent is teams who can't operate production agents because they treated the model as the product and the harness as out-of-scope. Same shape, different layer.
What the May Convergence Tells Us
The April 2026 read on agent engineering was that it was real but pre-paradigmatic. Production deployments and convergent architecture, but missing the testing infrastructure and security models that define mature engineering. Five weeks later the convergent diagnosis from independent practitioner posts, framework releases, and analyst reports points to a more specific claim: the field has identified what's missing, the major vendors are shipping it, and the binding constraint on the next wave of production deployments is not capability but adoption of the harness layer.
The Uber data is the leading indicator. The 1,500 agents in production matter because Uber built a control plane, registry, security scanner, and no-code Agent Builder before scale. Most enterprises did the opposite.
ARR per engineer is the cleanest forward filter. If a startup's ARR/engineer is rising, they have built something agents can't replace and that scales without proportional headcount. If it's flat or declining, the agentic branding is decoration on top of a SaaS unit-economics problem. The metric punishes both flavors of agent-washing at the same time: SaaS with agents bolted on, and agents with no distribution.
The Microsoft AI Tour author's closing observation is the right register to end on. "Microsoft confidently claims from the stage that their agents are ready to replace humans. But on the ground, these 'agents' are mostly just the same old LLMs wrapped in fancy scripts and system prompts." What closes the gap is the unglamorous infrastructure the CPO walked into and decided to leave a SaaS to build full-time. The second stack.