Experiment: Pilot Purgatory
Two-thirds of companies are running AI pilots. Most pilots are built to demo, not to ship — and a pilot without a production path is just an expensive way to postpone a decision.
Every company stuck in the Experiment phase has the same artifact: a demo that kills in meetings. It summarizes the contract, drafts the response, triages the ticket. Executives see it and say some version of "this is going to change everything." And then it changes nothing, because eleven months later it is still a demo, polished by an innovation team, admired quarterly, touching zero production workflows.
Welcome to pilot purgatory. Per the numbers in my series opener, the research finds 62% of organizations at least experimenting with AI agents, and the overwhelming majority of those experiments will never ship. MIT's GenAI Divide study put a number on it: roughly 95% of enterprise GenAI pilots deliver no measurable P&L impact. Not because the technology fails. Because the pilot was never designed to do anything else.
Pilots are built to demo, not to ship
The defining property of a purgatory pilot is what's missing. Look under the hood of the typical one and you find: no evaluation harness, no error budget, no data pipeline that survives contact with production permissions, no owner in the line organization, and, most damning, no definition of done. It runs on a curated sample of inputs, the happy path, demonstrated by the person who built it.
None of that is an accident. The pilot was commissioned to answer the question "is this possible?", and demos answer that question beautifully. But "is this possible?" stopped being the interesting question two years ago. With modern models, almost everything in the demo tier is possible. The questions that matter now are: is it reliable at the 95th percentile? What does it cost per case at real volume? Who owns it when it's wrong? A demo answers none of those, and a pilot designed as a demo can't be upgraded into one that does, it has to be rebuilt.
There's a structural reason this keeps happening: pilots live in innovation teams, and innovation teams are graded on demonstrations, not operations. The line organization, which owns the actual workflow, was never asked whether it wants this thing, never gave up budget for it, and quietly regards it as a threat or a toy. So the pilot has no landing zone. It orbits the org chart indefinitely, fully funded and going nowhere. The MIT research backs this up from the data: the pilots that cross the divide are overwhelmingly the ones driven by line managers who own the workflow, not by a central AI lab.
The purgatory economics
Purgatory has a seductive economics. Each individual pilot is cheap, a couple of people, a few months, API credits. Killing one is awkward, so nobody does, and continuing all of them costs less per quarter than the political fight of forcing one into production. The portfolio grows. The aggregate spend quietly becomes enormous, and the return is a slide titled "AI Initiatives: 23 Active Pilots" that the board mistakes for momentum.
Here's the reframe I push on executives: a pilot is not a project, it's a bet with a kill criterion. The language comes from the operating model I described in when building gets cheap: shape the problem, set an appetite, and decide in advance what evidence would make you ship it and what evidence would make you kill it. A pilot allowed to run past its appetite without a verdict isn't research, it's a subscription to the feeling of progress.
Anatomy of a pilot that graduates
The pilots that escape purgatory look different from day one. After taking enough of them through, the pattern is mechanical:
- It starts in the line organization. The team that owns the workflow runs the pilot inside their real daily work, with the innovation function supporting, not owning.
- It's evaluated on production data from week one, including the ugly inputs, the edge cases, and the adversarial users — not a curated golden set.
- The eval harness is built before the prompt. You cannot improve, or even honestly describe, what you don't measure, and an eval suite is also the regression net you'll need for every model upgrade afterward.
- It has a number: the baseline cost or cycle time of the current process, and the threshold the pilot must beat. "People liked it" is not a number.
- It has an appetite and a verdict date. On that date it ships, it's killed, or in rare justified cases it gets one explicit extension. Orbiting is not an outcome.
- The production path is designed up front: where it runs, who's paged, what the human-escalation route is, what the rollback is.
Notice that a pilot built this way is barely a pilot at all. It's the first iteration of a production system, scoped small. That's the real exit from the Experiment phase: stop building disposable demonstrations and start building small, real things. The distance from "pilot" to "production" should be a deploy, not a rebuild.
Kill more, ship more
The counterintuitive metric of a healthy pilot portfolio is the kill rate. Purgatory companies kill almost nothing, everything stays "active." Healthy companies kill the majority of their pilots, quickly and without ceremony, because a fast, cheap, well-documented kill is a successful outcome: you bought certainty for the price of a few weeks. The teams that win fail small, fast, and often, inside boundaries that make every failure survivable. If your AI program has never killed a pilot, it isn't a program, it's a museum.
What this looks like when I do it with you
The Experiment phase is where my engagement shifts from audit to build and ship. The white-glove part is that I don't hand you a pilot framework and wish you luck, I take the two or three workflows the audit ranked highest and personally drive each through the anatomy above: eval harness first, production data from week one, a kill-or-ship date that I hold everyone to, including the executive sponsor who would rather extend than decide.
I also do the quiet political work that determines whether any of this lands: moving the pilot out of the innovation silo and into the line team's hands, negotiating what the workflow owner gets out of it, and making sure the first shipped pilot is visible enough that the next one is pulled by the organization rather than pushed. The technology has almost never been the blocker in these engagements. The blocker is that production is a commitment, and organizations are built to defer commitments. My job is to make deferral more expensive than a decision.
Next in the series: what happens after the pilot ships, when you discover that an agent in production is less like software and more like a new hire, Operationalize: you built an agent, now make it an employee. And if you recognize your own pilot portfolio in this essay, let's talk →
