Operating PhilosophyJun 10, 2026 · 10 min read

Product Maturity Model: Is Your Product Ready to Scale?

Feature flags, a real testing environment, an experimentation subsystem. The unglamorous infrastructure that decides whether your product can absorb more building, or just ship chaos faster.

Oshri Cohen

Chief Product & Technology Officer

MaturityScale is earned, not declared

Every scaling conversation I get pulled into starts the same way: "We want to ship faster. Should we hire more engineers, or roll out AI tooling, or both?" And almost every time, the honest answer is one nobody booked the meeting to hear: your product can't absorb the output you already have.

Deploys happen twice a month and everyone holds their breath. There's a staging environment, technically, but it drifted from production two years ago and the test data in it is a folk legend. Releases are all-or-nothing: the feature goes out to every customer at once, and if it breaks, the rollback is a war room. Nobody can tell you whether last quarter's big launch actually moved a number, because nothing measures that.

Pour more building into that system and you don't get more innovation. You get more incidents, and a team that has learned to be afraid of its own deploys, because every bad release teaches the organization to bolt on another approval step and slow down. The constraint was never how much you could build. It's how much change your product can safely take in.

So I've started making the argument explicitly, as a maturity model: a product has to reach a certain operational maturity before scaling development is anything other than scaling risk. Maturity isn't about how old the product is or how clean the code is. It's about whether three specific capabilities exist: you can change the product safely, you can verify changes before customers see them, and you can learn from what you ship. Feature flags, a real testing environment, an experimentation subsystem. Miss any of them and there's a ceiling on your velocity that no amount of hiring will raise.

The ladder

The model has four levels, and it's deliberately blunt. At each one the question is the same: what happens when you double the volume of change?

Level 0: Fragile. Deploys are events. Testing happens in production, mostly by accident. Releases are coupled to deploys, so every push is a customer-facing gamble. Doubling change volume here doubles your incident count.
Level 1: Repeatable. CI runs a real test suite, deploys are boring and frequent, and there's an environment that genuinely resembles production where changes can be verified first. Doubling change volume is survivable but still scary.
Level 2: Safe. Deploy and release are decoupled by feature flags. New code ships dark, turns on for 2% of users, and turns off with a switch instead of a rollback. Doubling change volume is fine, because each change has a controlled blast radius.
Level 3: Learning. An experimentation subsystem (GrowthBook, or something like it) sits on top of the flags. Every meaningful release is a question with a measured answer. Doubling change volume doubles how fast you learn.

The rule that falls out of this: your safe rate of change is set by your maturity level, not your headcount. Hiring and AI tooling raise the rate at which you produce change. Only maturity raises the rate at which you can absorb it. When production outruns absorption, the surplus turns into risk, and the organization responds the only way fear knows how: with process. It freezes releases, stands up a change advisory board, adds another signature to the sign-off chain. You hired to go faster and ended up slower.

Your safe rate of change is set by your maturity level, not your headcount.

Level 1: an environment you can trust

The unglamorous foundation is a testing environment that tells the truth. Most companies have something they call staging. Far fewer have one where a green result actually predicts production behavior, same infrastructure shape, same configuration path, data that's realistic in volume and weirdness, and integrations that behave like the real ones instead of returning a hard-coded 200.

A staging environment that lies is worse than none, because it launders risk. Teams do the responsible thing, verify in staging, ship, and get burned anyway, and after a few rounds of that they stop believing in verification entirely. Then the real testing environment becomes production, you just don't call it that.

What I push for: staging built from the same infrastructure-as-code as production so it can't silently drift, a seeded dataset that's regenerated on demand and includes the ugly edge cases from real usage, and ephemeral preview environments per change so engineers aren't queuing for the one shared environment and quietly overwriting each other's state. None of this is exotic. All of it is the difference between "the tests passed" meaning something and meaning nothing.

Level 2: deploy is not release

Feature flags are the single highest-leverage piece of product infrastructure I know, and they're routinely dismissed as a nice-to-have. What a flag actually does is separate two decisions that have no business being coupled: the engineering decision to ship code, and the product decision to expose behavior. Once those are separate, everything about scaling gets easier.

Engineers merge small and merge often, because unfinished work can ship dark behind a flag instead of rotting on a long-lived branch. Releases stop being cliff edges: a feature goes to internal users, then 2%, then 25%, then everyone, and the moment a metric twitches you turn it off. No rollback to coordinate, no war room, nobody shipping a fix at 2 a.m. The blast radius of any mistake stops being "the customer base" and becomes "the cohort we chose." That containment is what makes a high rate of change survivable, and it's why flags are the gate between Level 1 and everything above it.

Two honest caveats. Flags need observability next to them, the switch is only useful if a dashboard tells you when to flip it, so error rates and key product metrics have to be visible per-flag, not just globally. And flags rot: an expired flag is dead code with a control panel. Mature teams treat flag cleanup as part of the definition of done, not as someday-debt.

Level 3: shipping becomes asking

Here's the part that turns operational plumbing into an innovation engine. Once every feature already ships behind a flag to a chosen percentage of users, you are one step away from experimentation: assign the cohorts randomly, attach metrics, and let the math decide. That's all an A/B test is, a feature flag with a hypothesis and a scoreboard.

This is why I tell clients to stop treating experimentation as a distant someday and deploy a real subsystem for it. GrowthBook is my usual recommendation, since it's open source, self-hostable, and runs flags and experiments in one place. Once the flag infrastructure exists, the marginal cost of running real experiments collapses. And the cultural effect is bigger than the technical one: roadmap arguments that used to be settled by seniority get settled by data. "I think users want X" becomes "we ran X for two weeks against a holdout and it moved activation 4%." The loudest voice in the room loses its monopoly.

An A/B test is just a feature flag with a hypothesis and a scoreboard.

A Level 3 product changes what "shipping" means. At Level 0, shipping is a risk you take. At Level 2, it's a routine you trust. At Level 3, it's a question you ask, and every release comes back with an answer. That's the actual innovation flywheel: not more output, more answered questions per quarter.

AI just made this urgent

This model used to be a five-year conversation. AI compressed it into a now conversation, because building got cheap, and the volume of change a small team can produce went up by an order of magnitude. Every one of those changes still has to cross the same bridge: verified somewhere honest, released with a contained blast radius, measured against a real metric.

An immature product hit with AI-accelerated output doesn't innovate faster. It floods. The deploy queue backs up, staging becomes a rumor, releases get batched into bigger and scarier bundles, exactly the wrong direction. I've watched teams adopt agentic coding tools on a Level 0 product and conclude, three incidents later, that "AI writes bad code." The code was fine. The product had no way to absorb it. The same applies to product engineers who own the loop from problem to outcome: that loop only closes if the product gives them flags to release with and experiments to learn from. You can't own an outcome you can't measure.

Maturity work is product work

The reason most companies are stuck at Level 1 isn't ignorance, it's framing. Flags and staging and experimentation all get filed under "tech debt" or "platform," which means they lose every prioritization fight against a feature with a customer's name on it. That filing is wrong. This is product work, it determines how fast every future feature ships and whether you ever find out if it worked, and it deserves a place at the table on those terms.

The pitch I make to boards is one sentence: this investment raises the safe rate of change for everything we build afterward. Concretely, the sequencing I push is: honest staging and boring deploys first, flags with per-flag observability second, experimentation third, and only then pour in the headcount and the AI tooling. Each step typically pays for itself within a quarter or two, and the DORA metrics will show it: deployment frequency up, change-failure rate down, recovery measured in minutes because recovery is a flag flip.

So before you approve the hiring plan or the AI rollout, ask the three questions: Can we change the product safely? Can we verify before customers see it? Do we learn from what we ship? If any answer is no, that's the work, and it comes first. If you want help finding where your product sits on this ladder, and what the shortest path up looks like, let's talk →