Metrics & DORAJun 4, 2026 · 11 min readUpdated Jul 6, 2026

Tokenmaxxing: Why Token Usage Is a Bad Productivity Metric

Counting tokens tells you nothing about whether work got done. But rationing them to save money is the more expensive mistake. The real skill is context hygiene, and the real win is letting people experiment.

Oshri Cohen

Chief Product & Technology Officer

AI-NativeTokens aren't output

A few months into every AI rollout, someone in finance discovers the usage dashboard. Then I get the call. "Engineer A burned ten times the tokens of Engineer B last month. Should we be worried?" The honest answer is: worried about what, exactly? Because that number, on its own, tells you almost nothing about either of them.

I've started calling the instinct behind that question tokenmaxxing. Tokenmaxxing is the belief that tokens consumed measure work: that a high count means productivity, or that a capped count means savings. Both readings are wrong, because a usage number says nothing about the value of what it produced. It shows up in two opposite-looking forms, and each one follows from the same bad premise.

One form treats high token usage as a sign of productivity, the engineer who runs the model all day must be getting more done. The other treats it as a cost to be stamped out, cap the budgets, downgrade everyone to the cheapest model, ration access until the line goes down. They look like opposites. They're the same mistake wearing different clothes: both have confused a usage number with value.

We have run this exact play before

If "a usage number that has nothing to do with value" sounds familiar, it should. We spent decades counting lines of code and learned, painfully, that the most productive engineer on the team is often the one who deleted ten thousand lines and shipped a smaller, simpler system. Tokens are lines of code with a fresh coat of paint. A high token count can mean deep, sustained work on a genuinely hard problem. It can also mean someone flailing in circles, re-prompting the same broken request twenty times because they never learned to set it up properly.

The number doesn't distinguish between those. Neither does it distinguish the engineer who shipped a quarter's worth of value in an afternoon from the one who generated a quarter's worth of plausible nonsense nobody will ever merge. Goodhart's law isn't a theory in software; it's a Tuesday. Make tokens the metric and you'll get more tokens. You will not get more value, and you may well get less, because the people gaming the number are optimizing for the number instead of the work.

Token usage is a bad productivity metric for the same reason lines of code was: it counts motion, not progress.

The rationing mistake is the expensive one

Of the two failure modes, premature rationing is the one that actually costs you money, which is the irony, because saving money is its entire justification. The reasoning sounds responsible: tokens cost money, the better models cost more, so we'll cap usage and default everyone to the cheapest tier until they prove they need more.

Run the arithmetic on that decision and it falls apart. A senior engineer's fully loaded hour costs you somewhere in the range of one to two hundred dollars. The difference between the cheap model and the capable one, across a heavy day of real work, is measured in single-digit dollars. When you force a skilled, expensive person onto a weaker model to save the price of a sandwich, and that model sends them down two extra dead ends before lunch, you didn't save anything. You spent an expensive hour to avoid a trivial cost. Penny-wise, hundred-dollar-foolish.

And the damage isn't only on the spreadsheet. There is a real morale cost to handing a craftsperson a deliberately worse tool. People notice when the org's message is "we'd rather you struggle than spend three dollars." It reads as distrust, and it lands hardest on exactly the people you most want to keep, the ones who can tell the difference between a sharp tool and a dull one and resent being made to work with the dull one. Friction you imposed to save money on tokens, you pay back with interest in frustration and attrition.

Forcing an expensive person onto a cheaper model to save a few dollars in tokens is the most expensive saving on the books.

Account honestly, then stop counting the wrong thing

None of this means spend is irrelevant. It means you have to account for it like an adult instead of fixating on the rawest, least informative version of the number. Token cost is real and worth understanding, the same way cloud spend is. We don't praise the team with the biggest AWS bill or punish the one with the smallest. We ask what the spend bought.

So attribute cost to outcomes, not to headcount. The useful question is never "how many tokens did this person use," it's "what did this spend produce, and was it worth it?" A feature that shipped a week early, a migration that didn't need a second engineer, a class of support tickets that stopped arriving, those are the units that matter. Cost per outcome is a real metric. Tokens per head is a vanity metric dressed as governance.

Measure the system, not the seat. Total spend against value delivered tells you whether the investment is paying off. Per-person token leaderboards just teach people to game the leaderboard.
Treat a spike as a question, not a verdict. A 10× month might be your best week of the quarter or someone stuck in a loop. The number is a prompt to go look, never a conclusion on its own.
Budget at the team level, generously. Give teams a real envelope and let them spend inside it on judgment, the way you'd trust them with a cloud budget, not a per-keystroke allowance.
Compare against the alternative cost. The benchmark for any token bill is what the same outcome would have cost in engineer-hours, contractors, or simply not happening.

Context hygiene is the skill, not volume

What the token-counters miss is that the relationship between tokens and value is something you can train. The reason one engineer gets more done with fewer tokens isn't that they're stingy. It's that they've learned to manage context, and most people simply never have.

A trained developer keeps the model's working context clean. They clear it between unrelated tasks instead of dragging a mile-long, polluted conversation behind them. They compact deliberately when a thread gets long, so the model keeps the thread of the work without re-reading everything. They scope a request to the files and facts that matter rather than dumping the whole repository in and hoping. They write good project instructions, a solid CLAUDE.md or its equivalent, so the model starts every task already knowing the rules instead of rediscovering them at cost. They lean on narrow sub-agents with one job each instead of one overloaded conversation that has forgotten its own beginning.

Untrained, the same tools quietly waste enormous amounts of context. A bloated window doesn't just cost more, it produces worse output, because the signal the model needs is buried under everything irrelevant you forgot to clear. So the engineer who never learned context hygiene gets the double penalty: a bigger bill and weaker results. The fix for that is not a usage cap. A cap just rate-limits the bad habit. The fix is teaching the skill.

An untrained operator pays twice: a bigger bill and worse output. The answer is training, not a usage cap.

This is the part that should change how you think about the spend. If usage looks high, the first move isn't to ration, it's to ask whether your people actually know how to drive these tools. Context-window maintenance is a learnable, teachable discipline, and it's the single highest-leverage thing you can invest in. It lowers cost and raises quality at the same time, which no usage cap has ever done.

The double-edged sword: you want them experimenting

And now the part that keeps this from collapsing into pure thrift, because there's a real tension here and pretending otherwise would be dishonest. Yes, train people to be efficient. But efficiency taken too far becomes its own trap, because the most valuable token usage in your whole organization often looks, on a dashboard, exactly like waste.

It's experimentation. It's the engineer trying the capable model on a problem nobody asked them to solve, the analyst seeing whether they can automate a report that's been done by hand for years, the support lead wiring up a prototype to test a hunch. Most of those attempts go nowhere, and they all burn tokens. If you've built a culture that scrutinizes every token, every one of those experiments looks like a line item to question. So people stop running them. And the experiments you just killed were the ones that produce your next real advantage.

Killing those experiments narrows where innovation can come from. Once these tools are in everyone's hands, innovation stops being the exclusive property of leadership and a few key individuals. When the cost of trying something has collapsed, the best idea can come from anyone, the coordinator, the junior, the person three layers from the org chart's top who actually feels the problem every day. That broad, distributed experimentation is the entire prize of becoming AI-Native. A token budget enforced too tightly hands innovation back to the few, which is exactly the bottleneck you adopted these tools to escape.

The most valuable token usage in your company often looks, on a dashboard, exactly like waste. It's called experimentation.

So the discipline cuts both ways. You train people to manage context well, and you also make it genuinely safe to spend tokens on things that might not pan out. Those aren't in conflict. An experiment, like any good bet, has a bounded, affordable cost, a few dollars and an afternoon to learn something. The team that runs ten cheap experiments and kills nine of them is not wasteful. It's doing exactly what the tools are for, and the tenth experiment pays for all the rest many times over.

What to actually do

Stop counting tokens as if the count meant something. It's the lines-of-code fallacy with a new unit, and it will mislead you in both directions, flattering the engineer who flails and punishing the one who experiments.

Account for spend honestly at the system level and tie it to outcomes, the way you already do with every other infrastructure cost. Invest hard in context hygiene, because a trained operator costs less and produces more, and no cap can claim that. And protect a budget for experimentation on purpose, because that apparent waste is where innovation comes from now, and it comes from everyone, not just the top.

The goal was never to minimize tokens. It was to maximize what your people can do, and to spread that capability as widely through the organization as it will go. Measure that instead. If you're trying to get the accounting and the culture right at the same time, that's the work I do, and I'd like to hear how you're approaching it. Let's talk →