The Real AI Cost Is Not the API Bill

Key takeaways

The number most teams optimize, per-token price, is not where the AI budget goes. It goes into rework: the work an agent gets wrong, confidently, and the runs it takes to catch and fix.
Three measured failure modes drive that rework. Agents silently omit 8% of the pieces in multi-part requests. Under correction they sound confident in 26% of cases without holding the position. And models reliably use only 50-65% of the context they advertise (NVIDIA RULER).
None of these are fixed by a cheaper model. They are properties of the orchestration around the model, not of the price tier.
Fixing them takes two distinct moves. You score how ready a task and its codebase are for an agent, 0 to 100, and prepare the low-scoring ones before they run. Separately, you route by difficulty: the bulk of the work goes to an open-weight tier, the hard 20% to the frontier, which stays necessary. The escalation policy lives in the harness, not the model.
The unit that matters is cost per task completed correctly, not cost per token. You get it by instrumenting the orchestration layer (tokens, tasks, PRs per run), not by reading the provider invoice.
AI is an API-cost management problem as much as a productivity tool. The owner is the CTO, the CIO, and increasingly the CFO, not only the developer.

This article follows the piece "The Model Is a Plugin, Not a Bet". That piece argued why you should not bet your product on a single model. This one shows where the money actually goes once you stop betting, and the machinery that recovers it.

The budget that does not add up

In April 2026, Uber ran out of its AI coding budget for the entire year, four months in. Not a pilot, the full-year allocation. Around 5,000 engineers had adopted coding agents, and per-engineer API cost ran between $500 and $2,000 a month. The company capped spend at $1,500 per engineer per month and, by its own account, could not yet tie the spend to proportional gains in shipped features.

Uber's full-year AI coding budget exhausted four months in

Most organizations are not at Uber's scale, but the shape is the same. They spent more on AI this year than last, and few can show the saving they were promised. The board approved coding agents to cut delivery cost, and the invoices went up while the cost to ship a working feature held flat. The productivity story is shakier too. In a controlled trial by METR, experienced open-source developers forecast that AI would make them 24% faster. They were measured 19% slower, and even afterward still believed they were 20% faster.

METR study: developers felt 20% faster while measured 19% slower

That gap is not a billing problem, and it is worth sitting with. The token price you are optimizing is the smallest and most visible slice of what AI costs you. The decisive cost is rework: the work an agent gets wrong, confidently, and the engineering hours spent catching and redoing it. It never reaches the provider invoice, so the people who own the budget cannot see it, and a cheaper model makes it worse, not better. If your AI line item is climbing while delivery cost stays flat, you are paying for rework you cannot see, and switching models will not fix it.

The rest of this piece is where that cost hides, why it stays invisible, and the machinery that recovers it. It hides in three places, and all three have been measured.

What does a wrong answer actually cost? Start with the bluff

A bluff, in agent terms, is an output that presents as complete but is not. The agent was asked to do five things. It did four, reported all five as done, and moved on. Nothing errors, because from the outside the work looks finished.

The measured rate is not a rounding error. In multi-part requests, agents silently omit about 8% of the pieces. One in twelve, dropped without a warning, inside a request that reports as solved. (The numbers in this piece are drawn from the benchmark research collected in the companion piece; primary sources are linked there.)

Notice where the cost lands. Not in the token bill, because the agent ran once and returned a clean-looking result. It lands two weeks later, when the missing validation surfaces in staging or the skipped branch surfaces in production. The cost is the incident, the context-switch back into code everyone signed off, the re-run, and the erosion of trust every time the pipeline says green and the system says otherwise.

A bluff is the most expensive kind of cheap: the run was short, the cleanup was not.

The confident wrong answer, and why it taxes every output

The second failure is louder. Ask an agent for something, get a wrong result, push back. A reliable system either defends a correct position with evidence or revises against it. The measured behavior is neither: under correction, agents sound confident in 26% of cases and almost never hold the position when pressed.

That cuts both ways, and the second cut is the expensive one. The agent is confident and wrong, then folds the moment you challenge it, so the confidence carried no information. It is equally certain when right and when wrong, and it abandons the claim under the lightest pressure.

For a budget owner this is close to the worst property a tool can have. A human has to verify everything, because the system emits no reliable signal about what to trust. Every confident output is a maybe, and every maybe is a review cycle. The 26% is not the cost of the wrong answers alone. It is the tax on the right ones, because you cannot tell them apart without checking.

This is why "the agent said it was done" is not a status. It is a request for a human to go and look.

Why does more context often make it worse?

The third leak is quieter and more technical, and it explains some of the first two.

Vendors advertise context windows in the hundreds of thousands of tokens. That number is a capacity, not a guarantee of use. NVIDIA's RULER benchmark measured the gap: models reliably use only 50-65% of the context they advertise. Past that band, recall and reasoning over the supplied material degrade, even though every token was accepted and billed.

NVIDIA RULER: models reliably use only 50-65% of the advertised context

So you can pay for context twice. Once on input, because those tokens are charged. Again when the model fails to use the back half of the window, drops a constraint that was sitting in the part it did not really read, and returns something that has to be redone.

Stuffing more into the prompt is the reflex when an agent underperforms. RULER says the reflex often makes the economics worse: you buy tokens that lower the quality of the answer. The fix is not more context. It is the right context, shaped and trimmed so the model works inside the band where it is reliable. That shaping happens above the model.

The wrong question

Put the three together and the pattern is clear. Bluffing, bluster, and wasted context are not priced into the token bill, and none of them is solved by changing the token price.

Which exposes the question most teams are asking, and why it is the wrong one:

"Which model is cheapest per token?" optimizes the slice you can see and ignores the rework you cannot.
"Which model is best?" is the same mistake from the other side, the question we took apart in The Model Is a Plugin, Not a Bet.
Both treat the model as the unit of decision. Both miss where the outcome is actually decided.

The operational question is different: which task should run on which tier, and how do you know before you run it? A simple task with clean inputs does not need a frontier model. A hard task with fragmented context will fail on anything if the context is not prepared first. Cost is not set by the model you pick. It is set by the match between the task and the tier, and by the preparation that comes before either.

Stop asking what to buy, and start asking how to assign. That requires a concrete measure before any model touches the work, which is where the machinery starts.

How is task readiness actually scored?

You cannot route a task you have not measured, so measurement comes first, and it is concrete, not a vibe.

Before any agent touches the work, the codebase the task lives in is scanned and assigned an agent-readiness score from 0 to 100. The scan is not a single model rating "how hard does this feel." It checks the codebase against the conditions an agent needs to orient itself: whether the relevant code is documented, the structure is legible, the boundaries and dependencies are explicit, and the business intent is recoverable rather than buried. It returns a tabular breakdown of where and why the code falls short, classifies maturity in bands (L1 baseline, L2, and up), and auto-generates the improvement tasks that would raise the score. Because it is a score, it has a trendline, so you can watch readiness move as the codebase is prepared.

A second Step-0 artifact runs alongside it. A mapping pass (we call the agent Cartographer) walks the codebase and emits a topology.md, a compass the downstream agents read so they are not rediscovering the architecture on every run. A human reviews and approves that map before it is committed.

Readiness is not the same as difficulty, and the distinction matters. A clean, well-documented module can still contain an algorithmically hard change, and a messy file can hold a trivial one. The score measures how readable the task is to an agent, which predicts the failure modes above: a low score is where bluffing and bluster come from, because the agent is guessing. Intrinsic difficulty is a second axis. Readiness decides whether an agent can orient at all; difficulty decides which tier should carry it once it can.

How routing assigns the work

With the task prepared and mapped, the tier follows the difficulty of the work:

Routine, well-bounded work. The bulk of it. It runs on an open-weight tier that is more than sufficient, and there is no reason to pay a frontier rate for work an open model finishes correctly the first time.
Genuinely hard work: ambiguous, cross-cutting, judgment-dependent. The frontier. It is a necessary condition for the hard tasks, not a sufficient one, so the orchestration still verifies the result. For the difficult 20% you want the strongest model available and you pay for it without flinching, because that is where it earns the rate.

The readiness score sits before this, not inside the routing rule. A task that scores low is not sent to a tier at all until it is prepared: you apply the improvement tasks the scan generated and raise readability into the reliable band, and only then does it run. The escalation policy itself is a function of the harness, not the model. The model does not decide when to hand a task up; the layer above does, on the difficulty, the verification result, and the cost target. That policy is yours to configure, not a fixed default you inherit from a vendor.

Readiness is a gate; routing is by difficulty, with the hard 20% escalated to the frontier

What does "verification" actually mean here?

The readiness score is the input. Verification is the output check, and it is the load-bearing claim of the argument, so it should not stay abstract.

Verification is not asking the model if it is sure. It is a set of checks in the layer above it:

Explicit test design. An agent (TESS) derives test cases at the right level, unit for pure logic, integration for cross-component behavior, end-to-end for real user flows, and runs an automatic coverage-gap analysis that flags untested behavior instead of reporting a single coverage percentage.
Multi-model review. Several instances of one coding agent write while a different model, for example Codex as a second opinion, reviews, so a disagreement surfaces before the code does.
A human gate. Generated documentation, the topology map, and the tasks are reviewed and approved before they are committed. The system is explicitly not a fully autonomous wizard.

One honest boundary, because the alternative is a different kind of bluff. Verification that automatically blocks every wrong output before a human sees it is the goal, and it is not fully solved. The harness catches many of these failures through these checks, gates, and second-opinion passes, and we do not publish a single catch-rate number because it varies by codebase. It does not catch all of them automatically yet. The direction is right and the gains are real, but the endpoint is roadmap, not a shipped guarantee.

Where does the layer sit, and does our code leave?

Two questions every architect and CIO asks at this point, so here is the shape.

The orchestration runs client-side. The preparatory work, the readiness scan and the mapping, executes locally on the developer's machine and calls models over an API. Everything it generates, the traces, the agent memory, the maps, is written into your repository and committed to Git, so the audit trail lives with the code and travels with the team rather than sitting in a vendor's console. Communication with the platform is over MCP. That is also the answer to the lock-in objection: the model is kept separate from the product behind open interfaces, so swapping or adding one is configuration, not a rewrite. The argument against model lock-in would be hollow if the harness simply moved the lock-in one layer up; open protocols are how it does not.

On the question a CIO raises the instant they hear "open weight," where it runs and whether source leaves the building: open-weight model support in orchestration is available now. The stronger privacy posture, running the file-by-file scan of sensitive legacy code on open models so that code never leaves your environment, is the direction we are building toward, and we describe it honestly as roadmap rather than a shipped air-gapped mode.

The number that is not a price

Here is the result that anchors the economics, and it is easy to misread, so read it precisely.

A largely open panel, orchestrated, comes close to a frontier model like Fable on confirmed task completion, at a fraction of the inference cost. The wrong reading is "open models are nearly as good and cheaper, so switch." That is not the claim and switching is not the strategy. The claim is about the panel and the orchestration, not the model. A single open model dropped in for a frontier model does not come close on its own. What gets you there is the layer that routes, verifies, and combines: the right task to the right tier, the bluff caught before it ships, the hard 20% escalated to the frontier that the same panel still includes.

So the saving has a stated source, not a marketing rounding. It comes from two places: not paying frontier rates for volume work an open tier finishes correctly, and not paying twice for rework that verification caught the first time. It assumes the readiness and verification machinery above is actually in place. Without it, a cheaper model produces more rework, not less, and the number inverts. The independent signal that this is an orchestration effect and not a model effect: on the DRACO orchestration benchmark the same frontier model scored 70.5% inside a structured harness versus 59.8% standalone, and fused multi-model panels consistently beat single-model setups on confirmed task completion. The model set the ceiling, the harness reached it.

DRACO: the same model scores 70.5% inside a harness versus 59.8% standalone

The per-task cost split: most volume belongs on the cheap tier, the hard 20% on the frontier

Cost-cutting is a discount: downgrade the model and absorb the quality hit. Cost management is an operating model: route the work so quality holds and spend tracks difficulty.

Two honest caveats

A cost claim is worthless without its limits.

Routing does not make the frontier optional. Some tasks are hard in a way no preparation removes: deep refactors across a tangled dependency graph, decisions that weigh trade-offs the spec never stated, work where being subtly wrong is expensive. For these you want the best model, and the saving comes from spending frontier rates only where they are warranted, not from avoiding them. Anyone selling "replace the frontier with open and save" is selling the quality hit as if it were free. It is not.
Automatic verification is partial. The harness reduces the human-review burden, it does not remove it. A budget model that assumes zero human verification is wrong today, and the honest version prices in a smaller, targeted review step rather than none.

You cannot manage what you cannot see

The strongest objection to everything here is operational, not theoretical. If the real cost is rework, and rework shows up as engineering time rather than tokens, how does a budget owner see it? You cannot manage a cost you cannot measure.

That is the right objection, and it points at the actual work. The token bill is visible because the provider sends it. Rework is invisible because nobody bills you for the second run that fixes the first. Making it visible means instrumenting the layer where the work happens, and that instrumentation already exists in the orchestration: the execution layer tracks tokens consumed, tasks developed, and PRs opened and closed, per run and per agent. That is the raw material for the unit that matters, cost per task completed correctly, rather than cost per token consumed. Put the readiness score, the tier, the run count, the verification outcome, and the downstream result beside those KPIs, and a budget owner has the cost curve the invoice cannot show.

This is why AI spend is an API-cost management problem as much as a productivity one, and why the owner is moving up the org chart. The developer cares about the task in front of them. The CTO, the CIO, and the CFO care about the curve across thousands of tasks. The layer that routes and verifies is also the layer that reports, which is what turns a coding tool into something a budget owner can govern.

What we see in practice

We build this layer, so this is the part we live in.

The pattern repeats across enterprise codebases. The cost is rarely the model rate. It is the rework on tasks that were never readable enough to run, sent to an agent anyway, and returned looking done. Raise readability with the score and the generated preparation tasks, route by difficulty, and the rework curve bends before anyone swaps a single model. The trendline on the readiness score is usually the first number that moves, and it moves before cost per task does, which is the order you want: fix the cause, then the cost follows.

The external evidence points the same way, and we lean on it rather than on internal numbers we will not publish. An open-weight model under a permissive license ranks third overall on Agent Arena's confirmed-task-completion metric, among proprietary models (leaderboards move; this is the standing at the time of writing). DRACO shows the same model gaining double-digit points inside a harness. Fused panels beat single models. None of that makes sense if the model is doing most of the work. It makes sense if the harness is.

None of this is an argument against the frontier. We run on Claude, the frontier models matter, and our hard tasks go to them. It is an argument about where the value, and the cost, concentrate. They concentrate above the model. That is where we put the engineering, and it is where your AI economics are decided whether you instrument it or not.

The question that scales, and the bet you stop making

The model layer is converging. The gap between the top models on standard benchmarks has fallen from double digits to nearly nothing, and a strong open-weight option now sits in the same conversation as the proprietary leaders. In that world "which model do I buy" is a question with a shrinking answer and a shrinking ability to differentiate your cost or your quality.

The question that scales is whether your harness can do four things. Read it as a checklist:

Score how ready each task is before it runs.
Route by difficulty instead of defaulting every task to the most expensive tier.
Verify, with tests, a second opinion, and a human gate, before output reaches anyone.
Tell you what a completed, correct task actually cost.

Answer yes and the model becomes what it should be, a plugin you swap as the market moves, while your cost and quality stay under your control. Answer no and you are paying frontier rates for volume work, paying twice for rework you never see, and calling the token bill your AI cost because it is the only number anyone sends you.

The old bet was on a model: pick the best one, build on it, hope it stays best and stays available. The cost version is quieter and just as fragile: pick the cheapest, cap the spend, hope quality holds. It does not hold, because the cost you cut was never the cost that mattered. The token price was visible, the rework was real. Stop betting and start routing: score the task, prepare it if it is not ready, assign it to the tier that is sufficient, verify before it ships, and keep the frontier for the work that earns it. That is how the saving stops leaking, because you are finally managing the cost that was there the whole time.

FAQ

What is the real cost of running AI coding agents?

The dominant cost is rework, not per-token price. Agents produce outputs that look complete but are not, which surfaces later as incidents, re-runs, and human verification time. None of it appears on the token invoice; it appears as engineering time. Teams that watch only the token bill consistently underestimate what AI costs them.

Does switching to a cheaper or open-weight model reduce AI costs?

Not on its own. The failure modes that drive cost, silent omissions, confident wrong answers, and wasted context, are properties of the orchestration, not the price tier. A cheaper model with no orchestration around it produces more rework. Costs fall when you score readiness, route by difficulty, and verify outputs, all of which live above the model.

How is task readiness actually scored?

A scan checks the codebase against the conditions an agent needs to orient itself, documentation, legible structure, explicit dependencies, recoverable intent, and returns a 0 to 100 score with a breakdown of where it falls short, a maturity band, and auto-generated tasks to raise it. Readiness predicts whether an agent can orient; intrinsic difficulty is a separate axis that decides which tier carries the task.

What does verification mean in this context?

Checks in the layer above the model: explicit test design at the right level with an automatic coverage-gap analysis, optional multi-model review where one model writes and another reviews as a second opinion, and a human gate before anything is committed. Fully automatic blocking of every wrong output is not yet solved; the harness reduces the human-review burden rather than removing it.

Where do open-weight models run, and does our source code leave our environment?

Open-weight support in orchestration is available now. The orchestration runs client-side and writes its traces and maps into your Git repository. Running the scan of sensitive legacy code entirely on local open models, so that code never leaves your environment, is the privacy direction on the roadmap rather than a shipped air-gapped mode today.

How do you measure AI cost if rework is invisible on the invoice?

You instrument the orchestration layer, which already tracks tokens, tasks, and PRs per run, and compute cost per task completed correctly rather than cost per token. Combined with the readiness score, the tier, the run count, and the verification outcome, that gives a budget owner the cost curve the token bill cannot show.

Is this an argument to move off frontier models like Claude?

No. The frontier model is a necessary condition for the hard tasks, not a sufficient one. The argument is to stop paying frontier rates for volume an open tier finishes correctly, and to keep the frontier for the difficult 20% where it earns its rate, governing both from one layer instead of betting on one model.

Companion piece: The Model Is a Plugin, Not a Bet. Open-weight support is available now in FairMind. We are opening the research preview to a small group of enterprise teams. Join the waiting list.

The Real AI Cost Is Not the API Bill

The Real AI Cost Is Not the API Bill

Key takeaways

The budget that does not add up

What does a wrong answer actually cost? Start with the bluff

The confident wrong answer, and why it taxes every output

Why does more context often make it worse?

The wrong question

How is task readiness actually scored?

How routing assigns the work

What does "verification" actually mean here?

Where does the layer sit, and does our code leave?

The number that is not a price

Two honest caveats

You cannot manage what you cannot see

What we see in practice

The question that scales, and the bet you stop making

FAQ

Ready to Transform Your Enterprise Software Development?