The Model Is a Plugin, Not a Bet

by Alexio Cassani, CEO
The Model Is a Plugin, Not a Bet

The Model Is a Plugin, Not a Bet

Every few weeks someone reopens the same debate: "Has anyone actually replaced Claude or GPT with a local model for day-to-day work?" The latest round ran on Hacker News and pulled in dozens of replies from people with serious rigs under their desks. The honest conclusion is no: almost nobody has replaced the cloud for heavy work.

But the question is the wrong one. And the reason it is wrong is exactly the reason why, at FairMind, we build what we build.

The number that reframes everything

Let's start with the figures, because this is where most discussions stall at the level of perception.

On coding benchmarks the gap between open and frontier models has nearly closed. Over the past year the distance between the first and tenth model has fallen from almost twelve points to a little over five, and at the top the first two are separated by less than a point. The best open-weight model on SWE-bench Verified now runs north of 80%, in line with frontier proprietary models. On cost the collapse is even sharper: a model like DeepSeek V4 Flash runs at fractions of a cent per million tokens, with a permissive license and a million-token context window, and in general the cost per token of open models is ten to a hundred times lower than proprietary APIs.

If the story ended here, this would be another breathless piece about "the moment of open models." It doesn't, and the interesting part is precisely where the numbers stop telling the truth.

There is a specific mechanism behind this leveling, and it is not that the models have all become equally good. It is that the benchmark has become a target. When a metric like SWE-bench Verified becomes the public objective every lab is measured against, it stops measuring general capability and starts measuring how hard everyone has optimized toward it. It's Goodhart's law applied to AI: when a measure becomes a target, it ceases to be a good measure. The scores converge because everyone is sprinting toward the same finish line, not because the terrain underfoot has become uniform. A gap that closes at the top of the leaderboard says a great deal about the leaderboard and very little about real work.

SWE-bench Verified, the most cited benchmark, is a floor of capability, not a ceiling of performance. A high score says the model can solve well-bounded GitHub tasks. It says nothing about how it behaves in a session that runs for days, on a proprietary codebase with no tests, or in a refactor that touches twenty files. There is more: NVIDIA's RULER benchmark shows that models reliably use only 50-65% of the context they advertise. The million-token window on paper becomes a lot less in practice.

The distinction that matters here isn't capability versus incapability, it's median capability versus reliability on the tail. Two models can solve the same 80% of standard tasks and diverge completely on the hard 20%: the legacy file with no tests, the requirement that contradicts itself three folders over, the implicit dependency no documentation captures. Enterprise software development lives exactly on that tail. You don't pay a senior engineer for the average case, you pay them for the cases that blow up the average case. The same goes for the model: the frontier value isn't in the aggregate score, it's in what happens when the task runs off the rails, and that is precisely what no benchmark today knows how to measure.

The effective-context figure makes the picture worse exactly where it hurts most. If a model reliably uses only half the window it advertises, the problem is marginal on a two-hundred-line script and becomes structural on a brownfield codebase measured in millions of lines scattered across dozens of repositories. That's our daily terrain, and it's the terrain where the context window on paper and the real understanding of the project diverge the most. Stuffing the context full of tokens isn't the same as giving context: it's the work of selection, compression, and routing, that is, the work of the harness, that determines what the model actually understands.

What the community really says

The Hacker News thread is useful not for its conclusion, but for where the practitioners put their finger.

The top-voted comment is an argument about opportunity cost, not quality: I redo the analysis every month, and every month I conclude that the time it takes to make local work isn't worth it compared to a ready-made cloud agent. It's an economic observation, not a technical one.

That argument is worth taking seriously rather than dismissing, because in its own context it's correct. For a solo developer the cost of local is their own time, and their own time, set against a subscription of a few tens of euros a month, almost always loses. But that calculation doesn't transfer to the company, and here's the point the thread couldn't see. For an organization, three variables change at once: data sovereignty, which in Europe is not a preference but often a constraint; economies of scale, where the per-developer price of a cloud model multiplies across hundreds of seats while local infrastructure amortizes; and control, that is, the ability not to depend on a price list that, as we'll see, can move under your feet without warning. The "not worth it" conclusion is right for the individual and misleading for the enterprise.

Fig. 1 · Source: Hacker News thread — Who tried local for coding: how it went

Outcome of the 14 reported setups: 7 usable, 3 marginal, 4 failed

Each square is a setup described in the thread. Sample: 14 cases, self-selected. "Usable" means "the commenter uses it," not a benchmark, and often holds only for personal projects or well-defined tasks. Those who couldn't make it work often didn't leave a structured setup behind: the count is therefore optimistic.

The signal that really matters is a different one, and it surfaces repeatedly: the bottleneck isn't the models, it's the alternative harnesses. The tools that drive the local model are still immature on queue management, interruption, sub-agents, goal management. The models hold up. It's the scaffolding around them that doesn't. And the pattern among people doing serious work isn't "pure local" but hybrid: a frontier model for the hard tasks, an open or subscription model for volume and for boilerplate code.

It's worth being precise about what's missing, because "immature harness" sounds vague. What's missing is the stuff you don't see in a demo and feel in production: the queue that decides what the agent does first and what next, clean interruption when it goes off on a tangent, the sub-agents you can delegate a verifiable piece to, the project memory that survives a single session, the automatic verification that blocks a wrong output before it reaches the user. These are problems of systems engineering, not of model training, which is why they aren't solved by waiting for the next release. And the hybrid that emerges isn't a compromise downward: it's the mature architecture. Routing each task to the right model, frontier where reasoning is needed and open where volume is needed, is an orchestration decision. The model supplies the capability; the harness decides whether that capability becomes a result.

Fig. 2 · Synthesis of the thread (sosodev and others) — The hardware threshold: where the mid-tier becomes the sweet spot

Hardware scale: below 8B useless, 27-35B usable, frontier-like often not enough

Block height indicates the capability unlocked; color, the practicality on local hardware. The thread's message: the 27-35B tier is the real equilibrium point; running frontier "in-house" requires 128GB+ of memory and, in the words of those who tried, often still isn't enough.

In short: the people who tried didn't run into a limit of the model. They ran into a limit of orchestration, and into a wall of hardware.

Fig. 3 · Throughput reported in the thread — Real speed (tok/s), logarithmic scale

Throughput on a logarithmic scale: from 160 to 0.7 tokens per second

Only the cases where a number was reported; many comments only say "faster than the cloud" or "much slower." The spread between the multi-GPU rig and the CPU machine is about 230×. Speed isn't a detail: at 0.7 tok/s a multi-step agentic task becomes impractical.

The proof that doesn't come from a forum

A thread is anecdote. To get past anecdote you need something measured, and it just arrived.

In June 2026 Arena published the methodology behind Agent Arena, an agent evaluation built on millions of real interactions. The part that's relevant for us is conceptual before it's numerical. Arena treats the agent for what it is: not a model, but a system made of an orchestrator-model plus a harness with many subcomponents and tools. And it introduces a technique, causal tracing, that randomizes the choice of components and estimates the causal contribution of each one separately. In other words: it measures how much the model weighs and how much the scaffolding weighs, distinctly.

It's worth pausing on the method, because it's the difference between an opinion and a measurement. A normal leaderboard observes which agents do best and correlates, but an agent is model plus harness plus a thousand configuration choices, and correlation can't tell you who gets the credit. Causal tracing breaks this knot by randomizing the components, the same principle as a controlled experiment: if you change only the orchestrator and hold everything else fixed, the difference in outcome is attributable to that orchestrator and not to a coincidence of setup. It's the only honest way to answer the question "how much does the model really matter," and the answer that emerges is clear: it matters, but it's one factor among many, not the factor.

And here the names matter, because Arena's agentic leaderboard doesn't measure how intelligent a model is in the abstract, but how well it orchestrates tools on real tasks. As of June 15, 2026 it ranks 27 models across nearly 752,000 sessions. At the top is the proprietary frontier, as you'd expect. But scrolling downward the open models aren't bunched at the bottom: they're mixed in among the proprietary ones.

Fig. 4 · Source: Agent Arena — An agentic signal, not the price list: user-confirmed tasks

Confirmed success score per model, open and proprietary

Agent Arena's "confirmed success" signal: how often the user explicitly confirms the task is solved, one of the causal signals in the methodology. In teal, the open weights; in blue, the proprietary ones. GLM 5.2 (Max), open, is third, ahead of nearly all the frontier. I chose a clear, directional signal instead of the aggregate "net improvement" score, which isn't linearly orderable and would make the chart misleading. Data: Agent Arena, Jun 15, 2026; our selection.

Three readings are worth more than the whole table. The first you can see in the chart: on one of Arena's agentic signals, how often the user confirms the task is genuinely solved, GLM 5.2 (Max), open weights under an MIT license, is third overall, ahead of nearly all the frontier. The second isn't visible in a single chart but is the most instructive: the overall leaderboard isn't a ladder of capability. Claude Opus 4.8 itself is second in its reasoning variant and eleventh in its non-reasoning one, and what sinks it is a signal of behavior, tool hallucination, which climbed from under 1% to nearly 16%. Same brain, different behavior inside the flow. The third: open models like DeepSeek V4 Flash, nearly free, and GLM 5.1 deliver orchestration close to the frontier at a fraction of the cost, while other capable ones like Gemma 4 31B pay not for poor intelligence but for the same fragility on tools.

If you were looking for a single proof of this article's thesis, this is it. When you measure agents on how they actually behave, and not on how brilliant they are in the abstract, the leaderboard reorganizes around properties that live in the way the model is driven: tool reliability, the ability to be corrected, recovery from errors. All things a better harness improves, and that a model alone doesn't guarantee.

The signals Arena measures, in fact, almost never speak to the "intelligence" of the model, but to behavior inside a workflow: the ability to take in a user correction (steerability), the number of attempts it takes to recover from a shell error (bash recovery), the tendency to invoke tools that don't exist (tool hallucination). These are all properties that live in the interaction between model and harness, not in the model alone.

The point not to let slip is that each of these signals is improvable from outside the model. Steerability depends on how the harness reinjects the correction into the context. Recovery from a shell error depends on which diagnostic tools and which guardrails you've put around it. Tool hallucination is knocked down with a validated tool registry and a check that rejects nonexistent calls before they become errors. Arena is measuring properties that a good orchestration layer can move without touching a single weight of the model. It's the operational demonstration of something uncomfortable for those who sell models and convenient for those who build systems: today the margin for improvement sits more in the scaffolding than in the brain.

More interesting still are two ways of failing that Arena has named. Bluster: the agent, put under pressure by a correction, sounds confident but almost never holds its position. And bluffing: faced with a request made of multiple parts, in a non-trivial share of cases it leaves a piece incomplete, and in a more insidious minority silently omits it while presenting the result as complete. Neither of these two problems is solved by changing the model. They're solved with verification, with automatic checks, with a harness that doesn't take the agent at its word.

Fig. 5 · Source: Agent Arena — Bluffing and Bluster: two failures no model swap fixes

Bluffing and Bluster, ways agents fail

Two ways a capable agent under-delivers. On multi-part requests, 8% of the pieces are silently omitted and presented as complete (bluffing). Under correction, the agent "sounds confident" in 26% of cases but almost never holds its position (bluster). Neither is cured by changing the model: it's cured with verification inside the harness. Data: Agent Arena; our processing.

There's another Arena figure that closes the loop. When users delegate, in the majority of cases they don't ask for advice: they hand the agent an entire deliverable, and in a smaller share they let it work autonomously. Agents, then, already do real, unassisted work. But over the same period users take back control far more often than they widen the delegation, about two and a half times as much. Read correctly, this doesn't say "agents aren't ready." It says you need two things together: agents capable of taking on a whole task, and a control plane that lets the human grab the reins back at any moment. That control plane is harness, by definition.

And real work is heavy in ways the benchmark doesn't capture. In the observed window, agents wrote tens of millions of lines of code in a single week, with bash as the dominant tool, a substantial share of sessions running past twenty-five tool calls and a long tail reaching into the hundreds. A third of sessions close with more than 128k tokens of context. This isn't "completing a well-defined task." This is driving a long, noisy, multi-step system. It's harness terrain.

Fig. 6 · Source: Agent Arena — What agents really do: coding is the largest share

Distribution of real agent tasks by intent

Primary intent across 160,480 real tasks in 7 days (Agent Mode). Writing and debugging code together account for about 26%: coding is the largest share of real agentic work, and it's the most demanding terrain for the harness. Data: Agent Arena; our processing.

One last note not to miss, and one that ties back to something we've already covered on this blog. Arena computes real cost after the fact, not list cost, and finds models that are more expensive in practice than the per-token price would suggest, because an inefficient orchestrator takes more steps, more calls, and induces more user turns to arrive at the same result. The per-token price is an accounting illusion. What matters is the price per task solved, and that number isn't set by the model's price list: it's set by how well the harness drives the model to the result without wasting steps. A "cheap" model inside shoddy orchestration can cost more than an expensive model well driven.

The best model is no longer a single model

The most solid confirmation that the value lives in the orchestration layer, and not in the model, comes not from a forum but from a paper. Perplexity, together with a Harvard researcher, published DRACO, a deep-research benchmark built on real tasks and with penalties for wrong answers, so you can't inflate the score by writing a lot and writing badly. The result that matters is one: their deep-research pipeline based on Opus 4.6 scores 70.5%, while Opus 4.6 itself on its own, even with web search and code execution, stops at 59.8%. Same base model, more than ten points of difference, produced entirely by the orchestration built around it. The authors put it plainly: the gap indicates the importance of agent orchestration beyond the base model. And when they discuss how hard it is to compare different systems, they talk precisely about harness heterogeneity, the tools and the retrieval stacks that surround the model.

The same intuition is already a product. OpenRouter launched Fusion: instead of choosing a model, it fuses them. It sends the same prompt to a panel of models in parallel, a judge compares their answers, marking agreements, contradictions, and blind spots, and a synthesizer writes a single one. On the same benchmark, with a different judge (so the numbers aren't comparable one-to-one with the paper), every fused panel beats the models taken alone: the best touches 69.0%, above Fable on its own at 65.3%.

Fig. 7 · Source: OpenRouter / DRACO — Fusing models beats picking one, and the open models hold up

DRACO scores: fused panels and single models, open and proprietary

DRACO, OpenRouter's evaluation (judge Gemini 3.1 Pro, so not comparable one-to-one with the paper). Fusing beats every single model. For our theme, what counts is the bottom of the leaderboard: DeepSeek V4 Pro, open weights, on its own matches Opus 4.8 and GPT-5.5; a largely open panel comes within a point of Fable at half the cost, synthesized, however, by a frontier model. Axis truncated at 50%. Data: OpenRouter, Jun 12, 2026; our processing.

But for our theme, open models that become useful, the interesting figure is at the bottom of the leaderboard, not at the top. DeepSeek V4 Pro, open weights, on its own matches GPT-5.5 and beats Opus 4.8. And a panel made largely of open, low-cost models, DeepSeek and Kimi alongside a small Google model, beats both GPT-5.5 and Opus 4.8 taken alone and comes within a point of Fable, at half the cost. Orchestration doesn't just serve to squeeze the frontier: it serves to bring open, low-cost models to a level they wouldn't reach on their own.

Two honest caveats, because they matter precisely for anyone who wants to move toward open. The first: most of the gain comes from the synthesis step, not from the diversity of the models, and the limit case proves it, Opus 4.8 fused with a copy of itself, zero diversity, still rises by 6.7 points. The second, more uncomfortable: in that low-cost panel the synthesizer was still a frontier model, Opus 4.8. So an entirely open pipeline, panel and synthesizer, no one has demonstrated yet. And that is exactly the question we're working on.

Here we come back to the point of the whole article. Fusion is deep research, not coding: OpenRouter itself says it doesn't replace a coding model, but becomes a tool the model calls when an architectural decision or a choice of approach is worth a few extra seconds. The principle, though, is transferable, and the question we ask ourselves at FairMind is sharp: does the same orchestration of open models, which in research brings the low-cost to frontier level, hold up on development tasks? Requirements gathering, analysis, documentation, harness engineering on code. It's what we're experimenting with, orchestrating open models inside our harness instead of relying on a single frontier model. The difference from a generic Fusion is the point: our layer knows the codebase, verifies against the tests, routes on policies that depend on the domain. Fusion demonstrates the principle in its barest form; the real value is in the harness that specializes it. And for now the signals are encouraging: in some workflows an open model, inside our harness, asks a frontier model for a second opinion on the most delicate implementation steps, and in the majority of cases the frontier doesn't correct the open model's choice, it confirms it.

The most serious objection

At this point an attentive reader has an objection, and it's the strongest one you can raise against the whole argument. If it's true that on the hard 20%, the long and ambiguous orchestration, the frontier model remains irreplaceable, and if it's true that this 20% is precisely where the value concentrates, then the model isn't a plugin at all: it's still the lever that matters most, and the harness is garnish.

The objection is serious and must be granted halfway. Yes, on the hard tasks the best model wins, and no scaffolding turns a mediocre model into a frontier reasoner. But the conclusion doesn't follow, for two reasons. The first: the escalation policy is itself a function of the harness, not of the model. A system that knows how to route the hard 20% to the right model and the remaining 80% to an open model is extracting the frontier's value exactly where it's needed, and it does so thanks to orchestration. The plugin exists precisely because something, above it, decides how and when to swap it.

The second reason is more subtle. Even on the hard 20%, what makes a capable model fail is rarely a lack of raw intelligence. It's the wrong context, the missing verification, the unrecovered step, exactly the failures Arena measures and the harness corrects. Give the best model in the world the wrong context and you get a frontier-grade failure. The frontier model is a necessary condition for the hard tasks, not a sufficient one, and everything that makes it sufficient lives in the layer above. Granting that the frontier is irreplaceable on the 20% doesn't weaken the thesis: it completes it.

Why it didn't happen sooner

There's a second objection I hear often, and it's legitimate: if the harness really is the differentiating element, why didn't it happen sooner? Why are we only talking about it now?

The answer is that the dominant constraint has shifted. For years the bottleneck was the raw capability of the model: a model that couldn't follow a multi-step plan, call a tool reliably, or hold a long context together left nothing to orchestrate. Building a sophisticated harness on top of a model like that was like designing a refined logistics network for an empty warehouse. Only when models crossed a certain threshold of agentic capability, tool use, instruction adherence, coherence over long contexts, did the marginal return shift from the model layer to the orchestration layer. It's not that the harness suddenly became important: it became the binding constraint, because the other one loosened.

This also clarifies what the harness can and can't do, and it's the point to hold firm. The harness doesn't generate capability: you can't raise the accuracy with which a model writes Java on your own, that's supplied by the model and no scaffolding creates it from nothing. What the harness does is extract the maximum possible value from the capability that's there: it gives the right context, routes to the right model, verifies the output, recovers from errors. Capability sets the ceiling; the harness decides how much of that ceiling you actually reach. They're complementary, not in competition.

One honest question remains, and it's the most uncomfortable for anyone who, like us, bets on orchestration: if models keep getting more agentic, won't they end up absorbing the harness's functions on their own, self-verifying and self-routing? Partly yes, and you have to account for it. But two things stay out of reach for any single model, however good: the context and integration specific to a company, which no generalist model knows, and intelligent routing across a heterogeneous fleet of models, which by definition lives above the single model. Even in a world of near-perfect models, something has to decide which one to call, with which confidential data, and how to prove its result. That something is the harness.

Why value moves upward

There's a pattern that repeats every time a technological layer standardizes, and it's useful to recognize it because it tells you where value will go before the market gets there. When a component becomes abundant and interchangeable, differentiation doesn't disappear: it migrates upward, to the layer that orchestrates that component. It happened with hardware when compute became a commodity and value rose to software. It's happening now with models.

For years the model was the scarce, differentiating layer: having access to the best model was the advantage. That period is closing, not because models stop improving, but because they all improve together and the distance between the first and the tenth has shrunk to a margin that, on bounded tasks, you don't feel. When the component underneath flattens out, those who built their identity on "we use model X" are left with an advantage that evaporates with every release. Those who built on the layer above, the orchestration, the context, the verification, accumulate an advantage that every new model makes stronger, not weaker, because it hands them better material to orchestrate.

This isn't the partisan argument of someone selling harnesses. It's the natural prediction of how a maturing technology behaves, and it holds even if someone else were the one saying it. The strategic question for the buyer isn't "which model," it's "how much of my value depends on a choice I'll have to remake in six months."

What we see at FairMind

Here we stop quoting others and talk about what we measure ourselves, every day, on brownfield clients with legacy codebases.

The starting point is a methodological choice: the entire design phase, functional requirements gathering and technical analysis, we do on the platform with human validation. This is where ambiguity gets resolved, the part where a frontier model would remain irreplaceable. By the time the work reaches the coding agents, the perimeter and the context are already defined, and inside that clean perimeter the open models are enough to process and implement what the platform has specified. It's not the open model doing on its own what previously required the frontier: it's the harness that, upstream, clears the ambiguity out of its way.

We'll share the numbers in a separate article, but the direction is clear: over the past twelve months the autonomy of agents based on open models has grown in a way that "impressive" undersells. Today we effectively implement all the use cases planned on the platform with GLM 5.1, and for the last few days we've been trying GLM 5.2 and M3 with great satisfaction. Even in the most awkward case, generating documentation from existing code, a task that can run for hours, our agents stay on the rails with a very high level of accuracy and extremely contained costs.

And here a caveat that matters, because it's the opposite of what many expect. On the platform we never escalate automatically to the frontier model. It's the client who chooses whether to use an open model or a frontier one, and as of June 18 we've opened up, to a select group of clients, the ability to choose the open models, which are used across the board, always. The only point where today we genuinely reach for a frontier model is that optional second opinion inside the client-side coding agents: a check on the most complex steps, not the backbone of the system.

The right question

If the model has become an interchangeable component, the question "can I replace Claude with a local model?" is badly posed. The useful questions are others.

Is my harness mature enough to make the choice of model a plugin and not a bet? Is the task bounded enough to hold up with a mid-tier model, or is it ambiguous and long enough to require the frontier? What's my escalation policy, and who decides it, the engineer or the system? And finally, do the reasons to go local (data sovereignty, cost, control) justify the total cost, energy, hours of configuration, hardware, and not just the per-token price?

Those who stay stuck on the first question will keep redoing the same analysis every month and keep concluding the same thing every month. Those who move on to the other four start building something that lasts longer than a model release cycle.

The bet

At FairMind the bet is explicit: the model layer is a commodity in rapid convergence, and we treat it as such, a plugin to swap depending on the task and the client's constraints. We build the moat where the value doesn't evaporate with every new release: in the harness, in context engineering, in automatic verification, in the orchestration that turns a capable model into a reliable system.

The benchmark numbers will keep climbing and leveling off. The distance from one model to the next will keep tightening. And every time it does, those who invested in the right model will have to start over, while those who invested in the harness will simply have one more plugin to swap.


Sources

  • Hacker News — "Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding?"news.ycombinator.com
  • Agent Arena — "Causal Evaluation of Agents in the Real World" (methodology, Jun 2026), with the agentic leaderboardarena.ai/blog
  • Perplexity & Harvard — "DRACO: a Cross-Domain Benchmark for Deep Research" (arXiv:2602.11685, Feb 2026) — arxiv.org
  • OpenRouter — "Surpassing Frontier Performance with Fusion", on the DRACO benchmark (Jun 12, 2026) — openrouter.ai/blog · Fusion
  • SWE-bench Verified — public leaderboard and comparative open-vs-frontier analyses (mid-2026) — swebench.com
  • RULER (NVIDIA) — benchmark on effective context — arxiv.org

The Fig. 1-7 diagrams are our own processing of the data: the Hacker News thread for Fig. 1-3, Agent Arena for Fig. 4-6, OpenRouter / DRACO for Fig. 7.

Ready to Transform Your Enterprise Software Development?

Join the organizations using FairMind to revolutionize how they build, maintain, and evolve software.

Meet Your Agents
No credit card required. Full access during your personalized demo.