The same AI model can score 80% in one environment and 40% in another. The difference isn’t intelligence — it’s the system surrounding it. There’s a benchmark floating around the Internet right now that should make you uncomfortable. Someone — some group of researchers or developers with too much time and too much GPU budget — ran the same underlying AI model through two different environments and got radically different results.
In one environment, the model scored between 78 and 82% on a coding benchmark. In the other environment, the same model — same weights, same intelligence, same everything — scored 42%. Let that sit for a moment. The model didn’t change. The intelligence didn’t change. What changed was the scaffolding around it. The pipes and wires and decision-routing logic that nobody puts in the press release. What changed, in the industry’s new favorite piece of jargon, was the harness.
Why does the same AI produce different results?
Imagine you hired a brilliant contractor to renovate your kitchen. The contractor is genuinely exceptional — experienced, creative, fast. But then imagine that in one scenario, you gave them a truck full of tools, a blueprint, direct access to the house, and your phone number. In the other scenario, you gave them a description of the house read aloud through a wall, a blunt chisel, and a request to “figure it out.”
You’d get different kitchens. You’d get very different kitchens. The contractor didn’t become less brilliant in the second scenario. The system they were embedded in just made them act like it. This is more or less what’s happening with AI coding agents right now. And if you’ve been spending your time tracking which foundation model scored what on which leaderboard, you’ve been watching the wrong game.
Two competing visions for AI coding agents
Let me describe the players, because this is fundamentally a story about two competing philosophies that eventually become locked in. On one side, you have Claude Code, Anthropic’s coding agent. Claude Code is built around the Model Context Protocol (MCP), to those who enjoy acronyms.
The basic idea is that the AI agent has direct, structured access to your environment. Your codebase. Your project management tools. Your Linear tasks, your git history, your shell. The agent doesn’t live behind glass; it lives inside your workflow. When it needs to call a tool, it reaches out via MCP, and the tool responds. When it spins up subagents to handle parallel tasks, those agents share context through structured files. It’s deeply embedded, maybe uncomfortably, in your development environment. It knows things. It can do things. It is, in a sense, present.
On the other side, you have OpenAI’s Codex — a coding agent that takes a fundamentally different architectural stance. Codex lives in a sandbox. It doesn’t have native hooks into your calendar, project board, Slack, or IDE preferences. It has access to your codebase, yes, but that’s essentially all that exists for it. The world outside the code doesn’t register. It’s like a very smart developer who works only in a soundproof room with no phone, and has their assignments slid under the door.
Both approaches are coherent. Both have real advantages. The sandbox model has excellent isolation properties — the agent can’t accidentally delete something it shouldn’t, can’t get confused by the noise of the broader environment, and can operate cleanly on a well-defined task.
The embedded model, meanwhile, can do things the sandboxed model genuinely cannot: it can update a ticket in Linear after fixing the bug it describes, check your project context before proposing a deployment window, and coordinate across the full surface area of your professional life. Neither of these is obviously, inarguably better. They reflect different values — the sandboxed model values focus and safety; the embedded model values integration and capability. What’s important is that they are structurally different, and structural differences compound.

The lesson of 2010 and how it applies to AI tools today
Which brings us to 2010. Or more precisely, to the lesson of 2010, which most people learned only in retrospect. In 2010, the cloud computing wars were getting interesting. AWS had been around since 2006, Azure launched in 2010, and Google’s cloud offerings were beginning to coalesce. To most observers — to the CTO of a mid-sized company, say, to the business-page reporter writing about “computing trends” — these services all looked roughly similar. They provided compute. They provided storage. You’ve now run your servers somewhere else. Great. Progress.
What people missed was that the interfaces were diverging. Not in dramatic, obvious ways, but in hundreds of small, sticky ways. AWS was building a specific model for how you’d build applications in their environment. Azure had a different model, shaped by Microsoft’s enterprise DNA, its Active Directory roots, and its relationship with Windows workloads. Google Cloud would eventually develop its own philosophy, centered on data pipelines and Kubernetes, and on how Google actually runs its own infrastructure.
The underlying resource — compute and storage — was functionally equivalent. But the harnesses for that compute were becoming ecosystems. And ecosystems create lock-in not through malice but through depth. You don’t leave AWS because you’re trapped. You don’t leave because you’ve built too many things that expect AWS to work the way AWS works.
By 2018, 2019, you’d talk to an engineering team, and they’d say “We’re an AWS shop” or “We’re Azure,” and that wasn’t just a vendor preference. That was an entire architectural worldview, a set of hiring patterns, a way of approaching problems. The cloud layer had become load-bearing within the organization’s culture.
Here we are again. Except that instead of 2010, it’s 2025; instead of cloud providers, it’s AI agent harnesses; and instead of taking eight years for the lock-in to become visible, my guess is it’ll take about three.
Architecture becomes identity when tools become culture
Consider what it means, practically, to be deeply embedded in the Claude Code ecosystem. You’ve built MCP servers for your internal tools. Your project management workflow is integrated. Your AI-generated specs feed directly into AI-executed code tasks. Your custom orchestration layer is tuned to expect a particular model of how agents communicate with each other, how they hand off context, and how they validate their own work before committing.
Now someone says, “Hey, let’s try Codex, I heard it’s good.” And you say: “Sure.” And then you spend a week trying to recreate the integration surface that took you three months to build in the Claude environment, because the tool-calling model is different, because there’s no MCP, because you’d need to build an adapter layer to bridge between the two systems, and at some point someone in the meeting says “why are we doing this again?” and the answer is “I don’t know, someone saw a benchmark.” The benchmark. Right. The 42% versus 78%. Which model are we evaluating, exactly?
Here’s the uncomfortable truth that the benchmark reveals: we are no longer evaluating models. We are evaluating systems. We are evaluating the entire stack — the intelligence layer, the harness layer, the integration layer, the context management layer, and the way tasks are decomposed, delegated, and validated. When you run Claude in Claude’s harness, you get the full system. When you run the same model in a harness that wasn’t designed for it, you get less. Sometimes a lot less. The contractor with the blunt chisel.
This is not a criticism of any particular model or company. It’s just a description of how complex systems work. The intelligence of a system is not located in any single component. It’s distributed across the whole architecture. We knew this about organizations — a brilliant CEO in a dysfunctional company often produces less than a mediocre CEO in a well-designed one.
We knew it about software — a great algorithm running on a poorly designed database produces worse results than a decent algorithm on a well-optimized one. We just keep forgetting about AI, where the narrative gravity of “which model is smartest” pulls our attention toward the wrong question.

When tools become culture
There’s a developer somewhere — let’s call him Tobias, because every cautionary tale needs a Tobias — who is right now building an extremely sophisticated tool-calling architecture around a single harness philosophy, without quite realizing that he’s not just building a product; he’s committing to an ecosystem. Tobias isn’t doing anything wrong.
Tobias is being rational. He’s using the best tools available, integrating them as deeply as makes sense, optimizing the system he has. But in three years, Tobias will think of himself as working in a particular way, expecting particular affordances, finding other approaches oddly limiting. Tobias will have become, culturally, an embedded-agent developer or a sandbox developer, and that identity will shape what he builds and how he builds it.
Meanwhile, a product manager named Ines is explaining to her team why their AI coding agent can do things that the competitor’s can’t, and the answer she keeps giving is “it’s smarter,” because “our harness was designed with direct MCP integration and native tool-calling” doesn’t fit on a slide. The harness is invisible in the conversation, doing enormous amounts of work, receiving none of the credit.
The system is the intelligence
This is the part where I’m supposed to tell you what to do about all this. Pick the right harness! Invest in the ecosystem early! Future-proof your architecture! But honestly, the more interesting action is just to notice it. Notice that the performance gap you’re seeing between AI tools isn’t primarily about the model’s raw intelligence — it’s about how well the whole system is designed to let that intelligence operate.
Notice that the integration depth you build today becomes tomorrow’s switching cost. Notice that when you hear someone say “we’re a Claude shop” or “we use Codex for everything,” they’re not just describing a tool preference. They’re describing an architecture and, behind that architecture, a set of bets about how AI-assisted development will work.
The 2010s cloud wars taught us that the compute layer gets commoditized. The integration layer, the ecosystem layer, the harness — that’s where value accretes, and that’s where lock-in lives. We are, unmistakably, early in that same story. The benchmarks will keep coming. The press releases will keep arriving.
The “which model is smartest” conversation will continue to dominate the discourse. But underneath it all, quietly, the harnesses are diverging. And that divergence is going to matter enormously — to Tobias and Ines and to the rest of us — for a very long time. The author is aware that this article was likely organized using an AI task-routing agent. The irony was not lost on him.
Follow us on X, Facebook, or Pinterest