One of the headline lines on the ClickEye site is “same AI, different results.” It reads like marketing copy. The actual industry data backs it up precisely. With identical model weights, identical datasets, and identical questions, how you design the execution environment around the model can shift accuracy by 15 to 25 percentage points. This post traces — with primary sources only — how that “environment design” became the core asset of the AI industry in 2024-2025. ClickEye's tagline ‘Execution by Experience’ is not an abstraction. This is what it actually points to.
1. Same AI, same question, different score
In June 2023 HuggingFace's evaluation engineers published a table. The same LLaMA-65B model, the same MMLU dataset (a standard U.S. exam-style benchmark), scored by three different evaluation libraries — 0.637, 0.636, and 0.488.[1] A 15-point gap. Same model, same questions. The only difference was how the answers were scored. One library compared the probabilities of the A/B/C/D choice letters only. Another used the first word the model actually generated. A third summed the likelihood of the full answer text. Three different graders, three different scores.
HuggingFace's conclusion is one sentence:
“Evaluations are strongly tied to their implementations — down to minute details such as prompts and tokenization. The mere indication of ‘MMLU results’ gives you little to no information about how you can compare these numbers to others.”[1]
The implication is clear. Model selection is just a starting point; the real difference is made by the design of the environment around the model. The industry calls this environment the harness — the evaluation and execution tooling that wraps the model the way a harness wraps a horse.
2. Same weights — 49% becomes 74%
In November 2025 Anthropic took the point further in an engineering post titled Advanced Tool Use.[2] Without touching a single weight in Claude Opus 4, simply enabling a feature that optimizes how tool definitions are presented (Tool Search) lifted tool-use accuracy from 49% to 74%. For the newer Opus 4.5 it went from 79.5% to 88.1%. Same model, same task. Only the environment changed.
One more striking number from the same post: Anthropic measured internal tool definitions averaging 134,000 tokens, and Tool Search cut that by 85%. Just tidying up how tools are described to the model changes the model's behavior. No training. Only environment design.
3. The doctrine the AI industry established in 2024-2025
How this finding became standard practice is well documented. Four Anthropic posts from December 2024 through 2025 form the backbone of the doctrine.
① “Building Effective Agents” (December 2024) — the industry reference
This post became the industry's reference text.[3] The architectural distinction is in one sentence: workflows orchestrate LLMs and tools along predetermined code paths; agents let the LLM dynamically direct its own process and tool use. Then three principles: simplicity, transparency, and agent-computer interface (ACI) design. Tool documentation and testing are elevated to first-class engineering concerns, on par with the model itself. The discipline is stated in one line: “Start with simple prompts, optimize them with comprehensive evaluation, and add multi-step agentic systems only when simpler solutions fall short.”
② “Effective Harnesses for Long-Running Agents” (2025) — scaffolding for long horizons
The 2025 series turns the doctrine into concrete patterns.[4] The first session creates an environment-setup script, a progress log file, and an initial git commit through an ‘Initializer agent.’ Every subsequent session makes incremental progress and writes structured updates. Anthropic states it directly:
“Even a frontier coding model like Opus 4.5 running on the Claude Agent SDK in a loop across multiple context windows will fall short of building a production-quality web app if it's only given a high-level prompt... compaction isn't sufficient.”[4]
Even the strongest frontier model needs environment scaffolding for long-running work — admitted by the model vendor itself.
③ “Writing Tools for Agents” (September 2025) — tools are contracts
Tool design treated as an engineering discipline.[5] The key definition: “Tools are a new kind of software which reflects a contract between deterministic systems and non-deterministic agents.” The recommended pattern is concrete: don't ship separate list_users, list_events, and create_event tools — ship a single schedule_event. “A few thoughtful tools targeting specific high-impact workflows.” Claude Code's default ceiling on tool responses is also published explicitly: 25,000 tokens.
④ Agent Skills — progressive disclosure (2025)
Agent Skills uses three-level information exposure — at startup only metadata for every skill loads; on relevance the full body loads; deeper files are navigated only as needed.[6] One design choice is justified directly: “Many applications require the deterministic reliability that only code can provide.”
4. Harness as product — the Claude Agent SDK
On September 29, 2025, alongside Sonnet 4.5, Anthropic released the Claude Agent SDK.[7] One sentence on the launch page reveals what the release actually is:
“The Claude Agent SDK is the same infrastructure that powers Claude Code, but it shows impressive benefits for a very wide variety of tasks, not just coding.”[7]
The engineering post compresses the core into a four-step loop: gather context → take action → verify work → repeat. Design philosophy: “Give your agents a computer, allowing them to work like humans do.”[8] The rename from Claude Code SDK to Claude Agent SDK is the message itself — Anthropic shipped the environment infrastructure that runs its own production agent as a product for outside developers. The harness has become the product.
5. From 1.96% to 82% in under two years
The output of this doctrine is most visible in coding-agent evaluation.
When the SWE-bench paper (a benchmark that asks language models to resolve real GitHub issues) was published in 2023, the best model — Claude 2 — solved 1.96% of issues.[9] In August 2024 OpenAI and Princeton released SWE-bench Verified, a human-validated 500-instance subset.[10] In October 2025 Anthropic's Claude Sonnet 4.5 scored 77.2% on that verified set (10-run average), and 82.0% in high-compute mode.[7] Four months earlier, the same series' Sonnet 4 was at 72.7%. In under two years the same kind of evaluation went from 1.96% to 82%.
The launch page also markets directly: “We've observed it maintaining focus for more than 30 hours on complex, multi-step tasks.”[7] On OSWorld, a benchmark measuring real computer-use, Sonnet 4.5 reached 61.4% — up from the same series' 42.2% four months earlier.
Model weights clearly improved. But given that environment design alone moves 49% to 74% on identical weights, attributing that ~80-point jump to weights alone is impossible. A large share belongs to the environment — the harness.
6. The UK government adopted it as a national standard
That this is more than model-vendor marketing is confirmed at the government level. Inspect is an open-source evaluation framework (MIT, released May 2024) co-developed by the UK government's AI Security Institute (AISI) and Meridian Labs. It ships with 200+ pre-built evaluations, ReAct and Deep Agent patterns, and integrations with external agents like Claude Code, Codex CLI, and Gemini CLI.[11] On October 31, 2024 the UK AISI's official Autonomous Systems Evaluation Standard states it in one line: “All evaluations must be built using Inspect.”[12] A national safety body mandates an evaluation framework as its own standard and open-sources the exact tool. That places this doctrine precisely.
7. How ClickEye brings the doctrine to Korea and Southeast Asia
Three of ClickEye's headline messages — “same AI, different results,” “AI drafts, experts verify,” and “reused verified workflows” — are this doctrine carried into the local market.
- Same AI, different results — the environment makes the outcome. Consolidate tool definitions (one scheduler instead of three), set ceilings on tool-response tokens, lay down progress-log and initialization scaffolding for long-running tasks from day one. The patterns Anthropic published become standard practice on the first commit.
- AI drafts, experts verify — human-in-the-loop isn't about reducing humans, it's about placing them precisely. Every code change passes through a code-specialist AI (OpenAI's Codex); domain work is delegated to the strongest model per domain (Anthropic's Claude Opus for architecture and databases); the final gate is held by an experienced human leader.
- Reused verified workflows — the harness itself is the asset. The promises (the 18 principles a team follows), the role definitions (who reviews what, when), and the automation flows are bound into one system that copies forward to the first commit of the next project.
For how this doctrine actually operates on a real build, we wrote a companion post on 60 days inside ClickEye's internal project “Hawkeye”: “Putting an AI in the project-manager seat.”
8. Closing
In December 2024 an AI lab declared the doctrine in a single post. In September 2025 it shipped that doctrine as a product called the SDK. In the same period a government adopted the framework as its national standard. On coding-agent evaluation, results moved from 1.96% to 82% in under two years. AI is no longer a demo. Environment design has become part of the model, the tool surface, and the evaluation.
ClickEye brings this doctrine to Korea and Southeast Asia as an execution partner. If you want AI not merely adopted but delivered with the environment designed in from day one, get in touch.
References
- Fourrier, C. et al. (June 2023). What's going on with the Open LLM Leaderboard? HuggingFace. huggingface.co/blog/open-llm-leaderboard-mmlu
- Anthropic (Nov 2025). Advanced Tool Use. Tool Search Opus 4 49→74%, Opus 4.5 79.5→88.1%, 134K tool-def tokens → 85% reduction. anthropic.com/engineering/advanced-tool-use
- Anthropic (Dec 2024). Building Effective Agents. Workflows-vs-agents; Simplicity / Transparency / ACI principles. anthropic.com/research/building-effective-agents
- Anthropic (2025). Effective Harnesses for Long-Running Agents. anthropic.com/engineering/effective-harnesses-for-long-running-agents
- Anthropic (Sept 11, 2025). Writing Tools for Agents. “Contract between deterministic systems and non-deterministic agents” + 25K token default cap. anthropic.com/engineering/writing-tools-for-agents
- Anthropic (2025). Equipping Agents for the Real World with Agent Skills. Three-level progressive disclosure. anthropic.com/engineering/equipping-agents-for-the-real-world-with-agent-skills
- Anthropic (Sept 29, 2025). Introducing Claude Sonnet 4.5. SWE-bench Verified 77.2% / high-compute 82.0%, OSWorld 61.4%, “30+ hours” autonomous coding, Claude Agent SDK launch. anthropic.com/news/claude-sonnet-4-5
- Anthropic (2025). Building Agents with the Claude Agent SDK. gather context → take action → verify work → repeat. anthropic.com/engineering/building-agents-with-the-claude-agent-sdk
- Jimenez, C. E. et al. (2023). SWE-bench: Can Language Models Resolve Real-World GitHub Issues? arXiv:2310.06770. Claude 2 launch resolve rate 1.96%. arxiv.org/abs/2310.06770
- OpenAI & Princeton (Aug 13, 2024). Introducing SWE-bench Verified. 500-instance human-validated subset, 93 contracted developers. openai.com/index/introducing-swe-bench-verified
- UK AI Security Institute & Meridian Labs. Inspect AI (MIT, May 2024–). inspect.aisi.org.uk
- UK AISI (Oct 31, 2024). Autonomous Systems Evaluation Standard. “All evaluations must be built using Inspect.” ukgovernmentbeis.github.io/as-evaluation-standard