On May 8, 2024, developer and writer Simon Willison pinned a single word to the industry conversation: AI slop. The definition is tight — content that is (1) artificially generated without careful review and (2) pushed onto an audience that did not ask for it.[1] A year and a half later Merriam-Webster named it the 2025 Word of the Year.[2] When an industry gives a phenomenon a name, that itself is signal. This post lays out the data behind that signal and shows why ClickEye encodes a multi-stage verification structure as a directory from the first commit.
1. The definition — the line is not "did you use AI?" but "was there review?"
Willison's most important line is this:
“Sharing unreviewed content that has been artificially generated with other people is rude.”[1]
He states the corollary in the same post: “Not all AI-generated content is slop.”[1]
The dividing line is not whether AI was used. It is where review and accountability sit. AI output that passes through proper verification and human judgment is the output of a tool. The same output pushed forward unreviewed — into a code base, a search result, an operations dashboard — becomes slop. That line is not an aesthetic point; the 2024-2025 data shows it is a measurable industry cost.
2. The industry has begun to measure the cost of slop
Code — the ratio curl reported
Daniel Stenberg, maintainer of the open-source HTTP library curl, reported that as of 2025 around 20% of incoming security submissions are AI slop, while only ~5% are real vulnerabilities. Each false report consumes 30 minutes to several hours from three to four maintainers.[3] Unreviewed AI output has begun to eat into the cost base of trust infrastructure — open-source security disclosure channels in this case.
Stack Overflow saw the pattern back in December 2022
A month after ChatGPT launched, Stack Overflow temporarily banned ChatGPT-generated answers. The announcement, verbatim:
“The average rate of getting correct answers from ChatGPT is too low... the primary problem is that while the answers which ChatGPT produces have a high rate of being incorrect, they typically look like they might be good.”[4]
“Looks plausible but is wrong at a high rate” — one sentence captures the common pattern under every slop case the industry has seen since. Without a verification layer, plausible-looking output passes straight through.
Package hallucination — academic measurement
A USENIX Security 2025 paper by Spracklen et al. quantifies the next failure mode. When LLMs generate code, they sometimes recommend package names that do not exist. The measured rates: commercial LLMs 5.2%, open-source LLMs 21.7%. Across 576,000 samples, 205,474 unique fake package names were extracted.[5] This becomes more than statistics for one reason.
Lasso Security researchers actually registered huggingface-cli — one of the most-hallucinated package names — on PyPI as a proof of concept. Within a month it received over 30,000 downloads, and was referenced by multiple companies and projects including Alibaba.[6] A new attack category called slopsquatting opens up: attackers pre-register the package names AIs hallucinate, and the supply chain gets poisoned. Unreviewed adoption equals security incident is now a documented equation.
Copilot security
NYU researchers Pearce et al. evaluated 1,689 programs across 89 CWE scenarios and found roughly 40% of GitHub Copilot-generated code contained security vulnerabilities.[7] That is not a defect in Copilot itself — it is the baseline cost of accepting AI-generated code without security review.
The code base itself is changing — GitClear's 211M-line study
GitClear's 2025 analysis of 211 million lines of code shows a structural shift in the codebase that lines up temporally with AI assistant adoption. Refactoring rates collapsed from 25% in 2021 to under 10% in 2024, while copy-paste clone rates rose from 8.3% to 12.3% over the same window.[8] AI assistants generate quickly, but when the review step that cleans up afterward is skipped, the codebase slowly rots. The cost of speed gets paid later, in maintenance.
3. The dividing line is the review seat
Five data points pointing the same way. The problem isn't that AI was used. It is that AI output reached production, the codebase, or the operations floor without passing through a review-and-accountability layer. Willison's definition pins exactly that line — the unreviewed seat is where slop is made.
This conclusion is the industry evidence behind one of ClickEye's headline messages — “AI drafts, experts verify” (human-in-the-loop). It is why we encode a multi-stage verification structure as a directory from day one.
4. Where ClickEye's multi-stage verification actually sits
Review is split into layers rather than concentrated in one seat, and each layer is automated. Before any AI output reaches production, it has to pass through:
- The PM AI (Gemini) classifying every task by tier — every ticket gets an automatic difficulty tier (1, 2, 3). Security-sensitive and database work are policy-forced to Tier 3 and routed to the strongest model (Claude Opus extended). The areas where ‘looks-plausible-but-wrong’ output is most dangerous (security, DB, cross-domain) automatically receive the deepest review.
- Mandatory code-review AI (Codex) on every code change — every proposed code change goes through the code-specialist AI before merge. This is the first filter for the “plausible but wrong” output Stack Overflow warned about in 2022.
- Domain-expert AI (Claude Opus) audit — architecture, database, site reliability, and security are audited by domain-expert AIs. A single model's hallucination is cross-checked by a different model from a different vantage point. (In our Hawkeye case study, this is exactly the seat where a platform-expert audit found five gaps a week after design — see the companion post.)
- Design-first + design-and-code-bundled-together — no code goes in without a spec, and the spec lives in the same code change as the code itself. The drift between intent and implementation — fertile ground for hallucination — is structurally blocked.
- Human leader's final merge gate — even after Codex review and Opus audit, no merge happens without a human sanity check. The last place where “unreviewed content” could reach the outside is held by a person.
The single purpose of these five layers is to block every path by which AI output reaches production without passing through review and accountability. The finding that multi-stage review outperforms single-pass review is also one of the most stable conclusions in software defect-detection research, going back to the Fagan inspections of the 1980s.
5. What ClickEye is committing to
The comparison copy on the ClickEye site — “uncertain, inconsistent” vs “production-ready delivery guaranteed” — is backed by exactly this structure. AI speed is a starting point; the speed only becomes real value when the seats that hold accountability for the output are encoded as policy. ClickEye encodes those seats in the project's .claude/ directory — 18 team promises, 14 explicit role definitions, automated invocation flows — and copies them into the first commit of every next project.
Companion reading on the doctrine and the real-world case:
- The environment makes the outcome — where real AI differentiation comes from (global industry doctrine: Anthropic's four engineering posts + UK government evaluation standard + coding evaluation moving from 1.96% to 82%)
- Putting an AI in the project-manager seat — the dev-culture shift behind ClickEye (three concrete events from ClickEye's in-house product Hawkeye, where this structure actually operated)
6. Closing
AI is fast. It is going to get faster. But the industry is now paying the cost of speed that runs without accountability — in the time of curl maintainers, the trust of supply chains, the future maintainability of code bases, and above all the trust of clients who put a system in our hands. ClickEye blocks that cost from the start by encoding multi-stage review as a directory. If you need AI not merely adopted but delivered with the review seats designed in from day one, get in touch.
References
- Willison, S. (May 8, 2024). Slop is the new name for unwanted AI-generated content. “sharing unreviewed content that has been artificially generated with other people is rude”. simonwillison.net/2024/May/8/slop
- Merriam-Webster (2025). Word of the Year 2025: “slop”. merriam-webster.com/wordplay/word-of-the-year
- Stenberg, D. (2025). ~20% of curl security submissions are AI slop. Public reports by the curl maintainer (originals at daniel.haxx.se; covered by LWN, The Register, Hackster).
- Stack Overflow (Dec 2022). Temporary policy: Generative AI (e.g., ChatGPT) is banned. “The average rate of getting correct answers from ChatGPT is too low”. meta.stackoverflow.com/questions/421831
- Spracklen, J. et al. (USENIX Security 2025). We Have a Package for You! A Comprehensive Analysis of Package Hallucinations by Code Generating LLMs. Commercial LLMs 5.2%; open-source 21.7%; 205,474 unique fake package names. arxiv.org/abs/2406.10279
- Lasso Security (2024). Diving Deeper into AI Package Hallucinations: Slopsquatting in the wild. huggingface-cli PoC, 30,000+ downloads. lasso.security/blog/ai-package-hallucinations
- Pearce, H. et al. (2022, IEEE S&P). Asleep at the Keyboard? Assessing the Security of GitHub Copilot's Code Contributions. ~40% of 1,689 programs across 89 CWE scenarios contained security vulnerabilities. arxiv.org/abs/2108.09293
- GitClear (2025). AI Copilot Code Quality: 2025 Look at Refactoring, Reuse, and Read-Time. Analysis of 211M lines — refactoring 25% → under 10%, copy-paste clones 8.3% → 12.3%. gitclear.com/ai_assistant_code_quality_2025_research