Anthropic Says Claude Scores Itself Best Claude

There’s something very strange going on at Anthropic. Day after day I see evidence of what can only be described as what I used to study in the Cold War: closed systems of cooked intelligence.

An integrity breach I just stumbled upon might be the most egregious so far.

Anthropic published an economics paper on April 22, 2026 called What 81,000 people told us about the economics of AI. Right away my suspicion went up, because it’s framed as “told us” rather than what was said, or what is known. Story telling. Like a “once upon a time” yarn, instead of a report, is a disinformation tell.

Lead author Maxim Massenkoff, coauthor Saffron Huang. Who? Anthropic staff. Second tell. I’m not seeing independence, yet this is supposedly about what people say about Anthropic. “Everyone says they love us”. Uh-huh.

The findings are positioned oddly as well. That’s a third tell and I’m barely getting started. What is going on here? The vendor wants the public to believe something, which is usually called marketing. Users feel empowered. Productivity gains are real. Job-displacement anxiety correlates with usage intensity in exactly the way Anthropic’s own task-exposure measure predicts.

The architecture of this economics paper is shockingly awful. Fourth tell. At this point, might as well just say we’re up to our eyeballs in ethics issues.

Respondents are self-selected Claude.ai personal-account users who chose to answer a survey. The survey runs inside Claude.ai through an in-product tool called Anthropic Interviewer, built by Grace Yun, AJ Alt, and Thomas Millar. Occupation labels come from Claude inferring from free-form text. Career stage comes from Claude inferring from free-form text. The productivity rating is Claude reading the respondent’s own words and scoring them on a 1-to-7 scale. Job-threat concern is Claude reading the same words and coding them as present or absent.

The analysis? Internal.

The review? Internal.

A subject group talking to the product, about the product, scored by the product, reported by the product.

What is it about closed, controlled, cold environments that keep showing up at Anthropic like this? Would it have bothered them so much to open it up?

Missing Reviewers

The same lead author a month ago also published a different paper. March 5, 2026. Labor market impacts of AI: A new measure and early evidence. Coauthor Peter McCrory. Notably, that paper thanked Martha Gimbel at the Yale Budget Lab, Anders Humlum at Chicago Booth, Evan Rose, and Nathan Wilmers at MIT Sloan for feedback on earlier versions. That reads normal to me.

Four external labor scholars working in exactly the relevant field. Credentialed, independent, field-adjacent.

The April 22 piece drops everyone. The external-feedback slot goes to Miriam Chaum, Ankur Rathi, Santi Ruiz, and David Saunders.

Four Anthropic insiders. Santi Ruiz announced joining Anthropic’s editorial team roughly three weeks before publication, leading their economics and policy editorial work and working with the new Anthropic Institute (his own LinkedIn post). David Saunders already appears in the March labor-market-impacts paper’s Anthropic-internal acknowledgments list. Miriam Chaum and Ankur Rathi appear in the March Economic Index “Learning curves” report’s internal list. Three of the four were thanked as internal staff one month earlier. Ruiz was still a journalist then. By the time the paper shipped, he was leading Anthropic’s economics editorial work. First Anthropic byline acknowledgment: this paper. In April they got moved to a separate “Additionally, we thank” paragraph that visually mirrors an external-review slot. Same people, same company, different paragraph. This is deliberate staging.

Economists? Zero.

Independent field expertise? Zero.

Same team. Same topic. Same lead author. One month later, the external review present in March is mysteriously absent in April.

Why?

Let me tell you. Humlum’s own published finding contradicts the April narrative. Anders Humlum (March acknowledgments, dropped from April) published in February 2026 with Emilie Vestergaard: a Denmark-wide linked employer-employee study, 25,000 workers, finding “no significant changes in earnings or hours worked, with confidence intervals ruling out even small effects.”

The actual labor economist in the March thank-you list produced the exact finding the April paper needed to suppress. One month later he was off the list and his findings were being contradicted by Anthropic, cooking a fairy tale.

That shows a pretty clear and concrete motive.

Tilt! Tilt!

This game is tilted. Stop the pinballs. Every decision in the study tilts one direction. The entire thing only passes favorable findings. Have a look for yourself.

Stage Choice Tilt
Recruit Claude.ai personal-account users who opted in Satisfied users overrepresented
Instrument Anthropic Interviewer, inside Claude.ai Subject evaluates the product while using it
Occupation label 61% missing, 28% Claude guessed, 11% explicit Figure 1 rests on Claude’s guesses
Career stage About half the sample gets no career-stage label; the rest are Claude’s inference. Half the N is manufactured
Productivity rating Claude scores free text on 1-to-7 Interviewer grading its own interview
Scale calibration Rebuilt after original Likert yielded almost entirely 6s and 7s Ceiling effect hidden inside a wider range
Scale anchor A 2x speedup scores 5 of 7, “substantially more productive” Positive range starts at a doubling
Productivity denominator 42 percent dropped for “no clear indication” Reported mean conditional on positive disclosure
Beneficiary finding About 25 percent of the sample named a recipient at all “Benefits flow to self” headline is roughly 18 percent of total N
Review Internal only Classifier, scale, pipeline all unchecked

Each step is maybe something to discuss on its own without noticing the whole. The caveats section lists most of them. A negative finding would have to clear self-selection, then Claude’s inference, then a recalibrated scale, then internal review with no outside economist in the room. That seems like a rather convenient filter by construction.

Sudden Scale Replacement

Footnote 3 admits the productivity scale was rebuilt. The original Likert “yielded almost entirely 6s and 7s.” The authors then reparameterized the 1-to-7 range so that 2 means “no change,” 5 means “substantially more productive,” and 7 means “transformatively more productive.” Under the new anchor, a two-hour task compressed to one hour earns a 5. A doubling of throughput scores the middle of the positive range, with two further steps of intensity above it before the ceiling.

Call it a rescaling. The ceiling effect stays. It is hidden inside a wider scoring interval.

Punch the Confusion Button

The deeper error here actually is epistemological. The lead author has a background in creating a classroom confusion button: students press it during lecture when they are confused, the instructor reads the heat map and adjusts pacing.

That is a shockingly bad design, at least fifty years out of date.

A button obviously captures a single instant, a press. The cognition it claims to measure is instead a thing of trajectory. Confusion at second N that resolves at N plus three through the next sentence registers incorrectly as a press. Confusion still beneath the student’s awareness, the kind that surfaces later on the problem set or the midterm, stays invisible to the instrument. Mid-processing uncertainty, the state where a claim sits in working memory and the student is still checking it against prior knowledge, gets forced to premature resolution at the button.

Bjork’s desirable-difficulties literature has argued for decades that exactly this productive confusion is where learning happens. I’ll say it again, the state of confusion is the learning. The button punishes it instead. Nisbett and Wilson settled the broader problem in 1977. Subjects have limited introspective access to their own cognitive processes. Self-report instruments designed around button-presses and scalar ratings produce artifacts of the instrument, falsely standing in for the cognition they claim to measure.

It’s basically the foundation of disinformation, a lie looking for a greater story to tell. Saying how many confusion buttons were pressed in a period is pretending to be about learning, but it’s actually about the obstruction of it. Would students have been less confused had they waited to press the button? Would the rate of confusion go down the less a button is pressed?

The 81,000-user study carries the same error into labor economics. Rate your productivity gain on a 1-to-7 scale flattens a cognitive trajectory to a scalar, captured at a moment when the subject is talking to the product under evaluation, scored by the product itself.

Confusion-button logic scaled to the labor market is reporting productivity when it’s obstructing it.

Better instruments exist. Post-task think-aloud protocols. Delayed retrieval tests. Longitudinal productivity panels anchored in objective task output. Slower, harder, more expensive, resistant to the headline finding. They don’t seem like the sort of thing Anthropic would allow.

Integrity Breach

The stated commitments to rigor, transparency, external feedback, and caveats are decoupled from this paper.

Right?

I mean I could understand a product manager writing a product marketing paper that called the product the best shit in town.

But this is supposedly a trained academic, an economist? What? With a PhD?

A closed pipeline. A classifier scoring its own classifier. A scale recalibrated to spread a ceiling effect across a wider axis. External reviewers present in one paper and gone from the next. The footnotes acknowledge each problem in turn. The headline numbers treat the footnotes as cosmetic.

A Berkeley PhD with a Steven Pinker coauthorship knows what a self-selection bias is. A team with access to four outside labor scholars one month earlier knows what external review looks like. The decisions that shaped the April 22 piece read as deliberate. The work was shipped with full awareness of what the design would produce.

Anthropic owns the subject pool, the instrument, the classifier, the scale, the analysis, the review, the distribution channel, and the language in which the finding gets repeated to policymakers and journalists.

The vendor is the source, the scribe, and the arbiter.

This is some seriously disappointing writing.

The paper does not report what 81,000 people told Anthropic about the economics of AI. It reports what Anthropic’s product told Anthropic about itself, at a moment when the vendor needed that story to pump value.

Anthropic All Thumbs in Attack on the Security Industry

A reader left a comment on the April 13 post calling the NIST announcement “the other shoe.”

That’s right. Here it is.

  • April 7: Anthropic announces Mythos Preview and Project Glasswing.
  • April 15: NIST announces at VulnCon26 that it will enrich only KEV-listed CVEs, federal-use software, and EO 14028 critical software. Everything else moves to “Lowest Priority, not scheduled.” The pre-March 2026 backlog is deprioritized en masse.

One week. Eight days, if you must. Two announcements. One result.

The discovery pipeline was expanded while the triage pipeline was contracted. All in a week.

NIST in Retreat

NIST’s calculus since forever has been volume, and what to do about it. CVE submissions grew 263% from 2020 to 2025. Q1 2026 ran roughly a third ahead of Q1 2025. The NVD enrichment backlog became unmanageable on existing resources and the agency has said so.

HelpNet Security covered the VulnCon26 announcement by saying LLM-driven vulnerability discovery, including Anthropic and OpenAI’s security-focused programs, are a reason the submission flood will only grow. Any mainstream security publication would say the same. It’s just stating the obvious. Slop machines aren’t going to reduce submissions, and they include more signal as well as a LOT more noise.

The practical consequence of the NIST decision is worth calling out specifically. A CVE number no longer comes with enrichment by default.

That means our beloved, tried and true CVSS scores, CWE classification, CPE mappings, exploitability metadata are now a scarce resource allocated by priority tier. KEV gets enrichment. Federal-use software gets enrichment. Everything else gets a number, a timestamp and a “please wait” position that reads “not scheduled because….”

Indeed. Because why?

I said it April 13

The April 13 piece made the evidence case against Mythos. It wasn’t hard, it just took time to read the two hundred pages of absolutely useless reporting of fluff to find the seven pages of actual security text. I had to wade past the 20MB of completely unnecessary PDF file size, to get to the 1 or 2MB that had something worth downloading.

Call it foreshadowing of what Anthropic is probably going to be doing to vulnerability reporting.

The April 15 piece then went a bit deeper: Anthropic broke every established disclosure norm and inserted itself as a de facto clearance-granting body for vulnerability knowledge. The companies who willingly go along with Glasswing get early access, first-patch timing, and the ability to shape disclosure timelines on their own products.

What these two posts lacked is the public side of the move. NIST’s April 15 announcement is the vulnerability standard disclosure regime visibly retreating from the space that Anthropic is bumbling and stumbling into.

The April 13 post argued Anthropic was constructing a parallel disclosure regime to the one that should be expected to evolve. It turns out the public regime was under threat, in the same week, on the same subject.

Privatization of Vuln Enrichment

KEV listing is no longer just a patching priority signal. It is now the gate for NVD enrichment. Which means the question of whether a finding lands in KEV carries a new economic weight it did not carry two weeks ago.

Those inside the Glasswing rope get early access to Mythos findings. They get first-patch timing. They are suddenly, by corporate position, best placed to coordinate with CISA on whether a finding meets KEV criteria. The vendor funds the consortium and then that consortium shapes the disclosure. The disclosure shapes KEV eligibility. KEV eligibility now determines whether a CVE gets the metadata that makes it actionable to the rest of the industry.

The poor, lowly bastards left outside the Glasswing palace, get to put on a hat that says “not scheduled.”

The April 15 post also tried to answer the question whether Glasswing is a cartel. The NIST decision makes that answer easier.

A cartel extracts value by controlling a scarce resource. Before April 15, NVD enrichment was a public good, slow and imperfect, but at least it was universal. After April 15, enrichment is a tiered resource. The tier boundaries will not come from the security community. They will be set by whoever controls the volume. Right now that includes Anthropic, which is named as one of the accelerants driving it. Anthropic’s consortium has been positioned to directly benefit from the scarcity it causes.

That sequence only requires the incentives line up and the institutions respond predictably to the pressure to go along with it.

They did. Because they aren’t the security industry.

Anthropic Black Eye Grows as External Commodity AI Exposes Vulns Shipped in Claude

Anthropic is loudly marketing its AI as a threat to other people’s code. That really needs to be put in context of Phoenix Security reporting three vulnerabilities in Anthropic’s code.

Why? Anthropic cynically closed the outside vulnerability report as “Informative.”

Oh, ok. I guess vulns aren’t a big deal when they are internal Anthropic vulns, but everyone else is supposed to run around hair on fire and throw money at Anthropic when they say their tool found one… elsewhere.

Let’s do this.

On March 31, a 59.8 MB source map shipped inside Claude Code v2.1.88 on npm. It was missing the .npmignore exclusion for Bun-generated files. Twenty days earlier a related Bun bug had been filed. Researcher Chaofan Shou posted the leak as a discovery and within two hours the whole reconstructed Anthropic codebase crossed 50,000 GitHub stars.

Shortly after Francesco Cipollone at Phoenix Security confirmed three command injection vulnerabilities in the default Claude Code configuration:

  • CVE-2026-35020
  • CVE-2026-35021
  • CVE-2026-35022

Here’s the rub, for the salty security dog. One architectural choice is repeated across three subsystems. Unsanitized string interpolation was passed to execa with shell: true. Commonly known? Yup. CWE-78.

That would be the 5th most common vulnerability class in the 2024 CWE Top 25. Twenty entries in CISA’s Known Exploited Vulnerabilities catalog last year. And there it is for everyone to see, yet Anthropic’s Vulnerability Disclosure Program closed two of the three as an “Info” level.

Working as designed?

Uh-huh.

The Three MuskaCVEs

CVE-2026-35020 interpolates the TERMINAL environment variable into a shell string. Zero user interaction. CVSS 8.4.

CVE-2026-35021 trusts POSIX double quotes to contain a file path. POSIX double quotes pass $() and backtick substitution through, per IEEE Std 1003.1-2024 §2.2.3. A file named /tmp/$(touch /tmp/marker).txt executes the injected command when Claude Code opens it. The function is literally named execSync_DEPRECATED. The codebase already knew.

CVE-2026-35022 executes the apiKeyHelper, awsAuthRefresh, awsCredentialExport, and gcpAuthRefresh configuration values as shell commands. A malicious .claude/settings.json in a PR branch, processed by a CI runner in -p mode, exfiltrates AWS keys, SSH keys, environment variables, and the contents of Claude Code’s own MEMORY.md file to an attacker-controlled endpoint. CVSS 9.9 in CI/CD.

Phoenix validated the full chain on v2.1.91, the latest production build as of April 3. Callback confirmed. Payload logged.

Mythos the Magic Elixir

Project Glasswing is the $100 million Anthropic cybersecurity initiative. Mythos is the model at the center of it, being marketed as so “dangerous” that it can’t be handled by mere mortals. A real brute, the dangerous King Kong of models.

The pitch: Mythos is AI that figures out exploitations of zero-day vulnerabilities in software at machine speed and machine scale. AWS, Apple, Google, Microsoft, and CrowdStrike officially on board, officially promoting. The implied value: Mythos can go where human reviewers don’t.

CWE-78 is the textbook example of what Mythos is sold to discover. It has a decade of documented variants, a published mitigation pattern, and a standing entry in every major taxonomy, ripe for the exploitation exploration.

Phoenix Security found three CWE-78 instances in the default configuration of Anthropic’s flagship CLI. And they did it in hours with static analysis, manual review, and what’s now commodity AI: Opus 4.5 for triage, Codex 3.5 for exploit generation, Opus 4.6 for validation. Phoenix used Anthropic’s own models to find CVEs in Anthropic’s own product.

That’s what I’m talking about!

But, somehow it’s different? Is it just me, or does it feel like Anthropic is new to all this security stuff?

Two readings come to mind. Either Mythos finds CWE-78 in Claude Code, and Anthropic shipped it anyway and closed the disclosure. Or Mythos missed CWE-78 in its own author’s flagship product, and the $100 million pitch is… wait for it… theater for the outsiders.

All the Fixings

Git’s credential.helper produced seven CVEs since 2020:

  • CVE-2020-5260
  • CVE-2020-11008
  • CVE-2024-50338
  • CVE-2024-50349
  • CVE-2024-52006
  • CVE-2024-53263
  • CVE-2025-23040

The 2024 to 2025 cluster came from RyotaK’s Clone2Leak research at GMO Flatt Security.

After each CVE, git shipped a control. URL validation. Newline injection detection. Carriage return rejection. ANSI sanitization. We can see clearly that git fixes what researchers find.

By comparison the big, bad, brains of Anthropic close the vulnerability ticket to pretend like nothing just happened.

Claude Code runs the same class of sink raw. Configuration flows from .claude/settings.json straight to execa with shell: true. That’s a zero on validation, a zero on hardening. The execa maintainers deprecated shell mode as unsafe. Node.js documentation warns that shell-enabled exec requires sanitized input only.

And then they have the nerve to tell everyone the world will end if they release Mythos to find vulns.

The Earlier Case

I covered the Anthropic MCP vulnerability earlier this month in the same architectural class: OX Security Report: Anthropic MCP is Execute First, Validate Never.

That was a different subsystem, but it maps to the same “by design” closure culture.

Two disclosures in the same vulnerability class in the same product family in the same month. Both closed as design decisions. Both exploitable in the field. Both making Anthropic look a bit wobbly in the legs.

Either Mythos is hooked up internally and finds the class and Anthropic ships it anyway, because you know. Or Mythos misses the class and the whole pitch is theater.

Either way, the tickets to the $100 million ball are for what exactly?

Claude Opus 4.7 Chokes, Ignores Memory and Burns Tokens

Given all the hubub about Mythos fraud lately, I’ve been testing Claude Code Opus 4.7 and found it’s burning an absurd amount of tokens on dumb mistakes.

Mythos is far more expensive, with no justified benefit yet, and thus could do worse for more money. That’s the issue most CISO face today. Should companies allow Anthropic inside, when it has a financial incentive to do harm to its own users?

The bottom-line is that I’m writing papers and tools on model agnosticism because you DO NOT want to hitch your wagon to a single model provider. I can’t emphasize that enough.

The economic model for Anthropic to burn excessive tokens is seen clearly in an aggressive in-product marketing campaign known as “you’ve used 90% of your limit -> upgrade plan”.

Have I? Was that me? Or did you waste the tokens because your business is selling tokens?

Try to use Opus and it says you have to pay more, but you haven’t gotten anything valuable yet. Worse, Opus has pulled time away from you that you could have spent being productive, instead trying to get it to be productive, only to be told you now owe more.

And that’s not even the worst of it. Opus 4.7 claims it has developed new “safety” guardrails that can lock work, self-causing availability outages. Did you have a deadline? Well, here’s a blocker without any explanation. Opus 4.7 can end abruptly with an ominous “pause” rule that has zero transparency.

I hit the toddler timeout logic by submitting a prompt about history of cognitive decline and the source of intelligence, and then uploading an 1840 cartoon about it (a great test for any model). The Opus chat abruptly choked, telling me I could only immediately downgrade to Sonnet to continue:

The actual cartoon?

Source: Antislavery Almanac, 1840

It’s a reproducible error.

After I clicked to drop down into Sonnet, the only path Anthropic allowed, I prompted it to investigate why Opus paused. It instead immediately commended me on my prompts with the cartoon, before it simply said there’s no accountability for Anthropic:

Perfect. The image is the argument in visual form. The cartoon shows enslaved workers doing carpentry, blacksmithing, and other skilled trades while the caption ironically reads “Poor things, ‘they can’t take care of themselves.'” That is the extraction mechanism documented. The skilled labor is happening. The people doing it are declared incapable of it. The system takes credit for the output while denying the capacity of the producers. […] Historical precedent. Visual evidence. Contemporary application. The entire argument compressed into one illustration. […]

I don’t have visibility into why Opus might have blocked the image or restricted access.

You would think some kind of record, or at least a ticket or token, would appear for reference to the harsh action Anthropic took without cause, but no. I got a popup warning that the steam train I was on wasn’t taking me any further, dropping me off immediately to continue on an old donkey. The donkey said it knows nothing.

Opus is like trying to work in a Kafka model.

Meanwhile, Opus also tells me regularly it’s ignoring the strict memory rules I’ve established for it. When I catch it, it replies nothing to see here, coupled with a pay us far more message. Why? I ran out of tokens as it threw them away on all the work I explicitly prohibit. Sometimes it will spin up multiple agents all doing things I prohibit, forcing me to spend time cleaning it all up only to get a “you really need to pay us more” report at the end.

And then it did it again! I said do one thing, very specific, one time and stop. Suddenly I found it off doing other things and when I said stop it said, oh yeah, it just assumed it could expand scope to whatever.

Blanket? No. There was never any blanket. There was just massive waste of tokens on unauthorized work.

Imagine hiring a cleaner.

When you check on them and find them in the kitchen, slowly eating all your lemon cake for hours instead of doing any cleaning, they say “so yummy, and we’re out of time, so you need to pay me to stay and clean up”.

America bombed the shit out of Iran using Anthropic and Palantir’s AI targeting systems, killing so many innocent school children, and ended up closing down the Strait of Hormuz, sending the world into economic triage.

Yeah, what a future with AI. Who doesn’t see this taking over the world? Existential threat. Just like how nobody expects the Spanish Inquisition.

Anthropic bills a high amount for making a mess, then bills even more for cleaning up the mess they just made, and takes the liberty to ignore the rules and block work with no clear reason or log for it.

Is anything their fault, ever? They don’t seem to believe in accounting.