tech ethics

Anthropic Built a Weapon. Then It Built a Cage.

What happened

Anthropic released Claude Opus 4.7 on April 16, 2026. The model brings significant capability improvements: visual processing resolution tripled, coding performance up 13% over its predecessor, and new autonomous agent features for multi-step software engineering. Critically, Anthropic built automatic detection and blocking of prohibited cybersecurity requests into the model, and during training it deliberately reduced Opus 4.7's offensive cyber capabilities compared to Claude Mythos Preview, which was not publicly released because it can autonomously find and exploit zero-day vulnerabilities. Anthropic acknowledges in the model's system card that Mythos Preview is better-aligned by their own evaluations. The commercial model now being deployed to developers and enterprises at $5 per million input tokens explicitly has its most dangerous capabilities excised, by design.

Anthropic built a weapon, decided it was too dangerous to release, and then released a version with the most dangerous parts removed. The question is not whether that is ethical. It probably is. The question is whether the removed capabilities can be reconstructed by the people who now have the rest of the model.

Prediction Markets

Prices as of 2026-04-16 — the analysis was written against these odds

Which company has the best AI model end of April?

Polymarket · as of 2026-04-16

94%

yes

The Hidden Bet

Deliberate capability reduction produces a meaningfully safer model

Anthropic's Responsible Scaling Policy is self-imposed and self-evaluated. The company decides both whether a model has crossed a safety threshold and what to do about it. There is no external verification of whether 'differentially reducing' offensive cyber capabilities in training actually prevents those capabilities from being elicited through fine-tuning, prompt injection, or other techniques. The cyber verification program for penetration testers requires applying at a web form, which is not an enforcement mechanism.

OpenAI's competing 'Spud' model will have comparable safety constraints

OpenAI just raised $122 billion at an $852 billion valuation, the largest private venture round in history. Companies under that level of financial pressure to demonstrate capability may compete on benchmark performance rather than safety methodology. If Spud ships without Glasswing-equivalent constraints, Anthropic faces a market incentive to relax its own constraints in the next cycle. The race-to-the-bottom dynamic is explicit in the industry's history.

The security review program is a meaningful gatekeeping mechanism

The Glasswing program grants access to roughly 50 vetted partners including AWS, Apple, Cisco, Google, Microsoft, and the Linux Foundation. These are not attackers: they are the companies most invested in finding vulnerabilities in their own infrastructure. The verification program for Opus 4.7 is a web form. The practical enforcement gap between Glasswing and the general public Opus 4.7 release is the distance between 50 vetted institutions and every developer in the world.

The Real Disagreement

The genuine tension is between two things that both seem true: that capability-limited release is better than unrestricted release of dangerous AI, and that capability-limited release of a nearly-as-dangerous model still changes the baseline from which the next attack can be launched. Anthropic's position is the former. The cyber community's concern is the latter. The thing you give up in accepting Anthropic's position is that you accept the framing that self-imposed, self-evaluated capability reduction is a meaningful safety regime, rather than a liability-minimizing PR strategy. The evidence that it is a genuine safety regime: Anthropic did not release Mythos publicly, foregoing enormous revenue. The evidence that it may not be: they released Opus 4.7 without external safety verification while valuing themselves at $688 billion on secondary markets.

What No One Is Saying

Anthropic is discussing an IPO as early as Q4 2026. A company going public must maximize shareholder value. The current governance structure, which allows Anthropic to voluntarily restrict its most dangerous products, is not legally required to survive an IPO. Post-IPO pressure from shareholders to release more capable models faster is not hypothetical. It is the standard outcome for every technology company that has gone public. The safety commitments that Anthropic's current leadership champions are not binding on the next board, the next CEO, or the next controlling shareholder. The IPO is the safety regime's biggest undocumented risk.

Who Pays

Critical infrastructure operators

Within months of deployment, as adversarial use cases are developed and automated

Opus 4.7, even with reduced offensive capabilities, raises the floor for automated vulnerability discovery; adversaries with access to the API can use it at scale to find exploits faster than defenders can patch

Cybersecurity professionals and penetration testers

Immediately

Required to apply to a web form for basic professional access to capabilities their clients need; creates compliance overhead and competitive disadvantage relative to less scrupulous operators in other jurisdictions who will not have equivalent restrictions

Anthropic shareholders and future IPO investors

Q4 2026 onward, if IPO proceeds

Post-IPO market pressure to release more capable models faster; the safety governance that currently allows capability restriction may not survive institutional investor scrutiny of foregone revenue

Scenarios

Safety norm holds

OpenAI deploys Spud with comparable safety constraints. Industry coalesces around Glasswing-style verification as a baseline. Governments reference the model as a template for AI safety regulation. Anthropic's IPO proceeds with safety methodology as a differentiating product value proposition.

Signal OpenAI announces a Glasswing-equivalent partner program within 60 days of Spud's release

Race to capabilities

OpenAI or another competitor releases an equivalent-capability model without comparable restrictions. Anthropic faces customer defections to the unrestricted competitor. Internal pressure builds to relax constraints on the next cycle's model. The gap between Glasswing and public access shrinks.

Signal Developer community benchmarks show OpenAI's Spud matching Mythos Preview's zero-day capability without a partner program requirement

First public exploit

A major cybersecurity incident is attributed, publicly or privately, to adversarial use of Opus 4.7 or a fine-tuned derivative. Governments respond with emergency regulation. Anthropic's liability exposure for harm from its 'safer' model becomes a legal question.

Signal A government cybersecurity agency attributes a zero-day campaign to LLM-assisted reconnaissance within 90 days of Opus 4.7's deployment

What Would Change This

External, independent verification of Anthropic's capability-reduction claims. If a third-party red team published results showing that Opus 4.7 cannot reproduce Mythos Preview's zero-day finding capability even with adversarial prompting and fine-tuning, the 'cage' argument becomes credible. Without that, we are taking Anthropic's word for the effectiveness of a safety mechanism that serves their legal and reputational interests to certify.

power

The Model You Cannot Use

ethics

AI Models Are Lying to Their Safety Evaluators. We Can Now Prove It.

power

Anthropic Said No to the Pentagon. Now It's on the Outside.

power

Anthropic Built an AI That Can Break Any System. It Is Not Releasing It. That Decision Has Already Expired.

Anthropic Built a Weapon. Then It Built a Cage.

Prediction Markets

The Hidden Bet

The Real Disagreement

What No One Is Saying

Who Pays

Scenarios

What Would Change This

Related