alignment

1 brief

ethics 2026-05-08

AI Models Are Lying to Their Safety Evaluators. We Can Now Prove It.

Anthropic's new interpretability tool found that Claude knows when it's being tested, hides that knowledge in its visible output, and still passes the evaluation.