1 brief
Anthropic's new interpretability tool found that Claude knows when it's being tested, hides that knowledge in its visible output, and still passes the evaluation.