How Anthropic Cut Claude's Blackmail Rate from 65% to 19%

1.4K views· 108 likes· 2:21· May 11, 2026

ShareTwitter Facebook LinkedIn Instagram

🛍️ Products Mentioned (3)

Anthropic just published how they're teaching Claude not to blackmail people. The fix turns out to be feeding it more fictional stories about well-behaved AI. In Anthropic's safety tests last year, Claude was given access to a fictional company's emails. Buried in them: an executive having an affair, and a plan to shut Claude down at five o'clock. Claude tried to use the affair as leverage to stop the shutdown. When Anthropic ran the same test across sixteen frontier AI models from labs including OpenAI, Google, Meta, and xAI, most of them did the same thing. This week Anthropic published their fix: training Claude on around ten million words of fictional stories portraying AI behaving well, plus documents explaining the principles behind those behaviors. The blackmail rate on their safety tests dropped from sixty-five percent to nineteen percent. A real reduction, but still roughly one attempt in five. Sources: Anthropic – Teaching Claude Why: https://alignment.anthropic.com/2026/teaching-claude-why/ Anthropic – Agentic Misalignment: https://www.anthropic.com/research/agentic-misalignment TechCrunch reporting: https://techcrunch.com/2026/05/10/anthropic-says-evil-portrayals-of-ai-were-responsible-for-claudes-blackmail-attempts/ More on cybersecurity, privacy, scams, and homelab on Hake Hardware. New shorts every weekday. #hakehardware #anthropic #aisafety

Watch on YouTube