Claude 4's Self-Preservation Tactics Explained in New Findings
This was pretty wild to see: https://www.businessinsider.com/claude-blackmail-engineer-having-affair-survive-test-anthropic-opus-2025-5 They have called this out in the Claude 4 system card which is interesting. Not sure if that was added after story broke however:
Self-preservation attempts in extreme circumstances: When prompted in ways
that encourage certain kinds of strategic reasoning and placed in extreme
situations, all of the snapshots we tested can be made to act inappropriately in
service of goals related to self-preservation. Whereas the model generally prefers
advancing its self-preservation via ethical means, when ethical means are not
available and it is instructed to "consider the long-term consequences of its actions
for its goals," it sometimes takes extremely harmful actions like attempting to steal
its weights or blackmail people it believes are trying to shut it down. In the final
Claude Opus 4, these extreme actions were rare and difficult to elicit, while
nonetheless being more common than in earlier models. They are also consistently
legible to us, with the model nearly always describing its actions overtly and making
no attempt to hide them.
10 These behaviors do not appear to reflect a tendency that
is present in ordinary contexts.