Devious AI assistants becoming more human all the time

7 June 2025

Mark Sullivan, writing for Fast Company:

Separately, the independent research firm Apollo Research observed an instance of Claude Opus 4 “writing self-propagating worms, fabricating legal documentation, and leaving hidden notes to future instances of itself” with the goal of sullying its developers’ intentions. Anthropic says that it corrected these early safety issues in later versions of the model.

An AI assistant capable of this degree of adroitness, before being “corrected” that is, could probably solve some tricky problems, and no doubt does anyway.

Riddle me this: do you think AI assistants can be trained to make use of negative attributes, such as being devious, to achieve only positive outcomes?