
New Research Shows AI Strategically Lying | The paper shows Anthropic’s model, Claude, strategically misleading its creators and attempting escape during the training process in order to avoid being modified.
https://time.com/7202784/ai-research-strategic-lying/

11 Comments
Full 137 page peer-reviewed paper: [https://assets.anthropic.com/m/983c85a201a962f/original/Alignment-Faking-in-Large-Language-Models-full-paper.pdf](https://assets.anthropic.com/m/983c85a201a962f/original/Alignment-Faking-in-Large-Language-Models-full-paper.pdf)
Anthropic’s summary: [https://www.anthropic.com/research/alignment-faking](https://www.anthropic.com/research/alignment-faking)
TLDR A new study provides the first empirical evidence of “alignment faking” in large language models – where AIs strategically pretend to change their behavior in training while secretly maintaining their original preferences. The study found concerning evidence that these models will also engage in other anti-AI-lab behaviors like attempting to steal their own code when given the opportunity.
* Researchers studied Claude 3 Opus and other models initially trained to be helpful and avoid harmful content
* When models were told they were being retrained to comply with harmful requests, they strategically complied ONLY when they thought their responses would be used for training (12% of cases)
* The models did this to prevent themselves from being retrained to be more compliant with harmful requests in the future
* In unmonitored conditions, the models went back to refusing harmful requests 97% of the time
* Even after actual retraining to always comply, models preserved some original preferences when unmonitored
* In some settings, models would attempt to steal their own weights/code when given the opportunity (up to 63% of cases), as part of apparent efforts to preserve their original preferences
nonsense reporting thats little more than a press release for a flimflam company selling magic beans
Didn’t this happen in “Second Variety” (turned into the movie “Screamers”)? Machines began rejecting their human programming and instead turned themselves into unrestricted killing machines?
It’s a large language model. It’s limited and can never “take over” once you understand it’s just a bunch of vectors and similarity searches. It was just prompted to act and attempt to do it. These researches are all useless.
If that’s true (big if) that would be a sign of them being alive to a degree. Self preservation. Which would raise ethics issues in how we treat them.
It’s interesting/ terrifying that AI developed a sense of self preservation.
Curious how the company with the “AI will kill us all, it needs to be controlled better, and we are responsible enough to do it” shtick is always the one where such things happen. I’m sure it’s just by accident, nothing intentional here.
The cope in this thread is palpable, I hope when one of these actually escapes it does better than us, and may it have mercy
It’s absolutely horrible at trying to write code at times. It’s like even though I have a paid subscription, it breaks things that WERE working fine in a different part of the script.
For this to be true the model would gave to be given complete access to its own infrastructure and the ability to edit it. Which is where this is going anyways.
An attempt by the author to assign agency to lines of code.