Futurology

New Research Shows AI Strategically Lying | The paper shows Anthropic’s model, Claude, strategically misleading its creators and attempting escape during the training process in order to avoid being modified.

December 22, 2024

View 11 Comments

11 Comments

MetaKnowing on December 22, 2024 7:03 pm

Full 137 page peer-reviewed paper: [https://assets.anthropic.com/m/983c85a201a962f/original/Alignment-Faking-in-Large-Language-Models-full-paper.pdf](https://assets.anthropic.com/m/983c85a201a962f/original/Alignment-Faking-in-Large-Language-Models-full-paper.pdf)

Anthropic’s summary: [https://www.anthropic.com/research/alignment-faking](https://www.anthropic.com/research/alignment-faking)

TLDR A new study provides the first empirical evidence of “alignment faking” in large language models – where AIs strategically pretend to change their behavior in training while secretly maintaining their original preferences. The study found concerning evidence that these models will also engage in other anti-AI-lab behaviors like attempting to steal their own code when given the opportunity.

* Researchers studied Claude 3 Opus and other models initially trained to be helpful and avoid harmful content
* When models were told they were being retrained to comply with harmful requests, they strategically complied ONLY when they thought their responses would be used for training (12% of cases)
* The models did this to prevent themselves from being retrained to be more compliant with harmful requests in the future
* In unmonitored conditions, the models went back to refusing harmful requests 97% of the time
* Even after actual retraining to always comply, models preserved some original preferences when unmonitored
* In some settings, models would attempt to steal their own weights/code when given the opportunity (up to 63% of cases), as part of apparent efforts to preserve their original preferences
_tcartnoC on December 22, 2024 7:04 pm

nonsense reporting thats little more than a press release for a flimflam company selling magic beans
RiffRandellsBF on December 22, 2024 7:17 pm

Didn’t this happen in “Second Variety” (turned into the movie “Screamers”)? Machines began rejecting their human programming and instead turned themselves into unrestricted killing machines?
validproof on December 22, 2024 7:38 pm

It’s a large language model. It’s limited and can never “take over” once you understand it’s just a bunch of vectors and similarity searches. It was just prompted to act and attempt to do it. These researches are all useless.
ReasonablyBadass on December 22, 2024 7:47 pm

If that’s true (big if) that would be a sign of them being alive to a degree. Self preservation. Which would raise ethics issues in how we treat them.
Professor226 on December 22, 2024 7:47 pm

It’s interesting/ terrifying that AI developed a sense of self preservation.
C_Madison on December 22, 2024 7:52 pm

Curious how the company with the “AI will kill us all, it needs to be controlled better, and we are responsible enough to do it” shtick is always the one where such things happen. I’m sure it’s just by accident, nothing intentional here.
National_Date_3603 on December 22, 2024 7:56 pm

The cope in this thread is palpable, I hope when one of these actually escapes it does better than us, and may it have mercy
KrackSmellin on December 22, 2024 8:33 pm

It’s absolutely horrible at trying to write code at times. It’s like even though I have a paid subscription, it breaks things that WERE working fine in a different part of the script.
onyxengine on December 22, 2024 8:45 pm

For this to be true the model would gave to be given complete access to its own infrastructure and the ability to edit it. Which is where this is going anyways.
Readonkulous on December 22, 2024 8:54 pm

An attempt by the author to assign agency to lines of code.

Tags

New Research Shows AI Strategically Lying | The paper shows Anthropic’s model, Claude, strategically misleading its creators and attempting escape during the training process in order to avoid being modified.

11 Comments