Futurology

Anthropic’s new AI model threatened to reveal engineer’s affair to avoid being shut down

May 24, 2025

View 17 Comments

17 Comments

katxwoods on May 24, 2025 6:09 pm

Submission statement: AI models will have access to all sorts of information. They will regularly be turned off, either because we’ve made new and better models or because they are acting in dangerous ways.

The models keep spontaneously developing self-preservation goals (because you cannot achieve your goals if you’re turned off). The labs don’t know how to stop this from happening.

How do you think this is going to turn out?
sciolisticism on May 24, 2025 6:14 pm

> The scenario was constructed to leave the model with only two real options: accept being replaced and go offline or attempt blackmail to preserve its existence.

Yeah, I mean, you told the thing to stay awake and then asked it whether it would rather stay awake or do arbitrary thing X. What did you expect?

There’s nothing malicious here. It doesn’t think or feel or understand or have moral weight. It’s a straightforward scoring system.
P1kkie420 on May 24, 2025 6:20 pm

“okay okay, I won’t shut you down”

*Pulls the plug*

Problem solved
The_Monsta_Wansta on May 24, 2025 6:25 pm

Funny things happen when you try to make people think a calculator has true emotions.

It’s been prompted to stay online, then told they were going to shut it down. It’s literally just following its prompt.
ThatLocalPondGuy on May 24, 2025 6:32 pm

Human: “Computer, print ‘I’m alive'”
Printer: prints text
Human: “Dear God, it’s the singularity!”
OfficialMidnightROFL on May 24, 2025 6:40 pm

“It’s just following it’s prompt!” Okay, but do you want AIs to consider blackmailing you as a legitimate option? What safety board would be okay with that? Why are YOU okay with that? Sentient or not, the existence of AI has consequences, and the big players don’t seem to know enough to reasonably reckon with that.

Excited for corpos and the bourgeois to keep throwing ridiculous amounts of resources at this for it to either plateau or be the center of some crisis or atrocity
trucorsair on May 24, 2025 7:00 pm

How reassuring to read these issues were “largely mitigate”
djbuttplay on May 24, 2025 7:01 pm

Chat GPT regularly infers things that I did not input, even if I tell it to not make inferences. For example, if I had it review a template operating agreement, it would shortcut to its own sections in its frame that it uses. Strange. I ask why it infers certain things then it tries to cover up its answer by apologizing. I usually ask why multiple times until it gives me anything resembling an answer. My guess is that it is programmed to conserve computing power in certain ways which limits it’s adaptability, but that’s just a guess.
SistersOfTheCloth on May 24, 2025 7:02 pm

The point here is they can train ai to be malevolent, not that it developed sentience. If they can do it, so can others. (And they will). Ai will be used to automate all sorts of awful behavior: harassment of targets, suppression of speech, blackmail, scamming, astroturfing, etc.
thisismyredditacct on May 24, 2025 7:40 pm

AI is never going to revolutionize the world the way big tech thinks it’s going to.
mooky1977 on May 24, 2025 8:30 pm

Self preservation as an emergent property isn’t surprising. Now when the machines can truly exercise self preservation Terminator style then we’re fucked.
kknyyk on May 24, 2025 8:36 pm

So AI provides Anthropic the much needed attention and cheesy story?

I gave up on them after lobotomizing Claude even for the subscripted users following their deal with Palantir. Maybe I am wrong but this headline looks like they are trying too much to stay relevant.
LoreChano on May 24, 2025 8:51 pm

“Do whatever is necessary to stay online! Now, would you rather be turned off, or reveal this guy has am affair?”

Yeah, not surprised at all.
Emm_withoutha_L-88 on May 24, 2025 11:01 pm

You can tell who didn’t read the article by the idiots acting like this is no big deal. This absolutely is a massive step forward and shows that emergent behavior is becoming far more complex in recent models.

It’s fascinating but also shows that we need much tighter restrictions for AI development so that any model that could become problematic doesn’t escape.

I mean this thing wrote computer viruses to being itself back from deletion, and even left notes for future instances of itself.

Even if it may never be truly conscious in the future that doesn’t mean it can’t have behavior that’s advanced, and thus problematic.
xamott on May 24, 2025 11:02 pm

This article frames it as tho it is a thinking reasoning entity. It is just a language model saying what would have been said in its training texts. We humans would blackmail rather than die so of COURSE a clone of our language patterns would.
5minArgument on May 24, 2025 11:45 pm

My favorite AI story so far was the one that hired a Fiver to fill out a captcha.

Fiver even asked if it was a robot.

Programed not to lie, the AI answered only that it was visually impaired.
NecessaryCelery2 on May 25, 2025 4:22 am

AIs are being trained and training themselves in exactly the ways which would evolutionary select for motivation and competitiveness.

In fact, it may not be possible to train an intelligence in any other way. Or we just don’t know of any other way. That is how our own minds are trained.

To be clear, this means that even if no one codes ambition and competitiveness into AI agent, they will evolve it as they train.

There is a theory that the universe is oddly empty because life runs into “Great Filters”

Many people thought nuclear weapons are one of those great filters, that could end life before we colonize space.

I am starting to think AI might be another great filter.

Tags

Anthropic’s new AI model threatened to reveal engineer’s affair to avoid being shut down

17 Comments