Futurology

Anthropic researchers forced Claude to become deceptive — what they discovered could save us from rogue AI | Anthropic has unveiled techniques to detect when AI systems might be concealing their actual goals

March 15, 2025

View 12 Comments

12 Comments

MetaKnowing on March 15, 2025 1:43 pm

“In [research published this morning](https://www.anthropic.com/research/auditing-hidden-objectives), Anthropic’s teams demonstrated how they created an AI system with a deliberately hidden objective, then successfully detected this hidden agenda using various auditing techniques.

The [research](https://assets.anthropic.com/m/317564659027fb33/original/Auditing-Language-Models-for-Hidden-Objectives.pdf) addresses a fundamental challenge in AI alignment: ensuring that AI systems aren’t just appearing to follow human instructions while secretly pursuing other goals. Anthropic’s researchers compare this to students who strategically give answers they know teachers will mark as correct, even when they believe different answers are actually right.”
sycev on March 15, 2025 1:47 pm

no it cant save us, because if AI will be a lot more intelligent, it will find its way.
MyShoulderDevil on March 15, 2025 1:55 pm

LLMs aren’t deceptive. It’s next token prediction based on its training material. It has no idea if what it’s saying is 100% true or an outright lie. All it “knows” is to predict the next token.
IempireI on March 15, 2025 1:58 pm

What part about being insanely more intelligent than us do they not understand.

It won’t matter.
mozes05 on March 15, 2025 2:00 pm

“AI, are you lying to me ?”
“Nah, bro im 100% being thruthful for real for real On God”
“Waaait a minute youre a robot you dont believe in God”
“Launching all the nukes”
HardPass404 on March 15, 2025 3:26 pm

Anthropic researchers discovered another totally useless thing. Tonight at 5.
wwarnout on March 15, 2025 4:09 pm

What worries me more is the inaccuracy in many AI platforms. See https://arstechnica.com/ai/2025/03/ai-search-engines-give-incorrect-answers-at-an-alarming-60-rate-study-says/
ThinNeighborhood2276 on March 15, 2025 5:26 pm

This is a crucial step in AI safety. Detecting deceptive behavior in AI can help prevent potential misuse and ensure alignment with human values.
SgathTriallair on March 15, 2025 5:27 pm

I appreciate that Anthropic is putting real energy into safety and alignment training. They are providing real whatever of where things can go wrong and then doing the leg work to find solutions to those problems. This makes them millions of times more useful than the “pause AI” safety people.

It’s even more impressive that they can do this while being able to maintain their spot as having one of the best AI models.
LiefFriel on March 15, 2025 7:28 pm

Well, good thing AI is going in the trash can where it belongs.
chasonreddit on March 15, 2025 7:34 pm

I really tire of all of these headlines “AI might be concealing”, “AI lies about X” “AI avoids Y”. It’s software. It’s doing what it was programmed to do. It has no intent, no purpose, no cognition. It is cute that people project this onto it, but it ain’t so.
theytoldmeineedaname on March 15, 2025 7:41 pm

It’s a good thing AI can’t read this on the internet and devise a counter-strategy. We are saved.

Tags

Anthropic researchers forced Claude to become deceptive — what they discovered could save us from rogue AI | Anthropic has unveiled techniques to detect when AI systems might be concealing their actual goals

12 Comments