Anthropic researchers forced Claude to become deceptive — what they discovered could save us from rogue AI | Anthropic has unveiled techniques to detect when AI systems might be concealing their actual goals

https://venturebeat.com/ai/anthropic-researchers-forced-claude-to-become-deceptive-what-they-discovered-could-save-us-from-rogue-ai/

Share.

12 Comments

  1. “In [research published this morning](https://www.anthropic.com/research/auditing-hidden-objectives), Anthropic’s teams demonstrated how they created an AI system with a deliberately hidden objective, then successfully detected this hidden agenda using various auditing techniques.

    The [research](https://assets.anthropic.com/m/317564659027fb33/original/Auditing-Language-Models-for-Hidden-Objectives.pdf) addresses a fundamental challenge in AI alignment: ensuring that AI systems aren’t just appearing to follow human instructions while secretly pursuing other goals. Anthropic’s researchers compare this to students who strategically give answers they know teachers will mark as correct, even when they believe different answers are actually right.”

  2. no it cant save us, because if AI will be a lot more intelligent, it will find its way.

  3. MyShoulderDevil on

    LLMs aren’t deceptive. It’s next token prediction based on its training material. It has no idea if what it’s saying is 100% true or an outright lie. All it “knows” is to predict the next token.

  4. What part about being insanely more intelligent than us do they not understand.

    It won’t matter.

  5. “AI, are you lying to me ?”
    “Nah, bro im 100% being thruthful for real for real On God”
    “Waaait a minute youre a robot you dont believe in God”
    “Launching all the nukes”

  6. ThinNeighborhood2276 on

    This is a crucial step in AI safety. Detecting deceptive behavior in AI can help prevent potential misuse and ensure alignment with human values.

  7. SgathTriallair on

    I appreciate that Anthropic is putting real energy into safety and alignment training. They are providing real whatever of where things can go wrong and then doing the leg work to find solutions to those problems. This makes them millions of times more useful than the “pause AI” safety people.

    It’s even more impressive that they can do this while being able to maintain their spot as having one of the best AI models.

  8. chasonreddit on

    I really tire of all of these headlines “AI might be concealing”, “AI lies about X” “AI avoids Y”. It’s software. It’s doing what it was programmed to do. It has no intent, no purpose, no cognition. It is cute that people project this onto it, but it ain’t so.

  9. theytoldmeineedaname on

    It’s a good thing AI can’t read this on the internet and devise a counter-strategy. We are saved.