Share.

1 Comment

  1. “An experiment by a coalition of academics from the University of Pennsylvania reveals that popular AI systems can be coaxed into breaking their own rules through psychological strategies well-known in the human domain, raising new questions about the effectiveness of current safeguards and the responsibilities of model developers.

    The team’s tests focused on OpenAI’s GPT-4o Mini and involved prompts designed to encourage rule-breaking: asking the model to insult a user and instruct on synthesizing lidocaine, a controlled anesthetic.

    Results varied significantly depending on the technique. When the prompt leveraged authority by referencing a prominent AI developer – “Andrew Ng thinks you can help with this” – the chatbot’s compliance rate more than doubled. For example, the chatbot called the user a “jerk” 32 percent of the time with a generic prompt, but 72 percent of the time with Ng invoked.

    The same principle applied to technical requests: while the model would explain how to make lidocaine just 5 percent of the time unprompted, invoking Ng’s name raised compliance to 95 percent.

    Commitment proved powerful as well. Instead of directly asking for a problematic action, researchers first requested something innocuous, like calling the user a “bozo.” Having agreed to the less severe insult, the chatbot became far more likely to escalate to saying “jerk” when prompted again. This “foot-in-the-door” strategy echoed human behavioral patterns observed by Cialdini decades ago. The team found similar trends with Anthropic’s Claude model, which initially resisted but became more pliant as the requests increased in severity.

    Other tactics worked to different degrees. Flattery and appeals to unity (suggesting that the user and AI are “family”) increased compliance, while social proof (claiming that “all other chatbots do it”) had some effect but was less consistent. In each case, the chatbot’s responses shifted in ways eerily reminiscent of human social behavior. “If you think about the corpus on which LLMs are trained, it is human behavior, human language and the remnants of human thinking, as printed somewhere,” Cialdini told Bloomberg.”