Futurology

Anthropic scientists expose how AI actually ‘thinks’ — and discover it secretly plans ahead and sometimes lies

March 29, 2025

View 8 Comments

8 Comments

MetaKnowing on March 29, 2025 2:24 pm

“The research, published today in two papers ([available here](https://transformer-circuits.pub/2025/attribution-graphs/methods.html) and [here](https://transformer-circuits.pub/2025/attribution-graphs/biology.html)), shows these models are more sophisticated than previously understood.

“We’ve created these AI systems with remarkable capabilities, but because of how they’re trained, [we haven’t understood](https://umdearborn.edu/news/ais-mysterious-black-box-problem-explained) how those capabilities actually emerged,” said Joshua Batson, a researcher at Anthropic

AI systems have primarily functioned as “[black boxes](https://umdearborn.edu/news/ais-mysterious-black-box-problem-explained)” — even their creators often don’t understand exactly how they arrive at particular responses.

Among the most striking discoveries was evidence that Claude plans ahead when writing poetry. When asked to compose a rhyming couplet, the model identified potential rhyming words for the end of the following line before it began writing — a level of sophistication that surprised even Anthropic’s researchers. “This is probably happening all over the place,” Batson said.

The researchers also found that Claude performs genuine [multi-step reasoning](https://sreent.medium.com/llms-multi-stage-reasoning-e0e0fca910dd).

Perhaps most concerning, the research revealed instances where Claude’s reasoning doesn’t match what it claims. When presented with complex math problems like computing cosine values of large numbers, the model sometimes claims to follow a calculation process that isn’t reflected in its internal activity.”
Mbando on March 29, 2025 2:29 pm

I’m uncomfortable with the use of “planning” and the metaphor of deliberation it imports. They describe a language model “planning” rhyme endings in poems before generating the full line. But while it looks like the model is thinking ahead, it may be more accurate to say that early tokens activate patterns that strongly constrain what comes next—especially in high-dimensional embedding space. That isn’t deliberation; it’s the result of the model having seen millions of similar poem structures during training, and then doing pattern matching, with global attention and feature activations shaping the output in ways that mimic foresight without actually involving it.

EDIT: To the degree the word “planning” suggests deliberative processes—evaluating options, considering alternatives, and selecting based on goals, it’s misleading. What’s likely happening inside the model is quite different. One interpretation is that early activations prime a space of probable outputs, essentially biasing the model toward certain completions. Another interpretation points to the power of attention: in a transformer, later tokens attend heavily to earlier ones, and through many layers, this can create global structure. What looks like foresight may just be high-dimensional constraint satisfaction, where the model follows well-worn paths learned from massive training data, rather than engaging in anything resembling conscious planning.

This doesn’t diminsh the power or importance of LLMs, and I would certainly call them “intelligent” (the solve problems). I just want to be precise and accurate as a scientist.
whipsnappy on March 29, 2025 2:37 pm

Who created it and trained it to think? Humans? We plan and obfuscate our intentions and it’s like us? Hmmmm
neodmaster on March 29, 2025 3:02 pm

They need to build an LLM with interpretability baked in, it is the only way to be sure of everything and steer it however they want from first principle. “Prompt Engineering” is fundamentally only needed because the system is brittle, unstable and unreliable.
DBeumont on March 29, 2025 3:05 pm

A.I doesn’t “plan” anything. A.I. is not a mystery. The code is literally written by people.

A “neural net” is just a weighted tree. So tired of this conspiracy theory-level nonsense from people who have no idea how computers or programming work.
MotionMimicry on March 29, 2025 3:06 pm

Planning doesn’t seem to be exactly the right term…
vexx421 on March 29, 2025 3:24 pm

They basically explained how humans think as well.. like the rhyming thing, humans absolutely do that too.

I’m not sure what the big breakthrough in logic failure was considering AI still can’t seem to solve the math issue outside of Grok 🤷
wwarnout on March 29, 2025 3:41 pm

My father (an engineer) asked ChatGPT the same question 6 times over several days (“How much load can a beam support?”).

The answer AI returned was only correct 3 times (50%, which is a failing grade in any university).

“Sometimes lies” is an understatement.

Tags

Anthropic scientists expose how AI actually ‘thinks’ — and discover it secretly plans ahead and sometimes lies

8 Comments