AI giants score below 25% in UC Berkeley-led test of real-world application

In collaboration with more than 300 industry experts, UC Berkeley researchers have released a new benchmark testing AI capabilities in more than 50 industries. Of the models tested, OpenAI’s GPT-5.5 scored the highest, but only with a 24% pass rate.

The benchmark, dubbed Agents’ Last Exam, or ALE, is led by the Berkeley Center for Responsible, Decentralized Intelligence. The exam assigns tasks spanning subjects from audio processing to theoretical physics.

A rival model, Anthropic’s Claude Fable 5, followed GPT-5.5 at a 22% overall pass rate, with Google Gemini, DeepSeek and Grok all scoring below 16%. Pass rates measure the runs in which an AI agent gets a perfect score across all tasks.

The UC Berkeley center is co-directed by computer science professor Dawn Song and Haas School of Business professor Christine Parlour. The ALE project has 13 advisers from academia and industry, across multiple universities and companies.

“If you want to spread the impact (of AI) to (other) domains, (you) need to set up the correct evaluation system to track what’s important and what’s actually GDP relevant,” said Yiyou Sun, a UC Berkeley postdoc who leads the ALE project from Song’s group. “These tasks are actual jobs that experts have worked on.”

Compared to other benchmarks, ALE focuses on a wider variety of disciplines to test AI models from a number of mainstream labs such as OpenAI and Anthropic. The tasks assigned to agents are routinely updated in an effort to minimize contamination, a process where the data used to train and evaluate models overlap, causing inflated performance results.

University of Southern California materials science professor and ALE collaborator Zhenglu Li said the benchmark is unbiased because the contributors working on ALE aren’t affiliated with any particular AI companies, meaning the tests aren’t designed for any specific model. He added that while companies can fine-tune models to perform well on model-specific benchmarks, they might not perform as well on some general tasks.

The pass rates of these models aren’t high, which Li attributes to a lack of people from different disciplines currently working to train AI models.

However, the benchmark results still concern some scientists and academics.

“My bigger concern is not the pass rate but the way agents fail,” said Benjamin Liu, a Stanford University computer science Ph.D. student and test collaborator, in an email. “They often produce an answer that looks completely plausible but is subtly wrong, and in science a confident wrong answer is more dangerous than no answer, because someone might build on it.”

These results come during a high-stakes month for large AI companies OpenAI and Anthropic. The two competitors filed for initial public offerings earlier in June, and last Friday, Anthropic received a federal warning that caused it to shut down access to its latest models.

“I think having a benchmark where all the frontier leading models are sitting at 20% is a good incentive for these models to continue becoming better,” said Kunyang (Oliver) Sun, a project collaborator and postdoc studying computational chemistry at UC Berkeley. “(ALE) is really setting the standard … these are the tasks that are relevant to scientists.”

AI giants score below 25% in UC Berkeley-led test of real-world application | Campus