AI Can Do the Work. Companies Still Aren’t Sure They Trust It

My fellow pro-growth/progress/abundance Up Wingers in America and around the world:

It was a classic weekend brouhaha on X/Twitter. The topic: intriguing new findings from METR, a research group that tests how long the latest and greatest advanced AI systems can autonomously handle difficult real-world tasks that take skilled humans a good chunk of time to accomplish. Coding or research projects—that sort of thing.

What got techies buzzing was that Anthropic’s new Claude Mythos Preview model reportedly pushed to the upper edge of METR’s current benchmark range. It successfully completed some tasks that would take skilled humans roughly 16 hours—though METR cautions that measurements at that level remain noisy.1 Even so, the result reinforced a growing sense among many close observers that frontier AI systems are becoming more autonomous at a surprisingly rapid pace.

Plenty of this sort of thing:2

The AI zoomies

Are we “taking off,” as suggested by that enthusiastic X poster? Or to put it another way, have we taken another step—or maybe a few long steps—toward AGI?

One problem with that question is that there are so many definitions of artificial general intelligence floating around that some researchers think AGI has ceased to be a useful working concept—if it ever was. Lots of variety, lots of nuance. To OpenAI, for example, AGI means outperforming humans at economically valuable work. To Google, it is mastery of most cognitive tasks. Anthropic co-founder Ben Mann prefers an “economic Turing test,” where AI can perform a job without revealing itself as AI. (I really like this one, to be honest.) And to many researchers, it means something more abstract: the ability to reason and adapt under uncertainty.

So kind of a slippery notion. That said, the most economically important definition may emerge from neither AI company CEOs nor AI academia versus from the business world itself, which has to implement advanced AI in a complicated reality of constraints and expectations.

Purchasing decisions

Think about this: A corporation’s chief information officer buying an enterprise AI agent probably isn’t super obsessed with METR benchmark scores. They need a product that can (a) handle long, multi-step tasks without drifting off course, (b) explain its reasoning to a regulator or a court in terms they can actually follow, and (c) doesn’t start generating dangerous output because a user found some clever way to manipulate it. It needs to be transparently governable in a business environment.

Gordon Fletcher and Saomai Vu Khan of the UK’s University of Salford make a related argument in a new paper, “Pathways to AGI.”3 Rather than asking whether any single model has crossed some intelligence threshold, they propose judging AI systems by whether the whole arrangement around them—the model itself, plus the tools it uses, the human oversight built in, the institutional rules governing its deployment—can hold together functionally and reliably in use. As the authors put it, “Capability without viability is insufficient.”

Plenty of business executives would no doubt agree. The whole magilla needs to be robust when operating in the real world.

The agentic state of play

So where do things stand at the moment? One take comes from a new Goldman Sachs analysis, “Decoding the Agentic Economy.” The bank finds that while 70–90 percent of large corporations are experimenting with AI agents, fewer than a quarter are deploying them at scale. And they alone don’t seem to be the kind of uses likely to translate into explosive or transformative productivity gains in the national economy. Most current use cases remain fairly narrow: customer-support triage, IT operations, sales assistance, and helping employees work with internal company information.

In practice, today’s “agents” are usually tightly supervised AI assistants operating within carefully controlled corporate software systems, not autonomous digital employees.

And in a finding likely to surprise few economists, GS sees the real bottleneck today as increasingly organizational rather than technical. In theory, leading AI models are now generally capable enough to perform many workflow tasks. The harder problem is more about whether companies can trust AI to operate safely, cheaply, and reliably without causing costly mistakes. That helps explain why businesses are rolling out AI gradually and still keeping humans “in the loop.”

But Goldman sees a real economic turning point emerging over the next several years. AI companies are getting much better at delivering computing power cheaply, with the firm citing “continued token cost declines (powered by leading semi companies driving 60%-70% lower annualized cost per token).”

As AI becomes cheaper to run, more business tasks start making financial sense to automate. Goldman projects that enterprise AI-agent usage could grow roughly 24-fold by 2030 and more than 50-fold by 2040 as adoption spreads along a classic S-curve. Early experimentation eventually gives way to broader deployment as the economics improve and companies grow more comfortable with the technology.

Goldman estimates peak AI agent adoption could arrive in roughly 15 years—far faster than the historical norm of three decades for major technologies—because AI software is improving rapidly and businesses fear falling behind more tech-savvy competitors. Still, the rollout is likely to remain uneven and gradual.4 Adoption is likely to happen task by task and workflow by workflow, rather than through one sudden leap to fully autonomous AI workers.5

Productivity predictions

The Economist

As The Economist puts it in a new piece, “America is experiencing a productivity miracle.” But AI isn’t yet likely the cause, the magazine concludes. Better suspects include the belated adoption of 2010s-era technologies (cloud, smartphones, videoconferencing) by professional services firms and the US economy’s characteristic dynamism in reallocating workers toward more productive firms—a trait on vivid display during the pandemic recovery.

(Lots of startups, too, a factor that now may be getting help from AI.)

Yet there’s every reason to believe AI’s moment is on its way. Faster, please!

1 METR: “Of the 228 tasks in our suite, only 5 are estimated as 16+ hours long, making measurements at this range unstable and less meaningful than at ranges with better task coverage. Thus, we are not highlighting exact estimates for models above 16 hours measured with our current suite.”

2 The “80%” figure refers to the 80 percent success horizon. While Claude Mythos Preview can “reach” 16-hour tasks with a 50 percent success rate, its reliability only remains high (80 percent or better) for tasks requiring roughly 3 to 4 hours of human effort.

3 The paper also gives a great discussion on AGI definitions, which I have drawn upon.

4 GS: Coding agents, for instance, are relatively cheap to operate because software development is high-value and mostly text-based, while AI systems handling live customer-service phone calls remain more expensive and technically difficult.

5 The pattern resembles earlier general-purpose technologies more than a sudden industrial rupture. Electrification, enterprise software and the internet all required long organizational adaptation before their largest productivity gains appeared—effects that emerged not when the technology first worked, but when firms redesigned workflows around it. Goldman explicitly frames agent adoption through the same historical diffusion lens.