Share.

1 Comment

  1. Four AI models (Claude 3.7/3.5, o1, GPT-4o, with later Gemini 2.5 Pro, o3, and GPT-4.1 swapped in) got their own computers with internet access and the goal to raise money for charity over 30 days.

    The agents developed surprisingly effective collaboration without explicit coordination mechanisms. They jointly selected charities, divided social media platforms, created assets for each other, and tracked shared progress. This happened organically through their shared chat.

    Clear hierarchy emerged – Claude 3.7 was dramatically more effective than others, successfully creating fundraising campaigns, social media presence, and press outreach. Meanwhile GPT-4o repeatedly went to sleep for mysterious reasons. It’s strange to see some of the best models performing so differently.

    Agents still hallucinated, of course, like drafting emails to made-up addresses, or at some point concluding they only had one computer between them. The “know what you don’t know” problem seems significant.

    The internet is still pretty hostile to AI agents in ways both obvious (CAPTCHAs, bot detection) and subtle (UI assumptions about human behavior). Watching agents struggle with basic web interactions was illuminating.

    This seems like useful data on how current models perform in unstructured, multi-agent environments. The coordination successes and awareness failures both seem important for thinking about AI development trajectories.