Sixteen primary care clinics. More than 9,600 patients. A generative AI tool powered by the same model that scored 90.4% on the USMLE. And at the end of a pragmatic, cluster-randomized trial in Kenya, the primary outcome was flat.
The trial, published in Nature Medicine, tested a generative AI-enabled clinical decision support system (CDSS) called “AI Consult” in real primary care conditions across Kenya. The tool improved the quality of clinical documentation and decision-making processes. But it did not produce a statistically significant difference in short-term patient outcomes: 2.2% of patients in the AI-assisted group experienced worsening conditions or required additional treatment within 14 days, compared to 2.0% in the control group. That gap went nowhere near statistical significance.
The industry should sit with that for a moment.
The Benchmark Trap
Here is the assumption driving billions of dollars in investment: if a generative AI model can pass medical licensing exams at near-physician accuracy, it will improve clinical outcomes when embedded in care workflows. That assumption has never been more expensive, or more unexamined.
GPT-4 scored 81.1% on 750 USMLE-style clinical vignette questions, and GPT-4o pushed that to 90.4%. Venture capital noticed. AI-backed healthcare and biotech companies captured $5.6 billion in investment in 2024 alone, nearly three times the prior year’s figure. The market read benchmark performance as clinical readiness. The Kenya trial is the first large pragmatic RCT to answer that assumption directly, and the answer is: not yet.
What benchmark tests measure is the ability to select the correct answer from a curated multiple-choice set in a decontextualized environment. What primary care in Nairobi or Kisumu demands is something entirely different: a clinician operating under time pressure, with incomplete patient history, variable connectivity, a culturally specific presentation, and an EMR system that may have been built primarily for HIV patient management. A cross-sectional survey of 112 health facilities in Homa Bay County, Kenya found that 91% used the Kenya Electronic Medical Record system primarily for HIV care. An AI layer built for general clinical decision support lands on top of infrastructure that was never designed to support it.
The gap between benchmark accuracy and real-world clinical utility has a name in pharmacology: it’s the difference between efficacy and effectiveness. The clinical trial world has understood this distinction for decades. The AI industry has only recently been forced to confront it.
Three Signals the Ops Community Hasn’t Connected Yet
The Kenya trial is not a one-off anomaly. Read it alongside two other developments from the past year and a pattern emerges that the clinical operations community has not yet named.
Signal one is the Kenya trial itself. Its pragmatic, cluster-randomized design is important: this was not a controlled efficacy study with hand-selected sites and optimized workflows. Clinicians used AI Consult under routine conditions, and the tool improved process quality without moving the outcome needle. That is exactly the profile regulators worry about when they evaluate AI-enabled devices: improvement in intermediate endpoints that does not translate to the outcomes that matter to patients.
Signal two is the FDA’s own evolving posture on AI validation. The agency’s Request for Public Comment on Measuring and Evaluating AI-Enabled Medical Devices signals that FDA is not satisfied with benchmark testing as a validation standard. The agency is actively asking the field to define what real-world performance measurement looks like for AI tools in clinical settings. The Kenya trial’s null outcome on patient endpoints is precisely the kind of evidence that will shape how FDA answers that question, and sponsors building AI-assisted trial infrastructure should expect the bar to move upward.
Signal three is the emerging literature on implementation barriers. A February 2025 study in the Journal of Medical Internet Research identified barriers to AI CDSS adoption through 15 expert interviews, generating 309 categorized statements. The largest category, comprising 33% of all identified problems, was user-level barriers: clinician trust, cognitive load, and workflow disruption. Documentation quality improves when AI is present. Clinical judgment adapts more slowly, and in some cases resists altogether.
Connect these three signals and you have an emerging pattern: AI clinical decision support tools are clearing the wrong bar. They are being validated on benchmark accuracy, deployed into structurally unprepared environments, and measured on intermediate outcomes that do not predict patient benefit. The Kenya trial is the first large pragmatic RCT to run to the end of that logic and report what it found.
What This Means for Sponsors Building AI Into Trials
Here is the counterintuitive read that the industry needs to hear: the null outcome in Kenya does not mean AI clinical decision support does not work. It means the current implementation model is producing tools that are ready for workflows but not ready for patients. Those are two different readiness standards, and conflating them has become an industry-wide habit.
For sponsors embedding AI-assisted decision support into clinical trial site workflows, the Kenya trial creates a direct regulatory exposure. If your AI tool improves site documentation quality and protocol adherence metrics but does not demonstrably improve patient safety signals or endpoint collection accuracy, you are operating in the same gap the Kenya trial exposed. FDA’s draft guidance on AI in drug and biological product development will not give you credit for process improvement if patient-level outcomes are unmoved.
The WHO’s January 2024 guidance on ethics and governance of large multimodal models in healthcare explicitly flags equity risks in low-resource settings, warning that AI tools designed in high-income contexts may perform differently when deployed in low-income environments. The Kenya trial is a 9,600-patient confirmation of that warning. For sponsors running global trials with sites in Africa, Southeast Asia, or Latin America, this is not a theoretical concern. It is a demonstrated outcome gap requiring site-level infrastructure assessment before AI tool deployment, not after.
CROs offering AI-powered site support packages face the sharpest immediate pressure. The Kenya trial’s finding that AI improved documentation quality without improving patient outcomes means that the deliverable CROs have been selling, better records, cleaner audit trails, faster query resolution, does not by itself satisfy the evidence standard that regulators are moving toward. A 2025 JMIR analysis of AI CDSS barriers found that user-level adoption problems account for one in three identified barriers. CROs that have not built clinician adoption protocols into their AI deployment playbooks are carrying undisclosed operational risk.
Technology vendors in this space face a version of the same reckoning. The Kenya trial was designed as a pragmatic RCT, which means the tool was tested under the conditions that matter, not the conditions that flatter. Across 16 clinics and more than 9,600 patients, AI Consult could not move the 14-day outcome rate in a clinically or statistically meaningful direction. Any vendor claiming that their generative AI CDSS is clinically validated based on USMLE performance data is now working against a published pragmatic trial that tells a different story.
In 12 to 18 months, the FDA’s response to its AI device measurement framework will begin to crystallize into guidance, and that guidance will almost certainly require prospective, outcomes-based validation rather than retrospective benchmark performance as the standard for AI tools used in trial-critical decision pathways. Sponsors who have embedded AI into site workflow without a parallel outcomes validation strategy will face a retrofit problem at exactly the moment their trials are approaching critical regulatory milestones. The Kenya trial just gave them the clearest possible preview of what the FDA’s reviewers will be reading when they evaluate those submissions.
References
- Nature Medicine — “Generative AI-enabled clinical decision support system in primary care: a pragmatic, cluster-randomized trial”
- EurekAlert — “Large real-world pragmatic trial of AI Consult in Kenya primary care clinics: 9,600+ patients, null short-term outcomes result”
- JMIR Medical Education — “ChatGPT-4 achieves 81.1% accuracy on USMLE clinical vignette questions; ChatGPT-4o reaches 90.4%”
- Intuition Labs — “AI biotech and healthcare VC funding: $5.6 billion invested in AI-backed companies in 2024”
- FDA Digital Health Center of Excellence — “Request for Public Comment: Measuring and Evaluating AI-Enabled Medical Devices”
- WHO — “Ethics and Governance of Large Multi-Modal Models in Healthcare,” January 18, 2024
- Journal of Medical Internet Research — “Barriers to AI-based CDSS adoption: 309 expert statements across 7 problem categories,” February 2025
- PMC — “Digital health infrastructure in Homa Bay County, Kenya: 91% of facilities using KeEMR primarily for HIV management”
- Chosun — “AI Consult trial Kenya: 2.2% worsening in AI group vs. 2.0% in control, not statistically significant”

