A Rare RCT of Artificial Intelligence in Medicine

Research Corner
May 23, 2023

Soleil Shah, MSc, Research Reporter

Soleil Shah writes Tradeoffs’ Research Corner, a weekly newsletter bringing you original analysis, interviews with leading researchers and more to help you stay on top of the latest health policy research.

There’s an avalanche of news and research about the growing role of artificial intelligence (AI) in health care.

But what does it take to rigorously vet AI tools for potential harms?

That’s the focus of today’s newsletter and the podcast later this week, where we’ll be releasing a two-part series on the challenge of rooting out racial bias in AI in health care.

A Rare RCT of Artificial Intelligence in Medicine

There’s a clear gold standard for testing new ways of diagnosing and treating diseases. It’s called a randomized controlled trial (RCT) and it allows researchers to measure how much an intervention — like a drug or medical device — affects a group of people compared to another ‘control’ group that didn’t receive it.

Participants are randomly assigned to either group, which have similar characteristics — like age or racial mix — so there’s less chance that any observed effect has to do with anything other than the treatment itself.

Yet despite being the most robust way to test whether a medical innovation really works, just 28 RCTs were done to evaluate artificial intelligence tools in medicine between 2009 and 2020.

So I was pretty amped to see a rigorous RCT of an AI-powered tool for cardiologists published by Bryan He and colleagues in Nature last month.

AI tool saves cardiologists time and prevents errors

He and colleagues set out to test whether an AI tool developed at Stanford University could interpret a measure of heart function that’s critical to diagnosing heart disease.

This measure, called LVEF, comes from a type of scan known as an echocardiogram. It’s typically interpreted first by a sonographer and then confirmed by cardiologists. Not infrequently, cardiologists need to correct a sonographer’s initial interpretation — a process that can take time and creates room for errors.

Could AI save cardiologists time and improve the accuracy of this important reading? 

To answer this, the researchers randomly assigned half of a pool of echocardiograms to the AI tool and half to a group of sonographers. Both were assigned to calculate LVEFs.

Then, experienced cardiologists reviewed the scans and measured the LVEF themselves. If the cardiologist found an LVEF more than five percent different from the initial LVEF, the researchers would consider the measure “substantially changed.” 

The researchers also used a technique called ‘blinding’ to increase the trial’s accuracy. The cardiologists did not know whether the echocardiogram was previously read by AI or by a sonographer.

The researchers found that: 

  • Cardiologists substantially changed 27% of sonographer-read scans, compared with just 16% of AI-read scans.

  • More than twice as many sonographer-read scans were changed to a “clinically significant degree,” meaning the revision could actually change the patient’s treatment — for example, getting a defibrillator implanted instead of just getting medications.

  • The AI tool also saved cardiologists’ time — a median of nearly 20% fewer seconds spent per scan, compared to those read by human sonographers.

One limitation is that the study was done on a narrow set of 3,495 echocardiograms from a single health system. In addition, only 43 percent of the scans came from female patients who, on average, have higher LVEFs than male patients. Would the AI tool perform as well in the general population, which is 51 percent female?

Randomized controlled trials are hard, but needed to prove benefit of AI in medicine 

It feels like AI tools are cropping up in every corner of health care from operating rooms to physical rehab to billing departments.

More RCTs are needed for clinicians and regulators to better understand whether these increasingly common AI tools really do work.

Robust trials could also help raise alarms around a given AI tool’s potential risks well before it becomes widely used in patient care.

Tradeoffs’ coverage of diagnostic excellence is funded in part by the Gordon and Betty Moore Foundation.