Proceed With Caution
By Beth W. Orenstein
Radiology Today
Vol. 25 No. 5 P. 10

Personalizing AI may be one key to its success.

Radiologists’ workload has risen considerably over the past few decades with no end in sight. At the same time, staffing shortages are hindering the profession’s ability to accommodate current demands, according to the American College of Radiology.

“Radiologists’ workload is getting to the point where it’s barely manageable,” says Paul Chang, MD, a professor of radiology at the University of Chicago. “Our existing legacy tools, our PACS systems, our EMRs are inadequate to help us get through the day. We’re going to need some help.”

Could AI/machine learning/neural networks be a savior? Chang thinks so—with a big caveat: if done right. “We are going to need optimized human-machine collaboration to get through this,” he says.

However, how radiologists can best benefit from AI is much debated. A study published in March in Nature Medicine found that the benefits of using AI tools to help interpret images vary from clinician to clinician. “Some are helped while others are hurt by it,” says cosenior author Pranav Rajpurkar, PhD, an assistant professor of biomedical informatics in the Blavatnik Institute at Harvard Medical School.

Details Matter
A number of studies in recent years have shown that AI can boost a radiologist’s diagnostic performance. However, these studies look at radiologists as a whole and don’t consider how such factors as the radiologists’ area of specialty, years of practice, and familiarity with AI tools can affect their AI collaboration, Rajpurkar says.

In their study, the researchers from Harvard, MIT, and Stanford examined how AI tools affected the performance of 140 radiologists on 15 X-ray diagnostic tasks. They analyzed the radiology reports involving 324 patient cases with 15 abnormal conditions found on chest X-rays. The researchers used advanced computational methods to compare the radiologists’ performance when using AI (the CheXpert) and when not using it. What they found was that the effect of AI assistance was inconsistent. AI appeared to improve the performance of some radiologists while worsening the performance of others.

Rajpurkar says the CheXpert AI model used in the study was previously shown to perform comparably to board-certified radiologists on five pathologies. “In our experiment, the AI model outperformed approximately two-thirds of the radiologists, on average, with performance measured using area under the receiver operating characteristic curve (AUROC),” he says.

Cofirst author Feiyang “Kathy” Yu, MS, who worked on the project while she was a member of the Rajpurkar Lab at Harvard Medical School, cited two main findings on the variability in radiologists’ performance when using AI: First, she says, the accuracy of the AI itself on a given case influenced radiologist performance. More accurate AI predictions better helped radiologists’ performance. Second, surprisingly, experience-based characteristics such as years of practice, thoracic radiology subspecialty, and prior experience with AI tools did not reliably predict how much a radiologist’s performance changed with AI assistance.

Based on their findings, the researchers recommend that radiologists use high-quality, extensively validated AI models selectively and with caution. Rajpurkar and his colleagues believe a personalized assessment of how an individual radiologist’s accuracy changes with the specific AI model would be an ideal way to determine whether and when to provide AI assistance to that radiologist. In any case, the AI predictions should be treated as complementary to the radiologist’s assessment, and not as an independent opinion to simply average together, Rajpurkar says.

The researchers found that AI interference with the radiologist’s accuracy occurred primarily when the AI made inaccurate predictions. “In some cases, inaccurate AI predictions actively misled radiologists and worsened their performance compared to reading the case on their own,” Yu says. “This is a real concern that underscores the importance of rigorous AI model testing and setting appropriate confidence thresholds before deployment.”

Yu says radiologists also need support in identifying potentially unreliable AI predictions, “perhaps through explanation methods that go beyond a simple probability to visualize feature importance or generate verbal justifications.” Recognizing AI errors is a key skill that radiologists must cultivate, she says.

The researchers are strong proponents of testing and validating AI tool performance before clinical deployment. “Before deploying an AI assistant, we recommend rigorously testing its performance on a dataset representative of the target patient population and case mix,” Rajpurkar says. “Crucially, the diagnostic accuracy of the radiologists actually using that AI assistant also should be directly measured in a controlled experiment or retrospective study, not simply assumed to be better than the AI or radiologist alone.” Rajpurkar adds that monitoring should continue after deployment to detect potential negative impacts.

Monitoring Output
Radiologists with a keen interest in and knowledge of AI agree with the study findings that AI presents many potential ramifications for medical imaging. Raym Geis, MD, a radiologist in Fort Collins, Colorado, affiliated with National Jewish Health, says AI can be helpful if it has proven to be trustworthy and provides accurate information about a radiology exam. “If a radiology AI tool is definitively shown to be trustworthy and reliable in specific situations in one’s own clinical setting, with its own individual quirks about how exams are performed and the patient population it serves, then radiologists (and others) will start to trust it and begin to understand when and how to use it,” Geis says.

However, he adds, in many cases, “it is unclear how reliable and trustworthy any given AI tool is. This is a challenging and poorly understood issue. The entire field of people involved in radiology computer vision AI, including theoretical computer science folks, people building radiology AI tools, imaging informatics in hospitals and health systems, and physicians, doesn’t have a good handle on how to verify how well a particular AI tool works in each unique clinical setting or on different patient populations.”

Radiologists’ use of AI when reading exams is “a little bit” similar to having second radiologists giving their opinion about the exam, Geis says. “The primary radiologist reading the exam doesn’t know for sure how knowledgeable or accurate that second person opining on the exam is,” he says. The “second opinion” can introduce an “automation bias,” where the radiologist places too much trust in the AI output. “Automation bias has been shown when using AI to read mammograms, where sometimes radiologists rely too heavily on AI and change their reports incorrectly based on those results,” Geis says.

Geis does not agree with the researchers’ suggestion that to maximize AI benefits, AI needs to be personalized. “It is hard enough to monitor the quality and accuracy of an AI tool for an individual hospital system or health care enterprise,” he says. “AI tools are not like other software. Their accuracy will vary in each setting due to variations unique to each situation. Imaging protocols and patient population differences are the obvious differences, but there are many others. Every AI tool’s trustworthiness will decline over time. And often, their results will deteriorate and even fail silently. As a result, clinical AI must be monitored much more rigorously than we are used to monitoring, even at a group or enterprise level. Monitoring AI tools at an individual level would be very resource intensive and isn’t a feasible recommendation at this point.”

Setting Boundaries
Bradley J. Erickson, MD, PhD, a professor of radiology and director of the Mayo Clinic’s AI Laboratory, says AI, much like computer-aided detection (CAD) before it, almost always proves to help physicians achieve midlevel performance. “Therefore,” he says, “if the case is out of that physician’s area of expertise, AI will likely be helpful.” However, he notes, AI may confuse the expert.

Many radiologists are “generalists” and expected to read all types of imaging, Erickson says. “And so, for them, AI is likely to be useful. For those that only do subspecialty reads, AI is not likely to be helpful.” Back in the CAD days, Erickson adds, “This was seen when the CAD tools for mammography helped nonspecialists but often degraded the performance of experts.”

Despite what the AI tool says, Erickson believes most radiologists are best to go with their gut when reading exams. “Like when taking a test, your first impression is usually right, but if you keep thinking about it, you often switch to the wrong answer,” he says.

Erickson surmises that AI would not only be most helpful if a case is outside the reader’s area of expertise but also for those in training, such as residents. Would, as the Nature Medicine authors suggest, personalizing AI systems help? Yes, he says, but with boundaries.

“AI systems span a broad range of applications, so I think this answer needs some boundaries,” Erickson says. “I think AI tools that retrieve and summarize information probably should be personalized because different people and specialties focus on different pieces of information. In that case, it is probably both at the specialty level and individual level.”

A major challenge with today’s AI tools is that they lack a “certainty” output, Erickson says. “That is, the AI should produce a number indicating if it is certain or not. I think this could explain much of the variability seen in this study and may also help physicians better use AI tools.”

Like Geis, Erickson says “automation bias” plays a role. Humans often trust the machine no matter what—like the people who follow their GPS and drive into a lake. “If there was a certainty value, that would likely help to address this problem,” Erickson says.

In a paper published in August 2023 in Radiology, Erickson and colleagues wrote that while deep learning (DL) has shown impressive performance in radiologic image analysis, “for a DL model to be useful in a real-world setting, its confidence in a prediction must also be known.” The authors believe that users must know the trustworthiness (validity) of the AI tool they are employing and suggest methods for determining its “uncertainty quantification.” By implementing uncertainty quantification in medical DL models, the authors say, users can be alerted if a model does not have enough information for them to make a confident decision with it. If users reevaluate uncertain cases, they add, it could lead to gaining more trust when using that AI model.

Human-Machine Workflow
Chang sees AI as being an arbiter when used properly. Some radiologists, he says, are overcallers and some undercallers. “If I am an overcaller,” he says, “I want an AI model that says, ‘cool your jets a little bit.’” And if the radiologists tend to be undercallers, they could benefit from AI models that maximize the opposite.

AI models are best when they are individually tuned to how radiologists interpret studies. When it comes to AI algorithms, “one size doesn’t fit all,” Chang says. Unfortunately, “our current integration of human and AI interpretation is primitive. We are not going to fully leverage AI until we fully embrace a true real-time human-machine optimized workflow.”

Chang also believes that AI will “take off” when it shows it can help improve radiology efficiency. The C-suite, he says, usually prefers to invest in products/ systems that save money or make employees more efficient. Improving accuracy? As cynical as it sounds, that is not a priority for the C-suite since quality is “assumed,” he says. However, for AI to improve efficiency, he says, it must not be seen as a peripheral—as it is now—but optimally integrated into the radiologists’ workflow.

Rajpurkar says none of the radiologists’ personal characteristics that his team looked at in their study reliably predicted how much a radiologist’s performance changes with AI assistance. “Our results suggest it will be challenging to eliminate biases based on specialty, prior experience, and prior AI use, as these factors did not consistently predict performance changes with AI,” he says.

The most effective way to assess fitness for using an AI assistant may be personalized testing of each radiologist’s response to the specific AI model being deployed rather than reliance on general characteristics, Rajpurkar says. “Another potential strategy is targeted training to help radiologists recognize and self-correct these common biases in how they utilize AI predictions.”

Rajpurkar says training radiologists to identify AI errors is a crucial part of its deployment. “This requires going beyond just overall accuracy metrics to examine failure modes,” he says. “Educational modules could present examples where the AI is inaccurate and prompt radiologists to explicitly write down features that justify trusting or distrusting the prediction before revealing the ground truth.”

Discussing the reasoning afterward can illuminate blind spots and build recognition of misleading predictions, he says. Checklists reminding radiologists to consider contradictory evidence in the image, patient history, or their own judgment could be embedded into the interface. “Periodic feedback on cases where they agreed with an inaccurate AI prediction also may be valuable for calibration,” Rajpurkar says. The goal, he says, is to develop habits of cross-checking and synthesizing AI input thoughtfully, not just copying it.

Beth W. Orenstein of Northampton, Pennsylvania, is a freelance medical writer and regular contributor to Radiology Today.