Time for a Chat About LLMs?
By Beth W. Orenstein
Radiology Today
Vol. 25 No. 8 P. 10
Can ChatGPT-4 help accurately interpret images and improve workflow?
Unlike other imaging techniques, ultrasound depends heavily on the skills of the operator. As ultrasound machines have become more portable, they are being used by more health care providers in a growing number of settings, including in the emergency department (ED) and at bedside. As portable ultrasound devices become more prevalent, the reliance on human expertise introduces unique challenges, says Laith Sultan, MD, MPH, a physician scientist in the department of radiology of the Children’s Hospital of Philadelphia (CHOP).
Sultan has always been passionate about technological advancements, focusing much of his research on evaluating computer-aided analysis tools and machine learning to enhance the diagnostic value of ultrasound imaging. When ChatGPT-4, the latest iteration of ChatGPT, was released in March 2023, many expressed excitement about its capabilities in video and image analysis. Sultan was inspired to explore its potential in ultrasound imaging analysis and AI-assisted diagnostics. He began working on different ideas for research projects with his research team at CHOP Radiology and colleagues at the University of Pennsylvania Medicine Radiology.
Sultan led a team testing ChatGPT capabilities in multimodal analysis. The team studied how to translate these capabilities for medical image analysis. They started their project in mid-2023, and it is ongoing. Their first study, published in March 2024 in Radiology Advances, used ChatGPT-4 to analyze a thyroid ultrasound image with a nodule. The model successfully identified and marked the lesion, provided a differential diagnosis, and accurately segmented the thyroid gland and lesion.
“This showcases ChatGPT’s potential in enhancing medical imaging analysis,” Sultan says. The researchers also tested ChatGPT-4’s ability to differentiate between normal and abnormal renal ultrasound images. “The model effectively distinguished normal kidneys from those with urinary tract dilation, highlighting its potential utility in clinical settings,” Sultan says. The early results exceeded his expectations. “I honestly was surprised how successful it was for medical interpretations,” he says.
Although the study was limited, Sultan believes it suggests that Chat- GPT, with its advanced capabilities, user-friendly interface, accessibility, and cost-effectiveness, could become a valuable tool in clinical practice as a computer-assisted diagnosis aid. “It can assist radiologists in diagnostic decision-making, image interpretation, and report generation,” he says. In their paper, the researchers say they believe that ChatGPT-4 could be capable of distinguishing normal cases from abnormal ones, which could significantly reduce the workload of radiologists by allowing them to focus primarily on abnormal cases.
Possible Triage Tool
Since COVID, Sultan says, ultrasound has been used more in EDs and intensive care units to monitor patients because it’s accessible, portable, and less expensive than other imaging modalities. With providers outside of radiology using ultrasound, AI tools such as ChatGPT-4 that can help detect abnormalities are needed more than ever. Sultan says he sees the potential for ChatGPT-4 to help triage cases in the ED and elsewhere.
Kassa Darge, MD, PhD, chair of the department of radiology and radiologist in chief at CHOP, agrees that ChatGPT-4 has the potential to help not only busy radiology departments but also others who are using ultrasound but may be less proficient with the modality. “For the inexperienced,” Darge says, “It’s even more important to be able to distinguish normal vs abnormal, and ChatGPT-4 could be helpful in alerting staff to which images may be abnormal.”
Additionally, Sultan says, ChatGPT-4 could simplify radiology reports, making them more understandable for patients and thereby improving communication. Sultan also believes that ChatGPT could serve as an educational tool for those training in radiology by providing guidance during ultrasound scanning and supporting their diagnostic training.
Security, Technical Challenges
One of the challenges the researchers encountered was evaluating medical images while ensuring data security and privacy, particularly with larger datasets. To address this challenge, “We utilized anonymized images without patient information,” Sultan says. “To address these concerns in larger datasets, we relied on ultrasound images acquired from preclinical studies, which provided a more secure and compliant approach.”
Another challenge was technical, Sultan says. It stemmed from the fact that ChatGPT-4 isn’t specifically designed for medical image analysis. For example, selecting regions of interest was difficult due to the absence of manual tools for precise localization. Additionally, “We faced technical issues related to ChatGPT’s capabilities in handling large-scale data analysis,” Sultan says.
To solve this challenge, the researchers had to fine-tune the model and make several adjustments to extract the required information, Sultan says. However, he believes that as the large language model (LLM) gets trained on medical imaging, its performance will improve. While “there are some limitations with how to outline the regions that you want for analysis, in general,” Sultan says, “it was still really easy. In addition, in my opinion, the performance was impressive.”
Since the research group’s first paper was published in Radiology Advances, the researchers have made more progress. They submitted a second paper to another journal, which is currently under review. Additionally, they have analyzed a large dataset of ultrasound images of liver disease, focusing on ultrasound radiomics and comparing the results with those obtained from established conventional image analysis software. Sultan believes their findings to date “indicate that ChatGPT-4 provides results comparable to traditional methods.” Continued research is needed, he says, but the promise afforded by ChatGPT-4 “invites continuous innovation and thoughtful integration into our health care framework.”
Testing Applicability
Darge believes that the most important consideration with all AI apps is to test them first with the department’s own staff and scanners. A study such as this one, while impressive, does not mean that ChatGPT-4 would be applicable in all cases, he says. For example, it may not work as well in pediatrics, where the age and tissue makeup of patients is highly variable. “ChatGPT-4 will have to be tested again and again to be universally applicable,” he says. “To think that the first iteration will apply universally is not correct.”
Like Sultan, Theodore T. Pierce, MD, MPH, an assistant professor of radiology at Massachusetts General Hospital in Boston, believes more research is needed before LLMs such as ChatGPT-4 are incorporated in radiologists’ workflow. However, Pierce agrees with Sultan that, in time, they will have an important place in the field. “I think there are concrete applications for how ChatGPT-4 and other LLMs can be incorporated into practice,” he says.
One application that Pierce envisions is using ChatGPT-4 to convert image data into draft text reports. Currently, this task is done manually and is a large component of what radiologists do on a day-to-day basis. Were an AI program such as Chat- GPT-4 able to handle this task, it would free radiologists to do more “important” work requiring critical thinking and judgment, he says.
Pierce is intrigued by the CHOP study’s finding that ChatGPT-4 can help identify normal vs abnormal tissue in ultrasound images. “From a medical/ legal point of view, I don’t foresee a time—at least in the near future—where radiologists could hand off diagnostic decision-making responsibility to an algorithm,” Pierce says. However, Chat- GPT-4 “might help prioritize” exams that need radiologists’ attention, especially in a busy ED.
“ChatGPT-4 could identify a critical finding among a large number of exams and prioritize the time-sensitive exam to the top of the worklist,” Pierce says. “Those exams that have a much lower likelihood of being positive could be deferred until high-priority exams are addressed.”
Potential Role in Research
The CHOP group’s study suggests another area where ChatGPT-4 may be useful: research. “Say you wanted to do a study that required certain findings among kidney images,” Pierce says. “Wouldn’t it be great if you could ask ChatGPT-4 to find you all the kidney images from a dataset that match the criteria you’re looking for?” Right now, such research is often a tedious task, he says. Using ChatGPT-4 for screening would be time-saving. Similarly, Pierce says, a radiology department doing a quality control study could ask ChatGPT-4 to look at images by its sonographers to distinguish the highest quality images from those that could be improved.
Pierce believes CHOP’s ultrasound pilot study of Chat- GPT-4 is a promising start. However, he says, much more research is needed to confirm its findings across varied applications and more generalizable populations. One area that will need more research is vendor variability.
“There are a lot of differences between how images look based on the manufacturer of the scanning equipment and transducers,” Pierce says. “It’s possible that an AI algorithm can work well on images from one vendor, but when you analyze the same type of image on a second vendor, the algorithm doesn’t work well.” Any use of ChatGPT-4 would need to be validated across all the major ultrasound manufacturers and in varying practice environments, he adds.
Pierce also believes that regulatory bodies will have a difficult time clearing the use of a fully automated AI algorithm to obtain differential diagnoses from ultrasound images. “I don’t think it’s something we’re going to see in the next decade, if ever, just because of the concern about what to do if the AI is wrong,” Pierce says. He believes regulatory bodies would favor “human-in-the-loop” workflows with a licensed clinician supervising AI algorithm operations.
More Research Needed
Finally, Pierce wonders about the cost of implementing such AI applications and whether institutions would balk at funding them. “There are a large and growing number of commercially available software solutions for a variety of radiologic tasks; however, the cost to procure, implement, and maintain these technologies is often institutionally burdensome and, perhaps in some people’s perspective, prohibitive,” he says.
Darge says that the CHOP study is a good start. As with most new applications, he says, “You start in one institution, and then you expand it. But to do one study in one institution and think it applies universally would not be correct.”
Sultan is optimistic that user-friendly and accessible AI algorithms such as ChatGPT-4 will have a place in ultrasound at some point in the future. As with anything new, he says, there is some skepticism and some fascination. “Right now, I think there’s a lot of interest in doing more research and trying to implement LLMs in ways that could benefit the use of ultrasound,” he says. And it’s not just ultrasound, he notes: “I think that, at some point, work like ours would apply to other imaging modalities as well, such as CT and MRI.”
— Beth W. Orenstein of Northampton, Pennsylvania, is a freelance medical writer and regular contributor to Radiology Today.