Asian boy using inhaler containing medicine to stop coughing
Credit: BonNontawat/ Getty Images

Researchers in the U.K. have found that the chatbot ChatGPT performed better in assessing complex cases of respiratory illnesses in children than trainee doctors. Analysis of other common chatbots found that Google’s Bard performed better in only some aspects, while Microsoft’s Bing performed as well as trainees.

The findings could point to a future in which the use of large language models (LLMs) could be used to support a range of healthcare providers including trainees doctors, nurses, and general practitioners to triage patients.

“Our study is the first, to our knowledge, to test LLMs against trainee doctors in situations that reflect real-life clinical practice, said Manjith Narayanan, MD, PhD, a consultant in pediatric pulmonology at the Royal Hospital for Children and Young People, Edinburgh, Scotland. “We did this by allowing the trainee doctors to have full access to resources available on the internet, as they would in real life. This moves the focus away from testing memory, where there is a clear advantage for LLMs. Therefore, this study shows us another way we could be using LLMs and how close we are to regular day-to-day clinical application.”

Narayanan presented findings of the study recently at the European Respiratory Society (ERS) Congress in Vienna.

To conduct the research, Narayanan chose clinical situations that occur frequently in pediatric respiratory medicine, provided by six other experts in pediatric respiratory therapy. The scenarios covered the areas of cystic fibrosis, asthma, sleep disordered breathing, breathlessness, and chest infections. All of the scenarios didn’t have an obvious diagnosis and didn’t have any published evidence, guidelines, or expert consensus that would point to a diagnosis or treatment plan.

The trainee doctors in the study had fewer than four months of clinical experience in pediatrics and were given one hour to query the internet, excluding chatbots, to suggest what each scenario represented and describe it in 200 to 400 word. All scenarios were also provided to the three chatbots being tested.

Responses from all sources were then scored by the six pediatric experts for how correct, comprehensive and usable the answers were, as well as plausibility and coherence. Response scoring was based on a one-to-nine scale. The experts were also asked to identify whether each answer was generated by a human or a chatbot.

Solutions from ChatGPT, version 3.5, scored an average of seven points out of nine and were believed to be more human-like responses than the responses from either Bard or Bing chatbots. Bard’s solutions received an average score of six and were judged more coherently than the trainee doctor responses, but in other areas were considered no better or worse than the trainees’ solutions. Bing scored four out of nine, the same score achieved by the trainee doctors. The experts also were able to identify the Bing and Bard submissions as non-human generated.

Importantly, the researchers did not find instances of “hallucinations,” or seemingly made up information from any of the three chatbots. “Even though, in our study, we did not see any instance of hallucination by LLMs, we need to be aware of this possibility and build mitigations against this,” Narayanan noted.

To continue their research, Narayanan and the team now plan to provide a similar test pitting the chatbots against more experienced doctors employing newer, more advanced LLMs.

Also of Interest