Artificial intelligence (AI)-based tools that rely on large language models (LLMs) struggle to correctly identify genetic conditions based on patient descriptions of their own health, show results from a study led by the National Institutes of Health in Bethesda.
Lead investigator Benjamin Solomon, MD, clinical director at the NIH’s National Human Genome Research Institute, and colleagues found that the LLM’s were able to identify diseases from textbook descriptions or clinicians, but not when described by lay people.
“We may not always think of it this way, but so much of medicine is words-based,” said Solomon, in a press statement. “For example, electronic health records and the conversations between doctors and patients all consist of words. Large language models have been a huge leap forward for AI, and being able to analyze words in a clinically useful way could be incredibly transformational.”
To test how accurate LLMs are at diagnosing diseases based on written descriptions, the researchers tested the ability of 10 commonly used models, including the latest two versions of ChatGPT—GPT-3.5 and GPT-4, at identifying 63 genetic conditions such as sickle cell anemia, cystic fibrosis and Marfan syndrome, as well as rarer genetic syndromes and diseases. As a comparator, non LLM methods such as Google search were also tested for accuracy.
As reported in the American Journal of Human Genetics, Solomon and team tested the ability of the models to identify genetic disease descriptions from text books and using more common language rather than medical jargon. They also tested the models using descriptions written by patients from the NIH Clinical Center of their symptoms.
“It’s important that people without medical knowledge can use these tools,” said first author Kendall Flaharty, a National Human Genome Research Institute researcher. “There are not very many clinical geneticists in the world, and in some states and countries, people have no access to these specialists. AI tools could help people get some of their questions answered without waiting years for an appointment.”
The largest models and those with a closed—rather than an open-source performed best. Chat GPT-4 was the best overall at identifying textbook and clinician descriptions with accuracy of up to 90%.
However, none of the models performed very well on real patient descriptions, with accuracy rates for the closed-source models ranging from 7–27% versus 44–55% for clinician-led descriptions of real patients, with open-source models scoring lower in both groups. But accuracy did improve if standard questions for the patients were used as a guide, suggesting there was too much variability in the unguided descriptions.
“For these models to be clinically useful in the future, we need more data, and those data need to reflect the diversity of patients,” said Solomon. “Not only do we need to represent all known medical conditions, but also variation in age, race, gender, cultural background and so on, so that the data capture the diversity of patient experiences. Then these models can learn how different people may talk about their conditions.”