Most large language models (LLMs) used in health care settings, such as AI chatbots, are being assessed in a fragmented and inconsistent way that does not encompass real patient information, researchers have warned.
A new study, in the journal JAMA, found that just one in 20 evaluation studies included real patient data that reflected the complexities of clinical practice.
Most instead focused on the accuracy of answering medical examinations, which resulted in limited attention on considerations such as fairness, bias, toxicity and deployment.
The investigators said that testing these AI programs on hypothetical medical questions has previously been likened to certifying a car for road worthiness using a multiple-choice questionnaire.
“Real patient care data encompasses the complexities of clinical practice, providing a thorough evaluation of LLM performance that will mirror clinical performance,” they maintained, calling for consensus-based methods of assessment.
Co-first author Lucy Orr-Ewing, a Harkness Fellow at Stanford, and the researchers conducted a systematic review of 519 studies and categorized existing evaluations of LLMs in health care according to five components.
These included: data type; health care task; natural language processing (NLP) and natural language understanding (NLU) tasks; dimension of evaluation; and medical specialty.
The most common health care tasks assessed were medical knowledge, such as answering medical licensing examination questions, and making diagnoses.
Less often studied were administrative tasks such as assigning billing codes and writing prescriptions and NLP and NLU tasks such as summarization and conversational dialogue.
Almost all the studies, 95.4%, used accuracy as the primary dimension for evaluation. Fairness, bias, and toxicity were only assessed in 15.8%, deployment considerations in 4.6%, and calibration and uncertainty in just 1.2% of cases.
Among medical specialties, just over a quarter of studies were in generic health care applications, followed by internal medicine surgery and ophthalmology.
Less than 1% of studies concerned nuclear medicine, physical medicine, and medical genetics.
“This systematic review highlights the need for real patient care data in evaluations to ensure alignment with clinical conditions,” concluded Orr-Ewing and co-workers.
“A consensus-based framework for standardized task definitions and evaluation dimensions is crucial.”
The researchers noted that, while efforts such as the World Health Organization ethics and governance guidelines and the U.S. Executive Order on AI provide valuable foundations, specific metrics and methods for LLM assessment are still lacking.
“The Coalition for Health AI is making promising strides in this area, with workgroups launched in May 2024 to establish metrics and methods for LLMs in health care,” they noted.
“This initiative aims to create a consensus-based assurance standard guide similar to that for traditional AI models.”