Inside Precision Medicine Informatics Clinicians Struggle to Spot Biases in AI Diagnostic Model Despite Explanatory Data

Clinicians Struggle to Spot Biases in AI Diagnostic Model Despite Explanatory Data

December 19, 2023

Digital Brain hovering above a series of computer chips to illustrate artificial intelligence (AI)-powered tools to help clinicians make better treatment choices based on patient tests — Credit: BlackJack3D/Getty Images

Results from a study led by the University of Michigan show that doctors struggle to spot biases in an artificial intelligence (AI) diagnostic model, despite explanations of the model’s methods being available.

The model in this study, which is published in JAMA, was designed to help diagnose the cause of acute respiratory failure in patients admitted to hospital.

Participating doctors who had help from an unbiased AI diagnostic model with no explanation improved their diagnostic accuracy in the study by 2.9 percentage points, whereas an unbiased AI model with explanations improved accuracy by 4.4 percentage points.

However, when given access to a biased version of the model, the clinicians had a significant reduction in their diagnostic accuracy, despite being able to access explanations that showed that non-relevant or biased information was being used to make the diagnostic suggestions.

“AI models are susceptible to shortcuts, or spurious correlations in the training data. Given a dataset in which women are underdiagnosed with heart failure, the model could pick up on an association between being female and being at lower risk for heart failure,” explained co-senior author, Jenna Wiens, PhD, associate professor of computer science and engineering at the University of Michigan, in a press statement. “If clinicians then rely on such a model, it could amplify existing bias.”

There is a strong call for AI used for healthcare applications to be explainable, as this should theoretically make mistakes or biases easier to spot.

This study tested whether clinicians were able to pick up errors in an AI diagnostic model by checking the accompanying explanatory data.

Overall, 457 clinicians were recruited and randomized to take part in the study. They were asked to look at a selection of eight out of nine possible clinical vignettes of patients with acute respiratory failure and were given information on the patient’s symptoms on arrival at hospital, physical examination information, lab results and chest X rays. Half the group were randomly assigned to use an AI-based diagnostic model with explanations and half an AI model without explanations.

The clinicians were asked to estimate whether pneumonia, heart failure, or chronic obstructive pulmonary disease was causing the patient’s respiratory failure. They were asked to diagnose the first two cases without the help of an AI model and then a random six cases with the help of an AI model. Of the final six cases, three used an unbiased AI model and three a biased model.

The average diagnostic accuracy of the participating doctors was 73% without the help of AI. Although the unbiased AI model improved clinician accuracy by up to 4.4 percentage points with explanations of the data, the clinicians found it difficult to spot the cases that used a biased model even with available explanations showing the data that was used.

The biased model reduced diagnostic accuracy by 11.3 percentage points when no explanation was available and by 9.3 points when an explanation was available, showing that the available explanations made little difference to the final diagnostic decision in most cases.

“The problem is that the clinician has to understand what the explanation is communicating and the explanation itself,” said first author Sarah Jabbour, a PhD candidate in computer science and engineering at the College of Engineering at the University of Michigan.

“There’s still a lot to be done to develop better explanation tools so that we can better communicate to clinicians why a model is making specific decisions in a way that they can understand. It’s going to take a lot of discussion with experts across disciplines.”