x-ray lung cancer
Credit: utah778/Getty Images

While AI can spot demographic information on X-ray images, researchers have warned that this can also affect its accuracy for different genders and ethnic groups.

Their study, published in Nature Medicine, suggests that machine-learning models may use characteristics such as sex, age, and gender as shortcuts, resulting in biased predictions.

It offers new insights into how AI encodes demographics and the impact this has on fairness, particularly when models are applied outside the scenarios in which they were trained.

Ironically, those models most able to predict demographic information also did worst in accurately diagnosing images among particular demographic subgroups.

While the researchers were able to find ways around these issues, they found these were only effective when used in the patients who were similar to those in whom the models were trained on.

“I think the main takeaways are, first, you should thoroughly evaluate any external models on your own data because any fairness guarantees that model developers provide on their training data may not transfer to your population,” said co-lead researcher Haoran Zhang, a graduate student at MIT.

“Second, whenever sufficient data is available, you should train models on your own data.”

Up until May this year, the U.S. Food and Drug administration has approved 882 AI-enabled medical devices, 671 of which are designed to be used in radiology.

Recent studies have shown the surprising ability of deep-learning models to predict demographic information such as race, sex and age from medical images in a way that far exceeds that of radiologists.

To investigate the accuracy of these models, the researchers used publicly available chest X-ray datasets from Beth Israel Deaconess Medical Center in Boston to train models to predict whether patients had fluid buildup in the lungs, collapsed lungs, or enlargement of the heart.

The models were then tested on X-rays that were held out from the training data. While they performed well overall, most showed discrepancies in accuracy between men and women, and between Black and White patients.

There was a significant relationship between each model’s accuracy in making demographic predictions and the size of these “fairness gaps.” This indicated they might be using demographic categories as a shortcut to make disease predictions.

The team then used state-of-the-art methods to remove these shortcuts to create “locally optimal” models that were also fair. Models were trained to be rewarded when they worked better in the worst-performing groups and penalized for high-error groups.

This “subgroup robustness” method forced models to be sensitive to poor predictions in a specific group. The researchers also forced the models to remove any demographic information from images, in a method that used “group adversarial” approaches and tried to remove group information completely.

While these methods generally worked well, it was only for similar scenarios in which the models were created such as the dataset from Beth Israel Deaconess Medical Center.

When “debiased” data was tested on patient information from five other hospitals, the accuracy of the models remained high but large fairness gaps could be seen. This challenges the popular opinion of a single, fair model across different settings.

The authors note that some demographic information can be a direct causal factor for some diseases, such as sex in breast cancer. Here, it would not be desirable to remove all reliance on demographics but instead match the reliance of the model on the demographic attribute to its true causal effect.

Senior author Marzyeh Ghassemi, PhD, an associate professor of electrical engineering and computer science at MIT, acknowledged that many popular machine-learning model had superhuman capacity in predicting demographics.

“These are models that are good at predicting disease, but during training are learning to predict other things that may not be desirable,” she said.

Also of Interest