Interpreting chest x-rays and predicting severe nearsightedness: Launching our Special Issue on Machine Learning in Health and Biomedicine

November 7, 2018 Linda Nevin General Special Issues

Associate Editor Linda Nevin discusses highlights from the first week of the PLOS Medicine’s Machine Learning in Health and Biomedicine Special Issue

This week, PLOS Medicine launches our Special Issue on Machine Learning in Health and Biomedicine, Guest Edited by Atul Butte of UCSF, Suchi Saria of Johns Hopkins University, and Aziz Sheikh of The University of Edinburgh. In the coming weeks PLOS Medicine will publish 14 high-quality research articles that proved, through editorial assessment and peer review, to be of broad interest to practitioners of medicine and public health, robust in their conclusions, and presented with appropriate qualifications to current applicability and generalizability. The Special Issue features

Machine learning-based original research in medical decision making—comprising detection, diagnosis, and prognosis
Machine learning-driven research in health systems, providing insights to allocation of healthcare resources
Formal comparisons between standard epidemiological and machine learning approaches, and broad demonstrations of the performance of machine learning-based models when tested in datasets distinct from the discovery set
Pathophysiologic insights from machine learning that demonstrate novel explanatory mechanisms for a significant clinical problem

In addition, the Special Issue will include commentary from experts in diagnostic radiology, population health, ethics and public policy. This week features Effy Vayena and colleagues’ Perspective on addressing ethical challenges through data protection, algorithmic transparency, and accountability.

In the Special Issue’s first featured study, Eric Karl Oermann of the Department of Neurological Surgery, Icahn School of Medicine, New York, and his colleagues, trained a convolutional neural network (CNN) to detect signs of pneumonia in chest X-rays from 3 large U.S. hospital systems, individually and in combination, and attempted to validate the trained models’ performance internally (with held-out test data) and externally (using test data from a different hospital system).

Overall, the researchers found that the CNN models showed a significant reduction in accuracy when tested on data from outside the training set. For example, the CNN trained on data from Mount Sinai Hospital (MSH) had an area under the receiver operating characteristic curve (AUC) of 0.802 (95% confidence interval 0.793-0.812) in held-out internal validation; an AUC of 0.717 (95% CI 0.687-0.746) in data from the National Institutes of Health Clinical Center (NIH), and an AUC of 0.756 (95% CI 0.674-0.838) in data from Indiana University Network for Patient Care (IU).

A CNN trained on combined data from NIH and MSH had the strongest internal performance of the models generated (AUC 0.931 [95% CI 0.927–0.936]), but significantly lower external performance with data from IU (AUC 0.815, [95% CI 0.745–0.885, P = 0.001 by DeLong’s test]). In further analyses, the researchers found evidence that this model exploited imperceptible (to humans) image features associated with hospital system and department, to a greater extent than image features of pneumonia, and that hospital system and department were themselves predictors of pneumonia in the pooled training dataset—for example, prevalence was very high at MSH (34.2%) relative to NIH and IU (1.2% and 1.0%). Thus, when tested in an independent hospital system the model may have been deprived of predictors that were key to initial fitting but irrelevant to patient diagnosis. In the authors’ words, “performance of CNNs in diagnosing diseases on X-rays may reflect not only their ability to identify disease-specific imaging findings on X-rays but also their ability to exploit confounding information.”

Due to limitations of the datasets, and because the features and interrelationships by which CNNs predict outcomes are not easily reduced to simpler, familiar terms, the authors cannot fully assess what factors other than disease prevalence might have led to reduced performance in external validation. Nonetheless, the study provides evidence that estimates of real-world CNN performance based on held-out internal test data can be overly optimistic.

In the second featured study from the Special Issue, Yizhi Liu of Sun Yat-sen University, Guangzhou, and colleagues trained a Random Forest model to estimate a future diagnosis of high myopia among Chinese school-aged children—a prediction that can aid in the selection of children for preventive therapies. For their study, the researchers extracted clinical refraction data from electronic medical records (EMR) for 129,242 individuals aged 6 to 20 years who visited 8 ophthalmic centers between January 1, 2005 and December 30, 2015. The researchers used age, spherical equivalent (SE), and past annual progression rate to train a Random Forest algorithm to predict SE and onset of high myopia (SE ≤ −6.0 diopters) in future years. Model training used a single center, and the remaining 7 centers were used for external validation. The researchers obtained additional validation using data from 2 Chinese population-based cohorts.

Liu, and colleagues’ trained algorithm predicted the onset of high myopia over 10 years with clinically acceptable performance in external EMR validation (AUC ranging from 0.874 to 0.976 for 3 years, 0.847 to 0.921 for 5 years, and 0.802 to 0.886 for 8 years). The algorithm also predicted development of high myopia by 18 years of age, as a surrogate of high myopia in adulthood, with clinically acceptable performance in external EMR validation over 3 years (AUC ranged from 0.940 to 0.985), 5 years (AUC ranged from 0.856 to 0.901), and 8 years (AUC ranged from 0.801 to 0.837).

As expected, the performance of Liu and colleagues’ prediction model for future high myopia was reduced when the targeted prediction time increased; additionally, practitioners may disagree on what constitutes clinically acceptable performance. However, in this study, a clinically interpretable prediction model achieved AUC in the 0.80–0.90 range for future predictions up to 8 years in several external validation analyses. The authors state, “[t]his work proposes a novel direction for the use of medical big data mining to transform clinical practice and guide health policy-making and precise individualized interventions.”

In addition to the PLOS Medicine Special Issue, research articles in PLOS Computational Biology and PLOS ONE curated by Guest Editors Quaid Morris, Leo Anthony Celi, Luca Citi, Marzyeh Ghassemi and Tom Pollard will be published as part of a forthcoming cross-PLOS Collection on Machine Learning in Health and Biomedicine. For an early sample from the Collection, have a look at three PLOS ONE research articles from September: an intelligent analysis of interaction trajectories occurring in online support groups for people with cancer, a model for magnetic resonance-based quantification of abdominal fat, and a prediction algorithm for long-term meningioma outcomes.

PLOS Medicine’s Special Issue will continue for the next several weeks with further research and commentary—to view all the articles, visit the Collection.

Featured Image Credit: StockSnap, Pixabay

Leave a Reply Cancel reply