The optic disc may tell us more than what catches the eye. 86,123 pairs of fundus photos from the Duke Eye Center matched with spectral domain optical coherence tomography (SD OCT) images from patients with glaucoma or glaucoma suspects were used to train a convolutional neural network to predict global retinal nerve fiber layer (RNFL) thickness. Predicted RNFL thickness from fundus photos significantly correlated with observed RNFL thickness based on OCT (r = 0.76). Fundus photos could also discriminate progressors from nonprogressors (AUC= 0.86). While fundus photos will never replace OCT, especially given the decreased accuracy and lack of sectoral measurements, it may be a useful tool in limited resource settings where OCT is not available.
AI Insights - AUC and ROC
AUC (Area Under Curve) is a commonly used metric in AI representing the area under the ROC (Receiver Operator Characteristic) curve. Every test has a sensitivity and specificity - but many AI algorithms are tunable to have different thresholds that trade off between increased sensitivity and specificity. The ROC curve is a graph of all the test's sensitivity/specificity pairs, and the area under the curve represents better overall sensitivity/specificity, represented as a decimal with a maximum area of 1.0. It is important to note that a random classifier between two possibilities (such as progressors vs nonprogressors) will have an AUC of 0.5, so should be considered the baseline minimum. Sensitivity and specificity are not affected by the prevalence of the disease being studied by the test, so the AUC reflects algorithmic performance which may not reflect real-world accuracy after accounting for the disease prevalence.
Translational Vision Science & Technology
Can machine learning algorithms stand out in a crowded field? 90,713 visual fields from 13,516 eyes across five institutions were used to train six different machine learning (ML) algorithms to identify glaucomatous progression. The performance of these algorithms were compared to six existing progression algorithms based on clinical expert labels. ML algorithms classified progressed versus stable fields with similar/better performance than the individual or ensemble progression algorithms. They also found that traditional algorithms had a significant tendency to call “unclear” patterns consistently progressing or stable, while ML algorithms had no significant bias in these borderline cases. Visual field progression remains an important problem in the management of glaucoma, and machine learning algorithms demonstrate equivalent or better accuracy than conventional progression algorithms with less bias in borderline progression cases.
AI Insights - Ground Truth Labels
“Ground truth” is an important concept for the training and evaluation of AI algorithms. In particular, supervised tasks (i.e, those where labels are explicitly provided for algorithm training) rely on “valid” signals to avoid learning biased or faulty patterns in the data. Therefore, great care should be taken to avoid using proxy labels without extensive consideration and verification of assumptions. For example, ICD-10 codes are a relatively simple proxy to acquire for diagnostic labels; however, anyone who has ever used an EMR in a patient with multiple hospitalizations would agree that these labels are often not rigorously evaluated for precision. Typically, expert panel judgement is considered “ground truth” for many clinical problems. However, in large datasets, manual assessment for the entire dataset may be time/cost prohibitive. Ideally, “ground truth” labels would be used when both training and testing supervised models, but when “ground truth” labels are difficult to acquire, they should at least be used for testing and evaluation of the models. In this work, authors trained their machine learning algorithm on a training dataset with proxy labels generated from the majority decision of the traditional progression algorithms, but evaluated its performance on the testing dataset with expert panel labels. The authors deliberately explored their training data and undertook various algorithmic design choices to account for their use of proxy labels for training. While the algorithms were not necessarily being fully optimized to outperform the consensus prediction, the inferences made from the testing dataset were still based on the “ground truth” labels.
Content of The Lens is for medical education purposes only.
Copyright © 2021 The Lens Newsletter LLC - All Rights Reserved.