Glaucoma is a leading cause of blindness and has no cure
60M affected worldwide
Glaucoma is the leading cause of irreversible blindness worldwide
$2.5B in annual US healthcare spend
Glaucoma testing and treatment drives 10M physician visits in the US each year
1 in 8 go blind even with treatment
Determining the speed of vision deterioration early is critical to preventing blindness
Today glaucoma detection is uncertain and friction-filled
Is the patient's vision getting worse ("progressing")?
Visual field tests that detect a patient's sensitivity to light at 54 distinct points are a critical factor in determining the right treatment strategy for glaucoma.
3 key reasons complicate interpreting results to determine progression:
Test Variation: Visual field tests depend on patient responses to flashes of light. Patient fatigue, lack of focus, or learning effects can cause variation in the results independent of any change in vision
Lack of a "gold standard" metric: Researchers have developed a range of metrics for assessing patient vision. Research has shown that these measures can often disagree leaving clinicians to make a judgement call between them
Too manual: Today clinicians line up paper printouts of visual field results to compare results over time and make assessments on pateient progression
Foresight aims to bring simplicity and certainty to glaucoma detection
More accurate detection
Leverage machine learning on 13K unique eyes across 5 leading US eye institutes
Streamlined data consumption
Eliminate the clutter to help clinicians determine the right treatment strategy
Put patient data in context
Give clinicians confidence in recommendations by placing data in relation to patient history and peers
Data
Understanding how data on patient vision is structured
Visual field of right eye (total deviation values shown)
note: black rectangle represents blind spot
A visual field is the core data structure for our product
Numbers in the eye-shaped matrix represent a patient's sensitivity to light at 54-distinct points. Higher numbers represent stronger vision. There are three standard metrics on visual fields, which are included in our dataset:
Raw sensitivity: values of each tested point are listed in decibels in the sensitivity plot. Higher numbers mean the patient was able to see a more attenuated light, and thus has more sensitive vision at that location
Total deviation: values are deviations in sensitivity from the expected values for a specific age. Positive values represent areas of the field where the patient can see dimmer stimuli than the average individual of that age. Negative values represent decreased sensitivity from normal.
Pattern deviation: total deviation values corrected for generalized decreases in visual sensitivity. It is useful in cases where there is both localized depression due to glaucoma, as well as globally depressed vision across the eye due to other pathologies such as cataracts.
Our dataset consists of 831K visual fields from 177K unique patients spanning 5 US-based eye institutes.
Eyes from the Glaucoma Research Network visual field collection (831,240 fields from five sites without clinical data) were used
Dataset was filtered for patients with at least five reliable, SITA-Standard 24-2 fields resulting in 90,713 fields of 13,156 eyes included in the study.
Research & Analysis
Leveraging insights from clinicians, glaucoma research and machine learning techniques to develop Foresight
#1: User research
#2: Data filtering
#3: Data exploration
#4: Data normalization
#5: Label generation
#6: Model development
Results
Benchmarking our model against leading research standards
200 expert-labeled patients used to benchmark our models
200 patients chosen randomly in each of three categories based on proxy label (50 unanimous "stable", 50 unanimous "progressing", 100 where algorithms disagreed)
Human ophthalmologist (glaucoma expert) examined visual fields of these patients and generated "ground truth".
Ground Truth: 134 out of 200 patients had "progressing" or "stable". The remaining 66 were boundary cases
Foresight classifiers have good F1 scores with lower class-bias
Encouraging F1 scores: Our top 4 classifiers had F1 scores in line with leading research algorithms (i.e., 0.90 or above)
Lower Class-Bias: Our classifiers did not overpredict "progressing" on patients identified as "too close to call" by the glaucoma specialist, while VFI and PLR did.
We focused on 2 key metrics to evaluate models
F1 score: the harmonic mean of precision and recall
Class-Bias: The term class-bias (overpredict one class) refers specifically to one form of the term bias as used in machine learning literature.
CLF. Strengths (Click for Details)
CLF. Metrics (Click for Details)
Second Approach
Using *just* the 134 labeled eyes
Click to see mean F1-scores and Standard Deviation
Can we use just the 134 labeled eyes to build a classifier?
Yes! We explored Support Vector Classifier (SVC), Random Forest and K Nearest Neighbots. We used 5-fold cross validation repeated over 1000 iterations to come up with an unbiased estimate of how well the model is doing.
Our best model is a SVC that uses just the 52 pattern deviation values and age to achieve a mean F1 score of 0.94
How can we avoid overfitting?
Given the size of the dataset, overfitting was a concern. To address this problem, we
generated synthetic data.
The figure on the left shows a patient with six visual fields. We anchor the first two fields and drop fields three and four to create a synthetic data point, assuming that the final classification of stable / progressing
still stands.
By subsampling the visual fields of a patient we are able to generate more than 2300 points
using only the 134 eyes. And we got an even better average F1 score of 0.95.
We expect this model to generalize better to unseen data.
Abstracts for some of the work we shared today have been submitted to the American Academy of Ophthalmology. We will continue working with our advisors.
We hope to take our models to the next level by labeling a larger portion of our dataset and exploring
the two approaches further.
We hope to apply the synthetic data approach to early detection of glaucoma progression.
Our Team
Combining forces to push forward thinking on data-driven glaucoma detection
Loris D'Acunto
>
Surabhi Gupta
Vikram Hegde
Amin Venjara
Our Advisors
Helping us chart the course and guiding us through the intricacies of applying machine learning to glaucoma
Dr. Osamah Saeedi, MD
Assistant Professor of Ophthalmology
University of Maryland
>
Dr. Tobias Elze
Instructor in Ophthalmology
Harvard Medical School
Joyce Shen
Lecturer
UC Berkeley School of Information
Alberto Todeschini
Lecturer
UC Berkeley School of Information
#1: User Research
In order to understand how to improve we glaucoma detection via machine learning, we interviewed five ophthalmologists and optometrists -- the target user of our product.
Our objectives for the interviews were to understand:
Current state detection processes: Each physician has her or his own personal "algorithm" for how to determine progression. We wanted to learn from these experience-based processes to inform our model development, but also identify potential points for improvement.
Pain points: Key frustrations opthalmologists and optometrists face today when trying to determine progression of glaucoma-suspect patients.
Concept feedback: Gathering unprompted and prompted reactions to paper prototypes of product concepts
Key learnings from the interviews included:
Test variation is a key challenge: Visual field tests have a high degree of variance due to patient focus, technician skills and also a learning effect. All of these complicate the ability to get a clear picture of progression
Primarily manual, paper-based process: Determining progression today is a very manual process. Clinicians typically line up print-outs from visual field machines and manually look for patterns. Electronic Medical Record (EMR) systems are starting to make in-roads but are primarily digitization of existing paper than real leaps forward in any analytics
Clinicians want the raw data: Due to test variability and lack of gold standard metrics, clinicians grown accustomed to looking at the raw data to sanity check the results of existing algorithms. Making raw data easily available is important for securing a clinician's trust
Advanced algorithm have limited adoption today: While clinicians are aware of progression algorithms from recent, large research studies (e.g., AGIS, CIGTS), they are not well understood and rarely used clincially as they require extensive, manual calculations. Instead, clinicians tend to stick with the methods they learned during their training
Greyscale images of eye are important for patient education: Clincians regularly use greyscale images to help patients understand their diagnosis and progression. Patient education is important ensure treatment adherence and attendance at follow-up visits
The next steps for user research would include testing our web-based prototype with clincians and expanding our research to include patients to test for success in enabling patient education.
#2: Data Filtering
The initial dataset had 177,172 patients with 831,240 visual fields (VF) from 5 different institutions. We focused our analysis on the most commonly used methods, the SITA (see above) on 24-2 field measurement (which measures 24 degrees temporally and 30 degrees nasally and tests 54 points spaced 2 degrees apart). The stimulus size was III (code for a particular standard size) and was white on a white background. These criteria were selected for us by our opthalomologist advisor as current best clinical practice.
From the raw data, we removed tests with high level of false positive (greater than 20%) and we kept only the VFs where there were at least 5 studies conducted on the same eye. The criteria for grouping studies was [patient-id, eye (OD or OS)]. OD is the medical term for the right eye and OS for the left eye. Thus each eye’s (left or right) studies were considered separate. After this filtering process, we ended up with a final dataset of 90,713 visual fields of 13,156 eyes included in the study.
#3: Data Exploration
Glaucoma is known to be a disease that is more prevalent in older patients and our data confirms this. As the histogram below shows, the prevalence of Glaucoma increases with age with the average age of the patient being 67 years. The fall off in the prevalence of glaucoma beyond the peak reflects possibly two trends - patients reaching the end of their life span before Glaucoma occurs and saturation of Glaucoma incidence in the Glaucoma susceptible population.
The typical HFA test can be tedious for most subjects and this may result in errors (false positives or negatives) in patients. This situation can be exacerbated in older patients due to their age. The average duration of the test is more than 6 minutes, but higher rate of errors may prolong the test.
As expected, there is a correlation between the duration of the test and the age of the patients.
The results provided by the HFA include false positive and false negative errors as well as both global indices and point wise measures. Some of them are:
Mean Deviation (MD) a global index of the deviation away from age matched controls. Patients without glaucoma have MD values close to zero. Patients having Mean Deviation more negative than -10 dB are classed as having advanced Glaucoma. As we can see from the relative picture, the majority of the patients had negative MD indicating Glaucoma vision loss.
Raw Sensitivity, (point wise metric) values of each tested point in decibels.
Total Deviation, (point wise metric) this is the deviation of each point in the visual field compared to control subjects for that specific age set. Patients without glaucoma have values close to zero.
Pattern Deviation, is the total deviation after reducing deviations uniformly across the visual field. The rationale is that this accounts for global deficits in VFs caused by other reasons such as cataract, leaving the typical localized Glaucoma related deficit in place.
Spatial correlation
The picture below, on the left side, shows the structure of the retinal nerve fibre bundles that are visible after a digital enhancement of a picture taken with a blue filter. In the right side, the same structure is manually highlighted in red. The retinal nerves transmit the visual information from the retina to the brain.
Overlapping the structure of the retinal nerves with the visual fields matrix, suggests a complex spatial correlation between the points.
The following plots display the spatial correlation obtained by analyzing the correlation matrixes of the raw sensitivities, the total deviations and the pattern deviations. The 2 lines without data correspond to the position of the blind spot.
Correlation matrix of the total deviations
Correlation matrix of the pattern deviations
Visual field deficits in Glaucoma are caused by damage to the nerve fibers that conduct nerve impulses to the brain. The above described complex spatial correlations between VF points result in patterns of visual field loss that cannot be summarized by a single metric. This was the rationale for our decision to use point-wise data rather than a global index as the input to our classifiers. A global index typically loses most of the spatial information encoded in the sensitivity of VF points.
#4: Data Normalization
Different patients have different number of visual fields, but for our machine learning models, we need to provide a fixed length input.
Moreover, there was a lot of variation in the temporal range of the VF recordings. For one patient the fields may be 10 years apart, for another it may be 2 years apart. So we had to normalize the time dimension in our training dataset and to do that we explored a few different solutions:
We divided the visual fields for each patient into two groups (first half and second half) and calculated the difference in pointwise means between the two groups. We then divide the delta (in pointwise means) by the time spanned by the visual fields.
Like the previous approach but without dividing by time
Instead of taking all the visual fields, we took the first and the last two visual fields.
The second approach gave us the best results with our classifiers.
#5: Label Generation
In order to apply a classifier to the dataset, we had to assign a label to every single eye of the patients who met our inclusion criteria (described previously). We used two labels: (1) Stable: vision in a specific eye of the patient is stable or not decreasing at a rate that is measurable (2) Progressing: the eye of the patient is measurably getting worse. Note that it is quite possible that the left eye of a patients has severe glaucoma while the right eye is disease free (or vice versa). Hence each eye (left or right) was considered separately as a single sample.
We used 6 methods to determine the same metric - the progression of visual field loss (or lack thereof):
AGIS Score - Advanced Glaucoma Intervention Study
CIGTS Score - Collaborative Initial Glaucoma Treatment Study
Mean Deviation - (provided by the HFA)
VFI index - calculated using an open source R package
PLR - Pointwise Linear Regression
PoPLR - Permutation of Pointwise Linear Regression
The first 4 algorithms evaluate the severity of vision loss in a patient (how bad is it ?). The last two algorithms measure “progression” of visual loss (is it getting worse with time.)
Since the quantity we are actually interested in is progression i.e. is there a measurable rate of loss - for the severity measures we compute rate of severity increase across multiple VF studies to get the progression.
None of these indices (Except MD) are readily available in the data to use. For the rest, we either implemented the algorithms ourselves (AGIS, CIGTS) or we used open source R packages (rest of the indices). The open source package we used was the R package “visualFields” developed by Ivan Marin-Franch.
One important finding from generating the labels is that the 6 different approaches result in 6 different verdicts. The following table is an extract from the table we developed for all our 10,000 or so eyes.
For machine learning, we needed a singe label for each eye. So we decided to use a voting system. The winning vote (progressing or stable) became the training label for our classifier.
After the labeling we analyzed the data using a Principal Component Analysis and it’s clear from the following set of pictures that there are two distinct regions (one for stable and one for progressing) and then a big overlap between the two clusters. The first chart is created plotting the stable patients first, the second plotting the progressing first.
Chart plotting the stable patients first
Chart plotting the progressing patients first
The following chart presents the final result with the “boundary” patients i.e. patients who are neither measurably stable or measurably progressing.
From the figures it’s clear that the "boundary" patients are those in the intersection between the 2 sets. It’s very complex to study the data using unsupervised clustering. In fact both group of patients are losing their sight, but the ones that are classified as “stable” are losing it at a imprerciptible slower speed compared to the “progressing” ones. Basically it’s human perception that divides a patient into progressing or stable. In reality there exists a continuum between the two states and their overlap.
#6: Model Development
We applied different classifiers to our dataset and in particular:
Logistic Regression
Random Forest
Extreme Gradient Boosting
Support Vector Classifier
Vanilla Neural Network
Convolutional Neural Network
Our goal in modelling was not to come up with a perfect classifier for the proxy label we used (majority vote label). Our hypothesis instead was that the proxy label reflected a strong "true" signal corrupted by noise caused by our imperfect proxy label. We tuned our hyperparameters to deliberately underfit to the noisy label. The tuning was carried out by assuming that within each category of stable and progressing, there exists (given sufficient eyes) a normal distribution of eyes. We felt that 10,000 eyes was a sufficiently large population of eyes to give us a reasonably accurate normal distribution within each category. We then started fitting models to the training set using the simplest and least expressive model possible (for a random forest, the hyperparameter we used was maximum depth of the trees in the forest). We then increased the max-depth hyperparameter until we had a reasonably accurate normal distribution of eyes in each of the stable and progressing categories. In other words we chose the simplest model that resulted in a normal distribution of eyes in each class. By underfitting in this fashion, our hypothesis was that we would be picking out the hypothesized strong signal in the data (without being affected by the noise introduced by our proxy labels.)
In the rest of this section we discuss the challenges we faced in developing some of models, particularly convolutional neural networks.
>
The structure of the Neural Network is described in the following table. To avoid overfitting we added 4 dropout layers with decreasing dropout probabilities, from 0.5 to 0.2.
We trained the Neural Network for 20,000 epochs since we found that more training didn’t improve the accuracy of the classifications. For our model we used Keras with a Tensorflow backend and we noticed that the model loss on the validation set was smaller that the one one the training set. We believe we have a reasonable explanation for why this is so (the following is adapted from the Keras documentation):
1. During the calculation of the validation set accuracy, regularization mechanisms such as Dropout are turned off
2. The training loss is the average of the losses over an epoch. Typically a model is worse at the start of an epoch compared to the end of an epoch. A model whose accuracy is the average accuracy across the epoch will typically fare worse than the (validation) accuracy measured at the end of the epoch (unless there is strong overfitting.)
For the Convolutional Neural Network we had to rectify the shape of the matrix of the visual fields. We manipulated the data in the way described in the following picture, adding zero valued padding where needed:
The following table describe the structure of the classifier, that we trained for 300 epochs. More training didn't improve the capability of the network to correctly classify the patients. Also in this case we added some dropout layers to avoid overfitting. In particular the dropout probabilities were 0.2 for both layers.
Classifier Strengths
Encouraging F1 scores: Our top 4 classifiers had F1 scores in line with leading research algorithms (i.e., 0.90 or above)
Lower Class-Bias: Our classifiers did not overpredict "progressing" on patients identified as "too close to call" by the glaucoma specialist, while VFI and PLR did. On the set of 66 "too close to call" patients -- PLR and VFI identified 63/66 (95%) and 59/66 (89%) patients as progressing, which suggests very strong tendency to overpredict in cases that were not clear even to human experts.
Classifier Metrics
Class-Bias: The term class-bias in our case refers specifically to one form of the term bias as used in machine learning literature. In machine learning, bias (i.e. ML-bias) refers to any systematic deviation from the true model across many training sets. The bias we are referring to here (class-bias) is the tendency of a classifier to overpredict one class or the other.
Unless the class-bias is overwhelmingly in favor of one class, clearly demarcable cases tend to be predicted correctly despite class-bias. A good way to
determine if there is any hidden class-bias is to look at the examples straddling the boundary between "progressing" and "stable". Class-bias (if any) in these cases cannot be ignored, since fully 1/3 of all eyes submitted to a human expert for labelling were classified as "boundary" eyes.
To ensure that our determination of hidden class-bias was statistically sound, we conducted a chi-squared analysis to determine if the distribution of predictions for the "boundary" eyes was significantly different from the distribution of the ground truth for eyes that had a clear demarcation. A large p-value implies no statistically significant difference in distributions, which indicates a balanced classifier. A small p value shows differences that are statistically significant, indicating that the classifier is biased.
Many of the ophthalmological indices had p values less than 0.05. This indicates that there was a very strong class-bias towards either the "progressing" or "stable" label irrespective of F1-scores. All of the machine learning classifiers on the other hand seem not to have class-bias.
F1 score: the harmonic mean of precision (what fraction of a predicted label is really that label) and recall (what fraction of a particular label is actually predicted).