In this work, we compared several well-established machine learning algorithms (i.e., DNN, GB, RF, and SVM) to predict the diagnosis out of numerous different neurodegenerative syndromes on the basis of pre-structured, atlas-based volumetric brain MRI data. In agreement with our hypothesis, we show that neurodegenerative diseases can be classified from structural brain imaging data, in particular, if they are characterized by specific atrophy patterns. Here, DNN showed a moderate performance, whereas the three other models showed a fair performance according to Cohen’s kappa scores. Although reasonable for this ambitious clinical question, results were not reaching substantial or even perfect classification results as achieved in comparisons of single neurodegenerative diseases vs. controls [1, 17, 27, 36,37,38]. This important difference between the diagnostic (disease vs. control) and differential diagnostic (disease vs. disease) approach might be related to etiological overlap between clinical syndromes, unspecific atrophy patterns for some diseases, and even the fact that single patients might show different syndromes in the course of the disease. These severe limitations, to be addressed in future studies, hamper the translation of multi-syndrome classifiers to clinical settings to date. In the following, we will discuss our results in more detail.
Structuring imaging data for machine learning approaches
Pre-structuring of the data with atlas-based volumetry had some clear advantages such as easy assessment of particular brain regions as contributing factors for the diagnosis on an individual level as well as across syndromes, thereby increasing the interpretability of the respective model. Moreover, data could be normalized individually by adjusting to the subject’s intracranial volume. Presumably, atlas-based volumetry seems to be also superior to voxel-based morphometry, because the impact of different centers, scanner types, protocols, and applied parameters seem to be decreased by processing steps in atlas-based volumetry—a hypothesis that has to be validated in future studies. Furthermore, using volumetry data also allowed for the training of a model on a single CPU core and with 6-GB RAM. In contrast to this, the training of a convolutional neural network (CNN) with raw imaging data , which is the state-of-the-art method for image classification, requires machines with at least one 12-GB GPU or in case of 3D MRI volumes a server with several GPUs . Finally, pre-structuring of the data increased the anonymity of the data—a general benefit that facilitates central data aggregation without risking the exposure of privacy-sensitive medical information.
The reason we were not able to conduct the same experiment with raw imaging data was that we did not have access to the raw images. Despite all the advantages of pre-structured imaging data listed above, it precludes the possibility of data augmentation of raw imaging data—a powerful strategy to increase the amount of training data and thereby boosting model performance. Furthermore, and perhaps more importantly, predefined feature extraction might lead to a loss of valuable information, which is a clear limitation of our study.
Comparison of machine learning models
Corresponding to the literature , our results indicate that the DNN with a simple feed-forward architecture is the superior method for this kind of classification task, closely followed by the SVM as illustrated in Table 2. While neural networks became the state-of-the-art method for the processing of imaging data and text data, DNNs  were shown to outperform tree-based methods as well as SVM with structured data. However, it is informative to take a closer look at model performance and model robustness for every single class individually, especially considering the size of the class and the specificity of atrophy patterns, respectively (see Table 3). The DNN performed best (high F1-score and high robustness) in large classes (e.g., PD, bvFTD, AD, and PSP) where there was a sufficient data for the model loss to converge. Generally, classes with smaller sample sizes expectedly led to models with weaker performance measures. GB and SVM seemed to best perform for smaller classes (e.g., MSA-C, lvPPA, CBS), while RF rendered the best robustness for smaller classes. The high robustness of RF in this case might be due to the prediction ensembles, while the superior performance of GB and SVM over the DNN might reflect those models possibly needing less data than neural networks. Notably, classes with more specific atrophy patterns (e.g., svPPA and AD) were also best predicted by the DNN despite the comparatively small sample size possible due to the faster convergence of the loss function. As expected, diseases with regionally specific and pronounced atrophy patterns such as svPPA, AD, and PSP were generally better classified than diseases with widespread and rather weak atrophy such as CBS (see Fig. 4). The confusion matrices in Fig. 4 give a nice overview of the class-specific performance of the different methods and nicely show that the DNN has a reasonable performance for all classes.
In conclusion, the larger the dataset, the better the performance. It was here, where the DNNs were able to clearly show their superiority with respect to classification performance as well as robustness. However, the point of convergence is the critical factor for good performance. For this, a balanced validation set must be used.
The validation was not only performed by using the prediction score, but also the standard deviation of the prediction scores as a measure of robustness. Generally, the standard deviation of the model performance depends on the training dataset used, which is why we chose k-fold cross validation  instead of a leave-one-out cross-validation. In contrast to a leave-one-out cross-validation, a k-fold cross-validation changes the class distribution in the training dataset over the different experiments, which affects the model training. When a leave-one-out cross-validation is performed, a class imbalance in the dataset always exists in a similar ratio (with the exception of the validation instance) and is therefore reflected in a lower model quality. The highest overall robustness was observed for the DNN while the ensemble methods in turn were least robust, possibly due to their general propensity to overfit the models.
Both recall and precision are class-wise measures and are therewith independent of the number of true negatives, which are over-represented in a multiclass problem and thereby inflate measures contingent on the true negatives. The F1-score is a combination of both precision and recall and is supposed to give a more holistic measure of class-wise model performance.
For the overall model performance, accuracy is a popular measure, which we included in the reported metrics. However, in the case of a multiclass problem with a large imbalance, accuracy is not able to provide an honest reflection of the overall model performance. For this reason, we limit the consideration to the Cohen’s kappa score for the overall model evaluation (Table 2), because this score allows a normalization by the size of the respective class . For the interpretation of the Cohen’s kappa score, the following scheme can be used: 0–0.20 as slight, 0.21–0.40 as fair, 0.41–0.60 as moderate, 0.61–0.80 as substantial, and 0.81–1 as almost perfect agreement . According to this scheme, every DNN performs with a moderate performance and the three other models with a fair performance. The confusion matrix (Fig. 4) further visualizes how the DNN performs better for all different classes in comparison with the tree-based methods, where the model overfits towards the larger classes such as Parkinson, PSP, and bvFTD.
To better understand the process of decision-making of every model, we extracted the feature importance with the LIME method. LIME explains the model performance by approximating an explainable model that has exactly the same predictive behavior as the used classifier.
Despite the differences in performance metrics, all methods were able to reproduce well-known atrophy patterns of respective syndromes (see Table 4). Note that unlike in binary disease-vs.-healthy classification tasks, the interpretation of the feature importance resulting from a multiclass classification problem is more ambiguous. The “important features” listed above merely reflect which brain regions were most important to differentiate the respective diagnosis from all other diagnoses included in the classification task.
While the use of volumetry data simplifies the task of classification, it simultaneously limits the classification basis to atrophy patterns only and excludes brain tissue that has no effect on atrophy. The two-stage approach consisting of the volumetry calculation and the classification of the diseases also carries the risk of error summation, which can lead to increased prediction error compared to approaches that are using the original data. Our study results might be limited by the unbalanced dataset, i.e., varying numbers in subjects per group. Although this variability reflects, at least partly, differences in prevalence and data availability, the findings of our study shall be validated in future more comprehensive, better balanced, and preferably international cohorts. Herewith, our results have to be validated externally to improve model generalization.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.