To evaluate scSemiAE comprehensively, we implement several experiments on real datasets and compare it with both semi-supervised methods and unsupervised methods. We test the performance of scSemiAE under three scenarios, the different proportions of labeled cells, the different numbers of labeled cell subpopulations and batch effects existing. Since the datasets in these experiments must be labeled with true cell subpopulations in advance to give a standard evaluation criterion, we omit the step of cell annotation, but in real applications, cell annotation may be necessary to get partial labels with high confidence.
The methods in comparison include PCA, semi-supervised methods netAE and scANVI as well as unsupervised methods AE and scVI. Among them, AE refers to the pretraining model of scSemiAE. For fairness, we use similar network structure for all deep models in comparison. Encoder and decoder are both with two fully connected layers and the dimension of latent space is 50. The training epoch of all deep models is set to 50. Besides, to reduce the effect on performance caused by parameter regulation, these methods in comparison are tested under their default parameters and algorithm procedures. Consequently, we do not tune the hyperparameters of scSemiAE separately for each dataset but use the default settings.
Since scSemiAE can be treated as a dimensionality reduction algorithm, to show the performance, we select two representative metrics from classification field and clustering field respectively. (1) Accuracy (ACC): This one is mainly to evaluate the performance of classification. We select the k-nearest neighbor (kNN) classifier with k = 10, which is one of the simplest models for classification. For deep models, scSemiAE, AE, scVI, scANVI and netAE, embedding vectors mapping from labeled cells are used to train a kNN classifier and prediction accuracy is calculated on the unlabeled set. For PCA method, kNN is trained on principal components . (2) Adjusted Rand Index (ARI) : This one is mainly to evaluate the performance of clustering. We select Louvain and K-means that are two most popular clustering algorithms. They are used to group only unlabeled cells on a low-dimensional space directly and ARI is also calculated on the unlabeled cells. Louvain will stop at the best split. The size of groups, k in K-means algorithm will be set to the exact number of cell subpopulations. Besides, Uniform Manifold Approximation and Projection (UMAP)  are employed to give visualizations by projecting embedding into two dimensions.
scSemiAE performs best for tests with different proportions of labeled cells
The first experiment is to test the performance of scSemiAE when the proportion of labeled cells is various. We set the proportion of labeled cells to 0.05, 0.1, 0.2, 0.4 respectively. In Fig. 2, we present the results for the first four datasets of all methods in comparison, where all the mean and standard deviation of ARI and ACC values are counted from 20 randomly sampling of labeled cells (its corresponding numerical values being shown in Additional file 1).
As shown in Fig. 2, it is very clear that scSemiAE performs best. In most tests, scSemiAE achieves the best score and especially the ARI values calculated from Louvain, scSemiAE exceeds other algorithms by 10–30%. For K-means, scSemiAE is the most stable method on multiple datasets. Though it is the second best on Cortex and Limbs Muscle datasets, scSemiAE performs much better than others in Heart and Embryos datasets. PCA fluctuates a lot and all the other models are worse than scSemiAE on ARI values. Naturally, scSemiAE should be a better choice for clustering of scRNA-seq data. Besides, the performance of scSemiAE is far beyond it of AE which is the pretraining part of scSemiAE and it demonstrates that the fine-tuning part of scSemiAE does help great. By the way, though K-means seemly performs better than Louvain, we cannot come to the conclusion quickly. Usually, the true size of cell subpopulations cannot be known beforehand and it is an important parameter for K-means. In fact, in above experiments, Louvain likely pops out more clusters than K-means, in which k is exactly set to the true size of cell subpopulations. It can explain why the ARI of Louvain looks not as good as it of K-means.
As for ACC indicator, scSemiAE achieves the best score on two datasets, Limb Muscle and Heart, and on the other two datasets, scSemiAE is competitive. What’s more, when the proportion of labeled cells is very low, such as 0.05, scSemiAE is the best one.
A series of experiments illustrate that scSemiAE should be the first choice among the semi-supervised methods and unsupervised methods in comparison since it performs best or secondly best in tests with different proportions of labeled cells.
scSemiAE performs best for tests with the different number of labeled cell subpopulations
In reality, there must be some cell types, especially rare cell types, which cannot be annotated by a cell type predictor since these cell types may not be detected before. Therefore, we explore the performance of scSemiAE when the number of annotated cell types is limited. In this section, due to the size of cell subpopulations in the dataset, the number of labeled cell subpopulations ranges from 2 to 5, 6 or 7. Up to (10%) cells of a cell subpopulation with more than 50 cells may be labeled. Three semi-supervised methods, scSemiAE, netAE and scANVI are compared.
The experimental results are shown in Fig. 3 (its corresponding numerical values being shown in Additional file 2). Obviously in most cases scSemiAE performs best regardless of the number of annotated cell subpopulations. From the results of Louvain algorithm, when the number of annotated cell populations is more than 2, ARI of scSemiAE achieves about 20–30% better than it of the other two methods. When the number of annotated cell populations is 2, scSemiAE outperforms much in Cortex, Heart, and Limb muscle datasets, and performs comparably in Embryos datasets. The observation from K-means almost agrees it from Louvain, scSemiAE performs best in whole, except on Cortex dataset, when the number of cell types is less than 4, netAE and scANVI outperformed scSemiAE a little.
What’s more, among three methods, scSemiAE is the only one that with the increasing of annotated cell subpopulations, the performance of clustering unlabeled cells become better. While, netAE and scANVI cannot take full advantage of annotated cells since their performances keep very stable with the increasing of annotated cell subpopulations.
As shown in Table 1, some datasets are seriously unbalanced, such as Heart datasets and among cell subpopulations of it, the most and least cells differ by two orders of magnitude. In this case, it is truly hard for clustering algorithms to simultaneously identify both common cell type and rare cell type. scSemiAE alleviates the problem because it gives better low-dimensional representations in which common cell subpopulations and rare cell subpopulations are easy to be grouped respectively. ARIs from scSemiAE for Heart datasets are much better than its from netAE and scANVI, the other two semi-supervised methods, as shown in Fig. 3.
scSemiAE could remove batch effects
Though scSemiAE takes no extra step to remove batch effects, it does have this function since one goal of it is to make cells of the same cell type close.
The Pancreas dataset is from four different batches. We also implement the experiments mentioned above on this dataset. Among the methods, scANVI inherits from scVI the special treatment for removing batch effects, while scSemiAE and other methods do not take specific solution for it. As shown in Fig. 4 (its corresponding numerical values being shown in Additional file 3), the special treatment for removing batch effects do help scANVI and scVI outperform other methods. While, scSemiAE presents similar performance as scVI in clustering, a litter worse than the semi-supervised version scANVI, and much better than netAE and other unsupervised methods. It demonstrates that even when batch effects exist, scSemiAE could give better low-dimensional representations which make the work of clustering algorithms easier.
For the Pancreas dataset, we further give visualizations of the raw data, the embedded data from scSemiAE and netAE by UMAP, shown in Fig. 5 and here the labeled proportion is set to 0.1. It is obviously shown in Fig. 5a which is from raw data that when batch effects exist, cells such as alpha cells from CelSeq2 and Fluidigm C1 are rather far away, and even worse, alpha cells of Fluidigm C1 and beta cells of Fluidigm C1 are mixed together. It illustrates that here technical variations are bigger than biological variations, while most methods for batch effects removal, such as Seurat , Harmony , LIGER , work under the contrary assumption. In Fig. 5b, it is very clear that scSemiAE mixes cells of the same cell subpopulation from different batches very well, and each cell subpopulation is rather separate to ease clustering. In Fig. 5c, netAE looks unable to remove batch effects, such as alpha cells of Fluidigm C1, CelSeq2 and SMART-Seq2 are very far away, so is beta cells of Fluidigm C1, CelSeq2 and SMART-Seq2. What’s worse, alpha cells and beta cells of Fluidigm C1 are much closer, just like the Fig. 5a.
The fine-tuning step of scSemiAE is to make cells of the same subpopulation close and cells among different subpopulations far away, which in some extent helps to remove batch effects even the step is guided by a few labeled cells. While netAE cannot deal with batch effects since one important part of its optimization goals is to make classification work, a classifier may find a cutline to classify different cell subpopulations while it is hard for unsupervised methods to group cells.
scSemiAE could preserve the cell differentiation structure
Embryos dataset includes 5 states of embryo development from the third day to the 7th day. UMAP visualizations for this dataset of the original space as well as the embedded spaces by scSemiAE, netAE and scANVI are shown in Fig. 6. The labeled proportion is also set to 0.1. From Fig. 6, we can figure out scSemiAE preserves the structure of the differential process of cells. It suggests that scSemiAE could provide better low-dimensional representations which could ease clustering and downstream trajectory inference.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.