While comprehensive molecular characterization will transform our understanding of cancer, advanced computational methods are also needed to generate insights from high-dimensional datasets. To address this need, we develop a novel approach using DNM for end-to-end unsupervised representation and visualization of neoplastic and non-neoplastic tissue samples that allows their molecular properties to directly determine their similarity. Briefly, our approach involves mapping a topological distribution of neoplastic and non-neoplastic samples to an SOM, identifying samples that do not cluster with their class, and using activation gradients to better understand the mechanistic significance of select miRNAs. Combining the identification of misclassified samples on the SOM lattice and the interpretation of the significance of miRNA profiles in their representation, leads to better understanding of cancer. In particular, samples that do not cluster well with others with similar ontology are analyzed for differences in their miRNA expression. This has the potential to reveal biomarkers shared between cancers of different origin, and possible targeted treatment options that are patient-specific.
Comparative Approaches and Ablation Study DNM provides an unsupervised approach for visualizing data clusters based on their inherent properties. Comparisons to other well-known visualization techniques have been reported previously in Pesteie et al. . As opposed to Principal Component Analysis (PCA)  which reduces dimensionality linearly, the DNM is non-linear. Compared to t-distributed Stochastic Neighbor Embedding (t-SNE)  and Uniform Manifold Approximation and Projection (UMAP, Fig. 7) , our solution fine-tunes the latent space in DNMs to provide better topological mapping and low reconstruction error by integrating the SOM and AE losses. In addition, our solution can embed data into a larger latent space than t-SNE and UMAP, preserving the topological structures among data points. This allows the optimization of the latent space, preventing the loss of important features. We experiment with various dimensions of the AE and hyperparameters for achieving the best reconstruction error, and try the range of 10–40 for the latent space. The AE loss decreases while the latent space grows to 25 and increases afterwards. We also notice that as the loss increases, samples of the same class are more dispersed on the lattice, showing the importance of optimizing the latent size. This highlights one of the advantages over traditional visualization techniques; the SOM is able to map higher dimensional features in two dimensions while preserving the topology. The lower error using a latent space of 25 indicates that UMAP and t-SNE may lose important information during feature reduction, leading to less refined clusters.
DNM Clustering and Accuracy The DNM is able to effectively discriminate between neoplastic and non-neoplastic tissues. Distinct clusters of tissue samples are formed, corresponding to these designations, on opposite sides of the lattice in Fig. 3a, b. This suggests that cancers of different tissue-origins share sufficient miRNA features that are specific to neoplasticity. This observation has been previously reported in the literature, e.g., miR-21 is shown to be up-regulated in many cancers . This is further seen by the high accuracy of classifying neoplastic and non-neoplastic samples (Table 1). The DNM also results in distinct clusters of samples when examining combined neoplasticity and tissue-of-origin, shown in Fig. 3c. The presence of distinct clusters of tissues-of-origin demonstrates the known ability of miRNAs to be used for tissue typing . The DNM is able to map the held-out test data to similar locations as their respective classes seen during training and generalizes well to unseen data. The DNM accuracy is next compared to a supervised MLP. The MLP outperforms the DNM in all three classification tasks. The MLP uses the labels of each sample during training, whereas the DNM uses only similarity measures to identify classes. While two samples may have the same neoplasticity status and tissue-of-origin, the true miRNA dysregulation and/or molecular properties may differ largely, leading to misclassifications. The motivation behind the DNM is to identify the most similar samples based purely on molecular properties, and analyze samples that do not cluster well with their respective class, leading to better understanding of individual patient samples. This is performed by analyzing samples mapped to specific nodes, as opposed to analyzing clusters of multiple nodes. Samples at the same nodes are the most similar to each other; it is therefore expected that these would be samples from the same class. Those that differ from the expectation are potentially abnormal samples which should be further analyzed. While these are called ‘misclassifications’, they may represent samples with unusual miRNA dysregulation.
DNM Misclassifications Our method allows the further exploration of misclassified samples quantitatively. A large number of neoplastic skin samples are misclassified as non-neoplastic skin (Fig. 4, green), which is also shown through their proximal mapping in Fig. 3c. Looking at the class-average activation gradients, these two classes have many similarly activated miRNAs (Fig. 5), resulting in challenges determining class differences. An additional issue contributing to such challenges is that skin samples obtained for pathology contain multiple different cell types, leading to varying molecular signatures.
Multiple classes of tissue were wrongly mapped to nodes labeled as neoplastic breast (Fig. 4, orange). Neoplastic breast samples share some of their most activated miRNAs (e.g. miR-21, miR-26a) with numerous classes, seen in Fig. 5. In addition, miRNA expression is more varied within neoplastic tissue due to differing cancer grade and subtypes [31, 47], which may contribute to more misclassifications. This is further seen through the high number of multi-class neoplastic nodes shown in Fig. 6a compared to the number of non-neoplastic multi-class nodes in Fig. 6b. The low number of non-neoplastic multi-class nodes displays the stable expression of healthy tissue and acts as a positive control for the DNMs (Fig. 6).
A challenge we face is that for classes with low number of samples and heterogenous cell types, e.g. neoplastic intestine tissue (intestine samples share smooth muscle features with other soft tissues), the DNM is not able to map them to designated nodes. Instead they are often misclassified as other classes they may share features with (Fig. 4, red). Through identifying these misclassifications, it is possible to further study molecular similarities between known classes or discover new shared signatures.
Discovery of Cancer Biomarkers The miRNA activation gradients provide a novel approach for proposing potential cancer biomarkers and molecular drivers through comparison of neoplastic and non-neoplastic activation for the same organ. From Fig. 5, in all classes except pancreatic tissue, miR-21 has higher activation in neoplastic than non-neoplastic tissue. This difference suggests miR-21 could be an important biomarker for discriminating neoplastic and non-neoplastic tissue in these organs. We then examine the initial normalized expression of miR-21 in these tissues in Fig. 8, which indeed shows a difference in expression between neoplastic and non-neoplastic tissue in each organ. In addition, miR-21 is a known oncomiR for certain cancers shown to have upregulated expression, which has been linked to overtargeting of genes that prevent metastasis and apoptosis [46, 48]. Many other potential biomarkers in Fig. 5 have also been reported in the literature. For example, miR-143 has lower activation in neoplastic bronchus and lung, breast, and skin tissue compared to its respective non-neoplastic activations. miR-143 is a known tumour-suppressor, and its down-regulation has already been linked to many cancers . Further, let-7b has lower activation in neoplastic bronchus and lung, which is a known tumour-suppressor and has been shown to target KRAS. The KRAS gene is often mutated in lung cancer, which can prevent binding of let-7b to the target site (preventing mRNA degradation), and downregulate let-7b through further downstream complications of upregulated KRAS . This is a key example of how computation meets biology and identifies important parts of entire regulatory networks. While our proposed approach can be used to identify potential cancer biomarkers, a knowledge of miRNA tissue specificity is essential to prevent the misidentification of tissue-specific markers as cancer biomarkers . For example, miR-1 appears to have low activation in neoplastic soft tissue samples. This is likely due to how the tissue samples were collected. miR-1 is a known muscle tissue marker; during sample collection, the non-neoplastic samples likely contained muscle tissue, whereas the neoplastic samples collected tissue only from the tumour. The heterogeneity of sample collection must be considered when examining potential miRNA biomarkers. In addition, while activation partly reflects the expression of a miRNA, its purpose is to highlight the attention of the deep learning model to this particular marker.
Multi-Class Node Interpretation—Case 1 To further examine misclassified samples we study multi-class nodes of the SOM lattice that have three or more classes mapped to them. Upon analysis of nodes [10,0], [11,0], and [12,0] (Fig. 6a, green, and Table 2), it is found that the apparent misclassifications at the nodes are the result of neuroendocrine tumour samples. Neuroendocrine tumours (NETs) are a rare form of cancer that develop within the neuroendocrine cells of numerous different organs . In the dataset we use, NETs are present in a total of 8 of the 17 tissue types (adrenal gland, bronchus and lung, endocrine and related structures, pancreas, paraganglion, skin, small intestine, and thyroid gland). However, the NETs are only annotated in the disease subtype, which we do not consider in this study. Calculating sample-specific activation gradients at these three nodes indicates shared features of miR-375 and miR-7, which are specific to NETs . Upon closer evaluation of the disease subtypes of the samples, every sample mapped to these nodes is indeed a NET. Therefore, although these samples are from different organs and have distinct molecular profiles, the DNM is able to identify signatures specific to NETs, and map these samples to the same location.
Since miRNAs specific to NETs are significantly represented in our data, we analyze the remainder of NETs and found that while these samples cluster with other samples from their respective tissue-of-origin, the majority also cluster together at the bottom of the SOM (Fig. 3c). Using class-average activation gradients we also identify that both neoplastic pancreas and neoplastic bronchus and lung have high activation of miR-375, with neoplastic pancreas having miR-7 as well (Fig. 5).
Multi-Class Node Interpretation—Case 2 Node [9,12] shown in Fig. 6 and Table 2, represents samples from soft tissue, breast and skin cancers. We calculate sample-specific activation gradients for these cases and compare them with their corresponding class-average activations. We found that several shared miRNAs in these samples have significantly higher or lower activations when compared to class-averages. Selected shared features include high activation of miR-21 and miR-199a-3p, and low activation of miR-26a and miR-125b. These miRNAs are known oncomiRs (miR-21, miR-199a-3p) and tumour suppressors (miR-26a, miR-125b), and also identified by our models as significant contributors to abnormal sample clustering [12, 52, 53]. It is possible that all these samples are those of higher grade or aggressiveness compared to others from their respective class in the data. We hypothesize that it is possible to use the SOM maps to identify potential higher grade tumours or other abnormalities in patient-specific samples, a likely valuable tool for experimental cancer biologists.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.