scREG: a computational method for single cell regulatory analysis from multiome data

We propose a computational method for integrative analysis of single cell multiome gene expression and chromatin accessibility data. Figure 1 shows the schematic overview of the scREG analysis workflow. Our software scREG takes as input the read count matrices of gene expression and chromatin accessibility measured on the same cells which are the same format as the standard output of 10X genomics CellRanger software. The output of scREG is a lower-dimensional representation of cells, clustering label, 2-D embedding, and subpopulation-specific cis-regulatory networks. The R package also has an interactive visualization function to plot the peaks, genes, and cis-interaction in a subpopulation-specific manner for a given genomics range. scREG processes the input data in three main steps: joint dimension reduction, cell clustering, and regulatory network inference.

Fig. 1
figure 1

Schematic overview of the scREG. The scREG takes matrices of gene expression and chromatin accessibility (E and O respectively) measured on the same cells as inputs and process the multiome data by three steps: dimension reduction, cell clustering, and regulatory network inference. First, a cell level index cis- regulatory potential matrix R is defined, indicating the regulatory strength of a peak to a target gene. Rows of R matrix represent preselected peak-gene pairs and columns represent cells. Here, the index i represent the ith peak, j represents a gene, and c represents a cell. Then, three matrices (E, O, and R) are factorized into products of domain specific profiles (W1, W2, and W3 for E, O, and R respectively) and a common low dimension representation of the cells (the H matrix) by NMF-based optimization model. Based on the reduced dimension matrix, we do cell clustering by Louvain algorithm and visualize the cells into 2D space by Umap. For each peak-gene pair, a cis– regulatory score (CRS) was defined by average of the cis- regulatory potential over cells from the same cluster. Subpopulation specific cis- regulatory networks were identified based on the CRS scores.

First, we develop a non-negative matrix factorization (NMF)-based optimization model to reduce the dimension of multiome data with m1 genes and m2 peaks to a common K dimension matrix (default value of K is 100). The gene expression data E matrix and chromatin accessibility data O matrix could be treated as two different modalities and thus those need to be integrated. To capture the cross-modalities information, we define a cell level index cis-regulatory potential (noted as R matrix, rows represent peak-gene pairs and columns represent cells) to measure the regulatory strength of a peak to a TG in a cell. We project the three different data matrices E, O, and R to a common reduced dimension space by an NMF-based optimization model (see the Joint dimension reduction” section). Specifically, we factorize the three data matrices into products of modality-specific factor profiles (W1, W2, and W3 for E, O, and R respectively) and a common low dimension representation of the cells (the H matrix).

Second, we cluster the cells based on the reduced dimension matrix. Specifically, we calculate the similarity between cells based on the reduced dimension H matrix and construct a k-nearest neighbor graph. To detect rare populations, we transfer this graph to a weighted graph where the weight between two nodes is defined as the Jaccard similarity of their neighbors in the k-nearest neighbor graph. The Louvain algorithm [37] was applied on this weighted graph to identify the cell populations (see the “Clustering of cells” sections).

Third, we define a cis-regulatory score (CRS) for each peak-gene pair in each cluster by an average of the cis-regulatory potential over cells from the same cluster (see the “Reconstruction of the cis-regulatory networks” section). We select the top 10,000 peak-gene pairs in each subpopulation to construct the cis-regulatory network and identify regulatory elements (REs) as peaks regulating at least one gene in a given subpopulation. To obtain more accurate regulatory information, we merge cells from the same subpopulation and perform peak calling on each subpopulation. We replace the REs with cluster-specific peaks, which are shorter and more accurate than the original REs, to obtain the final subpopulation-specific cis-regulatory networks.

scREG performs cross-modalities dimension reduction by data integration

To assess whether the scREG method can reduce dimensions efficiently, we apply our method to peripheral blood mononuclear cells (PBMC) multiome data from 10X genomics (see the “PBMC 10 K data” section). We use the joint NMF-based optimization model in scREG to reduce the dimension of data into 100 dimensions. To validate the results, we used the cell-type labels that were annotated by the 10X Genomics R&D team as surrogates for ground truths [34] and calculated silhouette index (SI) for each cell based on the reduced dimension matrix. A higher SI value indicates that the cell is more similar to cells sharing its label than those not sharing its label. First, we compare our method with dimension reduction methods based on a single dataset PCA on scRNA-seq, and LSA on scATAC-seq. As the dataset contains 14 cell types, we choose the top 14 dimensions of PCA and LSA. To make the results comparable, we perform PCA on our reduced dimension matrix, which is 100 dimensions, to reduce it to the same dimension as the other two methods. Figure 2A shows that scREG achieves higher SI than PCA RNA, and LSA ATAC on 83.39%, and 83.37% of the cells, respectively.

Fig. 2
figure 2

Benchmarking scREG with other dimension reduction Methods. A, B Scatter plots visualize the silhouette index (SI) of following methods versus scREG: PCA for RNA data, LSA for ATAC data, concatenation of top PCs of RNA and top factors of ATAC, scAI, MOFA+, and intNMF, which differ from scREG by lacking the cis-regulatory potential term. Each dot represents the SI of one cell. Each scatter plot compare scREG with another dimension reduction method. Y axis represent the SI from scREG and X-axes represent that from alternative methods. SI represents the similarity of each cell with the cells in same cell type compared to those from other cell types. The percentage of cells with higher SI in scREG is labeled top of the figures. Color of dots represent true label of cells as shown in Fig. 3B. C Comparison of SI values from 7 different methods in some cell types. cells achieve higher SI values when using scREG

The increment in SI is cell type-dependent. Figure 2C and Additional file 1: Fig. S1 show the distribution of SI of different methods on different cell types. For CD56 bright NK cells, naive B cells, naïve CD4 T cells, and plasmacytoid DC, the SIs are increased in scREG compared to all the other methods (Fig. 2B). The scREG achieved the best performance on 9 out of 14 cell types. Additional file 1: Fig. S1 shows the comparison of the average SI on all the cell types by different dimension reduction methods, except for the effector CD8 cell, on which all methods generally have poor performance. Our method scREG performs better on most of the cell types, with SI range from 0.2371 to 0.9758, and achieves the highest average SI score across cell types (0.5614).

One alternative way to perform dimension reduction is to perform dimension reduction separately on each type of data first and concatenate them together. We construct a new lower dimension representation by concatenating 7 dimensions of PCA on scRNA-seq and 7 dimensions of LSA on scATAC-seq and compare this with scREG. These newly constructed 14 features perform lightly better than the two methods that perform dimension reduction separately but are inferior to scREG on 83.14% of the cells (Fig. 2A).

To evaluate the importance of the cis-regulatory potential, we compare our method with three other methods that do not use cis-regulatory information. First, we build a baseline model of scREG by removing the cis-regulatory potential term (the 3rd term in eq. (1) of Methods) from the optimization model and then perform the same optimization and dimension reduction as before. The other two methods for comparison are scAI [16] and MOFA+ [33]. We use these three baseline methods to reduce the dimension to 100 and perform PCA to further reduce it to 14 dimensions. The result shows that including the cis-regulatory term improves performance: scREG increases SI on 72.08%, 81.18%, and 62.39% of cells compared to scAI, MOFA+, and scREG baseline, respectively (Fig. 2B). Figure 2B, C and Additional file 1: Fig. S1 show that the baseline methods without cis-regulatory potential are inferior to the scREG but perform better than the other methods that take a single dataset as inputs. This result suggests that we achieve more accurate low dimension representation by integrating both gene expression and chromatin accessibility data, and the accuracy would be further improved if we include the cis-regulatory potentials into the formulation of the NMF-based optimization model.

scREG identifies the cell populations with high accuracy

To evaluate the performance of the cell clustering aspect of our method, we compare the scREG with Cell Ranger ARC V1.0, Seurat V4.0, scAI, MOFA+, and the baseline scREG method. Cell Ranger ARC analyzes the gene expression and chromatin accessibility separately and outputs two clustering labels, while scREG scAI, MOFA+, and Seurat perform a joint analysis of the two types of data and output one clustering label. Figure 3A and Additional file 1: Fig. S2A show the UMAP and inferred clustering labels from each of the five methods. Figure 3B and Additional file 1: Fig. S2B show the corresponding UMAPs colored by the surrogate ground truth labels. From the clustering and Umap visualization, we see the scREG gives a consistent label with the surrogate ground truth while alternative methods have obvious misclassification. The naive CD4 T cells and the naive CD8 T cells are not separated in Cell Ranger RNA-seq clustering but separated in Cell Ranger ATAC-seq clustering and the joint clustering methods scREG and Seurat. The non-classical monocytes and the myeloid DC are not separated in both RNA-seq and ATAC-seq clustering but separated in the three joint clustering methods. Memory B cells and naïve B cells are separated clearly in ATAC-seq clustering, but the boundary is not clear in RNA-seq clustering. In Seurat, the boundary between memory B cells and naïve B cells is shifted so a large proportion of memory B cells are labeled as naïve B cells. In scAI clustering, there is a subpopulation that is a mixture of naive CD4 T cells, naive CD8 T cells, and memory CD4 T cells. Memory B cells and naïve B cells are not separated in MOFA+ clustering. These two cell types are separated clearly in scREG_baseline and scREG. To evaluate the clustering results systematically, we calculate the normalized mutual information (NMI) and adjusted Rand index (ARI) based on the surrogate ground truth. It is seen that scREG achieves the highest NMI and ARI compared to other methods. It is worth to notice that all the clustering methods are compared under their default resolution parameter setting, which is 0.8 for Seurat and 1 for all other methods. As clustering accuracy is highly affected by the resolution provided to Louvain clustering, we also compared the clustering performance of these five methods under different resolutions ranging from 0.2 to 2.0 (Additional file 1: Fig. S3). The scREG performs very robust under different resolution parameters and achieve the highest performance among all the method we compared.

Fig. 3
figure 3

Evaluation of the performance of cell clustering. A Scatter plot visualize the Umap embedding colored by clustering label from different methods including Cell Ranger on gene expression, Cell Ranger on chromatin accessibility, Seurat V4, intNMF, and scREG. B Same Umap as shown in A but colored by the surrogate ground truth. We see Cell Ranger RNA-seq did not distinguish naive CD4 T cells from the naive CD8 T cells, and CD56 (dim) NK cells from the effector CD8 T cells. ATAC-seq failed to separate non-classical monocytes and the myeloid DC, while scREG separates them clearly. In Seurat, the boundary between memory B cells and naïve B cells is shifted so a large proportion of memory B cells are labeled as naïve B cells. Clustering performance also assessed by calculating normalized mutual information (NMI) and adjusted Rand index (ARI) based on the surrogate ground truth. CE scREG clusterings on 10X multiome data from human cerebellum, mouse E18 brain, and lymph node from B cell lymphoma. The clustering results are consistent with the known cell types and marker genes’ expression. F The comparison of scREG with Seurat by four different clustering evaluation metrics on three datasets. The distance among cells are calculated as Euclidean distance on the top 20 principal components of gene expression and chromatin accessibility, respectively. X axis represents the metric calculated based on Seurat clustering label, and Y axis represent that from scREG clustering. Colors represent different data sets and shape represents different data type (triangle for scRNA-seq and diamond for scATAC-seq). A lower Davies-Bouldin index indicate better clustering, but the other three metrics are the higher the better. The scREG perform better for all case than the Seurat

We also use scREG and alternative methods on 10X multiome data from the human cerebellum, mouse E18 brain, and lymph node from B cell lymphoma (Fig. 3C–E). The clusters from scREG are consistent with the known marker genes’ expression (Additional file 1: Fig. S4-S6). As ground truth labels are not available for these data, we use an adjusted internal clustering evaluation to compare the clustering of scREG with Seurat. It is not fair to compare scREG clustering with Seurat clustering by calculating internal clustering evaluation metrics. The reason is that, when ground truth is not available, the internal clustering evaluation metrics calculated on different distance matrices are not comparable. This is an underappreciated statistical issue, so we generate two artificial examples to illustrate this (please see Additional file 2 for more detail). From example 1, we see the silhouette index is increased but the real clustering accuracy is decreased. This is because internal clustering evaluation metrics are designed for the comparison of different clustering methods (or different parameters) based on the same cell-cell distance matrix. Since internal clustering evaluation has to be performed on the same distance matrix, we have to choose a common embedding space to compare two clusterings. Here, we use the top 20 principal components (PCs) of scRNA-seq and scATAC-seq to calculate the cell-cell distance. For a given clustering, we can say it is a good clustering if cells with the same labels are close to each other in both modalities; we say it is a bad clustering if cells with the same labels are not close to each other in any modalities. Example 2 in the Additional file 2 has illustrated this. It is possible that method A is better than method B on one modality but worse in the other modality. In this case, we cannot evaluate these two methods. For each clustering method, we compute four different clustering evaluation metrics based on top 20 PCs of scRNA-seq and scATAC-seq on the three datasets (Additional file 1: Fig. S7-S9). Figure 3F shows the comparison of scREG with Seurat. It shows scREG performs better than Seurat for most cases.

We also tested our method on the bone marrow mononuclear cells data from the NeurIPS competition [38]. which include 22,463 cells with known cell labels. Additional file 1: Fig. S10 shows the clustering and 2D embedding results and comparison with Seurat. The clustering of scREG is more consistent with the ground truth label and achieves 0.7649 in NMI and 0.6549 in ARI, which are higher than Seurat (NMI = 0.7409, ARI = 0.5419). Overall, scREG identifies cell populations with high accuracy on different datasets.

scREG constructs subpopulation specific cis-regulatory networks

To assess the cis-regulatory network inference of our method, we evaluate the predictions in several cell populations where experimental cis-regulation is available. First, we download variant-gene links defined by the expression quantitative trait loci (eQTL) of CD14 positive monocyte cells [39] and use them to validate the RE-TG prediction of classical monocyte clusters. For each peak-gene pair in the cis-regulatory potential R matrix, we have a predicted cis-regulatory score in each cluster. Taking the eQTL data as ground truth, we plot the receiver operating characteristic (ROC) curve and precision-recall (PR) curve by sliding the cis-regulatory score (Fig. 4A, B). As a baseline method for comparison, we calculate the Pearson correlation coefficient (PCC) between the expression of a gene and the accessibility of peak within 1 Mb of the gene’s transcriptional start site. Our method achieves 0.81 area under the ROC (AUROC) curve and 0.46 area under the PR (AUPR) curve, while the AUROC and AUPR of the baseline method are 0.56 and 0.25, respectively. Our method identified 9067 REs in classical monocyte subpopulations, and 1998 of them overlapped with eQTL variants. For those REs that overlapped with eQTL variants, 53.25% of our predictions are connecting them to the same genes as eQTL (Fig. 4C). If we randomly select genes from a 1 Mb distance, this percentage would be 3.26% (16.33-fold), and it would be increased to 18.31% (2.91-fold) if we force the selected peak-gene to have the same distance distribution with our predictions.

Fig. 4
figure 4

Validate the RE-TG prediction. A, B Receiver operating characteristic (ROC) curve and precision-recall (PR) curve by taking the eQTL data of CD14 positive monocyte cells as ground truth to validate the prediction of scREG. Curves were plotted by sliding the predicted cis- regulatory score of 100,000 peak-gene pairs. As an alternative method, the Pearson correlation coefficient (PCC) between the expression of a gene and the accessibility of peaks within 1 Mb of the gene’s transcriptional start site was calculated. We can see scREG prediction achieves 0.81 area under the ROC (AUROC) curve and 0.46 area under the PR (AUPR) curve, which are much higher than that from PCC (0.56 and 0.25 respectively). C Comparison of the precision of scREG with random selections. Ratio of consistency (precision) was the percentage of RE-TG pairs validated by eQTL for those REs that linked to at least one gene in eQTL. Red line represents the ratio of consistency of scREG, the blue distribution represents that from 1000 random selection of peak-gene pairs in 1 Mb distance, and the orange distribution represents that from 1000 random selection of peak-gene pairs but restricting the distribution of distance between peak to genes is the same as the scREG predictions. D Validation of RE-TG prediction by HiC data on Naïve CD4 T cell. All metrics are calculated same as in C but replacing the eQTL by HiC

Next, we use promoter capture HiC data to validate our predictions. We downloaded the promoter capture HiC data of 7 primary blood cell types [40] and compare them with the cis-regulatory networks for the corresponding cell types that we inferred from the PBMC data. Figure 4D and Additional file 1: Fig. S11 shows the precision of our method is 2-fold higher than the randomly selected peak-genes pairs which have the same distance distribution as our predictions. Additional file 1: Fig. S12 shows our method achieves much higher AUROC and AUPR than PCC. The strong performance in the peak-gene link suggests that scREG is effective in inferring regulatory relations.

The cis-regulatory networks are highly cell-type-specific. Additional file 1: Fig. S13 shows the Jaccard similarity of clusters in terms of cis-regulation. The average Jaccard similarity between clusters is 0.4760. Hierarchical clustering analysis shows similar cell types have a similar cis-regulatory network. For example, the average similarity between four T cell clusters is 0.7783, and group to one cluster; the similarity of two B cell subpopulations is 0.79; the similarity of two NK cell subpopulations is 0.79; similarity of two monocytes subpopulation is 0.64.

scREG shed light on the interpretation of disease-associated loci

The cis-regulatory network inferred from single cell multiome data may provide new insights for the interpretation of disease-associated loci. To demonstrate this, we download 376 fine mapped variants with a posterior inclusion probability greater than 0.1 for inflammatory bowel disease (IBD) [41, 42]. These fine-mapped GWAS variants showed high enrichment (odds ratio in the range of 9.56 to 27.45) in REs from subpopulations of PBMC data (Fig. 5A, see the “Enrichment of GWAS variants” section) [43]. As a baseline for comparison, we use all peaks from each subpopulation to do the same analysis. As a result, the enrichment odd ratios from all peaks are 1.14-3.02 fold (on average 1.82-fold) lower than the peaks from our networks. These results show that the context specific regulatory network from scREG could improve the interpretation of the disease-associated loci.

Fig. 5
figure 5

scREG interprets disease associated loci. A Comparison of the enrichment of fine-mapped GWAS variants of IBD disease in the scREG predicted regulatory elements (red) against that in all peaks (blue) called by MACS2 in each subpopulations. B Interactive visualization function of scREG package. scREG will plot genes, the raw peaks from all cells, peaks of each cluster, and the predicted peak-gene association in each cluster. Figure shows the track around the variant rs11640143 and gene IRF8. The highlighted blue bar represents the location of REs that contain the variant rs11640143, and the yellow bar represents one example of different cluster-level-peaks are merged into one in the raw peak calling before clustering. C, D Comparison of the three TFs (IRF4, IRF8, and CEBPB) target genes with the upregulated genes and downregulated genes in IBD patients. The p-value and odd ratio are calculated based on Fisher’s exact test

The cis-regulatory networks produced by scREG connect 34 fine-mapping variants to 32 genes in 12 cellular contexts. These 32 genes include three transcription factors IRF4, IRF8, and CEBPB. Variants rs913678, rs4811031, and rs6063502 are linked to CEBPB in non-classical monocytes; variant rs6935510 in chr6 is linked to IRF4, and variant rs11640143 in chr16 is linked to IRF8 in plasmacytoid dendritic cells. Our scREG package has an interactive visualization function that takes the genomics region as input and plots the genes, peaks, and interactions in the given range. It includes the genes, the raw peaks from all cells (before clustering), peaks of each cluster from MACS2, and the predicted peak-gene association in each cluster. Figure 5B shows the track of around the variant rs11640143 and IRF8. The variant rs11640143 is in an accessible region in myeloid DC, memory B cell, naïve B cell, and plasmacytoid DC cells (highlighted blue in Fig. 5B). The regulatory element that contains this variant is predicted to regulate IRF8 only in the scREG generated a cis-regulatory network of plasmacytoid DC cells. It also shows that the cluster level peaks are narrower than the raw peaks and multiple cluster-level peaks are merged into one raw peak (highlighted yellow in Fig. 5B). The interactive visualization function will help users understand the regulatory relations in a cell type-specific manner.

To further investigate the roles of these transcription factors in IBD, we modify our previous PECA2 method to infer their trans-regulatory target genes (see the “Inference of trans-regulatory targets” section) in each subpopulation. We download the differential expression genes list from IBD patients versus healthy controls study [44] for further analysis. First, we find the TF IRF4 is upregulated in IBD patients compared to healthy controls (two samples two-tail t-test, p-value = 7.59E−31). Next, we compare differential expressed genes with the target genes of the three TFs (Fig. 5C, D). Target genes of IRF4 and IRF8 in plasmacytoid dendritic cells are 2.88 and 2.80 fold enriched (odds ratio) in the upregulated genes in IBD (Fisher’s exact test, p-value 7.97E−32, and 7.76E−31). Target genes’ of CEBPB in non-classical monocytes are 3.17-fold enriched in the IBD upregulated genes (Fisher’s exact test, p-value 1.51E−31). Interestingly, downregulated genes in IBD are depleted in the IRF4, IRF8, and CEBPB’s target genes (Odds ratios are 0.45, 0.44, and 0.55, p-values are 5.61E−06, 2.03E−06, and 0.0037). Overall, scREG is a useful tool for the interpretation of disease-associated loci.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Disclaimer:

This article is autogenerated using RSS feeds and has not been created or edited by OA JF.

Click here for Source link (https://www.biomedcentral.com/)