# Construction and Clinical Translation of Causal Pan-Cancer Gene Score Across Cancer Types Shiyue Tao, et al.

Dec 30, 2021

## Introduction

Cancer is a major cause of mortality in both developed and developing countries, resulting in more than 8 million deaths each year worldwide (Tarver, 2012; Bray et al., 2013; Rodriguez-Martin et al., 2020). Since the causal factors and regulatory mechanisms are complex and remain largely unknown, there could be an increasing trend of morbidity and mortality attributed to cancer in the future (Nakagawa and Fujita, 2018). Based on multi-omics technologies, it has been demonstrated that cancers in different tissues and organs may share common features, whereas those in the same or similar organ may have distinct characteristics (Ciriello et al., 2013; Kandoth et al., 2013; Peng et al., 2021). These findings indicate that a pan-cancer strategy, which takes into account commonalities across cancer types, can be used to identify molecular abnormalities that transcend particular lineages, may explain oncogenesis, and make a large contribution towards the personal management of cancer (Vargas and Harris, 2016). In addition, pan-cancer analysis improves the statistical power used to identify cancer-related molecular dysregulation and avoids poor reproducibility in the characterization of rare subtypes (Priestley et al., 2019). Programs, such as the Cancer Genome Atlas (TCGA), which coordinate multi-omics sequencing and the clinical annotation of approximately 10,000 samples across over 30 cancer types, provide a great opportunity to identify pan-cancer biomarkers (Chang et al., 2013; Nawy, 2018).

Although genome-wide association studies (GWASs) have identified thousands of cancer-related loci (De Los Campos et al., 2018), there are still some unsolved issues. First, the majority of GWAS have identified variants located in non-coding regions and with small effect sizes, making it difficult to interpret functional and biological mechanisms that underlie the associations (Maurano et al., 2012; Visscher et al., 2017; Zeng et al., 2021). Second, complex linkage disequilibrium (LD) may obscure causal variants that drive the associations. Therefore, significant associations identified by GWASs are more about disease-related genomic regions than individual variants (Gallagher and Chen-Plotkin, 2018; Tam et al., 2019). Third, the most statistically significant variants may not be causal.

In general, genetic variants cause complex diseases by regulating gene expression, the abundance of one or multiple downstream proteins (Lappalainen et al., 2013; Westra et al., 2013; Albert and Kruglyak, 2015; Gusev et al., 2016). Gene pathway analysis and enrichment analysis have been widely applied to explore potential cancer-related mechanisms and have supplied plenty of valuable clues for the development of intervention targets (Shukla et al., 2016; Bao et al., 2019; Demircioğlu et al., 2019; Peng et al., 2019; Kim et al., 2020). Although previous studies have also leveraged data from TCGA to identify specific genes and signaling pathways involved in oncogenesis and development from a pan-cancer perspective (Ballot et al., 2020; Frost et al., 2020; Liu et al., 2020), these candidate genes or pathway-specific strategies are all based on prior knowledge, resulting in the loss of potential causal associations (Deng et al., 2014; Liu et al., 2014; Zhang et al., 2018). In addition, many of the associated genes identified may be the outcome rather than the cause of the disease (Gusev et al., 2016). Consequently, transcriptome-wide association studies (TWASs) were proposed to integrate GWASs with expression quantitative trait locus (eQTL) reference panels constructed from external genome-wide gene expression and genotype data to identify predicted gene–trait associations (Xu et al., 2017; Barbeira et al., 2018). Since a large sample size is used, the performance and statistical power of a TWAS is superior to that of traditional transcriptome analysis (Gusev et al., 2016; Xu et al., 2017; Barbeira et al., 2018). However, TWASs tend to identify multiple significant genes per region but fail to define the causal gene due to LD confounding (Mancuso et al., 2019; Wainberg et al., 2019; Wu and Pan, 2020). To overcome this drawback, fine-mapping methods were used to identify causal variants responsible for complex traits by accounting for the patterns of LD among the SNPs within a region associated with the target disease and assuming that at least one causal variant exists (Schaid et al., 2018).

In this study, we developed a novel strategy to define causal genes with the assumption of pan-cancer, to construct the CAusal Pan-cancER gene (CAPER) score, and to validate its performance using internal and external validation (Figure 1). Based on the workflow, we identified causal genes related to multiple cancers, used the same gene panel to differentiate cancer patients, and validated the efficiency of the gene panel in external validation sets (Li et al., 2021).

FIGURE 1. Workflow for CAPER score construction and its clinical translation.

## Materials and Methods

### Genome-Wide Associate Studies Data Process

TABLE 1. Summary of GWAS summary statistics in 15 types of cancer.

Furthermore, we used the GENCODE database (v25) to map the SNPs to gene positions. In total, we retained 19,201 protein-coding genes selected from 60,252 transcripts in autosomes. Then, the SNPs located 100 kb upstream and downstream of a specific gene were selected to perform fine-mapping. After mapping to the gene and intersecting using the reference panel, 1,403,668 SNPs remained. We used PLINK (v1.9b6.22) to estimate the LD matrix of each gene.

### Potential Causal Gene Set Identification

We used SuSiE, a fine-mapping method, to identify the causal SNPs in the 15 GWAS summary datasets with the aid of the susieR package (v0.11.42) (Wang et al., 2020a) in R software. We also used 1 KGP EUR samples as the LD reference panel. According to SuSiE manual, we set the maximum number of causal variants in the region to 10. Based on the results of previous studies (Fadista et al., 2016; Schaid et al., 2018), we set the significant level to 1E-5. A specific gene with causal SNPs was defined as a potential causal gene. We repeated the fine-mapping procedure for each type of cancer. Finally, the potential causal gene set consisted of genes regarded as casual genes of at least three types of cancers.

We also used the Molecular Signatures Database (MSD) to evaluate the overlapping of our candidate genes with regard to common processes, pathways, and underlying biological themes, while considering an FDR q-value of less than 0.05 and a minimum gene set size of two as statistically significant (Subramanian et al., 2005; Liberzon et al., 2011; Liberzon et al., 2015).

### The Cancer Genome Atlas Data Process

To verify the clinical translation of the potential causal gene set, we downloaded two types of TCGA data: (1) molecular data, which included gene expression (HTSeq-FPKM) (log2(FPKM+1)) and DNA methylation data; and (2) clinical data, which included age, sex, and survival time. These data were downloaded from the University of California Santa Cruz (UCSC) Xena browser (Goldman et al., 2018). We used data on all 11,057 samples and 60,483 transcripts available on 33 different types of cancer. We transformed the Ensembl IDs to symbols, using the biomaRt package (v.2.46.3) (Durinck et al., 2009). Specifically, we used the average level to represent the gene with multiple Ensembl ID mapping with a single symbol. We obtained the sum of the genes from the 33 datasets and calculated the average expression level for genes that contained more than one transcript. We also filtered out the samples that were (1) non-European ancestry and (2) missing tumor stage data. Genes that were not expressed (FPKM = 0) in more than 50% of samples were excluded from the expression data of our causal gene set. After quality control, we obtained 38,596 genes, including 12 CAPER genes, and 4,842 individuals with 21 different types of cancer.

### CAusal Pan-cancER Gene Score Estimation

Second, we used a causal gene set to fit the multivariable Cox regression adjusted for age, sex, and tumor stage. The model used was as follows:

where

$β^CAPER$

was a vector of the coefficients of CAPER genes obtained from the multivariable regression and

$XCAPER$

was a matrix of expression levels of CAPER genes. Using Eq. 1, we constructed the CAPER score for each individual in the test set and regarded the 12 potential causal genes identified as CAPER genes. Using the median CAPER score, we classified the individuals into two groups: (1) the high-CAPER group and (2) the low-CAPER group.

In addition, we used the TCGA test set to perform three types of sensitivity analyses. (1) To show the performance of the CAPER score for a shorter survival time, we selected all samples with survival <5 years. (2) To show the performance of the CAPER score in a smaller sample size, we randomly selected 80 samples. (3) To show the effectiveness and accuracy of the CAPER score, we randomly selected 12 genes, namely, DGFRL, GLRX5, KCNJ14, SMARCAL1, FTH1P16, CDK5, WDFY1, TMEM266, RAD21, NAA16, AGPS, and FBXO39, to construct the random CAPER score. We used the log-rank test for all three analyses.

### Clinical Translation

We performed a series of analyses to investigate the clinical translation of the CAPER score, using the TCGA test set, IMvigor210 cohort, and Genomics of Drug Sensitivity in Cancer (GDSC) (Yang et al., 2012; Tomczak et al., 2015; Mariathasan et al., 2018).

First, using the TCGA test set, we defined the association between CAPER score and TNM staging, tumor histological grade, and vascular tumor cell types. Specifically, the pathological stages of the primary tumor (T) were divided into two groups: (1) Tis (tumor in situ) and T1, and (2) T2 or larger. The staging of distant metastasis (M) is defined as M0 and M1. In addition, we detected the association between the CAPER score and immune subtypes, including C1 (wound healing), C2 (IFN-γ dominant), C3 (inflammatory), C4 (lymphocyte depleted), C5 (immunologically quiet), and C6 (TGF-β dominant), which were filtered because the sample size was below 40. These immune subtypes were proven to be associated with prognosis, genetic, and immune-modulatory alterations. These factors may shape the specific types of immune environments that we observed and indicate response to therapy or prognosis (Thorsson et al., 2018). We also showed the association between the high-CAPER group and the low-CAPER group, as well as the immune subtype of each gene.

Second, we used the TCGA test set to define the association between the CAPER score and the tumor microenvironment. On the one hand, we estimated the immune score, stromal score, and tumor purity using the limma (v3.46.0) (Ritchie et al., 2015) and the estimate (v1.0.13) packages (Yoshihara et al., 2013). ESTIMATE is a method that uses gene expression signatures to infer the fraction of stromal and immune cells in the tumor samples. We estimated the correlation between the CAPER score, the three metrics, and identified differences between the two scores and tumor purity of the high- and low-CAPER group. After filtering out data on 3 cancer types (CHOL, KICH, and UVM) with a sample size below 10 (low sample size cancers), we also estimated the Spearman correlation between single CAPER gene expression and three metrics in each cancer. On the other hand, we used the support vector regression (SVR) on CIBERSORTx (Steen et al., 2020) to deconvolve RNA admixtures to the abundance of 22 types of immune cells in TCGA samples to further observe the tumor microenvironment (TME) (Chen et al., 2018). Following the CIBERSORTx manual, we set the number of permutations to 100. We uploaded RNA-seq FPKM data and set quantile normalization to discern the recommended setting (Craven et al., 2021). We filtered out immune cell types with an average proportion lower than 2%, and 14 types of immune cells were included into the final analysis. We identified differences in the immune cell abundance between the high- and low-CAPER groups. After filtering out 3 types of cancers with a small sample size, we estimated the Spearman correlation between the CAPER score and the abundance of 14 immune cells in 18 types of cancers.

Third, we used the stemness score based on DNA methylation (DNAss) and RNA expression (RNAss) obtained from UCSC. The stemness score is defined as the quantification of stemness and is associated with tumor progression, therapeutic resistance, and recurrence. DNAss indicates epigenetic features while RNAss indicates gene expression (Malta et al., 2018; Pei et al., 2020). We estimated the Spearman correlation between single CAPER gene expression and the stemness score in 18 types of cancer. For the CAPER score, we estimated the Spearman correlation between the CAPER score and the stemness score and conducted a Wilcoxon rank-sum test to compare the statistical significance of the stemness score between high- and low-CAPER groups.

Fourth, we used the IMvigor210 cohort to verify the robustness and efficiency of the CAPER score. Data were downloaded using the Imvigor210CoreBiologies package (v1.0.0). The cohort data included immune phenotypes (immune inflamed, immune excluded, and immune desert), Lund molecular subtypes, IC-Level (level of immunohistochemistry-assessed PD-L1 staining on immune cells), and TC-Level (level of immunohistochemistry-assessed PD-L1 staining on tumor cells). We used the coefficient estimated using the training set in TCGA to construct the CAPER score for the Imvigor210 cohort and investigated the difference in variables in the high- and low-CAPER groups.

Finally, using GDSC, the largest free public database of information on drug sensitivity in cancer cells and molecular markers of drug response (Yang et al., 2012), we identified the association between the CAPER score and drug sensitivity. We constructed the CAPER score for each cell line sample and conducted a Spearman correlation analysis between the CAPER score and IC50 value, the half maximal inhibitory concentration, which is an established measurement of drug efficacy (Aykul and Martinez-Hackert, 2016).

## Results

### Identification of Potential Causal Genes

Using SuSiE (Wang et al., 2020a), we identified 54 causal SNPs in 15 genes (ATP6V1G2, ATP6V1G2-DDX39B, CLPTM1L, DDX39B, MCCD1, MICA, MICB, MYC, MYEOV, NFKBIL1, POU5F1B, SLC6A18, SLC6A19, SLC6A3, and TERT). Detailed information of the causal SNPs and potential causal genes is shown in Table 2 and Figure 2, and Supplementary Table S2. MDB showed that 13 were significantly enriched in breast tumor, hepatocellular carcinoma, substance transport-related, and certain other pathways, while 12 were located in the cytogenic region of 6p21 or 5p15 (FDR q-value < 0.05, Supplementary Table S3).

TABLE 2. Summary of the 15 potential causal genes.

FIGURE 2. Summary of the CAPER genes. (A) Ideogram of the 15 CAPER genes (the color of each chromosome indicates gene density across the human genome). (B) The bubble plot shows the minimum p-value of each causal gene in each cancer dataset.

### Construction of the Causal Pan-cancER Gene Score

The training data of TCGA was used to select 12 CAPER genes, investigate their pan-cancer association, and construct the CAPER score. Through univariate Cox regression, we defined 6 significant causal genes with an average p-value of 0.01 (median = 1.87E-06, ranging from 4.62E-09 to 0.04) by adjusting for age, sex, and tumor size. For example, from among the 12 CAPER genes, MYEOV was the gene with the greatest risk (HR = 1.09, 95% CI: 1.06–1.13), while ATP6V1G2 was the gene that offered the largest protection effect (HR = 0.68, 95% CI: 0.59–0.78). Details of the univariate analysis are presented in Table 3 and Figure 3A.

TABLE 3. Summary of the univariable Cox regression analysis conducted on 12 CAPER genes in the TCGA training set *.

FIGURE 3. Summary of the CAPER score. (A) Heatmap showing the p-value of each CAPER gene in the univariate Cox regression performed on each type of cancer. (B) ROC of the multivariate Cox regression conducted on the TCGA test set. We estimated the AUC of 1-year (green), 3-year (Red), and 5-year (Blue) survival. (C) K-M curve of the high- and low-CAPER score groups. The HR of the CAPER score was statistically significant to the survival time with a p-value of <0.001 and was adjusted for age, sex, and tumor stage.

Furthermore, we performed a multivariate Cox regression to construct the CAPER score using the TCGA training set (Table 4). We estimated the time-dependent ROC curves for multivariate Cox regression for 1-, 3-, and 5-year survival (AUC = 0.704, 0.717, and 0.710, respectively) using TCGA test data (Figure 3B). The time-dependent ROC curves in the TCGA training set are provided in Supplementary Figure S2. The formula used to construct the CAPER score is as follows:

TABLE 4. Summary of the multivariable Cox regression conducted on 12 CAPER genes in the TCGA training set *.

The HR for the CAPER score is 3.75 (95% CI: 2.65–5.32). In addition, Kaplan–Meier curves were used to detect differences in survival between the high- and low-CAPER groups, which was found to be significant (log-rank test, p < 0.0001), with the median survival time of the high-CAPER group being shorter (Figure 3C).

In addition, sensitivity analyses indicated the effective construction procedure and accurate prognosis performance. When we used 5 years as the cutoff, the difference in survival between high- and low-CAPER groups was found to be significant (log-rank test; p < 0.0001) (Supplementary Figure S3A). When a small sample size was used (i.e., 80 individuals), the difference in survival between the high- and low-CAPER groups was also significant (log-rank test; p = 0.011) (Supplementary Figure S3B). A random gene set was used, and although three genes showed significant differences, the difference between high- and low-CAPER groups was not significant (log-rank test, p = 0.54) (Supplementary Figure S4 and Supplementary Table S4).

### Internal Validation of the CAusal Pan-cancER Gene Score Using The Cancer Genome Atlas Test Set

TCGA test data were used to explore the clinical translation of the CAPER score and its potential application for therapeutic and prognostic purposes in cancer management. We used the CAPER score to classify patients into high- and low-CAPER groups and tested differences in clinical metrics, TME, and the stemness score between the two groups.

First, we analyzed the association between CAPER score and immune subtypes and TNM stage (Figure 4 and Table 5). The Kruskal–Wallis test found the difference in each immune subtype to be significant (

$χ2=164.21$

, p < 2.2E-16) with an average CAPER score of 0.73, 0.82, 0.47, and 0.52, respectively. As expected, the frequencies of each immune subtype were different between the high- and low-CAPER groups (

$χ2=130.05$

, p < 2.2E-16). Then, the average CAPER score in the different primary tumors (T stage) in the high- and low-CAPER groups were 0.55 and 0.7. Again, the Kruskal–Wallis test showed that the difference in each subtype was significant (p = 6.22E-09). As expected, the frequencies of each primary subtype were different between the high- and low-CAPER groups (

$χ2=20.35$

, p = 6.44E-06). The average CAPER score in the different primary tumors (N stage) was 0.67, 0.67, 0.84, and 0.73, respectively. The Kruskal–Wallis test showed that the difference in each subtype was significant (

$χ2=22.94$

, p = 4.15E-05). As expected, a significant difference in the frequencies of each subtype was found between the groups (

$χ2=17.68$

, p = 5.12E-04). The average CAPER score for distant metastasis (M stage) was 0.66 and 0.70 for M0 and M1, respectively. The Wilcoxon rank-sum test showed that the difference in each subtype was non-significant (p = 0.40). The association between M stage and CAPER groups also shows no statistical significance (

$χ2=0.40$

, p = 0.53). Second, we analyzed the association between the CAPER score and TME (Figure 5 and Supplementary Figure S5). We estimated the immune score, stromal score, and tumor purity. The immune score was positively related to the CAPER score (Spearman correlation test, ρ = 0.15, p = 3.02E-06), while the low-CAPER group had a lower immune score than the high-CAPER group (Wilcoxon rank-sum test, p = 3.02E-04). Consistent with the results of the survival analysis, patients with a low CAPER score may have fewer immune cells. The association between stromal score and CAPER score between the high- and low-CAPER groups showed no statistical significance (Spearman correlation test, ρ = 0.0059, p = 0.8543, Wilcoxon rank-sum test, p = 0.8506). The tumor purity was negatively related to the CAPER score (Spearman correlation test, ρ = −0.08, p = 0.0107). The low-CAPER group showed higher tumor purity than the high-CAPER group (Wilcoxon rank-sum test, p = 0.0492). On the other hand, we estimated the cellular abundance of 22 types of immune cells obtained from CIBERSORTx and filtered immune cell types with an average proportion that was lower than 2%. We selected 14 types of immune cells and explored their associations with the CAPER score (Supplementary Figure S6). Consistent with the results of the survival analysis and ESTIMATE, we found that infiltration of naïve B cells (Wilcoxon rank-sum test, p = 0.027), macrophages M0 (p = 0.001), plasma cells (p = 1.93E-05), and CD4 memory activated T cells (p = 3.49E-22) was higher in the high-CAPER group, while the infiltration of M2 macrophages (p = 1.24E-08), resting mast cells (p = 5.27E-11), monocytes (p = 5.92E-10), activated NK cells (p = 1.72E-10), and CD4 memory resting T cells (p = 0.035) was higher in the low-CAPER group.

FIGURE 4. Summary of the differences between the three tumor-related variables in the different CAPER groups using the TCGA test set. (A) immune subtypes; (B) the pathological stage of the primary tumor (T); (C) the pathological stage of the lymph nodes (N).

TABLE 5. Summary of the association between clinical categorical variables and the CAPER score in TCGA.

FIGURE 5. Summary of the correlation between TME and CAPER score or CAPER gene expression using the TCGA test set. (A) Differences in the immune score between the CAPER score groups. (B) Differences in the stromal score between the CAPER score groups. (C) Differences in tumor purity between the CAPER score groups. (D) The bubble plot shows the p-value of Spearman correlation tests conducted on 18 types of cancer. (E) The boxplot cellular abundance differential score between the high- and low-CAPER score groups (*p < 0.05, **p < 0.01, and ***p < 0.001). (F) The autocorrelation plot of cellular abundance.

Finally, we analyzed the correlation between the CAPER score and the stemness score constructed using gene expression and DNA methylation data (Supplementary Figure S7). The RNA stemness score (Spearman correlation test, ρ = 0.29, p < 2.2E-16) and the DNA stemness score were both positively correlated to the CAPER score (Spearman correlation test, ρ = 0.32, p < 2.2E-16). The high-CAPER group had a higher DNA (Wilcoxon rank-sum test, p = 6.9E-15) and RNA (Wilcoxon rank-sum test, p = 1.5E-13) stemness score, indicating stronger tumor stem cell activity and a lower degree of tumor differentiation.

### External Validation of the CAusal Pan-cancER Gene Score Using IMvigor210

The IMvigor210 cohort was analyzed to verify the robustness and efficiency of the CAPER score; we observed that the CAPER score was significantly associated with the three types of clinical metrics in IMvigor210 (Table 6 and Figure 6). For the clinical phenotypes, differences in the CAPER score between patients with different TC levels were statistically significant (Kruskal–Wallis test, p = 5.2E-05). The CAPER score of TC2+ patients was higher than that of TC0 patients (Wilcoxon rank-sum test, p = 1.1E-05). The differences in CAPER score between patients with different IC levels were statistically significant (Kruskal–Wallis test, p = 0.015). The CAPER score of IC2+ patients was higher than that in IC0 patients (Wilcoxon rank-sum test, p = 3.9E-03). In addition, we analyzed CAPER score differences between immune phenotypes using IMvigor210. The differences in CAPER score between patients with different immune phenotypes were statistically significant (Kruskal–Wallis test, p = 0.003). We observed that the “immune inflamed” type had a higher CAPER score than the “desert” type (Wilcoxon rank-sum test, p = 1.2E-03). The difference in Lund molecular subtypes was significant in different CAPER groups (Kruskal–Wallis test, p = 3.2E-14). The CAPER score of the “Basal/SCC-like” type was higher than that of the “Infiltrated” (Wilcoxon rank-sum test, p = 1.5E-11), and “UroA” (Wilcoxon rank-sum test, p = 7.2E-13) types. As expected, differences between the high- and low-CAPER groups were also significant for the three tumor immunity-related variables indicated above (TC levels, p = 2.53E-03, immune subtypes, p = 0.01, and Lund molecular subtypes, p = 7.28E-09, respectively).

TABLE 6. Summary of the association between four tumor immunity-related variables and CAPER score in the IMvigor210 cohort.

FIGURE 6. Summary of the differences between the four tumor phenotypes in different CAPER groups using IMvigor210. (A) TC levels; (B) IC levels; (C) immune phenotypes; (D) Lund molecular subtypes.

### Potential Therapeutic Value of the CAusal Pan-cancER Gene Score

We analyzed the correlation between the CAPER score and drug sensitivity in the GDSC using Spearman correlation and identified 81 significantly correlated drugs between the CAPER score and drug sensitivity (Supplementary Table S5). For example, the CAPER score was negatively correlated with the sensitivity of Afatinib (ρ = −0.28, p = 6.71E-17), Trametinib (ρ = −0.23, p = 2.12E-12), Selumetinib (ρ = −0.22, p = 1.36E-11), and Refametinib (ρ = −0.21, p = 2.22E-10), which indicated a lower CAPER score against their higher level of drug sensitivity. The CAPER score was positively correlated to the sensitivity of Axitinib (ρ = 0.14, p = 2.56E-05), SB52334 (ρ = 0.13, p = 5.61E-05), GSK269962A (ρ = 0.12, p = 4.2E-04), and Quizartinib (ρ = 0.10, p = 3.0E-03). Furthermore, we found that drugs with sensitivity that was negatively correlated with the CAPER score mostly targeted EGFR signaling and the ERK MAPK signaling pathway, while drugs with sensitivity that was positively correlated with the CAPER score mostly targeted RTK signaling and cytoskeleton pathways.

## Discussion

In our study, using GWAS summary statistics of 15 cancers and the SuSiE package, we identified causal SNPs and 12 CAPER genes. SuSiE is a newly developed approach for genetic fine-mapping that quantifies the uncertainty of causal variables (Wang et al., 2020a). Compared with existing fine-mapping methods, such as CAVIAR (Hormozdiari et al., 2014), FINEMAP (Benner et al., 2016), and DAP-G (Wen et al., 2016; Lee et al., 2018), SuSiE has been demonstrated to have a fair degree of accuracy for computing PIPs and a much higher computation speed and higher power in distinguishing between causal variables and non-causal variables (Wang et al., 2020a). Then, we used the expression level of the CAPER genes in the TCGA training set to construct the CAPER score. We performed internal and external validation of the CAPER score using three validation sets, including the TCGA test set, IMvigor210, and GDSC. The results demonstrated the potential application of the CAPER score for therapeutic and prognostic purposes in cancer management.

Among the 54 causal SNPs obtained through fine-mapping, 29 were shown to have a significant eQTL effect on 8 potential causal genes across different tissues. A total of 324 eQTL pairs were extracted from GTEx and are listed in Supplementary Table S6. As expected, the association between all 12 CAPER genes and cancers have been demonstrated using a series of basic experimental and population-based studies. For example, ATP6V1G2 is a ferroptosis-related gene that plays a primary role in metabolism and oxidative stress and is defined as a colorectal cancer prognosis biomarker (Huang et al., 2021; Shao et al., 2021). DDX39B is involved in the regulation of pre-mRNA splicing, nuclear export of mRNAs, and pre-ribosomal RNA translation, and may promote the genesis, development, and metastasis of multiple cancer types by regulating cell proliferation (Awasthi et al., 2018; Gu et al., 2020; Xu et al., 2020; He et al., 2021). MYC and MYEOV are two well-known oncogenes (Specht et al., 2004; Paglia et al., 2020). MYC can contribute to oncogenesis and immune evasion through various mechanisms, including the promotion of autonomous cell growth and proliferation, modulation of tumor–stroma interactions, and regulation of the host immune system (Dang, 2012; Paglia et al., 2020; Dhanasekaran et al., 2021; Lourenco et al., 2021), while the mechanism of action that underlies the function of MYEOV in cancer development and metastasis may enhance SOX9 transcriptional activity (Lawlor et al., 2010; Fang et al., 2019; Liang et al., 2020). Thus, both MYC and MYEOV have also been identified as potential immunotherapy targets (Fang et al., 2019; Duffy et al., 2021). POU5F1B can promote cancer oncogenesis by cooperating with MYC and is associated with poor prognosis in pancreatic ductal adenocarcinoma patients (Hayashi et al., 2015; Amantini et al., 2019). It may also promote HCC proliferation by activating AKT (Pan et al., 2018). SLC6A19 has also been reported to be a potential biomarker that has a significantly low level of expression in patients with renal cancer (Zamora-Fuentes et al., 2020). In addition, a series of genetic variants in MICA, MICB, NFKBIL1, SLC6A3, CLPTM1L, and TERT have been found to be associated with the susceptibility and prognosis of different cancer types (Rafnar et al., 2009; Wang et al., 2009; Turnbull et al., 2010; Miki et al., 2011; Baek et al., 2018; Toledo-Stuardo et al., 2021). Meanwhile, using summary statistics of GETx data, we performed a colocalization analysis using the coloc package (v.5.1.0) (Giambartolomei et al., 2014). Using default settings, we found that the posterior probabilities of hypothesis 4 (PPH4: both the expression of CLPTM1L and MM are associated and share a single causal variant) are 0.986 and 0.985 for skin exposed to the Sun (lower leg) and skin not exposed to the Sun (suprapubic), respectively, while rs31490 had the largest PPH4 values of 0.541 and 0.624, respectively.

However, some well-known cancer-related genes, such as TP53, are not among the CAPER genes. For example, all108 SNPs of TP53 were not significant in the 15 cancer summary statistics of the 15 types of cancer. There may be two probable reasons for this. First, the dysfunction of familiar genes may not have been caused by genetic variation. For example, the methylation of TP53 has been regarded as a causal factor of leukemia (Saeed et al., 2019). Second, the limitation of sample size in the GWASs and its weak signal may result in SNPs from which common genes have been filtered out.

Furthermore, to verify the robustness and efficiency of the CAPER score, we applied it to TCGA test samples and the IMvigor210 cohort. As expected, patients in the high-risk group tended to have shorter survival and a worse TN stage, which together indicate a poor prognosis. The results also showed that samples in the high-risk group had a higher degree of immune infiltration and a lower differentiation ability, with a higher immune score and stemness score. The stemness score usually indicates the differentiation potential, and a loss of a differentiation ability and gain of stem-cell-like were reported to be the main signs of tumor progression (Seguin et al., 2015; Prasetyanti and Medema, 2017; Zhang et al., 2020); while immune infiltration was also reported to be correlated with the malignancy and prognosis of different types of cancer (Wang et al., 2020b; Wu et al., 2020; Zhang et al., 2020). In addition to the degree of infiltration, the composition of the infiltrating immune cell types in the high-risk and low-risk groups were also different, indicating a more complex difference in the tumor microenvironment between the two groups. Taken together, validation using the TCGA test samples and the IMvigor210 cohort indicated consistency of the CAPER score with clinical prognosis, proving its value in clinical translation.

Interestingly, some associations between genes and certain cancer types were not taken into account during the selection procedure but were detected during the validation procedure. For example, MICB is defined as the casual gene of CESC, PRAD, and UCEC (Table 2), but showed a high correlation with the tumor purity in four cancer types (KIRC, KIPP, LUAD, and MESO), which were not included in the selection step (Figures 5A,E). In addition, while THCA was not included in the selection step, its stemness showed a high level of correlation with the expression of MICA, MYEOV, POU5F1B, and TRET (Supplementary Figures S7A,C). This indicates the potential value of a pan-cancer analysis to identify novel associations.

In addition, we also found a link between the CAPER score and drug sensitivity using the GDSC database, which indicated the potentially extensible application of the CAPER score for the therapeutic and prognostic management of cancer. In particular, 81 drugs were found to be significantly correlated with the CAPER score, among which 67 showed a negative correlation with CAPER score, while only 16 showed higher sensitivity in the group with high CAPER score, indicating a limited selection of drugs available for the high-risk group.

In summary, we developed a CAPER score using a novel strategy based on fine-mapping. An extensive validation procedure was followed to confirm the robustness and efficiency of the CAPER score. Considering its potential usage in prognosis prediction and the identification of novel associations, we expect that this score may provide valuable information that can be used to better understand oncogenesis to guide management from a pan-cancer perspective.

## Data Availability Statement

The original contributions presented in the study are included in the article/Supplementary Material. Further inquiries can be directed to the corresponding author.

## Author Contributions

SY designed the study. ST and ZP performed the datasets’ quality control. ST and XY performed the data analysis. ST, XY, LP, and MF interpreted the analysis results. XY, ST, and LP wrote the draft manuscript. SY and PH revised the article. All authors accepted the final manuscript.

## Conflict of Interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

## Publisher’s Note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors, and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

## Acknowledgments

We acknowledge the participants and investigators of GWAS-ALTAS and GWAS-Catalog for making the summary data publicly available for us.