### Obtaining individuals and phenotypes from UK Biobank

We used data from UK Biobank downloaded on July 19, 2017. We excluded related individuals and the ones with high missing rate or other sequencing quality issues. As covariates, we extracted age at recruitment (Data-Field 21022), sex (Data-Field 31), and the first 20 genetic PCs. The ancestry information of individuals was obtained from Data-Field 21000, and we kept individuals labeled as “British,” “Indian,” “Chinese,” or “African” (according to Data-Coding 1001: http://biobank.ctsu.ox.ac.uk/crystal/coding.cgi?id=1001). Throughout the paper, we labeled ‘British’ individuals as EUR, “Indian,” “Bangladeshi,” and “Pakistani” individuals as S.ASN, “Chinese” individuals as E.ASN, and “African” and “Caribbean” individuals as AFR. The measurements of the 17 quantitative phenotypes (as shown in Additional file 1: Table S1) across all available instances and arrays were retrieved. The data retrieval described above was performed using ukbREST [21] with the query YAML file available at https://github.com/liangyy/ptrs-ukb/blob/master/output/query_phenotypes.yaml[22].

If one individual has multiple measurements for the same phenotype (in more than one instances and/or more than one arrays), we collapsed multiple arrays by taking the average and we aggregated measurements across multiple instances by taking the first non-missing value. Individuals with missing phenotype in any of the 17 quantitative phenotypes or covariates were excluded.

### Quality control on self-reported ancestry

To ensure the quality of ancestry label, we removed individuals who deviate substantially from the population that they were assigned to. Specifically, for population *k* among the 4 populations (EUR, S.ASN, E.ASN, and AFR), we treated the distribution of the individuals, in the space of the first 10 PCs, as multivariate normal. And we calculated the observed population mean (widehat {mu }_{k}) and covariance (widehat Sigma _{k}) accordingly. Then, for each individual *i* in population *k*, we evaluated the “similarity” *S*_{ik} to the population *k* as (S_{ik} = log Pr (text {PC}_{i}^{1}, cdots, text {PC}_{i}^{10}; widehat {mu }_{k}, widehat {Sigma }_{k})). Intuitively, if an individual has genetic background differing from is the assigned population, the corresponding *S*_{ik} will be much larger than others. So, we filtered out individuals with *S*_{ik}≤−50 in the assigned population *k*. This cutoff was picked such that (phantom {dot {i}!}S_{ik^{prime }}) for any un-assigned population *k*^{′} has (S_{ik^{prime }} le -50) for all individuals.

The number of individuals remained after data retrieval and ancestry quality control is listed Additional file 1: Table S2.

### Performing GWAS and building LD clumping and *p*-value thresholding based PRS models

We built PRS using the genotypes and phenotypes of the individuals in the discovery data set (the details of data splitting is described in the “Experimental setup” section). We performed GWAS (linear regression) using linear_regression_rows in hail v0.2 where we included covariates: first 20 genetic PCs, age, sex, age^{2},sex×age, and sex×age^{2}. In the GWAS run, we excluded variants with minor allele frequency < 0.001 and variants that significantly deviate from Hardy-Weinberg equilibrium (*p*-value < 10^{−10}). And the phenotype in their original scales were used.

To obtain relatively independent associations for PRS construction, we ran LD clumping using plink1.9 with option –clump –clump-p1 1 –clump-r2 0.1 –clump-kb 250. This command extracted genetic variants in the order of their GWAS significances and excluded all variants having *R*^{2}>0.1 to or 250 kb within any variants that have already been included. The PRS was constructed on the basis of the set of variants obtained from the LD clumping along with the marginal effect size estimated in GWAS run. Specifically, we calculated PRSs at a series of GWAS *p*-value thresholds: 5×10^{−8},10^{−7},10^{−6},10^{−5},10^{−4},10^{−3}, 0.01, 0.05, 0.1, 0.5, and 1. In other words, at threshold *t*, the PRS for individual *i* was calculated as

$$begin{array}{*{20}l} text{PRS}_{i}^{t} &= sum_{j: p_{j} le t} X_{ij} widehat{b}_{j}, end{array} $$

(1)

where *X*_{ij} is the effect allele dosage of variant *j* in individual *i* and (widehat {b}_{j}) is the estimated effect size of variant *j* from GWAS run.

At the testing stage, given the genotype of an individual, we calculated the PRS of the individual using Eq. 1.

### Computing the predicted transcriptome

We computed predicted gene expression for all individuals passing filtering steps and quality control. We utilized two sets of prediction models: (1) CTIMP models (proposed in [11]) trained on GTEx v8 EUR individuals [23] and (2) elastic net models which were trained on Europeans (EUR) or African Americans in combination with Hispanics (AFHI) [15]. The sample size and tissue information of the prediction models are listed in Additional file 1: Table S3.

### Estimating PVE by predicted transcriptome of a single tissue

To assess the potential predictive power of predicted transcriptome on the phenotypes of interest, we estimated the proportion of phenotypic variation that could be explained by the predicted transcriptome in aggregate. Specifically, we assume the following mixed effect model (for individual *i*).

$$begin{array}{*{20}l} Y_{i} &= mu + sum_{l} C_{il} a_{l} + sum_{g} widetilde{T}_{ig} beta_{g} + epsilon_{i} end{array} $$

(2)

$$begin{array}{*{20}l} epsilon_{i} &sim_{iid} N(0, sigma_{e}^{2}) end{array} $$

(3)

$$begin{array}{*{20}l} beta_{g} &sim_{iid} N(0, frac{sigma_{g}^{2}}{M}), end{array} $$

(4)

where *M* denotes the number of genes, *C*_{il} is the *l*th covariate, (widetilde {T}_{ig}) is the inverse normalized predicted expression for gene *g*, and *Y*_{i} is the observed phenotype. By inverse normalization, we converted the predicted expression (widehat {T}_{ig}) to (widetilde {T}_{ig}) by (widetilde {T}_{ig} = Phi ^{-1}(frac {text {rank}(widehat {T}_{ig})}{N + 1})) within each gene *g* where *N* is the number of individuals and ‘rank’ is in increasing order, so that we have (widetilde {T}_{ig} sim N(0, 1)). The parameters of the model were estimated using hail v0.2 stats.LinearMixedModel.from_kinship with *K* matrix being set as (widetilde {T} widetilde {T}^{t} / M). And PVE is calculated as (frac {hat sigma _{g}^{2}}{hat sigma _{e}^{2} + hat sigma _{g}^{2}}). The same set of covariates as the ones for PRS were used.

The PVE estimation was performed for each transcriptome model and population pairs. For non-European populations, all individuals were included in the analysis. We randomly selected 5,000 EUR individuals for the analysis.

### Estimating PVE by predicted transcriptome of multiple tissues

The genetic effects on the complex trait can be mediated through the regulation of expression in different tissues so that including predicted transcriptomes in multiple tissues could potentially improve the prediction performance. To quantify the gain, we performed the PVE analysis as described in the previous section using predicted expression in 10 GTEx tissues (listed in Additional file 1: Table S3) instead of just one. To avoid colinearity issues caused by the high correlation of predicted expression among tissues, we used eigenvectors of the predicted transcriptome matrix instead of the actual predicted expression. More specifically, for each gene *g*, we formed the matrix of predicted expression across the 10 tissues and performed singular value decomposition. We only retained eigenvectors with singular values that were at least 1/30 of the maximum singular value of the gene’s expression matrix across tissue. This approach is similar to the one used for combining PrediXcan association in multiple tissues [24].

### Estimating chip heritability

The chip heritability was estimated using the mixed effect model which is similar to the one for PTRS PVE estimation. But here we replace predicted transcriptome with genome-wide variant genotypes. The same cohorts and covariates were used as the ones in PTRS PVE estimation.

### Transcriptome prediction models for PTRS construction

The predicted transcriptome in the discovery set (UKB EUR) was calculated using models from GTEx [23] and MESA EUR based models [15] listed in Additional file 1: Table S3). The GTEx EUR whole blood transcriptome consisted of 7,041 genes. For the MESA transcriptomes, we restricted the prediction to the 4041 genes that were present in both the MESA EUR models and the MESA AFHI models (to ensure that PTRS built in the discovery set with the EUR models could be computed without missing genes in the target sets using the AFHI models).

### Building PTRS models using elastic net

For each of the 17 quantitative phenotypes, we trained elastic net model to predict the phenotype of interest using the predicted transcriptome (in a single tissue) as features. The same set of covariates as the ones for PRS were used. Let (widehat {T}_{g} in mathbb {R}^{N times 1}) denote the standardized predicted expression level of gene *g* across *N* individuals. Similarly, let (C_{l} in mathbb {R}^{N times 1}) denote the observed value of the *l*th standardized covariate. We fit the following elastic net model.

$$begin{array}{*{20}l} beta^{text{EN}} &= argmin_{beta} overbrace{frac{1}{N} | Y – X beta – beta_{0} |_{2}^{2}}^{text{loss:} l(beta)} + lambda alpha |beta|_{1} + lambda (1 – alpha) |beta|_{2}^{2} end{array} $$

(5)

$$begin{array}{*{20}l} X &:= [ widehat{T}_{1}, cdots, widehat{T}_{M}, C_{1}, cdots, C_{L} ] ~, end{array} $$

(6)

where *β*_{0} is the intercept, *M* is the number of genes, *L* is the number of covariates, (|beta |_{2}^{2}) is the *l*_{2} norm, and ∥*β*∥_{1} is the *l*_{1} norm of the effect size vector. Here, *α* controls the relative contribution of the *l*_{1} penalization term and *λ* controls the overall strength of regularization.

The model fitting procedure was implemented using tensorflow v2 with mini-batch proximal gradient method and the code is available at https://github.com/liangyy/ptrs-tf [25]. We trained models at *α*=0.1 (*α* = 0.5 and 0.9 show similar performance). And fixing the *α* value, as suggested in [26], we trained a series of models for a sequence of *λ*’s starting from the highest. The maximum *λ* value, *λ*_{max}, was determined as the smallest *λ* such that Eq. 7 is satisfied.

$$begin{array}{*{20}l} |~ nabla l(beta) ~| le alpha lambda, end{array} $$

(7)

where the gradient is evaluated at

$$begin{array}{*{20}l} beta_{0} = overline{Y}, beta_{text{covariate}} = mathbf{0}, beta_{text{transcriptome}} = mathbf{0} end{array} $$

(8)

So, at *λ*=*λ*_{max}, Eq. 8 is the solution to Eq. 5, which could serve as the initial points for the subsequent fittings of *λ*’s. We estimated *λ*_{max} using the first 1000 individuals of the data. And the sequence of *λ* was set to be 20 equally spaced points in log scale with the maximum value being 1.5*λ*_{max} and the minimum value being *λ*_{max}/10^{4}. Among these PTRS models generated at different *λ* values, we only kept the first 11 non-degenerate PTRS models so that we have the same number of candidate models for both PRS and PTRS.

### Building PTRS models using the LD clumping and *p*-value thresholding based approach

To implement clumping and thresholding based PTRS, we first obtained the gene/phenotype associations (PrediXcan association [9]) by associating the predicted gene expression to the phenotypes with the same set of covariates as PRS models. Next, we performed gene-based LD clumping to extract the roughly independent genes for PTRS models. The LD clumping procedure is described as follows. First, we prioritized genes according to their PrediXcan *p*-values. Second, we traversed the prioritized gene list from the most significant ones to the least ones, in which we included a gene into the returned gene list if the squared correlation (in terms of the predicted expression) of this gene and any other genes that have been included in the current returned gene list is smaller than 0.1.

Among these genes being selected in LD clumping, we built PTRS models by *p*-value thresholding. Specifically, at *p*-value thresholds 10^{−6},5×10^{−6},10^{−5},5×10^{−5},10^{−4},5×10^{−4},10^{−3},5×10^{−3}, 0.01, 0.05, 0.1, 0.5, 1, we built a PTRS model by keeping the genes with *p*-values smaller than the cutoff. For these genes being included in the final PTRS model, we used the effect sizes estimated in the PrediXcan association as the PTRS weights.

### Calculating PTRS in target sets

At the testing stage, given the standardized (within the population) predicted transcriptome of an individual, we calculated the PTRS of the individual using the following:

$$begin{array}{*{20}l} text{PTRS}_{i}^{lambda} &= sum_{g} widehat{T}_{ig} beta_{g}^{lambda}, end{array} $$

(9)

where *β*^{λ} denotes the *β*^{EN} obtained at hyperparamter equal to *λ*. For the PTRS built upon from GTEx EUR predicted transcriptome, the target PTRS was calculated with the GTEx EUR transcriptome (transcriptome predicted with GTEx EUR gene expression prediction models). To examine the utility of population-matched prediction model, the PTRS on the target set were calculated with of both MESA EUR and MESA AFHI transcriptomes.

### Quantifying the prediction accuracy of PRS and PTRS with partial *R*

^{2} ((widetilde {R}^{2}))

To measure the predictive performance of PRS and PTRS, we calculated the partial *R*^{2} of PRS/PTRS against the observed phenotype accounting for the set of covariates listed in Methods. Specifically, for individual *i*, let (hat {y}_{i}) denote the predicted phenotype which could be either PRS or PTRS and *y*_{i} denote the observed phenotype. Partial *R*^{2} (denoted as (widetilde {R}^{2}) below) is defined as the relative difference in sum of squared error (SSE) between two linear models: 1) *y*∼1+covariates (null model); and 2) (y sim 1 + text {covariates} + hat {y}) (full model), *i.e.*(widetilde {R}^{2} = 1 – frac {text {SSE}_{text {full}}}{text {SSE}_{text {null}}}). To enable fast computation, we calculated (widetilde {R}^{2}) using an equivalent formula shown in Eq. 10 which relies on the projection matrix of the null model.

$$begin{array}{*{20}l} widetilde{R}^{2} &= frac{mathcal{C}^{2}(y, hat{y})}{mathcal{C}(y, y)mathcal{C}(hat{y}, hat{y})} end{array} $$

(10)

$$begin{array}{*{20}l} mathcal{C}(u, v) &:= u^{t} v – u^{t} H v, end{array} $$

(11)

where *H* is the projection matrix of the null model, *i.e.*(H = widetilde {C} (widetilde {C}^{t} widetilde {C})^{-1} widetilde {C}^{t}) where (widetilde {C} = [1, C_{1}, cdots, C_{L}]) with *C*_{l} being the *l*th covariate.

As stated in the results section, PTRS weights were computed in the discovery set (UKB EUR) and tested in the 5 target sets. To determine the hyperparameter (*p*-value cutoff in the clumping and thresholding based approach or *λ* value in elastic net), for each target set, we further split the target set into two equal-size parts, a validation set and a test set. First, we calculated (widetilde {R}^{2}) for all hyperparameters in the validation set and we selected the hyperparameter that maximized the (widetilde {R}^{2}) in the validation set. And then, we calculated the (widetilde {R}^{2}) under the selected hyperparameter in the test set. This procedure was repeated 10 times and we reported the average (widetilde {R}^{2}) in the test set as the prediction accuracy.

### Combining PTRS and PRS

We combined PTRS and PRS as a weighted sum, i.e., combined score=*c*_{1}PRS+*c*_{2}PTRS with *c*_{1} and *c*_{2} calibrated in a validation subset of the target set. Given a sequence of PTRSs and PRSs at different hyperparameters (trained with the discovery set, UKB EUR), we used the following procedure to obtain the prediction accuracy in each of the ancestry groups. Similar to the prediction accuracy quantification of PTRS and PRS alone, we first split the target set into validation and test sets. We used the validation set to the following: (i) determine the coefficients (*c*_{1} and *c*_{2}) for each PTRS’ and PRS’ hyperparameter combination and (ii) select the best combined score among all PTRS and PRS combinations. And the test set was used to obtain an unbiased measure of the prediction accuracy of the selected score in the corresponding ancestry. To achieve the two goals above using the validation set, we further split the validation set into two equal-size parts. Using the first half of the validation set, we determined the coefficients (*c*_{1} and *c*_{2}) for each of the PTRS and PRS combinations by minimizing the square difference between predicted and observed. Next, using the other half of the validation set, we calculated the prediction accuracy of each PTRS and PRS combination with the coefficients obtained above. We note that this accuracy measure is unbiased since neither PTRS and PRS weights nor the coefficients for combining PTRS and PRS were obtained using this data. Based on these prediction accuracy measures, for each ancestry, we selected the best-performing PRS and PTRS combination along with the corresponding coefficients. We report the prediction performance calculated in the test set using the optimal combined score selected in the validation set.

### Quantifying the portability of PRS and PTRS

Portability was defined as the ratio of the prediction accuracy in each target set divided by the prediction accuracy in the European reference set. Therefore, by definition, portability in the EUR ref. set was 1.

When calculating the portability of PTRS using MESA AFHI transcriptome, we used the MESA EUR model (widetilde {R}^{2}_{text {EUR ref.}}) as the reference. This is a conservative choice since MESA EUR model is expected to perform better than MESA AFHI model among EUR individuals.

## Rights and permissions

**Open Access** This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

##### Disclaimer:

This article is autogenerated using RSS feeds and has not been created or edited by OA JF.

Click here for Source link (https://www.biomedcentral.com/)