Our software consists of two modules: (1) estimation of experimental parameters from control data (“experimental parameters”) and (2) inference of methylation level (“biological parameters”) and differential methylation from DNA bisulfite sequencing data using the previously estimated experimental parameters. While LuxGLM was originally designed for analysis of both methylated (5mC) and hydroxymethylated cytosines (5hmC), the level for only one methylation modification (methylcytosine, 5mC) is included in this work (although our model can also be extended to 5hmC).

Fig. 1
figure1

Plate diagram of LuxRep model for the module analyzing experimental parameters from control data

Fig. 2
figure2

Plate diagram of the LuxRep model for estimating methylation level of a single cytosine with biological as well as technical replicates. The circles represent latent (white) and observed (gray) variables and the unbordered nodes represent constants

Fig. 3
figure3

Plate diagram of LuxRep model for generating dummy control data

To facilitate genome wide analysis, in our model implementation the experimental parameters are first computed from the control data since all cytosines per technical replicate have the same value for these parameters (Fig. 1). Methylation levels are then determined individually for each cytosine, and differential methylation thereafter, using the pre-computed experimental parameters as fixed input (Fig. 2). We will next describe these two models in detail in Sects. 2.1–2.2.

Experimental parameters

Methylation estimates are a function of experimental parameters: bisulfite conversion rate ((text{BS}_{rm eff})), sequencing error ((text{seq}_{rm err})) and incorrect bisulfite conversion rate ((text{BS}^*_{rm eff})). A BS-seq library with low (text{BS}_{rm eff}) results in overestimation of methylation levels. High (text{seq}_{rm err}), on the other hand, can lead to both over and underestimation of methylation levels. Though typically not measured in high-throughput bisulfite sequencing experiments, high (text{BS}^*_{rm eff}) leads to underestimation of methylation level.

To demonstrate that differences in technical parameters (specifically bisulfite conversion rate) is common we took a real bisulfite sequencing dataset [12] and compared the bisulfite conversion efficiencies of the technical replicates (i.e. libraries) per biological replicate (Additional file 1: Fig. S1). Most samples had significantly variable conversion rates, i.e. differences in technical parameters is common. Moreover, in practice, BS-seq datasets obtained with non-optimal conversion efficiencies are commonly ignored as currently there does not exist a statistical analysis tool that would allow analyzing BS-seq datasets with different conversion efficiencies. This in turn leads to loss of data, decrease in statistical power, loss of a biological sample, and increase in sequencing costs.

We start by briefly reviewing the underlying statistical model [11] and then introduce our extension that can handle technical replicates. Briefly, the conditional probability of a sequencing readout being “C” in BS-seq data is a function of the experimental parameters that include (text{seq}_{rm err}) and (text{BS}_{rm eff}), and depends on the methylation level (theta in [0,1]). If a read was generated from an unmethylated cytosine (C), the conditional probability (p_{rm{BS}}(text{“C”|C})) is given by

$$p_{rm{BS}}(text{“C”|C})=(1-text{BS}_{rm eff})(1-text{seq}_{rm err}) + text{BS}_{rm eff} text{seq}_{rm err}.$$

(1)

The term ((1-text{BS}_{rm eff})(1-text{seq}_{rm err})) refers to the condition wherein unmethylated cytosine is incorrectly not converted into uracil and correctly sequenced as “C” whereas the term (text{BS}_{rm eff} text{seq}_{rm err}) represents the condition wherein the unmethylated cytosine is correctly converted into uracil but incorrectly sequenced as “C”. Similarly, in the case of methylated cytosine

$$p_{rm BS}(text{“C”|5mC}) = (1-text{BS}^*_{rm eff})(1-text{seq}_{rm err}) + text{BS}^*_{rm eff} text{seq}_{rm err},$$

(2)

where ((1-text{BS}^*_{rm eff})(1-text{seq}_{rm err})) denotes the case that methylated cytosine is correctly not converted to uracil and correctly sequenced as “C” while the term (text{BS}^*_{rm eff} text{seq}_{rm err}) represents the case that methylated cytosine is incorrectly converted to uracil and incorrectly sequenced as “C”.

In [11], bisulfite conversion, sequencing error and incorrect bisulfite conversion rates were specific to each biological replicate, not technical replicate.

The experimental parameters follow a logistic normal distribution, where the bisulfite conversion rate (text{BS}_{rm eff}) is given by

$$text{BS}_{rm eff}=text{logit}^{-1}(mu _{rm{BS}_{rm eff}} + sigma _{rm{BS}_{rm eff}}r_{rm{BS}_{rm eff}})$$

(3)

and its hyperparameters are

$$mu_{rm{BS}_{rm eff}} sim {mathcal {N}}(psi ^{mu ,mu }_{rm{BS}_{rm eff}},psi ^{mu ,sigma }_{rm{BS}_{rm eff}})$$

(4)

$$text{ln}(sigma _{{{rm BS}_{rm eff}}}) sim {mathcal {N}}(psi ^{sigma ,mu }_{rm{BS}_{rm eff}},psi ^{sigma ,sigma }_{rm{BS}_{rm eff}})$$

(5)

$$r_{rm{BS}_{rm eff}} sim {mathcal {N}}(0,1),$$

(6)

such that (text{logit}(text{BS}_{rm eff}) sim {mathcal {N}}(mu _{rm{BS}_{rm eff}},sigma _{rm{BS}_{rm eff}})), where (mu _{rm{BS}_{rm eff}}) is the mean and (sigma _{rm{BS}_{rm eff}}) is the standard deviation ((psi ^{mu ,mu }_{rm{BS}_{rm eff}}=4), (psi ^{mu ,sigma }_{rm{BS}_{rm eff}}=1.29), (psi ^{sigma ,mu }_{rm{BS}_{rm eff}}=0.4) and (psi ^{sigma ,sigma }_{rm{BS}_{rm eff}}=0.5)). See [13] for details.

The sequencing error (text{seq}_{rm err}) is modeled similarly

$$text{seq}_{rm err} = text{logit}^{-1}(mu _{rm{seq}_{rm err}} + sigma _{rm{seq}_{rm err}}r_{rm{seq}_{rm err}})$$

(7)

$$mu _{rm{seq}_{rm err}} sim {mathcal {N}}(psi ^{mu ,mu }_{rm{seq}_{rm err}},psi ^{mu ,sigma }_{rm{seq}_{rm err}})$$

(8)

$$text{ln}(sigma _{rm{seq}_{rm err}}) sim {mathcal {N}}(psi ^{sigma ,mu }_{rm{seq}_{rm err}},psi ^{sigma ,sigma }_{rm{seq}_{rm err}})$$

(9)

$$r_{rm{seq}_{rm err}} sim {mathcal {N}}(0,1),$$

(10)

such that (text{logit}(text{seq}_{rm err}) sim {mathcal {N}}(mu _{rm{seq}_{rm err}},sigma _{rm{seq}_{rm err}})), where (mu _{rm{seq}_{rm err}}) is the mean and (sigma _{rm{seq}_{rm err}}) is the standard deviation ((psi ^{mu ,mu }_{rm{seq}_{rm err}}=-8), (psi ^{mu ,sigma }_{rm{seq}_{rm err}}=1.29), (psi ^{sigma ,mu }_{rm{seq}_{rm err}}=0.4) and (psi ^{sigma ,sigma }_{rm{seq}_{rm err}}=0.5)).

The hyperparameter values above were used since they worked well in a previously published related work [11] although we chose a lower (psi ^{mu ,mu }_{rm{seq}_{rm err}}) since it generated more robust methylation estimates with mid-values of theta (i.e. 0.3 and 0.7). Other than that, to confirm that the results were not sensitive to hyperparameter values we tested different values ranging from low ((psi ^{mu ,mu }_{rm{BS}_{rm eff}}=1), (psi ^{mu ,sigma }_{rm{BS}_{rm eff}}=1), (psi ^{sigma ,mu }_{rm{BS}_{rm eff}}=0.1), (psi ^{sigma ,sigma }_{rm{BS}_{rm eff}}=0.1), (psi ^{mu ,mu }_{rm{seq}_{rm err}}=-10), (psi ^{mu ,sigma }_{rm{seq}_{rm err}}=1), (psi ^{sigma ,mu }_{rm{seq}_{rm err}}=0.1) and (psi ^{sigma ,sigma }_{rm{seq}_{rm err}}=0.1)) to high ((psi ^{mu ,mu }_{rm{BS}_{rm eff}}=10), (psi ^{mu ,sigma }_{rm{BS}_{rm eff}}=10), (psi ^{sigma ,mu }_{rm{BS}_{rm eff}}=1), (psi ^{sigma ,sigma }_{rm{BS}_{rm eff}}=1), (psi ^{mu ,mu }_{rm{seq}_{rm err}}=-1), (psi ^{mu ,sigma }_{rm{seq}_{rm err}}=10), (psi ^{sigma ,mu }_{rm{seq}_{rm err}}=1) and (psi ^{sigma ,sigma }_{rm{seq}_{rm err}}=1)) hyperparameter values, relative to the values used in this paper, and indeed the methylation estimates were robust regardless of hyperparameter values (Additional file 1: Fig. S2).

The BS-seq experiments typically include completely unmethylated DNA fragments as controls (such as the lambda phage genome) that allow estimation of (text{BS}_{rm eff}) and (text{seq}_{rm err}). However, as BS-seq experiments typically do not include completely methylated DNA fragments as controls that would be needed to estimate the incorrect bisulfite conversion rate (text{BS}^*_{rm eff}), it is set to a constant value (e.g. (text{BS}^*_{rm eff}=0), see Sections “Estimating experimental parameters” and “Estimating methylation levels” for specific values used in results). Note also that the bisulfite conversion rate and sequencing error parameters are specific for each biological samples and technical replicate.

In Fig. 1, (theta ^{text {control}}) represents the proportions of DNA methylation modifications in the control cytosine. In this case the proportion consists of unmethylated DNA, but this can be adjusted if additional DNA methylation modifications are included.

Following Eqs. 1 and 2, the observed total number of “C” readouts for a single control cytosine is binomially distributed,

$$N_{rm{BS,C}}^{text {control}} sim text {Bin}(N_text{BS}^{text {control}},p_{rm{BS}}text{(“C”)}^{text {control}}),$$

(11)

where (N_{rm{BS}}^{text {control}}) is the total number of reads and the probability of observing “C” is given by

$$begin{aligned} p_{rm{BS}}(text{“C”})^{text {control}}&= p_text{BS}(text{“C”} | text{5mC})theta ^{text {control}} + p_text{BS}(text{“C”} | text{C})(1-theta ^{text {control}}). end{aligned}$$

(12)

Using the sequencing read counts from the control cytosines (N_{rm{BS,C}}^{text {control}}) and (N_{rm{BS}}^{text {control}}), posterior distributions of unknowns in this model are obtained using the inference methods described in section “Variational inference“. Posterior means of (text{BS}_{rm eff}) and (text{seq}_{rm err}) (and (text{BS}^*_{rm eff}) if available) are then used in the actual methylation level analysis as described in the next section.

Biological parameters

For computing the biological parameters, the observed total number of “C” readouts for a single noncontrol cytosine is similar to Eq.  11, (N_{rm{BS,C}} sim text {Bin}(N_text{BS},p_{rm{BS}}text{(“C”)})), where (N_{rm{BS}}) is the total number of reads and the probability of observing “C”, similar to Eq. 12, is given by

$$p_{rm{BS}}(text{“C”}) = p_text{BS}(text{“C”} | text{5mC})theta + p_text{BS}(text{“C”} | text{C})(1-theta )$$

(13)

where (theta =p(text{5mC})).

LuxRep retains the general linear model with matrix normal distribution used by LuxGLM to handle covariates wherein matrix normal distribution is a generalisation of multivariate normal distribution to matrix-valued random variables. The following section summarizes the linear model (see [11] for more details).

In the general linear model component of LuxGLM (Fig. 2)

$${mathbf {Y}}={{mathbf {D}}}{{mathbf {B}}} + {mathbf {E}},$$

(14)

where ({mathbf {Y}} in {mathbb {R}}^{N times 2}) contains the unnormalized methylation fractions, ({mathbf {D}}) is the design matrix (size N-by-p, where p is the number of parameters), ({mathbf {B}} in {mathbb {R}}^{p times 2}) is the parameter matrix, and ({mathbf {E}} in {mathbb {R}}^{N times 2}) is the noise matrix.

To derive the (normalized) methylation proportions (varvec{theta } = (theta _1,ldots ,theta _N)^T), LuxGLM uses the softmax link function (or transformation)

$$theta _i = text{Softmax}(text{row}_i({mathbf {Y}})).$$

(15)

The softmax function is obtained when generalizing the logistic function to multiple dimensions. That is, the softmax function (sigma : {mathbb {R}}^K rightarrow [0,1]^K) is defined by (sigma ({mathbf {z}})_i =) (frac{e^{z_i}}{ sum _{j=1}^{K} e^{z_j} }).

In matrix normal distribution,

$${mathbf {X}} sim {{mathcal {M}}}{{mathcal {N}}}({mathbf {M}},{mathbf {U}},{mathbf {V}})$$

(16)

where ({mathbf {M}}) is the location matrix and ({mathbf {U}}) and ({mathbf {V}}) are scale matrices. Alternatively, ({mathbf {X}}) (in Eq.  16) can also be written as the multivariate normal distribution

$$text{vec}({mathbf {X}}) sim {mathcal {N}}(text{vec}({mathbf {M}}),{mathbf {U}} otimes {mathbf {V}}),$$

(17)

where (text{vec}(cdot )) denotes vectorization of a matrix and (otimes) denotes the Kronecker product.

Given Eq. 14, ({mathbf {B}}) and ({mathbf {E}}) take on the following prior distributions

$${mathbf {E}}|mathbf {U_E},mathbf {V_E} sim {{mathcal {M}}}{{mathcal {N}}}({mathbf {0}}, mathbf {U_E}, mathbf {V_E})$$

(18)

$${mathbf {B}}|mathbf {M_B}, mathbf {U_B}, mathbf {V_B} sim {{mathcal {M}}}{{mathcal {N}}}(mathbf {M_B}, mathbf {U_B}, mathbf {V_B}).$$

(19)

Using the vectorized multivariate normal distribution formulation of the matrix normal distribution, matrix ({mathbf {Y}}) then becomes

$$begin{aligned} &text{vec}({mathbf {Y}})|{mathbf {D}}, {mathbf {M_B}}, {mathbf {U_B}}, {mathbf {V_B}}, {mathbf {U_E}}, {mathbf {V_E}} sim {mathcal {N}}(({mathbf {I}} otimes {mathbf {D}}) text{vec} ({mathbf {M_B}}), \&({mathbf {I}} otimes {mathbf {D}})({mathbf {V_B}} otimes {mathbf {U_B}})({mathbf {I}} otimes {mathbf {D}})^{mathbf {T}} + {mathbf {V_E}} otimes {mathbf {U_E}}). end{aligned}$$

(20)

Assuming the scale matrices ({mathbf {U_B}}), ({mathbf {V_B}}), ({mathbf {U_E}}) and ({mathbf {V_E}}) are all diagonal with parameter and noise specific variances (sigma ^text{2}_{mathbf {B}}) and (sigma ^text{2}_{mathbf {E}}), probability densities for ({mathbf {B}}), ({mathbf {E}}) and ({mathbf {Y}}) can be stated as

$$text{vec} ({mathbf {B}}) sim {mathcal {N}}(text{vec}({mathbf {0}}), sigma ^text{2}_{mathbf {B}}({mathbf {I}} otimes {mathbf {I}}))$$

(21)

$$text{vec} ({mathbf {E}}) sim {mathcal {N}}(text{vec}({mathbf {0}}), sigma ^text{2}_{mathbf {E}}({mathbf {I}} otimes {mathbf {I}}))$$

(22)

$$begin{aligned}&text{vec}({mathbf {Y}})|{mathbf {D}},sigma ^text{2}_{mathbf {B}}, sigma ^text{2}_{mathbf {E}} sim {mathcal {N}}(text{vec}({mathbf {0}}), nonumber \&sigma ^text{2}_{mathbf {B}} ({mathbf {I}} otimes {mathbf {D}})({mathbf {I}} otimes {mathbf {I}}) ({mathbf {I}} otimes {mathbf {D}})^{mathbf {T}} + sigma ^text{2}_{mathbf {E}}({mathbf {I}} otimes {mathbf {I}})). end{aligned}$$

(23)

Fig. 4
figure4

Parameter estimates of (text{BS}_{rm eff}) and (text{seq}_{rm err}). The x-axis shows whether the samples were drawn from ‘G’ (good quality) or ‘B’ (bad/low quality) technical replicates corresponding to (text{BS}_{rm eff}^B) and (text{BS}_{rm eff}^G), respectively, and grouped according to the two scenarios, discussed in section “Estimating methylation levels“, ‘GGB’ and ‘GBB’. Low and high coverage refer to (N_{rm{BS}} = 1 dotsc 3) and (N_{rm{BS}} = 10), respectively

Fig. 5
figure5

Plate diagram of LuxRep model for simulating data for estimating methylation level

Fig. 6
figure6

Plate diagram of the reduced LuxRep model that mimics the traditional approach of not accounting for experimental parameters

Fig. 7
figure7

Boxplots of estimates of methylation levels. Datasets were analysed with the full and reduced LuxRep models with varying methylation levels (columns, values shown at topmost panel), varying number of reads (rows, values shown on right panel), different combinations of replicates with varying (text{BS}_{rm eff}) (‘G’ and ‘B’) (x-axis), and using either HMC or ADVI to evaluate or approximate the posterior, respectively. The boxplots show the posterior means ((n=100))

Fig. 8
figure8

Plate diagram of model for generating dummy data for differential methylation analysis

Fig. 9
figure9

AUROCs of differential methylation calls. Accuracy in determining differential methylation was measured by generating datasets consisting of two groups (A and B) with varying (Delta theta) ((theta _A) and (theta _B) levels are shown in top panels and when one or two of three replicates have low (text{BS}_{rm eff}) (‘GGB’ and ‘GBB’, respectively.) For ‘GBB’ (top box) and ‘GGB’ (bottom box) (N_{rm{BS}} = 10) and (N_{rm{BS}} = 6), respectively, for each technical replicate. X-axis shows whether HMC or ADVI was used to evaluate or approximate the posteriors

Variance (sigma ^2_{mathbf {B}} = 5) and (sigma ^2_{mathbf {E}} sim Gamma ^{-1}(alpha ,beta )), where (alpha = beta = 1), are used in this work. We chose to use the hyperparameter value (sigma _B^2=5) because that seems to be widely applicable and provides robust inference results. To confirm that the results are not sensitive to the particular choice of (sigma _B^2) value, we carried out an ablation study where we repeated the methylation level estimation experiment (from Fig. 7) with three different values of (sigma _B^2): 1, 5, and 10. Our results in Fig. S3 confirm that the results have very little or no variation depending on the choice of (sigma _B^2) value.

The inverse gamma distribution was used as prior for (sigma _E^2) since (with the alpha and beta hyperparameters used) it is uninformative and makes no strong assumptions with regards to the spread of the noise term. Also, the inverse gamma distribution is a conjugate prior to a normal distribution with known mean (mu) and unknown variance (sigma ^2).

We extend the model to allow modelling of technical replicates wherein the methylation level (theta) is the same for all different bisulfite-converted DNA libraries from the same biological sample but the experimental parameters ((text{seq}_{rm err}) and (text{BS}_{rm eff})) vary across both the biological replicates as well as the technical replicates.

In the modified model (Figs. 1 and 2), (N_{rm{BS,C}}) and (N_{rm{BS}}) represent the observed “C” and total counts, respectively, from each of the (M_i) technical replicates per biological sample (i in left{ 1,..,Nright}). Note that the experimental parameters (text{BS}_{rm eff}) and (text{seq}_{rm err}), taken from the posterior means, are sample and replicate-specific.

To detect differential methylation, hypothesis testing was done using Bayes factors (via the Savage-Dickey density ratio method) as implemented in [11].

Variational inference

[11] used Hamiltonian Monte Carlo (HMC) for model inference (since the model is analytically intractable), whereas in variational inference (VI) the posterior (p(phi |mathbf{X} )) of a model is approximated with a simpler distribution (q(phi ;rho )), which is selected from a chosen family of distributions by minimizing Kullback-Leibler divergence between (p(phi |mathbf{X} )) and (q(phi ;rho )). We use the automatic differentiation variational inference algorithm (ADVI) from [14], which is integrated into Stan. ADVI is used to generate samples from the approximative posterior (q(phi ;rho )).

There are a few parameters which can be tuned to make the ADVI algorithm [14] fast but accurate. These parameters are number samples used in Monte Carlo integration approximation of expectation lower bound (ELBO), number of samples used in Monte Carlo integration approximation of the gradients of the ELBO and number of samples taken from the approximative posterior distribution. The default values for gradient samples (N_G) and ELBO samples (N_E) are 100 and 1 respectively. Here we compare the computation times and preciseness of the Savage-Dickey estimate computed using HMC and ADVI with different (N_E) and (N_G) values. The tested values for (N_E) were 100, 200, 500 and 1000 and for (N_G) 1, 10 and 100. To make the HMC and ADVI methods comparable, the number of samples retrieved from the approximative posterior distribution is set to be the same for both methods.

To choose the best number of gradient samples and ELBO samples, simulation tests on LuxGLM model were executed. These tests were conducted in the following way: First, simulated data from the LuxGLM model was generated. The number of reads and replicates were varied (the tested values were 6, 12, 24 and 6, 10, 20 respectively) and for each combination data sets with differential methylation and without differential methylation were generated. The calculation of the Bayes factors was made using different (N_E) and (N_G) values. For each setting 100 data sets were simulated and Bayes factors were calculated. Using the computed Bayes factors, ROC curves and AUROC statistics were produced. Also, the computation times for each parameter value combination were taken down. The results of these tests for the case of 12 reads and 10 replicates are shown in Additional file 1: Figs. S4 and S5.

In Additional file 1: Fig. S4 the computation times for different parameter values are shown. In Additional file 1: Fig. S5 the computation time was plotted as a function of accuracy of the method when compared to the HMC approach. The average computation time for the HMC method is plotted in red. From the figures we can see that with the all tested parameter combinations the computing Savage-Dickey estimate with ADVI is faster than with HMC. In Additional file 1: Fig. S5, on the left side of the dashed line are the parameter combinations which gave better precision than HMC approach.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Disclaimer:

This article is autogenerated using RSS feeds and has not been created or edited by OA JF.

Click here for Source link (https://www.biomedcentral.com/)