# LuxRep: a technical replicate-aware method for bisulfite sequencing data analysis – BMC Bioinformatics

#### ByMaia H. Malonzo, Viivi Halla-aho, Mikko Konki, Riikka J. Lund and Harri Lähdesmäki

Jan 14, 2022

Our software consists of two modules: (1) estimation of experimental parameters from control data (“experimental parameters”) and (2) inference of methylation level (“biological parameters”) and differential methylation from DNA bisulfite sequencing data using the previously estimated experimental parameters. While LuxGLM was originally designed for analysis of both methylated (5mC) and hydroxymethylated cytosines (5hmC), the level for only one methylation modification (methylcytosine, 5mC) is included in this work (although our model can also be extended to 5hmC).

To facilitate genome wide analysis, in our model implementation the experimental parameters are first computed from the control data since all cytosines per technical replicate have the same value for these parameters (Fig. 1). Methylation levels are then determined individually for each cytosine, and differential methylation thereafter, using the pre-computed experimental parameters as fixed input (Fig. 2). We will next describe these two models in detail in Sects. 2.1–2.2.

### Experimental parameters

Methylation estimates are a function of experimental parameters: bisulfite conversion rate ((text{BS}_{rm eff})), sequencing error ((text{seq}_{rm err})) and incorrect bisulfite conversion rate ((text{BS}^*_{rm eff})). A BS-seq library with low (text{BS}_{rm eff}) results in overestimation of methylation levels. High (text{seq}_{rm err}), on the other hand, can lead to both over and underestimation of methylation levels. Though typically not measured in high-throughput bisulfite sequencing experiments, high (text{BS}^*_{rm eff}) leads to underestimation of methylation level.

To demonstrate that differences in technical parameters (specifically bisulfite conversion rate) is common we took a real bisulfite sequencing dataset [12] and compared the bisulfite conversion efficiencies of the technical replicates (i.e. libraries) per biological replicate (Additional file 1: Fig. S1). Most samples had significantly variable conversion rates, i.e. differences in technical parameters is common. Moreover, in practice, BS-seq datasets obtained with non-optimal conversion efficiencies are commonly ignored as currently there does not exist a statistical analysis tool that would allow analyzing BS-seq datasets with different conversion efficiencies. This in turn leads to loss of data, decrease in statistical power, loss of a biological sample, and increase in sequencing costs.

We start by briefly reviewing the underlying statistical model [11] and then introduce our extension that can handle technical replicates. Briefly, the conditional probability of a sequencing readout being “C” in BS-seq data is a function of the experimental parameters that include (text{seq}_{rm err}) and (text{BS}_{rm eff}), and depends on the methylation level (theta in [0,1]). If a read was generated from an unmethylated cytosine (C), the conditional probability (p_{rm{BS}}(text{“C”|C})) is given by

$$p_{rm{BS}}(text{“C”|C})=(1-text{BS}_{rm eff})(1-text{seq}_{rm err}) + text{BS}_{rm eff} text{seq}_{rm err}.$$

(1)

The term ((1-text{BS}_{rm eff})(1-text{seq}_{rm err})) refers to the condition wherein unmethylated cytosine is incorrectly not converted into uracil and correctly sequenced as “C” whereas the term (text{BS}_{rm eff} text{seq}_{rm err}) represents the condition wherein the unmethylated cytosine is correctly converted into uracil but incorrectly sequenced as “C”. Similarly, in the case of methylated cytosine

$$p_{rm BS}(text{“C”|5mC}) = (1-text{BS}^*_{rm eff})(1-text{seq}_{rm err}) + text{BS}^*_{rm eff} text{seq}_{rm err},$$

(2)

where ((1-text{BS}^*_{rm eff})(1-text{seq}_{rm err})) denotes the case that methylated cytosine is correctly not converted to uracil and correctly sequenced as “C” while the term (text{BS}^*_{rm eff} text{seq}_{rm err}) represents the case that methylated cytosine is incorrectly converted to uracil and incorrectly sequenced as “C”.

In [11], bisulfite conversion, sequencing error and incorrect bisulfite conversion rates were specific to each biological replicate, not technical replicate.

The experimental parameters follow a logistic normal distribution, where the bisulfite conversion rate (text{BS}_{rm eff}) is given by

$$text{BS}_{rm eff}=text{logit}^{-1}(mu _{rm{BS}_{rm eff}} + sigma _{rm{BS}_{rm eff}}r_{rm{BS}_{rm eff}})$$

(3)

and its hyperparameters are

$$mu_{rm{BS}_{rm eff}} sim {mathcal {N}}(psi ^{mu ,mu }_{rm{BS}_{rm eff}},psi ^{mu ,sigma }_{rm{BS}_{rm eff}})$$

(4)

$$text{ln}(sigma _{{{rm BS}_{rm eff}}}) sim {mathcal {N}}(psi ^{sigma ,mu }_{rm{BS}_{rm eff}},psi ^{sigma ,sigma }_{rm{BS}_{rm eff}})$$

(5)

$$r_{rm{BS}_{rm eff}} sim {mathcal {N}}(0,1),$$

(6)

such that (text{logit}(text{BS}_{rm eff}) sim {mathcal {N}}(mu _{rm{BS}_{rm eff}},sigma _{rm{BS}_{rm eff}})), where (mu _{rm{BS}_{rm eff}}) is the mean and (sigma _{rm{BS}_{rm eff}}) is the standard deviation ((psi ^{mu ,mu }_{rm{BS}_{rm eff}}=4), (psi ^{mu ,sigma }_{rm{BS}_{rm eff}}=1.29), (psi ^{sigma ,mu }_{rm{BS}_{rm eff}}=0.4) and (psi ^{sigma ,sigma }_{rm{BS}_{rm eff}}=0.5)). See [13] for details.

The sequencing error (text{seq}_{rm err}) is modeled similarly

$$text{seq}_{rm err} = text{logit}^{-1}(mu _{rm{seq}_{rm err}} + sigma _{rm{seq}_{rm err}}r_{rm{seq}_{rm err}})$$

(7)

$$mu _{rm{seq}_{rm err}} sim {mathcal {N}}(psi ^{mu ,mu }_{rm{seq}_{rm err}},psi ^{mu ,sigma }_{rm{seq}_{rm err}})$$

(8)

$$text{ln}(sigma _{rm{seq}_{rm err}}) sim {mathcal {N}}(psi ^{sigma ,mu }_{rm{seq}_{rm err}},psi ^{sigma ,sigma }_{rm{seq}_{rm err}})$$

(9)

$$r_{rm{seq}_{rm err}} sim {mathcal {N}}(0,1),$$

(10)

such that (text{logit}(text{seq}_{rm err}) sim {mathcal {N}}(mu _{rm{seq}_{rm err}},sigma _{rm{seq}_{rm err}})), where (mu _{rm{seq}_{rm err}}) is the mean and (sigma _{rm{seq}_{rm err}}) is the standard deviation ((psi ^{mu ,mu }_{rm{seq}_{rm err}}=-8), (psi ^{mu ,sigma }_{rm{seq}_{rm err}}=1.29), (psi ^{sigma ,mu }_{rm{seq}_{rm err}}=0.4) and (psi ^{sigma ,sigma }_{rm{seq}_{rm err}}=0.5)).

The hyperparameter values above were used since they worked well in a previously published related work [11] although we chose a lower (psi ^{mu ,mu }_{rm{seq}_{rm err}}) since it generated more robust methylation estimates with mid-values of theta (i.e. 0.3 and 0.7). Other than that, to confirm that the results were not sensitive to hyperparameter values we tested different values ranging from low ((psi ^{mu ,mu }_{rm{BS}_{rm eff}}=1), (psi ^{mu ,sigma }_{rm{BS}_{rm eff}}=1), (psi ^{sigma ,mu }_{rm{BS}_{rm eff}}=0.1), (psi ^{sigma ,sigma }_{rm{BS}_{rm eff}}=0.1), (psi ^{mu ,mu }_{rm{seq}_{rm err}}=-10), (psi ^{mu ,sigma }_{rm{seq}_{rm err}}=1), (psi ^{sigma ,mu }_{rm{seq}_{rm err}}=0.1) and (psi ^{sigma ,sigma }_{rm{seq}_{rm err}}=0.1)) to high ((psi ^{mu ,mu }_{rm{BS}_{rm eff}}=10), (psi ^{mu ,sigma }_{rm{BS}_{rm eff}}=10), (psi ^{sigma ,mu }_{rm{BS}_{rm eff}}=1), (psi ^{sigma ,sigma }_{rm{BS}_{rm eff}}=1), (psi ^{mu ,mu }_{rm{seq}_{rm err}}=-1), (psi ^{mu ,sigma }_{rm{seq}_{rm err}}=10), (psi ^{sigma ,mu }_{rm{seq}_{rm err}}=1) and (psi ^{sigma ,sigma }_{rm{seq}_{rm err}}=1)) hyperparameter values, relative to the values used in this paper, and indeed the methylation estimates were robust regardless of hyperparameter values (Additional file 1: Fig. S2).

The BS-seq experiments typically include completely unmethylated DNA fragments as controls (such as the lambda phage genome) that allow estimation of (text{BS}_{rm eff}) and (text{seq}_{rm err}). However, as BS-seq experiments typically do not include completely methylated DNA fragments as controls that would be needed to estimate the incorrect bisulfite conversion rate (text{BS}^*_{rm eff}), it is set to a constant value (e.g. (text{BS}^*_{rm eff}=0), see Sections “Estimating experimental parameters” and “Estimating methylation levels” for specific values used in results). Note also that the bisulfite conversion rate and sequencing error parameters are specific for each biological samples and technical replicate.

In Fig. 1, (theta ^{text {control}}) represents the proportions of DNA methylation modifications in the control cytosine. In this case the proportion consists of unmethylated DNA, but this can be adjusted if additional DNA methylation modifications are included.

Following Eqs. 1 and 2, the observed total number of “C” readouts for a single control cytosine is binomially distributed,

$$N_{rm{BS,C}}^{text {control}} sim text {Bin}(N_text{BS}^{text {control}},p_{rm{BS}}text{(“C”)}^{text {control}}),$$

(11)

where (N_{rm{BS}}^{text {control}}) is the total number of reads and the probability of observing “C” is given by

begin{aligned} p_{rm{BS}}(text{“C”})^{text {control}}&= p_text{BS}(text{“C”} | text{5mC})theta ^{text {control}} + p_text{BS}(text{“C”} | text{C})(1-theta ^{text {control}}). end{aligned}

(12)

Using the sequencing read counts from the control cytosines (N_{rm{BS,C}}^{text {control}}) and (N_{rm{BS}}^{text {control}}), posterior distributions of unknowns in this model are obtained using the inference methods described in section “Variational inference“. Posterior means of (text{BS}_{rm eff}) and (text{seq}_{rm err}) (and (text{BS}^*_{rm eff}) if available) are then used in the actual methylation level analysis as described in the next section.

### Biological parameters

For computing the biological parameters, the observed total number of “C” readouts for a single noncontrol cytosine is similar to Eq.  11, (N_{rm{BS,C}} sim text {Bin}(N_text{BS},p_{rm{BS}}text{(“C”)})), where (N_{rm{BS}}) is the total number of reads and the probability of observing “C”, similar to Eq. 12, is given by

$$p_{rm{BS}}(text{“C”}) = p_text{BS}(text{“C”} | text{5mC})theta + p_text{BS}(text{“C”} | text{C})(1-theta )$$

(13)

where (theta =p(text{5mC})).

LuxRep retains the general linear model with matrix normal distribution used by LuxGLM to handle covariates wherein matrix normal distribution is a generalisation of multivariate normal distribution to matrix-valued random variables. The following section summarizes the linear model (see [11] for more details).

In the general linear model component of LuxGLM (Fig. 2)

$${mathbf {Y}}={{mathbf {D}}}{{mathbf {B}}} + {mathbf {E}},$$

(14)

where ({mathbf {Y}} in {mathbb {R}}^{N times 2}) contains the unnormalized methylation fractions, ({mathbf {D}}) is the design matrix (size N-by-p, where p is the number of parameters), ({mathbf {B}} in {mathbb {R}}^{p times 2}) is the parameter matrix, and ({mathbf {E}} in {mathbb {R}}^{N times 2}) is the noise matrix.

To derive the (normalized) methylation proportions (varvec{theta } = (theta _1,ldots ,theta _N)^T), LuxGLM uses the softmax link function (or transformation)

$$theta _i = text{Softmax}(text{row}_i({mathbf {Y}})).$$

(15)

The softmax function is obtained when generalizing the logistic function to multiple dimensions. That is, the softmax function (sigma : {mathbb {R}}^K rightarrow [0,1]^K) is defined by (sigma ({mathbf {z}})_i =) (frac{e^{z_i}}{ sum _{j=1}^{K} e^{z_j} }).

In matrix normal distribution,

$${mathbf {X}} sim {{mathcal {M}}}{{mathcal {N}}}({mathbf {M}},{mathbf {U}},{mathbf {V}})$$

(16)

where ({mathbf {M}}) is the location matrix and ({mathbf {U}}) and ({mathbf {V}}) are scale matrices. Alternatively, ({mathbf {X}}) (in Eq.  16) can also be written as the multivariate normal distribution

$$text{vec}({mathbf {X}}) sim {mathcal {N}}(text{vec}({mathbf {M}}),{mathbf {U}} otimes {mathbf {V}}),$$

(17)

where (text{vec}(cdot )) denotes vectorization of a matrix and (otimes) denotes the Kronecker product.

Given Eq. 14, ({mathbf {B}}) and ({mathbf {E}}) take on the following prior distributions

$${mathbf {E}}|mathbf {U_E},mathbf {V_E} sim {{mathcal {M}}}{{mathcal {N}}}({mathbf {0}}, mathbf {U_E}, mathbf {V_E})$$

(18)

$${mathbf {B}}|mathbf {M_B}, mathbf {U_B}, mathbf {V_B} sim {{mathcal {M}}}{{mathcal {N}}}(mathbf {M_B}, mathbf {U_B}, mathbf {V_B}).$$

(19)

Using the vectorized multivariate normal distribution formulation of the matrix normal distribution, matrix ({mathbf {Y}}) then becomes

begin{aligned} &text{vec}({mathbf {Y}})|{mathbf {D}}, {mathbf {M_B}}, {mathbf {U_B}}, {mathbf {V_B}}, {mathbf {U_E}}, {mathbf {V_E}} sim {mathcal {N}}(({mathbf {I}} otimes {mathbf {D}}) text{vec} ({mathbf {M_B}}), \&({mathbf {I}} otimes {mathbf {D}})({mathbf {V_B}} otimes {mathbf {U_B}})({mathbf {I}} otimes {mathbf {D}})^{mathbf {T}} + {mathbf {V_E}} otimes {mathbf {U_E}}). end{aligned}

(20)

Assuming the scale matrices ({mathbf {U_B}}), ({mathbf {V_B}}), ({mathbf {U_E}}) and ({mathbf {V_E}}) are all diagonal with parameter and noise specific variances (sigma ^text{2}_{mathbf {B}}) and (sigma ^text{2}_{mathbf {E}}), probability densities for ({mathbf {B}}), ({mathbf {E}}) and ({mathbf {Y}}) can be stated as

$$text{vec} ({mathbf {B}}) sim {mathcal {N}}(text{vec}({mathbf {0}}), sigma ^text{2}_{mathbf {B}}({mathbf {I}} otimes {mathbf {I}}))$$

(21)

$$text{vec} ({mathbf {E}}) sim {mathcal {N}}(text{vec}({mathbf {0}}), sigma ^text{2}_{mathbf {E}}({mathbf {I}} otimes {mathbf {I}}))$$

(22)

begin{aligned}&text{vec}({mathbf {Y}})|{mathbf {D}},sigma ^text{2}_{mathbf {B}}, sigma ^text{2}_{mathbf {E}} sim {mathcal {N}}(text{vec}({mathbf {0}}), nonumber \&sigma ^text{2}_{mathbf {B}} ({mathbf {I}} otimes {mathbf {D}})({mathbf {I}} otimes {mathbf {I}}) ({mathbf {I}} otimes {mathbf {D}})^{mathbf {T}} + sigma ^text{2}_{mathbf {E}}({mathbf {I}} otimes {mathbf {I}})). end{aligned}

(23)

Variance (sigma ^2_{mathbf {B}} = 5) and (sigma ^2_{mathbf {E}} sim Gamma ^{-1}(alpha ,beta )), where (alpha = beta = 1), are used in this work. We chose to use the hyperparameter value (sigma _B^2=5) because that seems to be widely applicable and provides robust inference results. To confirm that the results are not sensitive to the particular choice of (sigma _B^2) value, we carried out an ablation study where we repeated the methylation level estimation experiment (from Fig. 7) with three different values of (sigma _B^2): 1, 5, and 10. Our results in Fig. S3 confirm that the results have very little or no variation depending on the choice of (sigma _B^2) value.

The inverse gamma distribution was used as prior for (sigma _E^2) since (with the alpha and beta hyperparameters used) it is uninformative and makes no strong assumptions with regards to the spread of the noise term. Also, the inverse gamma distribution is a conjugate prior to a normal distribution with known mean (mu) and unknown variance (sigma ^2).

We extend the model to allow modelling of technical replicates wherein the methylation level (theta) is the same for all different bisulfite-converted DNA libraries from the same biological sample but the experimental parameters ((text{seq}_{rm err}) and (text{BS}_{rm eff})) vary across both the biological replicates as well as the technical replicates.

In the modified model (Figs. 1 and 2), (N_{rm{BS,C}}) and (N_{rm{BS}}) represent the observed “C” and total counts, respectively, from each of the (M_i) technical replicates per biological sample (i in left{ 1,..,Nright}). Note that the experimental parameters (text{BS}_{rm eff}) and (text{seq}_{rm err}), taken from the posterior means, are sample and replicate-specific.

To detect differential methylation, hypothesis testing was done using Bayes factors (via the Savage-Dickey density ratio method) as implemented in [11].

### Variational inference

[11] used Hamiltonian Monte Carlo (HMC) for model inference (since the model is analytically intractable), whereas in variational inference (VI) the posterior (p(phi |mathbf{X} )) of a model is approximated with a simpler distribution (q(phi ;rho )), which is selected from a chosen family of distributions by minimizing Kullback-Leibler divergence between (p(phi |mathbf{X} )) and (q(phi ;rho )). We use the automatic differentiation variational inference algorithm (ADVI) from [14], which is integrated into Stan. ADVI is used to generate samples from the approximative posterior (q(phi ;rho )).

There are a few parameters which can be tuned to make the ADVI algorithm [14] fast but accurate. These parameters are number samples used in Monte Carlo integration approximation of expectation lower bound (ELBO), number of samples used in Monte Carlo integration approximation of the gradients of the ELBO and number of samples taken from the approximative posterior distribution. The default values for gradient samples (N_G) and ELBO samples (N_E) are 100 and 1 respectively. Here we compare the computation times and preciseness of the Savage-Dickey estimate computed using HMC and ADVI with different (N_E) and (N_G) values. The tested values for (N_E) were 100, 200, 500 and 1000 and for (N_G) 1, 10 and 100. To make the HMC and ADVI methods comparable, the number of samples retrieved from the approximative posterior distribution is set to be the same for both methods.

To choose the best number of gradient samples and ELBO samples, simulation tests on LuxGLM model were executed. These tests were conducted in the following way: First, simulated data from the LuxGLM model was generated. The number of reads and replicates were varied (the tested values were 6, 12, 24 and 6, 10, 20 respectively) and for each combination data sets with differential methylation and without differential methylation were generated. The calculation of the Bayes factors was made using different (N_E) and (N_G) values. For each setting 100 data sets were simulated and Bayes factors were calculated. Using the computed Bayes factors, ROC curves and AUROC statistics were produced. Also, the computation times for each parameter value combination were taken down. The results of these tests for the case of 12 reads and 10 replicates are shown in Additional file 1: Figs. S4 and S5.

In Additional file 1: Fig. S4 the computation times for different parameter values are shown. In Additional file 1: Fig. S5 the computation time was plotted as a function of accuracy of the method when compared to the HMC approach. The average computation time for the HMC method is plotted in red. From the figures we can see that with the all tested parameter combinations the computing Savage-Dickey estimate with ADVI is faster than with HMC. In Additional file 1: Fig. S5, on the left side of the dashed line are the parameter combinations which gave better precision than HMC approach.