A sparse matrix of TF binding events across accessible genomic regions
Deep learning-based TF binding prediction can be treated as a binary classification task wherein the ones and zeros (or positives and negatives) represent whether or not a TF binds to a genomic region. It is common to define the regions a TF binds to as the set of ChIP-seq peaks for that TF, with unbound regions being the ChIP-seq peaks from other TFs or randomly selected genomic regions. Leaving aside the intrinsic challenges associated with ChIP-seq data generation and peak-calling , there are limitations to adopting such definitions. For instance, many ChIP-seq peaks lack the consensus motif of the profiled TF , while others are the consequence of indirect or tethered binding events [29, 30], or appear in datasets for unusually high numbers of other TFs [31,32,33], i.e., they may be artifacts. While bound regions can be directly defined from the ChIP-seq data, the selection of unbound regions for use in deep learning models is more difficult. For example, the %GC content, which is often an important contributor to model performance both in vitro  and in vivo , varies between the set of peak regions of different TFs . Moreover, some classes of TFs are special: pioneer TFs can bind to nucleosome regions, which have distinctive sequence characteristics  and are not usually bound by non-pioneer TFs . Thus, it is suboptimal to define the set of unbound regions for a given TF based on the set of bound regions of other TFs. The alternative approach of using randomly selected genomic regions as unbound regions can result in the inclusion of a set of non-relevant regions, such as centromeres or telomeres. Furthermore, a region not bound by a TF in a certain cell or tissue type could be bound by that same TF in a different biological context. Due to the dependence of deep learning on high quality data , properties arising from one or more of the aforementioned limitations could result in improperly fitted models and, ultimately, mislead or inflate model performance.
To mitigate these limitations, which in turn could negatively impact on our assessment of transfer learning, we restricted the selection of bound and unbound regions to open regulatory regions of the genome. We constructed a sparse matrix describing the binding of 163 TFs to 1,817,918 200-bp of DNase I hypersensitive sites (DHSs) in a cell and tissue type-agnostic manner (Methods; Fig. 1). Each element in the matrix was defined by a specific TF-DHS pair and could take one of three values: “1” (i.e., bound) if the DHS was both accessible and bound by the TF in at least one cell or tissue type in common, “0” (i.e., not bound) if the DHS was accessible but not bound by the TF in any cell or tissue types in common, or “null” if the binding of the TF to the DHS could not be resolved (e.g., the TF had not been profiled in a cell or tissue type with available DHS data). The total number of ones, zeros, and nulls in the matrix was ~ 1.9 M, ~ 51.7 M, and ~ 242.6 M, respectively. In addition, the number of ones for each TF varied greatly, with 16 TFs having ≤ 500 bound regions (Table S1).
The number of unresolved elements (> 80%) led to concerns regarding the sparsity of the matrix and whether it had captured known TF-TF functional associations present in ChIP-seq data . We computed pairwise cosine similarities between the binding vectors of all TFs as a measure of correlation (Methods). To ease interpretation, TFs were sorted by their hierarchy in TFClass  and visualized on a heatmap (Fig. 2). As expected, TFs with shared DNA-binding mechanisms (hereafter referred to as binding modes), such as homologs or overlapping members of dimeric complexes, formed well-defined clusters along the heatmap diagonal. In contrast, away from the diagonal of the heatmap, correlated points could indicate the cooperative binding between TFs from different families, such as the erythropoietic TFs GATA1, GATA2, SOX6, and TAL1 , highlighted on the heatmap using white arrows. In agreement with previous literature , TEAD4 was also highly correlated with these four TFs (teal arrows). To further support the presence of cooperative binding in the TF binding matrix, we compared the binding vector similarities of TF-TF interacting pairs with different degrees of confidence from STRING . On average, the binding vectors of the “highest confidence” TF pairs exhibited a higher degree of similarity than the rest (0.141 vs. 0.063; Welch t test, p value = 6.46e−09). We concluded that the TF binding matrix captured TF cooperativity and thus was well suited for our assessment of transfer learning.
Transfer learning improves TF binding prediction
Zheng and colleagues recently applied transfer learning to identify distinctive contextual sequence characteristics of bound and unbound instances of TF motifs . Their transfer learning framework, named AgentBind, included a pre-training step, wherein a multi-task model (hereafter referred to as multi-models) was trained on the DeepSEA dataset , and a fine-tuning step, wherein single-task models (hereafter referred to as individual models) for 38 individual TFs of the GM12878 cell line were trained after previous initialization of the convolutional layers with weights from the multi-model. AgentBind outperformed models trained from scratch (i.e., without transfer learning), particularly for TFs with little ChIP-seq data. To corroborate these results, we implemented a similar two-step transfer learning strategy for TF binding prediction (Methods; Fig. 3A); however, in the fine-tuning step, both the convolutional and fully connected layers (except the output layer) of individual models were initialized with the multi-model weights, after which the training continued at a lower learning rate. We trained a multi-model using the 50 TFs that maximized the number of common resolved regions (i.e., not null) in the TF binding matrix. Moreover, we trained individual models, with and without transfer learning, for 49 of these TFs; ARNT was excluded because it had < 250 resolved regions left (i.e., non-overlapping with multi-model regions). Transfer learning significantly improved model performance for 39 (79.6%) TFs (Fig. 3B; Wilcoxon signed-rank test, p value = 1.91e−07), especially for TFs with small training datasets, which is in concordance with the results from AgentBind. For the 10 cases where transfer learning was detrimental, the differences in model performance compared to training from scratch were small (ΔAUCPR < 0.025; Table S2).
The pre-training step included data for all TFs, so for a more direct comparison, we repeated the previous experiment with 99 additional TFs that had at least 250 bound regions in the TF binding matrix and, importantly, had not been used to train the multi-model. Again, transfer learning significantly improved model performance (Fig. 3C; Wilcoxon signed-rank test, p value = 2.92e−13) and was more beneficial for TFs with small training datasets. Among the most notable examples was ATF4, with only 557 bound regions, which achieved an AUCPR of 0.941 (compared to 0.571 when training from scratch). However, transfer learning was detrimental for 15 (17.1%) TFs (Table S2). By design of the TF binding matrix, the bound regions of each TF contained the motif of that TF, while the unbound regions could or could not contain the TF motif. This could make their classification too simple a task for the model. We therefore generated de novo PWMs for each TF to provide a more reliable baseline (Methods). Overall, transfer learning performance was comparable to that of de novo PWMs (0.845 vs. 0.856; Wilcoxon signed-rank test, p value = 0.667, NS); however, de novo PWMs outperformed transfer learning for TFs with small training datasets (Fig. 3C). Noteworthy, transfer learning reduced the variation in model performance resulting from different data splits (i.e., the random assignment of data to different training, validation and test sets), thereby increasing the overall robustness of the training process (Fig. S1).
Driven by the observation that transfer learning benefited TFs with small training datasets in particular, we sought to explore the minimum training dataset size required for effective transfer learning. We focused on five TFs from five different families: HNF4A (a nuclear receptor), JUND (a basic leucine zipper), MAX (a basic helix-loop-helix factor), SPI1 (a tryptophan cluster factor), and SP1 (a C2H2 zinc finger). Of these, JUND, MAX, and SP1 were among the 50 TFs used to train the multi-model. For each TF, we trained individual models, with and without transfer learning, by randomly downsampling to 5000, 1,000, 500, 250, 125, and 50, bound and unbound regions from the TF binding matrix. We repeated the downsampling process five times to ensure the robustness of the results. Transfer learning was effective when downsampling to just 500 bound/unbound regions (Fig. 4 for HNF4A; Figs. S2, S3, S4 and S5 for the remaining TFs), which was concordant with the previous result for ATF4. For JUND and MAX, transfer learning was effective when downsampling to as few as 50 bound/unbound regions (Figs. S2F and S3F). This could be explained by the use of data for both TFs in the pre-training step. However, for SP1, which had also been used to train the multi-model, transfer learning was no longer effective when downsampling below 500 bound/unbound regions (Fig. S5C). We attributed this result to the poor performance of SP1 in the multi-model (AUCPR = 0.796 vs. 0.882 and 0.866 for JUND and MAX, respectively; Table S2), suggesting that the multi-model could not correctly learn the binding features of SP1, which in turn would result in an inadequate set of weights with which to initialize the individual model in the fine-tuning step.
Finally, since the TF binding matrix aggregated data across 52 cell and tissue types, we wondered if that would have an impact on transfer learning. We focused on GM12878 and K562 cells, which covered the largest number of TFs in the matrix. Using resolved regions in both GM12878 and K562 cells, we pre-trained a multi-model for 50 TFs common to both cell types, but only using matrix values from GM12878 cells (i.e., the multi-model was not trained using data from K562 cells). GM12878-specific resolved regions (i.e., regions resolved in GM12878 but not in K562 cells) and K562-specific resolved regions were held-out for fine-tuning. Using the held-out data, we then trained individual models, with and without transfer learning, for 35 of the 50 TFs used in the pre-training step (15 TFs were discarded because they had < 250 resolved regions left non-overlapping with multi-model regions). We additionally trained individual models (with and without transfer learning) for 76 TFs with resolved regions in either GM12878 cells (2), K562 cells (55), or both (19). Pre-training and fine-tuning on the same or on different cells did not impact transfer learning (Welch t test, p value = 0.202, NS; Fig. S6A); however, individual models pre-trained and fine-tuned in the same cells benefited from transfer learning significantly more (Welch t test, p value = 0.029; Fig. S6B), although differences were small (0.166 vs. 0.134). We attributed this to the poorer performance of individual models trained from scratch on GM12878 cells compared to those trained on K562 cells for the 35 multi-model TFs (average AUCPR = 0.656 vs. 0.713; Wilcoxon signed-rank test, p value = 0.001; Fig. S6D). Taken together, these results suggest that aggregating the data in a cell and tissue type-agnostic manner did not introduce any bias during the construction of the TF binding matrix.
Biologically relevant prior knowledge improves transfer learning
TFs from the same family often have highly similar DNA-binding specificities ; hence, we hypothesized that the presence in the pre-training step of TFs with the same binding mode as the target TF could have a positive effect on transfer learning performance. The JASPAR database of TF binding profiles includes a hierarchical clustering that groups TFs based on the similarity of their DNA-binding profiles . We relied on this grouping as the source of binding modes (Table S3). Focusing on the set of TFs whose performance worsened with transfer learning, we found that the binding modes of the most extreme cases (ΔAUCPR ≥ 0.025), namely MEF2A, MEF2C, HOXB13, TBX21, and TFAP2A, differed from those of the 50 TFs used to train the multi-model, suggesting that the inclusion in the pre-training step of TFs with relevant binding modes could be beneficial. This observation was further supported by reviewing the remaining TFs; TFs with the same binding mode as a TF from the pre-training step benefited from transfer learning significantly more (Welch t test, p value = 0.035; Fig. 5).
We hypothesized that the presence in the pre-training step of other biologically relevant prior knowledge, such as cofactors, or functional partners from STRING, could also have a positive effect on transfer learning performance (Methods). We defined cofactors as pairs of TFs whose binding was positively correlated. We focused on the same pentad of TFs from the previous section (i.e., HNFA4, JUND, MAX, SPI1, and SP1). For each TF, we pre-trained five different multi-models with five TFs with the same binding mode as the target TF, five cofactors of the target TF, five TFs with the same binding mode as the target TF but whose binding was not correlated with it (i.e., non-cofactors), five functional partners of the target TF from STRING, and five randomly selected TFs. Cofactors, functional partners from STRING, and randomly selected TFs were restricted to have different binding modes than the target TF. To avoid any confounding effects related to the training dataset size, all models were trained with a similar number of regions: ~ 70,000 for multi-models and ~ 2000 for individual models. Furthermore, to set an upper performance limit for each pre-training strategy, we repeated the analysis by replacing one of the five TFs, with which we pre-trained the multi-model, by the target TF. As expected, pre-training with the target TF resulted in better transfer learning. Moreover, when the target TF was included in the multi-model, we did not observe any significant differences between the five pre-training strategies (Kruskal-Wallis H test, Bonferroni adjusted p values = NS; Fig. 6). In contrast, when the target TF was not included in the multi-model, pre-training with either TFs with the same binding mode as the target TF or with cofactors were the best strategies: both achieved effective performance levels for four out of five TFs (except for SP1). Interestingly, using non-cofactors with the same binding mode as the target TF during pre-training led to slightly worse performance, suggesting that cofactors could play an important role in TF binding prediction. Of the five TFs analyzed, SP1 was a notable outlier: not only did it show the worst performance levels overall, but when it was not included in the multi-model, pre-training with other Krüppel-like zinc fingers sharing the same binding mode as SP1 performed worse than pre-training with randomly selected TFs (Fig. 6E). To confirm whether these observations were general to this family of TFs, we repeated the experiment with another Krüppel-like zinc finger, EGR1. We obtained results more in line with the other four TFs (Fig. 6F), suggesting that SP1 was an isolated case. Finally, to ensure that these results were not biased by our choice of model architecture, we repeated the previous experiment using the hybrid architecture with convolutional and recurrent layers of DanQ , obtaining similar results (Fig. S7).
Next, we tested whether focusing on a few biologically relevant TFs, rather than pre-training with a large dataset, would be a more effective transfer learning strategy. For each category of biologically relevant information (i.e., binding modes, cofactors and functional partners from STRING), we pre-trained two different multi-models for each of the pentad TFs with either five biologically relevant TFs, or “burying” the same five TFs in 45 TFs drawn from the original multi-model, each with a different binding mode than the target TF. For this set of experiments, the target TF was not included in the pre-training step, and we fixed the training dataset sizes of multi-models and individual models to ~ 40,000 and ~ 2000 regions, respectively. Furthermore, we used cofactors and functional partners from STRING with and without sharing the same binding mode as the target TF. We observed that pre-training with just five biologically relevant TFs led to only slightly better performance levels (Fig. 7).
The pentad TFs had a sufficiently large number of bound regions that we downsampled; hence, we wondered whether TFs with less bound regions would behave similarly. For each of the pentad TFs, we selected two additional TFs with the same binding mode and within the following ranges of bound regions: < 1000 positives or 1000–10,000 positives. For each of these 15 TFs (the five “pentad” TFs plus the 10 newly selected), we trained individual models using transfer learning from two different multi-models pre-trained with either five TFs with the same binding mode as the target TF, or five randomly selected TFs with different binding modes than the target TF. Each multi-model was trained both with and without the target TF included in the pre-training step. Unlike the above set of experiments, this time, we did not perform downsampling. In general, pre-training with the target TF resulted in better transfer learning performance (Fig. 8); however, for TFs with large datasets, this effect was less obvious. In addition, pre-training with a biologically relevant multi-model improved transfer learning performance, even when the target TF was present in the multi-model, especially for TFs with < 10,000 positives. Similar results were obtained when the experiment was repeated, but using longer input sequences (Figs. S8 and S9), as well as pre-training using five cofactors (Fig. S10). Finally, except for KLF9 and ZNF740, both zinc fingers, and SREBF2, which only had ~ 41.2 training and ~ 5.4 test sequences per data split (and hence should be considered an outlier), biologically relevant transfer learning performed similar than or better to de novo PWMs (Figs. 8 and S8, S9 and S10).
Interpretation of the transfer learning mechanism
In an attempt to understand the mechanism of transfer learning, we converted the filters from the first convolutional layer of the original multi-model to PWMs and compared them using Tomtom  to TF binding profiles from JASPAR (Methods; Fig. 9A). As expected, the majority of filters (55%) had significant similarities to known JASPAR profiles (Tomtom q value ≤ 0.05; Table S4). Focusing on HNF4A, which had not been used in the pre-training step, we found that four of the multi-model filters were refined in the fine-tuning step to resemble the motifs of HNF4A (Fig. 9B). We also observed that an increased number (six) of filters became similar to the motifs of HNF4A after using transfer learning compared to training from scratch (three; Table S4). Moreover, using transfer learning, the individual model of HNF4A was able to learn the two distinct binding modes of HNF4A represented by the JASPAR profiles MA0114.4 and MA1494.1. Similar observations were made for SPI1, which was also not present in the pre-training step (Fig. 9B; Table S4). Taken together, these findings suggest that weights from the pre-training step provided a better initialization for convolutional filters in the fine-tuning step, wherein they were refined to resemble the binding motif of the target TF. This would, in turn, explain the increased performance of transfer learning compared to training from scratch. To confirm that the refinement of convolutional filters in the fine-tuning step was responsible for the improvement in model performance by transfer learning, we applied an alternative fine-tuning strategy: we froze the convolutional layers of the pre-trained multi-model and trained only on the fully connected layers (i.e., no refinement of convolutional layer filters was allowed). Doing so resulted in poorer model performance, particularly for TFs that were not present in the multi-model (Table S5), further supporting the importance of filter refinement in the fine-tuning step.
Next, we applied DeepLIFT  and TF-MoDISco  in an attempt to understand the role of cofactors in transfer learning. Briefly, DeepLIFT quantifies the importance to the model prediction of each nucleotide in a given genomic sequence, while TF-MoDISco clusters “important” nucleotides from different genomic sequences into motifs. We show an analysis of transfer learning for JUND from two different multi-models pre-trained on either five cofactors with a different binding mode than JUND (CEBPB, MAFF, MAFG, NFIC, and SP1) or five TFs from five different families whose binding was not correlated with JUND (CTCF, EBF1, MXI1, NFYA, and TCF3). Data for JUND was not present in the pre-training step. The three motifs generated by TF-MoDISco for the cofactor model corresponded to the canonical JASPAR profiles of CEBPB and SP1, both of which were among the cofactors used to train the multi-model, as well as JUND (Fig. 9C). For the second model trained on five unrelated TFs, TF-MoDISco identified five motifs, none of which corresponded to the canonical JUND motif. Taken together, these results suggest that the model used the presence of CEBPB and SP1 motifs to aid in the prediction of JUND binding.
Finally, to further confirm the central role of convolutional layers, particularly filters, in the transfer learning process, we focused on the 10 TFs from Fig. 8 with an intermediate (1000–10,000) and small (< 1000) number of bound regions. We trained individual models with transfer learning wherein we transferred the weights of either the first or all three convolutional layers (i.e., we did not transfer the weights from the fully connected layers). Doing so only slightly worsened performance compared to the original transfer learning strategy, particularly when we transferred all three convolutional layers (Fig. S11). We also explored the importance of convolutional filters by initializing individual models with a combination of non-redundant profiles from the JASPAR database and de novo PWMs (Methods). As previously reported , filter initialization with JASPAR profiles, and in our case de novo PWMs, improved model performance compared to training from scratch; however, performance was inferior to that of transfer learning (Fig. S11).
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.