Most TE expression candidates are fragmented and polymorphic TE loci

Following the system established by Lizamore [21] to activate TE transcription, V. vinifera (Pinot noir UCD5 clone) embryogenic callus cultures were subjected to either a mock treatment (hereinafter Vv_Mock; which included vigorous shaking that was expected to mount a wounding response in this tissue; see Methods) or mock treatment plus H. uvarum live cultures (hereinafter Vv_Yeast). Samples were collected after exposure at four time points, with an untreated embryogenic callus culture included as a common time zero (Vv_T = 0) for both Vv_Mock and Vv_Yeast treatments (see Methods and Fig. 1a for experimental schematic). Triplicated treatments of embryogenic callus were carried out, RNA-seq libraries prepared and sequence data generated as outlined in Methods. All TE loci potentially transcribed were identified by our analysis pipeline constituted by existing tools (see Methods and Fig. 1b). To provide data to validate our analysis pipeline we retrieved the RNA-seq data of Arabidopsis wild type (Col) [40] and ddm1 mutant [17] and subjected this to the same analysis pipeline. The pipeline produced three sets of TE loci that passed our arbitrary expression threshold (see Methods and Fig. 1b) and were united as a pool of expression candidates. ‘Trackable’ expression candidates were considered TE loci in which mapped reads possessed a unique polymorphism such as a single nucleotide variant (SNV) or a small insertion or deletion (INDEL). ‘Un-trackable’ expression candidates represent a collection of TE loci for which the sequencing reads obtained show no sequence divergence and that all TE loci in this set have to be considered as potentially transcribed and likely represent TEs that have recently mobilised [47].

Fig. 1
figure 1

Short-read RNA-seq data sources and the analysis workflow to identify expressed TE loci. a A schematic of the sampling regime for the generation of grapevine short-read RNA-seq data. The grapevine RNA-seq data was derived from the V. vinifera embryogenic callus subjected to time-series stress treatment (each with three biological replicates). Published RNA-seq data of Arabidopsis thaliana seedlings of wild-type [40] and ddm1 [17] were also used in the analysis pipeline. b Computational workflow to identify TE expression candidates. The first and second sub-pipelines apply HISAT2 [42] for the alignment of sequencing reads against the reference genome and then use htseq-count [43] and Bedtools toolset [44], respectively, to quantify reads overlapping with TEs. While htseq-count only adopts unique-mapping reads, Bedtools incorporates both unique- and multi-mapping reads. The third sub-pipeline uses BWA [45] to align reads against TE sequences, after which the mates of TE-mapped reads would be fed into TEFingerprint [46] for mapping against the reference genome to capture danglers. The three sets of TEs passing through the filtering steps are joined together as a pool of expression candidates (see Methods, Fig. S1 and additional file 1 for detailed descriptions of definitions and thresholds)

In V. vinifera embryogenic callus, our analysis workflow identified 3698 (1.6%), 5524 (2.5%), and 5531 (2.5%) of a total of 223,411 annotated TE loci as expression candidates in Vv_T = 0, Vv_Mock, and Vv_Yeast samples, respectively (Fig. 2a). Of the total number of loci containing annotated TE sequence in the Vitis genome, 75–87% were effectively excluded as having no evidence of transcription and a further 11–22% of annotated TE loci fell below our expression threshold due to insufficient support in terms of numbers of mapped reads (Additional file 2: Fig. S2). In each of the tested conditions (Vv_T = 0, Vv_Mock and Vv_Yeast), different sets of expression candidates were identified (Fig. 2b). The increase in number of expression candidates uniquely found in Vv_Mock (1139 TE loci; Fig. 2b) and Vv_Yeast (1913 TE loci; Fig. 2b) compared to our time 0 control (Vv_T = 0; 329 TE loci; Fig. 2b) indicates heightened transcriptional activation of TE loci due to wound and biotic stress treatments.

Fig. 2
figure 2

Identification of TE expression candidates in V. vinifera embryogenic callus and A. thaliana seedlings. a, c Bar charts demonstrating the abundance of TE expression candidate loci identified in different experimental conditions or genotypes of the V. vinifera (a) and A. thaliana (c) systems, with pie charts showing the proportion of trackable/un-trackable loci of TE expression candidates. b, d Venn diagrams showing different sets of TE expression candidates in the V. vinifera (b) and A. thaliana (d) systems. e, f Histograms demonstrating the length integrity of all annotated TE loci and TE expression candidates, and pie charts showing the proportion of full-length TE loci among all annotated TE loci and TE expression candidates, in the V. vinifera (e) and A. thaliana (f) systems. Trackable elements refer to those loci that has sufficient sequence divergence to allow reads to be unambiguously mapped to a specific genomic locus. Un-trackable elements refer to loci that have reads that multimap to many loci due to lack of any sequence divergence at these loci, the assumption being these represent recent insertions that have not yet accumulated sequence variation

Similarly, in the Arabidopsis RNA-seq datasets, the wild-type dataset revealed over 95% of the total 31,189 annotated Arabidopsis TE loci were excluded from the pool of expression candidates, leaving 1410 TE loci identified as expression candidates (Fig. 2c). However, in the mutant ddm1 background, the number of expression candidates increased to 4156 TE loci (Fig. 2c), of which 3182 were transcribed TE loci compared to wild type (Fig. 2d). This marked difference in patterns of transcribed loci in the ddm1 mutant likely reflected the deficiency in the epigenetic regulatory system in this mutant background leading to more transcriptionally permissive epigenetic landscape and activation of TEs [18].

Depending on the experimental condition or genotype in the V. vinifera and A. thaliana system, 69.2–77.4% and 87.1–94.9% of the expression candidates, respectively, contained unique-mapping reads, and were thus identified as trackable expression candidates in these two plant systems (Fig. 2a, c). Of the identified expression candidates, most where fragmented, with only 9–10% or 11–17% of the expression candidates in V. vinifera and A. thaliana system retaining > 90% integrity compared to the length of canonical elements (Fig. 2e, f). We termed TE loci with > 90% integrity in length ‘full-length’ TE loci.

The high proportion of trackable and fragmented TE loci in the pools of expression candidates in both systems indicate that the bulk of transcription of TEs observed in plants is likely derived from TEs which have both accumulated sequence polymorphisms and are fragmented, indicating that they likely derived from much older TE insertion events.

The characteristics of TE expression candidates varies by TE family

The high proportion of fragmented elements triggered our interest in determining whether the degree of mutation accumulation within expressed TE families varied according to each family’s mobilisation history. We hypothesised that transcribed TE loci of more recently mobile families are expected to have less sequence divergence compared to those transcribed TE loci belonging to families with more ancient mobilisation events [48, 49]. Thus while there may not be a correlation between older versus newer insertions and transcription levels of TE-loci, younger elements may have a higher proportion of full-length and un-trackable loci.

To test this assumption, we initially considered the grape control data (Vv_T = 0) and grouped grapevine expression candidates hierarchically by each family, degree of sequence divergence (i.e. trackable or un-trackable), and length (i.e. fragmented or full-length). Of the TE loci showing evidence of transcription 2558 (69%) of the total 3698 expression candidates were trackable, whereas the remaining 1140 expression candidates fell into the ‘un-trackable’ category (Fig. 3a). Among the 232 TE families, belonging to nine super-families, presented on the y-axis of Fig. 3a, 174 of these contained expression candidates in the control treatment, Vv_T = 0. Of the 174 families, 102 contained fewer than ten expression candidate loci, and a further 32 families contained 10 to 23 expression candidate loci (Additional file 3: Table S3). It was noticeable that 6 of the 174 TE families had more than 100 expression candidate loci; these families were Copia-23 (211 expression candidates), Gypsy-12 (175 expression candidates), VLINE1 (245 expression candidates), VLINE4 (211 expression candidates), VLINE5 (117 expression candidates) and VLINE6 (139 expression candidates). However, of these, the expression candidate loci in 5 of the families were mostly fragmented and trackable TE sequences (Additional file 3: Table S3), indicating that only a small number of families contribute the bulk of putatively transcribed loci, while the bulk of the expressed TE families have few transcriptionally active TE loci. In fact, half of the expressed TE families (87 of the 174 expressed families) possessed only trackable expression candidates, and a further 74 TE families had only 1 to 20 un-trackable expression candidates, most of which are fragmented.

Fig. 3
figure 3

Transcriptionally active TE families in grapevine embryogenic callus. a–c Histograms demonstrating all expression candidates of Vv_T = 0 (a), Vv_Mock (b), and Vv_Yeast (c) categorised by families, distinctiveness and integrity. Each bar represents a TE family containing expression candidates. The expression candidates were then further grouped into un-trackable (green) and trackable (blue) candidates, those of which full-length were filled with either dark green or dark blue. TE families containing at least two full-length expression candidates of either group were indicated. Note that Harbinger families missing open reading frame (ORF) encoding transposase and MULE families lack of terminal inverted repeats (TIRs) in their canonical sequences are indicated

These observations demonstrates that the vast majority of the transcriptionally active loci across TE families are fragmented and could be identified by polymorphisms (e.g. SNVs and INDELs). In contrast, Copia-23 and Copia-3 families were over-represented with expression candidates that were both un-trackable and potentially full-length. These findings were concordant with the observations in Vv_mock (Fig. 3b) and Vv_Yeast (Fig. 3c). In the treatment series, we observed larger numbers of TE families containing transcribed TE loci, however, as observed in the control Vv_T = 0, most of the expressed families possess a majority of loci that are fragmented. Copia-3 and Copia-23 remained the two families possessing the largest numbers of full-length un-trackable expression candidates (Additional file 3: Table S3).

A closer interrogation of the sequences of the canonical Copia-3 element and the 90 annotated full-length Copia-3 elements shows a condensed phylogenetic cluster mostly comprised of structurally intact (and thus potentially autonomous) Copia-3 indicated by the presence of LTRs flanking an intact INT domain (Additional file 2: Fig. S3; also see Methods). This cluster included 26 sequences, 19 of which were structurally intact un-trackable expression candidates with over 90% read coverage of the annotated INT domain, and a further 4 of which were intact trackable Copia-3 with a nearly complete transcription of INT (see Methods).

A similar analysis of the Copia-23 family, comparing the canonical sequence of Copia-23 and the 220 full-length Copia-23 sequences, revealed that the 11 un-trackable and four trackable structurally intact candidates with nearly complete transcription across INT were scattered in 5 broom-like unresolved clusters. These clusters were densely packed with other un-trackable candidates that had either lost intact LTRs or lacked full INT coverage (Additional file 2: Fig. S4). These compact clades with short branches were distinguished from the sequences of unexpressed full-length Copia-23.

To test the contribution of the full-length Copia-3 loci to the pool of Copia-3-related transcripts, reads mapping to all Copia-3 expression candidates were categorised into four groups by whether they mapped to; full-length trackable, full-length un-trackable, fragmented trackable and fragmented un-trackable candidates. This analysis revealed that each category contained reads shared with one or more categories, irrespective of treatments and treatment time-point (Additional file 2: Fig. S5). Nonetheless, each category obtained a unique subset of reads that only mapped to one of the four groups (pairwise combination of full-length/fragmented and trackable/un-trackable) of expression candidates, meaning no one group was able to represent the whole collection of Copia-3 transcripts. The same analysis was applied to reads mapping to all Copia-23 expression candidates. This analysis also demonstrated reads shared across different categories and those reads unique to a single category (Additional file 2: Fig. S6).

Copia-3 and Copia-23 belong to LTR-retrotransposon super family (LTR-TEs). This type of TE is characterised by the presence of identical long terminal repeat (LTR) at either end of the element. The pair of LTRs gradually degrades over time, accumulating increasing numbers of mutations across time; therefore, it has been believed that the more diverse a pair of LTR sequences are, the more time that has passed since insertion [49, 50]. To test whether Copia-3 and Copia-23 were active more recently than other LTR-TEs, we analysed the divergence of each pair of LTRs, then estimating the insertion time of each structurally intact TE loci. The insertion dates of the 87 and 177 Copia-3 and Copia-23 intact elements, respectively, were estimated to peak approximately 0.02 and 0.017 million years ago (MYA), respectively (Fig. 4a, b; Additional file 4: Table S3). Peak insertion times of the other 39 LTR-TE families with at least ten intact copies were analysed in the same way (Fig. 4b; Additional file 4: Table S3). Most LTR-TE families experienced bursts no longer than 4.5 million years ago (MYA). Note that Copia-3 and Copia-23 were the most recently active LTR-TE families (Additional file 4: Table S3). Comparison of the peak insertion time of Copia-3, Copia-23 and the other 5 Copia families, which obtained trackable full-length candidates across all treatments but lacked un-trackable full-length candidates, showed that Copia-3 and Copia-23 experienced their bursts of activity more recently than these five other families (Fig. 4c). In addition, the average insertion time of intact un-trackable candidates is significantly more recent than the non-candidates of Copia-3 and Copia-23 (Fig. 4d, e).

Fig. 4
figure 4

Insertion dates of LTR-TE families containing complete copies. a Distribution of the insertion dates for 177 complete copies of the Copia-23 family. The peak of amplification (0.17 MYA) is indicated as the dashed line. b Distribution of the insertion dates using 41 LTR-TE families with at least 10 complete copies. c Comparison of the transposition burst time among LTR-TE families containing full-length expression candidates. The families were ordered by the peak of insertion time (grey diamonds). The asterisks denote a significant level of t-test for the mean of insertion time as indicated. d, e Comparison of the insertion time of all annotated structurally intact Copia-3 (d) and Copia-23 (e) loci grouped into three categories: non-candidates, un-trackable expression candidates, and trackable expression candidates

Together, these results suggest that LTR-TE families that show evidence of the most recent mobilisation burst in evolutionary time may not necessarily be TE families possessing the largest number of expression candidate loci. Instead, recently mobile families that contain currently active element loci are likely typified by collections of loci that are structurally intact and where RNA-seq reads multi-map across these conserved sequences.

TE expression candidates are not randomly distributed in the genome

A number of recent reports highlight that TE mobilisation and reintegration into host genomes often show distinct insertion bias [9, 26,27,28,29,30]. However, little is known regarding the genome-wide distribution of transcriptionally active TE loci. In order to investigate whether there is location bias between all annotated TEs and TEs representing expression candidates, we compartmented the annotated reference genome into genic and intergenic regions. The genic region comprised gene units, which were made of exons and introns that were bounded by the transcription start and stop sites of genes, and flanking regions of genes which encompassed 2 kb upstream (N-flanks) and 2 kb downstream (C-flanks) of corresponding transcription start and stop sites (Fig. 5a). All annotated TE loci with genomic coordinates that were found to intersect with specific gene compartments were categorised and hierarchically ordered with respect to being present in genic/intergenic regions, within the genic region (e.g. exon, intron, flanks), and whether the TE loci contained a fragmented or full-length element. In grapevine, over half of all annotated TE loci were found to be located in intergenic regions (126,976 TEs, 56.83%), while 96,435 (43.16%) TEs co-localised with genes (Fig. 5b). Approximately half of the genic TEs were found to be present in flanking regions, without particular preference for either the 5′-flank or 3′-flank. TEs that mapped to gene units were mostly found in the introns (Fig. 5b; Additional file 5: Table S4).

Fig. 5
figure 5

Location bias of expressed TE loci. a Illustration of TE insertions in the different genomic region (genic or intergenic) or location (exon, intron, or flanks) relative to genes. b-e Hierarchical classification of all annotated V. vinifera TE loci (b) and all TE expression candidates of Vv_T = 0 (c), Vv_Mock (d), and Vv_Yeast (e), by location, integrity, and distinctiveness. TE loci were categorised in the order of region (centre), the transcriptional activity of co-localised genes (2nd layer), location (3rd layer), integrity (4th layer), and the presence/absence of unique-mapping reads (outer-most layer). f-h Hundred percent stacked bar charts showing the proportion of all annotated TE loci of V. vinifera and TE expression candidates (Vv_T = 0, Vv_Mock, and Vv_Yeast) distributed in genic/intergenic regions (f), in gene unit (including exon and intron)/flanks (g), and with expressed/non-expressed genes (h). i-k Hierarchical classification of all annotated A. thaliana TE loci (i) and all TE expression candidates of At_WT (j) and At_ddm1 (k), by location, integrity, and distinctiveness. l-n Hundred percent stacked bar charts showing the proportion of all annotated A. thaliana TE loci and TE expression candidates (At_WT and At_ddm1) distributed in genic/intergenic regions (l), in gene unit/flanks (m), and with expressed/non-expressed genes (n). Colour codes for different categories of location, integrity, and distinctiveness are as indicated at the right panel. Chi-square tests with p-value < 0.01 were as indicated. Expr., expressed; Non-expr., non-expressed; N-flank, N-terminal 2 kb-flanking region of gene; C-flank, C-terminal 2 kb-flanking region of gene; Exp., expected proportion; Obs., observed proportion

The expression candidates we defined in grapevine were then classified in the same way, with additional categories added, including the transcriptional activity of genes that were co-localised with TE candidates (i.e. TEs associated with expressed or non-expressed genes) and the presence or absence of unique-mapping reads (trackable or un-trackable TE candidates). Untreated embryogenic callus (Vv_T = 0), Vv_mock, and Vv_Yeast showed that 71.47, 75.69, and 74.62% of the expression candidates, respectively, were found within the genic regions (Fig. 5c-e, Additional file 5: Table S4). The goodness of fit X-square test showed that the increased proportion of expressed TE candidates was significant across all treatments (Fig. 5f). Delving deeper into the insertion context, about two-thirds of these genic TE expression candidates overlapped with introns with a significant bias toward introns of expressed genes (Fig. 5c-e, g, h; Additional file 5: Table S4).

In order to test whether our observations were isolated to grapevine or represent a trend across plants, we examined the location bias of expression candidates within the A. thaliana dataset. We found 58.37% of A. thaliana annotated TE loci were located in the genic region, while TEs in gene units and flanking regions comprised 16.00 and 42.37% of the total pool, respectively (Fig. 5i; Additional file 5: Table S4). Remarkably, the majority of the TEs annotated within the ‘gene unit’ overlapped with exons (Fig. 5i), contributing to 12.42% of all annotated loci. This high proportion of exonic TE loci is likely due to issues with the TE annotation established by Jin et al. [51], where considerable numbers of long TEs are found to overlap with multiple exons and introns, which were annotated based on the gene annotation file deposited in Ensembl Plants (see Methods), and therefore were preferentially categorised as exonic TEs.

In wild-type Arabidopsis (At_WT), TE expression candidates locating in genic regions accounted for 91.28% of the 1410 expression candidates, of which 1093 loci co-localised with expressed genes (Fig. 5j; Additional file 5: Table S4). In addition, the proportion of the gene-unit loci subset had changed significantly from 16% of all annotated TE loci (Fig. 5i) to 74.61% of the expression candidates in At_WT (Fig. 5j; Additional file 5: Table S4). The majority of these expression candidates in the gene-unit were associated with expressed genes (65.67% to the total expression candidates; Additional file 5: Table S4). In comparison, however, while intergenic TE expression candidates comprised only 8.72% of the expression candidate pool in At_WT, the proportion of TE-expression candidates in the At_ddm1 mutant was significantly higher at 54.79% of expressed TE loci (Fig. 5k, m, Additional file 5: Table S4). In this mutant background the number of expression candidate loci in the gene-unit subset dropped from 74.61% in At_WT to only 24.95% of the expression candidates in At_ddm1 mutant. These results were found to be statistically significant using the goodness of fit X-square test (Fig. 5l-n).

The observed preference for expressed TE loci to be located in genic regions may be explained by either a general increase in the proportion of genic located TE expression candidate from most TE families or simply a reflection of a genic-enriched annotation of a few TE families that largely contribute to the pool of expression candidates. To test these two assumptions, we further looked into the location distribution of all annotated TE loci and TE expression candidates of individual TE families. The genic and intergenic proportions of annotated TE loci and expression candidates were plotted for families belonging to Copia, Gypsy, LINE, hAT and MULE, the five super-families that contributed to the majority of expression candidates (Additional file 2: Fig. S7-S11). This analysis first looked at the genic and intergenic proportion of all annotated TE loci grouped by families. We then categorised TE expression candidates, in the same way, to examine whether the genic proportion of TE expression candidates was higher than that of annotated TE loci. Considering all annotated TE loci, we observed that for Copia families about two thirds are underrepresented (< 50%) in genic regions (Additional file 2: Fig. S7A). When considering Copia TE expression candidates in Vv_T = 0, 58 of the 71 families with expression candidates demonstrated a higher proportions of expression candidates in genic regions (Additional file 2: Fig. S7B) compared to annotated TE loci (Additional file 2: Fig. S7A), supporting that Copia TE expression candidates tend to be located in the genic regions in the Vv_T = 0 samples. This trend for Copia families was also observed in Vv_Mock and Vv_Yeast (Additional file 2: Fig. S7C-D). Analysis of Gypsy, LINE, hAT and MULE TE families and expression candidates revealed similar trends as observed for Copia (Additional file 2: Fig. S8–11).. The same analysis for the location preference within the genic region subset (exon, intron, and flanking regions) also reveals that the tendency for colocation of TE expression candidates in the intronic fraction of expressed genes is widely presented in most TE families of V. vinifera (Additional file 2: Fig. S12-S16).

The distribution bias of TE expression candidates suggested a degree of tolerance for the transcription of TE loci within, or proximal to, expressed genes. We were interested to see if the observed distribution was biased toward non-autonomous TE loci, which are assumed to pose less of a mutational load to host cells than the structurally intact TE loci. We therefore investigated the location distribution of the structurally intact TE loci of Copia-3 and Copia-23, the two families estimated to have mobilised most recently among all LTR-TE families in V. vinifera. Among the total annotated structurally intact TE loci of Copia-3 and Copia-23, 55 of 87 (63%) and 117 of 177 (66%) loci were found within introns (Additional file 2: Fig. S17A), respectively. Seventy-four and 138 of these structurally intact Copia-3 and Copia-23 TE loci were identified as TE expression candidates in at least one experimental condition. Astonishingly, 59 (80%) of the 74 structurally intact Copia-3 expression candidate loci and 104 (75%) of the 138 structurally intact Copia-23 expression candidates were found to be co-localised with expressed genes, primarily within introns of these genes (Additional file 2: Fig. S17B). Taken together, this clearly suggests a strong distribution bias of structurally intact and therefore potentially autonomous TE expression candidates to within expressed genes in LTR-TE families that are most recently active in this genome.

The transcriptional dynamics of TEs is closely related to that of co-localised genes

The over-representation of TE expression candidates in proximity to expressed genes in the V. vinifera and wild-type A. thaliana indicates a tolerance for the transcriptional activity of intragenic TEs within these genes. This suggested to us that these TEs behave in a ‘hitchhiker-like’ manner where TEs co-located with expressed genes are able to take advantage of the permissive epigenetic landscape associated with the expressed genes for their own expression [13, 52].

As this ‘hitchhiker-like’ behaviour appears to be widely spread across plant lineages, we hypothesised that the co-localisation of TE and gene loci is likely due to genes that are stress-regulated and that co-located TEs would share a similar expression pattern to their ‘host-gene’. To test this, we first looked at the differential expression of genes in mock and yeast-treated samples and overlaid the presence and location of intron located and expressed TEs. While only 6 and 7% of DEGs contained intronic expressed TEs in Vv_Mock and Vv_Yeast, respectively, 14 and 13% of the non-DEGs co-located with expressed intronic TEs in Vv_Mock and Vv_Yeast (Additional File 2: Fig. S19). This result suggests that intron location of expressed TE loci is not restricted to stress-responsive genes. To address the possibility of the a shared expression pattern between co-located TEs and genes, differential transcriptional changes of TE expression candidates in the grapevine system were analysed using DESeq2 [53]. Due to the repetitive characteristics of TEs and that DESeq2 can only work with uniquely mapped read data, only the subset of expression candidates containing sequence divergence (i.e. trackable expression candidates) were suitable for this analysis. Differential expression analysis was performed on 5869 trackable expression candidates found in at least one of the three experimental conditions of the V. vinifera system (Vv_T = 0, Vv_mock, and Vv_Yeast). Hierarchical clustering of differentially expressed TEs (DETEs) revealed various expression patterns across the time course in response to different treatments (Fig. 6a, b). The mock treatment (Vv_Mock) showed that over 50% of the DETEs were transcriptionally activated in the first 3 h (3 h) of post-treatment and then returned to an expression level similar to that observed in Vv_T = 0 (Fig. 6a, c), illustrating an ‘up-back’ expression pattern. Interestingly, the majority (206 of 291) DETEs (70.79%) in the H. uvarum exposure treatment (Vv_Yeast) responded in an up-regulated manner (Fig. 6b, e).

Fig. 6
figure 6

Expression dynamics of differentially-expressed TEs (DETEs) correlate with that of the co-localised differentially-expressed genes (DEGs). a, b Heatmaps showing hierarchical clustering of DETEs in Vv_Mock vs Vv_T = 0 (a), and DETEs in Vv_Yeast vs Vv_Mock (b), based on logarithmic-transformed fold-changes. The three replicates of each time point are individually presented. The time-series expression pattern of each cluster is depicted as line graph with the number of TE loci within each cluster indicated. c, e Pie charts to summarise the expression pattern of DETEs in Vv_Mock vs Vv_T = 0 (c) and DETEs in Vv_Yeast vs Vv_Mock (e). d, f Hierarchical classification of DETEs in Vv_Mock vs Vv_T = 0 (d) and DETEs in Vv_Yeast vs Vv_Mock (f) by the presence/absence of co-localised genes (genic/intergenic), the transcriptional activity of co-localised genes (with expr. Gene/with non-expr. gene), and the differential test of expressed genes (with DEG/with non-DEG). g, h Heatmaps showing hierarchical clustering of co-localised DETEs and DEGs in Vv_Mock vs Vv_T = 0 (g), and DETEs in Vv_Yeast vs Vv_Mock (h), based on logarithmic-transformed fold-changes. i, j Donut graphs to summarise the hierarchical clustering of co-localised DETEs and DEGs in Vv_Mock vs Vv_T = 0 (i) and DETEs in Vv_Yeast vs Vv_Mock (j) by expression pattern

Among the DEGs of Vv_Mock samples, 40 DEGs were co-localised with 45 of the 78 DETEs (Fig. 6d). In Vv_Yeast samples, 106 DEGs were co-localised with 124 of the 291 DETEs in this treatment (Fig. 6f). In an attempt to investigate the relationship of expression pattern between DEGs and the co-localised DETEs, these corresponding DETEs and DEGs were used for hierarchical clustering, in which DETEs and DEGs of similar expression pattern were grouped into the same clusters (Fig. 6g, h). Note that a small number of DETEs, especially DETEs within 2 kb flanking regions of genes, might co-localise with multiple DEGs and vice versa. Instead of arbitrarily excluding DETEs or DEGs that fell into this scenario, the comparison of the expression pattern of co-localised DETEs and DEGs was conducted on each DETE-DEG pair. The expression pattern of the 45 DETEs co-localized with 40 DEGs in Vv_Mock was then compared with that of paired DEGs, resulting in 45 pairs of DETE-DEG comparisons summarised in Fig. 6i, where 42 (93.33%) pairs of co-localised DETE-DEG showed concordant clustering between DETEs and corresponding DEGs. The same approach was applied on co-localised DETEs and DEGs of Vv_Yeast, in which 113 (89.68%) of the total 126 co-localised DETE-DEG pairs showed the same expression pattern between paired DETEs and DEGs (Fig. 6j). These findings indicate that the dynamic expression pattern of DETEs co-localized with DEGs tended to resemble that of the paired DEGs.

Non-de novo TE transcription of intronic TE loci is associated with intron retention and exposure of premature termination codons

Aberrant alternative splicing, such as exon skipping and intron retention, has been observed at gene loci containing epigenetically unmasked intronic TEs [31, 54]. While TE expression candidates in the V. vinifera and A. thaliana systems show a strong location bias towards expressed genes, how broad the transcriptionally active intronic TE loci associated with aberrant alternative splicing remains unanswered. To interrogate this issue on a genome-wide scale and investigate the transcription of ‘competent’ TE-derived transcripts as well as potential fusion products of TEs and co-localised genes, we took advantage of long-read sequencing to determine transcript sequence integrity. Oxford Nanopore (ONT) cDNA sequencing was utilised for V. vinifera (P. noir UCD5 clone) embryogenic callus subjected to 12 h of continuous incubation with live H. uvarum culture (hereinafter Vv_Yeast12h) and the corresponding mock treatment (hereinafter Vv_Mock12h). This stress treatment experiment was independent of experiments used to derive RNA for the short-read RNA-seq analyses, We chose only the 12 h time point as we previously showed that with yeast treatment of embryogenic cell cultures, the majority of TE families exhibit a continual increase in transcript accumulation (Fig. 6b), and we thus hypothesised that this time point was likely to seed numbers of autonomous TE-RNAs.

The alignment depth of the ONT cDNA sequencing reads to the V. vinifera reference genome was comparable to that of the Illumina libraries (Additional file 6: Table S5). The quantification of gene and TE family expression level in the ONT dataset shows a medium to high level of correlation with that of the Illumina dataset (Spearman’s correlation coefficient ρ > 0.80 for genes; ρ > 0.58 for TE families; Additional file 2: Fig. S20).

To understand the prevalence of TE transcription initiating from outside the annotated TE boundary (i.e. a non-de novo transcription), we examined the transcription start of ONT reads overlapping with TE loci. The results show that the frequency of non-de novo TE transcription varied by TE superfamily (Fig. 7). Most expressed Copia, Gypsy and LINE loci revealed transcription initiation from within the annotation boundary of these TEs. The type II DNA transposon loci investigated, exhibited a mixture of de novo and non-de novo transcription or transcription that was purely initiated outside of the annotation boundaries (Fig. 7a, b).

Fig. 7
figure 7

The frequency of non-de novo TE transcription differs by TE superfamily. a, b Expressed TE loci in (a) Vv_Mock_12h and (b) Vv_Yeast_12h were grouped by the mapped start site of ONT reads in relation to the annotated TE boundaries. c, d TE loci without de novo transcription in (c) Vv_Mock_12h and (d) Vv_Yeast_12h were categorised by the mapped start site of ONT reads in relation to the closest gene. “All_within”, all mapped ONT reads initiated within TE loci (i.e. de novo transcription); “All_upstream”, all mapped ONT reads initiated from the upstream of TE loci (i.e. non-de novo transcription); Mix, a mixture of mapped ONT reads representing de novo and non-de novo transcription

For TE loci without de novo transcription, the transcription start sites were further investigated in relation to the closest gene, and then these TE loci were grouped by whether the transcript initiated from exonic or intronic sites or did not initiate from an annotated gene (Fig. 7c, d). While the majority of non-de novo transcription of Copia and Gypsy was not initiated from annotated genes, the majority of LINE, Harbinger, hAT and MULE non-de novo transcripts were initiated from within introns (Fig. 7c, d). This result suggests that LINE and TIR-TEs are more likely to contribute to aberrant alternative splicing or gene-TE fusion transcripts.

The analysis of the ONT derived transcript data suggested a range of aberrant transcript splicing. To interrogate these data further we utilised the FLAIR pipeline [55] to categorise alternative splicing events into four categories: alternative 3′ splicing (Alt3), alternative 5′ splicing (Alt5), intron retention (IR) and exon skipping (ES). Gene-related alternative splicing features overlapping with TEs were also collected, following by estimation of the productivity (as per the definition in FLAIR pipeline, this denotes the ability of a transcript to produce protein), of gene transcripts having these alternative splicing features.

Among the total 21,081 alternative splicing features identified by FLAIR, 19,526 (92.6%) of these are related to annotated genes. Of these features, over 90% were IR (8806 alternative splicing features) and ES (9378 alternative splicing features; Fig. 8a). Note that an isoform may contain multiple numbers and various types of alternative splicing features. Of the 19,526 alternative splicing features relating to annotated genes, only 524 (2.7%) of these overlapped with TEs (Fig. 8a). As expected, almost all TEs overlapping with Alt3, Alt5 and IR features are located within introns, while 22 of 40 ES-associated TEs overlapped with annotated exons.

Fig. 8
figure 8

Intronic TE sequences retained in gene isoforms correlates with the presence of premature termination codon. a Illustration and summarisation of four types of alternative splicing. An alternative splicing feature could appear in multiple isoforms. Notably, there are more genes than the number of associated ES features, suggesting that, for some ES events, each may involve more than one gene. b A 100% bar chart to demonstrate the productivity of gene isoforms in the four types of alternative splicing, each further grouped as TE-present (+ TE) or TE-absent (− TE). AS, alternative splicing; Alt3, alternative 3′ splicing; Alt5, alternative 5′ splicing; IR, intron retention; ES, exon skipping; PRO, productive; PTC, premature termination codon; NGO, having no start codon; NST, having a start codon but no stop codon

The FLAIR pipeline was able to group splice variant transcripts into four types: productive (PRO), having premature termination codon (PTC, i.e. unproductive), no start codon (NGO), and having start codon but no stop codon (NST). To understand whether the presence of these TEs was associated with the productivity of the gene transcripts, the productivity of isoforms containing these gene-and-TE-related alternative splicing was estimated. The analysis showed that 50 to 68% of the isoforms having Alt3, Alt5, or ES remained productive, no matter whether the alternative splicing features overlapped with TEs (Fig. 8b). However, 80.6% of isoforms having TE-related IR were PTC, while the PTC proportion in isoforms having IR events non-overlapping with TEs was less than 45%. Looking into the estimated translation stop site of these isoforms containing TE-related IR feature, 196 of the 261 PTC isoforms exhibit premature stop codon exactly within the TE-overlapping IR feature (Additional file 7: Table S6). From the perspective of the isoform orientation, nine of the translational premature termination sites appear within TEs, two are after TEs, and the rest 186 isoforms show premature termination sites before the presence of TEs. The distance between TEs and the premature termination sites placed prior to TEs ranged from 2 bp to over 4 kb, with the first quartile, median and third quartile at 147 bp, 311 bp, and 693 bp, respectively (Additional file 7: Table S6).

Interestingly, different TE super-families were preferentially found among the four types of alternative splicing features. The retrotransposon LINE superfamily was over-represented in Alt3 and Alt5 alternative splicing events (Additional file 2: Fig. S21A, B), and Harbinger, a type II DNA transposon, was predominantly seen in IR features (Additional file 2: Fig. S21C). For ES features, MULE DNA transposon was the most predominant superfamily among all TE super-families (Additional file 2: Fig. S21D). Most of the TEs involved in alternative splicing were found to be fragmented. We observed only 10, 7, 34 and 1 full-length TE loci associated with Alt3, Alt5, IR, and ES features, respectively (Additional file 7: Table S6).

Identification of potential loci contributing ‘full-length’ TE transcription allowing for autonomous mobilisation

Transcriptional activation of TEs in cell exposed to stress conditions has been widely studied in plants and primarily reports transcriptional activation at the family level [12, 56]. The proportion of these TE transcripts derived from structurally intact TE loci and thus capable of seeding TE mobilisation in the genome remains unclear. We have shown that within our V. vinifera system the detailed characterisation of transcriptional activation of TE loci under stressed conditions using short read RNA-seq data. The long-read capabilities of Nanopore sequencing has proved to be a useful tool in identification of TE transcripts engaged in the mobilisation of retro-elements [32]. We sought to use ONT RNA-seq to identify which, if any, of the annotated and transcribed TE loci identified earlier may possess full-length TE transcripts necessary for autonomous mobilisation. We initially set out to identify structurally intact TE loci with read alignments that met the criteria for transcription that would be required for autonomous mobilisation. These included intact LTR-TEs with > 90% breadth of coverage across INT domain that would suggest the production of transcripts capable of producing the gag and pol gene products required for mobilisation (Additional file 2: Fig. S22 A, B), structurally intact LINEs with > 0.9 breadth of coverage across whole elements (Additional file 2: Fig. S23 A, B), as well as intact TIR-transposon (TIR-TE) with > 90% of the transposase ORF covered by ONT reads (Additional file 2: Fig. S24 A, B). In these data we identified 20 and 19 LTR-TE loci in Vv_Mock12h and Vv_Yeast12h libraries, respectively (Additional file 2: Fig. S22 C). These included Copia-3, Copia-23, and Gypsy-V1. For LINE retrotransposons, only a single VLINE7 locus and two VLINE 8 loci were selected (Additional file 2: Fig. S23 C). For TIR-TE, three hAT-7 loci in Vv_Mock12h library revealed > 0.9 breadth of coverage across the transposase ORF (Additional file 2: Fig. S24 C).

We were particularly interested in identification of contiguous cDNA-seq reads that mapped across the appropriate length of identified loci. The ONT read should meet two criteria to prove that a full-length TE transcript was present. Firstly, depending on the expected de novo transcription characteristics of mapped TEs [57], the long-read should be at least as long as the INT domain for LTR-TEs [3], the transposase ORF for DNA transposons, or the full feature of the TE locus for LINE [57]. Secondly, this read should have its TE-mapped bases almost as much as its read length. Taking Copia-23 as an example, the ONT reads mapping to the 19 structurally autonomous Copia-23 loci were all shorter than 3000 bp (x-axis, Additional file 2: Fig. S25 A), whereas the size of the canonical Copia-23 INT domain is 4084 bp. Only a single read longer than 3 kb was identified but this exhibited a very poor mapping quality to the element (y-axis, Additional file 2: Fig. S25 A). In addition, as expected, given the conservation in sequence of these loci, the majority of these ONT reads were multi-mapping (red dots, Additional file 2: Fig. S25 A). This analysis highlighted an inconsistency between lengths of ONT reads and mapping of these reads to TEs. To determine the factors underlying this inconsistency, the alignment of the start and end sites of ONT reads mapped to Copia-23 loci were surveyed. As illustrated in Additional file 2: Fig. S25 B and C, the alignment of the 5′ and 3′ ends of reads could be grouped into three categories, internal, external, and clipped. Our analysis revealed that most of the ONT reads represented transcription initiated within the Copia-23 loci (Additional file 2: Fig. S25 B). However, the tail of the reads were mostly clipped due to the sequence discrepancy between ONT reads and TE loci (Additional file 2: Fig. S25 C). Only a few reads extended through the annotated boundary of TE loci. Overall, there is no convincing evidence of transcription from potentially autonomous Copia-23 loci annotated in this genome.

The situation described for the Copia-23 loci was also observed in the TE loci of Copia-3, Gypsy-V1, VLINE7, VLINE8, and hAT-7 (Fig. 9; Additional file 2: Fig. S25). Only a single ONT read mapping to an autonomous Gypsy-V1 locus and eight reads of hAT-7 appeared to adequately cover the bases of the INT (Fig. 9 a-c) or ORF (Fig. 9 d-f) of the respective TE loci. The genome browser image of the only Gypsy-V1 locus demonstrates the full coverage of this locus by a single ONT read (Fig. 9 g). The genome browser image for hAT-7 shows ONT reads covering the ORF of the hAT-7 in chromosome 14 (Fig. 9 h). These data suggest the observed transcription of Gypsy-V1 and hAT-7 may allow limited mobilisation of these elements.

Fig. 9
figure 9

Identification of structurally intact TE loci potentially having transcripts for autonomous mobilisation. a-c The alignment properties of the ONT reads mapping to a structurally intact Gypsy-V1 locus whose INT domain was > 90% covered by ONT reads. For each read, the read length was plotted against the number of bases overlapping with the autonomous locus and coloured by mapping specificity (multi−/unique-mapping; a) and alignment start site (b) and end site (c) relative to the TE locus. All reads plotted in (a) were presented at the same coordinates in (b) and (c). The black arrows indicate an ONT read covering most of the structurally intact TE locus. Note that this ONT read is multi-mapping to other structurally intact Gypsy-V1 loci, but only the locus indicated here is TE expression candidate (i.e. supported by short-read data). d-f The alignment properties of the ONT reads mapping to three structurally intact hAT-7 loci whose ORF was > 90% covered by ONT reads. (d)-(f) were plot following the approach used for (a)-(c). g-h Genome browser images of the structurally intact Gypsy-V1 locus (Gypsy-V1_chr15_3486647–3,489,471) and the representative hAT-7 locus (hAT-7_chr14_21994787–21,998,695), whose INT and ORF (teal blue strips) was fully covered by individual ONT reads, respectively. The two black stars label full coverage of the INT or ORF, while the single black star marks > 90% coverage of the ORF. INT, internal domain; ORF, open reading frame; F, forward reads

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Disclaimer:

This article is autogenerated using RSS feeds and has not been created or edited by OA JF.

Click here for Source link (https://www.biomedcentral.com/)