Case studies demonstrating the ECCsplorer’s functionality

To evaluate and demonstrate the ECCsplorer’s functionality, we first outlined the results using simulated enrichment of the eccDNA fraction (Fig. 1b–d). Next, we reanalyzed publicly available data of the model organisms A. thaliana (Fig. 2a–c) and H. sapiens (Fig. 2d), as well as newly generated circSeq data for the non-model crop B. vulgaris (Fig. 2e). The three datasets illustrate three different research questions, three different input configurations, and a variety of different eccDNA candidates; thus together fully demonstrate the ECCsplorer pipeline’s wide applicability and functionality.

Fig. 2
figure2

Outputs of ECCsplorer for different case studies representing the running modes and input data configurations. ac An ECCsplorer run using A. thaliana circSeq (epi12) and control data (WT) [12] detected EVD/ATCOPIA93 as key eccDNA candidate (orange). a A mapping against the reference genome sequence (TAIR10) revealed eight eccDNA candidate regions. These were identified by applying three criteria: the region is flanked by split reads (at least five), contains discordant mapping reads (at least one pair), and shows a high coverage compared to its neighboring regions (peak prominence of at least one). b The comparative read clustering identified 41.04% of the clustered epi12 reads as potential eccDNA candidates. c To report eccDNA candidates with high confidence the outputs of the mapping and clustering strategies are compared by the ECCsplorer pipeline. The comparative module links the EVD/ATCOPIA93 candidate region from the mapping module to four clusters from the clustering module. d An ECCsplorer run using H. sapiens circSeq data [32] detected one eccDNA candidate region containing genic content (light blue) and many candidate region with low confidence (grey blue). e ECCsplorer analysis using B. vulgaris circSeq data and available sequenceing data [30] as control data detected multiple candidate clusters that were in the following annotated as mitchondrial minicircles a/d/pO (orange). f Overview on the running modes, their required input data and the corresponding case studies

Case study I: proof of concept using simulated data

To illustrate the ECCsplorer pipeline functionality, we applied it to simulated semi-artificial data (Fig. 1b–d; Additional file 3: Tables S1–S3). For this, three regions of the B. vulgaris reference genome sequence [30] were randomly selected to represent chromosomes with a total length of about 0.6 Mb, including a copy of the well-described long terminal repeat (LTR) retrotransposon Beetle7 [31]. A potential enrichment of the 6695 bp long Beetle7 circular DNA was simulated by introducing a total nine copies of this retrotransposon in tandem arrangement (including solo-LTR and LTR-LTR junctions). To simulate the short-read data, the tool dwgsim (github.com/nh13/DWGSIM) was used. As circSeq data 50,000 paired-end (PE) reads with a length of 200 bp each were simulated using the reference sequence and the tandemly arranged retrotransposon. For the control data the same amount of 200 bp PE reads was generated from the reference only. The reference itself served as reference genome sequence. In addition, the reference sequences of Beetle1 to Beetle7 [31] were used as annotation database.

After trimming, 46,950 PE reads of the circSeq dataset and 46,754 PE reads of the control dataset were piped into the subsequent bioinformatics steps, respectively. Of those reads 12,000 PE reads (6000 per dataset) were processed further for the clustering analysis, which corresponds to a 4 × coverage of the reference. The results from the mapping module as well as those from the clustering module yielded Beetle7 as the only eccDNA candidate with high confidence. The mapping module reported one candidate region on the artificial reference chromosome 3 with a length of 6695 bp, an enrichment score of 46.5, and a BLAST + annotation as Beetle7, respectively (Fig. 1b; Additional file 3: Table S1). The clustering module reported 13 candidate clusters (containing 3475 reads in total, 3399 reads from the circSeq data and 76 from the control data) adding up to one circular supercluster (Fig. 1c; Additional file 3: Table S2). The observed split of the eccDNA candidate into multiple clusters is relatively common after read clustering with RepeatExplorer2. Based on paired end read information, the clusters are commonly linked into superclusters. The comparative module reported that all 13 candidate clusters showed sequence similarities to the eccDNA candidate region (Fig. 1d; Additional file 3: Table S3). Four of those candidate clusters the clusters (namely 5, 6, 11, and 13) cover the circular break-point.

Taken together, the ECCsplorer pipeline successfully retrieved the artificial eccDNA candidates and revealed no other false-positives.

Case study II: identification of active retrotransposons using circSeq data of A. thaliana

To demonstrate the ECCsplorer’s functionality to detect the activation of LTR retrotransposons, we re-analyzed public data from A. thaliana [12] (Fig. 2a–c; Additional file 3: Tables S4–S6). As a reference genome and both circSeq and control data are available, all ECCsplorer modules were used (running mode: all). Lanciano et al. [12] amplified eccDNAs of epigenetically impaired A. thaliana plants and used the wild-type as control. We expected that our pipeline shows enrichment of the EVD/ATCOPIA93 retrotransposon, as originally reported. The full ECCsplorer pipeline was started with default settings and the trimming option enabled using the Nextera™ adapter. A total of 501,558 circSeq PE reads and 143,372 control PE reads were mapped against the TAIR10 (The Arabidopsis Information Resource, http://www.arabidopsis.org) reference genome.

The ECCsplorer mapping module retrieved the well-known LTR retrotransposon EVD/ATCOPIA93 as eccDNA candidate with the highest probability (Fig. 2a, orange; Additional file 3: Table S4) as well as other potential eccDNA candidates, e.g. originating from the ribosomal genes (Fig. 2a, blue; Additional file 3: Table S4). In detail, the mapping module reported 13 highly confident candidate regions and 673 candidates with low confidence. The candidate region with the highest enrichment score of 56 was 5332 bp long and annotated as EVD by the pipeline’s BLASTn analysis—in line with our expectation. We conclude that the ECCsplorer’s mapping module produces results that compare well with the original findings and that it is able to detect active retrotransposons from current datasets.

For the comparative read clustering, equal read counts of circSeq and control data were used (i.e. 96,016 PE reads per dataset). The clustering module identified 41.04% of the clustered circSeq and 1.41% of the control data reads as potential eccDNA candidates, respectively (39,402/1371 reads from circSeq/control), pointing to a 28.7 × overall increase (Fig. 2b). Four of the largest clusters containing about 10.3% of all reads in candidate clusters (4058/2 reads from circSeq/control), were reported as Ty1_copia/Ale, and correspond to the EVD/ATCOPIA93 reference (Fig. 2b, orange circles; Additional file 3: Table S5). We conclude that the eccDNA candidates can be readily detected with the clustering module as well as with the mapping module.

To be able to report highly confident eccDNA candidates, the pipeline compares the outputs of the mapping and clustering strategies. Here, the comparative module links eight candidate regions from the mapping module (Fig. 2a) to eleven clusters from the clustering module (Fig. 2b) with the best results for the eccDNA candidate EVD/ATCOPIA93 (Fig. 2c; Additional file 3: Table S6). After removing duplicates, this leads to three target retrotransposon regions for potential experimental verification. These results are in line with the original study [12], and lead us to highlight the ECCsplorer’s potential for the fast and reliable identification of retrotransposon mobilization.

Case study III: detection of (genic) eccDNAs using circSeq data from healthy humans (H. sapiens)

Typically, eccDNAs in H. sapiens are associated with cancer and other diseases. However, also in healthy tissues eccDNAs may arise [32]: The study by Møller et al. generated multiple circSeq datasets from healthy blood and muscle tissues. They detected about 100,000 unique eccDNAs including genic eccDNAs. As circSeq enrichment data and the reference genome sequence were available, we re-analyzed these data using the ECCsplorer’s mapping module (running mode: map). As the original publication reported high eccDNA candidate density on chromosome 16 of the hg38 assembly (H. sapiens genome assembly GRCh38) [33], we used this chromosome as reference, along with the corresponding mRNA database as annotation database (UCSC Genome Browser, https://genome.ucsc.edu/).

The mapping module detected eccDNA candidates with high (n = 1) and low confidence (n = 840) on the analyzed chromosome 16 (Fig. 2d). The highly confident candidate region was 22,772 bp long and was located on the distal end of chromosome 16 (Fig. 2d, light blue, arrowed). It contains several gene annotations with the highest BLAST score observed for the gene mitofusin (MFN1, Additional file 3: Table S8). Due to the arch-like distribution of the read coverage over the whole chromosome the candidate region with high confidence only showed an enrichment score of 0.16 as it is calculated globally. Although the ECCsplorer detected only a single eccDNA candidate region with high confidence, our manual analysis of regions with lower confidence showed also promising results. In total, 29 of those low-confidence candidate regions were supported by Møller et al.’s approach ([32], Fig. 3f; Additional file 4: Supplementary Data).

Fig. 3
figure3

Comparison of results from ECCsplorer with Circle-Map [18] and originally published eccDNA candidates. We reanalyzed data from A. thaliana [12] and H. sapiens [32] using available eccDNA tools, our ECCsplorer and Circle-Map. a Mapping plots from all three approaches (ECCsplorer, Circle-Map and original published data by Lanciano et al. 2017) of (re-) analyzed A. thaliana circSeq data. For each approach, the detected eccDNA candidates have been highlighted in blue with the key eccDNA candidate EVD/ATCOPIA93 shown in orange (experimentally validated: “ground truth”; maximal coverage depth of each individual plot corresponds to 100% relative depth). b Comparison of all three approaches (ECCsplorer, Circle-Map and original published data by Møller et al. 2018) of (re-)analyzed H. sapiens circSeq data. Bars in track 2 represent the detected eccDNA candidate regions by each approach. The connections highlight similar (overlapping) genomic regions detected on chromosome 16 of hg38 (arrow points to eccDNA candidate with high confidence). c Length distribution of by each approach detected eccDNA candidates using A. thaliana circSeq data (1 candidates with low confidence additionally shown in light grey). The ECCsplorer’s high-confidence hits show large overlap to the “ground truth”. d Venn-diagram of similarly detected eccDNA candidates using A. thaliana circSeq data with EVD/ATCOPIA93 candidate overlaps shown in orange. e Length distribution of by each approach detected eccDNA candidates using H. sapiens circSeq data. f Venn-diagram of similarly detected eccDNA candidates using H. sapiens circSeq data. The ECCsplorer shows a larger overlap to the original data than the CircleMap tool; yet “ground truth” data is still lacking

A main difficulty for this specific read dataset was the high background noise, presumably from linear DNAs that remained after incomplete exonuclease treatments. Nevertheless, the ECCsplorer pipeline was able to detect eccDNA candidates from genic regions, confirming the presence of eccDNAs in healthy H. sapiens.

Case study IV: detection of eccDNAs absent from the reference genome using circSeq data from B. vulgaris

To test whether our ECCsplorer pipeline is able to detect eccDNAs absent from reference genome assemblies, we queried sugar beet (B. vulgaris), a non-model organism. The B. vulgaris genome harbors small extrachromosomal circular stretches of mitochondrial DNAs [34] that are absent from the published reference assembly [30, 35]. We generated circSeq data from inflorescences of B. vulgaris after experimental enrichment of the eccDNA fraction according to Lanciano et al. (2017) and Diaz-Lara et al. (2016). As control, we used publicly available data from the same genotype (KWS2320, see Data Availability).

To test the ECCsplorer’s usability for the detection of extrachromosomal mitochondrial DNAs absent from the reference genome, we used the comparative read clustering as embedded in the clustering module (running mode: clu). For the clustering, 322,580 PE reads (161,290 PE reads per dataset) were used, respectively. The clustering module revealed twelve candidate clusters (combined in one supercluster) containing 30,792 reads (30,636/156 in circSeq/control). All were clearly enriched (Fig. 2e, orange circles; Additional file 3: Table S9) and of mitochondrial origin. A manual BLAST assigned all twelve clusters to the B. vulgaris-typical mitochondrial minicircles, termed a, d and pO [34].

These results clearly show that the ECCsplorer is capable of reference-free eccDNA detection at low sequencing coverages. We further want to highlight that already existing sequencing runs can be used for the comparative clustering analysis. Ideally, however, we recommend preparing enriched and non-enriched DNA from the same samples.

In contrast to other eccDNA detection methods, the ECCsplorer can be used without a high-quality reference genome sequence and is therefore much less vulnerable to assembly errors. This makes the ECCsplorer the current method of choice when working with non-model organisms and low-coverage, short-read sequencing data (see also 3.2 Comparison with other tools).

Comparison with existing tools

Although the amount of available circSeq data and the interest in eccDNAs has been growing lately, there is yet no standardized way of analyzing such data. To date, only few attempts for software solutions are available and currently no solution is able to address different approaches in a single tool. Most of the available tools are aimed at a very specific use-case and are not applicable for non-model organisms at all.

To our knowledge there are currently only a few software solutions to detect eccDNAs. Circle-Map [18] is a realigning-based pipeline to detect eccDNA from circSeq datasets already mapped to a reference genome sequence. Circle_finder [15] is a script collection written for the detection of eccDNA (there called microDNA) from H. sapiens samples. CIRCexplorer2 [36] is intended to detect circular RNA. Two further tools available are AmpliconArchitect [37] and CIDER-Seq [16]. Whereas the AmpliconArchitect aims very specifically at detecting eccDNAs from H. sapiens cancer tissue, the CIDER-Seq approach relies on long-read (Pac-Bio/SMRT) sequences, only. At last there is the Circulome-Seq [38], which is an RCA-free, column-free approach that enriches eccDNA by prolonged exonuclease V treatment and consequently library construction with the transposase Tn5. Though the very different application ranges of AmpliconArchitect, CIDER-Seq and Circulome-Seq, a direct comparison with our ECCsplorer pipeline is unfortunately prevented. For human data, Circle-Map has been compared to Circle_finder and CIRCexplore2 before and shows similar or better performance overall [18].

To evaluate the performance of our ECCsplorer pipeline, we compared its output with published results and those that we retrieved after re-running Circle-Map. For the comparison, the datasets analyzed with the ECCsplorer (Fig. 2a–d) were reanalyzed with Circle-Map and bwa-mem [39] as mapping tool, as recommended (https://github.com/iprada/Circle-Map/wiki). In addition, we compared the results obtained from both tools with the originally published results. As Circle-Map relies on a gold-standard reference genome, only A. thaliana and H. sapiens data are suitable as input. To our knowledge, non-model organisms that lack high-quality draft genomes are only analyzable with our ECCsplorer pipeline when using short-read sequencing data at low coverage (< 0.2 ×).

First, we compared results of both tools and the originally published data from the enrichment of eccDNA in A. thaliana [12]. In summary, the ECCsplorer detected eight eccDNA candidate regions and 673 regions with low confidence (Fig. 3a, c, d). Three of the eight candidates were annotated as EVD/ATCOPIA93 (Fig. 3a, first skyline plot, orange peaks), an LTR retrotransposon previously reported as transpositionally active [40]. The five remaining candidates appeared to be organellar DNA or tandem repeats. The original publication [12] also reported eight candidate regions with three of them being EVD/ATCOPIA93 (Fig. 3a, third skyline plot, orange peaks). The original data reported no organelle-derived eccDNAs as those reads had been filtered out before their analysis. Re-analysis with the Circle-Map tool reported 1292 candidate regions, whereas regions greater than 50 kb have been manually filtered out as they were very likely false-positives or overlapped other candidates (Fig. 3a, second skyline plot, blue peaks and Fig. 3e, f).

The ECCsplorer output and the original data shared three high-confidence candidate regions with all of them being EVD/ATCOPIA93 (Fig. 3d; Additional file 3: Table S7). Four additional originally reported regions occurred in the low confidence output of the ECCsplorer pipeline. The ECCsplorer output and the Circle-Map output share three candidates as well, but none of them were EVD/ATCOPIA93. The Circle-Map output and the original data share five candidate regions, but again none of them being EVD/ATCOPIA93. There was no candidate region found by all three approaches (Fig. 3d; Additional file 3: Table S7). The comparison of the length distributions of the eccDNA candidates showed similar profiles between the ECCsplorer candidates (with high confidence) and the originally published data as well as between the ECCsplorer candidates (with low confidence) and the Circle-Map results (Fig. 3c). This comparison demonstrates that the ECCsplorer pipeline provides more accurate results than the Circle-Map tool using default settings and considering the originally published results [12] as ground truth with experimental validation. Quite surprising was the absence of an EVD/ATCOPIA93 eccDNA candidate in the Circle-Map results despite the large number of output candidates. The ECCsplorer pipeline outperforms Circle-Map in this case study using low-coverage (~ 1 ×) input data.

Second, we compared the results of ECCsplorer and Circle-Map with the originally published results of the H. sapiens circSeq study focusing on candidates on chromosome 16 from the hg38 assembly (Fig. 3b). Chromosome 16 was reported by the original study [32] to have a high per Mb eccDNA count and is also one of the shorter H. sapiens chromosomes allowing all tools to run on a desktop-grade computer. The ECCsplorer pipeline found one candidate region with high and 840 with low confidence using default setting (Fig. 3b, e, f), respectively. The original study [32] reported 70 highly and 75 lowly confident candidate regions. Re-analysis with the Circle-Map tool found 493 candidates using recommended settings, and manual filtering of candidates greater than 50 kb. The outputs of all approaches were compared with BEDtools intersect (for details see Supp. Info 2 Methods). The ECCsplorer pipeline had 29 results in common with the originally published data and 82 overlaps to candidates reported by the Circle-Map tool (Fig. 3b, f; Additional file 4: Supplementary Data). Curiously, only two similar candidates were shared between the original data and the Circle-Map results (Fig. 3f), despite the length distribution profiles appearing to be similar (Fig. 3e). No candidates were shared across all approaches. This demonstrates the complexities of the detection of eccDNA candidates from circSeq data and the need of a unified, and reproducible software solution. The ECCsplorer pipeline shared similar candidates with both other approaches (Fig. 3f), while detecting longer candidates overall (Fig. 3e). This finding makes the ECCsplorer a viable starting point for the analysis of eccDNAs. This is especially true in the light that the published candidates from H. sapiens [32] have not entirely been experimentally validated. Additionally, our findings demonstrate that an approach based on mapping only may result in many false-positive eccDNA candidates. Therefore, we recommend running the ECCsplorer pipeline with all implemented approaches, including the mapping, clustering, and comparative modules.

Current limitations

The main limitations of the current version of the ECCsplorer pipeline are justified in the implemented tools and algorithms, which can heavily occupy computing resources. First, segemehl is using very high amounts of RAM (random-access memory) for rather medium-sized reference genome sequences (128 GB of RAM are recommended for usage with H. sapiens reference genome). Nevertheless, its high accuracy in split-read detection make it a very valuable choice for the analysis of eccDNA candidates. Second, the peak_finder algorithm is comparably slow scanning chromosome scale data, but needed to generate eccDNA candidates. Third, the all-against-all BLAST performed by the RepeatExplorer2 limits the number of analyzable reads. However, implementation of RepeatExplorer2 read clustering offers a fundamentally different way to detect eccDNA amplification, hence greatly increasing the usefulness of the pipeline. Last, the amplification bias in the experimental procedure prevents a quantitative analysis of the eccDNA fraction, and instead focuses on the qualitative detection of eccDNAs. The current limitations can be overcome by splitting the individual dataset and performing multiple runs. Summarizing, the ECCsplorer pipeline aims for robust, qualitative results over speed, producing highly confident eccDNAs by integrating different detection approaches.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Disclaimer:

This article is autogenerated using RSS feeds and has not been created or edited by OA JF.

Click here for Source link (https://www.biomedcentral.com/)