Omics data collection and statistics

Galbase collected the public chicken multi-omics data of 928 re-sequenced genomes (Additional file 1: Table S1), 429 transcriptomes (Additional file 2: Table S2), 379 epigenomes (Additional file 3: Table S3), 15,275 QTL entries (Data from Animal QTLdb), and 7,526 associations (Additional file 4: Table S4). After applying quality control and standardized data processing procedures, these datasets have been converted to usable table information in the MySQL database (Fig. 1).

Fig. 1
figure 1

The database structure and processing pipeline of Galbase

The final list of variations included 21,672,487 SNPs and 2,708,244 InDels, which were annotated into 25 consequence types (Additional file 4: Table S5 and S6). We estimated π and FROH values to evaluate the genomic diversity of each breed (Fig. 2 and Additional file 4: Fig. S1). We found higher level of nucleotide diversity (Fig. 2a and Additional file 4: Fig. S1a) and relatively lower level of inbreeding coefficients (Fig. 2b and Additional file 4: Fig. S1b) in RJFs than domestic chickens, representing a higher level of genetic diversity in wild populations. In domestic chickens, breeds in Southwest Asia, South Asia, Southeast Asia, and Southern China had higher level of nucleotide diversity (Fig. 2a and Additional file 4: Fig. S1a) but lower level of inbreeding (Fig. 2b and Additional file 4: Fig. S1b) than those in Northern China, due possibly to more frequent gene flow between RJFs and sympatric domestic chickens [1]. Some highly selected breeds and native breeds, including White Leghorn, Commercial broilers, Araucana Blue-shelled chicken, Dongxiang Blue-shelled chicken, Daweishan Mini chicken, Yuanbao chicken, Guangxi chicken, Miyi chicken, Jiangxi Silkies, Anyi Gray chicken, Langshan chicken, Shouguang chicken, and Gushi chicken, showed higher levels of inbreeding (Fig. 2b and Additional file 4: Fig. S1b), indicating potential risks of their extinction following inbreeding depression and thus calling for conservation measures [31, 46].

Fig. 2
figure 2

Statistics of population genomic diversity based on random sampling of each group (To avoid the bias due to sample size variation, we reduced the sample size of each group down to five, following random sampling for 10 times.). a Genome-wide distribution of nucleotide diversity in each chicken group. b Distribution of FROH estimate in each chicken group

The transcriptome data covers 44 tissues at different developmental stages. We added some valuable groups to provide information for chicken trait studies, such as high and low altitude samples, slow and fast-growing muscles, as well as normal and frizzle shaped feathers. We calculated TPM values and tissue specific index (tau) for 23,729 genes. The gene ontology (GO) analysis validated tissue-specific genes as being involved in the known tissue-relevant biological processes (Additional file 4: Table S7). We also performed differential expression analysis for each experimental designed group from the same project and provided a list of differentially expressed genes. To better interpret the changes in expression levels, we collected and included four histone modification marks (H3K4me3, H3K27ac, H3K4me1, and H3K27me3) and one transcription factor, CCCTC-binding factor (CTCF), based on ChIP-seq, as well as one open chromatin marker, based on ATAC-seq, to identify cis-regulatory elements. A total of 488,583 cis-regulatory elements were identified, accounting for 49.37% of the whole genome size (Additional file 4: Table S8).

The phenotypic data were collected from AnimalQTLdb, GWAS Atlas, and public literature resources (Additional file 4: Table S4). We mapped reported genes and positional information for all collected phenotypic data to the chicken GRCg6a genome. The data includes 609 different chicken traits which were divided into 15 major categories (Additional file 4: Table S9). We found that the reported traits were mainly concentrated in the “Growth Related Traits”, “Egg Related Traits”, “Exterior Features”, “Behavior Related Traits”, and “Feeding Related Traits” categories, which is consistent with the mainstream research on chickens.

Database characteristics

Galbase comprises a data storage warehouse though MySQL, a user search engine by CodeIgniter, and a set of tools for analysis and visualization. We categorized the chicken multi-omics data into five main retrieval functionalities: (i) variation module; (ii) expression module; (iii) epigenomic module; (iv) phenotypic module; (v) batch annotation; and (vi) a series of useful tools. Each module has its own page, and features are linked to each other by gene symbols and chromosomal locations.

Variation module

The “Variation module” was designed to dynamically retrieve relevant SNP and InDel information in a tabular format or in a genome browser interface. Simply by specifying a variant rsID, a gene symbol, or a chromosomal location (Fig. 3a), users can easily obtain the results of a query for all annotated variation information, including chromosome, position, reference/alternative alleles, MAF, consequence type, variant ID, and allele frequency (Fig. 3c and 3d). Gbrowse (integrated from UCSC Genome Browser, here we named it “Gbrowse”) was linked to this page to help view other sequence features (Fig. 3c and 3d). More filtering parameters can be set to obtain those variations that fall in different gene bodies or non-protein-coding fraction (Fig. 3b). Moreover, a geographical map showing the allele frequency of five RJF subspecies and 47 chicken breeds can also be displayed, accompanied by each search (Fig. 3c and 3d). This function can help users to identify the breed- or trait-associated variants.

Fig. 3
figure 3

Features of Galbase variation module. a Basic search interface includes filtering of variation rsID, gene symbol, and chromosomal location of both GRCg6a and GRCg7b assemblies. b Advanced search interface includes filtering of minor allele frequency and consequence type. c An example shows the chr5:41,020,238 locus at the TSHR gene and its allele frequency distribution of five RJF subspecies and 47 domestic chicken breeds. d An example shows the chr3:67,850,419 locus at the PDSS2 gene and its allele frequency distribution. The maps were created by using ECharts (https://github.com/apache/echarts) [43], an open-sourced and web-based framework based on JavaScript 

Expression module

The “Expression module” displays gene expression profiles in three ways. The first displays the gene expression matrix by heatmap and also incorporates a tau value (Fig. 4a), which enables users to easily distinguish gene expression patterns and tissues with specific or high abundance expression. The other two ways display RNA-seq data in Gbrowse. The “expreBar” track (Fig. 4b) can be linked to a more detailed boxplot display page (Fig. 4c) by clicking one gene symbol, where users can view the expression levels of all samples (Fig. 4c). The “RNAseqReadsCoverage” track, displays normalized read coverage depth by converting BAM files to 1 × sequencing depth, so that users can compare the expression levels between different samples (Fig. 4d). We also performed differential expression analysis by DESeq2 [37] for some specific experimental designed groups, and provide upregulated and downregulated gene lists. Users can download expression matrices and differential gene lists in a CSV format or plain-text files for further analysis.

Fig. 4
figure 4

Features of Galbase expression module. a Screenshots of gene expression function and the result for one example. (The chicken illustration is processed from a chicken photo taken by Weiwei Fu, one of the authors of this article.) b Display of transcriptome expression histogram in Genome Browser. c A boxplot showing the range of expression levels across all chicken samples by clicking the gene symbol in Fig. 4b. Different colors represent different organ systems. d Display of the coverage depth of transcriptome reads in Genome Browser

Epigenomics module

Cis-regulatory elements can help elucidate the causes of complex traits and altered expression levels. Galbase encompasses data relating to four histone modifications: H3K4me3 (active promoters), H3K27ac (active promoters and enhancers), H3K4me1 (enhancers and other distal regulatory elements), H3K27me3 (repressed transcription); one transcription factor, CCCTC-binding factor (CTCF), contributing to 3D genome organization; and one open chromatin marker based on ATAC-seq to identify regulatory regions. The epigenomics metadata, including sample information, experiment type, and epigenetic mark, can be displayed by heatmap views or typical ‘wiggle’ views in WashU Epigenome Browser (Fig. 5b). The epigenomics data can also be retrieved by a table browser. Both regulatory peaks and read coverage (normalized 1 × sequencing depth) can be visualized in Gbrowse (Fig. 5c). Tracks can be sorted, organized, dragged, and printed in a PDF format based on user preferences.

Fig. 5
figure 5

Presentation of multi-omics data for silky-feather trait. a Genotype pattern of the silky-feather fixed region (chr3:67,832,192–67,850,863). Only the chr3:67,850,419 locus conforms to a single gene recessive inheritance mode. b Display of epigenomics metadata in local WashU Epigenome Browser. c Display of epigenomics peaks and read coverage in local UCSC Genome Browser

Phenotypic module

The “Phenotypic module” contains 15,275 QTL entries and 7,526 variant-trait associations manually curated from AnimalQTLdb [18], GWAS Atlas [19], and literature resources. To unify different chicken traits, we divided the trait entities into 15 major categories using the standard classification of chicken QTLdb (Additional file 4: Table S9). We provided various ways to browse and retrieve the phenotypic data: search by gene symbol, search by variant rsid, find QTLs or associations by genome location, and find associated genes by a trait name or a keyword. In order to be compatible with the latest version of the chicken reference genome, we transformed the coordinates to the newest genome assembly by using the LiftOver tool [42] with the default parameters.

Batch annotation

This module was designed on the home page to realize full database retrieval. We integrated all aforementioned multi-omics data to annotate genes, genomic regions, and loci in batches, which can improve the accuracy and reliability of screening, and help users to better analyze and judge the function of genes. Users can enter a candidate list of genes or positions to view all available datasets: (i) SNPs; (ii) InDels; (iii) Epigenomics; (iv) Expression; (v) QTL; (vi) GWAS; (vii) Gene Ontology; and (viii) KEGG pathway. The search results are presented in a tabular format and can be downloaded in a CSV format for downstream analyses.

A series of useful tools

In order to facilitate the usage of the database and to compensate for the lack of tools in the latest chicken GRCg7b reference genome, we have introduced and built several commonly used tools. The currently available tools included two genome browsers (the UCSC Genome Browser [42] and WashU Epigenome Browser [44]), a BLAST server [45], a BLAT server, and a liftOver tool [42]. Users can use the UCSC Genome Browser to visualize the multi-omics data in a global view. Currently, 1196 tracks for the GRCg6a assembly and 388 tracks for GRCg7b assembly have been released. Users can view SNPs, InDels, gene expression, epigenomics signal, and phenotypic data by searching for a gene symbol or a genomic region. The WashU Epigenome Browser was designed to display epigenomic data specially and we configured the epigenomic data to the Browser. The BLAST and BLAT servers target GRCg6a and GRCg7b genome assemblies, which allows researchers to perform sequence alignment, locate the position of the sequence on the genome, and infer the sequence function. The liftOver tool can offer an online coordinate conversion from GRCg6a to GRCg7b and chain files can be downloaded to support the liftOver server version.

Example applications and discussion

Establishing a systematic multi-omics database is critical to streamline all mega-datasets to provide an easy access for different users in the field of animal genetics and breeding, however, such databases are very limited in domestic animals. Here, we use chicken as a paradigm for archive, analysis, and visualization of muti-omics data on genome-wide SNPs, InDels, expression, epigenomics, GWAS, and QTL. Compared with other specialized databases, including GEISHA [12], Chickspress [13], and ChickenSD [1], Galbase excels in the following two aspects:

First, Galbase provides variants and their allele frequency for both wild and domestic chickens in nearly 1000 genomes, which can help investigation of the population history of chickens. For instance, a previous study reported a missense mutation in the TSHR gene (GRCg6a; chr5:41,020,238 G/A; TSHR-Gly558Arg) to be a domestication locus since it was nearly fixed in all domestic chickens [2]. However, subsequent studies found that the frequency of TSHR-558Arg mutation in European archaeological chickens sharply increased only in the last 1000 years [47], while this allele also had a very high frequency in the ancestor of domestic chickens [1], Gallus gallus spadiceus, suggesting this mutation may not be a domestication locus following the complex domestication history of all chickens. By querying our database, users can easily obtain the frequency distribution pattern of domestic chickens and RJFs (Fig. 3c). This intuitive display of geographic allele frequency can help quickly verify hypothesized population history.

Secondly, Galbase provides multi-omics data to help comprehensively judge the potential causal variation of chicken complex traits. For instance, the silky-feather phenotype is controlled by a single recessive gene [9, 48]. We calculated the FST values of silky-feather and non-silky-feather groups though the vcf file that was downloaded from our database. We selected highly differentiated loci (FST > 0.4) (Additional file 4: Table S10) in the previously reported fine mapping interval of the silky-feather phenotype (GRCg6a; chr3:67,832,192–67,850,863) [9], and presented its genotype patterns (Fig. 5a). We found that only the chr3:67,850,419 locus conformed to the single gene recessive inheritance mode, that is, the homozygous silky-type did not exist in the non-silky-feather group. By querying the epigenomics data in our database, we found that chr3:67,850,419 was located in a region showing strongly-enriched signals for H3K4me3, H3K27ac, and ATAC-seq (Fig. 5b and 5c), which was consistent with the published experimental results showing that the chr3:67,850,419 locus leads to the silky-feather phenotype by affecting promoter activity [9]. This exemplifies the use of multi-omics resources obtained from Galbase to reveal complex traits and reduce the verification work of downstream experiments. In addition, Galbase provides a variety of downloadable forms of expression data. Users can easily screen tissue-specific genes according to tau index or filter differentially expressed genes according to the result of DEseq2, which makes it more convenient to investigate tissue traits.

Data management plan and future update

We will continue to incorporate newly released chicken multi-omics data, and provide dedicated tools required to explore and visualize these data. In order to connect and integrate these external resources more quickly, we will develop an automatic interface to download daily published data, upload it to the supercomputer platform for quality control, processing and analysis when the sample size gets larger than 100 individuals, and finally process the offline data into the website format. For the analysis of re-sequenced genomes, which require a lot of computational resources, we will maintain a major update every year. Our plan for the next phase is to use deep learning algorithms to integrate multi-omics data, which will provide a comprehensive insight from genotype to phenotype, so as to better evaluate and mine heterogeneous multi-omics information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Disclaimer:

This article is autogenerated using RSS feeds and has not been created or edited by OA JF.

Click here for Source link (https://www.biomedcentral.com/)