DNA duplex stability parametrized archaeal promoters differ from control sequences
In order for getting the binding sites of transcription factor proteins represented by numeric inputs, genetic information was coded into DDS. Promoter sequences have already been well represented by DDS . Concerning the coding of genetic information into DDS as well as locating areas of interest for turning promoters unmatched, Fig. 2 has been provided. The plotting of promoter sequences and their negative controls reveal that the binding site of transcription factor proteins is only found within promoters; these areas are observed in the promoter line, around positions − 28, − 32, and in the range of − 10 to + 1.
The initiation of transcription in archaea has been reported to need two transcription factor proteins: a TBP and a TFB, homolog to eukaryotic TFIIB [7, 10]. Additionally, a second strong signal was observed around positions − 10 and + 1, matching the Proximal Promoter Element. This area consists of the binding site of a protein namely TFE, which has been reported to optimize the transcription in archaea by stabilizing the formation of a PIC . Considering these organisms have limited genomes and need to have their metabolic demands matched in order to thrive, the presence of transcription optimizer proteins such as TFE plays a pivotal role in the gene expression.
Next, we have observed conserved binding sites of promoter recruitment transcription factor proteins in archaea with varied GC content (H. volcanii = 66.13%, T. kodakarensis = 50.67%, and S. solfataricus = 34.48%), More GC would indicate less potential binding sites for such proteins, as reported in . However, our rationale has been able to find the binding site despite the amount of GC in a particular archaeon, Therefore, the binding site of these three proteins are clear in the plots representing the promoters in all organisms, suggesting that DDS succeeded in well representing promoter elements in archaea.
Statistical classification succeeds in the distinction of promoter sequences
In order to promote a classification method, the mean values of TATA + BRE sites of promoter sequences as well as three levels of control were converted into DDS. For each observation, the average of the promoter sequence differs from the three levels of control, where the closer from the promoter score is the shuffled sequence, followed by the downstream, and finally, the block shuffling process (Table 1).
The dissimilarity observed in the promoters and the three forms of controls in Table 1 enables the statistical form of classification proposed in this study. The promoter interval was ranged and all the sequences got their data classified into promoter or non-promoter. The results were then computed on an error matrix (Table 2), from which the precision value remains the same in every organism against their controls, since its calculation relies on positive values. The assessment of Table 2 indicates a higher recall value. The most satisfactory scores were achieved by the block form of control whilst the last was found in shuffled sequences, with the exception of S. solfataricus, in which downstream drags behind shuffled.
The statistical method of classification reported in this study has proven satisfactory in a way that it did not employ techniques encompassing machine learning. Firstly, the lower Precision value in the method suggests the model classified too many False Positives, this means non-promoters were classified as promoters in some instances. A reason for this to happen is the diversification found in archaea  and the dissimilarity in owning conserved binding sites . Secondly, the most fine counts were achieved in the block form of control, matching the identification of Table 1, in which the means of the blocks are the furthest from the promoters. Additionally, the method has presented satisfactory scores regarding recall, a metric that is sensible towards False Negatives. This phenomenon is explained due to the transcription machinery of different archaea being quite similar. If a sequence that lacks conserved binding sites of TBP and TFB, it is very unlikely to be classified as an archaeal promoter.
The stress of the statistical model, brought by an inter-archaea classification (Table 3), similarities have been found in S. solfataricus and T. kodakarensis, confirming what was proposed by Takemasa et al.  in order to turn T. kodakarensis and S. solfataricus as regulatory chassis for hyperthermophilic archaea. These two particular organisms were reported to be similar in terms of their AT% throughout the genome , while H. volcanii has higher GC. We found the nucleotide composition directly affects the classification outcome, since it relies on conserved binding sites of transcription factor proteins.
Statistics has been proved as an adequate way to classify promoter sequences of archaea. This method is highlighted to its ease to implement, since it does not require extensive computational costs. Indeed, descriptive statistics is seen as a precursor of machine learning in classificatory nature .
Artificial Neural Network conveys a sturdier classification
In order to achieve more robust classification scores, ANNs were used. In the ANN simulation, the architecture that protruded satisfactory scores follows: (i) seven neurons in the input layer; (ii) two neurons in the hidden layer, and; (iii) one neuron in the output layer. Table 4 indicates the results achieved by the ANN simulation, with a default tradeoff value of 0.5 in computing the output of the model. The four parameters tested in the classification (Accuracy, Precision, Recall, and Specificity) are evenly spread among different forms of control. For a mean of the three forms of control against each classification parameter, please see Additional file 1: Table S1, in which the results of the four metrics are equidistant. Furthermore, the behavior of the ANN model was tested with different tradeoff values through a ROC (Receiver Operator Characteristic) curve, presented in Fig. 3.
A second application of ANNs was conducted to evaluate if the pattern of one archaeon might be employed to classify another. Following this rationale, a new simulation was achieved in which the ANN was trained with one organism and tested with another. The results of this new simulation are available in Table 5, from which there is a leaning towards S. solfataricus and T. kodakarensis. The H. volcanii logistics produced classification results far distant from each other (e.g. from 60.87% recall to 96.23% specificity in a crossing of S. solfataricus and H. volcanii).
The results brought by the ANN classification suggest the model succeeded in classifying archaeal promoters, distinguishing them from three variations of control. In fact, this machine learning approach has succeeded in encountering promoters [17, 22, 23]. An implementation of similar nature was performed in  through the classification of bacterial promoters. The results obtained in this present study outperformed the bacterial classification because of the structure of the archaeal promoter in comparison to bacteria, which contains sigma factor proteins to direct RNAP to specific sites.
By outshining the statistical classification, the mathematical robustness of the ANN method  has proven uneven. Also, the rationale found in such method has matched the statistical classification, but overcame it. A good indicator to observe prediction validity is brought by ROC curves, which plots the specificity cost in gaining more recall . The most evident characteristics are observed in the block control, which is found in the upper left corner of the plotting areas, confirming what the statistical analysis has found and validating the findings of Table 4 and Additional file 1: Table S1. The evenly spread scores (not fluctuating more than 1% in the metrics of each archaeon) certify the success of classification of ANN, suggesting the conservation protruded by a DDS codification of transcription factor binding sites has sufficiently turned promoter sequences unique.
The verification of inter-organism rationale of classification has evidenced that S. solfataricus and T. kodakarensis share similarities, evidenced by the acceptable classification scores between these two archaea. The high values of recall observed in H. volcanii vs. S. solfataricus and T. kodakarensis suggest that very few False Positives were identified, meaning that it was rare for the model to incorrectly classify H. volcanii promoters, this is due to the divergent amount of GC in this organism, reported in . The bumpy results of precision in H. volcanii and S. solfataricus (and vice-versa) shows that the S. solfataricus model correctly identified non-promoters of the H. volcanii dataset, meaning the model correctly identifies promoters with conserved binding sites. However, the H. volcanii ANN architecture failed in classifying non promoters of the S. solfataricus dataset, indicating that the rationale of classification of this halophilic archaeon only performs well with organisms with higher GC%. In general terms, due to the higher amount of GC in H. volcanii and consequently, less conserved binding sites of transcription factors, the promoter sequences of this organism are unparalleled.
ANNs and statistics employed in finding potential archaeal promoters
Upstream regions of Aciduliprofundum boonei and Thermofilum pendens were selected in order to extract potential promoters from. The statistical and ANN models found in S. solfataricus and T. kodakarensis were employed in the validation dataset. H. volcanii was left out due to its unparalleled AT content; such inclusion would have jeopardized the validation. An upstream region was considered as a promoter if the statistics of S. solfataricus and T. kodakarensis and the ANN of S. solfataricus and T. kodakarensis flagged the given sequence as a promoter. From the 742 and 1927 sequences from A. boonei and T. pendens, respectively, the method encountered 145 promoters of the Euryarchaea and 243 promoters of the Crenarchaea. The lists containing sequence ID, the nucleotide sequences, and functional annotation are available at https://doi.org/10.5281/zenodo.5729308.
To validate the newly identified promoters, they have been compared with experimentally verified promoters. In this sense, Fig. 3 holds information of the DDS profile of A. boonei and T. pendens as well as the other three archaea. In Fig. 4, there is a conserved region in the binding site of TBP, TFB and TFE proteins for all observations. A statistical analysis of the slice − 40 to − 1 of Fig. 2 was provided in Fig. 5, from which unannotated promoters of A. boonei resemble the averages of S. solfataricus, while T. pendens match T. kodakarensis. The whole analysis of the datasets present a p = 3.241^ − 14.
The method proposed in this study was able to hand in regulatory annotation upon the genomes of A. boonei and T. pendens. To do so, we systematically characterized promoters from well-known archaea  and systematically used the algorithmized information in order to locate promoters in unannotated upstream regions of these organisms.
Many factors such as the diversity of archaea and their relatively recent discovery creates the need for high quality genome annotation. This is the moment when in-silico approaches provide help to experimental biology by curating data . The boxplots portrayed in Fig. 5 showed two groups of organisms. No taxonomic inferences, i.e., boxplots with similar averages, could be made upon these since T. kodakarensis and A. boonei are Euryarchaea while S. solfataricus and T. pendens belong to the Crenarchaeota division, the statistical resemblance of these organisms requires further analysis. We also suggest using the model of H. volcanii in order to locate promoters in archaea that have high GC content. The statistical similarity found between verified and potential promoters advocate the robustness of the method proposed.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.