Proteins comprise amino acid (AA) sequences and fold into three dimensional (3D) structures. The native structure of a protein has the minimum free energy and it determines the function of the protein. The protein structure prediction (PSP) problem is to determine the native structure of a protein from its AA sequence. PSP is computationally very challenging [1]. The challenge comes from the astronomically large conformational search space and the unknown energy function involved in the folding process [2].

Proteins have backbones or main chains comprising peptide bonds that connect C and N atoms of successive AAs. AAs all have three common atoms N, (C^{alpha }), and C in sequence. Typically AAs can be of 20 types based on the uniqueness of the side chains that start from their (C^{alpha }) atoms. AA instances in a protein are called residues. Protein backbone structures can be represented by a series of dihedral angles (phi _i), (psi _i), and (omega _i). These dihedral angles are defined respectively by every four consecutive atoms from the sequence (C_{i-1}), (N_i), (C^{alpha }_i), (C_i), (N_{i+1}), (C^{alpha }_{i+1}). However, (omega) angles are (180^circ) for majority proteins [3]. AA side chains also have their own dihedral angles but they are out of scope of this work since they can be dealt with later once backbones are obtained. Nevertheless, protein backbone structures are important for both template-based and template-free PSP [2, 4].

Besides the representation method discussed above, protein backbone structures can also be represented by (C^{alpha }) atoms since successive (C^{alpha }) atoms have almost the same distance. In this case, instead of (phi), (psi), and (omega), two other angles (theta) and (tau) are used. Note (theta) and (tau) are respectively a planner and a dihedral angle comprising respectively three and four consecutive (C^{alpha }) atoms. Since multiple residues are needed to define (theta) and (tau), they could somewhat capture local structures.

In this work, we develop deep neural network (DNN) models to predict the backbone angles (phi), (psi), (theta), and (tau) for proteins. Protein backbone angle prediction (BAP) has achieved significant progress with the development of DNNs. Yet more accurate BAP is needed since errors in any angles in a protein has a cascaded effect on the entire protein structure.

In BAP, DNN variants such as stacked sparse auto-encoder neural networks [5], long short-term memory (LSTM) bidirectional recurrent neural networks (BRNNs) [6,7,8], Residual Networks (ResNets) [7], and DNN ensembles [7, 8] or layered iterations [9] have been used.

Input features used in BAP include very popular position specific scoring matrices (PSSM) generated by PSI-BLAST [5,6,7, 9,10,11,12]; 7 physicochemical properties (7PCP) [5,6,7, 9, 11] such as steric parameter (graph shape index), hydrophobicity, volume, polarisability, isoelectric point, helix probability, and sheet probability [13]; predicted accessible surface area (ASA) [5, 12]; hidden Markov model (HMM) profiles [7, 11, 14] produced by HHBlits [15]; contact maps [7]; and PSP19 [8].

Capturing local structures around and long range interactions between residues have been considered in BAP. Sliding windows [5, 6, 9, 12] around residues have been used in feature encoding to capture the local structures. On the other hand, entire protein sequences have been used as features [9, 11, 16] to capture long range interactions. Convolutional neural networks (CNNs) [8, 14] or LSTM-BRNNs [6, 7] have also been used to capture long range interactions.

For benchmark datasets, we refer to PISCES [17], SPOT-1D [7, 18], PDB150 [19] and CAMEO93 [20]. The first two are large with respectively 5.5K and 12.5K proteins and 1.2M and 2.7M residues. The last two are small with 150 and 93 proteins respectively and are used in testing.

Proteins locally exhibit three major secondary structure (SS) types such as helices, sheets, and coils. This three-state classification can be extended to an eight-state classification. Essentially some SS types are associated with angle ranges. For example, helices and sheets have ranges of 20 for (phi) and (psi). Because of these narrow angle ranges, BAP could be essentially viewed as a classification problem via SS type prediction, although backbone angles are actually continuous valued. Unfortunately, coils have no ranges and they are about 40% of residues in an average protein [21]. So SS prediction essentially does not make BAP trivial. SS prediction has obtained significant progress via DNN models [8, 11, 22,23,24,25,26] and ab initio methods [27]. SSpro8 [28] achieves respectively 92% and 79% accuracy on proteins with or without homologs in the Protein Data Bank (PDB).

Predicted SS types have been used as features in deep learning for BAP [5, 9, 12, 29]. Features in general in deep learning only implicitly capture problem characteristics and the neural network model then attempts to establish the unknown input-output relation but again very implicitly. Also, any machine learning method strives to achieve generality over the training examples and consequently loose accuracy in the process. While generic artificial intelligence (AI) methods could be adapted to a range of problems easily, they usually suffer from the loss of explicit problem specific knowledge. So explicit exploitation of any available knowledge is of great importance in AI. This could actually bridge the gap between the generality of the approach with the specificity of the problem. Inspired by this AI interest, we attempt to explicitly exploit predicted SS knowledge in BAP. To be more particular, for BAP, we train separate deep learning models for each SS category. This restricts the generalisation only within the specific class of training examples and thus compensates the loss of generalisation by exploiting specialisation knowledge in an informed way.

We name our new BAP method as Simpler Angle Predictor for Secondary Structures (SAP4SS), which has DNN models similar to a very recent BAP method named SAP [30]. Like SAP, our new method SAP4SS has simpler DNN models than what other recent methods such as OPUS-TASS [8] and SPOT-1D [7] have. SAP4SS uses the same fully connected neural network (FCNN) architecture as SAP does while OPUS-TASS and SPOT-1D use ensembles of LSTM-BRNNs and ResNets. SAP4SS has more features than SAP but fewer features than OPUS-TASS and SPOT-1D. While SAP has been trained on all residues, SAP4SS has separate DNN models for residues that belong to separate 3-state SS types.

On well-known benchmark datasets, SAP4SS obtains mean absolute error (MAE) values 15.59, 18.87, 6.03, and 21.71 respectively for (phi), (psi), (theta), and (tau) predictions. As a result, SAP4SS significantly outperforms existing state-of-the-art methods SAP, SPOT-1D and OPUS-TASS: the differences in MAE are from 1.5 to 4.1% compared to the best known results. The SAP4SS program along with its data is available from the website

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.


This article is autogenerated using RSS feeds and has not been created or edited by OA JF.

Click here for Source link (