To demonstrate the effectiveness of the proposed model, comprehensive experiments are conducted. We first describe the schizophrenic speech data set and implementation details. Next, the ablation studies are presented to demonstrate the advantages of each component in the proposed Sch-net. Then comparisons with state-of-the-art methods based on feature engineering and deep learning techniques are conducted and analyzed. The network visualization is also presented using Grad-CAM. Finally, to further validate the generalization of proposed method, the classification experiments on the LANNA children speech database are conducted.
Schizophrenic data set
Our study has 28 schizophrenic patients (18 females and 10 males) and 28 matched healthy controls (18 females and 10 males). The schizophrenic group is with a mean age of 40.6 years (SD 9.4 years), and the control group is with a mean age of 36.5 years (SD 9.1 years). All subjects are native Mandarin speakers, and they have no past or current disease affecting the speaking process. Patients were recruited from the Psychiatry Department of the Mental Health Center, Sichuan University. This department is one of the four major mental health centers in China. The schizophrenic group was diagnosed by clinicians based on the Diagnostic and Statistical Manual of Mental Disorders, Fifth Edition (DSM-5) that outlines the concise and explicit criteria for the diagnosis of schizophrenia . All subjects provided the written informed consent.
The data set is composed of audio signals that are recorded in a 16-bit mono/dual-format at a sampling rate of 44.1kHz. Participants are asked to achieve the reading task. There are four texts with calm, happiness, anger, and fear sentiments, and each text comprises 8–10 sentences. We select a fixed sentence for each emotional recording, and the transcriptions of speech signals are listed in Table 1.
In this study, all audios are converted to spectrograms using the Short-time Fourier Transform (STFT) method. To improve the invariance properties to geometric perturbations and noise, data augmentation methods are utilized, including random crop, random rotation, random rescaling, random Gaussian noise, masking blocks of frequency channels , and masking blocks of time steps .
The input image of the Sch-net is with the size of 128(times)256 pixels. Table 2 shows the Sch-net architecture details. In this architecture, the size of each filter in Conv layers is set as 3 (times) 3. There are 64, 128, 256, 512 filters in the first to the fourth Conv layers, respectively. In addition, there are 512 filters in the three skip connections. The convolved images are normalized using a ReLU activation in Conv blocks. The max pooling and average pooling in pooling layers are obtained every 2 (times) 2, with a stride of 2. In the CBAM, 2048 filters of size 7 (times) 7 are used to highlight effective features. The highlighted features are convolved with 512 filters of size 3 (times) 3. In the FC neural network, there are 512 neurons in the first hidden layer and 2 neurons in the second layer. The final output is a vector of probabilities that the input sample will belong to each class.
In all experiments, the binary cross-entropy is adopted as the loss function, and Adam  is used as the optimization algorithm. All experiments are implemented based on the PyTorch framework  and trained on a workstation with Intel(R) Xeon(R) CPU E5-2680 v4 2.40 GHz processors and an NVIDIA Tesla P40 (24 GB) installed. The network is trained using batch size 16 for 50 epochs. The initial learning rate is set to 0.0003 and decreases by 10 times after 25 epochs. (cdot)
In this subsection, the effectiveness of our network is verified. The Sch-net’s backbone network is based on CNN, with adding skip connections to enrich the feature information. In addition, the CBAM is applied to emphasize the more effective features with bigger weights. For this ablation study, we evaluate the contributions of the two key components to discriminate schizophrenic patients from healthy controls. To evaluate the performance of Sch-net and its components (backbone, skip connection, and CBAM), we run 30 iterations of tenfold cross-validation and compute seven metrics (accuracy, precision, recall, f1-score, sensitivity, specificity, and Area Under ROC Curve (AUC)) for each model. The 95% Confidence Intervals (CIs) for the metrics are listed in Table 3, and the box plots of classification accuracies are shown in Fig. 1.
In each box plot in Fig. 1, there are five points (the median, the upper and lower quartiles, and the minimum and maximum values) to display the distribution of classification accuracies for each model. As can be seen in Table 3 and Fig. 1, the skip connection enriches the information of feature maps and improves the classification accuracy by 1.71% on the schizophrenic speech data set. The CBAM selects the meaningful features for classification and improves accuracy by 2.40%. Significant improvement of 4.45% for classifying schizophrenic speech and normal speech is achieved when adding skip connections and CBAM to the backbone network. The proposed Sch-net combines the advantages of skip connection and CBAM, achieving better performance on the classification task.
Comparison with the models based on feature engineering and classifiers
Previous studies about automatic schizophrenic speech detection [28,29,30,31,32,33,34,35,36,37,38] are almost based on feature engineering and pattern recognition technology. In this subsection, the performances of the combination of feature engineering and classifiers are displayed and analyzed. Four types of acoustic features are extracted, which are time-domain features, FFT-based spectral features, auditory-based spectral features, and spectral envelope features. Four classifiers are adopted, including random forest (RF), k-nearest neighbor (KNN), support vector machine (SVM), and linear discriminant analysis (LDA).
Time-domain features used in this work contain short-term energy (STE), pitch, and fluency features. The STE feature of speech signals reflects the amplitude variation, and the pitch indicates the vocal cords vibration in the pronunciation process. The fluency feature can reflect the degree of coherence in expression. Considering the reduced syntactic complexity and abnormal pauses in schizophrenic speech, five fluency features (total recording time, the total length of voice segments, the ratio of voice segments, max duration of pauses, mean length of syllables) are employed to construct a feature set.
FFT-based features refer to the features computed by the STFT. In this work, two FFT-based features (spectrogram and long-term average spectrum (LTAS)) are adopted in this work. The LTAS describes the resonance characteristics by computing the short-term Fourier magnitude spectra , which have shown promising performance in speech sentiment analysis and pathological speech analysis [46,47,48].
Auditory-based features are proposed to simulate the clinical diagnosis. Schizophrenia is diagnosed by clinicians through a comprehensive evaluation of speech and behaviors. Therefore, speech signals are necessary to be analyzed by combining with human auditory characteristics. In this study, MFCC and its modification, Gammatone cepstral coefficient (GTCC) , are extracted to detect schizophrenia. The MFCCs and GTCCs are computed using a series of filters that are designed according to the frequency response characteristics of the human auditory system.
The spectral envelope feature is also commonly used to describe the vocal tract characteristics in speech production. In this work, LP and its deformations, stabilized weighted linear prediction (SWLP)  and extended weighted linear prediction (XLP) , are tested on the schizophrenic speech data set. The SWLP is an improved WLP that is proposed to model speech by applying the temporal weighting of the square of the residual signal . The XLP is a further generation of WLP and SWLP methods, which allows temporal weighting on a finer time scale . The SWLP and XLP have performed well on the speech recognition tasks and pathological speech detection [52, 53].
The features mentioned above combined with four classifiers are tested on schizophrenic speech data set. The overall performances are listed in Table 4 using accuracy, precision, recall, and F1-score. The bold font in Table 4 represents the highest value in each type of features using different classifiers. It can be seen that fluency feature, spectrogram, GTCC, and XLP achieve the highest F1-score in its corresponding feature group. When compared the results in Tables 3 and 4, it can be seen that the proposed Sch-net has a better performance than the models based on feature engineering and classifiers.
As shown in Table 4, the F1-score of schizophrenic speech detection using the STE reaches 0.6306. Owing to the difficulty in expression for schizophrenic patients, the intensity of schizophrenic speech tends to be lower than that of controls. The STE feature can describe the intensity of speech, but it may be influenced by the different distances between the recording equipment and speakers. Thus, the performance of the STE feature is not as good as the fluency feature.
Though studies [28, 30,31,32] have proved that there are significant differences in pitch between schizophrenic speech and normal speech, the pitch gains the worst performance among time-domain features. The results are consistent with the results in [30, 37], in which the distribution of pitch shows no significant differences between the two groups.
Fluency feature performs well on the schizophrenic speech detection, owing to the disordered thought and language impairments of patients . The cognitive impairment also contributes to the incoherence of speech.
FFT-based spectral feature
The LTAS achieves 62.11% accuracy on the schizophrenic speech data set. The LTAS is calculated as the average of a spectrogram, reflecting the spectrum of glottal source and vocal tract . Results in  have shown that schizophrenic speech has lower variations in energy than normal speech. The unexpected accuracy using LTAS may be caused by the average operation that eliminates the differences in variations between two groups.
The spectrogram achieves better performance than the LTAS, which is the time-frequency representation of speech. It not only contains the energy distribution in frequency bands but also reflects the pitch and formant information. It has been proven that schizophrenic speech have less variability in pitch and voice intensity, smaller range of second formant than normal speech [28, 30,31,32]. Thus, the spectrogram covers more effective features for discriminating patients from controls than the LTAS does.
Auditory-based spectral feature
The GTCC achieves a better performance than the MFCC on the schizophrenic speech detection task, which is caused using different auditory filters. The MFCC is computed based on a series of triangular bandpass filters with equal bandwidth. The GTCC employs the Gammatone filters to model the human auditory response, which are with equivalent rectangular bandwidth . The use of Gammatone filters minimizes the loss of spectrum information and increases the correlation among the outputs of the filters . Therefore, the GTCC contains more effective information to detect schizophrenia than the MFCC.
Spectral envelope feature
The F1-scores of schizophrenic speech detection using LP, SWLP and XLP are above 0.9. The SWLP and XLP have slightly better results than LP. The results of spectral envelop features are gained when the order of LP is set as 38 [57, 58]. Results in [32, 37] have shown that formant is an indicator to distinguish schizophrenic speech from controls. The LP reflects the characteristics of the vocal tract, such as the frequency of formants. However, the LP analysis relies on the excitation signal, which is usually affected by the harmonics. The SWLP reduces the effect by composing the temporal weights on the closed-phase interval of the glottal cycle . In addition, the XLP improves the time scale on the spectral envelop by weighting each lagged speech signal separately . The SWLP and XLP highlight the formant information that can be used to distinguish patients from controls. Thus, the SWLP and XLP achieve better performance on classifying schizophrenia and controls than the LP.
Comparison with classic deep neural networks
In this subsection, comparisons between five neural networks and our model are conducted. The five networks are AlexNet , VGG16 , ResNet34 , DenseNet121 , and Xception , which are commonly used for speech recognition and classification tasks [64,65,66,67,68]. AlexNet  is the winner of the ImageNet Large Scale Visual Recognition Challenge in 2012, which reduces overfitting and controls the model complexity of the FC layers using dropout. VGG16  is a good benchmark architecture for classification tasks, which is consisted of 13 Conv layers, 3 FC layers, and 5 pooling layers. ResNet34  is introduced to alleviate the degradation problem caused by increasing stacked layers via adding shortcut connections. To reduce the impact on vanishing gradient, the feed-forward fashion in the connection between each layer to every other layer is used in DenseNet121 . DenseNet121 also can strengthen the propagation of features and reduce the number of parameters . To obtain fast convergence and good performance on the model’s expressive ability, Xception  replaces the inception modules with depthwise separable convolutions in deep CNN. Table 5 lists the 95% CIs for seven metrics of classifying schizophrenic speech and normal speech using the five deep neural networks and our method. Fig. 2 presents the box plots of the classification accuracies for the models.
As shown in Table 5 and Fig. 2, the accuracies of schizophrenic speech detection using AlexNet and VGG16 are 92.72% (95% CI: 92.49–92.95%) and 92.47% (95% CI: 92.25–92.69%), respectively. The depth of AlexNet and VGG16 is shallow, contributing to the insufficient information in feature maps. ResNet34 achieves 94.39% (95% CI: 93.98–94.80%) accuracy on the schizophrenic speech data set, owing to the introduction of the residual module. DenseNet121 and Xception gain slightly better results than ResNet34, owing to the networks not only adopt the shortcut connections but also utilize dense connection/depthwise separable convolutions to make more efficient use of model parameters. The proposed Sch-net in this work achieves a better performance than the five networks, because it can gain the local and global features simultaneously via CBAM and skip connections. The feature map contains more abundant information to better distinguish schizophrenia from controls.
Network visualization using Grad-CAM
In recent years, deep learning methods have already achieved high accuracy that approaches the manual diagnosis accuracy in many fields through improving the computing capabilities and expanding the data set. It can simplify and speed up the diagnosis, and reduce the workload of doctors. However, the process of generating predicted labels from input data is still uninterpretable. To make the decision-making process in deep learning transparent, this work applies the Grad-CAM  to Sch-net using speech samples from schizophrenic group and healthy group. Grad-CAM is a visualization method to show the importance of each neuron for the classification using the gradient information in the last Conv layer . The Grad-CAM highlights the more discriminative parts as brighter regions in the heatmap. We attempt to consider how the Sch-net works on making good use of features, through observing the spectrogram and activation maps. In this subsection, the input spectrogram and its corresponding activation map generated in the last Conv layer of normal speech and schizophrenic speech are shown in Fig. 3.
In Fig. 3, spectrograms of normal speech and schizophrenic speech are shown in a and c, respectively. Activation maps of normal speech and schizophrenic speech are depicted in b and d. The brighter region in the spectrogram means more energy concentrated, and that in the activation map means larger weight located.
As shown in Fig. 3a, c, schizophrenic speech and normal speech have different distributions of concentrated energy in the spectrogram. Through the horizontal comparison, two findings of two groups can be seen in this figure, which can be listed as follows:
The energy concentration in the frequency domain of schizophrenic speech is almost below 5000 Hz, while normal speech has a wider range of energy concentration bands, that can be extended from 8000 to 10,000 Hz. Blunted affect is a typical symptom in schizophrenia . Patients with negative symptoms may speak with a dull monotone voice , resulting in a small range of the energy concentration region. While healthy controls have a more flexible emotional expression. The angry, fearful and happy speech exhibit a higher intonation, faster speed rate, and more energy in higher frequencies . And the sad speech changes slowly and has high energy concentration in lower frequencies . Thus, normal speech has a wider range of energy distribution than schizophrenic speech.
It can be seen that schizophrenic speech and normal speech both have concentrated energy region and apparent formant horizontal stripes in the low-frequency bands below 2000 Hz. The difference between the two groups is the shape of formant horizontal stripes. For schizophrenic speech, the stripes are almost continuous, which is inconsistent with the energy distribution characteristics of vowels and consonants. The vowels have energy concentration in both low- and high-frequency range . The unvoiced consonants mainly have high-frequency energy components, and they rarely have formants . According to the texture used in this work, the continuous-time speech signals comprise both vowels and consonants. Therefore, there are supposed to show a short disappearance of formant horizontal stripes on the spectrogram. It can be guessed that the continuous stripes in the spectrogram of schizophrenic speech may be caused by the incorrect placement of articulators during speech production. The wrong articulation process leads to the unvoiced consonants are produced as voiced consonants.
Observing both the spectrogram and its corresponding activation map in Fig. 3, it can be seen that the Sch-net can capture the features in high-frequency bands for normal speech, and can give larger weights to the features in low-frequency bands for schizophrenic speech. The results of Sch-net are consistent with human visual perception, which is difficult to achieve using the models based on feature engineering. The Sch-net has excellent learning ability to extract features, and it achieves better performances on schizophrenic speech detection than traditional feature engineering models adopted in this work.
Further validation of the proposed Sch-net using LANNA children speech database
Schizophrenia is a neurodevelopmental disorder affecting the language expression of patients . SLI, also termed development dysphasia, is described as a neurological disorder of the brain [77,78,79,80]. Patients with SLI exhibit delayed language acquisition , slower linguistic processing , and difficulties in grammar or specific subcomponents of grammar [83, 84]. To further validate model effectiveness and generalization, the Sch-net is tested on LANNA children speech database  for the classification of patients with SLI and healthy controls in this subsection.
LANNA children speech database  is the first and only publicly open speech corpora for children with SLI, which comprises 2173 speech signals from 54 children with SLI (aged from 6 to 11 years) and 1680 speech signals from 44 controls (aged from 6 to 10 years). This data set is composed of 13 parts: vowels, consonants, syllables, six types of words, sentences, auditory differentiation, and description of the picture. Audios were recorded in a schoolroom and a consulting room using Dictaphone, MD and microphone. The background noise in natural environments affects the quality of speech signals, leading to difficulties in speech signal processing.
Previous studies [85,86,87,88,89,90,91] had demonstrated that speech can be viewed as a symbol of diagnosing SLI. In [85,86,87], 1582 acoustic features were extracted from 34 low-level descriptors and its 21 statistical functionals. The features were given as inputs of the SVM, achieving 96.94% accuracy on the LANNA children speech database. In , Gaussian posteriorgrams trained on MFCC features were employed to discriminate patients with SLI and healthy controls. The kernel extreme learning machine were trained with the speech signals, and it performed an accuracy of 99.41% on the test data. Apart from MFCC, in , Tonnetz and Chroma were calculated, combined with SVM, RF and Recurrent Neural Network to detect SLI. The Tonnetz and Chroma reached accuracies of 70% and 71%, respectively. In the four studies [85,86,87,88,89], high accuracies had been achieved for speaker-dependent classification.
In contrast, some methods were proposed for speaker-independent classification in [90, 91]. The top-20 LPC features were selected from 408 LPCs using Mann–Whitney U test and Spearman’s correlation in , which achieved an accuracy of 97.90% on the SLI detection task. In , a feed-forward neural network was proposed for classifying patients with SLI and healthy controls. The glottal features and MFCCs were adopted as the inputs of the network and the classification accuracy reached up to 98.82%.
In this subsection, fivefold cross-validation is employed. SLI data set is divided with 80% for training and 20% for testing. Table 6 gives the classification results using state-of-the-art methods, deep neural networks and the proposed Sch-net. As can be seen, our method outperforms the classic deep neural network and state-of-the-art methods. The proposed Sch-net can extract discriminant features of speech signals for classifying healthy individuals and those suffered from SLI.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.