The most widely used data set in the above SER model is Interactive Emotional Dyadic Motion Capture (IEMOCAP)et al. [27]. Therefore, in this paper, in order to explore the effect of black-box attack on the above model, we also use the data in this data set for research. It contains 12 h of emotional speech performed by 10 actors from the Drama Department of University of Southern California. The performance is divided into two parts, improvised and scripted, according to whether the actors perform according to a fixed script. The utterances are labeled with 9 types of emotion-anger, happiness, excitement, sadness, frustration, fear, surprise, other and neutral state. For the databases, a single utterance may have multiple labels owing to different annotators. We consider only the label that has majority agreement. For the labeled data in the database, we only consider the case of many labels due to the difference of vision aids. In previous studies [2830], due to the imbalanced data in the dataset (fewer happy data), researchers usually choose more common emotions such as neutral state, sadness, anger, and because of excitement and happiness there is a certain similarity, so the excitement will be replaced by happiness, or the excitement and happiness will be combined to increase the amount of data. In this paper, we also use the four emotions of neutral, excitement, sadness and anger from the IEMOCAP dataset.

Evaluation metrics

Evaluating the recognition performance in the above SER model uses weighted accuracy (WA) and unweighted accuracy (UA), where WA weighs each class according to the number of samples in that class and UA calculates accuracy in terms of the total correct predictions divided by total samples, which gives equal weight to each class:

$$ UA = frac{TP+TN}{P+N}, WA = frac{1}{2}left(frac{TP}{P} + frac{TN}{N}right) $$


where P is the number of correct positive instances, N is the number of all negative samples, and True Positive(TP) and True Negative(TN) are the number of positive and negative samples predicted correctly, respectively. And in [31], considering that WA and UA may not reach the maximum value in the same model, their average ACC is used as the final evaluation standard (the smaller the ACC, the better the attack effect on the model is). At the same time, in order to show the actual auditory effect of the transformed speech, we use automatic speech recognition (ASR) as the change standard before and after speech processing. And will calculate the word error rate:

$$ WER = frac{N_{sub} + N_{del} + N_{ins}}{N_{ref}} $$


where Nsub,Ndel, and Nins are the number of substitution, deletion, and insertion errors, respectively, and Nref the number of words in the reference [22]. We will calculate WER on the voice before and after the attack as the standard to judge the voice quality.

Evaluation setup

In the experiments, we randomly split the dataset into training set (80%) and test set (20%) for cross-validation. First of all, after the above three SER models are trained on the training set, they are tested with the test set, and then the test set is processed with three different black-box attack methods, and then the attack effect is identified and explored. Finally, adversarial training is added to explore the robustness of the model.

Evaluation results

Table 1 shows the results of three different online emotion recognition for the dataset (IEMOCAP). Firstly, VTLN is used to attack three different models. Table 2 describes the performance of the three models under adversarial attack. With the continuous adjustment of super parameters αvtln, the success rate of attack is also increasing. However, due to excessive and obvious transformation, the original voice content will change too much, which is a loss for the value of speech, so the loss of speech quality also needs to be taken into account as a consideration when considering the best attack case i.e. the growth of WER. Therefore, in our experiment, we know that it has the best performance when the hyperparameter αvtln=0.15, and the WER increases from 11.23 to 21.40% as shown in Table 5, which means that the speech quality effect decreases by 10.17%. The recognition accuracy of the three models is reduced to about 10% in Fig. 6, indicating that they have good resistance to the emotion recognition system.

Fig. 6
figure 6

Description of the three models under the Vocal Tract Length Normalization attack

Table 1 Recognition results of the above three speech emotion recognition models
Table 2 Performance of the three models under the Vocal Tract Length Normalization attack (UA/WA/ACC)

Table 3 shows the recognition results of the three models under McAdams transformation attack. Due to the particularity of McAdams coefficient, there are two relatively symmetric transformation modes in forward and reverse, so the recognition results in the table also show a symmetry. According to the experimental results in the Fig. 7, the best attack performance will be obtained when the αmas=1.20 (reverse is 0.80), reducing the recognition accuracy of the three models to 8–10%. Meanwhile, WER increased by only 6.24% in the Table 5.

Fig. 7
figure 7

Description of the three models under the McAdams transform attack

Table 3 Performance of the three models under the McAdams transform attack (UA/WA/ACC)

Table 4 shows the results of the three models on the Modulation Spectrum Smoothing attack method. According to the analysis of the experimental results, as shown in Fig. 8, when the αms=0.25, the best attack effect can be obtained, and the accuracy of emotion recognition can be reduced to 12–14%, and WER increased by 8.83% in the Table 5. After the three attack methods, the recognition accuracy of the model dropped significantly. At the initial hyperparameter α (0.05, 0.95, 0.05, respectively), the model accuracy dropped to 20–25%, indicating that the three black-box confrontation attacks effectiveness, the robustness of the model is not excellent.

Fig. 8
figure 8

Description of the three models under the modulation spectrum smoothing attack

Table 4 Performance of the three models under the Modulation Spectrum Smoothing attack(UA/WA/ACC)
Table 5 Changes of speech quality before and after the change

After we add three kinds of adversarial samples into the training, as shown in Table 6, three different adversarial samples are added. As shown in Fig. 9, VTLN train, Mas train and MSS train respectively add one adversarial sample to the training with the correct label, and then test the accuracy of the model. The best performance is the adversarial samples produced by adding McAdams. The recognition result of GCN model can reach 68.40% after adversarial training. After adding three kinds of samples together into the adversarial training (All train in Fig. 9), the best performance model is CNN-MAA, and the recognition accuracy is 64.60%. According to our analysis, the above two models still have strong robustness after adversarial training because they have better learning effect on sample dispersion by incorporating graph structure and area attention mechanism.

Fig. 9
figure 9

Three models after adversarial training

Table 6 Performance of the three models under the Modulation Spectrum Smoothing attack (UA/WA/ACC)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit


This article is autogenerated using RSS feeds and has not been created or edited by OA JF.

Click here for Source link (