# Black-box adversarial attacks through speech distortion for speech emotion recognition – EURASIP Journal on Audio, Speech, and Music Processing

Aug 17, 2022

### Dataset

The most widely used data set in the above SER model is Interactive Emotional Dyadic Motion Capture (IEMOCAP)et al. [27]. Therefore, in this paper, in order to explore the effect of black-box attack on the above model, we also use the data in this data set for research. It contains 12 h of emotional speech performed by 10 actors from the Drama Department of University of Southern California. The performance is divided into two parts, improvised and scripted, according to whether the actors perform according to a fixed script. The utterances are labeled with 9 types of emotion-anger, happiness, excitement, sadness, frustration, fear, surprise, other and neutral state. For the databases, a single utterance may have multiple labels owing to different annotators. We consider only the label that has majority agreement. For the labeled data in the database, we only consider the case of many labels due to the difference of vision aids. In previous studies [2830], due to the imbalanced data in the dataset (fewer happy data), researchers usually choose more common emotions such as neutral state, sadness, anger, and because of excitement and happiness there is a certain similarity, so the excitement will be replaced by happiness, or the excitement and happiness will be combined to increase the amount of data. In this paper, we also use the four emotions of neutral, excitement, sadness and anger from the IEMOCAP dataset.

### Evaluation metrics

Evaluating the recognition performance in the above SER model uses weighted accuracy (WA) and unweighted accuracy (UA), where WA weighs each class according to the number of samples in that class and UA calculates accuracy in terms of the total correct predictions divided by total samples, which gives equal weight to each class:

$$UA = frac{TP+TN}{P+N}, WA = frac{1}{2}left(frac{TP}{P} + frac{TN}{N}right)$$

(3)

where P is the number of correct positive instances, N is the number of all negative samples, and True Positive(TP) and True Negative(TN) are the number of positive and negative samples predicted correctly, respectively. And in [31], considering that WA and UA may not reach the maximum value in the same model, their average ACC is used as the final evaluation standard (the smaller the ACC, the better the attack effect on the model is). At the same time, in order to show the actual auditory effect of the transformed speech, we use automatic speech recognition (ASR) as the change standard before and after speech processing. And will calculate the word error rate:

$$WER = frac{N_{sub} + N_{del} + N_{ins}}{N_{ref}}$$

(4)

where Nsub,Ndel, and Nins are the number of substitution, deletion, and insertion errors, respectively, and Nref the number of words in the reference [22]. We will calculate WER on the voice before and after the attack as the standard to judge the voice quality.

### Evaluation setup

In the experiments, we randomly split the dataset into training set (80%) and test set (20%) for cross-validation. First of all, after the above three SER models are trained on the training set, they are tested with the test set, and then the test set is processed with three different black-box attack methods, and then the attack effect is identified and explored. Finally, adversarial training is added to explore the robustness of the model.

### Evaluation results

Table 1 shows the results of three different online emotion recognition for the dataset (IEMOCAP). Firstly, VTLN is used to attack three different models. Table 2 describes the performance of the three models under adversarial attack. With the continuous adjustment of super parameters αvtln, the success rate of attack is also increasing. However, due to excessive and obvious transformation, the original voice content will change too much, which is a loss for the value of speech, so the loss of speech quality also needs to be taken into account as a consideration when considering the best attack case i.e. the growth of WER. Therefore, in our experiment, we know that it has the best performance when the hyperparameter αvtln=0.15, and the WER increases from 11.23 to 21.40% as shown in Table 5, which means that the speech quality effect decreases by 10.17%. The recognition accuracy of the three models is reduced to about 10% in Fig. 6, indicating that they have good resistance to the emotion recognition system.

Table 3 shows the recognition results of the three models under McAdams transformation attack. Due to the particularity of McAdams coefficient, there are two relatively symmetric transformation modes in forward and reverse, so the recognition results in the table also show a symmetry. According to the experimental results in the Fig. 7, the best attack performance will be obtained when the αmas=1.20 (reverse is 0.80), reducing the recognition accuracy of the three models to 8–10%. Meanwhile, WER increased by only 6.24% in the Table 5.

Table 4 shows the results of the three models on the Modulation Spectrum Smoothing attack method. According to the analysis of the experimental results, as shown in Fig. 8, when the αms=0.25, the best attack effect can be obtained, and the accuracy of emotion recognition can be reduced to 12–14%, and WER increased by 8.83% in the Table 5. After the three attack methods, the recognition accuracy of the model dropped significantly. At the initial hyperparameter α (0.05, 0.95, 0.05, respectively), the model accuracy dropped to 20–25%, indicating that the three black-box confrontation attacks effectiveness, the robustness of the model is not excellent.