# Shallow and deep feature fusion for digital audio tampering detection – EURASIP Journal on Advances in Signal Processing

#### ByZhifeng Wang, Yao Yang, Chunyan Zeng, Shuai Kong, Shixiong Feng and Nan Zhao

Aug 13, 2022

The audio tampering detection method based on the fusion of shallow and deep features proposed in this paper consists of three parts (Fig. 2):

1. 1.

Phase and frequency feature extraction First, down-sampling and band-pass are employed to filter the audio to obtain the ENF component. Then, windowing and framing process is implemented on the ENF components. Finally, DFT is used to obtain the phase feature, and Hilbert transform is applied to obtain the instantaneous frequency.

2. 2.

Feature process In this part, the average of ENF phase and frequency variations is calculated as the shallow features. Meanwhile, the feature matrix is obtained by framing and reshaping operations on the audio (see Sect. 4.2.2 for details). The feature matrix will be used as the convolutional neural network’s input to obtain more local information. The fit coefficients are obtained by fitting the phase to the instantaneous frequency through Sums of Sines functions [16], and the fit coefficients are the input to the DNN to give some global information compensation to the deeper features.

3. 3.

Deep neural network In this part, the feature matrix and fitting coefficients are input to the neural network, and the output is stitched to obtain the deep features that contain both global and local information. Finally, the deep, deep phase, and instantaneous frequency features are fused with features using the attention mechanism to give different weights to each feature vector value to achieve feature selection. Finally, a DNN classifier is proposed to classify the tampered audio with the authentic audio.

The specific details will be introduced later in this section.

### ENF phase and frequency feature extraction

According to the method in the literature [22, 23], to obtain the phase and instantaneous frequency of ENF, we performed subsampling and bandpass filtering on the audio. Firstly, the subsampling frequency is set as 1000 Hz and 1200 Hz according to the ENF nominal frequency of 50 or 60 Hz. The purpose of this is to ensure the accuracy of ENF while reducing the amount of calculation. Then, after subsampling, bandpass filtering is carried out to obtain ENF components in the audio. We use a linear zero-phase FIR filter of order 10000 to carry out narrowband filtering. The centre frequency is ENF standard (50 Hz or 60 Hz), the bandwidth is 0.6 Hz, the passband ripple is 0.5 dB, and the stopband attenuation is 100 dB. Finally, we can obtain the phase and instantaneous frequency of ENF through DFT and Hilbert transformation.

#### DFT transformation gets the phase

The phase of ENF is obtained by discrete Fourier transform, and the phase of ({text {DFT}}^{0}) and ({text {DFT}}^{1}) is calculated. ({text {DFT}}^{k}) represents the k derivative of the DFT transform of a signal, and ({text {DFT}}^{0}) represents the conventional DFT transform [23].

First, the approximate first derivative ({x’_{mathrm{ENFC}}}left[ n right]) of ENF signal ({X_{mathrm{ENFC}}}left[ n right]) at point n is calculated

begin{aligned} {x’_{mathrm{ENFC}}}left[ n right] = {f_d}left( {{X_{mathrm{ENFC}}}left[ n right] – {X_{mathrm{ENFC}}}left[ {n – 1} right] } right) end{aligned}

(2)

where ({f_d}left( * right)) represents the approximate derivative operation, and ({X_{mathrm{ENFC}}}left[ n right]) represents the n-th point of the ENF component.

Then, Hanning window (wleft( n right)) was used to frame and window ({x’_{mathrm{ENFC}}}left[ n right]). The frame length was 10 standard ENF frequency cycles ((frac{{10}}{{50}}) or (frac{{10}}{{60}})), and the frame was moved to 1 standard ENF frequency cycle ((frac{1}{{50}}) or (frac{1}{{60}})).

begin{aligned} {x’_N}left[ n right] = {x’_{mathrm{ENFC}}}left[ n right] wleft( n right) end{aligned}

(3)

where ({x’_N}left[ n right]) represents the ENF signal after window addition, and (wleft( n right)) represents the Hanning window.

To obtain the phase ({phi _{{mathrm{DFT}}^{0}}}) of ENF and the phase ({phi _{{mathrm{DFT}}^{1}}}) of the first derivative of ENF, n-point DFT should be executed for each frame signal ({x’_N}left[ n right]) and ({X_{mathrm{ENFC}}}left[ n right]) , respectively, to obtain (X’left( k right)) and (Xleft( k right)). Estimated frequency ({f_{{mathrm{DFT}}^{1}}}) based on the integer index ({k_{mathrm{peak}}}) of (left| {X’left( k right) } right|) peak points

begin{aligned} {f_{{mathrm{DFT}}^{1}}} = frac{1}{{2pi }}frac{{{{text {DFT}}^1}left[ {{k_{mathrm{peak}}}} right] }}{{{{text {DFT}}^0}left[ {{k_{mathrm{peak}}}} right] }} end{aligned}

(4)

where ({text {DFT}}^{0}left[ {{k_{mathrm{peak}}}} right] = Xleft( {{k_{mathrm{peak}}}} right)), ({text {DFT}}^{1}left[ {{k_{mathrm{peak}}}} right] = Fleft( {{k_{mathrm{peak}}}} right) left| {X’left( {{k_{mathrm{peak}}}} right) } right|) and (Fleft( {{k_{mathrm{peak}}}} right)) are scale coefficients.

begin{aligned} Fleft( k right) = frac{{pi k}}{{{N_{mathrm{DFT}}}sin left( {frac{{pi k}}{{{N_{mathrm{DFT}}}}}} right) }} end{aligned}

(5)

where ({N_{mathrm{DFT}}}) represents the number of discrete Fourier transform points, and k is the index of peak point.

Now the ENF phase ({phi _{{mathrm{DFT}}^{0}}}) of the conventional DFT transformation can be calculated, ({phi _{{mathrm{DFT}}^{0}}} = arg left[ {Xleft( {{k_{mathrm{peak}}}} right) } right]). Through Eq. (6), ({phi _{{mathrm{DFT}}^{1}}}) [23] can be calculated.

begin{aligned} left{ begin{array}{l} {phi _{{mathrm{DFT}}^{1}}} = arctan left{ {frac{{tan left( theta right) left[ {1 – cos left( {{omega _0}} right) } right] + sin left( {{omega _0}} right) }}{{1 – cos left( {{omega _0}} right) – tan left( theta right) sin left( {{omega _0}} right) }}} right} \ theta approx left( {{k_{{mathrm{DFT}}^{1}}} – {k_{mathrm{low}}}} right) frac{{{theta _{mathrm{high}}} – {theta _{mathrm{low}}}}}{{{k_{mathrm{high}}} – {k_{mathrm{low}}}}} + {theta _{mathrm{low}}} end{array} right. end{aligned}

(6)

where ({omega _0} approx 2pi {f_{{mathrm{DFT}}^{1}}}/{f_d}), ({f_d}) are heavy sampling frequency, ({k_{{mathrm{DFT}}^{1}}} = {f_{{mathrm{DFT}}^{1}}}{N_{mathrm{DFT}}}/{f_d}), ({k_{mathrm{low}}} = {text {floor}}left[ {{k_{{mathrm{DFT}}^{1}}}} right]), ({k_{mathrm{high}}} = {text {ceil}}left[ {{k_{{mathrm{DFT}}^{1}}}} right]), ({text {floor}}left[ a right]) is the maximum integer less than a, and ({text {ceil}}left[ b right]) is the minimum integer greater than b. Since the calculated ({phi _{{mathrm{DFT}}^{1}}}) has two possible values, ({phi _{{mathrm{DFT}}^{0}}}) is used as a reference, and the value closest to ({phi _{{mathrm{DFT}}^{0}}}) in ({phi _{{mathrm{DFT}}^{1}}}) is selected as the final ({phi _{{mathrm{DFT}}^{1}}}).

#### The Hilbert transform captures the instantaneous frequency

Hilbert transformation [22] was performed on the filtered ENF signal ({X_{mathrm{ENFC}}}left[ n right]) to obtain the ENF instantaneous frequency (fleft[ n right]). So first, we get the analytic function of ({X_{mathrm{ENFC}}}left[ n right])

begin{aligned} {x^{left( a right) }}_{mathrm{ENFC}}left[ n right] = {X_{mathrm{ENFC}}}left[ n right] + i*Hleft{ {{X_{mathrm{ENFC}}}left[ n right] } right} end{aligned}

(7)

where (Hleft{ * right}) stands for Hilbert transformation, (i = sqrt{ – 1}). Instantaneous frequency (fleft[ n right]) is the rate of change of (Hleft{ {{X_{mathrm{ENFC}}}left[ n right] } right}) phase angle.

The parasitic oscillation generated by the numerical approximation during the Hilbert transformation needs to be removed after the instantaneous frequency (fleft[ n right]) obtained. The fifth-order elliptic IIR filter was used to carry out the low-pass filter on (fleft[ n right]) to remove oscillation. The filter’s central frequency is ENF standard frequency, the bandwidth is 20 Hz, the passband ripple is 0.5 dB, and the stopband attenuation is 64 dB. Due to the boundary effect of frequency estimation, the head and tail of (fleft[ n right]) are removed for about 1 s. Finally, ({f_{mathrm{hil}}}) of instantaneous frequency estimation of ENF component is obtained.

### Shallow feature acquisition and deep feature preparation

We use the average of ENF phase and instantaneous frequency changes as shallow features. To obtain the deep features, we use a convolutional neural network better to learn the details of ENF phases and instantaneous frequencies. We frame, reshape and fit the ENF phase and frequency to get the input to the neural network and feed it to the neural network to obtain the depth features for the training phase of the network.

#### Acquire shallow features

The estimated phase ({phi _{{mathrm{DFT}}^{0}}}), ({phi _{{mathrm{DFT}}^{1}}}) and Hilbert instantaneous frequency ({f_{mathrm{hil}}}) are put into Eq. (8) to obtain the statistical feature ({F_{01f}} = left[ {{F_0},{F_1},{F_f}} right]), which reflects the abrupt transition of ENF phase and instantaneous frequency [19].

begin{aligned} left{ begin{array}{l} {F_{0,1}} = 100log left{ {frac{1}{{{N_{mathrm{Block}}} – 1}}sum limits _{{n_b} = 2}^{{N_{mathrm{Block}}}} {{{left[ {{{hat{phi }}} ‘left( {{n_b}} right) – {m_{{{hat{phi }}} ‘}}} right] }^2}} } right} \ {F_f} = 100log left{ {frac{1}{{{text {len}} – 1}}sum limits _{n = 2}^{mathrm{len}} {{{left[ {f’left( n right) – {m_{f’}}} right] }^2}} } right} end{array} right. end{aligned}

(8)

where ({{hat{phi }}} ‘left( {{n_b}} right) = {{hat{phi }}} left( {{n_b}} right) – {{hat{phi }}} left( {{n_b} – 1} right)), (2 le {n_b} le {N_{mathrm{Block}}}). ({{hat{phi }}} left( {{n_b}} right)) is the estimated phase of the corresponding ({n_b}) frame. ({m_{{{hat{phi }}} ‘}}) represents the average value of ({{hat{phi }}} ‘left( {{n_b}} right)) from ({n_b} = 2) to ({N_{mathrm{Block}}}). ({text {len}} = {text {length}}({X_{mathrm{ENFC}}}left[ n right] )), (f’left( n right) = fleft( n right) – fleft( {n – 1} right)). (fleft( n right)) is the instantaneous frequency of the nth sampling point, and ({m_{f’}}) represents the average value of (f’left( n right)) from (n = 2) to len.

#### Obtaining the input of deep features ({F_{m times m}}),({P_{n times n}})

The deep features proposed in this paper consist of two parts, firstly, the local detail information obtained by the feature matrix ({F_{m times m}}) and ({P_{n times n}}) through the convolutional neural network, obtained by framing and reshaping operations. The second is the global information obtained by fitting coefficients through DNN. Finally, the global information is stitched with detailed information to obtain deep features.

To reduce information loss, we acquire the deep features by convolutional neural networks. Therefore, we designed a framing approach for obtaining the input of the convolutional neural network so that the audio ENF phase or frequency of unequal lengths through the dataset becomes a matrix of (m times m). Where (m) is the frame length (the audio determines the frame length with the longest duration in the data), and each row in the matrix is one frame, and the frame shift (s) of each audio is computed adaptively. The detailed steps are listed in following Algorithm 1.

#### Curve fitting for fitting coefficient

We performed a reshape operation when obtaining the feature matrix of the convolutional neural network input, which may result in the loss of global information of the sequence, so we fit the ENF phase and frequency sequences and used the fit coefficients as compensation for the global information. The ENF phase and instantaneous frequency are curve-fitted to extract the fit coefficients containing the global information. We use the MATLAB fitting toolbox to extract the fitting coefficients using six Sum of Sines functions to fit the phase, and frequency features ({F_{{mathrm{coe}}}},{P_{mathrm{coe}}} = left[ {{a_1},{b_1},{c_1}, cdots ,{a_j},{b_j},{c_j}} right] left( {0 < j le 6} right)). The Sum of Sines functions is

begin{aligned} y = sum limits _{j = 1}^6 {{a_j}sin left( {{b_j}x + {c_j}} right) } end{aligned}

(9)

### Shallow and deep feature fusion network

There is information loss by only going through shallow features, resulting in the inability to obtain higher detection accuracy and model generalization. The duration of each detected audio is different, so the obtained phase feature length and frequency feature length are also different. As shown in Fig. 3, in the tampering detection method based on the fusion of shallow and deep features proposed in this paper, the phase and instantaneous frequency features of the ENF are first processed to make them suitable for automatic learning of the neural network and reduce information loss. Then, the depth features of ENF are obtained by the neural network to understand better the difference between tampered audio and real audio by automatic learning. Then, feature fusion is performed using attention, and finally, the detection results are output.

#### Neural networks of deep features

As shown in Fig. 3, the shallow feature ({F_{123}}), which are extracted through the framing and the Sum of Sines fitting, reflects the sudden change of ENF phase and frequency, but its statistical feature is only a single value, and detailed information about ENF phase and frequency will be lost. When it is only used for audio tampering detection, it may cause misjudgement due to the insignificant fluctuation of ENF in the tampered area, or the interference of low-frequency noise on ENF. In order to reduce the occurrence of this situation, we use the convolutional neural network to obtain ENF detailed information as deep features and use the attention mechanism to combine deep features with shallow features to reduce misjudgements and improve the generalization ability of the model.

The deep features proposed in this paper are obtained from the fitting coefficients and the feature matrix. The fitting coefficients are passed through two fully connected layers with 32 neurons to obtain the ENF phase and global frequency information. A convolutional neural network extracts the phase and frequency feature matrices to obtain detailed information about the ENF phase and frequency. The size of the phase feature matrix ({P_{n times n}}) is different from that of the frequency feature matrix ({F_{m times m}}). As the size of the feature matrix is (n times n), (m times m), the frame length is determined by the longest audio in the audio data, and the longest duration of the digital audio that this network can detect is 35 s. Since the longest audio in our dataset is 35 s, the length of the phase and instantaneous frequency sequences obtained by DFT and Hilbert transform are 2055 and 37,281, so our frame length in the deep feature is set to 46,194 by the steps in 1. The number of convolution blocks for phase features is 2, and for instantaneous frequency, convolution blocks are 3. When the longest length of the audio to be measured increases or decreases, the number of convolution blocks should be increased or decreased as appropriate.

We use two convolution blocks to extract features from the phase feature matrix ({P_{n times n}}) and three convolution blocks to extract features from the frequency feature matrix ({F_{m times m}}). Each convolutional block consists of two identical convolutional layers with one pooling layer (the number of filters for the three convolutional blocks is 32,64,128. The convolutional kernel size is 3 * 3 and the step is 1. The Maxpooling layer pool size is 3). Detailed information of the ENF phase and frequency sequence can be obtained by using the local sensing property of the convolutional neural network. The pooling layer is used for dimensionality reduction to reduce the number of parameters, avoid overfitting, and improve the model’s fault tolerance and generalization ability. Also, because the convolutional neural network has fewer parameters, it can obtain better classification results with less training time.

Frequency fitting coefficient ({F_{mathrm{coe}}}), two fully connected layers were used to fit its characteristics. (The number of neurons was 32, 32, and the activation function was Relu.) The output of the convolution block is dimensioned through a layer of fully connected with 1024 neurons, then splicing with the fitting coefficient features after DNN fitting. Finally, the deep feature is obtained through the fully connected layer of 1024 neurons. The deep feature contains both the global information of the fitting coefficient and the local information obtained by the convolutional neural network.

#### The attention mechanism of feature fusion

We use the attention mechanism [28] to fuse shallow and deep features. In the feature fusion part (as shown in Fig. 3), firstly, we concatenate the shallow and deep features of phase and frequency to obtain the input of length L. Then, to get the weight of each feature, we will input the fully connected layer through the two activation functions for ReLU and Sigmoid. We use the ReLU activation function to enhance the nonlinearity and obtain the weight through Sigmoid. Finally, the input features are multiplied by the weights.

The attention fusion mechanism used in this paper uses the Sigmoid activation function instead of Softmax to obtain the weights because the primary purpose of the attention mechanism used in this paper is to suppress invalid features, not to find the optimal features. There is no need for each feature value in the shallow and deep layers to compete for weights. This is because the primary purpose of the attention mechanism used in this paper is to suppress the invalid features, not to find the optimal features. The attention fusion mechanism in this paper can automatically learn to give different weights to each feature value of the shallow and deep features. The features significantly impacting the classification result will be given a larger weight. In comparison, the features that do not significantly affect the final classification will be given a smaller weight to improve the detection accuracy and generalization ability.

#### DNN classifier

We use the attention mechanism to fuse shallow and deep features. In the feature fusion part (as shown in Fig. 3), firstly, we concatenate the shallow and deep features of phase and frequency to obtain the input of length L. Then, to get the weight of each feature, we will input the fully connected layer through the two activation functions for ReLU and Sigmoid. We use the ReLU activation function to enhance the nonlinearity and obtain the weight through Sigmoid. Finally, the input features are multiplied by the weights. Through automatic learning, we give different weights to each value of shallow and deep features. The features that have a significant impact on classification are given greater weights. In comparison, those that are ineffective in classification are given smaller weights to improve detection accuracy.