# ABOT: an open-source online benchmarking tool for machine learning-based artefact detection and removal methods from neuronal signals – Brain Informatics

#### ByMarcos Fabietti, Mufti Mahmud, Ahmad Lotfi and M. Shamim Kaiser

Sep 1, 2022

In the following sections, the articles from the dataset are reviewed, organised progressively by each level of the acquisition method’s spatial resolution. Given the extensive literature available on EEG signals, the most popular articles are chosen to be discussed, while the rest will be provided as a table in the online tool. We define popularity as the number of citations, however recentness also plays a factor in the number of citations, so we express the popularity of an article as stated in the following equation:

$${text{Popularity}}_{i} = frac{{{text{citations}}_{i} }}{{{text{max}}({text{citations}}_{{i, ldots ,n}} )}}*0.9 + frac{1}{{1 + {text{years}},{text{since}},{text{publications}}}}*0.1,$$

(2)

where i = 1,…, n and n equal the number of EEG articles in the dataset. Figure 10 depicts the publication trend in artefact removal from neuronal signals as a function of the citations and years since publication. While the popularity index allows the identification of high-impact articles, it does have a negative bias toward more recently published articles that may be of high relevance. Nonetheless, the key information of those articles is available through the toolbox for users to explore and compare.

### Magnetoencephalography

Hasasneh et al. [50] developed an ECG and EOG artefact classifier based on a combined convolutional neural network (CNN), utilising temporal and spatial information of independent component analysis (ICA) components. From 48 subjects, 7122 examples were obtained after data augmentation, and the model achieved an accuracy of 94.4%. This has proven that accuracy improves when temporal and spatial information is incorporated. Additionally, the model was trained without relying on auxiliary signal recordings, and it allows for EEG and various sensor types as well.

Garg et al. [51] proposed two ECG and EOG identifiers composed of a deep 1-D CNN from ICA components. Resting state MEG data from 49 subjects were used to train the model and reached a 96% sensitivity and 99% specificity on the ECG model and 85% sensitivity and 97% specificity on the ECG model. Finally, gradient-weighted class activation maps were generated to visualise learned features, which shows how the model operates.

In another publication by Garg et al. [52], they applied a 10-layer CNN, which labels EOG artefacts. The MEG data were extracted from 44 subjects, out of which 14 were used for training and 30 for testing. The obtained accuracy on the testing set was 99.67%. The saliency maps and gradient-weighted class activation maps revealed that the model’s learned features correspond to those used by human experts.

The approach by Phothisonothai et al. [53] consists of extracting the central moment of frequency, fractal dimension, Kurtosis, probability density and spectral entropy from independent components. Next, a Gaussian kernel SVM was trained to identify these features. From a dataset of ten healthy children, the obtained accuracies were 98.15%, 99.18%, and 92.33% for high amplitude changes (HAM), ECG and EOG, respectively.

Duan et al. [54] presented a weighted SVM as an ECG, EOC and sudden high amplitude change artefact predictor. This method was chosen to address the class imbalance factor of the independent components. By re-weighting, the examples belonging to the negative class, the specificity of the classifier was boosted. Using a dataset composed of the MEG data of ten healthy children, the model’s accuracy was 97.91% ± 1.39%.

Rong et al. [55] applied two clustering methods to ICA components: threshold-based and an Adaptive Resonance Theory (ART) neural network. The characteristics compared for thresholding were the statistical aspects, topographic patterns, and power spectral patterns. The MEG data were acquired from five healthy right-handed adults, and the chosen performance metric was “correctness”. This can be defined as the proportion of real artefactual independent components identified over the total independent components identified by the algorithm. The ART network achieved 60% correctness on ECG data and 70% on EOG data, underperforming considerably against the 100% and 90%, respectively, the threshold method achieved. Lastly, they compared the number of real artefactual independent components identified over the total artefact independent components in the dataset to measure named “coverage” to measure the underestimation of artefacts. This showed that the coverage of the network was approximately 85% over both artefacts.

Croce et al. [56] trained a CNN with the independent component’s spectrum and the topographic distribution of its weights, extracted from multichannel MEG and EEG recordings. From 503 brain and 564 artefact components of the EEG recordings along with 2730 artefact and 2019 brain independent components of the MEG recordings, the final dataset was downsampled to 2012 (503 each category). The classification accuracies obtained through cross-validation were 92.4% for EEG, 95.4% for MEG and 95.6% for EEG + MEG.

Lastly, Treacher et al. [57] employed a combination of 1 dimensional CNN for the independent components and a 2-dimensional CNN for the spatial maps to detect eye blinks, saccades and cardiac artefacts. The data set was composed of 294 scans from 217 subjects, out of which 232 scans or 49,100 independent components were used to train the model. After hyperparameter optimisation of both networks, an accuracy of 98.87% was achieved on the test data by the ensemble model, surpassing the performance of the individual temporal and spatial models.

In the case of MEG, we can observe that researchers constructed both artefact-specific models and multiple-artefact models. The model by Duan et al. is able to identify multiple artefacts with near 98% accuracy, a performance comparable to the models of other authors that are able to identify a single artefact. Treacher et al. also identify multiple artefacts but is more computationally expensive as it requires training two CNN models, and the model developed by Croce et al. was trained jointly with the data of EEG, which may not be available in most experiments.

### Electroencephalography

Winkler et al. [58] proposed an ICA-based approach that estimates the source components for the classification of general artefacts by factoring in temporal correlations, named temporal decorrelation source separation. Components extracted from data of 12 healthy right-handed male subjects during two auditory stimuli in an oddball paradigm were labelled by two experts. They were broken into 690 examples for training and 1080 examples for testing a Linear Programming Machine, a Gaussian kernel SVM and a regularised LDA model. The LPM classifier obtained a classification error of 8.9% based on six handcrafted characteristics, while the difference between the two expert scores was 13.2%. For validation, they used data from two studies: 18 subjects in an auditory event-related potential paradigm and 80 subjects in the motor imagery BCI paradigm. The former dataset achieved an average MSE of 14.7%, compared to 10.6% disagreement between experts. At the same time, the latter showed that eliminating up to 60% of the framework did not affect the overall performance of the BCI classification.

Shao et al. [59] applied a weighted version of SVM to handle the inherently unbalanced nature of ICA’s component classification. By giving a higher penalty on the classification errors generated by the minority class samples, the algorithm compensates for the bias of prior class probabilities. EEG recordings were obtained from ten right-handed volunteers, segmented into 12 s epochs and then decomposed each one into independent components by the ICA. Each independent component was manually labelled, and six features were extracted from them to train the models, trained with the recordings of 9 subjects and tested with the left-out subject. The compared models included the Gaussian mixture model, kNN, LDA, standard SVM and weighted SVM with and without error correction. The weighted SVM obtained the best results with error correction, an accuracy of 95.67%, and a reduction of 98.4% and 96.8% in the epochs of ECG artefacts and EOG artefacts, respectively.

Shoker et al. [60] used ICA with SVM with handcrafted features. Ten 7-min-long EEG data sets were built with data supplied by King’s College Hospital, London, UK. After applying the blind source separation method, 200 independent components were obtained: 100 free of artefact and 100 containing eye blinks; from them, four handcrafted features were extracted and used to train the classifier. The SVM was trained using linear, cubic polynomial and Gaussian kernels, with the latter achieving the highest accuracy of 98.5 ± 1.00%.

Hader et al.’s [61] approach consisted of the application of ICA and SVM on the topography and power spectral density of the independent components. Four different artefacts were recorded from four healthy and paralysed subjects to train a Gaussian SVM using 20-fold cross-validation. The accuracy was 99.39% for eye blinks, 99.62% for eye movement, 92.26% for jaw muscle and 91.51% for forehead, averaging 95.70% between them.

Lawern et al. [62] addressed artefact removal by means of implementing auto-regressive models for feature extraction coupled with a Gaussian SVM classifier. Seven participants made a series of facial and head movements that induced artefacts, which involved moving the jaw vertically, clenching the jaw, moving eyes left, moving eyes upwards, blinking both eyes, moving the eyebrows and rotating the head. An eight multi-class SVM was trained with these recordings, using fourfold cross-validation to determine its optimal parameters, finally reaching a 94% accuracy.

Gao et al. [63] presented a method where ICA is applied to obtain independent components; then, a peak detection algorithm recognises eye-blink artefacts, followed by a classifier trained on the topographic and spectral features to recognise eye-movement artefacts and finally, the artefact-free components are used to restore the signal. Their dataset was composed of 600 EEG epochs from 15 healthy subjects for 3 s per epoch. They compared three different classifiers: MLP, Fisher Discriminant Analysis and SVM, with the latter achieving the best scores of 98.7% sensitivity and 97.9% specificity, using tenfold cross-validation.

Li et al. [64] employed the Lomb–Scargle periodogram to determine the spectral power from recordings that had parts contaminated by artefacts removed and used those features to train an autoencoder and a Gaussian SVM. Evaluated with simulated and real motor imagery data, the autoencoder proved to be comparable to the SVM. Moreover, results show that accuracy is not reduced dramatically if various amounts of data are discarded. Therefore, they concluded that rather than discarding an entire segment, it could use all the same to generate commands after removing the parts with artefacts.

O’Regan et al. [65] proposed complementing EEG signals with gyroscope signals to detect head-movement artefacts. They collected data on head movement from seven healthy male adults for 30 min. Both types of signals were preprocessed and divided into epochs for the analysis. A total of 69 and 80 features were extracted from each epoch of the EEG and gyroscope signals. For each type of signal, a Gaussian kernel SVM classifier was trained, and a third feature fusion classifier surpassed the former two. The fusion model reached an average AUROC of 0.822 for the participant independent model and 0.98 for the participant dependent model. This shows that additional information about the presence of EEG artefacts is given by gyroscope features and boosts their detection. Nguyen et al. [66] named “wavelet neural network” their EOG detection methodology. It is composed of three steps: (i) decompose the raw signal into a group of wavelet coefficients; (ii) pass the coefficients in low-frequency wavelet sub-bands to an MLP for correction; (iii) reconstruct an artefact-free signal based on the corrected coefficients. The technique was trained on simulated data and validated on two datasets, recorded during a visual selection task and a driving test. The authors achieved an RMSE of 12.2 for the driving dataset and 19.21 for the visual selection dataset, surpassing the results they obtained with ICA. Furthermore, the solution is computationally efficient and more practical than ICA, suggesting an online deployment is feasible.

Gonçalves et al. [67] focused on removing artefacts in EEG from the magnetic resonance sequence magnetic fields during the co-registration of EEG and functional magnetic resonance imaging. They utilised a hierarchical clustering algorithm, which employs Euclidean distances to aggregate the different pulse artefacts. The averages of each cluster were then used to generate an artefact template that was subtracted from the respective pulse occurrences belonging to each cluster. The artefact correction in this situation has no ground truth to compare the outcome of the correction algorithm. Nonetheless, the authors used the estimated acquisition time of one slice to determine the quality of the successful correction.

We can observe that most of these articles share the commonality that they have utilised SVM as the classification model. From them, Lawern et al. has been able to achieve a performance nearing 96% in a model that is able to identify 7 types of artefacts, the most out of any article reported in the literature. This is achieved with only a second-order auto-regressive model as a feature, and was tested in real patient data. A benefit of the feature is that it is scale-invariant, so it is stable across subjects and computationally efficient to calculate, in contrast to ICA-based approaches. However, Lawern et al.’s approach must be used in conjunction with other methods for those looking to recover the underlying signal.

### Electrocorticography

Alagapan et al. [68] developed an artefact removal algorithm for ECoG labelled shape adaptive non-local artefact removal (SANAR). This approach works by approximating the Euclidean median of k-nearest neighbours of each artefact in a non-local manner, acquiring a template of the artefact, which then is removed from the original signal. It was applied to data obtained from a single subject carrying out a working memory task while being simultaneously stimulated, as well as a simulated ECoG and direct cortical stimulation, where an antenna connected to a function generator acts as a virtual dipole, and a saline solution emulates the conductivity of the grey matter. Artefact residue index was used to measure performance, which should be close to 0. ICA achieved 0.430 ± 0.015, while SANAR 0.388 ± 0.011, reaching better performance. Nonetheless, one must consider the extended calculation time as one of the main limitations of the method.

From another perspective, Tuyisenge et al. [69] developed a model for detecting bad channels in ECoG recordings of seizure patients undergoing pre-surgical recordings and stimulation. They extracted the correlation, variance, deviation, amplitude, gradient, Hurst exponent and Kurtosis from each channel and fed it to a bagging tree model for classification. They explored the model’s performance based on the number of subjects used to train it, which plateaued at 99.7% accuracy with 110 subjects. The wrong channels consisted of artefacts such as electrode pop, power line noise and intermittent electrical connection.

Nejedly et al. [70] proposed using CNN with five different frequency bands of the recordings as inputs to identify between physiological, pathological, noise and muscle activity and power line noise. Their analysis was made using two large datasets. They made a general model (trained with one dataset and validated with the other) and a specific model (retraining the general model with 8% of the second dataset and validating with the remaining data). The general model achieved an F1-score of 0.89 in the noise and muscle activity class, while the specific model achieved 0.98 and 0.97 in power line and noise and muscle activity classes, respectively. The overall performance of the specific model was 0.96, including the physiological and pathological ECoG classes.

Finally, Fabietti et al. [76] explored the impact of sampling frequency in the four-way classification of baseline brain activity, seizure, line noise and noise and muscle activity. After down-sampling to balance the classes; they used 67,992 examples to train a CNN. At 5 kHz, they achieved a sensitivity of 99.7% for the line noise class and 91.9% for the artefact class. When the signals are downsampled to 250 Hz, the respective sensitivities are 99.4% and 87.8%, indicating a small loss of performance for a sequence reduction of 20 times.

Taking these articles into consideration, Tuyisenge et al.’s approach to utilise bagged decision trees achieves the best performance of artefact detection in ECoG signals. The performance was tested on the left-out data of 100 patients, indicating the robustness of the method. It is also worth highlighting that they utilised the least amount of training examples, as it was not a deep learning model, and achieved the performance with only 7 handcrafted features.

### Local field potentials

Regarding artefact detection in LFP, Fabietti et al. have explored several approaches. Their dataset comprises multi-site electrode recording in freely moving male Long Evans rats. First [72], they proposed using a multi-layered perceptron for the binary classification of LFP and artefacts of various origins. They explored how the performance varied based on the input length in both subject-specific and cross-subject models. The cross-subject model achieved an accuracy of 93.2%. This was followed by their second work [68].

A recurrent architecture, namely a long-short term memory (LSTM), was also used for binary classification and an approach based on forecasting. After comparing different parameter combinations, the best classification model achieved an accuracy of 87.1%, while the forecasting approach could not identify the two classes with good performance. The third approach [73] consisted of using CNNs, where three popular architectures were adapted for the one-dimensional signal. The best performance was achieved by the Alexnet [74] inspired model, with an accuracy of 95.1%. In addition, grad cam maps were extracted to understand which portions of the signal the model used for assigning each class. Continuing to explore interpretability, a decision tree-based model was the basis for the fourth research article [75]. They explored three feature extraction toolboxes combined with three feature selection methods to obtain an accurate and interpretable model. The accuracy of the decision tree was 89.3%.

From the artefact removal perspective, Fabietti et al. [77] proposed using an LSTM network to forecast “normal” neural activity to replace the artefactual segment. An open-source dataset of rodents in a treadmill was used to train the model, fed 200 ms long sequences and was asked to predict the subsequent data point in a sliding window approach. The performance was evaluated as the RMSE of 100 ms across four individual subject models and a cross-subject model, which achieved a performance of 0.189 in the test set. Afterwards, the generated signals were compared in the temporal and spectral domains, where they mimic the properties of the physiological recordings. These approaches have been compiled into an open-access toolbox for artefact detection and removal [78].

In general, it can be said Fabietti et al. has compared a wide range of architectures for artefact detection in LFP, and over two datasets the CNN [74] has achieved the best accuracy and the lowest computational time to classify a minute of recording. In regard to artefact removal, the use of LSTM to forecast over corrupted LFP segments has shown promise, and may prove use useful in single-channel EEG applications.

### Spikes

Klempivr et al. [79] approached artefact detection using transfer learning with a CNN based on AlexNet. The dataset was composed of thousands of 10-s extracellular microelectrode recordings of 58 patients with Parkinson’s disease. Approximately 75% of the recordings did not contain any artefacts, and the preprocessed dataset consisted of nearly 100,000 one-second signal segments. Continuous wavelet transform was applied to generate a time–frequency image, which was the input to the network. This pipeline attained an accuracy of 88.1% for artefact identification and 75.3% accuracy for the individual classes of artefact identification.

From another angle, Hosny et al. [80] explored the use of machine learning to detect artefacts from multi-electrode recordings. Their data consisted of recordings from 17 Parkinson’s disease who showcased artefacts such as mechanical motion, electromagnetic interference, baseline drift, irritated neuron and others. Power spectral density and wavelet packet decomposition was used to obtain 106 features, which were used to train classifiers such as Gaussian SVM, decision trees, AdaBoost, Bagging learners, LogitBoost and an LSTM network with 3785 examples. The best performing model was the LSTM network, with an accuracy of 97.49% on the test set.

Overall, Hosny et al. out-performs Klepmvir et al. in regard to the achieved accuracy on binary classification. Furthermore, Hosny’s model was trained with nearly half the amount of examples that the latter used, and the examples included a wider range of artefacts. However, training a LSTM network end-to-end is significantly more computationally expensive than to apply transfer learning to the Alexnet CNN.

We proceed to discuss the challenges in the field and the advantages and limitations of the tool in the subsequent section.