A Bootstrapping Method for Improving the Classification Performance of the P300 Speller A Bootstrapping Method for Improving the Classification Performance in the P300 Speller

In this paper, we present a novel approach to training classifiers in a speller based on P300 potentials. The method, based on bootstrapping, is a known strategy for generating new samples, but it is rarely used in neurosciences. The study first demonstrates how the performance of the classification task (detecting P300 and Non-P300 classes) could be sub-optimal in the traditional approach. Then, a new method for taking new samples from the training data is proposed. Each classifier is re-trained using balanced sub-groups of individual P300 and non-P300 samples. Data were collected from 14 healthy subjects, using 16 electroencephalography channels. These were filtered in bandpass and decimated. Subsequently, four linear classifiers were trained using the traditional method followed by the proposed one, with 1000, 2000 and 3000 samples per class. Results indicate an improvement in the accuracy and discrimination capacity of discriminative classifiers with the proposed method, maintaining the same statistical properties between the training and test data. By contrast, for generative classifiers, there is no significant difference in the results. Therefore, the proposed method is highly recommended for training discriminative classifiers in spell-based P300 potentials.


INTRODUCTION
One of the most interesting applications for Brain-Computer Interfaces (BCIs) is the P300 speller, proposed in 1988 by Farwell and Donchin [1] and re-invented and improved in many other studies [2] [3] [4] [5] . A commonly used speller consists of an arrangement of characters uniformly distributed in rows and columns, displayed in a screen. Rather than displaying a single character, the speller randomly highlights some characters organized in rows or columns. When the user watches the desired character in a highlighted row or column, the brain generates a P300 signal, related to memory and the attention processes in the brain [6] .
A typical P300 speller reads signals from the brain, using electroencephalography (EEG), and tries to discriminate between P300 and non-P300 signals. When a P300 signal is detected in a specific row and column, the speller takes the corresponding character and displays it on the screen. The described speller has been used for developing online BCI applications [5] [7] [8] [9] .
Note that the target of the classification is to identify the row and the column that corresponds to a character from P300 signals rather than to classify P300 and non-P300 signals.
As the P300 speller is based in the oddball paradigm, the number of events is unbalanced; that is, the number of non-P300 trials is larger than the number of P300 trials [10] . Both unbalanced classes and small datasets could affect the performance measurement of a classifier [11] [12] . To get a more confident performance, the number of samples by class should be balanced.
Some researchers have proposed discarding samples randomly from the class with more members to reach the desired 1:1 proportion [13] [14] [15] [16] [17] , trying to preserve as many samples as possible in the training stage [18] . This solves the problem of unbalanced classes at the expense of decreasing the number of available samples.
By contrast, there are mainly three approaches to processing the input features to a classifier for a P300 speller. The first one consists of training and evaluating the classifier in single trials [3] [8] [19] . The second approach makes use of averaged data over a fixed number of trials, for training and testing the system [5] [7] [9] [13] [16] [20] [21] [22] . The third approach consists of training the classifier in single trials, and evaluating the classifier with averaged trials [14] [17] [23] [24] .
The last approach (called the traditional approach in this work) is commonly used in the literature. It suffers from an important problem of statistical properties of signals during the training stage being different from those of signals used during the testing stage.
This violates the assumption that training and testing data should come from the same population, for any classification problem [25] . Consequently, the classifier has reduced capacity to differentiate between P300 and non-P300 trials.
Different statistical properties of training and testing signals carry another problem. The estimation of the posterior probabilities from probabilistic classification approaches would not be correct.
This issue is critical for P300 applications that make use of language models [7] [26] [27] [28] since the posterior probability of the output of the classifier is typically combined with the probability of letters in a particular language to determine the most likely sequence of letters.
In addition, since P300 and non-P300 classes are unbalanced, performance measures, such as the accuracy, tend to be biased. This happens because the classifier assigns most of the samples to the class with higher prior probability [12] . Some studies have proposed use of the Cohen's kappa index κ as an alternative measure of performance that does not suffer from the issues previously described [29] [30] [31] .
The classification problem could be seen from one of two possible points of view. The first one establishes that the classification problem is typically divided into two stages. The inference stage tries to learn a probabilistic model of the data given the class. Then, the decision stage implements the theorem of Bayes to determine the class of each data. A classifier implemented in this manner is known as a generative classifier [25] .
Linear discriminant analysis (LDA) classifiers are generative because they mostly assume Gaussian distributions in the data [25] [32] .
The second point of view determines that a class could be directly mapped from the data. The model comes from either a probabilistic discriminant model of the class, given the data, or a deterministic discriminant function that directly maps the data to the class.
A classifier that uses the latter approach is a discriminative classifier [25] . Logistic regression and the sup- In this work, a method for training linear classifiers in the identification of P300 potentials is presented.
First, we demonstrate that the traditional approach could lead to misinterpretation of the actual performance of these classifiers, as the performance metric based on accuracy is not well suited for the cases of unbalanced classes. Second, a bootstrapping approach is presented as a method for obtaining effective training of linear classifiers. Results indicate a significant improvement using the proposed method for detection of P300 potentials.

Experiment and dataset description
The experiment consists of declaring one of 36 possible characters (26 letters and 10 digits). Each subject observed a 6 × 6 matrix of characters in a screen, focusing the attention on the character that was prescribed above the matrix speller. For each character, the matrix was displayed for a 2.5 s period, and all characters had the same intensity. Afterward, each column and each row were randomly intensified for The dataset contains EEG signals that were recorded using a cap embedded with 64 electrodes, according to the modified 10-20 system [33] . All electrodes were referenced to the right earlobe and grounded to the right mastoid. The raw EEG signal was bandpass-filtered between 0.1 and 60 Hz and amplified with a 20000X SA Electronics amplifier [23] . Each experiment took into account only 16 EEG channels, motivated by the study presented by Krusienski et al. [23] : F3, Fz, F4, FCz, C3, Cz, C4, CPz, P3, Pz, P4, PO7, PO8, O1, O2, and Oz. Each channel is sampled at a rate of 240 Hz for one subject and 256 Hz for the others. All aspects of the data collection and experimental control were controlled by the BCI2000 system [22] . Two datasets were acquired for each subject: One was used for training, and the other was implemented for testing. Both databases were taken on different days. All datasets were obtained from the Wadsworth Center, NYS Department of Health.

Data processing
Data were pre-processed using bandpass filtering, separation in trials and decimation. Then, all channels were concatenated in a single vector. Depending on the type of training, data were taken from either the input of a classifier or a new population for obtaining new samples. In the latter case, a determined number of N averaged samples were taken. Afterward, the training datasets were inputs of a linear discriminant classifier.
Details are explained in the following subsections.

Pre-processing
For each subject, data were bandpass-filtered between 1 and 20 Hz using a fourth-order Butterworth filter.
The chosen bandwidth eliminates the trend of each channel and allows decimation of the signal later, by preventing the aliasing. Afterward, data were separated in trials by taking a window of 600 ms after the presentation of each visual stimulus (the highlight of one row or column), as proposed in a previous work [26] .
Signals from all channels were decimated by a factor of 4 and concatenated in a single feature vector. The factor was chosen because frequencies higher than that of the beta band reflect unrelated neural processes to P300 in awareness [34] . In addition, the maximum analog frequency of the EEG signal is 60 Hz, as seen before [23] . For the averaged process, signal segments were averaged across repetitions, up to the maximum number of repetitions by character (15).
Concatenated channels were used as the inputs of the classifier since they are used in the traditional method, as described in [23] .

Re-sampling of training samples
In the traditional approach, the classifier is trained with single trials and tested on averaged trials, to increase the signal-to-noise ratio. Note that besides the issue of having unbalanced data, the statistical properties of the training data do not match those of the testing data.
To avoid these problems, we implement an approach based on bootstrap re-sampling (bootstrapping) [32] [35] .

Classifiers
In the literature, the classification problem involves identifying the row and the column that corresponds to a character of the speller. In the present study, the target of the classification is to determine whether a signal is P300 or not. For aiming to the goal, we imple- For discriminative classifiers, it is necessary to choose the value of a regularization factor C. A four-fold cross-validation process is implemented with the training dataset, to get the best value of C. The number of values tested for C was 25, all located between 0.01 and 1. After this procedure, the final classifier is trained using the whole training dataset and the chosen value of C. The process is repeated by each user and each type of training samples [36] .

Stepwise LDA
The traditional approach implements a modified version of LDA as the classifier, where a stepwise regression is used before the classification task [23] . The classifier is known as SWLDA. Unlike other LDA-based classifiers, this classifier chooses the coefficients of the model regression iteratively, according to a statistical criterion [37] . As a result, the model obtained is more compact than the least-squares-based regression. The study implements the stepwise regression included in the Statistical and Machine Learning Toolbox for MATLAB®. Additional details of SWLDA can be found in [38] .

Bayesian LDA
When the coefficients of the model implemented for LDA are chosen according to Bayesian criteria, an LDA classifier based on Bayesian interpolation (BLDA) is obtained. According to the literature, the algorithm gives better results than the ordinary LDA or, even, SWLDA [39] [40] . Like the SWLDA classifier, the coefficients are obtained by iteration. However, the statistical criteria for choosing corrections are based on the Bayes Rule and are not added or removed from the model [41] . The algorithm implemented in the study and further details of BLDA can be obtained from [39] .

Linear SVM
Support-vector machine (SVM)-based classifiers have been implemented in several previous studies related to BCIs, including P300 spellers [14] [15] [20] [42] [43] . In this work, a linear kernel support-vector machine was implemented as the classifier with the LIBSVM Toolbox for MATLAB® [44] , for each subject. The reader is encouraged to see [45] for a wide list of studies implementing SVM in BCIs.

Logistic Regression
Unlike the L-SVM, logistic regression-based classifiers have been implemented in fewer works related to BCIs [46] [47] . It is a member of the family of log-linear models, implemented in discriminative classifiers [25] [ 32] . In the present study, the classifier was implemented with the UGM Toolbox for MATLAB® [48] , for each subject. Further details about Logistic Regression are found in [25] .

Performance metrics Accuracy
A common measure of performance used for classification is the accuracy. Accuracy is defined as a metric of the closeness between measured or predicted values and their corresponding true values [49] . A measure commonly used for the accuracy, for classification problems with M_c classes, is defined by using the trace of a confusion matrix H [29] as shown indicated in Equation ( 1 ) can be approximated to a normal distribution with the standard deviation defined in Equation ( 2 ): However, high accuracy does not always mean the classifier has high performance. This is true when the number of classes is highly unbalanced, as the classifier tends to be biased toward the class with the highest occurrence in the dataset. This is known as the accuracy paradox [50] .

Cohen's kappa index
A commonly used measure of precision is the Cohen's kappa index κ [29] [51] [52] . It is an alternative way of measuring the predictive power of a classifier that relates the accuracy with the probability to classify by chance, as expressed by Equation (3): The numerator is the difference between the accuracy and the expected probability to classify correctly by chance (p e ). The denominator is the difference between the maximum accuracy and p e . Consequently, κ is defined as the rate of the difference between accuracy and p e , and the maximum value of this difference is used to determine the difference. The possible values for κ are within the range of −1 to 1 [53] .  The standard deviation of κ is calculated using Equation (7): The standard error can be used to build confidence intervals and calculate statistical significance when accuracy or kappa values are compared.

RESULTS
Results presented here refer to the average performance obtained by each classifier, in terms of the accuracy and Cohen's kappa index metrics. All metrics were obtained from the testing dataset of each subject.

Number of bootstrapped samples
The statistical significance of differences among the numbers of bootstrapped samples for averaged training data was tested by a one-way randomized blocks ANOVA using a performance index and a classifier.

Type of training samples
The statistical significance of differences among the previously described types of training data was tested by two procedures. First, a one-way randomized blocks ANOVA was performed by metric and classifier. The           According to the results, when the classifier is trained with 2,000 averaged trials by class, the performance is significantly better than that of the traditional approach. Results indicate that the difference is highly significant (p < 0.01), for most of metrics and subjects. In most cases, the difference is in favor of the proposed method.  Stepwise and Bayesian LDA  Table 3.        Table 4. properties of single and averaged data. As a consequence, most of non-P300 features will be learned in this case.  gives small values of the statistics.

Results
Another issue worth considering is the nature of the LDA-based classifiers. They try to fit the data to a set of Gaussian models, with a mean by class and a common covariance matrix [25] . When new data are presented to the classifier, they are compared with each model. Later, a class is assigned to the data when the highest score or probability value is obtained from the corresponding model of the set. This score or probability comes from the distance between the data and each mean. In our study, both classifiers map the data to a score value, according to a model of regression before the generation of Gaussian models. This means that the models are also scalar rather than multivariate, unlike discriminative classifiers, where the mapping to the class is direct [25] . Consequently, discriminative models are more affected by the statistical nature of the data. This is reflected in the difference of the results between generative and discriminative classifiers.

CONCLUSIONS
In this study, a bootstrapping method is presented to solve two important problems in the P300 speller. The method generates a new training set by re-sampling with replacement from the original set, reaching two important goals at the same time.
First, the number of trials across classes is balanced.
It avoids dropping data in the process, as suggested in other approaches [13] [14] [15] [16] [17] , which prevents a possible bias in the classification results.
Second, the statistical properties of the training data are made equivalent between the training and the test sets. This is achieved when the number of averaged trials for each instance in training equals the number of averaged samples during testing.
Unbalanced classes and the difference in statistical properties are considerable issues present in the stateof-the-art implementations of the P300 classification task.
Results presented here indicate that the proposed method improves significantly the detection of P300 and non-P300 classes in linear discriminative classifiers, by dealing with the aforementioned issues.