Optimizing event-related potential based brain–computer interfaces: a systematic evaluation of...

This content has been downloaded from IOPscience. Please scroll down to see the full text.

Download details:

IP Address: 211.138.121.36

This content was downloaded on 20/10/2013 at 02:59

Please note that terms and conditions apply.

Optimizing event-related potential based brain–computer interfaces: a systematic evaluation

of dynamic stopping methods

View the table of contents for this issue, or go to the journal homepage for more

2013 J. Neural Eng. 10 036025

(http://iopscience.iop.org/1741-2552/10/3/036025)

Home Search Collections Journals About Contact us My IOPscience

iopscience.iop.org/page/terms

http://iopscience.iop.org/1741-2552/10/3

http://iopscience.iop.org/1741-2552

http://iopscience.iop.org/

http://iopscience.iop.org/search

http://iopscience.iop.org/collections

http://iopscience.iop.org/journals

http://iopscience.iop.org/page/aboutioppublishing

http://iopscience.iop.org/contact

http://iopscience.iop.org/myiopscience

OPEN ACCESSIOP PUBLISHING JOURNAL OF NEURAL ENGINEERING

J. Neural Eng. 10 (2013) 036025 (13pp) doi:10.1088/1741-2560/10/3/036025

Optimizing event-related potential basedbrain–computer interfaces: a systematicevaluation of dynamic stopping methodsMartijn Schreuder, Johannes Hohne, Benjamin Blankertz, Stefan Haufe,Thorsten Dickhaus and Michael Tangermann

Machine Learning Laboratory, Berlin Institute of Technology, Marchstraße 23, 10537, Berlin, Germany

E-mail: [email protected]

Received 9 January 2013Accepted for publication 1 May 2013Published 20 May 2013Online at stacks.iop.org/JNE/10/036025

AbstractObjective. In brain–computer interface (BCI) research, systems based on event-relatedpotentials (ERP) are considered particularly successful and robust. This stems in part from therepeated stimulation which counteracts the low signal-to-noise ratio in electroencephalograms.Repeated stimulation leads to an optimization problem, as more repetitions also cost moretime. The optimal number of repetitions thus represents a data-dependent trade-off betweenthe stimulation time and the obtained accuracy. Several methods for dealing with this havebeen proposed as ‘early stopping’, ‘dynamic stopping’ or ‘adaptive stimulation’. Despite theirhigh potential for BCI systems at the patient’s bedside, those methods are typically ignored incurrent BCI literature. The goal of the current study is to assess the benefit of these methods.Approach. This study assesses for the first time the existing methods on a common benchmarkof both artificially generated data and real BCI data of 83 BCI sessions, allowing for a directcomparison between these methods in the context of text entry. Main results. The resultsclearly show the beneficial effect on the online performance of a BCI system, if the trade-offbetween the number of stimulus repetitions and accuracy is optimized. All assessed methodswork very well for data of good subjects, and worse for data of low-performing subjects. Mostmethods, however, are robust in the sense that they do not reduce the performance below thebaseline of a simple no stopping strategy. Significance. Since all methods can be realized as amodule between the BCI and an application, minimal changes are needed to include thesemethods into existing BCI software architectures. Furthermore, the hyperparameters of mostmethods depend to a large extend on only a single variable—the discriminability of thetraining data. For the convenience of BCI practitioners, the present study proposes linearregression coefficients for directly estimating the hyperparameters from the data based on thisdiscriminability. The data that were used in this publication are made publicly available tobenchmark future methods.

S Online supplementary data available from stacks.iop.org/JNE/10/036025/mmedia

(Some figures may appear in colour only in the online journal)

Content from this work may be used under the terms ofthe Creative Commons Attribution 3.0 licence. Any further

distribution of this work must maintain attribution to the author(s) and thetitle of the work, journal citation and DOI.

1. Introduction

Recent advances in brain–computer interfaces (BCIs) haveresulted in both applications and methods that come closer topractical use [1–4]. BCI systems allow a person to control a

1741-2560/13/036025+13$33.00 1 © 2013 IOP Publishing Ltd Printed in the UK & the USA

http://dx.doi.org/10.1088/1741-2560/10/3/036025

mailto:[email protected]

http://stacks.iop.org/JNE/10/036025

http://stacks.iop.org/JNE/10/036025/mmedia

http://creativecommons.org/licenses/by/3.0

https://www.researchgate.net/publication/6992750_The_Wadsworth_BCI_Research_and_Development_Program_at_home_with_BCI?el=1_x_8&enrichId=rgreq-bb69fcf2-dd5e-417e-8bbb-d1f12f022287&enrichSource=Y292ZXJQYWdlOzIzNjkxOTg3MztBUzoxMTQ4MDgxNTQyMzQ4OTRAMTQwNDM4MzgwNTY4MQ==

https://www.researchgate.net/publication/44805651_A_brain-computer_interface_for_long-term_independent_home_use_Amyotroph_Lateral_Scler?el=1_x_8&enrichId=rgreq-bb69fcf2-dd5e-417e-8bbb-d1f12f022287&enrichSource=Y292ZXJQYWdlOzIzNjkxOTg3MztBUzoxMTQ4MDgxNTQyMzQ4OTRAMTQwNDM4MzgwNTY4MQ==

J. Neural Eng. 10 (2013) 036025 M Schreuder et al

device without the use of the brain’s normal efferent pathways.In other words, by inducing recognizable brain states, a personcan convey her/his intention directly to a device. This is ofparticular interest for people with severe motor disabilities,but can also prove useful in other settings [5, 6].

BCI systems based on event-related potentials (ERPs)have proven to be particularly robust [7]. The principle ideaof any ERP-BCI is to stimulate the user with two or moredifferent stimuli. The user focuses on one of these stimuli,called the target. This target is generally less frequent thanthe non-targets, either by design or by the larger numberof different non-focused stimuli. The target and non-targetstimuli elicit different brain patterns, which can be detectedand exploited by the BCI in real-time. Currently, the mostcommonly employed method for detecting this signal is non-invasive electroencephalography (EEG).

ERP-based BCIs are most widely used for communicationapplications and were originally introduced as such by [8].Since then, several variations have been proposed, changingthe interface [9–11] or the modality [12–14]. The originalidea, however, remains the same; the user communicatesby focusing attention on one of several offered options andthereby conveys her/his intention to the BCI.

Typically, a single selection step is referred to as a trial.It contains two main time intervals which are important forthe estimation of the efficiency of the system: the periodwhich contains the actual stimulation activity, and the pre-and poststimulus pauses. The latter two are collectively calledthe overhead and are required to inform the user of a selectionand to prepare a new trial. Though the overhead is typicallyfixed, the length of the stimulation period can be adjusted.

Generally, a round (or iteration) of one stimulus for eachoption is performed before proceeding to the next iteration.Several such iterations make up the stimulation part of atrial. This repetition is done in order to improve the relativelylow signal-to-noise ratio (SNR) of the resulting ERPs, andthereby increase the accuracy of the decision (see figure 1(A)).However, as each additional iteration prolongs the length ofa trial, there exists a trade-off between accuracy and speed,which calls for data-driven optimization. A longer stimulationtypically means higher accuracy but lower speed, whereas ashorter stimulation may result in lower accuracy but higherspeed. The optimum in this trade-off of course depends on theSNR of the ERPs, but is to a large extent also determined byapplication factors, such as the number of trials needed forcorrecting a wrong decision, the number of trials needed towrite a letter and automatic error correction features of theapplication. All these should be taken into account for fine-tuning this trade-off.

Typically, the above trade-off is not dealt with for theonline use of a BCI, and an arbitrary number of iterations isfixed beforehand. Subsequently, offline post-hoc analyses ofthe online data are used to generate performance curves overthe number of iterations, given some metric (for an example,see figure 1(B)). The peak performance is then taken as ameasure of the possible performance of the system. This simplestrategy of a fixed number of iterations is widely used in BCI,but generalization might be troublesome due to their post-hoc

(A)

(B)

(C)

Figure 1. Influence of two different metrics ((B) and (C)) on theoptimal number of stimulations. The simulation parameters were setto SOA = 200 ms, overhead = 8 s, number of classes = 12 and toone step per character. This choice of parameters resembles thestandard matrix speller [8]. The four lines indicate differentaccuracy profiles, and black dots indicate peak performances.(A) The y-axis shows the metric binary classification accuracy anddefines the profiles. (B) The bitrate as a metric does not include anyerror correction strategy of the application. Peak performances are atvery low numbers of iterations. (C) Using the SPM metric, peakperformances are reached for larger iteration numbers. In fact, forthree out of four profiles, the lower number of iterations assuggested by the bitrate peak performances would not have allowedthe user to write.

nature; the determined number of iterations would have beenoptimal in retrospect, but this does not necessarily hold forfuture data, as the SNR is prone to changes.

This leads to the challenge of optimizing the numberof iterations of ERP-based BCIs. The simplest method, andan extension of the above, is to learn the optimal numberof iterations from the calibration data and fix this for onlineuse. This will be referred to as fixed stopping, as it does notadapt to changes in the online data. More dynamic methodshave been proposed in the literature to account for the varyingSNR or user performance over time [15–20]. They stop thestimulation at any point in a trial when enough evidence (inthe form of classifier scores) has been accumulated for a correctdecision. The hyperparameter of the stopping method shouldbe learned based on the calibration data. In order to outperformfixed stopping, such dynamic methods should ideally be robust

2

https://www.researchgate.net/publication/51733419_Listen_You_are_Writing_Speeding_up_Online_Spelling_with_a_Dynamic_Auditory_BCI?el=1_x_8&enrichId=rgreq-bb69fcf2-dd5e-417e-8bbb-d1f12f022287&enrichSource=Y292ZXJQYWdlOzIzNjkxOTg3MztBUzoxMTQ4MDgxNTQyMzQ4OTRAMTQwNDM4MzgwNTY4MQ==

https://www.researchgate.net/publication/23182061_Asynchronous_P300-Based_Brain--Computer_Interfaces_A_Computational_Approach_With_Statistical_Models?el=1_x_8&enrichId=rgreq-bb69fcf2-dd5e-417e-8bbb-d1f12f022287&enrichSource=Y292ZXJQYWdlOzIzNjkxOTg3MztBUzoxMTQ4MDgxNTQyMzQ4OTRAMTQwNDM4MzgwNTY4MQ==

https://www.researchgate.net/publication/20695420_Donchin_E_Talking_off_the_top_of_your_head_toward_a_mental_prosthesis_utilizing_event-related_brain_potentials_Electroencephalogr_Clin_Neurophysiol_70_510-523?el=1_x_8&enrichId=rgreq-bb69fcf2-dd5e-417e-8bbb-d1f12f022287&enrichSource=Y292ZXJQYWdlOzIzNjkxOTg3MztBUzoxMTQ4MDgxNTQyMzQ4OTRAMTQwNDM4MzgwNTY4MQ==

https://www.researchgate.net/publication/5449999_An_Adaptive_P300-Based_Online_Brain-Computer_Interface?el=1_x_8&enrichId=rgreq-bb69fcf2-dd5e-417e-8bbb-d1f12f022287&enrichSource=Y292ZXJQYWdlOzIzNjkxOTg3MztBUzoxMTQ4MDgxNTQyMzQ4OTRAMTQwNDM4MzgwNTY4MQ==

https://www.researchgate.net/publication/51658054_Flashing_characters_with_famous_faces_improves_ERP-based_brain-computer_interface_performance_J_Neural_Eng_8056016?el=1_x_8&enrichId=rgreq-bb69fcf2-dd5e-417e-8bbb-d1f12f022287&enrichSource=Y292ZXJQYWdlOzIzNjkxOTg3MztBUzoxMTQ4MDgxNTQyMzQ4OTRAMTQwNDM4MzgwNTY4MQ==

https://www.researchgate.net/publication/43023925_A_New_Auditory_Multi-Class_Brain-Computer_Interface_Paradigm_Spatial_Hearing_as_an_Informative_Cue?el=1_x_8&enrichId=rgreq-bb69fcf2-dd5e-417e-8bbb-d1f12f022287&enrichSource=Y292ZXJQYWdlOzIzNjkxOTg3MztBUzoxMTQ4MDgxNTQyMzQ4OTRAMTQwNDM4MzgwNTY4MQ==

https://www.researchgate.net/publication/43201762_An_online_brain-computer_interface_using_non-flashing_visual_evoked_potentials?el=1_x_8&enrichId=rgreq-bb69fcf2-dd5e-417e-8bbb-d1f12f022287&enrichSource=Y292ZXJQYWdlOzIzNjkxOTg3MztBUzoxMTQ4MDgxNTQyMzQ4OTRAMTQwNDM4MzgwNTY4MQ==

https://www.researchgate.net/publication/6414954_How_many_people_are_able_to_operate_an_EEG-based_brain-computer_interface_BCI_IEEE_Trans_Neural_Syst_Rehabil_Eng?el=1_x_8&enrichId=rgreq-bb69fcf2-dd5e-417e-8bbb-d1f12f022287&enrichSource=Y292ZXJQYWdlOzIzNjkxOTg3MztBUzoxMTQ4MDgxNTQyMzQ4OTRAMTQwNDM4MzgwNTY4MQ==


against the distortions which are commonly observed in EEGdata, and they should preferable be trained on a small set oftraining data.

To test and compare these proposed methods, they wereimplemented and validated, both on artificial data (checkingfor common data distortions) and on real EEG data of severalparadigms. All stopping methods were provided with theoutput of a linear classifier for training and testing purposes.Though more sophisticated classifiers are sometimes used(outputting for instance probabilities), the most widely usedclassifier in ERP based BCI remains linear discriminantanalysis (LDA). For this reason, all methods were restrictedto work with data obtained from a regularized LDA. Eachstimulus was classified independently, and the accumulation ofthe evidence was left to the stopping methods. This ensures theapplicability of the results, as the methods could be realized asa simple module between the existing BCI and the application.

2. Material and methods

The dynamic stopping methods which were proposed in theliterature [15–20] were tested on a variety of data. Unlessstated otherwise, the implementations used in the followingcomparisons are equal to those reported in the original work.All of the methods use a different terminology in their originalmanuscripts, yet their fundamental principles are very similar.For better readability, the terminology for all methods has beenunified.

Classifiers were trained to assign (more) negative scoresto target epochs. All analyses were done using Matlab(Mathworks, Version 7.13 R2011b). Some basic notations thatwill be used throughout are:J: number of iterations; j indexes iterations.C: number of classes; c indexes classes.T : number of trials; t indexes trials.D ∈ R

C×J×T : tensor containing classifier outputs.Dtrain: tensor containing the training classifier outputs.Dtest: tensor containing the test classifier outputs.

l ∈ ZT : the vector with the true binary class labels.

�: paradigm parameters (stimulus onset asynchrony (SOA),overhead, C).S: stopping method hyperparameter to be optimized.�: set of trained parameters of a stopping method.L: training function which returns parameters �.F : online decision function.M: sub-function, describing the data.

All of the methods have access to Dtrain, l and � fortraining purposes. For testing purposes only Dtest, � and �

are available. As there is no access to raw EEG data, themethods are implemented as a module in between the BCIand the application. This ensures maximum compatibility withexisting processing pipelines.

Given the training data Dtrain, all dynamic stoppingmethods are trained with a function L, such that

� = L(Dtrain, l, �). (1)

The type of parameters contained in � differs for each method.Given �, the function F is called after each iteration of anonline trial. It is defined for the current trial t at iteration j by

F(Dtest,�, t, j) ={

cwj,t if stop possible

∅, otherwise(2)

where cwj,t indicates the winning class, and the empty set ∅

indicates that no early stop should be taken. In case a trialreaches the maximum number of J iterations, the winningclass is returned (see equation (3)). In most cases, function Frelies on a subfunction M, which summarizes the data. Thesefunctions are core to each dynamic stopping method.

Unless defined directly by the method, the winning classcw

j,t is defined for trial t and iteration j as

cwj,t = argmin

c(med(Dc,1...j,t)) (3)

where the median is taken over all iterations up to iteration j,for each class.

Throughout this manuscript, the classification data areoften summarized in terms of its discriminability. Thereceiver–operator characteristic (ROC) is a measure toestimate the performance of a two-class classifier, expressed inthe ratio between true- and false positive decisions for variousvalues of the bias of the LDA decision function. It can besummarized in a single value as the area under the ROC curve(AUC). This measure is used here to describe the quality of thedata. The AUC is bound between 1 (perfect separation, alwaysestimates the correct class label) and 0 (perfect separation,always estimates the wrong class label). An AUC value of0.5 means that the classifier is random and does not realize aclass separation.

2.1. Tested methods

All methods are conceptually described here, within theabove framework. A more detailed description is given if theimplementation deviates from the original implementation.For a full account, the reader is referred to the originalpublications. All evaluations are done in the context of a textspelling application and thus by optimizing the hyperparameterS of a stopping method with respect to the symbols per minute(SPM) metric (see section 2.4 and 2.5). The following methodsare investigated:

No stopping. In this condition, no early stopping isperformed. It uses all available iterations and is therefore theperformance baseline. No parameters need to be estimated.Formally, this means that � = {J}, the maximum number ofiterations possible. F(Dtest,�, t, j) returns cw

j,t (according toequation (3)) when j = J.

Fixed (optimal). The simplest way of optimizing the BCIis to estimate an optimal fixed number of iterations on thecalibration data and to apply this for the online data. Becausethe number of iterations can be set in virtually any BCIsystem, this method requires no implementation effort. Thenumber of iterations j† which results in the highest SPM onthe offline/training data is selected for online use. � is thusdefined as � = { j†}. As j† is directly optimized it equals S.During online use, F(Dtest,�, t, j) returns cw

j,t (equation (3))when j = j†. For the fixed condition, the optimal number

3







of iterations is estimated once from all of the training data;it serves as a evaluation of the post-hoc estimation of BCIperformance which is generally given in the BCI literature.For the fixed optimal condition, a re-sampling scheme (seesection 2.5) is used to increase robustness.

Liu [17]. This approach simply sums the outputs of a linearclassifier per class and over iterations. M is defined by

M(D, t, j) = minc

(j∑

j′=1

Dc, j′,t

)(4)

for trial t and iteration j, thus taking the most negativeclass. The average of all target classification scores in Dtrain,called d, is taken as the initialization for optimizing � = Sd(section 2.5). � then contains this threshold � for online use.

During online operation, the decision for an early stop istaken as follows:


cwj,t if M(Dtest, t, j) < �

∅ otherwise(5)

where cwj,t is the class with the most negative summed score.

In the original work, EEG features from iteration i aresmoothed together by averaging them with features fromiterations i − 1 and i − 2 before classification. Here, access isonly given to the instantaneous classifier output by design, sothis pre-processing step is left out.

Jin [20]. This method sets a minimum number ofconsecutive iterations ( jmin) which must result in the sameclass prediction. To measure this, M(D, t, j) is defined as1 + n, where n denotes the number of iterations directlyprior to iteration j which all lead to the same decision as jdoes (according to equation (3)). The original paper performsan additional pre-processing step on D to give D∗ whereD∗

c, j,t = 1j

∑ jj′=1 Dc, j′,t for each j � jmean. Since it only

includes processing the classifier outputs, the same is donehere. Both values of � = { jmin, jmean} are themselves thehyperparameters (S1 = jmin and S2 = jmean). They areoptimized directly according to the procedure in section 2.5.

In the online case, the decision function F is as follows:

F(D∗,test,�, t, j, ) =⎧⎨⎩

cwj,t if M(D∗,test, t, j) � jmin

and j � jmean

∅, otherwise.(6)

Here, both jmin and jmean were bound between 2 and 5.As there are no decisions taken before j < jmean, the earliestpossible stop is at iteration jmean + jmin.

Lenhardt [16]. This method was originally proposedfor a visual speller, and defines two thresholds. For boththresholds, D is first transformed into D∗ by D∗

1...C, j,t =norm(

∑ jj′=1 −D1...C, j′,t ), where norm(X ) = X−min(X )

max(X−min(X )),

a vector normalization over the class dimension. Lenhardtrequires the target class to have positive labels. Since thetarget class was defined to be negative in this framework, asign reversal is done within this transformation.

The first threshold is based on a measure called the‘intensity’, or M1, defined for iteration j and trial t as

M1(D∗, t, j) =C∑

c=1

D∗c,t, j (7)

M1 is thus simply the sum of all values in D∗1...C,t, j. The

second threshold is based on a slightly different measure M2,defined as

M2(D∗, t, j) = 1 − D∗k2,t, j

D∗k1,t, j

(8)

where k1 = argmaxc(D∗c,t, j) and k2 = argmaxc�=k1

(D∗c,t, j). M2

is thus the ratio of the highest to the second highest value inD∗

1...C,t, j.To find both thresholds, the training function

L(D∗,train, l, �) first calculates the intensity M1(D∗, t, j) forthe earliest possible stop which leads to a correct decision(according to equation (3)). This is done for each trainingtrial, and the average of these intensities, i, serves as aninitialization for optimizing �1 = S1 i (see section 2.5). Thesecond threshold, �2, is optimized on the training data directly(�2 = S2), which leads to � = {�1,�2}.

The online decisions for an early stop are taken by F asfollows:

F(D∗,test,�, t, j) =⎧⎨⎩

cwj,t if M1(D∗,test, t, j) < �1

and M2(D∗,test, t, j) > �2

∅, otherwise.(9)

The original publication calculates the thresholds using dataof several subjects. Here, it is restricted to the more commoncase of only having the calibration data of a single subject.

Rank diff [19]. Rank diff was designed for the AMUSEparadigm. However, the AMUSE dataset used in the currentperformance comparisons was not available during the designof Rank diff. The method is based on M(D, t, j), defined foriteration j and trial t as

M(D, t, j) = minc∗ (med(Dcw

j,t ,1... j,t ) − med(Dc∗,1... j,t )) (10)

where cwj,t is found using equation (3), c∗ runs over all other

classes, and med is the median operator. � = {�1, . . . ,�J}contains J iteration dependent thresholds �, allowing itto model the within-trial variation of ERP responses. Todetermine � j (the threshold for iteration j), L(Dtrain, l, �, j)calculates M(Dtrain, t, j) for each training trial. These areassigned to one of two vectors, depending on whether theestimated class label was correct (+ j) or incorrect (− j),using equation (3). To implement a security margin thatprevents overly optimistic early stops, � j is then defined as� j = max(max(− j) , S · med(+ j)). The hyperparameter S isset globally over all iterations, and is optimized as describedin section 2.5.

A larger value for M means a better separation of the classmedians, and thus more confidence in a decision. For onlineuse, F is defined as


cwj,t if M(Dtest, t, j) > � j

∅, otherwise.(11)

Hohne [18]. This method was again proposed for an auditoryspeller paradigm. A t-test with unequal variances (also referredto as Welch’s t-test) is calculated on D1... j for each class againstall other classes. Equation (12) shows the test statistic,

t(X1, X2) = X1 − X2√s2

1N1

+ s22

N2

(12)

4





where Xi, s2i , Ni are the sample mean, sample variance and

sample size, respectively. In order to account for a varyingnumber of data points, the t-test statistic t(Dc,1... j,t, Dc,1... j,t )

is transformed into a p-value (computed from a one-sided test, see supplementary appendix A (available fromstacks.iop.org/JNE/10/036025/mmedia)). This transformationis part of any statistics toolbox. M(D, t, j) is then defined asthis p-value.

The natural threshold on such a p-value is the significancelevel, or � = {α}. It is optimized for SPM on the training datadirectly, with α = S (section 2.5). During online operation, thedecision for an early stop is taken according to the following


cwj,t if M(Dtest, t, j) < α

∅, otherwise(13)

where cwj,t is the class with the smallest M value.

Zhang [15]. This method proposes a probabilisticapproach to continuously tracking whether a BCI user isin control or non-control state. The core of the algorithmis the formulation of a joint likelihood function, whichaccumulates the evidence for being attended for each stimulusover iterations. Here, this evidence is used as a criterion forearly stopping.

The method requires that the distributions of theclassifier outputs for target and non-target stimuli, fa andfu, are known. In practice, they are estimated from thetraining data. Here, univariate normal distributions for theclassifier outputs are assumed, which are fully characterizedby their means and variances μa/u and σ 2

a/u. Note thatthis assumption is compatible with the assumption ofmultivariate normal distributed data made by the LDAclassifier (see supplementary appendix B (available fromstacks.iop.org/JNE/10/036025/mmedia)).

Making the further simplifying assumption that the EEGdata are independent across iterations, as well as acrossstimulations within one iteration, the joint likelihood ofobserving the classifier outputs of the first j iterations giventhat stimulus c is attended is

f(Dtest

1...C,1... j,t

∣∣ c) =

j∏j∗=1

fa(Dtest

c, j∗,t) ∏

c∗ �=c

fu(Dtest

c∗, j∗,t). (14)

According to Bayes’ theorem, the posterior probability ofstimulus c being attended after the first j iterations isgiven by

P(c∣∣ Dtest

1...C,1... j,t

) = f(Dtest

1...C,1... j,t

∣∣ c)P(c)∑C

c=1 f(Dtest

1...c,1... j,t

∣∣ c)P(c)

, (15)

where P(c), c = 1, . . . ,C denote the prior probabilities withwhich theC classes are selected by the user. These probabilitiesmay be estimated in the training phase or, e.g., derived froma language model in spelling applications. If all classes areassumed to be equally likely, uniform priors P(c) = 1/C,c = 1, . . . ,C are used. Note that in practice, numericalinaccuracies may occur when evaluating equations (14) and(15) due to excessively small values of the joint likelihoodfunction. Therefore, all computations are carried out on logprobabilities, as in the original publication [15].

The joint likelihoods and posterior probabilities ofeach class are updated after each iteration. The function

M(Dtest, t, j) is defined as the maximum of the posteriorprobabilities of all classes:

M(Dtest, t, j) = maxc=1,...,C

P(c∣∣ Dtest

1...C,1... j,t

). (16)

A natural stopping condition is attained, if equation (16)exceeds 1 − α, where α = S is a hyperparameter to beoptimized based on the training data (section 2.5). Thus,� = {{P(c)}C

c=1, μa/u, σ2a/u, α}. The decision at iteration j

then reads:


cwj,t if M(Dtest, t, j) � 1 − α

∅, otherwise ,(17)

where cwj,t = argmaxc=1,...,CP(c | Dtest

1...C,1... j,t ).Upper bound. The upper bound method uses knowledge

about the class labels and is not applicable for normal onlineuse. It is applied for reasons of comparison and to deliver anupper bound on the performance of any of the other stoppingmethods. It simply stops a trial at the first occasion whichleads to a correct decision. If a trial does not result in acorrect decision at any iteration, the full number of iterationsis applied.

2.2. Artificial data

The most important requirement for the methods is thattheir performance should at worst drop to the baselineperformance of the no stopping strategy whenever data becomedifficult to separate. In other words, they should gracefullydegrade. Several artificial datasets were constructed to testthe robustness of the stopping methods for data distortionswhich are typically observed in classifier outputs during BCIonline sessions (see figure 2). Unless stated otherwise, data aresampled from two normal distributions (target and non-target)with unit variance. They represent the classifier outputs and notthe EEG data or features thereof. A training set and a validationset were generated independently. For the calculation of theSPM metric, all of the following datasets were assumed to havea SOA of 200 ms, six stimulus classes, a two-step interface,and an overhead of 10 s per step. The influence of the followingdistortion types were investigated (but not interactions betweendifferent types):

Separation. To mimic data that is generally difficult toclassify, the distance between the target and non-target classmean is varied from zero standard deviations (STD) for noseparability, to three STD (good separability), effectivelychanging the AUC. This is done equally for both training andvalidation data. Based on the results, all other datasets wereconstructed with a relative high AUC value of 0.85.

Outliers. Artifacts in the EEG create outliers in theclassifier outputs. These can generally be removed from thecalibration data before training the classifier. Particularlyduring longer use, the number of artifacts in the online phasewill be larger and removing them is not straightforward.Thus, between 0% (no outliers) and 65% (many outliers) areintroduced to both classes, but in the validation data only.Outliers are randomly sampled from a uniform distributionon [−3, 3].

Confusions. A typical phenomenon in some BCIs isa systematical confusion, where one particular non-target

5






(A)

(B)

(C)

(D)

(D)

(E)

(F)

Figure 2. Distributions of classifier scores for artificial datasets. Thedistributions are used to test the robustness of stopping methods.The left column represents the distribution of the classifier outputsof training data on arbitrary scales; the right column displays theclassifier outputs for online data on the same scales. The top row(A) shows the standard sample without distortions between trainingand online data. The lines in (B)–(F) depict specific forms of datadistortions. Whenever the data are distorted in (B)–(F), the standarddistribution is indicated by the faded lines. The seventh dataset (notshown here) only changes the number of non-target classes, whilstkeeping the standard distribution.

is mistaken for a target more often than other non-targets[21]. This leads to a partial overlap of non-target and targetclassifiers scores. To simulate this, one non-target class wasshifted from the non-target distribution (normal separation)toward the target distribution (no separation) in several steps.This is a special case of the separation dataset, where only onenon-target stimulus class is affected by bad separability. It isapplied to both training and validation data.

Datasets with changes in terms of drift, scale, distributionand number of classes were also tested. Since their influence

on the performance was minimal, they are not described fully,but only depicted in figure 2.

2.3. EEG data

EEG data from 83 sessions were taken, originating fromAMUSE [19], an auditory ERP experimental BCI paradigm,and five visual ERP paradigms: rapid serial visual presentation(RSVP) [22], movement-onset visual evoked potential(MVEP) [23], and standard visual evoked potentials fromthe paradigms HEX-O-SPELL, center speller and cake speller[10]. Datasets were previously recorded for the respectivepublications. All datasets consisted of an offline calibrationpart and an online spelling part; in all the original studiesthe users’ task was to write text. Details on the datasets canbe found in table 1. All calibration datasets were reducedto 24 trials. Due to the varying number of iterations per trial,there was still a discrepancy between datasets. Since it is not theprimary goal here to compare datasets, this was not considereda problem. All online results are based on 21 training trials,to resemble the amount available in one cross-validation fold.Only results explicitly testing the influence of training datahad a varying number of training trials.

2.3.1. Pre-processing and classification. All data werehigh-pass filtered above 0.1 Hz during recording. For thecurrent analyses they were further low-pass filtered (below30 Hz) before being downsampled to 100 Hz and segmentedaccording to the stimulus onset. Single epochs were baselinedto the interval 150 ms prior to the stimulus onset. Forfeature extraction, data were further downsampled usinga priori knowledge to use a higher sampling density for earlycomponents (30–350 ms: average over 30 ms bins) than for latecomponents (360–800 ms: average over 60 ms bins). All dataprior to 30 ms, and after 800 ms post stimulus were discardedfor classification.

The resulting feature vector of an epoch is ratherlarge, which requires regularization of the classifier. Here,an LDA classifier is used with shrinkage regularizationof the covariance matrix [24]. It outputs distances to thelearned separating hyperplane, the sign of which indicates thepredicted class label.

2.4. Metric

For the evaluation and optimization of early stopping methods,a metric is needed which respects the speed–accuracy trade-offunder consideration of the overhead. The information transferrate (ITR) [25, 26] is an obvious performance measure, as itis predominant in the BCI literature. However, for spellingapplications it is more straightforward and intuitive to use theSPM metric [27]. In addition to the overhead, it considersthe application specific error handling, the benefit of which isaptly shown in figure 1. Typically, a correct selection counts as+1 symbol, and an erroneous selection counts as −1 symbol,to account for the added effort of a necessary subsequentcorrective action. SPM is defined as follows:

effective correct = percent correct - percent erroneous (18)

6

https://www.researchgate.net/publication/202065953_Single-trial_analysis_and_classification_of_ERP_components_--_a_tutorial?el=1_x_8&enrichId=rgreq-bb69fcf2-dd5e-417e-8bbb-d1f12f022287&enrichSource=Y292ZXJQYWdlOzIzNjkxOTg3MztBUzoxMTQ4MDgxNTQyMzQ4OTRAMTQwNDM4MzgwNTY4MQ==


https://www.researchgate.net/publication/230567508_Exploring_motion_VEPs_for_gaze-independent_communication?el=1_x_8&enrichId=rgreq-bb69fcf2-dd5e-417e-8bbb-d1f12f022287&enrichSource=Y292ZXJQYWdlOzIzNjkxOTg3MztBUzoxMTQ4MDgxNTQyMzQ4OTRAMTQwNDM4MzgwNTY4MQ==

https://www.researchgate.net/publication/221757225_A_gaze_independent_speller_based_on_rapid_serial_visual_presentation?el=1_x_8&enrichId=rgreq-bb69fcf2-dd5e-417e-8bbb-d1f12f022287&enrichSource=Y292ZXJQYWdlOzIzNjkxOTg3MztBUzoxMTQ4MDgxNTQyMzQ4OTRAMTQwNDM4MzgwNTY4MQ==

https://www.researchgate.net/publication/12423403_Brain-computer_interface_technology_A_review_of_the_first_international_meeting?el=1_x_8&enrichId=rgreq-bb69fcf2-dd5e-417e-8bbb-d1f12f022287&enrichSource=Y292ZXJQYWdlOzIzNjkxOTg3MztBUzoxMTQ4MDgxNTQyMzQ4OTRAMTQwNDM4MzgwNTY4MQ==

https://www.researchgate.net/publication/230567413_Natural_stimuli_improve_auditory_BCIs_with_respect_to_ergonomics_and_performance?el=1_x_8&enrichId=rgreq-bb69fcf2-dd5e-417e-8bbb-d1f12f022287&enrichSource=Y292ZXJQYWdlOzIzNjkxOTg3MztBUzoxMTQ4MDgxNTQyMzQ4OTRAMTQwNDM4MzgwNTY4MQ==


Table 1. Dataset parameters. Given are number of subjects (N), number of channels, modality (A = auditory, V = visual), number ofclasses, steps (number of trials per character), the maximum number of iterations in the original data, the SOA and the overhead (time pertrial that needs to be added to the stimulation period). As AMUSE is exclusively auditory, the labels were read out to the user, explaining thelarger overhead.

Dataset N Channels Modality Class Steps Max It. SOA Overhead

AMUSE [19] 16 61 A 6 2 15 0.175 18.25Hex-o-spell [10] 13 63 V 6 2 10 0.217 6Center speller [10] 13 63 V 6 2 10 0.217 8.25Cake speller [10] 13 63 V 6 2 10 0.217 6RSVP [22] 12 55 V 30 1 10 0.083 12MVEP [23] 16 57 V 6 2 10 0.266 11.7

SPM = decisions per minute ∗ effective correct (19)

where decisions per minute considers the overhead andstimulation, as discussed in the introduction.

In a two-step selection process, errors can be made on thefirst level (group selection) and on the second level (singlesymbol selection). When the group selection is false, thesecond level contains the possibility to select an ‘undo’ action,which cancels the previous selection. As no symbol is selectedafter such a meander, it counts as a 0 symbol selection.Percent correct is now interpreted as the accuracy on bothlevels and percent erroneous as the percentage of erroneousselections on the second level. Note that the simplifyingassumption that the accuracy is the same for each symbolwas made (similarly to the assumption of the ITR as definedin [26]).

2.5. Dynamic stopping optimization

All methods have one or more hyperparameters S, whichallow the user to influence the trade-off between speed androbustness. It requires some form of optimization withinthe training function L. Interestingly, most of the originalpublications do not provide a way to properly estimate thevalue of this hyperparameter.

Therefore, the hyperparameter S is optimized usinga resampling procedure on the classifier scores obtainedfrom a two-fold cross-validation on the training trials (seesection 3). Two-thirds of the training trials are randomlyselected to estimate the necessary data statistics (for instancethe intensities in Lenhardt or the means, priors and variancesin Zhang). Given these statistics, the remaining data is usedto optimize hyperparameter S for SPM, using a simple linearsearch. This procedure is repeated 15 times and the medianhyperparameter is subsequently applied to the test trials.

Some methods ((Fixed) optimal, Hohne, Jin) do notrequire such a compound optimization. However, they too areestimated using the resampling procedure for extra robustnessand better comparability. In these cases the hyperparameteris optimized on the larger two-thirds of data split directly. Inthe case of two mutually dependent hyperparameters (Jin andLenhardt), they are estimated simultaneously, by extending thelinear search to a two dimensional grid search.

2.6. Directly estimating S

The size and heterogeneity of the datasets can be used to findproxies for the optimization of hyperparameter S. For thispurpose, the optimized values of S for all training sets wereinvestigated for dependencies on data features. Any resultingindependent variable should be chosen such that it can easilybe derived from a calibration dataset. As the AUC of thetraining data is highly correlated with S, and is independentof the sample size, it was chosen as the independent variable.The dependent variable S was then linearly fitted to the AUCof the training data. Fitting was done using iteratively re-weighted least-squares analysis [28] to reduce the influenceof outliers. This gives two coefficients for each method,which transform the AUC of a dataset into a correspondinghyperparameter S†, which approximates S. All EEG datasetswere re-analyzed using this approximation by S†, and theperformance was compared to the optimized variant. In thecase of multiple hyperparameters, the one with the highestcorrelation coefficient was approximated whilst fixing thesecond one. For Jin, jmean was fixed to the minimum of twoiterations, and for Lenhardt, S1 was fixed to 1.

3. Experimental procedure

All analyses were performed offline, but maximumgeneralization was ensured by leaving out all online data forvalidation. The experimental protocol is depicted in figure 3and followed these steps:

(i) EEG data are split into training and test trials (seeconditions below).

(ii) Training trials are transformed into classifier scoresthrough two-fold cross-validation. Additionally, aclassifier is trained on all training trials and applied tothe test trials.

(iii) The classifier scores from the cross-validation are used toobtain � for each stopping method. The trained stoppingmethods are then applied to the classifier scores from thetest trials, to determine the stopping point for each trial.

(iv) Accuracy and stimulation duration are transformed intoSPM. The resulting SPM of the validation data is usedthroughout this manuscript, unless stated otherwise.

Two conditions need to be distinguished. Offline: theabove procedure takes place inside an eight-fold cross-validation. Data comes from the original calibration data only,

7


https://www.researchgate.net/publication/239017572_Robust_Regression_Using_Iteratively_Reweighted_Least-Squares?el=1_x_8&enrichId=rgreq-bb69fcf2-dd5e-417e-8bbb-d1f12f022287&enrichSource=Y292ZXJQYWdlOzIzNjkxOTg3MztBUzoxMTQ4MDgxNTQyMzQ4OTRAMTQwNDM4MzgwNTY4MQ==


Figure 3. Validation procedure. The validation procedure is the same for both conditions (offline and quasi-online), and for EEG- and toydata. The number of training trails (x) was changed to evaluate the robustness of the methods. For all other analyzes a number of 21 trainingtrials were chosen.

and at each fold 21 trials are used for training and three trialsare used for testing. All parameters are estimated within thecross-validation procedure. Quasi-online: this condition wasperformed once, after the optimization procedures performedsatisfactorily. Here, 21 trials of the calibration data were usedas training trials, and the test trials consisted of all online data.

The procedures were the same for the artificially generateddata, with the addition that the experiments were run40 times per condition. All reported performance gains arecalculated with the no stopping condition as a baselinecondition, by subtracting the no stopping performance.Negative performance gains thus mean that the stoppingmethod results in a worse performance than when no stoppingis applied.

4. Results

4.1. Artificial data

All of the methods are quite robust to most types of datadistortion. The largest performance deteriorations were causedby the class separation, outliers and systematical confusions,which are shown in figure 4(B). Results are summarized bytheir mean performance gain over the no stopping condition(see figure 4(A)) over the given parameter space. The gain ofa specific setting was considered significant if at least 90% ofthe 40 simulations yielded an output on the same side of thebaseline (black bars). Bars are gray where no such significancewas found.

Separation. Most of the methods behave well for datathat is difficult to separate, meaning that they do not decreaseperformance below the no stopping baseline when dataseparation goes to zero. Liu clearly struggles with the data asthey becomes difficult to separate, as they severely undershootthe baseline performance. This is reflected in the relativelylow average gain. To a lesser extent, Jin suffers from the sameeffect for the most difficult data. For difficult data, Rank diffshows a constant and tiny, but significant reduction. Zhang isexceptionally resilient to inseparable data, with a significantperformance gain for all but the most inseparable data.

Outlier. All of the methods have severe problems indealing with outliers. None of the methods reached a positiveaverage gain over the given parameter space. Though Liu

performs worst for outliers, here the picture is not that clear.In fact, all of the methods severely deteriorate performancecompared to the baseline when many outliers are present,though the level of tolerance differs slightly between methods.It is important to note that the number of outliers went ashigh as 65%, and that these outliers were not observed in thecalibration data.

Confusion. All methods are influenced by systematicconfusions, but they degraded gracefully to the baseline.Interestingly, also Liu does not show any significantperformance reduction from the baseline, although confusionis a special case of separation.

The results for all of the datasets can be found inthe supplementary material (see supplementary figure 1(available from stacks.iop.org/JNE/10/036025/mmedia)). It isworth noting that Liu is the only method that is clearly notrobust against drifts, showing a performance peak aroundthe no-drift parameter and falling off to both sides withpositive and negative drifts. Furthermore, Liu, Rank diff, andZhang show reduced performance gain when the data isdownscaled. For all of the other methods, and for upscaling,only marginal influences were found. An increased number ofclasses reduced the gain for all methods, and a more skeweddistribution showed higher gains for all methods.

4.2. EEG BCI data

While the results on the artificial data give an intuition of theproperties of the various methods, the truly relevant questionis their behavior on real BCI data, which is summarized infigure 5. On average over all datasets, all methods improve theperformance. However, this does not hold when observing theparadigms individually.

All of the methods significantly increase the performanceon two datasets (center speller and RSVP speller), and allmethods apart from fixed and fixed optimal managed toincrease the performance on the cake speller dataset (forsignificance levels see figure 5). Performance on the hex-o-spell dataset is significantly increased by all but Liu. Asignificant increase in performance for the MVEP datasetis found for Rank diff, Zhang, and Lenhardt. None of themethods significantly increases (or decreases) performance onthe AMUSE dataset.

8



(A)

(B)

Figure 4. Results on artificial data sets. Those three out of sevenartificial datasets (see figure 2) that had particular influence on theearly-stopping methods are displayed. (A) To indicate the effectsize, the baseline performance of no stopping is shown. (B) The barsindicate the absolute performance gain over the no stoppingcondition, and are averaged over 40 simulations. Bars are black if90% of the 40 simulations result in the same sign, and grayotherwise. The plotted numbers indicate the means of the gains overthe depicted parameter space of each distortion method and stoppingstrategy, with larger positive numbers indicating a more robustbehavior of the stopping strategy.

In summary, no method increased performancesignificantly on all datasets. Rank diff, Zhang, and Lenhardtdid so for five datasets, Hohne and Jin for four datasets, andfixed, fixed optimal, and Liu for three datasets.

Figure 6 shows the average gain over no stopping for eachstopping method, accumulated over datasets. Over all datasets,Hohne, Lenhardt, and Zhang performed particularly well, asindicated by the high cumulative average gain (4.17, 4.14, and4.47 respectively). Directly after these three methods, Jin, andRank diff follow with cumulative gains of 3.11, and 3.49. Thecumulative gain for Liu is competitive (2.83), in particular forthe ‘easier’ datasets. When subtracting the negative gain forAMUSE and MVEP it reduces to 2.60.

Both of the fixed methods (fixed and fixed optimal) showa cumulative gain over no stopping (2.08 and 2.36, aftersubtraction of negative paradigms). However, their gains areclearly smaller when compared to most dynamic methods.A full image depicting individual gains for all subjects andall stopping methods can be found in supplementary figure 2(available from stacks.iop.org/JNE/10/036025/mmedia). Inabsolute terms, the fixed methods and Liu reduce performancefor about 26.5% to 30% of the subjects, against 8.4% to 15.7%for the dynamic methods.

In short, Hohne, Lenhardt, and in particular Zhangperform best in terms of gain. Liu performs good for goodsubjects, but underperforms on difficult EEG data.

4.3. Required training data

Figure 7 shows the effect of the training set size on theestimated online performance, as averaged over all datasets.With the minimum number of three training trials, all methodsbut Jin and Lenhardt decrease in performance with respect tono stopping, on average. With 12 training trials or more, allmethods increase performance over the no stopping condition,on average. Performance for the no stopping conditionconverges after about 12 trials, as the accuracy saturates to100% for several paradigms. For all stopping methods, therelative gain increases substantially during these first fourconditions (three to 12 training trials). After 12 training trials,most dynamic methods keep increasing in performance untiltrial 21, showing a performance gain that is impossible withoutsuch stopping methods.

Importantly, on average all dynamic methods, except forLiu, consistently outperform the fixed and fixed optimal methodfrom six training trials upwards. Most dynamic approachescan thus exploit the data better than simply fixing a number ofiterations, even with small amounts of training data. Liu neededon average 15 trials or more to outperform fixed (optimal).Though the effect size varies, the average findings largelytransfer to the individual datasets (not shown).

4.4. Typical parameter settings

The hyperparameter S of most methods is particularlycorrelated with the AUC of the training data. Evenacross different paradigms, with all the different settings,the optimized S is a rather robust function of thisdata feature (see supplementary figure 3 (available from

9



Figure 5. Quasi-online performance. Averaged SPM performance is given for all stopping methods (columns) and experimental paradigms(dots). The connected black dots represent the averages over experimental paradigms. It was calculated over all subjects directly andaccounts for the different number of subjects per paradigm. An asterisk marks the significance level of a performance increase over nostopping (*: p < 0.05. **: p < 0.01), as found by a Bonferroni-corrected paired t-test.

Figure 6. Cumulative gain. The y-axis shows the cumulative average gain with respect to no stopping), for each of the stopping methods.The height of a segment indicates the average performance gain which was realized for a particular dataset. Negative average performancegains are displayed below the zero line. It is interesting to note that the performance metric is SPM, and thus directly interpretable.

Figure 7. The influence of the number of training trials on the performance of stopping methods. With a low number of training trials, mostdynamic methods deteriorate the performance with respect to no stopping. When increasing the number of training trials, the dynamicmethods start outperforming no stopping. Their performance increase converges at a larger number of trials, indicating a gain which isimpossible to achieve without using dynamic methods.

stacks.iop.org/JNE/10/036025/mmedia)). The black lineshows the fit of all of the data, and the gray lines fit thedata, each with one dataset left out. Only for method Jin dothese fits not clearly overlap, which is likely due to the fixing

of jmean in the online case, the inherent discrete nature ofthe hyperparameters and the small parameter range. All othermethods have a robust fit, regardless of the aforementionedfactors and the paradigm fitted. The resulting coefficients can

10



Figure 8. Performance comparison between optimized S and global S†, which has been estimated by linear regression. The optimized Sboxplots are the quasi-online performance results of each method, when trained using the subject specific optimization (S). The global S†

boxplots indicate the results, when the methods are re-run and S† is directly estimated from the AUC of the training data, using thecoefficients in table 2. Performance differences are not significant, as confirmed by a Bonferroni-corrected paired t-test. This advocates infavor of using the simple regression method for estimating the stopping hyperparameters, which improves the usability of stopping methodsfor the practitioner.

Table 2. Optimization approximation. Given the tested datasets,hyperparameter S was found to correlate with the AUC of thetraining data. Coefficients of the linear fit are given here for eachmethod. The approximated S† can be recovered from the trainingdata as S† = a · AUC + b, which can form a starting point for newexperiments or further optimization.

Method a b

Fixed optimal −21.96 22.03Liu −4.58 5.46Jin −7.90 8.86Rank diff −1.97 1.77Lenhardt −1.07 1.07Hohne 0.10 −0.04Zhang 0.80 −0.30

be found in table 2 and can be used to approximate S accordingto the equation

S† = a · AUC + b. (20)

The result of re-analyzing all of the data, using theabove equation to determine S†, can be found in figure 8.Results are summarized by their median, 25 and 75 percentilesand minimum and maximum gain. Though the use of S†

significantly decreased the training performance for severalmethods when compared to the optimal S (not shown), itdid not significantly change the online performance for anymethod. Significance was tested using a Bonferroni-corrected,paired t-test.

5. Discussion

Several methods for dynamic stopping have been proposedin the literature, but a direct performance comparison hasbeen difficult due the variety of paradigms, pre-processingsteps, timings and evaluation metrics. Here, these methods aresystematically tested within the same theoretical framework,on the same datasets and under the same conditions.Artificial data are generated and evaluated to test themethods’ performances under strictly controlled conditions.

Subsequently, their performance is evaluated on pre-recordedreal EEG data from a variety of different BCI experiments,which cover the spectrum of ERP paradigms reported in theBCI literature well. This evaluation strategy allows us to drawconclusions on the benefits and drawbacks of these methodsand to directly compare between them.

In general it can be said that early stopping, be it dynamicor fixed, increases the performance in the majority of cases.This is particularly true when a paradigm results in highlydiscriminative classifier scores, with average performancegains of up to 1.19 SPM for the RSVP data with the stoppingmethod of Zhang—which represents a 78% performanceincrease over no stopping. Most methods can be applied evenwhen the classifier scores become less discriminative; thoughthey might no longer increase the performance, they reduceto the baseline performance at worst. Artificial data (figure 4)show that this is the case for all methods but for Liu and—incase of very difficult data—also for Jin. Considering patientapplications of stopping methods, this is an important feature,as patient data are typically less clean and more difficult toclassify. Most methods can thus be applied without worryingabout performance degradation.

Adding a robust early stopping module to the ERP basedBCI is a good idea, provided that enough data are available.Training an early stopping method on top of the classifiereffectively means estimating more parameters from the sameamount of data. This can pose a problem, as data are oftenlimited. However, over all datasets, 9 to 12 training trialssuffices for the methods to increase the online performanceon average. Given the difference in epochs per trial betweendatasets, the actual number may vary for different paradigms.Though nine trials is an amount that is normally available fortraining, the more difficult the data are, the more trials maybe needed. For difficult datasets (such as MVEP or AMUSE),dynamic stopping methods need more data to perform reliably.

Before additional modules are added to a complex BCIsystem, another aspect needs to be taken into consideration: thetested methods have a different implementation complexity,though it is difficult to quantify this. The fixed methods are

11


obviously the simplest. Taking the best performing methodsLenhardt, Rank diff, Hohne and Zhang, the simplest mayarguably be considered Hohne, given that the necessarystatistical tests are part of most data analysis toolboxes.The Bayesian statistics behind Zhang could be consideredmore complex and requires a custom implementation. Rankdiff and Lenhardt are rather simple to implement, and fallsomewhere between the other two. All methods showedruntime performances that are well in the range of what wouldbe applicable online (typically between 3 and 27 msec perdecision). As the methods were not optimized for runtimeperformance, these values can most likely be reduced.

5.1. Dynamic versus fixed

Early stopping methods allow the BCI to adapt to the currentstate of the user. In doing so, they can reduce the time requiredfor a selection, and thus increase the robustness of the system.The latter seems to be particularly true for dynamic methods,since they can adapt to changes in the data. Though fixed andfixed optimal are the simplest methods and require very littleadaptation of the BCI system, they are not dynamic; duringonline application the number of iterations is fixed regardlessof the data coming in.

On average the fixed methods do increase the performancewith respect to no stopping. Unfortunately, their performancegains leave much room for improvement, which is observedboth in artificial data and real EEG data. This indeed is dueto the fact that these methods are incapable of adapting tochanges in the data distributions over time. When observingthe online data, parts of the data that have an AUC belowaverage show the largest reduction in performance for the fixedmethods. Interestingly, they also show the largest performanceimprovement when non-stationarities temporarily lead to ahigher SNR of the data.

5.2. User-optimized S versus global S†

As most of the original publications do not specify how to findan appropriate value for the hyperparameter (here called S),a general optimization scheme is proposed. With the SPM,it uses a realistic metric as objective function and leadsto considerable performance gains. To optimize S, realisticsimulations for the online performance for different values ofS are required, which can be time consuming and error-proneto implement. For the BCI practitioner, a look-up table orregression function would be convenient. For instance, such aregression function would help to determine an at least suitablehyperparameter for a new BCI user, given some collectedtraining data. The empirical observation that S is highlycorrelated with the AUC led to a set of coefficients that translatea given AUC of the training data into an approximation of S,called S†. Using S†, the online performance gains were similarto the performance gains using S.

Apart from an interesting basic finding, this result haspractical relevance. The AUC can be estimated reliably witha relatively low amount of data, say a few trials. Using theregression coefficients, the problem with training on a lownumber of training trials may thus be overcome. But even with

appropriate training size, they alleviate the practitioner of thetask of running simulations to optimize S. They thereby reducethe barrier for using stopping methods in similar paradigms.

5.3. Benchmark data

The classifier scores of the EEG data used in this studyare available in the supplementary material (available fromstacks.iop.org/JNE/10/036025/mmedia). It contains all datarequired for reproducing the current results and, moreimportantly, for benchmarking future methods. Classifierscores for each fold are given, such that over-fitting can beavoided and realistic comparisons can be made. Furthermore,example scripts were created to facilitate the use of these data.They take care of the proper cross-validation. Only one trainingand one test function need to be written for any new method.

Be aware that the implementation that is the basis for thispaper differs slightly and contains a random factor; results willthus not exactly match those from the example script.

5.4. Limitations of the study

With offline analyses, the danger of overfitting the data isalways tempting, and online validations are considered to bethe last conclusive evidence. In this case, the protocol was keptstrict so as to mimic such a validation, leaving the online datauntouched until the very end. As running online trials withseveral stopping methods active at the same time would beimpossible, a block structured protocol would be necessary.The current quasi-online protocol bypasses this need, whilstkeeping the validity of the test intact. One clear differencebetween the two is that for the latter, the human is not inthe loop. Potentially, shortening the trials can increase orreduce the workload on the user. In particular when an earlystop results in a false decision, the user might get frustrated.This study shows the principle validity of the tested methods.Influences of such methods on the human in the loop remainan open topic, though at least for Jin [20], Lenhardt [16],and Rank diff [19] there is partial online evidence from directcomparison.

Method Liu differed slightly from the originalimplementation, to comply with the testing framework.The linear kernel support-vector machine [17] originallyemployed, is very similar to the LDA used here. This designdifference is not likely to be the cause for the poor performance.The effect of temporally smoothing the features, as proposedby Liu, could result in cleaner data. At the same timeit introduces strong dependencies between samples, whichviolates the assumptions of most classifiers. The influence ofthis on the current results is unclear.

For the sake of comparability, the current analyses weredone using a single linear classifier. LDA is the most prominentclassifier in the ERP based BCI literature, and particularlysuited for ERP data [24]. However, given that for all methodsthe gained performance depends on the AUC of the classifierdata, it could be expected that any improvement made to theAUC—by smarter feature extraction or better classification—could pay off double: it can improve the performance in itself,and additionally benefit more from the stopping methods. The

12







influence of different, potentially nonlinear methods remainsan open question.

6. Conclusion

This study for the first time benchmarks several methodsfor optimizing event-related potentials (ERP) based brain–computer interface (BCI) under the same conditions. Theresults clearly show the benefit of stopping methods, bothfixed and dynamic. In particular, methods Zhang, Hohne andLenhardt can be recommended for high performance. Cautionis to be taken with Liu, in particular for subjects with lessdiscriminative data, as it can clearly reduce performance withrespect to the baseline. All in all, these methods could be highlyinstrumental in making ERP based BCIs even more efficientand adaptable to transient changes in data characteristics, bothof which are important for long term use of a BCI.

Acknowledgments

The authors would like to thank Sulamith Schaeff, LauraAcqualagna, Nico Schmidt, and Matthias Treder for makingtheir EEG data available for this evaluation. This workis supported by the European ICT Programme ProjectFP7-224631 and BFNT 01GQ0850. This paper only reflectsthe authors’ views and funding agencies are not liable for anyuse that may be made of the information contained herein.

References

[1] Vaughan T, McFarland D, Schalk G, Sarnacki W,Krusienski D, Sellers E and Wolpaw J 2006 The WadsworthBCI research and development program: at home with BCIIEEE Trans. Neural Syst. Rehabil. Eng. 14 229–33

[2] Millan J d R et al 2010 Combining brain–computer interfacesand assistive technologies: state-of-the-art and challengesFront. Neuroprosthetics 4 161

[3] Sellers E W, Vaughan T M and Wolpaw J R 2010 Abrain–computer interface for long-term independent homeuse Amyotrophic Lateral Scler. 11 449–55

[4] Kleih S C et al 2011 Out of the frying pan into the fire—theP300-based BCI faces real-world challenges Prog. BrainRes. 194 27–46

[5] Plass-Oude Bos D, Reuderink B, van de Laar B, Gurkok H,Muhl C, Poel M, Nijholt A and Heylen D 2010Brain–computer interfacing and games Brain–ComputerInterfaces (Human–Computer Interaction Series) edD S Tan and A Nijholt (London: Springer) pp 149–78

[6] Blankertz B et al 2010 The Berlin brain–computer interface:non-medical uses of BCI technology Front. Neurosci. 4 198

[7] Guger C, Edlinger G, Harkam W, Niedermayer Iand Pfurtscheller G 2003 How many people are able tooperate an EEG-based brain–computer interface (BCI)?IEEE Trans. Neural Syst. Rehabil. Eng. 11 145–7

[8] Farwell L and Donchin E 1988 Talking off the top of yourhead: toward a mental prosthesis utilizing event-relatedbrain potentials Electroencephalogr. Clin. Neurophysiol.70 510–23

[9] Kaufmann T, Schulz S, Grunzinger C and Kubler A 2011Flashing characters with famous faces improves ERP-basedbrain–computer interface performance J. Neural Eng.8 056016

[10] Treder M S, Schmidt N M and Blankertz B 2011Gaze-independent brain–computer interfaces based oncovert attention and feature attention J. Neural Eng.8 066003

[11] Aloise F, Arico P, Schettini F, Riccio A, Salinari S, Mattia D,Babiloni F and Cincotti F 2012 A covert attentionp300-based brain–computer interface: Geospell Ergonomics55 538–51

[12] Hill N J, Lal T N, Bierig K, Birbaumer N and Scholkopf B2005 An Auditory Paradigm for Brain–Computer Interfaces(Advances in Neural Information Processing Systems vol17) ed L K Saul, Y Weiss and L Bottou (Cambridge, MA:MIT Press) pp 569–76

[13] Schreuder M, Blankertz B and Tangermann M 2010 A newauditory multi-class brain–computer interfaceparadigm: spatial hearing as an informative cue PLoS ONE5 e9813

[14] Brouwer A-M and van Erp J B F 2010 A tactile P300brain–computer interface Front. Neurosci. 4 036003

[15] Zhang H, Guan C and Wang C 2008 AsynchronousP300-based brain–computer interfaces: a computationalapproach with statistical models IEEE Trans. Biomed. Eng.55 1754–63

[16] Lenhardt A, Kaper M and Ritter H 2008 An adaptiveP300-based online brain–computer interface IEEE Trans.Neural Syst. Rehabil. Eng. 16 121–30

[17] Liu T, Goldberg L, Gao S and Hong B 2010 An onlinebrain–computer interface using non-flashing visual evokedpotentials J. Neural Eng. 7 036003

[18] Hohne J, Schreuder M, Blankertz B and Tangermann M 2010Two-dimensional auditory P300 speller with predictive textsystem Proc. IEEE Conf. of Engineering in Medicine andBiology Society vol 2010 pp 4185–8

[19] Schreuder M, Rost T and Tangermann M 2011 Listen, you arewriting! Speeding up online spelling with a dynamicauditory BCI Front. Neurosci. 5 112

[20] Jin J, Allison B Z, Sellers E W, Brunner C, Horki P, Wang Xand Neuper C 2011 An adaptive p300-based control systemJ. Neural Eng. 8 036006

[21] Hohne J, Krenzlin K, Dahne S and Tangermann M 2012Natural stimuli improve auditory BCIs with respect toergonomics and performance J. Neural Eng. 9 045003

[22] Acqualagna L and Blankertz B 2011 A gaze independentspeller based on rapid serial visual presentation Proc. IEEEConf. of Engineering in Medicine and Biology Societyvol 2011, pp 4560–3

[23] Schaeff S, Treder M S, Venthur B and Blankertz B 2012Exploring motion VEPs for gaze-independentcommunication J. Neural Eng. 9 045006

[24] Blankertz B, Lemm S, Treder M S, Haufe S and Muller K-R2011 Single-trial analysis and classification of ERPcomponents—a tutorial NeuroImage 56 814–25

[25] Schlogl A, Kronegg J, Huggins J and Mason S G 2007Evaluation criteria for BCI research TowardsBrain–Computer Interfacing ed G Dornhege, J del RMillan, T Hinterberger, D McFarland and K-R Muller(Cambridge, MA: MIT press) pp 297–312

[26] Wolpaw J R, Birbaumer N, Heetderks W J, McFarland D J,Peckham P H, Schalk G, Donchin E, Quatrano L A,Robinson C J and Vaughan T M 2000 Brain–computerinterface technology: a review of the first internationalmeeting IEEE Trans. Rehabil. Eng. 8 164–73

[27] Schreuder M, Hohne J, Treder M S, Blankertz Band Tangermann M 2011 Performance optimization ofERP-based BCIs using dynamic stopping Proc. IEEEConf. of Engineering in Medicine and Biology Societypp 4580–83

[28] Holland P W and Welsch R E 1977 Robust regression usingiteratively reweighted least-squares Commun. Stat.-TheoryMethods 6 813–27

13

http://dx.doi.org/10.1109/TNSRE.2006.875577

http://dx.doi.org/10.3389/fnins.2010.00161

http://dx.doi.org/10.3109/17482961003777470

http://dx.doi.org/10.1016/B978-0-444-53815-4.00019-4

http://dx.doi.org/10.1007/978-1-84996-272-8_10



http://dx.doi.org/10.1016/0013-4694(88)90149-6

http://dx.doi.org/10.1088/1741-2560/8/5/056016

http://dx.doi.org/10.1088/1741-2560/8/6/066003

http://dx.doi.org/10.1080/00140139.2012.661084

http://dx.doi.org/10.1371/journal.pone.0009813


http://dx.doi.org/10.1109/TBME.2008.919128


http://dx.doi.org/10.1088/1741-2560/7/3/036003

http://dx.doi.org/10.1109/IEMBS.2010.5627379


http://dx.doi.org/10.1088/1741-2560/8/3/036006

http://dx.doi.org/10.1088/1741-2560/9/4/045003


http://dx.doi.org/10.1088/1741-2560/9/4/045006

http://dx.doi.org/10.1016/j.neuroimage.2010.06.048

http://dx.doi.org/10.1109/TRE.2000.847807


http://dx.doi.org/10.1080/03610927708827533


























































Optimizing event-related potential based brain–computer interfaces: a systematic evaluation of...

Documents

Transcript of Optimizing event-related potential based brain–computer interfaces: a systematic evaluation of...