arXiv:2206.14261v1 [cs.LG] 28 Jun 2022

13
Computer Vision and Image Understanding Authorship Confirmation Please save a copy of this file, complete and upload as the “Confirmation of Authorship” file. As corresponding author I, Sumeet Menon , hereby confirm on behalf of all authors that: 1. This manuscript, or a large part of it, has not been published, was not, and is not being submitted to any other journal. 2. If presented at or submitted to or published at a conference(s), the conference(s) is (are) identified and substan- tial justification for re-publication is presented below. A copy of conference paper(s) is(are) uploaded with the manuscript. 3. If the manuscript appears as a preprint anywhere on the web, e.g. arXiv, etc., it is identified below. The preprint should include a statement that the paper is under consideration at Computer Vision and Image Understanding . 4. All text and graphics, except for those marked with sources, are original works of the authors, and all necessary permissions for publication were secured prior to submission of the manuscript. 5. All authors each made a significant contribution to the research reported and have read and approved the submitted manuscript. SignatureSumeet Menon Date6/24/2022 List any pre-prints: Arxiv.org Relevant Conference publication(s) (submitted, accepted, or published): N/A Justification for re-publication: N/A arXiv:2206.14261v1 [cs.LG] 28 Jun 2022

Transcript of arXiv:2206.14261v1 [cs.LG] 28 Jun 2022

Computer Vision and Image UnderstandingAuthorship Confirmation

Please save a copy of this file, complete and upload as the “Confirmation of Authorship” file.

As corresponding author I, Sumeet Menon , hereby confirm on behalfof all authors that:

1. This manuscript, or a large part of it, has not been published, was not, and is not being submitted to any otherjournal.

2. If presented at or submitted to or published at a conference(s), the conference(s) is (are) identified and substan-tial justification for re-publication is presented below. A copy of conference paper(s) is(are) uploaded with themanuscript.

3. If the manuscript appears as a preprint anywhere on the web, e.g. arXiv, etc., it is identified below. Thepreprint should include a statement that the paper is under consideration at Computer Vision and Image Understanding.

4. All text and graphics, except for those marked with sources, are original works of the authors, and all necessarypermissions for publication were secured prior to submission of the manuscript.

5. All authors each made a significant contribution to the research reported and have read and approved the submittedmanuscript.

SignatureSumeet Menon Date6/24/2022

List any pre-prints:

Arxiv.org

Relevant Conference publication(s) (submitted, accepted, or published):

N/A

Justification for re-publication:

N/A

arX

iv:2

206.

1426

1v1

[cs

.LG

] 2

8 Ju

n 20

22

Graphical Abstract (Optional)

To create your abstract, please type over the instructions in the template box below. Fonts or abstract dimensions should not bechanged or altered.

Semi-supervised Contrastive Outlier removal forPseudo Expectation Maximization (SCOPE)Sumeet Menon, David Chapman

Semi-supervised learning is the problem of training an accuratepredictive model by combining a small labeled dataset with apresumably much larger unlabeled dataset. Many methods forsemi-supervised deep learning have been developed, includingpseudolabeling, consistency regularization, and contrastive learn-ing techniques. Pseudolabeling methods however are highly sus-ceptible to confounding, in which erroneous pseudolabels are as-sumed to be true labels in early iterations, thereby causing themodel to reinforce its prior biases and thereby fail to generalizeto strong predictive performance. We present a new approachto suppress confounding errors through a method we describe asSemi-supervised Contrastive Outlier removal for Pseudo Expecta-tion Maximization (SCOPE). Like basic pseudolabeling, SCOPEis related to Expectation Maximization (EM), a latent variableframework which can be extended toward understanding cluster-assumption deep semi-supervised algorithms. However, unlikebasic pseudolabeling which fails to adequately take into accountthe probability of the unlabeled samples given the model, SCOPEintroduces an outlier suppression term designed to improve thebehavior of EM iteration given a discrimination DNN backbonein the presence of outliers. Our results show that SCOPE greatlyimproves semi-supervised classification accuracy over a baseline,and furthermore when combined with consistency regularizationachieves the highest reported accuracy for the semi-supervisedCIFAR-10 classification task using 250 and 4000 labeled sam-ples. Moreover, we show that SCOPE reduces the prevalenceof confounding errors during pseudolabeling iterations by prun-ing erroneous high-confidence pseudolabeled samples that wouldotherwise contaminate the labeled set in subsequent retrainingiterations.

Research Highlights (Required)

To create your highlights, please type the highlights against each \item command.

It should be short collection of bullet points that convey the core findings of the article. It should include 3 to 5 bullet points(maximum 85 characters, including spaces, per bullet point.)

• Novel method to reduce confounding for deep semi-supervised learning.

• Prior deep pseudolabeling methods omit key term in EM iteration.

• Theoretical and intuitive justification why omitted term causes confounding.

• Two novel outlier suppression methods approximate omitted term in EM.

• SCOPE achieves state of the art performance for the semi-supervised CIFAR-10.

1

Computer Vision and Image Understandingjournal homepage: www.elsevier.com

Semi-supervised Contrastive Outlier removal for Pseudo Expectation Maximization(SCOPE)

Sumeet Menona,∗∗, David Chapmana,b

aUniversity of Maryland, Baltimore County, 1000 Hilltop Circle, Baltimore, Maryland, 21250.bUniversity of Miami, 1320 S Dixie Hwy, Coral Gables, FL 33146

ABSTRACT

Semi-supervised learning is the problem of training an accurate predictive model by combining a smalllabeled dataset with a presumably much larger unlabeled dataset. Many methods for semi-superviseddeep learning have been developed, including pseudolabeling, consistency regularization, and con-trastive learning techniques. Pseudolabeling methods however are highly susceptible to confounding,in which erroneous pseudolabels are assumed to be true labels in early iterations, thereby causingthe model to reinforce its prior biases and thereby fail to generalize to strong predictive performance.We present a new approach to suppress confounding errors through a method we describe as Semi–supervised Contrastive Outlier removal for Pseudo Expectation Maximization (SCOPE). Like basicpseudolabeling, SCOPE is related to Expectation Maximization (EM), a latent variable frameworkwhich can be extended toward understanding cluster-assumption deep semi-supervised algorithms.However, unlike basic pseudolabeling which fails to adequately take into account the probability ofthe unlabeled samples given the model, SCOPE introduces an outlier suppression term designed toimprove the behavior of EM iteration given a discrimination DNN backbone in the presence of outliers.Our results show that SCOPE greatly improves semi-supervised classification accuracy over a baseline,and furthermore when combined with consistency regularization achieves the highest reported accuracyfor the semi-supervised CIFAR-10 classification task using 250 and 4000 labeled samples. Moreover,we show that SCOPE reduces the prevalence of confounding errors during pseudolabeling iterationsby pruning erroneous high-confidence pseudolabeled samples that would otherwise contaminate thelabeled set in subsequent retraining iterations.Keywords: Semi-supervised, Machine Learning, Outlier Removal, Pseudolabeling, Confounding.

© 2022 Elsevier Ltd. All rights reserved.

1. Introduction

Learning high quality model representations from limitedlabeled data is a problem that deep learning has not yet overcome.Although there have been much progress in this area, the vastmajority of published deep learning algorithms need to be trainedwith large labeled data volumes in order to perform well. Theapplications for semi-supervised learning are numerous becausein many domains, unlabeled data is plentiful yet high qualitylabeled data is scarce. Data labeling remains a task that is timeconsuming, expensive and error prone. As such, methods toreduce the need for manually labeled data may be impactful.

∗∗[email protected]

Semi-supervised learning becomes particularly challengingwhen the labeled data volumes are very small relative to theunlabeled volumes. In this case, many methods, especially pseu-doloabel techniques are susceptible to confounding errors, inwhich the model at first learns some bias due to the small andinadequate labeled sample, and then proceeds to pseudolabelsome of the data incorrectly with high confidence, and thenproceeds to retrain treating incorrect pseudolabels as real labelsthereby reinforcing it’s prior bias. This confounding bias canprevent a pseudolabel learning algorithm from achieving accept-able performance. If the initial labeled sample is sufficientlylarge and representative, these confounding errors are typicallyless problematic. But as we decrease the labeled sample size,this confounding bias becomes more and more of a concern.

2

Fig. 1. Illustration of the incompatibility of discrimination models with the cluster assumption. Left: for a generative model, the outlier sample is consideredlow probability of belonging to either cluster. Right: for a discrimination model, the outlier sample is considered to belong to the blue cluster with highconfidence due to the large distance from the decision boundary.

1.1. Intuition of the cause of confounding biasWe now introduce our main intuition for what we believe

to be a causal factor of confounding bias for pseudolabelingbased semi-supervised Deep Neural-Network (DNN) models.As part of the theoretical justification of for the proposed SCOPEtechnique, we discuss the close relationship between pseudola-beling and Expectation Maximization (EM) (Dempster et al.,1977), a classical latent variable meta-learning method basedon the cluster assumption. EM however is designed to be ap-plied to generative models, whereas DNN classifiers makinguse of sigmoid (binary) or softmax (categorical) final activationsare within the category of discrimination models. The use of adiscrimination model with an EM-like meta-learning algorithm(pseudolabeling) leads to mismatched assumptions, and moreimportantly, a potential for a confounding failure case in thepresense of outlier samples.

Figure 1 illustrates how outlier samples could cause confound-ing errors if a cluster assumption based meta-learning algorithm(e.g. pseudolabeling) is employed with a discrimination modelbackbone (e.g. DNN classifier). As seen in Figure 1 (right), witha discrimination backbone, the outlier sample is incorrectly pre-dicted to have high confidence of belonging to the blue cluster.However, as seen in Figure 1 (left), with a generative backbone,the outlier sample is correctly predicted to have low confidenceof belonging to either cluster.

Thus, if the purpose of pseudolabeling is to include only themost reliable pseudolabels as part of the labeled clusters forsubsequent model retraining, then it is inadequate to assume thatthes most reliable pseudolabels are the ones with the highestpredicted confidence. More precisely, the confident predictionsare the ones with the greatest posterior probabilities of the labelY given image X and model parameters θ: p(Y |X, θ), whereaswe describe as part of the theoretical justification that the mostreliable samples from the standpoint of an EM-framework arethe ones with the greatest posterior probability of the image/labeltuple p(Y, X|θ). As such, basic pseudolabeling ignores the proba-bility of the unlabeled sample occuring given the model parame-ters p(X|θ) and overlooks the impact that this probability has onthe reliability of inclusion in the labeled clusters for retraining.Clearly, if X is an outlier sample then p(X|θ) is much smallerthan if X is an inlier sample. As such, our intuition is that thepresence of outlier samples can be understood as a potentialfailure case of pseudolabeling methods, with the symptoms ofthis failure being the exacerbation of confounding errors.

1.2. Problem Definition for Semi-Supervised LearningSemi Supervised learning can be defined as the problem of

learning an accurate predictive model using a training datasetwith very few labeled samples but a much larger number of unla-beled samples. Let us say that we have a set of training samplesX and training labels Y , the samples can be further defined as aset of independent unlabeled samples XU and labeled samplesXL along with supervised labels YL and unobservable (latent)unsupervised labels YL. In practice, the number of unlabeledsamples is also typically much larger than the number of labeledsamples |XL| << |XU |.

The accuracy of semi-supervised learning is typically evalu-ated using the multi-class classification task via cross validatedbenchmark on a withheld test set XT , YT . It is further assumedthat the training and test sets follow the same distribution ofsamples and labels. The goal of semi-supervised learning is tominimize the expected testing loss as follows.

minθ E( L(YT − YT ) ; XL , XU , YL) (1)

Although we do not have access to the unobserved labels YU

one can attempt to predict these labels using a technique calledpseudolabeling. We define the pseudolabels as YU

twhich are

the predicted labels of the unlabeled set at the tth ExpectationMaximization (EM) iteration. If we define our predictive modelas F(X, θt) , where the algebraic form of expression F repre-sents the network architecture, and parameters θt represents themodel parameters at EM training iteration t then, the predictedpseudolabels can be defined as follows,

YUt= F( XU , θ

t) (2)

1.3. Definition of confounding errors and confounding biasIt is important to note that although confounding bias is rec-

ognized as an issue for semi-supervised bootstrapping methodssuch as pseudolabeling (Arazo et al., 2020), we are unaware ofa precise mathematical definition of confounding bias and/orconfounding errors. We propose a very simple definition ofconfounding errors that is intuitive and easily measurable givena labeled test set. We define a confounding error to be anyerroneous pseudolabel that is included in the labeled set in asubsequent EM iteration. This is a confounding error, becausethe erroneous pseudolabel is treated as a true label for modelretraining, thus the retraining step reinforces the prior mistake.

3

We define confounding bias as the difference in the predictedprobabilities of the test samples given the real converged model(including confounding errors) versus an analogous ideal modelthat was trained with confounding errors forcibly removed byan oracle prior to each retraining step.

2. Related Work

The most basic pseudolabeling algorithm is self-training(Yarowsky, 1995; McClosky et al., 2006; Olivier et al., 2006;Zhai et al., 2019; Livieris et al., 2019; Rosenberg et al., 2005).With self-training, an initial classifier is first trained with a smallamount of labeled samples and this initial classifier is used tolabel a subset of the unlabeled samples. These pseudolabledsamples are then added to the training dataset for retraining ifthe pseudolabels pass a certain threshold. Since the classifierwas trained initially only with a small subset of labeled sam-ples, self-training models are susceptible to the confoundingbias problem. Several papers have made use of an iterativeform of self-training, in which the the model iteratively switchesbetween pseudolabeling the unlabeled data, and retaining themodel with confident pseudolabels added (Mustafa and Mantiuk,2020; Menon et al., 2020; Nguyen et al., 2020). We call thisiterative process "basic pseudolabeling", and this process canalso be viewed as a specific interpretation of Expectation Max-imization (EM) (Dempster et al., 1977) with several heuristicassumptions added. We review the close relationship betweenEM and basic pseudolabeling in detail as part of the theoreticaljustification section.

A related method to self-training is co-training which is a partof multi-view training where a dataset S can be represented as2 independent feature sets S 1 and S 2. After the model is m1and m2 are trained on the respective datasets, at every iteration,the predictions that surpass the predetermined threshold fromexactly one model are then passed to the training dataset ofthe final model (Blum and Mitchell, 1998; Prakash and Nithya,2014). In recent times, co-training has been used in 3-D medicalimaging where the coronal, sagittal and axial view of the datawas trained on three different networks (Xia et al., 2020). Aconsensus model between these three networks was used topredict the label for the unlabeled dataset. The major limitationwith such types of models is that they are unable to correct theirown mistakes and any bias or wrong prediction detected by themodel results in confident but erroneous predictions.

A much more sophisticated pseudolabeling method is the"Mean Teacher" algorithm (Tarvainen and Valpola, 2017) whichtakes exponential moving averages of the model parameters toobtain a much more stable target prediction and this methodhas significantly shown improvements in results. One of thedrawbacks in these types of methods is that they use domainspecific augmentations. These problems have been overcomeby techniques like “Virtual Adversarial Training” (Miyato et al.,2018). These techniques believe in generating additive sampleswith similar characteristics to increase the data volume and thusavoiding random augmentations.

Contrastive Loss is a prominent distance criteria for smooth-ness based semi-supervised deep learning technique, and is the

foundation for Momentum Contrast (MoCo), its successors andrelated approaches (He et al., 2020; Oord et al., 2018; Henaff,2020; Hjelm et al., 2018; Tian et al., 2020; Misra and Maaten,2020; Li et al., 2020). There are several forms of contrastive loss(Hadsell et al., 2006; Wang and Gupta, 2015; Wu et al., 2018;Hjelm et al., 2018) but in its most general form one must definea similarity loss LS to penalize similar samples from having dif-ferent labels, as well as a difference loss LD to penalize differentsamples from exhibiting the same label. Consistency Regu-larization is another smoothness based strategy that has led toMixMatch and its derivatives for image classification (Berthelotet al., 2019b; Sohn et al., 2020; Berthelot et al., 2019a; Mustafaand Mantiuk, 2020). Consistency regularization assumes that ifone augments an unlabeled sample, it’s label should not change;thereby implicitly enforicing a smoothness assumption betweensamples and simple augmentations thereof. The most commondistance measurement techniques for these purposes are MeanSquared Error (MSE), Kullback-Leiber (KL) divergence andJensen-Shannon (JS) divergence (Jeong et al., 2019; Verma et al.,2022; Yalniz et al., 2019; Sajjadi et al., 2016).

3. Theoretical Justification

3.1. Pseudolabeling as an EM approximation

Theoretically, pseudo-labeling as well as latent bootstrappingmethods rely on the cluster assumption (Olivier et al., 2006).The cluster assumption can be intuitively paraphrased as follows:If there are 2 points that belong to the same cluster, then they(very likely) belong to the same class (Olivier et al., 2006). Assuch, clustering methods assume that the data is separable intoK clusters C1 ... CK in which the true decision boundary lies in-between the clusters, and does not pass through any individualcluster. The cluster assumption can be mathematically definedas follows,

∃C1 ... CK s.t. i f xi, x j ∈ Ck then yi = y j (3)

Pseudolabeling is a special case of the cluster assumption,in which one not only assumes the decision boundary lies in-between the clusters, but further assumes the stronger conditionthat there is only one cluster per label category. The EM al-gorithm (Dempster et al., 1977) is the foundation for manyclustering techniques. Maximum likelihood estimation of si-multaneous latent variable Z and model parameters θ can beobtained iteratively as follows

Expectation : Q(θ|θt) = EZ|X,θt log L(θ;X,Z)

Maximization : θt=1 = argmaxθ Q(θ|θt)(4)

Pseudolabeling can be justified as interpretation of EM, inthat the latent variableZ is defined as the unobservable (unsu-pervised) training labels YU . Furthermore the observed variableX describes all observable data measurements available for train-ing, including the supervised training data XS , the supervisedtraining labels YS , as well as the unsupervised training data asfollows XU ,

4

Z = YU

X = XS ,YS , XU(5)

Given the basic statistical identity that L(a|b) = p(b|a), theExpectation step under this interpretation is presented as follows.

Q(θ|θt) = EYU |XS ,YS ,XU ,θt log p(XS ,YS , XU ,YU |θ) (6)

One must further assume sample independence of the indi-vidual samples lying within the training dataset. Under thiscommon assumption, the supervised and unsupervised contribu-tions to the maximum likelihood expectation step can be splitadditively as follows,

Q(θ|θt) = EYU |XS ,YS ,XU ,θt log p(XS ,YS |θ)+ EYU |XS ,YS ,XU ,θt log p(XU ,YU |θ)

(7)

Which can be simplified as

Q(θ|θt) =

superivsed branch log p(XS ,YS |θ)unsuperivsed branch + EYU |XU ,θt log p(XU ,YU |θ)

(8)

One can also apply an additional Bayesian identity in thatp(a, b|c) = p(a|b, c)p(b|c). As such, the expectation can beexpanded as follows,

Q(θ|θt) = log p(YS |XS , θ) p(XS |θ)+ EYU |XU ,θt log p(YU |XU , θ) p(XU |θ)

(9)

For Discrimination models, the model parameters are not usedto generate the training samples p(X|θ), but instead to calculatedirectly the posterior predicted probabilities p(Y |X, θ).

As such, "basic pseudolabeling" with discrimination modelscan be viewed as making a convenience assumption in whichthe component p(X|θ) is omitted from the maximum likelihoodcalculation by setting this value to 1 as follows,

p(XS |θ) = p(XU |θ) = 1 (10)

Clearly equation (10) is overly simplistic, and moreover itexhibits a failure case when X is an outlier for which p(X|θ)→ 0.But given that discrimination backbones do not calculate p(X|θ)explicitly (as they calculate p(Y |X, θ) directly), it is commonpractice to omit the contribution of the p(X|θ) term rather thantaking this term into account. In contrast, the proposed SCOPEmethod attempts to improve upon equation (10) in order to betterhandle outliers and thereby suppress confounding errors.

Given equation (10) the expectation term for "basic pseudola-beling" simplifies as follows,

Q(θ|θt) = log p(YS |XS , θ) + EYU |XU ,θt log p(YU |XU , θ) (11)

One can make use of the sample independence assumptionover N supervised samples, M unsupervised samples, and Ccategories to rearrange this expression in more explicit form asfollows,

superivsed log loss

Q(θ|θt) =

N∑i=1

C∑c=1

YS ic log p(YS i=c |XS , θ)

unsuperivsed expected log loss

+

M∑i=1

EYU |XU ,θt

C∑c=1

YUic log p(YU |XU , θ)

(12)

When presented in this form, it becomes clear that a basicpseudolabeling classifier should contain both supervised andunsupervised optimization terms, and that the supervised termresembles the negative of the well known multi-class log lossfunction. This term is negated only for the reason that EM at-tempts to maximize the expectation and is thus by definitiona negated loss. The second term is the unsupervised contribu-tion to the expectation. It can be seen that the unsupervisedcontribution is in some sense similar to the multi-class log-loss,but with an additional caveat, that this term should be averagedover all possible values of the unobserved training labels YU ,given access to the unlabeled data XU and the previous modelparameters t, from timestep t.

Basic pseudolabeling makes use of one additional assumptionin that the expected value is distributed inside the sum as wellas inside of the log term as follows. In theory expected valuehas the distributed property for summations, but does not strictlyfollow the distributive property for all convex functions such aslog probabilities. Nevertheless, distributing the expected valuein this way reduces the computational complexity, because itallows one to calculate a single "pseudolabel" per unlabeled sam-ple rather than summing the unsupervised loss over all possiblelabels for the sample. We define this pseudolabeling approxima-tion as Q(θ|θt) as follows,

superivsed log loss

Q(θ|θt) =

N∑i=1

C∑c=1

YS ic log p(YS i=c |XS i , θ)

unsuperivsed log loss

+

M∑i=1

C∑c=1

YUi log p(YUi=c |XUi , θ)

where YUi is the pseudolabel as follows,

YUi =EYU |XU ,θt (YUi )

(13)

3.2. SCOPE as an improved EM approximationThe purpose of the SCOPE methodology is to reduce con-

founding errors by improving upon an unsatisfactory assumption(equation 10) of basic pseudolabeling as an EM approximator.Notably, let us recall equation 9 as follows,

Q(θ|θt) = log p(YS |XS , θ) p(XS |θ)+ EYU |XU ,θt log p(YU |XU , θ) p(XU , θ)

(13)

Basic pseudolabeling makes the convenient yet unsatisfactoryassumption that p(XS |θ) = p(XU |θ) = 1 (equation 10). This

5

assumption is convenient, because discrimination models suchas DNNs do not attempt to measure the probability of the sampleoccurring given the model parameters p(X|θ), but instead attemptto directly infer the predicted probabilities p(Y |X, θ).

SCOPE is based on an improved assumption that again alsodoes not require the discrimination model to be able to predictthe sample probabilities as part of a differentiable loss, but doesrequire some ability to perform non-differentiable outlier sup-pression based on a previous estimate of the model parametersp(X|θt) as follows,

Q(θ|θt) ≈ log p(YS |XS , θ) p(XS |θt)

+ EYU |XU ,θt log p(YU |XU , θ) p(XU , θt)

(14)

Replacing p(X|θ) with p(XS |θt) is asymptotically justifiable,

because as the EM model converges, the differences betweenthe model parameters θ and θt in successive EM iterations willbecome negligible. A practical advantage of approximatingp(X|θ) using p(X|θt) for the purposes of semi-supervised deeplearning is that p(X|θt) does not depend on the current estimateθ and therefore does not require a differentiable form of thisexpression to be integrated into the gradient descent.

As p(X|θt) does not need to take differentiable form, it is pos-sible to approximate this quantity using a Bernoulli distribution,which we define as p(X|θt) being a binary approximation ofp(X|θt) as follows,

p(X|θt) ≈ p(X|θt) where

p(X|θt) =

1 if p(X|θt) > τ0 otherwise

(15)

The use of a binary approximation implies an outlier removalstrategy. As such, if one can identify samples for which p(X|θt)is unlikely, these samples can be removed as outliers from theMaximization step. Whereas, if one can identify samples forwhich p(X|θt) is likely, these samples can be included as inliersto the maximization. If one repeats the derivation of equation 13but instead using this improved assumption for unlabeled datapoints, one arrives at the following which describes the SCOPEmethodology in its most general form.

superivsed log loss

Q(θ|θt) =

N∑i=1

C∑c=1

YS ic log p(YS i=c |XS i , θ)

unsuperivsed log loss

+

M∑i=1

C∑c=1

YUic log p(YUi=c |XUi , θ) p(XUi |θt)

where YUi is the pseudolabel as follows,

YUi = EYU |XU ,θt (YUi )

where p(XUi |θt) is the binary outlier removal

term for unlikely samples

(16)

4. Methodology

Fig. 2. Description of the SCOPE Meta-learning algorithm

Figure 2 shows the scope Meta-learning algorithm which im-plements equation 16 as well as extends this EM approximatorusing additional information. Figure 2 describes the applicationof equation 16. The meta-learning algorithm consists of a su-pervised branch and an unsupervised branch. The supervisedbranch consists of the supervised images and its correspondinglabels which is the lower-half of the figure 2. The images in thesupervised branch are passed through an augmentation functionwhich consists of basic augmentations such as left-shift, right-shift, rotate etc. As shown in the equation 16, the supervisedbranch has a supervised loss which in this architecture is thecross-entropy between the true labels and the predicted labelsby the model.

The top half of Figure 2 describes the unsupervised branch.The unlabeled images are then passed through the consistencyregularization layer. Consistency regularization as a form forcontrastive learning is used as a part of the proposed approach(Sohn et al., 2020; Berthelot et al., 2019b,a). The unlabeled sam-ples are passed into the model as 2 branches. The first branchbeing the weak augmentation of the unlabeled sample and thesecond branch being the strong augmentation of the same unla-beled sample (Sohn et al., 2020). The model’s predictions forthe weak augmented images are threshold to a confidence scoreof 0.95 and every image that surpasses this threshold is consid-ered as the true label for the unlabeled sample. The model’sprediction for the strong augmentation image is considered asthe prediction for the unlabeled sample and a cross-entropy lossis applied between the predicted class for the weak augmentedsample and the predicted probability distribution of the strongaugmented image. This loss term as a whole is considered as theunlabeled loss. The labeled loss on the other hand is the cross-entropy between the true label of the sample and the predictedprobability distribution of the labeled sample.

There are many potential ways to implement a binary outlierremoval term p(XUi|θ

t). We analyze two simple heuristics thatcan be used for outlier removal. The gaussian filter and con-trastive nearest neighbor filter techniques as seen in Figure 2are different realizations of the SCOPE outlier suppression termp(X|θ), and are described in sections 4.2 and 4.4 respectively.

6

4.1. Consistency RegularizationConsistency regularization is an approach in which one adds

an additional constraint that the predicted label should notchange through augmentation. Let us define A as the spaceof augmentation parameters, and as the augmentation function.Consistency regularization introduces the following constraint.

p(Y |α(X, A1), θ) = p(Y |α(X, A2), θ)for all A1 , A2 ∈ A

(17)

Consistency regularization is often defined with as a randomfunction, but it is equivalently presented here with as a deter-ministic function but randomly chosen parameters A1, A2 ∈ A ,where A is the space of augmentation parameters. This constraintstates algebraically that augmentation should not change the pre-dicted labels. Consistency regularization, like other forms ofregularization, can be implemented by adding a penalty Lconsist

to the overall loss function for optimization. Or conversely, asEM is a maximization procedure, by subtracting the followingpenalty Lconsist from the overall maximization step term Q(θ|θt).

Lconsist = EA1,A2∈AL[p(Y |α(X, A1), θ) − p(Y |α(X, A2), θ)] (18)

Consistency regularization is highly dependent on the abilityto obtain a viable augmentation function that is unlikely to alterthe true label of the image. Recent work in the use of Consis-tency regularization for semi-supervised learning has yielded anumber of augmentation functions that perform well for imageand digit classification datasets as described by the followingpapers: (Sohn et al., 2020; Berthelot et al., 2019b,a). SCOPEmakes use of Control-Theory Augment (CT-Augment) (Berth-elot et al., 2019a), Cut-out Augment (Devries and Taylor, 2017),and Rand Augment (Cubuk et al., 2020).

4.2. Gaussian Filtering and Outlier RemovalAs previously described, we define p(XUi|θ

t) is a binary outlierremoval term, in which we attempt to remove samples thatmay be unlikely to appear given the model parameters. Themost straightforward way to implement such a term would beto measure p(X|θt) directly and then to determine if p(X|θt)exceeds a threshold τ. If p(XUi|θ

t) > τ then we assign p(XUi|θt)

to 1 (inlier), otherwise we assign p(XUi|θt) to 0 (outlier). As

the label categories are mutually exclusive, one may use thefollowing identity.

p(X|θt) =

C∑i=c

p(X|Y = c, θt)p(Y = c|θt) (19)

And given that p(Y = c) is mutually independent of θt (as wehave no information of X), this simplifies to the following,

p(X|θt) =

C∑i=c

NY=c

Np(X|Y = c, θt) (20)

The quantity p(X|Y = c, θt) can be straightforwardly esti-mated using the pseudo-labeled set if one can make use of θt

to produce manifold that projects the data X into a space thatis approximately normally distributed. Define F(X, θt) as an

output feature vector of a neural network taking X as input, andusing trained weights θt. One can therefore estimate p(X|θt) asfollows,

p(X|θt) =

C∑i=c

NY=c

Np(F(X, θt)|Y = c, θt) (21)

For simplicity the Gaussian filtering technique for SCOPEuses the output predicted probabilities as the manifold F(X, θt),and furthermore assumes use of the diagonal joint multivariatenormal distribution as the distribution of output features F(X, θt).

We initialize the mean µtc and the standard deviation σt

c forthe probability distribution F(X, θt) where Y = c.

p(F(X, θt)|Y = c) =1√

2πσexp(

(F(X, θt) − µtc)2

(σtc)2 (22)

Algorithm 1: Gaussian Filter

for c in 0...10 doµt

p(S (XtUi

) = mean([p(W(XtU)) == c])

σtp(S (Xt

Ui) = std([p(W(Xt

U)) == c])

PseudoLabels = []for iter in 0...3 do

for i in range(len(p(W(XtU)))) do

G(p(W(XtUi

))) =

1√2πσ

exp p(W(Xt

Ui)−µt

p(W(XtUi

)

(σtp(W(Xt

Ui))2

end

end

µt+1p(S (Xt

Ui) = mean(G(p(W(Xt

Ui))))

σt+1p(S (Xt

Ui) = std(G(p(W(Xt

Ui))))

if G(p(W(XtUi

))) ≥ mean(G(p(W(XtUi

)))) thenPseudoLabels.append( argmax( p(W(Xt

Ui) ) )

endend

4.3. Contrastive Learning

Contrastive Learning can be understood as learning by com-paring multiple input samples that are in some sense similar toeach other [(Le-Khac et al., 2020)]. Discriminative models usecontrastive learning as an approach to group similar samplescloser and different samples away from each other [(Jaiswalet al., 2020)]. The distance between these samples is oftenmeasured with a distance metric which helps to evaluate if thesample lies close or far to another sample. The discriminatoris trained in a way that it learns if an original sample and theaugmented version of the original sample lie close to each otherin a feature space which is further used to fine tune the model.In the general form of contrastive loss, the similarity metric is

7

usually cosine similarity where we want the similar vectors tohave a cosine similarity distance close to 1.

The manifold assumption states that a high-dimensional data-point can be represented in a low-dimensional feature spacewhich could be used to learn special feature-representations thatcould not be learned from a high-dimensional feature spacealone. Now to pseudo-label the unlabeled samples, cosine sim-ilarity is used as a contrastive loss metric to evaluate the an-gular distance between the low-dimensional feature space ofthe unlabeled sample with a pseudo label of class c and thelow-dimensional feature space of the labeled sample with asupervised label of class c.

Cosine similarity works best for this approach because it com-pares the low-dimensional representations of angular distancebetween 1 being perfectly similar and -1 being completely dif-ferent.

d(F(XU , θt), F(XS , θ

t)) = ||F(XS , θt)|| ||F(XU , θ

t)|| cosθ (23)

Where cosθ is represented using the dot product and magni-tude as

cosθ =F(XU , θ

t) F(XS , θt)√

F(XU , θt)2√

F(XS , θt)2(24)

4.4. Contrastive Nearest Neighbor Outlier RemovalAnother way of estimating if p(X|θt) > τ is to make use of

a contrastive learning approach based on k-nearest neighbors.Intuitively, if a sample X is nearby other supervised samples inXS used for fitting the model t, then one might determine thatp(X|θt) > τ. Conversely, if a sample X is far from all knownsupervised samples in XS one may conclude that p(X|θt) < τ.This intuition has a theoretical basis, because the K-NearestNeighbor algorithm is the optimal classifier assuming a Variable-width Balloon Kernel Density Estimator as follows,

p(X|θt) =1

nhD

n∑i=1

K(F(X, θt) − F(Xi, θ

t)h

)

where h =k

(np(X|θt))1/D

where K(z) =1

2πexp (−

12

z2)

(25)

It can be shown that under these assumptions, p(X|θt) in-creases monotonically as sum of euclidean distances to the knearest samples d(X, Xs, k) decreases as follows,

∀ τ ∃ γ s.t.p(X|θt) > τ i f f d(X, Xs, k) < γ

(26)

The nearest neighbor contrastive outlier removal step for theSCOPE algorithm is based on this strategy with two minor ad-justments. First, rather than euclidean distance between samplefeature spaces F(Xi, θ

t) and F(X j, θt), cosine similarity is used.

This is because cosine similarity is a standard distance metric foruse with contrastive learning techniques that compare individualsample feature vectors. Secondly, rather than comparing dis-tance to all samples within the supervised set, this comparison

is made only for the samples within the predicted pseudo-labelcategory.

The Contrastive Nearest Neighbor Filtering algorithm is de-scribed in Algorithm 2. Let XS i and F(XS i, θ

t) be the labeledimage and the corresponding manifold. Similarly let X andF(X, θt) be the unlabeled image of interest and the correspond-ing manifold which belongs to class c where c = 1...10. Ifthe unlabeled manifold F(X, θt) has a cosine similarity score ofabove γ with at least k labeled samples with the same pseudolabel of the corresponding class we add X to the list of probablelabeled candidates.

Algorithm 2: Contrastive Nearest Neighbor Filter

/* dictionary with keys */Dict = {0 : [], 1 : [], ..., 9[]}

for c in 0...9 do/* c is the class from 0 to 9 */L = {Xi : xi ∈ Xt

L and Ci = c}U = {Xi : xi ∈ Xt

U and Ci = c}

Dict2 = {U1 : 0, ...,Un : 0}for i in range(len(L)) do

for j in range(len(U)) doif CosineS imilarity(L[i],U[ j]) > γ then

Dict2[U[ j]]+ = 1end

endendfor i in len(Dict2) do

if Dict2[i] ≥ k thenDict[c].append(Dict2(i))

endend

end

5. Experimental Setup, Results and Ablation Study

SCOPE was evaluated according to the test accuracy on theCIFAR-10 dataset to determine the extent to which our algorithmcan enhance semi-supervised classifiers to correctly classify thetest data. The algorithm was also compared against several state-of-the-art semi-supervised learning algorithms. CIFAR-10 is awidely used dataset, and was used as the testing benchmark formultiple state-of-the-art semi-supervised classification models.This dataset consists of 50000 images and their correspondinglabels for training and 10000 images and labels for testing. Thedataset consists of 10 classes and hence makes it challengingfor classifiers. As other models the proposed algorithm usesWide ResNet as the classifier. This paper evaluates the test re-sults on 2 particular splits of labeled and unlabeled data. Thefirst experiment consists of a training sample size of 250 la-beled images and 49750 unlabeled images whereas the secondexperiment consists of a training sample size of 4000 labeledimages and 46000 unlabeled images. In the first experiment,

8

Table 1. Accuracy Results

MethodAccuracy250 labels

π-Model 45.74% (44.76, 46.72)Pseudo-Labeling 50.22% (49.24, 51.20)Mean Teacher 67.68% (66.75, 68.60)Mix-Match 88.95% (88.32, 89.56)Fix-Match 94.93% (94.48, 95.35)SCOPE 95.52% (95.10,95.92)

Table 2. Accuracy Results

MethodAccuracy

4000 labels

π-Model 85.99% (85.29, 86.66)Pseudo-Labeling 83.91% (83.17, 83.63)Mean Teacher 90.81% (90.23, 91.37)Mix-Match 93.58% (93.08, 94.05)Fix-Match 95.69% (95.27, 96.08)SCOPE 95.82% (95.41,96.20)

with 250 labels, the π-Model yields an accuracy of 45.74% andthe pseudo-labeling paper reports an accuracy of 50.22%. Meanteacher which is one of the advanced techniques which usesthe exponential moving averages of weights, yields an accuracyof 67.67%. Mixmatch which used k number of augmentations,averages the model predictions over all the augmentations anduses temperature sharpening reports an accuracy of 88.95%. Thefix-match algorithm which had reported better results than allthe above approaches used consistency regularization and ex-ponential moving average of the weights reported an accuracyof 94.93%. The proposed algorithm in this paper uses consis-tency regularization along with contrastive latent bootstrappingusing outlier removal yields 95.46% accuracy with 250 labeledsamples.

In the second experiment, with 4000 labeled images and46000 unlabeled images, the π-Model yields an accuracy of85.99% and the pseudo-labeling paper reports an accuracy of83.91%. The mean-teacher algorithm receives an accuracy of90.81% which performs better than the mean-teacher approach.The fix-match approach has reported an accuracy of 95.69%.The semi-supervised contrastive latent bootstrapping along withconsistency regularization achieves better results with 4000 la-beled samples with an accuracy of 95.82%.

Figure 3 shows the confounding error rate of the proposedmodel while using various numbers of neighbors k. For thepurposes of measurement, we define a confounding error asan incorrect pseudolabel that is added to the labeled set forsubsequent EM training iterations. This figure helps to identifythe number of incorrectly predicted samples added to the labeledbranch of the model which causes the model to diverge furtherupon retraining. We see that as the number of neighbors used toevaluate the class of the model increases, the confounding errorrate of the unlabeled samples being added to the labeled branchof the training data decreases. As shown in the above diagram,

Table 3. Confounding Error Rate

K Neighbors Confounding Error Rate

k = 1 2.3%k = 2 2.13%k = 3 1.28%k = 4 1.2%k = 5 0.98%k = 6 0.97%

Table 4. SCOPE Accuracy Results

MethodAccuracy250 labels

Accuracy4000 labels

SCOPE with only Gaussianfilter

95.28% 95.698%

SCOPE with only K-NearestNeighbor filter

95.39% 95.71%

when k = 1, the confounding error being added to the modelis maximum. When k = 2, the error rate drops as compared towhen k = 1. Similarly the error rate for k = 3 is lesser than whenk = 2 and greater than when k = 4. We use the value k = 6 asthe improvement of the confounding error rate appears to haveleveled off at this point.

An additional comparison was performed to observe the per-formance of SCOPE by using only the Gaussian filter versusSCOPE using only the K-Nearest Neighbor filter. Scope withonly the Gaussian filter yields an accuracy of 95.28% with only250 labeled samples and yields an accuracy of 95.698% withonly 4000 labeled samples. Scope with only the K-NearestNeighbor filter yields an accuracy of 95.39% with only 250 la-beled samples and 95.71% with only 4000 labeled samples. Thisexperiment helps to demonstrate the effectiveness of using boththe gaussian filter and the pseudo-labeling using outlier removaland bootstrapping as it performs better than using only one ofthe features by itself.

6. Conclusion

Semi-supervised deep learning is a rapidly evolving field andrecent papers have identified the problem of confounding errorsas a major limitation of deep pseudolabeling techniques. In thispaper, we present a novel SCOPE meta-learning algorithm whichgreatly reduces the prevalence of confounding errors throughthe introduction of a binary outlier removal term. This termintegrates knowledge of the probability of the unlabeled samplesoccuring given the model parameters, which is required by pureEM, but typically neglected or overlooked in pseudolabelingtechniques. Furthermore, we realize this term through two novelfiltering strategies: gaussian filter and the contrastive nearestneighbour filter. SCOPE achieves state-of-the-art performanceon the semi-supervised CIFAR-10 dataset as with 250 and 4000labeled samples respectively. Furthermore, the confoundingerror rate of SCOPE is greatly reduced from 2.3% to below

9

Fig. 3. Confounding error rate per epoch.

1% when k = 6 contrastive nearest neighbors are incorporated.These results signify that outlier suppression techniques are apromising approach to reduce confounding and improve theability of deep pseudolabeling techniques to generalize withvery limited labeled training volumes.

Acknowledgments

We would like to thank Michael Majurski and Yelena Yeshafor their contributions towards this research. This research wasfunded in part by NSF award #1747724 IUCRC Center forAdvanced Real Time Analytics (CARTA).

References

Arazo, E., Ortego, D., Albert, P., O’Connor, N.E., McGuinness, K., 2020.Pseudo-labeling and confirmation bias in deep semi-supervised learning, in:2020 International Joint Conference on Neural Networks (IJCNN), IEEE. pp.1–8.

Berthelot, D., Carlini, N., Cubuk, E.D., Kurakin, A., Sohn, K., Zhang, H.,Raffel, C., 2019a. Remixmatch: Semi-supervised learning with distributionalignment and augmentation anchoring. arXiv preprint arXiv:1911.09785 .

Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., Raffel,C.A., 2019b. Mixmatch: A holistic approach to semi-supervised learning.Advances in Neural Information Processing Systems 32.

Blum, A., Mitchell, T., 1998. Combining labeled and unlabeled data with co-training, in: Proceedings of the eleventh annual conference on Computationallearning theory, pp. 92–100.

Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V., 2020. Randaugment: Practicalautomated data augmentation with a reduced search space, in: Proceedingsof the IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR) Workshops.

Dempster, A.P., Laird, N.M., Rubin, D.B., 1977. Maximum likelihood fromincomplete data via the em algorithm. Journal of the Royal Statistical Society:Series B (Methodological) 39, 1–22.

Devries, T., Taylor, G.W., 2017. Improved regularization of convolutional neuralnetworks with cutout. CoRR abs/1708.04552. URL: http://arxiv.org/abs/1708.04552, arXiv:1708.04552.

Hadsell, R., Chopra, S., LeCun, Y., 2006. Dimensionality reduction by learningan invariant mapping, in: 2006 IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR’06), IEEE. pp. 1735–1742.

He, K., Fan, H., Wu, Y., Xie, S., Girshick, R., 2020. Momentum contrast for un-supervised visual representation learning, in: Proceedings of the IEEE/CVFconference on computer vision and pattern recognition, pp. 9729–9738.

Henaff, O., 2020. Data-efficient image recognition with contrastive predictivecoding, in: International Conference on Machine Learning, PMLR. pp. 4182–4192.

Hjelm, R.D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P.,Trischler, A., Bengio, Y., 2018. Learning deep representations by mutualinformation estimation and maximization. arXiv preprint arXiv:1808.06670 .

Jaiswal, A., Babu, A.R., Zadeh, M.Z., Banerjee, D., Makedon, F., 2020. Asurvey on contrastive self-supervised learning. Technologies 9, 2.

Jeong, J., Lee, S., Kim, J., Kwak, N., 2019. Consistency-based semi-supervisedlearning for object detection. Advances in neural information processingsystems 32.

Le-Khac, P.H., Healy, G., Smeaton, A.F., 2020. Contrastive representationlearning: A framework and review. IEEE Access 8, 193907–193934. doi:10.1109/ACCESS.2020.3031549.

Li, J., Zhou, P., Xiong, C., Hoi, S.C., 2020. Prototypical contrastive learning ofunsupervised representations. arXiv preprint arXiv:2005.04966 .

Livieris, I.E., Drakopoulou, K., Tampakas, V.T., Mikropoulos, T.A., Pintelas, P.,2019. Predicting secondary school students’ performance utilizing a semi-supervised learning approach. Journal of educational computing research 57,448–470.

McClosky, D., Charniak, E., Johnson, M., 2006. Reranking and self-training forparser adaptation, in: Proceedings of the 21st International Conference onComputational Linguistics and 44th Annual Meeting of the Association forComputational Linguistics, pp. 337–344.

Menon, S., Chapman, D., Nguyen, P., Yesha, Y., Morris, M., Saboury, B., 2020.Deep expectation-maximization for semi-supervised lung cancer screening.arXiv preprint arXiv:2010.01173 .

Misra, I., Maaten, L.v.d., 2020. Self-supervised learning of pretext-invariantrepresentations, in: Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition, pp. 6707–6717.

Miyato, T., Maeda, S.i., Koyama, M., Ishii, S., 2018. Virtual adversarial training:a regularization method for supervised and semi-supervised learning. IEEEtransactions on pattern analysis and machine intelligence 41, 1979–1993.

Mustafa, A., Mantiuk, R.K., 2020. Transformation consistency regularization–a semi-supervised paradigm for image-to-image translation, in: European

10

Conference on Computer Vision, Springer. pp. 599–615.Nguyen, P., Chapman, D., Menon, S., Morris, M., Yesha, Y., 2020. Active

semi-supervised expectation maximization learning for lung cancer detectionfrom computerized tomography (ct) images with minimally label trainingdata, in: Medical Imaging 2020: Computer-Aided Diagnosis, InternationalSociety for Optics and Photonics. p. 113142E.

Olivier, C., Bernhard, S., Alexander, Z., 2006. Semi-supervised learning, in:IEEE Transactions on Neural Networks. volume 20, pp. 542–542.

Oord, A.v.d., Li, Y., Vinyals, O., 2018. Representation learning with contrastivepredictive coding. arXiv preprint arXiv:1807.03748 .

Prakash, V.J., Nithya, D.L., 2014. A survey on semi-supervised learning tech-niques. arXiv preprint arXiv:1402.4645 .

Rosenberg, C., Hebert, M., Schneiderman, H., 2005. Semi-supervised self-training of object detection models .

Sajjadi, M., Javanmardi, M., Tasdizen, T., 2016. Regularization with stochastictransformations and perturbations for deep semi-supervised learning. Ad-vances in neural information processing systems 29.

Sohn, K., Berthelot, D., Carlini, N., Zhang, Z., Zhang, H., Raffel, C.A., Cubuk,E.D., Kurakin, A., Li, C.L., 2020. Fixmatch: Simplifying semi-supervisedlearning with consistency and confidence. Advances in Neural InformationProcessing Systems 33, 596–608.

Tarvainen, A., Valpola, H., 2017. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results.Advances in neural information processing systems 30.

Tian, Y., Krishnan, D., Isola, P., 2020. Contrastive multiview coding, in: Euro-pean conference on computer vision, Springer. pp. 776–794.

Verma, V., Kawaguchi, K., Lamb, A., Kannala, J., Solin, A., Bengio, Y., Lopez-Paz, D., 2022. Interpolation consistency training for semi-supervised learning.Neural Networks 145, 90–106.

Wang, X., Gupta, A., 2015. Unsupervised learning of visual representations us-ing videos, in: Proceedings of the IEEE international conference on computervision, pp. 2794–2802.

Wu, Z., Xiong, Y., Yu, S.X., Lin, D., 2018. Unsupervised feature learningvia non-parametric instance discrimination, in: Proceedings of the IEEEconference on computer vision and pattern recognition, pp. 3733–3742.

Xia, Y., Liu, F., Yang, D., Cai, J., Yu, L., Zhu, Z., Xu, D., Yuille, A., Roth,H., 2020. 3d semi-supervised learning with uncertainty-aware multi-viewco-training, in: Proceedings of the IEEE/CVF Winter Conference on Appli-cations of Computer Vision, pp. 3646–3655.

Yalniz, I.Z., Jégou, H., Chen, K., Paluri, M., Mahajan, D., 2019. Billion-scale semi-supervised learning for image classification. arXiv preprintarXiv:1905.00546 .

Yarowsky, D., 1995. Unsupervised word sense disambiguation rivaling super-vised methods, in: 33rd annual meeting of the association for computationallinguistics, pp. 189–196.

Zhai, X., Oliver, A., Kolesnikov, A., Beyer, L., 2019. S4l: Self-supervisedsemi-supervised learning, in: Proceedings of the IEEE/CVF InternationalConference on Computer Vision, pp. 1476–1485.