Unsupervised Learning for Program Repair - Break-It-Fix-It

Break-It-Fix-It Unsupervised Learning for Program Repair

Michihiro Yasunaga 1 Percy Liang 1

AbstractWe consider repair tasks given a critic (egcompiler) that assesses the quality of an input thegoal is to train a fixer that converts a bad example(eg code with syntax errors) into a good one (egcode with no syntax errors) Existing works createtraining data consisting of (bad good) pairs bycorrupting good examples using heuristics (egdropping tokens) However fixers trained on thissynthetically-generated data do not extrapolatewell to the real distribution of bad inputs To bridgethis gap we propose a new training approachBreak-It-Fix-It (BIFI) which has two key ideas(i) we use the critic to check a fixerrsquos output onreal bad inputs and add good (fixed) outputs tothe training data and (ii) we train a breaker togenerate realistic bad code from good code Basedon these ideas we iteratively update the breakerand the fixer while using them in conjunction togenerate more paired data We evaluate BIFI ontwo code repair datasets GitHub-Python a newdataset we introduce where the goal is to repairPython code with AST parse errors and DeepFixwhere the goal is to repair C code with compilererrors BIFI outperforms state-of-the-art methodsobtaining 905 repair accuracy on GitHub-Python (+285) and 717 on DeepFix (+56)Notably BIFI does not require any labeled datawe hope it will be a strong starting point forunsupervised learning of various repair tasks

1 IntroductionIn many domains one has access to a critic that assesses thequality of an input but what is desired is a more constructivefixer that actually improves bad inputs For instance inprogramming while a code analyzer and compiler can tellif a given code has errors programmers need to repair theerrors in the bad code Development of an automatic codefixer is thus a key to enhancing programming productivity(Seo et al 2014) and is an active area of research (Mesbahet al 2019 Ding et al 2020 Dinella et al 2020) Other

1Stanford University Stanford CA Correspondence toMichihiro Yasunaga ltmyasucsstanfordedugt

Proceedings of the 38 th International Conference on MachineLearning PMLR 139 2021 Copyright 2021 by the author(s)

Task Fixer

x = analysisget(ip None)if not x print (PANIC No format(ipstrip())

def iterFiles(path) return (name for name in listdir(path) if isfile(join(pathname)))

x = analysisget(ip None)if not x print (PANIC No format(ipstrip()))

I

I

I

Critic (eg code analyzer compiler)

Unlabeled data

Bad examples Good examples

Error

No error mdash fixed

No error

x = analysisget(ip None) if not x print (PANIC No format(ipstrip()) Bad (Error)

Good (No error)

x = analysisget(ip None) if not x print (PANIC No format(ipstrip())

Bad example

x = analysisget(ip None) if not x print (PANIC No format(ipstrip()))


Good example

eg code analyzer compiler

Error


Unlabeled Data

Example

x = analysisget(ip None) if not x print (PANIC No format(ipstrip()) Bad (Error)

Good (No error)


Bad example

x = analysisget(ip None) if not x print (PANIC No format(ipstrip()))


Good example

eg code analyzer compiler

Error


Unlabeled Data

Example

Figure 1 Problem setup We are given an unlabeled data (codesnippets) and a critic (eg code analyzer compiler) that assessesthe quality of an input (eg bad if the code has errors good if noerror) Our task is to learn a fixer that can actually repair a badexample into a good one (eg fixing errors in the bad code)

instances of this general setting include molecular design (Jinet al 2019) which aims to improve the chemical properties(eg drug-likeness) of molecules given a property evaluatorand essay editing (Taghipour amp Ng 2016) which aims toimprove a writing given a grade How to automaticallylearn a fixer given a critic (we term critic2fixer) remains animportant research problem in machine learning

In this work we focus on the domain of code repair Learninga fixer is challenging because manual labeling of paired dataeg 〈broken code fixed code〉 is costly To this end we con-sider learning from unlabeled data Specifically as illustratedin Figure 1 we are given (a) a critic (code analyzer or com-piler) that assesses the quality of an inputmdashbad if the code haserrors good if it has no errorsmdash and (b) unlabeled datamdashunpaired set of good code and bad code eg from GitHubOur goal is to learn a fixer that repairs bad code into goodcode Previous works in code repair apply random or heuris-tic perturbations to good code (eg dropping tokens) andprepare synthetic paired data 〈perturbed code good code〉 totrain a fixer (Pu et al 2016 Gupta et al 2017 Ahmed et al2018 Hajipour et al 2019 Yasunaga amp Liang 2020) How-ever such synthetically generated bad examples do not match

arX

iv2

106

0660

0v2

[cs

LG

] 2

2 Ju

n 20

21


def validate(type) ltINgtif type not in types ltINgtmsg = (invalid type ltINgtmust in s types) a

raise Exception(msg)ltDEgtltDEgt

else pass

Real def validate(type) if type not in types ltINgtmsg = (invalid type ltINgtmust in s types)ltDEgt

raise Exception(msg)ltDEgt

else pass

Synthetic def validate(type)ltINgtif type not in types ltINgtmsg = (invalid type ltINgtmust in s types)ltDEgt


else pass

x = analysisget(ip None)if not x print (PANIC No format(ipstrip()) i

Real x = analysisget ip None)if not x print (PANIC No format(ipstrip()))i

Synthetic x = analysisget(ip None)if not x print (PANIC No format(ipstrip()))i

(Synthetic Errors)(Human Errors)

Distribution Mismatch


def validate(type) ltINgtif type not in types ltINgtmsg = (invalid type must in s types)a

ltINgtraise Exception(msg)ltDEgtltDEgt

else pass

Real def validate(type) if type not in types ltINgtmsg = (invalid type must in s types) raise Exception(msg)ltDEgt

else pass

Synthetic def validate(type)ltINgtif type not in types ltINgtmsg = (invalid type must in s types) raise Exception(msg)ltDEgt

else pass







def validate(type) ltINgtif type not in types ltINgtmsg = (invalid type not in s types)a


else pass

Real def validate(type) ltINgtif type not in types msg = (invalid type not in s types) raise Exception(msg)ltDEgt

else pass

Synthetic def validate(type)ltINgtif type not in types ltINgtmsg = (invalid type not in s types) raise Exception(msg)ltDEgt

else pass

print (PANIC No format(ipstrip()) i

Real

print (PANIC No format ipstrip()))

Synthetic

print (PANIC No format(ipstrip()))i






else pass


else pass


else pass


Real


Synthetic


(Synthetic Error)(Human Error)



def validate(type) ltIgtif type not in types ltIgtmsg = (invalid type not in s types)a

ltIgtraise Exception(msg)ltDgtltDgt else pass

Real def validate(type) ltIgtif type not in types msg = (invalid type not in s types) raise Exception(msg)ltDgt else pass

Synthetic def validate(type)ltIgtif type not in types ltIgtmsg = (invalid type not in s types) raise Exception(msg)ltDgt else pass


Real


Synthetic





Figure 2 Challenge of learning a fixer from unlabeled data Ex-isting works randomly perturb good examples into bad examplesand learn to recover However such synthetically generated badexamples do not match the distribution of real bad examples Forinstance synthetic perturbations may drop parentheses arbitrar-ily from code (top right) but real human-written bad code missesparentheses more often in a nested context (top center) In a more ex-treme case at bottom to generate the human error (center) from thecorrected code (left) a pair of indent and dedent tokens need to be in-serted accordingly which random perturbations generate with verysmall probability Note that indentation is meaningful in Python inthe tokenized Python code each 〈I〉 token means indenting the lineby one unit each 〈D〉 means dedenting the next line by one unit

the distribution of real bad examples For instance as shownin Figure 2 synthetic perturbations may drop parenthesesarbitrarily from code generating errors that rarely happen inreal programming (Figure 2 top right synthetic errors) incontrast real human-written code misses parentheses moreoften in a nested context (Figure 2 top center human errors)As we will show in sect4 this distribution mismatch betweensynthetic data and real data results in low performance

To bridge this gap we propose Break-It-Fix-It (BIFI) a newmethod to learn a fixer from unlabeled data and a critic (Fig-ure 3) BIFI is based on two core insights (i) we can use thecritic to check a fixerrsquos output on real bad examples and addgood outputs to the training data and (ii) we train a breaker togenerate realistic bad examples from good examples Specif-ically given an initial fixer trained on synthetic paired data〈synthetic bad good〉 BIFI improves the fixer and breaker si-multaneously through rounds of data generation and training(1) apply the fixer to real bad examples and keep fixed outputsto obtain real paired data (2) use the resulting data to train thebreaker (3) use the learned breaker to generate code errorsand obtain more paired data and (4) train the fixer on thenewly-generated paired data in (1) and (3) Intuitively thiscycle trains the fixer on increasingly more real or realisticallygenerated bad code adapting the fixer from the initial syn-thetic distributions towards real distributions of code errors

The BIFI algorithm is related to backtranslation in unsuper-vised machine translation (Lample et al 2018a) which usesa target-to-source model to generate noisy sources and trains

a source-to-target model to reconstruct the targets (eg thebad-side and good-side in our repair task can be viewed asthe source and target) BIFI differs from backtranslation intwo ways it uses the critic to verify if the generated examplesare actually fixed or broken (step 1 and 3) and it trains thefixer on real bad examples in addition to examples generatedby the breaker (step 4) which improves the correctness anddistributional match of generated paired data

We evaluate our proposed approach on two code repairdatasets

bull GitHub-Python We collected a new dataset of 3MPython code snippets from githubcom The task is torepair errors caught by the Python AST parser We setthe initial fixer to be an encoder-decoder Transformer(Vaswani et al 2017) trained on random perturbationsbull DeepFix (Gupta et al 2017) The task is to repair

compiler errors in C code submitted by students in anintroductory programming course We set the initial fixerto be the existing best system DrRepair (Yasunaga ampLiang 2020) which was trained on manually-designedheuristic perturbations

Our approach (BIFI) outperforms the initial fixers obtaining905 repair accuracy on GitHub-Python (+285 absolute)and 717 repair accuracy on DeepFix (+56 absolute)attaining a new state-of-the-art BIFI also improves onbacktranslation by 10 Further we qualitatively showhow the fixer and breaker adapt towards more realisticdistributions of code through the BIFI algorithm (sect43)

2 Problem statementFigure 1 illustrates our problem setup critic2fixer Thesystem is given unlabeled dataD (code snippets) and a criticc (code analyzer or compiler) that returns whether an inputis good or bad eg for a code snippet xisinD

c(x)=

0 if x has errors1 if x has no error (1)

Using the critic c examples inD can be classified into badonesDbad=x |xisinD c(x)=0 and good onesDgood=y |yisinD c(y)=1 Our task is to learn a fixer f that maps a badexample xisinDbad into a good example f(x) such that it isclose1 tox and c(f(x))=1 The evaluation metric is the fixerf rsquos repair accuracy on a held-out set of bad examplesD(test)

bad

RepairAcc=|x |xisinD(test)

bad c(f(x))=1||D(test)

bad | (2)

3 ApproachThe major challenge of learning a fixer is that we need to learnfrom unpaired data ieDbad andDgood do not form 〈broken

1We constrain the edit distance as described in sect41 We acknowl-edge that while we want f(x) to be semantics-preserving it is non-trivial to ensure this automatically so we rely on the edit distance


def iterFiles path) return (name for name in listdir(path) if isfile(join(pathname)))

Synthetic bad codedef iterFiles(path) return (name for name in listdir(path) if isfile(join(pathname)))

Good codeSynthetic

Perturbation(1)

Train b0

Train f0

def compute_loss(self) loss = selfcost(selfpreds selftarget selflength)sum()

Bad codedef compute_loss(self) loss = selfcost(selfpreds selftarget selflength)sum()

Fixed codeApply fk-1 Keep if fixed

(1)

Finetune bk

def iterFiles(path) return (name for name in listdir(path) if isfile(join(pathname)) i

Broken codedef iterFiles(path) return (name for name in listdir(path) if isfile(join(pathname)))

Good codeApply bk Keep if broken

(3)

Finetune fk


Bad codedef compute_loss(self) loss = selfcost(selfpreds= selftarget selflength)sum()

Fixed codeApply fk-1(1)

Finetune bk

def iterFile(path) return (name for name in listdir(path) if isfile(join(pathname)))i


Good codeApply bk (3)

Finetune fk




(1)

Finetune fk

sect 33 FixerOnly mdash Round k (=12 hellip)

sect 32 Break-It-Fix-It (BIFI) mdash Round k (=12 hellip)

sect 31 Initialization mdash Round 0

ref Backtranslation mdash Round k (=12 hellip)

(2)

(2)

(4)

(2)

(2)

(4)



Good codeSynthetic

Perturbation(1)

Train b0

Train f0




(1)

Train bk




(3)

Train fk




Train bk




Train fk




(1)

Train fk




sect 33 ref Backtranslation mdash Round k (=12 hellip)

(2)

(2)

(4)

(2)

(2)

(4)

Figure 3 Overview of our approach We train the initial fixer on synthetically prepared data as in prior works (top left) In our approachBIFI (bottom right) we apply the initial fixer to the real bad examples and add fixed outputs (verified by the critic) to our training data(Step 1) train a breaker on the resulting paired data (Step 2) use the breaker to generate (intuitively more realistic) code errors (Step 3) andtrain the fixer again on the newly-generated paired data (Step 4) We iterate this cycle improving the fixer and the breaker simultaneouslyTop right a version of BIFI without the breaker (FixerOnly) Bottom left comparison of BIFI to backtranslation The main differenceis that BIFI uses the critic to verify that the fixer produces good code and the breaker produces bad code (annotated with magenta font)

fixed〉 pairs Prior works in code repair apply random orheuristic perturbations to good examples (eg dropping to-kens) and prepare a synthetic paired data 〈perturbed codegood code〉 to train a fixer (Gupta et al 2017 Ahmed et al2018 Yasunaga amp Liang 2020) However such syntheti-cally generated bad examples do not match the distribution ofreal bad examples For instance as Figure 2 (top) shows syn-thetic perturbations may drop parentheses arbitrarily fromcode generating errors that are rare in real programs incontrast real human-written code misses parentheses oftenin a nested context (eg 10x more than non-nested in ourcollected dataset GitHub-Python) In a more extreme case(Figure 2 bottom) to make the real human error (center) fromthe corrected code (left) multiple tokens (in this case a pairof indent and dedent) need to be inserted or dropped accord-ingly which random perturbations would generate with ex-tremely low probability This distribution mismatch betweensynthetic data and real data results in low performance (sect4)

To address this challenge we propose Break-It-Fix-It (BIFI)an approach that adapts the fixer automatically towards realdistributions of bad examples Concretely we first startfrom the synthetic paired data 〈synthetic bad good〉 andtrain an initial fixer as in prior works (see Figure 3 top leftinitialization) We then perform the following cycle (seeFigure 3 bottom right) (1) we apply the initial fixer to the realbad examples and use the critic to assess if the fixerrsquos output isgoodmdashif good we keep the pair (2) we train a breaker on the

resulting paired datamdashas this data consists of real code errorsintuitively the breaker learns to generate realistic code errors(3) we apply the breaker to the good examples (4) we finallytrain the fixer on the newly-generated paired data in (1) and(3) We iterate this cycle to improve the fixer and the breakersimultaneously in the process The intuition is that a betterfixer and breaker will be able to generate more realistic paireddata which in turn helps to train a better fixer and breaker

Below we describe the initialization step in sect31 our mainalgorithm BIFI in sect32 and discuss two baselines of BIFIbacktranslation (sect33) and FixerOnly (version of BIFIwithout the breaker sect34)

31 Initialization

Given unlabeled data D= (DbadDgood) we first prepare asynthetic paired dataPsynthetic by perturbing good examples

Psynthetic=(bsynthetic(y) y) |yisinDgood (3)

where bsynthetic denotes a pre-defined procedure that corruptscode For instance we will experiment with two choices ofbsynthetic (i) random noising which randomly dropsinsertsreplaces tokens in code and (ii) heuristic noising designedin Yasunaga amp Liang (2020) which aims to generatecommon programming errors such as typo punctuation andtype errors More details are described in sect41

We then train the initial fixer f0 and breaker b0 on the


synthetic paired data

b0=TRAINgoodrarrbad(Psynthetic) (4)

f0=TRAINbadrarrgood(Psynthetic) (5)

where TRAINgoodrarrbad(P) denotes training an encoder-decoder model that maps good-side examples to bad-sideexamples in a paired data P and TRAINbadrarrgood(P) doesthe reverse Note that f0 here corresponds to the fixer learnedin prior works (eg Gupta et al (2017) Yasunaga amp Liang(2020)) We call this initialization step our round 0

32 Break-It-Fix-It (BIFI)

BIFI aims to improve the fixer and breaker simultaneouslythrough rounds of data generation and training (1) usea fixer to create data for a breaker (2) train a breaker (3)use a breaker to create data for a fixer and (4) train a fixerConcretely after the initialization step BIFI performs thefollowing in each round k=12K

P(f)k =(x fkminus1(x)) |xisinDbad c(fkminus1(x))=1 (6)

bk=TRAINgoodrarrbad(P(f)k ) (7)

P(b)k =(bk(y) y) |yisinDgood c(bk(y))=0 (8)

fk=TRAINbadrarrgood(P(f)k cupP(b)

k ) (9)

For convenience we call the original examples inDbad realbad examples Here Eq 6 applies the current fixer fkminus1 to thereal bad examples inDbad and keeps outputs that are actuallyfixed (verified by the critic c red part) This way we can ob-tain new paired dataP(f)

k that is based on real bad examplesEq 7 then trains the breaker bk (fine-tunes from the previousbreaker bkminus1) on this new paired dataP(f)

k so that intuitivelyit can learn to generate realistic bad examples Next Eq 8 ap-plies the breaker bk to the good examples inDgood and keepsoutputs that are actually broken (verified by the critic c redpart) This provides an extra paired dataP(b)

k that is based onbad examples generated by the learned breaker Finally Eq9 trains the fixer fk (fine-tunes from the previous fixer fkminus1)on bothP(f)

k andP(b)k so that the fixer sees real and breaker-

generated bad examples Over time this cycle adapts thefixer and breaker towards the distribution of real examplesFigure 3 (bottom right) provides an illustration of BIFI

33 Comparison with Backtranslation

The BIFI algorithm is related to backtranslation (Lampleet al 2018b) in unsupervised machine translation One mayview the bad-side and good-side in our setup as two sourcetar-get languages in machine translation Backtranslation uses atarget-to-source model to generate noisy sources and trains asource-to-target model to reconstruct the targets Specifically

in each round k backtranslation performs the following

P(f)k =(x fkminus1(x)) |xisinDbad (10)


P(b)k =(bk(y) y) |yisinDgood (12)

fk=TRAINbadrarrgood(P(b)k ) (13)

BIFI differs in two aspects First as our task has a critic BIFIuses the critic to verify the outputs of the fixer and breakerand only keep examples whose outputs are actually fixed(Eq 6 red part) and whose outputs are broken (Eq 8 redpart) to generate paired data In contrast backtranslationdoes not verify the generated paired data in Eq 10 12 Thismay blindly include non-fixed code as good-side examples(consequently the breaker might learn to output erroneouscode) and non-broken code as bad-side examples leadingto noisy training data (eg Figure 3 bottom left) Secondlywhile backtranslation trains the fixer only on examplespredicted by the breaker (Eq 13) BIFI trains the fixer on thereal bad examples (P(f)

k in Eq 9) in addition to examplesgenerated by the breaker which improves the correctnessand distributional match of training data We will show in ourablation study (sect433) that both of these two componentsimprove the learning of a fixer and breaker In essence BIFIis an augmentation of backtranslation with a critic

34 Version of BIFI without breaker FixerOnly

The benefit of BIFI is to enable training the fixer on real badexamples and bad examples generated by a learned breakerWe consider a version (FixerOnly) that trains the fixer simplyon the real bad examples (prepared in Eq 6) but not the badexamples generated by the breaker (Eq 8) SpecificallyFixerOnly does the following in each round k


fk=TRAINbadrarrgood(P(f)k ) (15)

FixerOnly can also be viewed as self-training (Lee 2013)with the difference that we only add fixer outputs verified bythe critic to the training data We will show in sect431 thatFixerOnly is especially useful when the amount of availablebad examples |Dbad| is big but the gain is smaller comparedto BIFI when |Dbad| is small because BIFI can use the breakerto generate additional paired data for training the fixer (Eq 8)

4 ExperimentsWe evaluate our approach on two code repair datasets acommon benchmark DeepFix2 (Gupta et al 2017) andGitHub-Python a bigger dataset we collect in this paper

41 Dataset and setup

We first describe the detail and experimental setup forGitHub-Python and DeepFix

2httpsbitbucketorgiiscsealdeepfix


MethodTest accuracy

Total UnbalancedParentheses

IndentationError

InvalidSyntax

Initial Round-0 620 877 394 705

FixerOnly Round-1 868 933 795 909Round-2 886 924 837 920

BIFI Round-1 880 941 813 916Round-2 905 942 859 935

Table 1 Repair accuracy on the GitHub-Python test set Theinitial fixer is trained on synthetic bad code Our proposed method(BIFI) enables the fixer to be fine-tuned on real bad code and badcode generated by the learned breaker The result shows that BIFIoutperforms the initial fixer by a large margin

Method Test accuracy

DeepFix (Gupta et al 2017) 334RLAssist (Gupta et al 2019) 266SampleFix (Hajipour et al 2019) 453DrRepair (Yasunaga amp Liang 2020) 661

Our Initial Round-0 (= DrRepair) 661

Our FixerOnly Round-1 686Round-2 705

Our BIFI Round-1 708Round-2 717

Table 2 Repair accuracy on the DeepFix test set We define ourinitial fixer (Round 0) to be the existing best system DrRepair Notethat DrRepairrsquos training procedure coincides with the initializationstep of our BIFI algorithm with heuristic perturbations used in Eq3 We then apply BIFI on top of it for Round 1 2 BIFI outperformsDrRepair achieving a new state-of-the-art

411 Github-Python

Dataset To obtain an unlabeled dataset of code we col-lected Python3 files from GitHub public repositories3 Wethen tokenize each code file using the builtin Python tok-enizer and keep code snippets of length 10ndash128 tokens re-sulting in 3M code snippets As the critic c we use the PythonAST parser4 which catches unbalanced parentheses inden-tation errors and other syntax errors Concretely we definec(x)=1 (good) if the AST parser returns no errors for inputcodex and c(x)=0 (bad) otherwise Using this critic we ob-tain 38K snippets of bad code and 3M snippets of good codeFrom the 38K bad examples we holdout 15K as the final testset and make the remaining 23K bad examples available forBIFI Our goal is to learn a fixer that repairs AST parse errorsWe define that the fixerrsquos repair is successful if the output codehas no AST parse errors and has Levenshtein edit-distance(Levenshtein 1966) less than 5 tokens from the input codeThe evaluation metric is the fixerrsquos repair accuracy on the testset ie the heldout 15K examples of real bad code

BIFI implementation details For the architecture of thefixer and breaker we use the encoder-decoder Transformer

3httpsgithubcom4httpsdocspythonorg3libraryasthtml

(Vaswani et al 2017) with 4 layers 8 attention heads and hid-den states of size 256 The model parameters are optimizedby Adam (Kingma amp Ba 2015) with batch size of 27000 to-kens learning rate 0001 and gradient clipping 10 (Pascanuet al 2013) on one GPU (GTX Titan X) For generation weuse beam search with beam size 10 and keep predictions withLevenshtein edit-distance less than 5 tokens from the input

To train the initial fixer f0 we use random perturbations forthe corruption procedure bsynthetic (Eq 3) which drops insertsor replaces 1ndash3 tokens in code with uniform distribution Weapply bsynthetic 8 times to each of the 26M good code snippetsto prepare the initial training dataPsynthetic We holdout 1of Psynthetic as our dev set which we use to perform earlystopping We then run the BIFI algorithm for K=2 rounds

412 DeepFix

Dataset DeepFix (Gupta et al 2017) contains C code sub-mitted by students in an introductory programming course ofwhich 37K snippets are good (no compiler error) and 7K arebad (have compiler errors) Each code snippet has 25 lines onaverage Within the 7K bad examples we take 20 as a held-out test set We make the remaining 80 available for BIFIThe goal is to learn a fixer that repairs compiler errors Repairis successful if the output code has no compiler errors Theevaluation metric is the fixerrsquos repair accuracy on the test set

BIFI implementation details We define the initial fixerf0 as DrRepair (Yasunaga amp Liang 2020) (the existing bestsystem on DeepFix) which is an encoder-decoder modeltrained in a procedure that corresponds exactly to the initial-ization step of BIFI (sect31) Specifically to train DrRepairYasunaga amp Liang (2020) design heuristic perturbations forthe corruption procedure bsynthetic which mimics commoncode errors beginner and experienced programmers make(eg typos punctuation keyword and type errors) Weuse the same training dev data prepared and released bythe authors to train the initial fixer We then run the BIFIalgorithm for K=2 rounds Our fixer and breaker have thesame model architecture as DrRepair At test time followingthe original DrRepair we repeatedly apply the fixer whilethe code still has errors up to a maximum of 5 times

42 Main results

We study the fixerrsquos repair accuracy on GitHub-Pythonand DeepFix Here ldquoround k accuracyrdquo means the repairaccuracy of the fixer learned in round k ie fk

GitHub-Python Table 1 shows the test results onGitHub-Python We show the overall repair accuracy(ldquoTotalrdquo) as well as the breakdown over the error categoriesin the Python AST parser (table right) The initial fixer f0(ldquoInitialrdquo) is trained on randomly perturbed synthetic badcode Our proposed method (BIFI) enables the initial fixer tobe further trained on real bad code and bad code generated bythe learned breaker which outperforms the initial fixer signif-icantly +285 in overall repair accuracy and consistently


MethodTest accuracy

Bad100

Bad50

Bad10

ref Syntheticbad only

Initial Round-0 620 620 620 620FixerOnly Round-2 886 847 785 627BIFI Round-2 905 890 867 633

Table 3 Analysis varying the amount of (real) bad code onGitHub-Python ldquoBad 100rdquo is our original setting (with 23Kreal bad examples available for BIFI) and ldquoBad 50rdquo means only50 of them (115K) are made available ldquoSynthetic bad onlyrdquomeans keeping training the fixer on synthetic bad examples Theresult shows that (i) even when the amount of real bad examplesis small (eg ldquoBad 10rdquo) they are still useful allowing FixerOnlyBIFI to perform better than using synthetic data alone (ii) whenthe amount of real bad examples is small BIFI exceeds FixerOnlyby a larger margin highlighting the usefulness of the bad examplesgenerated by the learned breaker


Initial Round-0 620

BIFI (ours) Round-2 905ndash real bad Round-2 846

ndash critic Round-2 840

ndash both(backtranslation) Round-2 801

Table 4 Performance comparison with backtranslation onGitHub-Python Backtranslation is equivalent to removing twocomponents from BIFI (i) using the critic to verify fix breakattempts (ldquocriticrdquo) and (ii) training the fixer on real bad examplesin addition to examples generated by the breaker (ldquoreal badrdquo) Theresult suggests that both of these components improve the learningof a fixer and consequently BIFI outperforms backtranslation

across all error categories This result suggests that even ifwe start from a very simple initial fixer trained with randomperturbations BIFI can automatically turn it into a usablefixer with high repair accuracymdash905 accuracy on real badcode We also experimented with continuing training theinitial fixer f0 with synthetic perturbations only for the samerounds of BIFI (hence controlling the amount of trainingdata seen by the fixer) however this did not provide animprovement suggesting that there is a performance ceilingif we only train on synthetic data

DeepFix Table 2 shows the test results on DeepFix alongwith prior works Here we use the existing best systemDrRepair as our initial fixer (ldquoInitialrdquo) BIFI outperformsthe initial fixer by a substantial margin (+56 absoluteover DrRepair) attaining a new state-of-the-art accuracyof 717 It is notable that DrRepair was trained withmanually-designed heuristic perturbations where theauthors (Yasunaga amp Liang 2020) mimicked various codeerrors beginner and experienced programmers make (egtypos punctuation and type errors) Nevertheless our resultsuggests that there is still room for improving the adaptationto a more realistic distribution of coding errors and BIFI


Syntheticdef iterFiles(path) return (name for name in listdir(path) if isfile(join(pathname)))

GoodSynthetic

Perturbation(1)

Train b0

Train f0


Baddef compute_loss(self) loss = selfcost(selfpreds selftarget selflength)sum()

Fixed Apply fk-1 Keep if fixed

(1)

Finetune bk


Broken def iterFiles(path) return (name for name in listdir(path) if isfile(join(pathname)))

GoodApply bk Keep if broken

(3)

Finetune fk


Baddef compute_loss(self) loss = selfcost(selfpreds= selftarget selflength)sum()

Fixed Apply fk-1(1)

Finetune bk



GoodApply bk (3)

Finetune fk




(1)

Finetune fk





(2)

(2)

(4)

(2)

(2)

(4)

def sanitize(xstartend)ltIgtif x gt= start and x lt end err = 0 ltIgtreturn err xltDgt else ltIgterr = 1 return err x

def sanitize(xstartend)ltIgtif x gt= start and x lt end ltIgterr = 0 ltIgtreturn err xltDgt else ltIgterr = 1 return err x

def sanitize(xstartend)ltIgtif x gt= start and x lt end ltIgterr = 0 return err xltDgt else ltIgterr = 1 return err x

def sanitize(xr1r2)ltIgtif x gt= r1 and x lt r2 err = 0 ltIgtreturn err xltDgt else ltIgterr = 1 return err x

def sanitize(xr1r2)ltIgtif x gt= r1 and x lt r2 ltIgterr = 0 ltIgtreturn err xltDgt else ltIgterr = 1 return err x

def sanitize(xr1r2)ltIgtif x gt= r1 and x lt r2 ltIgterr = 0 return err xltDgt else ltIgterr = 1 return err x

No error mdash fixed Error Error

def del_record(self desc) record = selffind_one(desc) records = selffind(desc) if [record] = records raise ValueError( duplicate format( str(desc))) if record selfremove(record)

def del_record(self desc) record = selffind_one(desc) records = selffind(desc) if [record] = records raise ValueError a

duplicate format( str(desc)) if record selfremove(record)

def del_record(self desc) record = selffind_one(desc) records = selffind(desc) if [record] = records raise ValueError( duplicate format( str(desc)) a

if record selfremove(record)

Figure 4 Example of breaker outputs Given good code on theleft we sampled two outputs made by the breaker learned in BIFIround 1 (right) We observe that the breaker places high probabilityon errors seen in real bad code (ie obsolete usage of raiseunbalanced parentheses in nested context)



GoodSynthetic

Perturbation(1)

Train b0

Train f0




(1)

Finetune bk




(3)

Finetune fk



Fixed Apply fk-1(1)

Finetune bk



GoodApply bk (3)

Finetune fk




(1)

Finetune fk





(2)

(2)

(4)

(2)

(2)

(4)













Figure 5 Example of fixer outputs Given the bad code on the left(with an indentation error) the initial fixer (center) attempts to fix itby inserting an indent token (〈I〉) to line 3 but fails to adjust (delete)the indent token on the next line The initial fixer commonly makesthis mistake due to the distribution mismatch between real errors andsynthetic perturbations on which the initial fixer was trained (seesect434) After one round of BIFI the fixer in round 1 (right) learnsto insert and delete the correct pair of indent tokens fixing the error

boosts repair accuracy without additional manual effort

43 Analysis

We now analyze the key insights of BIFI As the main prop-erties of BIFI are to (i) add real bad examples to the trainingdata if the critic accepts the fixerrsquos output (the FixerOnlyversion) and to (ii) train the breaker to generate realistic badexamples (BIFI) we analyze their effects in sect431 and sect432We also compare with backtranslation in sect433 We thenanalyze how our fixer adapts towards real code distributionsthrough quantitative and qualitative studies (sect434)

431 Effect of real bad examples

FixerOnly enables training the fixer on real bad examples(but does not use the bad examples generated by the breaker)As Table 1 and 2 show FixerOnly outperforms the initial fixerby a large margin eg +27 on GitHub-Python This resulthighlights the importance of training on real bad examples

We further analyze the effect of varying the amount of realbad examples shown in Table 3 Here ldquoBad 100rdquo isour original setting (with 23K real bad examples availablefor BIFI and FixerOnly) and ldquoBad 50rdquo means only 50of them (115K) are made available ldquoSynthetic bad onlyrdquomeans keeping training the fixer on synthetic bad examplesonly We find that while the repair accuracy drops as wedecrease the amount of available real bad examples a smallamount of real bad examples (eg ldquoBad 10rdquo) is still useful


Code error category Examples InitialRound-0 Acc

FixerOnly BIFI

Round-1 Acc Round-2 Acc Round-1 Acc Round-2 Acc

Total 15055 620 868 924 880 905

Unbalanced Parentheses 3999 877 933 924 941 942Unclosed left parenthesis (not nested) 226 925 946 946 947 944Unclosed left parenthesis (nested) 3014 858 928 917 938 938Redundant right parenthesis 759 938 953 947 954 957

Indentation Error 6307 394 795 837 813 859Expected indent 4311 464 812 841 820 853Unexpected indent 1966 246 768 838 809 884

Invalid Syntax 4749 705 909 920 916 935Missing colon 663 983 973 974 982 980Missing comma (single-line listtupledict) 694 954 981 974 984 983Missing comma (multi-line listtupledict) 451 889 925 920 945 949Missing newline 52 846 865 885 865 885Missing parenthesis pair 634 825 850 864 871 883Redundant comma 152 737 842 914 849 921Redundant parenthesis pair 698 138 807 861 801 894Invalid use of comma

(eg ldquoraise OSError msgrdquo rarr ldquoraise OSError(msg)rdquo) 1138 613 988 991 987 994Other 267 607 663 644 674 667

Table 5 Code error categories in GitHub-Python and repair accuracy Due to the mismatch between real errors and synthetic perturba-tions used for training the initial fixer has lower accuracy on ldquonestedrdquo than ldquonot nestedrdquo for ldquounbalanced parenthesesrdquo errors but it catches upin BIFI round 1 2 Similarly the initial fixerrsquos repair accuracy is very low for ldquoredundant parenthesis pairrdquo and ldquoindentation errorrdquo but it im-proves significantly in round 1 2 This result illustrates the effect of BIFI adapting the fixer towards real errors See sect434 for more analysis

making FixerOnly perform better than using synthetic dataalone (ie 785 vs 627)

432 Effect of bad examples generated by breaker

Recall that BIFI trains the fixer on both real bad examplesand bad examples generated by the learned breaker AsTable 1 and 2 show BIFI consistently provides an extra boostover FixerOnly suggesting that the use of breaker outputsimproves the fixer Moreover another benefit of the breakeris that one can sample many bad examples from the breakerto augment real bad examples if their amount is limitedIn Table 3 we find that BIFI is especially stronger thanFixerOnly when the amount of available real bad examplesis small (eg ldquoBad 10rdquo)

Figure 4 shows sample outputs made by the learned breakerb1 given the good code on the left We observe that thebreaker places high probability on errors seen in real badcode ie obsolete usage of raise in Python3 (center) andunbalanced parentheses in nested context (right) Comparedto random perturbations that arbitrarily dropinsert tokensthe learned breaker improves the coverage and efficiency ofthe training data for the fixer

433 Comparison with backtranslation

Table 4 compares our method (BIFI) with backtranslationAs discussed in sect33 backtranslation is equivalent toremoving two components from BIFI (i) using the criticto verify fixbreak attempts in data generation (ldquocriticrdquo inTable) and (ii) training the fixer on real bad examples besidesexamples generated by the breaker (ldquoreal badrdquo) We find thatremoving each component from BIFI hurts the performance(eg 90rarr84) which suggests that both componentsare important to improve the quality of training data Withthese two innovations BIFI outperforms backtranslation by

a large margin (+10 absolute on GitHub-Python)

434 How does the fixer adapt

We take a closer look at how our fixer performs and adaptstowards the real distribution of bad code Table 5 showsfine-grained categories of code errors seen in GitHub-Pythonand their repair accuracy

In this categorization we observe two examples of dis-tribution mismatch between the real errors and syntheticperturbations (i) random perturbations can generate thiscategory of errors with high probability but with a wrongldquosub-distributionrdquo within it (eg can generate ldquounbalancedparenthesesrdquo or ldquomissing commardquo errors but do not matchthe low-level distribution of real bad code such as errorsoccurring more often in nested parentheses or in a multi-linelisttupledict recall the Figure 2 top example) (ii) randomperturbations can only generate this category of errors withvery small probability (eg ldquoredundant parenthesis pairrdquo andldquoindentation errorrdquo for which an exact pair of parenthesesor indentsdedents need to be dropped or inserted recallthe Figure 2 bottom example) For (i) the result shows thatthe initial fixer trained with random perturbations has loweraccuracy on ldquonestedrdquo than ldquonot nestedrdquo for ldquounbalancedparenthesesrdquo errors and on ldquomulti-linerdquo than ldquosingle-linerdquofor ldquomissing commardquo errors but the performance catchesup in round 1 2 suggesting the effect of BIFI for addressingthe low-level distribution mismatch For (ii) the resultshows that the initial fixerrsquos repair accuracy is very low forldquoredundant parenthesis pairrdquo and ldquoindentation errorrdquo but itachieves significantly higher performance in round 1 2 (eg39rarr85 for indentation errors) A possible explanationis that as BIFI iteratively adds the successfully repaired casesto training the fixer adapts to up-weight this category oferror fixing leading to improved accuracy


Figure 5 provides examples of fixer outputs Given thebad code on the left (with an indentation error) the initialfixer (center) attempts to fix it by inserting an indent token(〈I〉) to line 3 but fails to adjust (delete) the indent token onthe following line The initial fixer commonly makes thismistake for indentation errors due to the mismatch betweenreal errors and synthetic perturbations discussed aboveAfter one round of BIFI the fixer (Figure 5 right) learns toinsert and delete the correct pair of indents fixing the error

5 Related work and discussionLearning to repair code Several works learn to repaircode from labeled datasets of source code edits made byprogrammers eg error resolution records (Just et al 2014Chen et al 2019 Mesbah et al 2019 Bader et al 2019Tarlow et al 2020 Ding et al 2020) As labeled data iscostly to prepare various works study learning code fixersfrom unlabeled or synthetic data (Pu et al 2016 Pariharet al 2017 Ahmed et al 2018 Pradel amp Sen 2018 Wanget al 2018 Vasic et al 2019 Gupta et al 2019 Hajipouret al 2019 Hellendoorn et al 2020) In particular Guptaet al (2017) is an early work that randomly perturbs goodcode to prepare a synthetic paired data and trains a seq2seqneural network model as a fixer Yasunaga amp Liang (2020)improve on it by designing heuristic perturbations that mimiccommon errors made by programmers Different from theabove work our method (BIFI) adapts a naive fixer trainedwith synthetic errors towards real errors without manualdomain-specific effort For a more comprehensive review ofautomatic code repair we refer readers to Monperrus (2020)

Denoising autoencoding Denoising autoencoding (Vin-cent et al 2008) trains a model that recovers original datafrom randomly corrupted versions and is widely used asan effective self-supervised representation learning and pre-training strategy eg in computer vision (Erhan et al 2010)and natural language processing (NLP) (Lewis et al (2020)which randomly drops replaces tokens in a sentence andlearns to recover) Our initial fixer trained with random per-turbations is a denoising autoencoder Crucially insteadof using it purely for representation learning we show thatthrough the BIFI algorithm one can turn the vanilla denoisingautoencoder into a usable fixer that repairs real-world bad ex-amples with high accuracy On a related note Lee et al (2019)adapt a denoising autoencoder into an autocomplete system

Domain adaptation Domain adaptation aims to addressthe mismatch in data distribution between training and testdomains (Daume III amp Marcu 2006 Quionero-Candela et al2009 Koh et al 2021) Such domain shifts typically occuracross related but different datasets (Torralba amp Efros 2011Fang et al 2013 Venkateswara et al 2017 Yu et al 2018b2019 Peng et al 2019 Kamath et al 2020) as well as fromsynthetic data to real data (including sim2real) (Wang et al2015 Ganin amp Lempitsky 2015 Richter et al 2016 Penget al 2018 Hellendoorn et al 2019 Xu et al 2020 Belle-mare et al 2020) as synthetic data can be easier to obtain

than real data In our repair task unpaired real data (codesnippets on GitHub) is available but paired real data is costlyto obtain Hence we considered adaptation from syntheticpaired data to real paired data Within domain adaptation ourtask is also related to the setting where unlabeled data in thetest domain is available (Sun amp Saenko 2016 Hoffman et al2018 Sun et al 2020) (in our case real bad code) The dif-ference is that as our outputs are structured we have a criticto check if the fixerrsquos output on the unlabeled data is correctLeveraging this property BIFI takes the correct outputs to cre-ate training data in the test domain (Eq 6) and trains a breakerto generate more data that simulates the test domain (Eq 8)

Data augmentation Data augmentation aims to generateextra training data A common approach is to increase thesource side data for instance by adding modifications orsampling from generative models (Hannun et al 2014 Jia ampLiang 2016 Krizhevsky et al 2017 Antoniou et al 2017Yasunaga et al 2018 Yu et al 2018a Lee et al 2019Berthelot et al 2019 Xie et al 2020) Several works alsostudy target side augmentation which keeps multiple (valid)target predictions made by a model and adds to trainingThis is commonly used in structured generation problemssuch as semantic parsing program synthesis and moleculegeneration (Liang et al 2013 Berant et al 2013 Guu et al2017 Min et al 2019 Zhong et al 2020 Yang et al 2020)While our method also augments training data it differsin two aspects 1) we use the fixer and breaker to augmentboth the source and target sides 2) our goal is not only toincrease the amount of training data but also to adapt to thedistribution of interest (real bad examples)

Self-training Self-training (Lee 2013 McClosky et al2006 Kumar et al 2020 Xie et al 2021) applies a trainedmodel to unlabeled data obtains predicted targets (pseudo-labels) and uses them as extra training examples Similarlyco-training (Blum amp Mitchell 1998) and tri-training (Zhouamp Li 2005) train multiple models and add predicted targetson which these models agree Our method also appliestrained models (breaker and fixer) to unlabeled data togenerate targets with a difference that we use the critic toverify the predictions and only keep correct ones

Unsupervised machine translation (MT) UnsupervisedMT learns translators from unpaired corpora (Artetxe et al2018ab Lachaux et al 2020) Backtranslation (Sennrichet al 2016 Lample et al 2018ab) is a common approachthat uses the target-to-source model to generate noisy sourcesand then trains the source-to-target model to reconstruct thetargets (also related to cycle-consistency (Zhu et al 2017Hoffman et al 2018) and style transfer (Shen et al 2017Yang et al 2018 Zhang et al 2019)) One may view theldquobad-siderdquo and ldquogood-siderdquo in our repair task as two sourcetarget languages in MT and apply backtranslation The maindifference is that the repair task has a critic which motivatedour BIFI algorithm BIFI (i) uses the critic to verify if thegenerated examples are actually fixed or broken (Eq 68) and (ii) trains the fixer on real bad examples besides


examples generated by the breaker (Eq 9) We found thesetechniques improve the correctness of the generated trainingdata (sect433) While we focused on code repair in this workwe hope that BIFI can be applied to unsupervised MT andstyle transfer by introducing a human-based or learned critic

6 ConclusionWe considered the problem of learning a fixer from unpaireddata and a critic (code analyzer or compiler) and proposed anew approach Break-It-Fix-It (BIFI) The idea of BIFI is totrain a breaker and use the critic to amass more realistic andcorrect paired data for training the fixer Using two code re-pair datasets (GitHub-Python and DeepFix) we showed howBIFI adapts baseline fixers towards realistic distributionsof code errors achieving improved repair performance Wenote that BIFI is not about simply collecting more trainingdata but rather turning raw unlabeled data into usable paireddata with the help of a critic This is a potentially powerfuland general framework applicable to many areas such asmolecular design text editing and machine translation

AcknowledgmentsWe thank Michael Xie members of the Stanford P-LambdaSNAP and NLP groups as well as our anonymous reviewersfor valuable feedback This work was supported in part byFunai Foundation Fellowship and NSF CAREER AwardIIS-1552635

ReproducibilityCode and data are available at httpsgithubcommichiyasunagabifi Experiments are available athttpsworksheetscodalaborgworksheets0xfddb2ef01a9f4dc0b5d974a5a97174be

ReferencesAhmed U Z Kumar P Karkare A Kar P and Gulwani

S Compilation error repair for the student programsfrom the student programs In International Conferenceon Software Engineering (ICSE) pp 78ndash87 2018

Antoniou A Storkey A and Edwards H Data augmen-tation generative adversarial networks arXiv preprintarXiv171104340 2017

Artetxe M Labaka G and Agirre E Unsupervised statisti-cal machine translation arXiv preprint arXiv1809012722018a

Artetxe M Labaka G Agirre E and Cho K Unsu-pervised neural machine translation In InternationalConference on Learning Representations (ICLR) 2018b

Bader J Scott A Pradel M and Chandra S GetafixLearning to fix bugs automatically Proceedings of theACM on Programming Languages 3(OOPSLA)1ndash272019

Bellemare M G Candido S Castro P S Gong JMachado M C Moitra S Ponda S S and Wang ZAutonomous navigation of stratospheric balloons usingreinforcement learning Nature 588 2020

Berant J Chou A Frostig R and Liang P Semanticparsing on freebase from question-answer pairs InEmpirical Methods in Natural Language Processing(EMNLP) 2013

Berthelot D Carlini N Goodfellow I Papernot NOliver A and Raffel C A Mixmatch A holisticapproach to semi-supervised learning In Advances inNeural Information Processing Systems (NeurIPS) 2019

Blum A and Mitchell T Combining labeled and unlabeleddata with co-training In Conference on computationallearning theory 1998

Chen Z Kommrusch S J Tufano M Pouchet L-NPoshyvanyk D and Monperrus M SequencerSequence-to-sequence learning for end-to-end programrepair IEEE Transactions on Software Engineering 2019

Daume III H and Marcu D Domain adaptation forstatistical classifiers Journal of artificial Intelligenceresearch 2006

Dinella E Dai H Li Z Naik M Song L and WangK Hoppity Learning graph transformations to detectand fix bugs in programs In International Conferenceon Learning Representations (ICLR) 2020

Ding Y Ray B Devanbu P and Hellendoorn V JPatching as translation the data and the metaphor InAutomated Software Engineering (ASE) 2020

Erhan D Courville A Bengio Y and Vincent P Whydoes unsupervised pre-training help deep learningIn Artificial Intelligence and Statistics (AISTATS) pp201ndash208 2010

Fang C Xu Y and Rockmore D N Unbiased metriclearning On the utilization of multiple datasets and webimages for softening bias In International Conferenceon Computer Vision (ICCV) 2013

Ganin Y and Lempitsky V Unsupervised domain adap-tation by backpropagation In International Conferenceon Machine Learning (ICML) 2015

Gupta R Pal S Kanade A and Shevade S K DeepfixFixing common C language errors by deep learning InAssociation for the Advancement of Artificial Intelligence(AAAI) 2017

Gupta R Kanade A and Shevade S Deep reinforcementlearning for programming language correction InAssociation for the Advancement of Artificial Intelligence(AAAI) 2019


Guu K Pasupat P Liu E Z and Liang P Fromlanguage to programs Bridging reinforcement learningand maximum marginal likelihood In Association forComputational Linguistics (ACL) 2017

Hajipour H Bhattacharya A and Fritz M SamplefixLearning to correct programs by sampling diverse fixesarXiv preprint arXiv190610502 2019

Hannun A Case C Casper J Catanzaro B Diamos GElsen E Prenger R Satheesh S Sengupta S CoatesA et al Deep speech Scaling up end-to-end speechrecognition arXiv preprint arXiv14125567 2014

Hellendoorn V J Proksch S Gall H C and Bacchelli AWhen code completion fails A case study on real-worldcompletions In International Conference on SoftwareEngineering (ICSE) 2019

Hellendoorn V J Sutton C Singh R Maniatis P andBieber D Global relational models of source code InInternational Conference on Learning Representations(ICLR) 2020

Hoffman J Tzeng E Park T Zhu J-Y Isola P SaenkoK Efros A and Darrell T Cycada Cycle-consistent ad-versarial domain adaptation In International Conferenceon Machine Learning (ICML) 2018

Jia R and Liang P Data recombination for neural semanticparsing In Association for Computational Linguistics(ACL) 2016

Jin W Yang K Barzilay R and Jaakkola T Learningmultimodal graph-to-graph translation for molecularoptimization In International Conference on LearningRepresentations (ICLR) 2019

Just R Jalali D and Ernst M D Defects4j A databaseof existing faults to enable controlled testing studies forjava programs In Proceedings of the 2014 InternationalSymposium on Software Testing and Analysis pp437ndash440 2014

Kamath A Jia R and Liang P Selective questionanswering under domain shift In Association forComputational Linguistics (ACL) 2020

Kingma D and Ba J Adam A method for stochasticoptimization In International Conference on LearningRepresentations (ICLR) 2015

Koh P W Sagawa S Marklund H Xie S M ZhangM Balsubramani A Hu W Yasunaga M PhillipsR L Beery S et al Wilds A benchmark of in-the-wilddistribution shifts In International Conference onMachine Learning (ICML) 2021

Krizhevsky A Sutskever I and Hinton G E Imagenetclassification with deep convolutional neural networksCommunications of the ACM 2017

Kumar A Ma T and Liang P Understanding self-training for gradual domain adaptation In InternationalConference on Machine Learning (ICML) 2020

Lachaux M-A Roziere B Chanussot L and LampleG Unsupervised translation of programming languagesIn Advances in Neural Information Processing Systems(NeurIPS) 2020

Lample G Conneau A Denoyer L and Ranzato MUnsupervised machine translation using monolingualcorpora only In International Conference on LearningRepresentations (ICLR) 2018a

Lample G Ott M Conneau A Denoyer L and RanzatoM Phrase-based amp neural unsupervised machinetranslation In Empirical Methods in Natural LanguageProcessing (EMNLP) 2018b

Lee D-H Pseudo-label The simple and efficient semi-supervised learning method for deep neural networks InWorkshop on challenges in representation learning Inter-national Conference on Machine Learning (ICML) 2013

Lee M Hashimoto T B and Liang P Learningautocomplete systems as a communication game InAdvances in Neural Information Processing Systems(NeurIPS) Workshop on Emergent Communication 2019

Levenshtein V I Binary codes capable of correctingdeletions insertions and reversals In Soviet physicsdoklady 1966

Lewis M Liu Y Goyal N Ghazvininejad M MohamedA Levy O Stoyanov V and Zettlemoyer L BartDenoising sequence-to-sequence pre-training for naturallanguage generation translation and comprehension InAssociation for Computational Linguistics (ACL) 2020

Liang P Jordan M I and Klein D Learning dependency-based compositional semantics ComputationalLinguistics 2013

McClosky D Charniak E and Johnson M Effectiveself-training for parsing In North American Associationfor Computational Linguistics (NAACL) 2006

Mesbah A Rice A Johnston E Glorioso N andAftandilian E Deepdelta learning to repair compilationerrors In Proceedings of the 2019 27th ACM Joint Meet-ing on European Software Engineering Conference andSymposium on the Foundations of Software Engineeringpp 925ndash936 2019

Min S Chen D Hajishirzi H and Zettlemoyer L Adiscrete hard em approach for weakly supervised questionanswering In Empirical Methods in Natural LanguageProcessing (EMNLP) 2019

Monperrus M The living review on automated programrepair Technical Report hal-01956501 HALarchives-ouvertesfr 2020


Parihar S Dadachanji Z Praveen Kumar Singh R DKarkare A and Bhattacharya A Automatic gradingand feedback using program repair for introductoryprogramming courses In ACM Conference on Innovationand Technology in Computer Science Education 2017

Pascanu R Mikolov T and Bengio Y On the difficulty oftraining recurrent neural networks In International confer-ence on machine learning (ICML) pp 1310ndash1318 2013

Peng X Usman B Kaushik N Wang D Hoffman Jand Saenko K Visda A synthetic-to-real benchmark forvisual domain adaptation In Computer Vision and PatternRecognition (CVPR) 2018

Peng X Bai Q Xia X Huang Z Saenko K and WangB Moment matching for multi-source domain adaptationIn International Conference on Computer Vision (ICCV)2019

Pradel M and Sen K Deepbugs A learning approach toname-based bug detection Proceedings of the ACM onProgramming Languages 2(OOPSLA)1ndash25 2018

Pu Y Narasimhan K Solar-Lezama A and Barzilay Rsk p a neural program corrector for moocs In CompanionProceedings of the 2016 ACM SIGPLAN InternationalConference on Systems Programming Languages andApplications Software for Humanity pp 39ndash40 2016

Quionero-Candela J Sugiyama M Schwaighofer A andLawrence N D Dataset shift in machine learning TheMIT Press 2009

Richter S R Vineet V Roth S and Koltun V Playingfor data Ground truth from computer games In Europeanconference on computer vision 2016

Sennrich R Haddow B and Birch A Improving neuralmachine translation models with monolingual data InAssociation for Computational Linguistics (ACL) 2016

Seo H Sadowski C Elbaum S Aftandilian E andBowdidge R Programmersrsquo build errors A case studyat google In International Conference on SoftwareEngineering (ICSE) 2014

Shen T Lei T Barzilay R and Jaakkola T Style transferfrom non-parallel text by cross-alignment In Advances inNeural Information Processing Systems (NeurIPS) 2017

Sun B and Saenko K Deep coral Correlation alignmentfor deep domain adaptation In European Conference onComputer Vision (ECCV) 2016

Sun Y Wang X Liu Z Miller J Efros A A andHardt M Test-time training for out-of-distributiongeneralization In International Conference on MachineLearning (ICML) 2020

Taghipour K and Ng H T A neural approach to automatedessay scoring In Empirical Methods in Natural LanguageProcessing (EMNLP) 2016

Tarlow D Moitra S Rice A Chen Z Manzagol P-ASutton C and Aftandilian E Learning to fix build errorswith graph2diff neural networks In Proceedings of theIEEEACM 42nd International Conference on SoftwareEngineering Workshops pp 19ndash20 2020

Torralba A and Efros A A Unbiased look at dataset bias InComputer Vision and Pattern Recognition (CVPR) 2011

Vasic M Kanade A Maniatis P Bieber D and SinghR Neural program repair by jointly learning to localizeand repair In International Conference on LearningRepresentations (ICLR) 2019

Vaswani A Shazeer N Parmar N Uszkoreit J JonesL Gomez A N Kaiser L and Polosukhin I Attentionis all you need In Advances in Neural InformationProcessing Systems (NeurIPS) 2017

Venkateswara H Eusebio J Chakraborty S andPanchanathan S Deep hashing network for unsuperviseddomain adaptation In Computer Vision and PatternRecognition (CVPR) pp 5018ndash5027 2017

Vincent P Larochelle H Bengio Y and Manzagol P-AExtracting and composing robust features with denoisingautoencoders In International Conference on MachineLearning (ICML) 2008

Wang K Singh R and Su Z Dynamic neural programembeddings for program repair In InternationalConference on Learning Representations (ICLR) 2018

Wang Y Berant J and Liang P Building a semantic parserovernight In Association for Computational Linguistics(ACL) 2015

Xie Q Dai Z Hovy E Luong M-T and Le Q VUnsupervised data augmentation for consistency trainingIn Advances in Neural Information Processing Systems(NeurIPS) 2020

Xie S M Kumar A Jones R Khani F Ma T andLiang P In-n-out Pre-training and self-training usingauxiliary information for out-of-distribution robustnessIn International Conference on Learning Representations(ICLR) 2021

Xu S Semnani S J Campagna G and Lam M SAutoqa From databases to qa semantic parsers with onlysynthetic training data In Empirical Methods in NaturalLanguage Processing (EMNLP) 2020

Yang K Jin W Swanson K Barzilay R and Jaakkola TImproving molecular design by stochastic iterative targetaugmentation In International Conference on MachineLearning (ICML) 2020


Yang Z Hu Z Dyer C Xing E P and Berg-KirkpatrickT Unsupervised text style transfer using language modelsas discriminators In Advances in Neural InformationProcessing Systems (NeurIPS) 2018

Yasunaga M and Liang P Graph-based self-supervisedprogram repair from diagnostic feedback In InternationalConference on Machine Learning (ICML) 2020

Yasunaga M Kasai J and Radev D Robust multilingualpart-of-speech tagging via adversarial training In NorthAmerican Association for Computational Linguistics(NAACL) 2018

Yu T Yasunaga M Yang K Zhang R Wang D Li Zand Radev D Syntaxsqlnet Syntax tree networks for com-plex and cross-domaintext-to-sql task In Empirical Meth-ods in Natural Language Processing (EMNLP) 2018a

Yu T Zhang R Yang K Yasunaga M Wang DLi Z Ma J Li I Yao Q Roman S et al SpiderA large-scale human-labeled dataset for complex andcross-domain semantic parsing and text-to-sql task InEmpirical Methods in Natural Language Processing(EMNLP) 2018b

Yu T Zhang R Yasunaga M Tan Y C Lin X VLi S Er H Li I Pang B Chen T et al SparcCross-domain semantic parsing in context In Associationfor Computational Linguistics (ACL) 2019

Zhang Z Ren S Liu S Wang J Chen P Li MZhou M and Chen E Style transfer as unsupervisedmachine translation In Association for the Advancementof Artificial Intelligence (AAAI) 2019

Zhong V Lewis M Wang S I and Zettlemoyer LGrounded adaptation for zero-shot executable semanticparsing In Empirical Methods in Natural LanguageProcessing (EMNLP) 2020

Zhou Z-H and Li M Tri-training Exploiting unlabeleddata using three classifiers IEEE Transactions onknowledge and Data Engineering 2005

Zhu J-Y Park T Isola P and Efros A A Unpairedimage-to-image translation using cycle-consistentadversarial networks In International Conference onComputer Vision (ICCV) 2017


def validate(type) ltINgtif type not in types ltINgtmsg = (invalid type ltINgtmust in s types) a

raise Exception(msg)ltDEgtltDEgt

else pass

Real def validate(type) if type not in types ltINgtmsg = (invalid type ltINgtmust in s types)ltDEgt


else pass

Synthetic def validate(type)ltINgtif type not in types ltINgtmsg = (invalid type ltINgtmust in s types)ltDEgt


else pass







def validate(type) ltINgtif type not in types ltINgtmsg = (invalid type must in s types)a


else pass

Real def validate(type) if type not in types ltINgtmsg = (invalid type must in s types) raise Exception(msg)ltDEgt

else pass

Synthetic def validate(type)ltINgtif type not in types ltINgtmsg = (invalid type must in s types) raise Exception(msg)ltDEgt

else pass









else pass


else pass


else pass


Real


Synthetic







else pass


else pass


else pass


Real


Synthetic





def validate(type) ltIgtif type not in types ltIgtmsg = (invalid type not in s types)a

ltIgtraise Exception(msg)ltDgtltDgt else pass

Real def validate(type) ltIgtif type not in types msg = (invalid type not in s types) raise Exception(msg)ltDgt else pass

Synthetic def validate(type)ltIgtif type not in types ltIgtmsg = (invalid type not in s types) raise Exception(msg)ltDgt else pass


Real


Synthetic





Figure 2 Challenge of learning a fixer from unlabeled data Ex-isting works randomly perturb good examples into bad examplesand learn to recover However such synthetically generated badexamples do not match the distribution of real bad examples Forinstance synthetic perturbations may drop parentheses arbitrar-ily from code (top right) but real human-written bad code missesparentheses more often in a nested context (top center) In a more ex-treme case at bottom to generate the human error (center) from thecorrected code (left) a pair of indent and dedent tokens need to be in-serted accordingly which random perturbations generate with verysmall probability Note that indentation is meaningful in Python inthe tokenized Python code each 〈I〉 token means indenting the lineby one unit each 〈D〉 means dedenting the next line by one unit

the distribution of real bad examples For instance as shownin Figure 2 synthetic perturbations may drop parenthesesarbitrarily from code generating errors that rarely happen inreal programming (Figure 2 top right synthetic errors) incontrast real human-written code misses parentheses moreoften in a nested context (Figure 2 top center human errors)As we will show in sect4 this distribution mismatch betweensynthetic data and real data results in low performance

To bridge this gap we propose Break-It-Fix-It (BIFI) a newmethod to learn a fixer from unlabeled data and a critic (Fig-ure 3) BIFI is based on two core insights (i) we can use thecritic to check a fixerrsquos output on real bad examples and addgood outputs to the training data and (ii) we train a breaker togenerate realistic bad examples from good examples Specif-ically given an initial fixer trained on synthetic paired data〈synthetic bad good〉 BIFI improves the fixer and breaker si-multaneously through rounds of data generation and training(1) apply the fixer to real bad examples and keep fixed outputsto obtain real paired data (2) use the resulting data to train thebreaker (3) use the learned breaker to generate code errorsand obtain more paired data and (4) train the fixer on thenewly-generated paired data in (1) and (3) Intuitively thiscycle trains the fixer on increasingly more real or realisticallygenerated bad code adapting the fixer from the initial syn-thetic distributions towards real distributions of code errors

The BIFI algorithm is related to backtranslation in unsuper-vised machine translation (Lample et al 2018a) which usesa target-to-source model to generate noisy sources and trains

a source-to-target model to reconstruct the targets (eg thebad-side and good-side in our repair task can be viewed asthe source and target) BIFI differs from backtranslation intwo ways it uses the critic to verify if the generated examplesare actually fixed or broken (step 1 and 3) and it trains thefixer on real bad examples in addition to examples generatedby the breaker (step 4) which improves the correctness anddistributional match of generated paired data

We evaluate our proposed approach on two code repairdatasets

bull GitHub-Python We collected a new dataset of 3MPython code snippets from githubcom The task is torepair errors caught by the Python AST parser We setthe initial fixer to be an encoder-decoder Transformer(Vaswani et al 2017) trained on random perturbationsbull DeepFix (Gupta et al 2017) The task is to repair

compiler errors in C code submitted by students in anintroductory programming course We set the initial fixerto be the existing best system DrRepair (Yasunaga ampLiang 2020) which was trained on manually-designedheuristic perturbations

Our approach (BIFI) outperforms the initial fixers obtaining905 repair accuracy on GitHub-Python (+285 absolute)and 717 repair accuracy on DeepFix (+56 absolute)attaining a new state-of-the-art BIFI also improves onbacktranslation by 10 Further we qualitatively showhow the fixer and breaker adapt towards more realisticdistributions of code through the BIFI algorithm (sect43)

2 Problem statementFigure 1 illustrates our problem setup critic2fixer Thesystem is given unlabeled dataD (code snippets) and a criticc (code analyzer or compiler) that returns whether an inputis good or bad eg for a code snippet xisinD

c(x)=

0 if x has errors1 if x has no error (1)

Using the critic c examples inD can be classified into badonesDbad=x |xisinD c(x)=0 and good onesDgood=y |yisinD c(y)=1 Our task is to learn a fixer f that maps a badexample xisinDbad into a good example f(x) such that it isclose1 tox and c(f(x))=1 The evaluation metric is the fixerf rsquos repair accuracy on a held-out set of bad examplesD(test)

bad

RepairAcc=|x |xisinD(test)

bad c(f(x))=1||D(test)

bad | (2)

3 ApproachThe major challenge of learning a fixer is that we need to learnfrom unpaired data ieDbad andDgood do not form 〈broken

1We constrain the edit distance as described in sect41 We acknowl-edge that while we want f(x) to be semantics-preserving it is non-trivial to ensure this automatically so we rely on the edit distance




Good codeSynthetic

Perturbation(1)

Train b0

Train f0




(1)

Finetune bk




(3)

Finetune fk




Finetune bk




Finetune fk




(1)

Finetune fk





(2)

(2)

(4)

(2)

(2)

(4)



Good codeSynthetic

Perturbation(1)

Train b0

Train f0




(1)

Train bk




(3)

Train fk




Train bk




Train fk




(1)

Train fk





(2)

(2)

(4)

(2)

(2)

(4)






31 Initialization
















k ) (9)


























MethodTest accuracy


IndentationError

InvalidSyntax

Initial Round-0 620 877 394 705


BIFI Round-1 880 941 813 916Round-2 905 942 859 935








411 Github-Python






412 DeepFix



42 Main results




MethodTest accuracy

Bad100

Bad50

Bad10





Initial Round-0 620









GoodSynthetic

Perturbation(1)

Train b0

Train f0




(1)

Finetune bk




(3)

Finetune fk



Fixed Apply fk-1(1)

Finetune bk



GoodApply bk (3)

Finetune fk




(1)

Finetune fk





(2)

(2)

(4)

(2)

(2)

(4)
















GoodSynthetic

Perturbation(1)

Train b0

Train f0




(1)

Finetune bk




(3)

Finetune fk



Fixed Apply fk-1(1)

Finetune bk



GoodApply bk (3)

Finetune fk




(1)

Finetune fk





(2)

(2)

(4)

(2)

(2)

(4)















43 Analysis







FixerOnly BIFI


Total 15055 620 868 924 880 905





















































































































Good codeSynthetic

Perturbation(1)

Train b0

Train f0




(1)

Finetune bk




(3)

Finetune fk




Finetune bk




Finetune fk




(1)

Finetune fk





(2)

(2)

(4)

(2)

(2)

(4)



Good codeSynthetic

Perturbation(1)

Train b0

Train f0




(1)

Train bk




(3)

Train fk




Train bk




Train fk




(1)

Train fk





(2)

(2)

(4)

(2)

(2)

(4)






31 Initialization
















k ) (9)


























MethodTest accuracy


IndentationError

InvalidSyntax

Initial Round-0 620 877 394 705


BIFI Round-1 880 941 813 916Round-2 905 942 859 935








411 Github-Python






412 DeepFix



42 Main results




MethodTest accuracy

Bad100

Bad50

Bad10





Initial Round-0 620









GoodSynthetic

Perturbation(1)

Train b0

Train f0




(1)

Finetune bk




(3)

Finetune fk



Fixed Apply fk-1(1)

Finetune bk



GoodApply bk (3)

Finetune fk




(1)

Finetune fk





(2)

(2)

(4)

(2)

(2)

(4)
















GoodSynthetic

Perturbation(1)

Train b0

Train f0




(1)

Finetune bk




(3)

Finetune fk



Fixed Apply fk-1(1)

Finetune bk



GoodApply bk (3)

Finetune fk




(1)

Finetune fk





(2)

(2)

(4)

(2)

(2)

(4)















43 Analysis







FixerOnly BIFI


Total 15055 620 868 924 880 905





























































































































k ) (9)


























MethodTest accuracy


IndentationError

InvalidSyntax

Initial Round-0 620 877 394 705


BIFI Round-1 880 941 813 916Round-2 905 942 859 935








411 Github-Python






412 DeepFix



42 Main results




MethodTest accuracy

Bad100

Bad50

Bad10





Initial Round-0 620









GoodSynthetic

Perturbation(1)

Train b0

Train f0




(1)

Finetune bk




(3)

Finetune fk



Fixed Apply fk-1(1)

Finetune bk



GoodApply bk (3)

Finetune fk




(1)

Finetune fk





(2)

(2)

(4)

(2)

(2)

(4)
















GoodSynthetic

Perturbation(1)

Train b0

Train f0




(1)

Finetune bk




(3)

Finetune fk



Fixed Apply fk-1(1)

Finetune bk



GoodApply bk (3)

Finetune fk




(1)

Finetune fk





(2)

(2)

(4)

(2)

(2)

(4)















43 Analysis







FixerOnly BIFI


Total 15055 620 868 924 880 905



















































































































MethodTest accuracy


IndentationError

InvalidSyntax

Initial Round-0 620 877 394 705


BIFI Round-1 880 941 813 916Round-2 905 942 859 935








411 Github-Python






412 DeepFix



42 Main results




MethodTest accuracy

Bad100

Bad50

Bad10





Initial Round-0 620









GoodSynthetic

Perturbation(1)

Train b0

Train f0




(1)

Finetune bk




(3)

Finetune fk



Fixed Apply fk-1(1)

Finetune bk



GoodApply bk (3)

Finetune fk




(1)

Finetune fk





(2)

(2)

(4)

(2)

(2)

(4)
















GoodSynthetic

Perturbation(1)

Train b0

Train f0




(1)

Finetune bk




(3)

Finetune fk



Fixed Apply fk-1(1)

Finetune bk



GoodApply bk (3)

Finetune fk




(1)

Finetune fk





(2)

(2)

(4)

(2)

(2)

(4)















43 Analysis







FixerOnly BIFI


Total 15055 620 868 924 880 905



















































































































MethodTest accuracy

Bad100

Bad50

Bad10





Initial Round-0 620









GoodSynthetic

Perturbation(1)

Train b0

Train f0




(1)

Finetune bk




(3)

Finetune fk



Fixed Apply fk-1(1)

Finetune bk



GoodApply bk (3)

Finetune fk




(1)

Finetune fk





(2)

(2)

(4)

(2)

(2)

(4)
















GoodSynthetic

Perturbation(1)

Train b0

Train f0




(1)

Finetune bk




(3)

Finetune fk



Fixed Apply fk-1(1)

Finetune bk



GoodApply bk (3)

Finetune fk




(1)

Finetune fk





(2)

(2)

(4)

(2)

(2)

(4)















43 Analysis







FixerOnly BIFI


Total 15055 620 868 924 880 905




















































































































FixerOnly BIFI


Total 15055 620 868 924 880 905


















































































































Unsupervised Learning for Program Repair - Break-It-Fix-It

Documents

Transcript of Unsupervised Learning for Program Repair - Break-It-Fix-It