Auto-locating and fix-propagating for HTML validation errors to PHP server-side code

10
Auto-Locating and Fix-Propagating for HTML Validation Errors to PHP Server-side Code Hung Viet Nguyen, Hoan Anh Nguyen, Tung Thanh Nguyen, Tien N. Nguyen Electrical and Computer Engineering Department Iowa State University {hungnv,hoan,tung,tien}@iastate.edu Abstract—Checking/correcting HTML validation errors in Web pages is helpful for Web developers in finding/fixing bugs. However, existing validating/fixing tools work well only on static HTML pages and do not help fix the corresponding server code if validation errors are found in HTML pages, due to several chal- lenges with dynamically generated pages in Web development. We propose PhpSync, a novel automatic locating/fixing tool for HTML validation errors in PHP-based Web applications. Given an HTML page produced by a server-side PHP program, PhpSync uses Tidy, an HTML validating/correcting tool to find the validation errors in that HTML page. If errors are detected, it leverages the fixes from Tidy in the given HTML page and propagates them to the corresponding location(s) in PHP code. Our core solutions include 1) a symbolic execution algorithm on the given PHP program to produce a single tree-based model, called D-model, which approximately represents its possible client page outputs, 2) an algorithm mapping any text in the given HTML page to the text(s) in the node(s) of the D-model and then to the PHP code, and 3) a fix-propagating algorithm from the fixes in the HTML page to the PHP code via the D-model and the mapping algorithm. Our empirical evaluation shows that on average, PhpSync achieves 96.7% accuracy in locating the corresponding locations in PHP code from client pages, and 95% accuracy in propagating the fixes to the server-side code. Index Terms—Fix Propagation, Bug Localization, PHP Dy- namic Web Applications, Validation Errors I. I NTRODUCTION Web applications have become a critical infrastructure in our society. The World Wide Web Consortium (W3C) has developed several standards to ensure the development of high-quality and reliable Web applications [1]. An important quality criterion for a Web application is Markup Validity [2], which defines the validity of a Web document in HTML and other client-side markup Web languages according to their corresponding grammar, vocabulary, and syntactical rules. Although modern Web browsers handle very well the pars- ing of even not well-formed HTML pages, some software defects in Web applications are not always easily caught due to the client-server and dynamic nature of Web contents. Check- ing HTML validation errors could really help the process of finding and fixing bugs in Web development. In a survey conducted by W3C [3], a majority of Web professionals stated that validation errors is the first thing they check whenever they run into a Web styling or scripting bug. Creating Web pages according to a widely accepted standard also makes them easier to maintain and evolve, even if the maintenance and evolution is performed by different developers [3]. Recognizing the importance of markup validity for Web pages, several organizations/individuals have produced auto- matic Web page validating tools (also called HTML valida- tors). Some HTML validators (e.g. Tidy [4]) also provide automatic support for fixing markup errors to convert an HTML page into a well-formed one that conforms to HTML grammar and syntax. However, such auto-fixing tools work well only on static HTML pages and do not address several challenges in current Web development. The first challenge is that in a Web application, a client-side HTML page is often dynamically generated from the server-side code, which is written in different languages. For example, the server code is written in PHP, ASP, Perl, SQL, etc., while a client-side page is in HTML, JavaScript, CSS, and so on. The generated HTML code is embedded within the string literals or the values of variables in the server code. Moreover, those values are also scattered in multiple locations in server pages. For example, to produce an HTML table, multiple variables and string con- stants in different functions in the server code can be involved. Importantly, because the server code dynamically produces different client pages depending on run-time situations, if a validation error is found and reported in a Web page (e.g. via Tidy), it is challenging for its developers to manually map the buggy location(s) back to its source(s) in the server-side code. We propose PhpSync, an auto-locating and fix-propagating tool for HTML validation errors in PHP-based Web applica- tions. Given an HTML page produced by a PHP server page, PhpSync uses Tidy, an HTML validating/correcting tool to find any validation errors on the HTML page. If errors are detected, PhpSync leverages the fixes from Tidy in the given HTML page and propagates them to the corresponding location(s) in the PHP code. In the cases that Tidy cannot provide the fixes, the auto-locating function in PhpSync will help developers to quickly locate the corresponding buggy locations in PHP code from the buggy HTML locations found by Tidy. PhpSync does not require the input that produces the erroneous page. The dynamic nature of a Web application is addressed via our symbolic execution algorithm that symbolically executes the given PHP program to create a single tree-based represen- tation, called D-model, which approximates its possible HTML client page outputs. Each D-model represents a symbolic, string-based value that is resulted from the symbolic execution of any PHP expression(s). The D-model for the entire PHP server page or function is composed by the D-models resulted

Transcript of Auto-locating and fix-propagating for HTML validation errors to PHP server-side code

Auto-Locating and Fix-Propagating for HTMLValidation Errors to PHP Server-side Code

Hung Viet Nguyen, Hoan Anh Nguyen, Tung Thanh Nguyen, Tien N. NguyenElectrical and Computer Engineering Department

Iowa State University{hungnv,hoan,tung,tien}@iastate.edu

Abstract—Checking/correcting HTML validation errors inWeb pages is helpful for Web developers in finding/fixing bugs.However, existing validating/fixing tools work well only on staticHTML pages and do not help fix the corresponding server code ifvalidation errors are found in HTML pages, due to several chal-lenges with dynamically generated pages in Web development.

We propose PhpSync, a novel automatic locating/fixing toolfor HTML validation errors in PHP-based Web applications.Given an HTML page produced by a server-side PHP program,PhpSync uses Tidy, an HTML validating/correcting tool to findthe validation errors in that HTML page. If errors are detected,it leverages the fixes from Tidy in the given HTML page andpropagates them to the corresponding location(s) in PHP code.Our core solutions include 1) a symbolic execution algorithm onthe given PHP program to produce a single tree-based model,called D-model, which approximately represents its possible clientpage outputs, 2) an algorithm mapping any text in the givenHTML page to the text(s) in the node(s) of the D-model andthen to the PHP code, and 3) a fix-propagating algorithm fromthe fixes in the HTML page to the PHP code via the D-modeland the mapping algorithm. Our empirical evaluation shows thaton average, PhpSync achieves 96.7% accuracy in locating thecorresponding locations in PHP code from client pages, and 95%accuracy in propagating the fixes to the server-side code.

Index Terms—Fix Propagation, Bug Localization, PHP Dy-namic Web Applications, Validation Errors

I. INTRODUCTION

Web applications have become a critical infrastructure inour society. The World Wide Web Consortium (W3C) hasdeveloped several standards to ensure the development ofhigh-quality and reliable Web applications [1]. An importantquality criterion for a Web application is Markup Validity [2],which defines the validity of a Web document in HTML andother client-side markup Web languages according to theircorresponding grammar, vocabulary, and syntactical rules.

Although modern Web browsers handle very well the pars-ing of even not well-formed HTML pages, some softwaredefects in Web applications are not always easily caught due tothe client-server and dynamic nature of Web contents. Check-ing HTML validation errors could really help the process offinding and fixing bugs in Web development. In a surveyconducted by W3C [3], a majority of Web professionals statedthat validation errors is the first thing they check wheneverthey run into a Web styling or scripting bug. Creating Webpages according to a widely accepted standard also makesthem easier to maintain and evolve, even if the maintenanceand evolution is performed by different developers [3].

Recognizing the importance of markup validity for Webpages, several organizations/individuals have produced auto-matic Web page validating tools (also called HTML valida-tors). Some HTML validators (e.g. Tidy [4]) also provideautomatic support for fixing markup errors to convert anHTML page into a well-formed one that conforms to HTMLgrammar and syntax. However, such auto-fixing tools workwell only on static HTML pages and do not address severalchallenges in current Web development. The first challenge isthat in a Web application, a client-side HTML page is oftendynamically generated from the server-side code, which iswritten in different languages. For example, the server code iswritten in PHP, ASP, Perl, SQL, etc., while a client-side page isin HTML, JavaScript, CSS, and so on. The generated HTMLcode is embedded within the string literals or the values ofvariables in the server code. Moreover, those values are alsoscattered in multiple locations in server pages. For example,to produce an HTML table, multiple variables and string con-stants in different functions in the server code can be involved.Importantly, because the server code dynamically producesdifferent client pages depending on run-time situations, if avalidation error is found and reported in a Web page (e.g. viaTidy), it is challenging for its developers to manually map thebuggy location(s) back to its source(s) in the server-side code.

We propose PhpSync, an auto-locating and fix-propagatingtool for HTML validation errors in PHP-based Web applica-tions. Given an HTML page produced by a PHP server page,PhpSync uses Tidy, an HTML validating/correcting tool to findany validation errors on the HTML page. If errors are detected,PhpSync leverages the fixes from Tidy in the given HTMLpage and propagates them to the corresponding location(s) inthe PHP code. In the cases that Tidy cannot provide the fixes,the auto-locating function in PhpSync will help developers toquickly locate the corresponding buggy locations in PHP codefrom the buggy HTML locations found by Tidy. PhpSync doesnot require the input that produces the erroneous page.

The dynamic nature of a Web application is addressed viaour symbolic execution algorithm that symbolically executesthe given PHP program to create a single tree-based represen-tation, called D-model, which approximates its possible HTMLclient page outputs. Each D-model represents a symbolic,string-based value that is resulted from the symbolic executionof any PHP expression(s). The D-model for the entire PHPserver page or function is composed by the D-models resulted

from the intermediate computations during the symbolic exe-cution of the expressions in that page/function. Symbols in aD-model represent users’ inputs, data retrieved from databases,or unresolved values. A node in a D-model represents either 1)a determined value (e.g. a string literal), 2) a non-determineddata value (e.g. a user’s input), 3) a concatenation operation,4) a selection operation, or 5) a repetition operation on othernodes/values. This allows PhpSync to model the multi-valuedand scattered server-side data and the multiple versions ofclient-side code generated from the server code.

Another fundamental technique in PhpSync is CSMap,an algorithm that maps any text in the given HTML pageproduced by the given PHP program to the correspondingPHP code location by mapping that text to the node(s) of thecorresponding D-model. Then, our fix-propagating algorithmderives the fixing changes from Tidy to the given HTML pageand propagates them to the locations in PHP via the establishedclient-to-server mappings. CSMap is generic and can be usedin other applications such as locating the corresponding buggyPHP places for other types of errors found in an HTML page.

Our empirical evaluation on real-world Web applicationsshows that PhpSync achieves on average 96.7% accuracy inlocating the corresponding locations in PHP code from clientpages, and 95% accuracy in fix-propagating to server code.

The key contributions of this paper include:1) PhpSync, an auto-locating and fix-propagating tool for

HTML validation errors in PHP-based Web applications;2) CSMap, a mapping algorithm from an HTML page

(produced by a PHP page) to the corresponding PHP locations;3) an empirical evaluation on several real-world Web appli-

cations to show PhpSync’s correctness and efficiency.Section II presents a motivating example. Section III dis-

cusses our representation model. Associated algorithms are de-scribed in Sections IV and V. Section VI is for our evaluation.Related work is in Section VII. Conclusions appear last.

II. MOTIVATING EXAMPLE AND APPROACH OVERVIEW

This section presents an example that illustrates a bugcaused by an HTML validation error and the challenges infixing such errors in PHP-based Web applications.

A. An Example of a Bug on an Ill-formed Web Page

This example is inspired from an online social networksystem in which users are able to connect with peers/friendsvia posting and sharing news, pictures, and videos in theirdaily activities. In this system, a user can view and providecomments on the posts from his/her friends’ pages. Figure 1adisplays such a page when a short news item on ASE 2011 isposted. Each post is followed by one or multiple comments anda textbox along with a submission button for a user to enter anew comment. After the comment is provided and the buttonis pressed, the new comment is expected to appear at the endof the comments’ list, and the textbox and the button wouldbe positioned at the bottom of the page for another commentas in Figure 1b. However, when a user entered a comment,the textbox and the submission button appeared before that

Fig. 1. Output of PHP Page in Browser

Page index.php1<html><head>2 <script language=’javascript’ src=’ajax. js ’></script>3<link rel=”stylesheet” type=”text/css” href=”style .css” />4</head><body>5<div class=’out’>6<div class=’inImg’><img src=ASElogo.gif width=’40’ /></div>7<div class=’inPost’>ASE 2011<br>Submission is now open.</div>8</div>9<div id=’divComments’ class=’out’>

10<div class=’inComment’>Hung Nguyen: Great news!</div>11<!−− miss closing the div tag on line 9−−>12<div class=’out’>13 <input id=’txtComment’ type=’text’>14 <input type=’button’ value=’Comment’ onclick=’comment()’>15</div>16</body></html>

Fig. 2. HTML Client-side Code for Figure 1

newly input comment (see Figure 1c). Assume that this bugwas found and reported by a user on that page.

From the point of view of the developer of this Web-basedsocial network application, in order to understand and fix thisbug, (s)he would naturally first examine the HTML code ofthat Web page (Figure 2) to see if there was any error in its pre-sentation structure. (S)He could do this verification manuallyor use an automatic HTML validator such as Tidy [4]. Assumethat (s)he found that the code missed a closing tag </div> forthe opening tag <div> at line 9. Therefore, (s)he discoveredthat the bug was caused by the missing </div>: the last pagedivision (lines 12-15) is included within the page divisionstarting at line 9, making the textbox and button belong tothe same page division for the comment list. When the usersubmitted a comment, it was appended to the end of the pagedivision for comments and appeared below the textbox.

B. Challenges for Validation and Bug Fixing on Web Pages

The fix in the HTML code would be straightforward as thedeveloper should add a </div> closing tag at line 11 for thecorresponding open tag at line 9. However, that HTML pagewas dynamically generated from PHP-based server code (Fig-ure 3). With a PHP-based Web application, (s)he must locateand fix the corresponding buggy code in the server. Doing thistask manually is challenging in general due to several reasons.

1) The mapping/tracing from a client HTML page to server-side code is not straightforward. A Web application is a client-server one and generally developed in multiple languages. Theserver code could be written using a scripting language, e.g.PHP, while the client-side code is in HTML for presentationand JavaScript (JS) for data processing and event handling.

2) When the server-side code executes at the server, client-side code is generated and sent to a browser to execute there.That is, PHP-based server-side code dynamically producesdifferent HTML pages depending on different inputs. Forexample, depending on the login information of a user atrun-time, different files are included, different functions areexecuted, different execution paths in PHP code are taken inorder to generate a particular client page. In this motivatingexample, to fix the bug, the developer would start examiningthe file index.php on the server side (Figure 3a) because (s)hefound the error in the client page index.php. However, the bugis not within the file index.php of the server side. That PHPfile is responsible for checking if a user has logged in (line 3,Figure 3a) via is logged in function in the file functions.php (line2), which also contains other utility and formatting functions inthe system (Figure 3d). The file main.php (Figure 3b) containsthe code handling the cases of correct logins, while error.php(Figure 3c) handles the incorrect cases. In practice, validationerrors are found in a client page via an HTML validationtool [4] and reported without corresponding input and actionsteps to produce that page. Thus, to fix them, a developer mighthave to check many server-side files and execution paths tofind the right execution path that produces that client page.

3) Due to the dynamic nature of PHP/HTML/JS and thegeneration of client-side code, in a Web application, code anddata tend to be mixed, especially client-side code is oftenembedded in server-side data. For example, the code of thediv tags are embedded within PHP string literals. Moreover, apiece of HTML code might be generated via many PHP stringliterals, variables, and functions that are scattered in differentplaces in the server-side code. In this example, the <body>element of the main HTML page is generated from the valuesof several scattered literals, variables, and function calls. Tolocate the right place to fix the <div> tag, the developer needsto check several literals and differentiate between many tagswith the same name <div> that appear in several places inmain.php and functions.php. In this example, the developer mustdetermine that the error is in the addComments function infunctions.php (line 12, Figure 3d). In reality, the numbers ofincluded files, functions, variables, literals, and execution pathsmight be very high and they are scattered, thus making itchallenging for a developer to manually locate the bug.

4) In this example, the PHP statement that prints outthe erroneous HTML line is line 8 of Figure 3b: echo add-Comments(...). However, to fix that error, a developer in factmust change line 12 of Figure 3d ($output .= “\n”;) where theerroneous HTML line is composed and manipulated.

This example shows that HTML validation errors couldcause run-time bugs even when a browser can still display thepage. As a user submitted a comment by clicking the button,

a) File index.php1 <?php2 include ”functions.php”;3 if (! is logged in()) include ”error .php”;4 ?>5 <html><head>6 <script language=’javascript’ src=’ajax. js ’></script>7 <link rel=”stylesheet” type=”text /css” href=”style .css” />8 </head><body>9 <?php include ”main.php”; ?>

10 </body></html>

b) File main.php1 <?php2 // connect to the database to get the content of the post and its

comments3 // and store them to the variables $post and $comments, respectively4 echo ”<div class=’out’>” .5 ”\n ” . addImage(”ASElogo.gif”) .6 ”\n ” . addPost($post) .7 ”\n</div>\n”;8 echo addComments($comments);9 echo ”<div class=’out’>” .

10 ”\n <input id=’txtComment’ type=’text’>” .11 ”\n <input type=’button’ value=’Comment’ onclick=’comment()’>” .12 ”\n</div>\n”13 ?>

c) File error.php1 <html><body><?php2 $msg = ”User not logged in”;3 echo $msg;4 exit ;5 ?></body></html>

d) File functions.php1 <?php2 function is logged in() {...}3 function addImage($src) {...}4 function addPost($post) {return ”<div class=’inPost’>”.$post.”</div>”;}5 function addComment($comment) {6 return ”<div class=’inComment’>” . $comment . ”</div>”;7 }8 function addComments($comments) {9 $output = ”<div id=’divComments’ class=’out’>”;

10 foreach ($comments as $comment)11 $output .= ”\n ” . addComment($comment);12 $output .= ”\n”; // miss closing the div tag on line 913 return $output;14 }15 ?>

Fig. 3. PHP Server-side Code Example

the JS function comment (not shown) was invoked (line 11,Figure 3b). Due to the missing <div> tag, it incorrectly updatedthe corresponding division in the page via Ajax framework,thus, causing the incorrect page as in Figure 1c.

C. Approach Overview

We propose PhpSync, an auto-locating and fixing tool forvalidation errors in PHP-based Web applications. Given anHTML page produced by a server-side PHP program, PhpSyncuses Tidy [4] to find the validation errors on the page. If errorsare found, it propagates the fixes from Tidy on that HTMLpage to the corresponding location(s) in PHP code. The ideasare as follows: 1) PhpSync performs a symbolic execution toapproximately represent all possible client-side HTML outputsof a server page S with a single tree-based model, called D-

model D; 2) it maps the given HTML page C, i.e. a concreteHTML output, to the D-model, and then, to the server-sidecode S; 3) it uses Tidy to validate/fix the page C into a well-formed page and recovers the fixes applied to C; and 4) itfinally propagates these fixes to S via the mapping establishedbetween C and S (via D). Let us describe our approach.

III. D-MODEL: REPRESENTATION OF CLIENT PAGES

A. D-model RepresentationD-model is a tree-based representation for any symbolic,

string-based value resulted from a symbolic execution onany portion of server-side PHP code. The D-model for theentire PHP server page/function is composed by the D-modelsresulted from the intermediate computations during a symbolicexecution of the PHP expressions of that page/function. Thatis, PhpSync also creates D-models to represent possible valuesof intermediate computations and combines them into largerD-models for later computations. A D-model often containssymbols to represent user inputs, data retrieved from databases,or unresolved values. By performing a symbolic execution ona PHP page, PhpSync approximates all possible outputs/clientpages with a single D-model. Let us explain it in details.

First, the string outputs for a portion of PHP code arestream-like, i.e. are produced via sequential writing or con-catenation operations on PHP string values. The string valueT of a data-related PHP expression or the string value resultedfrom a string computation in PHP can be produced using thefollowing context-free production rules [5]:

Rule 1. T → tRule 2. T → T . TRule 3. T → T | T

Rule 1 says that, the value of a PHP expression can be astring literal. Rule 2 means that the value of a PHP expressioncan be concatenated from the values of two PHP expressions.Rule 3 specifies that a PHP expression can have either one oftwo values depending on the actual execution path at runtime.For example, in Figure 3a, the output of the page index.php isproduced using Rule 3 due to the if statement at line 3 (i.e. itis either one of two strings), while the string output at line 9of main.php is produced using Rule 2 (i.e. it is concatenated byfour strings). Both production processes use Rule 1. Rules 2and 3 are also used to repeatedly produce a value. For example,the value of variable $output of the function addComments infunctions.php is produced by repeatedly using a foreach loopvia Rule 2 (lines 10-11, Figure 3d). Those rules for outputproduction of PHP code suggest the following structure.

Definition 1: A D-model is a labeled, ordered tree, inwhich the leaf nodes represent the values, and the inner-nodesrepresent the operations for combining those values.

1. There are two kinds of leaf nodes:• A literal node represents a determined string value (e.g.

a PHP literal), and• A symbolic node represents an undetermined/unresolved

string value (e.g. a user input).2. There are three types of inner nodes, representing three

kinds of operations on D-models:

• A Concat node represents a value that is concatenatedfrom the values corresponding to the sub-trees of thatnode. The order of the sub-trees represents the order ofthe concatenation operation.

• A Select node represents a value that could be selectedfrom the values corresponding to its sub-trees.

• A Repeat node represents a value that could be repeat-edly concatenated from the values corresponding to thesub-trees of the only child node of that Repeat node.

3. The nodes on D-models have their attributes describingadditional information, such as the PHP expressions associatedwith literal and symbolic nodes.

Figure 4 illustrates a D-model that represents the output ofthe page index.php in Figure 3a. As seen, the root node of theD-model is a Select node, representing that the correspondingoutput of this PHP page is selected from the two valuesof two corresponding sub-trees of that root node. The leftand right subtrees correspond to the outputs if error.php ormain.php is executed, respectively. The root node of the rightsubtree is a Concat node representing the concatenation of thevalues of multiple literals (represented as literal nodes), e.g. thestring literal “</div></div>”, the variables $post and $comment(represented as symbolic nodes), and the return values fromdifferent function calls. The return value of function calladdComments is represented as the D-model rooted at thesecond Concat node, with its child node Repeat representingthe repetition in the foreach loop. Consecutive string literalsare combined for a compact D-model representation.

Note that a D-model approximates all possible symbolicoutputs of PHP code by symbolically executing all of its exec-ution paths. However, it does not represent all possible paths.

B. Building D-model via Symbolic Execution

We develop an algorithm to evaluate/compute the symbolicvalue for the output of any PHP code by building its D-model. It takes as an input the code of a PHP server page,and performs a symbolic execution to create a D-model for aspecial variable $Output, to represent the output of that page.During execution, it creates the D-models for the intermediateresults and updates the D-models for encountered variables.

The algorithm recursively evaluates all statements in allbranches, updates/creates small D-models, and combines theminto larger ones. It processes the PHP statements as follows:

1. E → scalarValue: As a scalar/string value is encountered,a literal node is created to contain the corresponding string.

2. E1 → $V = E2: Since a variable might have differentvalues at different points in execution, PhpSync maintains foreach variable V a D-model corresponding to its most recentvalue during the execution. When meeting an assignmentexpression, PhpSync computes the D-model for the expressionE2, and assigns that D-model as the most recent value of V.

3. E → $V: When a variable V is retrieved for a computation,its latest D-model is used. However, if V does not have anyD-model, PhpSync returns a symbolic node representing anundetermined value. This corresponds to the cases of user in-puts, data values from databases, or unresolved computations.

Repeat<div id='divComments'

class='out'>

Concat

<html><head>

<script language='javascript' src='ajax.js'></script>

<link rel="stylesheet" type="text/css" href="style.css" />

</head><body>

<div class='out'>

<div class='inImg'><img src=ASElogo.gif width='40' /></div>

<div class='inPost'>

</div>

</div>$post

Select

<html><body>

User not logged in

</body></html>

Concat

<div class='inComment'> </div>$comment

<div class='out'>

<input id='txtComment' type='text'>

<input type='button'

value='Comment' onclick='comment()'>

</div>

</body></html>

$Output

Concat

Fig. 4. D-model Representation for the Outputs of the PHP Page index.php of Figure 3a.

4. E1 → E2.E3: For an expression with a concatenation,PhpSync processes the sub-expressions to produce their D-models, and then creates the resulting D-model with its rootnode being a Concat node. The sub-trees of that root node arethe computed D-models of the sub-expressions. Those subtreesare connected in the same order as the appearance order of thecorresponding sub-expressions. PhpSync also performs otherstandard string and arithmetic operations in a similar process.Un-resolved results are represented as symbolic nodes.

5. S → echo E: When seeing an echo/print statement, PhpSyncconcatenates the current D-model of the variable $Output andthe D-model of E to produce the new D-model for $Output.Note that $Output holds the current output of the PHP page.

6. S1 → if (E) S2 else S3: For an if statement, PhpSyncexecutes both branches, and collects into a set V* all variables Vmodified in either branch. Let us use VS2.D and VS3.D to denotethe D-models of V after executing each branch, respectively.For each V in V*, PhpSync updates its value with a new D-model. The new D-model is rooted at a new Select node whosechildren are VS2.D and VS3.D. If the else branch is empty, thelatest D-model for V before the if statement is used in placeof VS3.D. The same treatment is for Switch statements.

7. S1 → while (E) S2: First, PhpSync executes statement S2once and collects all modified variables V into V*. Typically,the string value of a variable is appended during the executionof a loop. Let us use DV to denote the D-model that representsthe symbolic string value appended to V. For a variable V in V*,PhpSync updates its value with a new D-model. The new D-model is rooted at a new Concat node whose children are V.Dand a new Repeat node (Figure 4). The Repeat node has DV

as its only child. If the value of V is not appended in the loop,PhpSync currently does not handle it and retains the old valueof V before the loop. The same treatment is for a for statement.

8. S → return E: When PhpSync meets a return statement, theD-model of E is computed and collected into a set retValues ofall possible returned values of the current function/file.

9. function call. When a function is called, PhpSync assignsthe D-models of the actual arguments to the formal parametersof the function, and then performs a symbolic execution on the

TABLE ISYMBOLIC EXECUTION RULES ON PHP CODE TO BUILD D-MODELS

PHP Syntax Evaluation Rule To Build D-model

E → scalarValue E.D = new LiteralNode(scalarValue)

E1 → $V = E2 V.D = E2.D, E1.D = E2.D

E → $V if V.D <> null then E.D = V.Delse E.D = new SymbolicNode($V)

E1 → E2.E3 E1.D = new Concat(E2.D, E3.D)

S → echo E $Output.D = new Concat($Output.D, E.D)

S1 → if (E) S2 ∀V∈V*, V.D = new Select(VS2.D, VS3.D)else S3

S1 → while (E) S2 ∀V∈V*, V.D = new Concat(V.D, new Repeat(DV ))

S → return E cur func.retValues = cur func.retValues ∪ E.Dor cur file.retValues = cur file.retValues ∪ E.D

E → func({argi}) func.retValues = ∅, ∀i func.parami.D = argi.D,execute func, E.D = new Select(func.retValues)

E1 → include E2 file = computeValue(E2.D), file.retValues = ∅,execute file, E1.D = new Select(file.retValues)

E → exit() cur prog.outputValues =cur prog.outputValues ∪ $Output.D

prog → {Si} prog.outputValues = ∅, execute {Si},prog.outputValues = prog.outputValues ∪ $Output.D$Output.D = new Select(prog.outputValues)

function’s code. After executing the function, it creates a newD-model with its root being a new Select node to describe thepossibly multiple returned values of the function. The childrenof that Select node are the D-models in the retValues set of thefunction. If the function has only one returned value, the D-model of that returned value is used. If global variables andreference parameters are modified during the execution of thefunction, their D-models also updated accordingly. If the codeof the called function is unavailable (e.g. library functions), itrepresents the returned value by a symbolic node.

10. E1 → include E2: PhpSync computes the string valuefrom the D-model of E2 and considers it as a file name fname.Then, it continues the execution on that file. Finally, the D-

model of E1 is assigned with a new D-model whose root is ata new Select node with its children being all returned valuesafter executing fname as in the case of a function call.

11. exit(): If PhpSync meets an exit function call, the D-modelof $Output is collected into outputValues set of the current page.

12. block of statements: After executing all statements in thePHP program/page, PhpSync creates a new D-model with itsroot being a new Select node to describe the possibly multipleoutputs of the page. The children of that Select node are theD-models in the set outputValues of the page.

While building the D-models, PhpSync also keeps the map-ping between the D-model leaf nodes and their correspondingPHP fragments. For example, the literal node <div class=...>under the lowest Concat node in Figure 4 is mapped to thefragment <div class=...> on line 6 of Figure 3d. For the mappingof a symbolic node, PhpSync also keeps its execution trace.For example, the node $post of Figure 4 is mapped to line 4 ofFigure 3d (inside the function’s body), and the trace includesline 4 of Figure 3d, line 6 of Figure 3b, and lines 2-3 ofFigure 3b. That trace is useful for developers in examiningthe output corresponding to $post (i.e. line 7 of Figure 2).

The limitation of PhpSync lies in the approximation of thesymbolic executions of if and for/while statements. The conditionof an if is not evaluated and only string-appending operationson variables are handled in a loop. PhpSync also does not han-dle well library function calls if the source code is unavailable.

IV. CSMAP: MAPPING TEXTS OF CLIENT PAGE TOSERVER PAGE VIA D-MODEL

Let us present CSMap algorithm that maps any text in anHTML page to the corresponding location in a server page. Ittakes as inputs an D-model D and a string C, divides C intoproper sub-strings and maps them to the corresponding literalor symbolic nodes in D, and then to PHP literals or variables.

A. Algorithm Design Strategies

A D-model D for a server page can be considered as acontext-free grammar (CFG) and a string C is one of itsconcrete sentences. However, the traditional CFG parsing/-compiling techniques [6] are not suitable and efficient herebecause the D-model always contains multiple symbols (i.e.symbolic values) that correspond to user inputs, etc. Therefore,we design CSMap with the following heuristic strategies:

1. Top-down and divide-and-conquer: with the goal of map-ping texts to the leaf nodes in D, it is natural to perform themapping of the substrings in C to the sub-trees in D. CSMapfollows the top-down process as in top-down parsers [6].

2. Pivoting: despite that the HTML pages are dynamicallygenerated, the shared/static HTML code portions among (someof) those outputs of a PHP page occur very often. CSMapattempts to map the string C to these shared code portionsin D first, and then uses them as the already-mapped pivotsfor further dividing and conquering. That is, the process willcontinue on the substrings of C divided by those pivots.

3. Local best-matching: Since there may exist many selec-tion nodes, CSMap could face the combinatorial explosion if it

1 function CSMap (C, D)2 StrToDModel(C, r← D.root)3 end4 // −−−−−−−−− Handling Literal Nodes −−−−−−−−−−−−−−−−5 function StrToDModel(String str, LiteralNode literal )6 substring ← str.FindFirstOccurence(literal.val)7 if (substring is found)8 substring.MapLocation← literal9 end

10 // −−−−−−−−− Handling Concat Nodes −−−−−−−−−−−−−−−11 function StrToDModel(String str , Concat concat)12 if (concat.numChildren == ∅) return13 if (concat.numChildren == 1) StrToDModel(str, concat.firstChild); return;14 Pivot = FindPivot( str , concat.children)15 if (Pivot <> null)16 str . Split (Pivot , firstSubStr , secondSubStr)17 concat.Split (Pivot , firstHalfNodes, secondHalfNodes)18 StrToDModel(firstSubStr, firstHalfNodes)19 StrToDModel(secondSubStr, secondHalfNodes)20 else21 StrToDModel(str, concat. firstChild )22 StrToDModel(str.GetUnmapped(), concat.removeFirstChild())23 end24 function FindPivot(String str , DModelList list )25 list .RetainOnlyLiteralDModels()26 for (dmodel ∈ list)27 count = FindOccurrences(str, dmodel.stringVal)28 if (count == 1) return dmodel29 end30 return null31 end32 // −−−−−−−−− Handling Symbolic Nodes −−−−−−−−−−−−−−33 function StrToDModel(String str, SymbolicNode node)34 Siblings ← node.Parent.ChildNodes35 if (node.GetRightSibling(Siblings) is a Pivot)36 str .MapLocation← node37 end38 // −−−−−−−−− Handling Select Nodes −−−−−−−−−−−−−−−−39 function StrToDModel(String str, Select select)40 TString ← FString← str41 StrToDModel(TString, select.trueBranch)42 StrToDModel(FString, select.FalseBranch)43 if (TString.MappedLength > FString.MappedLength)44 str .MapLocation← TString.MapLocation45 else str .MapLocation← FString.MapLocation46 end47 // −−−−−−−−− Handling Repeat Nodes −−−−−−−−−−−−−−−−48 function StrToDModel(String str, Repeat repeatNode)49 Before← str.MappedLength50 StrToDModel(str, repeatNode.ChildNode)51 After ← str.MappedLength52 if (Before < After)53 StrToDModel(str.GetUnmapped(), repeatNode)54 end

Fig. 5. CSMap Algorithm: Mapping from HTML page to D-model

tries to exhaustively explore all combinations of their branchesand perform optimal matching. Thus, for a selection node,CSMap uses a local best-matching strategy by first exploringall branches of the selection node and mapping to the branchwith more matched characters. This choice is made locally foreach selection without considering globally optimal matching.

B. Detailed Algorithm

Figure 5 shows the pseudo-code for CSMap algorithm. Itis designed as the recursive function StrToDModel whose inputsare a string C and the root node r of a D-model. There arefive overloading functions StrToDModel corresponding to fivetypes of D-model nodes. During the execution, the attributeMapLocation of each substring in C is assigned with at mostone reference to a node in the D-model (i.e. its mapped node).CSMap handles each of the five node types as follows:

1. If r is a literal node, r has a value val. If val appears instr (i.e. is its substring), then the characters of that substringare mapped to r. However, since str might have several occu-rrences of val, by a greedy strategy, CSMap maps the first occu-rrence of val in str to r, i.e. favoring the leftmost mapped string.

2. If r is a Concat node, CSMap considers str as aconcatenation of the values corresponding to the sub-treesof r. To find the optimal mapping, one might need to dividestr into all possible sub-strings and map each of them to thecorresponding sub-tree of r. However, to simplify the divide-and-conquer step, CSMap uses the pivoting strategy. It finds apivot by checking the string of a literal node among the sub-trees of r to see if it occurs only once in str. If such a pivotexists, it is used to divide str into two sub-strings, and thelist of child nodes of r into two sub-lists rooted at two newConcat nodes for further mapping (lines 16-19). If such a nodedoes not exist, CSMap maps str to the first subtree of r andrecursively maps the remaining texts in str (after the already-mapped portions) to the other subtrees of r (lines 21-22).

3. If r is a symbolic node, CSMap checks whether thesibling node of r is a pivot. If it is, CSMap considers the stringstr as the value generated from r, thus, maps all characters ofstr to r. If a pivot does not exist, CSMap does not map str tor because it tries to map str with other sibling nodes of r.

4. If r is a Select node, str is considered to be producedfrom one of the D-models corresponding to the sub-trees ofr. Thus, CSMap recursively maps str to each sub-tree of r,and chooses the sub-tree with the higher number of mappedcharacters as the mapping for str (lines 43-45).

5. If r is a Repeat node, C is considered as the concate-nation of the values produced by the sub-trees of D aftersome number of iterations. CSMap attempts to map str to thechild node of r, which represents the appendix string in oneiteration. It will continue to map the remaining of str until nomore mapping is gained (lines 52-53).

Finally, after determining the mapping between the client-page C and the D-model D via CSMap, PhpSync uses themapping from D to PHP code established during building Dto map the texts in C to PHP literals, variables, or statements.Example. Let us revisit the example in Figure 2 with theD-model in Figure 4 to illustrate CSMap. CSMap starts bymapping the entire HTML page to the D-model rooted at aSelect node. For a Select node, CSMap first attempts to mapthe code to each branch separately. In this case, the first branchis the string “... User not logged in ...”, which does not exist in theHTML code, thus it remains unmapped. The second branch,however, starts at a Concat node with several pivot nodes thatare helpful for the mapping. In particular, its first, third, andfifth child nodes are string literals that occur exactly once inthe HTML code, hence CSMap maps the corresponding sub-strings in the HTML code to those literal nodes.

The remaining sub-strings “ASE 2011<br>Submission is nowopen.” (line 7) and lines 9-10 of Figure 2, are mapped re-spectively to the remaining child nodes (i.e. the symbolicnode $post and the next Concat node). For that Concatnode, CSMap again finds that its first child node (“<div

id=’divComments’...”) corresponding to line 9 (Figure 2) is a pivot.Therefore, it maps the remaining substring on line 10 to theRepeat node (Figure 4). For a Repeat node, its containedD-model rooted at the child node Concat is mapped to thesubstring repeatedly until no further mapping is found. In thisexample, the substring can be mapped to the two literal nodesand the symbolic node $comment after one iteration. Even ifline 10 were repeated several times, CSMap would still mapthe text to the D-model with the Repeat node.

At this point, CSMap has evaluated both branches of theSelect node at the root of the top-level D-model. Comparingthe mapping results, it returns the mapping given by the secondbranch where all of the HTML code is successfully mapped.

Since CSMap works heuristically, it is important that themapping is done correctly in the top-level steps of the divide-and-conquer stack. A client page typically contains largechunks of texts that are likely to remain unchanged fordifferent executions of the server page. This nature of clientpages makes it likely for CSMap to find correct pivots in theearly mappings. Incorrect mappings may occur at a later stageof the execution but produce less impact on the overall resultsince remaining texts to be mapped get much smaller.

V. AUTO-LOCATING AND FIX-PROPAGATING TO PHPThis section describes how PhpSync helps in auto-locating

and fix-propagating for the validation errors to PHP code. Theinputs include a given HTML page C produced by a PHPpage S. PhpSync uses Tidy [4], an HTML validator/corrector,to check C for HTML validation errors. If errors are found,it uses Tidy to produce the corrected version C ′ of C.Auto-Locating. There exist the cases in which Tidy is notable to provide the fixes [4]; however, it points out the buggylocations in the HTML page C. In such cases, for each errorlocation in C, PhpSync uses CSMap to automatically locatethe corresponding literal node(s) in the D-model of S and thenlocate the PHP literal(s) in S. For example, via CSMap, the<div> opening tag on line 9 of Figure 2 is mapped to the firstliteral node of the second Concat in Figure 4, therefore, iscorrectly traced back to line 9 of Figure 3d.Fix-Propagating. If Tidy can fix those errors, PhpSync willpropagate those fixes through the mapping between S andC established by CSMap. Because Tidy does not provide theoperations of the fixes but produces only the corrected versionC ′, we developed CCMap algorithm to map the texts betweenC and C ′ to derive the fixing changes. The output of thealgorithm is all the changes at the character level between Cand C ′, which are then used to propagate to the server code.We design CCMap with three strategies:

1. Token-based processing: CCMap treats the client code Cas a sequence of tokens, instead of syntactic units because Cmight not be fully parsable due to its validation errors.

2. Divide-and-conquer: Due to the nature of validationerrors (missing closing tags, missing tag brackets, invalid tags,etc.), the fixes from Tidy leave the majority of C un-changed,i.e., C and C ′ share similar texts. CCMap maps the unchangedportions in C and C ′, and uses them as pivots as in CSMap.

1 function CCMap(C, C’)2 T = Tokenize(C, D) // D is the set of delimiters3 T’ = Tokenize(C’, D) // D={’<’,’>’,’ ’ , ’\r ’ , ’\n’, ’\t ’ , ’\f ’ , ’ ; ’ , ’=’}4 LCS Exact(T, T’)5 for each two successive already mapped elements T[l] and T[r]6 l ’ = T[ l ]. map r ’ = T[r ]. map // use mapped elements as pivots7 LCS Sim(T[l+1..r−1], T’[l ’ +1.. r ’−1], 0.8) // map similar elements8 for each mapped pair of tokens T[i] and T’[ i ’ ]9 LCS Exact(T[i], T’[ i ’ ]) // map characters in tokens

10 for each two successive already mapped characters C[l] and C[r]11 l ’ = C[l ]. map r ’ = C[r ]. map // use mapped characters as pivots12 LCS Exact(C[l+1..r−1, C’[l’+1..r ’−1]) // map the delimiters only13 function LCS Sim(T, T’, σ)14 P = array [0.. T.length, 0..T’ . length]15 for i = 1 to T.length16 for j = 1 to T’ . length17 if (sim(T[i ]. value, T’ [ j ]. value) ≥ σ)18 P[i ][ j ]. score = P[i ][ j ]. score + sim(T[i ]. value, T’ [ j ]. value);19 P[i ][ j ]. trace = ‘‘ LU’’20 else ... // Standard LCS algorithm

Fig. 6. CCMap Algorithm: Deriving Fixes from Tidy

3. Similar-matching: to capture the replacement operationsbetween C and C ′, we modify the standard longest commonsub-sequence (LCS) to support the mapping of similar texts.

The pseudo-code of CCMap is in Figure 6. It first tokenizesC and C ′ into two sequences of tokens T and T ′, respectively(lines 2-3). Then, it uses the standard LCS algorithm LCS Exactto find the pivot tokens (line 4). For each two successivemapped tokens T [l] and T [r], l < r, and their correspondingmapped tokens T ′[l′] and T ′[r′], it uses LCS Sim to findthe similar not-yet-mapped tokens in the aligned sequencesT [l+1..r−1] and T ′[l′+1..r′−1] (lines 5-7). LCS Sim, at lines13-20, is almost the same as the standard LCS except the wayit compares the elements between two sequences (line 17). InLCS Sim, two elements can be mapped if their string similarityexceeds a threshold σ. Function sim(.) measures the similarityof two strings by the ratio between the length of their LCS andtheir average length. For each mapped pair of token T [i] andT ′[i′] (both exact and similar ones), CCMap runs LCS Exacton the two sequences of characters to map the correspondingcharacters (lines 8-9). To map all the characters in the code,it then translates the mapping results of the characters in thetokens in T and T ′ to the characters in C and C ′. It maps thepreviously-removed delimiters in C and C ′ using LCS Exacttaking already-mapped characters as pivots (lines 10-12).

All mapped characters are considered as unchanged. Theun-mapped ones in C and C ′ are considered as deleted andadded, respectively. Finally, from those derived changes to C,PhpSync finds the corresponding D-model’s literal nodes andthen applies them to the corresponding PHP string literals inS. For example, PhpSync uses CCMap to detect that </div> isadded by Tidy at line 11 of Figure 2, i.e. is inserted after the“\n” character between lines 10 and 11. That string is mappedby CSMap to the literal “\n” at line 12 of Figure 3d. Thus,PhpSync will make the change at line 12: “$output .= \n</div>”;.

VI. EMPIRICAL EVALUATION

This section presents our empirical evaluation on PhpSync.Our research questions are 1) how accurately PhpSync maps

TABLE IISUBJECT SYSTEMS AND D-MODELS

Subject Systems D-models

Name Files KLOCs ExFiles Nodes Control Time(s)

SchoolMate-1.5.4 63 8 60 5357 885 0.6TimeClock-1.4 69 23 7 886 101 0.1WebERP-4.0.2 654 220 16 59338 14841 2.8UPB-2.2.7 400 105 6 8383 1621 0.7AddressBook-6.2.12 103 19 10 1789 320 0.3Manhali-1.3.2 299 52 17 7952 2353 0.6

TABLE IIIMAPPING AND FIXING RESULT ON SCHOOLMATE 1.5.4

# Navi Steps Mapping Fix-Propagating

Fragments Characters (×1000) Err. Tidy PSAll Auto Man Corr. All Corr. Acc.

1 Login 15 12 3 15 7.0 7.0 100% 38 4 42 School 42 27 15 42 11.8 11.8 100% 50 19 193 Terms 45 31 14 43 12.2 12.1 99% 52 22 224 Semesters 51 35 16 49 12.5 12.4 99% 50 20 205 Classes 96 64 32 96 12.9 12.9 100% 57 27 276 Users 87 58 29 85 13.1 12.9 99% 64 34 347 Teachers 56 38 18 56 12.0 12.0 100% 50 20 208 Students 105 68 37 105 13.0 13.0 100% 50 20 209 Registration 102 69 33 102 12.4 12.4 100% 49 19 19

10 Attendance 73 50 23 72 11.8 11.6 99% 50 20 2011 Parents 68 44 24 68 12.1 12.1 100% 50 20 2012 Announce 45 31 14 43 12.3 12.2 99% 50 20 2013 Terms/Add 19 15 4 19 10.3 10.3 100% 47 18 1814 Terms/Edit 27 19 8 27 10.0 10.0 100% 47 18 1815 Sem./Add 30 22 8 30 10.3 10.3 100% 47 18 1816 Sem./Edit 43 30 13 43 10.4 10.4 100% 47 18 1817 Classes/Add 47 32 15 47 11.0 11.0 100% 47 18 1818 Classes/Edit 42 30 12 40 10.9 10.7 98% 47 18 1819 Classes/Grid 22 17 5 22 9.8 9.8 100% 53 20 2020 Users/Add 19 15 4 19 10.6 10.6 100% 48 19 1921 Users/Edit 26 18 8 25 10.6 10.3 98% 48 19 19

1060 725 335 1048 236.9 235.8 99.5% 1041 411 411

HTML code to server code, and 2) how accurately it propa-gates the fixes from Tidy to server code. All experiments werecarried out on a Windows 7 Home Premium 64-bit computerwith CPU Intel Core i3-370M 2.40 GHz and 6GB RAM.

We collected six PHP systems from sourceforge.net in dif-ferent sizes and domains (Table II). We read the code togain the knowledge and set up those systems on our serverwith required databases and sample data. For each system, weselected multiple server pages for testing and built their D-models. Column ExFiles shows the average number of executedserver files for a page. Columns Nodes and Control show theaverage number of all nodes and that of control nodes (Se-lect/Repeat) in a D-model. Running time is in column Time.

A. Accuracy of Mapping Client Code and Server Code

To evaluate PhpSync’s accuracy in mapping the texts inHTML to PHP code, we first collected the HTML test pagesfrom the subject systems by navigating through several HTMLpages within that system on a Web browser. We recorded eachpage as an HTML test page by saving its corresponding HTMLcode and the navigation steps to get to that page (for laterreproducing the page and checking). For each subject system,we selected the HTML pages with different presentations tohave the samples of client pages with diverse page structures.

TABLE IVACCURACY OF MAPPING AND FIX-PROPAGATING ON ALL SUBJECT SYSTEMS

Mapping Fix-Propagating

System Test Fragments Characters (×1000) Complexity Err. Tidy PhpSync Miss Acc. Time (s)Pages All Auto Man Corr. All Corr. Acc. Files Time (s)

SchoolMate-1.5.4 21 1060 725 335 1048 236.9 235.8 99.5% 6 4.4 1041 411 411 0 100.0% < 0.2TimeClock-1.4 14 2511 2103 408 2484 164.1 162.7 99.1% 5 1.2 422 136 136 0 100.0% < 0.2WebERP-4.0.2 10 2736 1910 826 2564 78.5 75.8 96.7% 5 8.0 284 188 176 12 93.6% < 1.0UPB-2.2.7 10 1234 866 368 1191 56.7 53.9 95.0% 4 0.6 129 49 47 2 95.9% < 0.2AddressBook-6.2.12 10 1853 1341 512 1841 78.9 77.5 98.1% 8 0.3 64 51 48 3 94.1% < 0.2Manhali-1.3.2 10 3717 2616 1101 3610 115.1 105.9 92.0% 9 3.8 607 189 164 25 86.8% < 0.2

Our evaluation method is to use PhpSync to map everycharacter in an HTML test page C to the correspondingcharacter in a PHP literal or PHP variable, and then to verifythose mappings for all characters by the combination of achecking tool and human subjects. Remind that given anHTML test page, PhpSync divides its HTML contents intoseveral text fragments and maps each fragment into the PHPliterals/variables (Section 4). Because all of those fragmentscover the entire HTML test page, to verify PhpSync’s mappingfor each character, one can check the mapping for eachof those fragments (called test fragments). The unmappedfragments are considered to have incorrect mappings.

To reduce the effort of manual verification from human sub-jects, we wrote an evaluation program that checks PhpSync’smapping from every test fragment f of the test page to a PHPliteral l. If f is mapped to a PHP variable, we examine themapping manually. Otherwise, that program replaces only thefirst character in the literal l in the PHP code S with a specialcharacter (SC) that does not appear in the page C. We thenexecuted the instrumented PHP code S′ and followed the samerecorded navigation steps to produce the new HTML page C ′.If in C ′, the first character position in f is replaced with thatSC and all other positions in C ′ are un-changed, we considerit as a correct mapping for that character. Moreover, in sucha case of correct mapping for that character, if f is exactlyidentical to l, we consider the mapping (f → l) correct forall characters in the fragment f , and consider f as a correctlymapped fragment. When other positions in C ′ besides f havebeen changed, the evaluation tool cannot conclude that themapping is incorrect. For instance, there may exist a correctmapping from some client code to a PHP literal inside afor/while loop. When the client code C ′ is produced, the SCcharacter may appear multiple times in C ′ due to the executionof the loop. Thus, in all other cases, we manually verified themapping from f to l by understanding the program semantics.

In Table III, column Mapping shows the result on SchoolMatev1.5.4. We collected a total of 21 HTML test pages. In columnFragments, the sub-columns All, Auto, Man, and Corr. respectivelyshow the number of all test fragments in the test page, thenumbers of auto-evaluated, manually-evaluated, and correctlymapped fragments. In column Characters, the sub-columns Alland Corr. show the numbers of all characters and correctlymapped ones in a test page. Acc. shows accuracy, i.e. the ratioof the number of correctly mapped characters over the total.

Column Mapping of Table IV shows the results for all subjectsystems. Processing time is in column Time. As seen, PhpSyncachieves very high accuracy (an average of 96.7%) in charactermapping with a small processing time (an average of 3 secondsfor a test page of about 10,000 characters). Column Files showsus that on average a test page is produced by 6 PHP files.Thus, our tool could help reduce developers’ effort in findingthe PHP locations for a given HTML text.

B. Accuracy of Fix-Propagating to Server Code

We used the same set of HTML test pages in those systemsfor an experiment to evaluate PhpSync’s accuracy in fixpropagation. For each test page C, we used Tidy to detectvalidation errors. If errors were found and Tidy was able tofix the page into C∗, PhpSync would be used to derive thefixing changes between C and C∗ and propagate them to fixthe PHP code S into S∗. Then, we executed the fixed PHPcode S∗ and followed the same recorded navigation steps toproduce the new HTML page C+. Tidy was used to check onC+ for validation errors again. After that, the lists of errorsthat Tidy had fixed (C → C∗) and PhpSync had fixed viafix-propagation (C → C+) were automatically compared todetermine how well PhpSync propagated those fixes. Accuracyis measured as the ratio between the number of correctlypropagated fixes over the total propagated fixes. For the casesthat Tidy uncovered validation errors but could not fix, onecould use CSMap to auto-locate the erroneous PHP code. Thequality of such mapping was evaluated as in Section VI.A.

The columns Fix-Propagating in Tables III and IV display thefix-propagation results. Columns Err., Tidy, and PS show thenumber of total HTML validation errors found by Tidy, that oferrors fixed by Tidy, and that of errors fixed by PhpSync viafix-propagation. As shown, PhpSync achieves high accuracy(an average of 95%) in fix propagation with small processingtime. Importantly, it did not introduce any new validation error.

Threats to Validity. Our experiments were on only 6 systemswith 74 test pages. The selected systems and test pages mightnot be representative. However, the number of test fragments isvery large (13,111), of which 3,550 were manually checked in15 hours. During that process, human errors could occur. Cur-rently, PhpSync does not completely handle object-orientedPHP, thus most of the selected systems do not contain manyclasses. Four out of six systems have only reasonable sizes anddo not contain many loops for complex computational logics.

VII. RELATED WORK

Artzi et al. [7] introduced Apollo, a method to find bugsin Web applications by combining concrete and symbolicexecution. It executes a Web application on an initial empty orrandomly-chosen input. Additional inputs are derived by solv-ing path constraints and conditions extracted from exercisedcontrol flow paths [7]. Failures during such executions arereported as bugs. In [8], they extended Apollo to also modelinteractive user inputs in a Web application. However, it doesnot pinpoint the buggy PHP statements that cause such errors.

To support such fault localization, in [9], they combined avariation of Tarantula [10] with the use of a dynamic outputmapping technique. For each statement, Tarantula associatesit with a suspiciousness rating that indicates the likelihoodfor the statement to contribute to a fault. The rating iscomputed based on the percentages of passing and failing teststhat execute that statement. However, they reported that in aWeb application, a significant number of statements/lines areexecuted in both cases, or only in failing executions. Thus, theycombined Tarantula with a dynamic output mapping technique,which instruments a shadow interpreter to create a mappingbetween the lines in PHP and HTML code by recording theline number of the originating PHP statement whenever outputis written out using the echo and print statements [9].

In comparison, while their output mapping technique isbased on dynamic analysis with run-time instrumentation intoan interpreter, PhpSync relies on symbolic execution. Theirtechnique is lightweight, however, PhpSync is better suitedfor this auto-locating and fix-propagating problem. First, foran erroneous HTML line detected by an HTML validator,their tool will map it to the PHP statement responsible forprinting it out. However, that PHP print/echo statement mightnot always be the line that needs to be fixed because theerroneous content of the HTML line might be composedand manipulated in string variables in previous statement(s)(see the motivating example in Section II.B.4). Second, inpractice, validation errors could be found in a client pagevia Tidy and reported without corresponding input and actionsteps to produce that page. Thus, their dynamic mappingtechnique cannot be applied. In this case, a fixer can stilluse PhpSync to fix the errors. Finally, for fix-propagation,PhpSync performs mapping at the character level, while forthe debugging purpose, their tool maps at the line level.

Tidy [4], an HTML validator/corrector, works mostly onstatic HTML pages. For PHP code, it filters all the codewithin a ’<?php’ and the corresponding ’?>’ and considers theremaining as HTML code. That scheme does not work wellbecause HTML code is embedded within multiple scatteredPHP literals and variables (see Section II). Similar to Tidy,other validating tools [11], [12], [2] are limited to supportvalidating or correcting only client pages in XML/HTML/CSS.

Minamide’s string analyzer [5] takes a PHP program anda regular expression describing all of its possible inputs,and then statically approximates and validates the outputvia a context-free grammar. In comparison, his goal is to

validate approximated HTML outputs from a PHP programwithout fixing support. Moreover, PhpSync performs symbolicexecution requiring an input specification as in Minamide’s.

Using a string analyzer, Wang et al. [13] compute the ap-proximated output of a PHP program and identify the constantstrings visible from the browser for translation purpose.

Several string-taint analysis techniques were built for soft-ware security problems [14], [15], [16]. Gould et al. [17] usestring analysis to guarantee well-typed SQL queries generatedby a Java program. The type system in [18] is based on regularexpressions with string concatenation and pattern matching. ACFG-based type system for string analysis is presented in [19].PhpSync complements to PHP debuggers [20], however it doesnot need the inputs of PHP programs with symbolic execution.

VIII. CONCLUSIONS

We propose PhpSync, an auto-locating and fix-propagatingtool for validation errors. Given an HTML page produced byPHP code, PhpSync uses Tidy to find its validation errors, andpropagates Tidy’s fixes to PHP code. Our core solutions in-clude a symbolic execution algorithm on PHP code to producean D-model, which approximates all possible client pages, andthe client-server mapping and fix-propagating algorithms. Ourevaluation shows that it achieves high accuracy in both tasks.

ACKNOWLEDGMENTS

This project is funded in part by NSF CCF-1018600 grant.The first author was funded in part by a fellowship fromVietnamese Education Foundation (VEF).

REFERENCES

[1] “World Wide Web Consortium,” http://www.w3.org/, W3C.[2] “W3C Markup Validation Service,” http://validator.w3.org/, W3C.[3] “Why Validate,” http://validator.w3.org/docs/why.html, W3C.[4] “HTML Tidy Project,” http://tidy.sourceforge.net/, Source Forge.[5] Y. Minamide, “Static approximation of dynamically generated Web

pages”. WWW’05: Int. Conference on World Wide Web. ACM, 2005.[6] A. Aho, J. D. Ullman, and M. S. Lam, Compilers: Principles, Tech-

niques, and Tools. Pearson Education Inc., 2006.[7] S. Artzi, A. Kiezun, J. Dolby, F. Tip, D. Dig, A. Paradkar, and M. Ernst.

Finding bugs in dynamic Web applications. In ISSTA, pp 261-272, 2008.[8] S. Artzi, A. Kiezun, J. Dolby, F. Tip, D. Dig, A. Paradkar, and M. Ernst.

“Finding bugs in Web applications using dynamic test generation andexplicit-state model checking”. IEEE TSE, 36(4): 474-494. July, 2010.

[9] S. Artzi, J. Dolby, F. Tip, and M. Pistoia. “Practical fault localizationfor dynamic Web applications”. In ICSE’10, pp. 265-274. ACM, 2010.

[10] J. Jones and M. Harrold. “Empirical evaluation of the Tarantula auto-matic fault-localization technique”. In ASE’05, pp 273-282. ACM, 2005.

[11] “WDG HTML Validator,” http://htmlhelp.com/tools/validator/, WDG.[12] “CSE HTML Validator,” http://www.htmlvalidator.com/.[13] X. Wang, L. Zhang, T. Xie, H. Mei, J. Sun, “Locating need-to-translate

constant strings in Web applications”. FSE’10, pp. 87–96. ACM, 2010.[14] G. Wassermann and Z. Su, “Static detection of cross-site scripting

vulnerabilities”. In ICSE’08, pp. 171–180. ACM Press, 2008.[15] Y. Xie and A. Aiken, “Static detection of security vulnerabilities in

scripting languages”. In USENIX Security Symposium - Volume 15, 2006.[16] A. Kieyzun, P. J. Guo, K. Jayaraman, M. D. Ernst, “Automatic creation

of SQL injection and cross-site scripting attacks”. ICSE’09, IEEE CS.[17] C. Gould, Z. Su, and P. Devanbu, “Static checking of dynamically

generated queries in database applications”. ICSE’04, IEEE CS, 2004.[18] N. Tabuchi, E. Sumii, A. Yonezawa, “Regular expression types for

strings in a text processing language”. In Types in Programming, 2002.[19] P. Thiemann, “Grammar-based analysis of string expressions”. In ACM

workshop on Types in languages design and implementation ACM, 2005.[20] “DBG PHP Debugger,” http://www.php-debugger.com/dbg/.