Enriching Documents with Examples: A Corpus Mining ...

27
1 Enriching Documents with Examples: A Corpus Mining Approach JINHAN KIM, SANGHOON LEE, and SEUNG-WON HWANG, Pohang University of Science and Technology SUNGHUN KIM, Hong Kong University of Science and Technology Software developers increasingly rely on information from the Web, such as documents or code examples on application programming interfaces (APIs), to facilitate their development processes. However, API docu- ments often do not include enough information for developers to fully understand how to use the APIs, and searching for good code examples requires considerable effort. To address this problem, we propose a novel code example recommendation system that combines the strength of browsing documents and searching for code examples and returns API documents embedded with high-quality code example summaries mined from the Web. Our evaluation results show that our approach provides code examples with high precision and boosts programmer productivity. Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Search process General Terms: Algorithms, Experimentation, Performance Additional Key Words and Phrases: Clustering, ranking, code search, API document ACM Reference Format: Kim, J., Lee, S., Hwang, S.-W., and Kim, S. 2013. Enriching documents with examples: A corpus mining approach. ACM Trans. Inf. Syst. 31, 1, Article 1 (January 2013), 27 pages. DOI:http://dx.doi.org/10.1145/2414782.2414783 1. INTRODUCTION As reusing existing application programming interfaces (APIs) significantly improves programmer productivity and software quality [Devanbu et al. 1996; Gaffney and Durek 1989; Lim 1994], many developers search for API information, such as API documents or usage examples, on the Web to understand the usage of APIs. Specifically, developers may not know which API method to use, in which case they need to browse API documents and read the descriptions before selecting the appro- priate one. Alternatively, developers may know which API method to use but may not know how to use it, in which case they need to search for illustrative code examples. To address these browsing needs, API creators usually provide API documents writ- ten in a human readable language to show developers how to use the available APIs and select the most suitable one. However, as textual descriptions are often ambigu- ous and misleading, developers often combine browsing with searching for code exam- ples [Bajracharya and Lopes 2009]. This work is based on and extends Kim et al. [2009, 2010]. This work was supported by the Engineering Research Center of Excellence Program of Korea Ministry of Education, and the Science and Technology (MEST)/National Research Foundation of Korea (NRF) (Grant 2012-0000474). Authors’ emails: J. Kim, S. Lee, and S.-W. Hwang (corresponding author); {wlsgks08, sanghoon, swhwang} @postech.edu; S. Kim: [email protected]. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is per- mitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected]. c 2013 ACM 1046-8188/2013/01-ART1 $15.00 DOI:http://dx.doi.org/10.1145/2414782.2414783 ACM Transactions on Information Systems, Vol. 31, No. 1, Article 1, Publication date: January 2013.

Transcript of Enriching Documents with Examples: A Corpus Mining ...

1

Enriching Documents with Examples: A Corpus Mining Approach

JINHAN KIM, SANGHOON LEE, and SEUNG-WON HWANG, Pohang University of Scienceand TechnologySUNGHUN KIM, Hong Kong University of Science and Technology

Software developers increasingly rely on information from the Web, such as documents or code examples onapplication programming interfaces (APIs), to facilitate their development processes. However, API docu-ments often do not include enough information for developers to fully understand how to use the APIs, andsearching for good code examples requires considerable effort.

To address this problem, we propose a novel code example recommendation system that combines thestrength of browsing documents and searching for code examples and returns API documents embeddedwith high-quality code example summaries mined from the Web. Our evaluation results show that ourapproach provides code examples with high precision and boosts programmer productivity.

Categories and Subject Descriptors: H.3.3 [Information Storage and Retrieval]: Information Search andRetrieval—Search process

General Terms: Algorithms, Experimentation, Performance

Additional Key Words and Phrases: Clustering, ranking, code search, API document

ACM Reference Format:Kim, J., Lee, S., Hwang, S.-W., and Kim, S. 2013. Enriching documents with examples: A corpus miningapproach. ACM Trans. Inf. Syst. 31, 1, Article 1 (January 2013), 27 pages.DOI:http://dx.doi.org/10.1145/2414782.2414783

1. INTRODUCTION

As reusing existing application programming interfaces (APIs) significantly improvesprogrammer productivity and software quality [Devanbu et al. 1996; Gaffney andDurek 1989; Lim 1994], many developers search for API information, such as APIdocuments or usage examples, on the Web to understand the usage of APIs.

Specifically, developers may not know which API method to use, in which case theyneed to browse API documents and read the descriptions before selecting the appro-priate one. Alternatively, developers may know which API method to use but may notknow how to use it, in which case they need to search for illustrative code examples.

To address these browsing needs, API creators usually provide API documents writ-ten in a human readable language to show developers how to use the available APIsand select the most suitable one. However, as textual descriptions are often ambigu-ous and misleading, developers often combine browsing with searching for code exam-ples [Bajracharya and Lopes 2009].

This work is based on and extends Kim et al. [2009, 2010].This work was supported by the Engineering Research Center of Excellence Program of Korea Ministry ofEducation, and the Science and Technology (MEST)/National Research Foundation of Korea (NRF) (Grant2012-0000474).Authors’ emails: J. Kim, S. Lee, and S.-W. Hwang (corresponding author); {wlsgks08, sanghoon, swhwang}@postech.edu; S. Kim: [email protected] to make digital or hard copies of part or all of this work for personal or classroom use is grantedwithout fee provided that copies are not made or distributed for profit or commercial advantage and thatcopies show this notice on the first page or initial screen of a display along with the full citation. Copyrightsfor components of this work owned by others than ACM must be honored. Abstracting with credit is per-mitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any componentof this work in other works requires prior specific permission and/or a fee. Permissions may be requestedfrom Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212)869-0481, or [email protected]© 2013 ACM 1046-8188/2013/01-ART1 $15.00DOI:http://dx.doi.org/10.1145/2414782.2414783

ACM Transactions on Information Systems, Vol. 31, No. 1, Article 1, Publication date: January 2013.

1:2 J. Kim et al.

Fig. 1. Top two results from Koders for the query “Connection prepareStatement”.

To address these searching needs, developers often use several of the commercialsearch engines, including Koders [Koders 2010] and Google Code Search [Google 2010],using an API method name as a query keyword in order to find suitable code exam-ples. However, the top search results from Koders, shown in Figure 1, do not alwaysmeet developers’ expectations. The snippets of two top search results show matchesin the comments of source code and fail to provide any information about the usageof “prepareStatement().” As a result, human effort is needed to sift through all of thecode and find relevant code examples.

To reduce this effort, some API documents, such as the MSDN [MSDN 2010] fromMicrosoft and the Leopard Reference Library [Leopard 2010] from Apple, include arich set of code usage examples so that developers do not need to search for additionalcode examples to understand how to use the API. However, because manually craftinghigh-quality code examples for all API methods is labor intensive and time consum-ing, most API documents lack code examples. For example, according to our manualinspection, JDK 5 documents include more than 27,000 methods, but only about 500of them (approximately 2%) are explained using code examples.

One basic solution to improving API documents is to include code search engineresults (code examples) in API documents in advance by leveraging existing codesearch engines. However, these search engines work well for comment search butfail to retrieve high-quality code examples because they treat code as simple text.Similarly, one may consider code recommendation approaches [Holmes and Murphy2005; Sahavechaphan and Claypool 2006; Zhong et al. 2009]. However, they requirecomplex contexts, such as the program structure of the current task, in order torecommend suitable code examples.

In clear contrast to these approaches, we propose an intelligent code example rec-ommendation system that searches, summarizes, organizes, and embeds the necessaryinformation in advance by automatically augmenting it with high-quality code exam-ples. Specifically, our system first builds a repository of candidate code examples by in-terrogating an existing code search engine and then summarizes those candidate codeexamples into effective snippets. Subsequently, our system extracts semantic featuresfrom the summarized code snippets and finds the most representative code examplesfor which we propose three organization algorithms. Finally, our system embeds the se-lected code examples into API documents. Figure 2 provides a sample snapshot of gen-erated API documents. Each API method is annotated with popularity, represented bythe bar in Figure 2(a), to satisfy developers’ browsing needs to find API methods thatare used frequently in programming tasks. To determine the popularity, we count thenumber of code examples generated using our framework. For each API method, oneto five code examples are presented, as shown in Figure 2(b). Hereafter, we will callthese documents example oriented API documents (eXoaDocs).

ACM Transactions on Information Systems, Vol. 31, No. 1, Article 1, Publication date: January 2013.

Enriching Documents with Examples: A Corpus Mining Approach 1:3

Fig. 2. An example page of generated eXoaDocs (Java.text.Format Class).

To evaluate our approach, we first compared our three organization algorithms,which select representative code examples among candidates, and generated eXoaD-ocs for JDK 5.1 We then compared our results with the original Java documents(JavaDocs), code search engines, and gold standard results. We further conducted auser study for evaluating how eXoaDocs affect the software development process.

Our evaluation results show that our approach summarizes and ranks code exam-ples with high precision and recall. In addition, our user study indicates that usingeXoaDocs boosts programmer productivity and code quality.

The remainder of this article is organized as follows. Section 2 presents our proposedtechniques for automatic useful code example generation, Section 3 presents our orga-nization algorithms, and Section 4 evaluates the generated eXoaDocs. A case studyand results are presented in Section 5. Section 6 discusses our limitations, Section 7surveys related work, and Section 8 concludes this article.

1In this article, we evaluate our system using JDK 5, but it is easily extensible to other programminglanguages, such as C and C++, simply by replacing the parser.

ACM Transactions on Information Systems, Vol. 31, No. 1, Article 1, Publication date: January 2013.

1:4 J. Kim et al.

Fig. 3. Automatic eXoaDocs generation process.

2. OUR APPROACH

In this section, we discuss how to find and provide useful code examples for APIdocuments.

Our framework consists of three modules: Summarization, Representation, and Or-ganization (Figure 3).

First, our framework builds a repository of candidate code examples by leveragingan existing code search engine and summarizing the examples into effective snippets(Summarization). After that, it extracts semantic features from each summarized codeexample (Representation). Next, it selects the most representative code examples usingclustering and ranking (Organization). Finally, it embeds the selected code examplesinto API documents and generates eXoaDocs.

The following describe the three modules—Summarization, Representation, andOrganization—in detail.

2.1. Summarization

The first module of our framework collects potential code examples for each APImethod and builds a code example repository.

To achieve this, we first leverage a code search engine, Koders, by querying the en-gine with the given API method name and its interface name extracted from API docu-ments. For example, we query ‘Connection PrepareStatement’ to collect source code ofthe API method PrepareStatement in the interface Connection. We then collect the top200 source code files for each API method. We selected 200 as the retrieval size basedon our observation that most results ranked lower than the top 200 are either irrele-vant to the query or redundant. We will further discuss the pros and cons of leveragingcode search engines for populating the repository in Section 6. After collecting the top200 source code files, we ignore the Koders ranking and only use these files to constructa repository. We then use our own summarization and ranking algorithms to identifysuitable code examples.

Next, we summarize the code as good example snippets that clearly explain the us-age of the given API method. To obtain initial intuitions on defining good examplesnippets, we first observed manually designed code examples in JavaDocs. We illus-trate the observed characteristics of good example snippets using a manually writtencode example to explain the add method in the Frame class, as shown in Figure 4.

ACM Transactions on Information Systems, Vol. 31, No. 1, Article 1, Publication date: January 2013.

Enriching Documents with Examples: A Corpus Mining Approach 1:5

Fig. 4. A manually constructed code example in JavaDocs.

— Good example snippets should include the corresponding API method call, for exam-ple, f.add() in line 6 of Figure 4, and its semantic context, such as the declarationand initialization of an argument ex1 (lines 3 and 4) within the method.

— Irrelevant code, regardless of the textual proximity to the API method, can be omit-ted, as in line 5 in the manually written example.

Recall that both of the top two code snippets from Koders in Figure 1 violate all ofthe characteristics needed to be good example snippets by (1) failing to show the actualAPI method call and (2) summarizing simply on the basis of the proximity of text toquery keyword matches, Connection and PrepareStatement.

In clear contrast to the code search engine results, we achieve summarization basedon the semantic context by following these steps.

— Method extraction. We identify the methods that contain the given API method callbecause they show the usage and semantic context of the given API method call.

— API slicing. We extract only the semantically relevant lines for the given API methodcall using slicing techniques [Horwitz et al. 1988; Weiser 1981]. Relevant lines needto satisfy at least one of the following requirements: (R1) declaring the input ar-guments for the given API method, (R2) changing or initializing the values of theinput arguments for the given API method, (R3) declaring the class of the given APImethod, or (R4) calling the given API method. For example, in Figure 4, line 2 isrelevant because it declares the Frame class of the API method (R3); line 3, becauseit declares the input argument ex1 (R1); line 4, because it initializes the input ar-gument (R2); and line 6, because it calls the given API method (R4). All other lineswere omitted as being irrelevant.

To identify these relevant lines, we first analyze the semantic context of the iden-tified methods by building an abstract syntax tree (AST) from the potential code ex-ample using an open source tool, java2xml [Java2Xml 2010]. To illustrate, Figure 5shows a code example and its AST with the line information. Each node of the ASTrepresents an element of the Java language. For example, line 5 of the code exampleis converted to the subtree in the dotted circle. In that subtree, the node Send is aJava element indicating the actual API method call— the first Send at the root of thesubtree represents check, and the second represents hashcode. Target and Var-ref in-dicate the interface and its name for the API method. For instance, the left descendantof the root, Target, corresponds to the interface of the check API method call, whichis harness. Its right descendant, Arguments, has two descendants representing twoarguments, b.hashcode() and 90.

Next, to effectively find the relevant lines from the AST, we design a simple lookuptable, called the Intra-Method Analysis Table (IAT). To build the IAT, we traversethe AST and examine each element in the node. If the element in the node indicatesvariables or API method calls, we store the name and type of the element and theline number on which the element is called. When we find elements related to API

ACM Transactions on Information Systems, Vol. 31, No. 1, Article 1, Publication date: January 2013.

1:6 J. Kim et al.

Fig. 5. eXoaDocs code example of “Character.hashcode()” and its abstract syntax tree.

Table I. An Example of the Intra-Method Analysis Table(IAT) for the Code in Figure 5

Line Interface API name Arguments4 a : Character : 2 hashCode5 b : Character : 3 hashCode

method calls while traversing nodes in the AST, we check the stored elements, obtainthe elements related to the API method call (e.g., variables used in the API methodcall), and store them in the IAT.

Each row in an IAT corresponds to an API method call and consists of four parts, asshown in Table I: the line number of the API method, the name and line number of theinterface of the API method, the API method name, and the type and line informationof the argument list of the API method. For example, the first row of Table I showsthat (a) “hashCode()” is called at line 4, (b) its interface type is “Character” with name“a” (declared at line 2), and (c) “hashCode()” has no argument.

As an IAT contains all Java language elements related to the API method call, oursystem can find all relevant lines of the API method call by simply searching the rowsof the IAT and checking API method calls. Our system then summarizes the originalsource code and generates example snippets using only semantically relevant lines inthe IAT.

2.2. Representation

The goal of representation is to extract meaningful features from each summarizedcode example to identify the most representative code examples by comparing theirfeatures.

There are two extreme approaches to extracting features from each code exampleand comparing extracted features between code examples. One extreme is to treat codeexamples as simple texts, while the other extreme is to represent them as ASTs. As theformer compares textual features between code examples based on keyword matching,it is highly efficient but neglects the semantic context. Conversely, as the latter con-structs the ASTs using the semantic context of code examples and compares the ASTbetween code examples, it is very expensive but considers the semantic context.

ACM Transactions on Information Systems, Vol. 31, No. 1, Article 1, Publication date: January 2013.

Enriching Documents with Examples: A Corpus Mining Approach 1:7

We find an intermediate approach between these two extremes by approximatingthe semantic context and comparing the approximated semantic context (a trade-off between efficiency and accuracy). To approximate the semantic context, we ex-tract semantic context vectors from the ASTs using a clone detection algorithm calledDECKARD [Jiang et al. 2007].

DECKARD proposes q-level characteristic vectors for approximating the semantic con-text of AST for each code, where q is the depth of the AST subtree. When the value ofq is fixed, DECKARD extracts all possible AST subtrees with depth q and generates q-level characteristic vectors consisting of an n-dimensional numeric vector < c1, ..., cn >,where n is the number of elements in the subtree, such as loops, and each ci is the num-ber of occurrences of a specific element. DECKARD selects some relevant elements inorder to generate q-level characteristic vectors. For example, when computing the vec-tor for the subtree with line 5 of the code in Figure 5, that is, the tree segment in thedotted circle, the characteristic vector for selected elements such as send, arguments,local-variable, and new, would be < 2, 2, 0, 0 >, because the subtree contains two sendand two arguments but does not contain local-variable and new.

Similarly, we use all 85 Java language source code elements, detected fromASTs using “java2xml.” (For the complete listing of all 85 elements, refer tohttp://exoa.postech.ac.kr.) The AST in Figure 5 contains a part of the 85 Java lan-guage source code elements we used, such as method, formal-argument, local-variable,arguments, and send. These 85 features have integer values that represent the countfor each element in the given code example.

In addition, we consider two more integer features specific to our problem context ofcode example recommendation. First, we add the frequency of the query API methodin each code example, as there is a higher chance that a code example that has manyquery API method calls will be an example that the user is searching. For this reason,we count the number of query API method calls in the code example and use it as afeature. Second, we add the the number of lines of the code examples as a feature tofilter lengthy code examples, as developers usually prefer concise code examples. Inthis article, the length of the code example is the length of the entire method for codeexamples generated by “Method extraction” and the length of the code snippet for codeexamples generated by “API slicing.”

In total, we use 87-dimensional feature vectors to represent the code examples.

2.3. Organization

In this section, we discuss how to organize code examples to best satisfy user intent.To satisfy user intent, we need to know what users want. However, user intent is

often unclear, as users may have different usage types in mind, and it is impossible toguess the usage type using only the API method name. In a similar context, generalsearch engines have taken one of the following two approaches when the query intentis ambiguous.

— Clustering. Clusty2 and Carrot3 cluster search results into multiple groups satisfy-ing diverse user intents and present the representatives of each group.

— Ranking. Typical search engines estimate the probability of user intent and rank tooptimize for the most likely intent.

These two approaches have complementary strengths, as summarized in Table II.As the clustering-based approach finds clusters satisfying diverse user intents, codeexamples presented can cover a diverse range of usage types. However, this approach

2http://clusty.com/3http://search.carrot2.org/stable/search

ACM Transactions on Information Systems, Vol. 31, No. 1, Article 1, Publication date: January 2013.

1:8 J. Kim et al.

Table II. Complementary Strength of Ranking and Clustering-BasedApproaches

Algorithm Pros ConsClustering- cover diverse results may includebased approach usage types outlier typesRanking- consider different results may bebased approach probabilities of intents skewed

Fig. 6. Illustrations of presentation algorithms.

may return outliers by not considering probabilities of user intention, thus and gener-ating a group for an intent that is extremely unlikely. Conversely, the ranking-basedapproach considers such probabilities and optimizes for the most likely user intent,which reduces the coverage. As a result, it provides the most widely used code exam-ples. However, the results tend to skew to satisfy a few popular intents.

We examined both ranking- and clustering-based approaches for the organizationof code examples and propose the following three algorithms. To illustrate each al-gorithm, Figure 6 shows an example that plots candidate code examples in two-dimensional space (even though each code example is 87-dimensional), where thedotted circles are clusters representing the usage types of an API method and threeblack dots are code examples representing the code examples chosen by each algo-rithm. Note that the distance between points reflects the code similarity between twocode examples.

— eXoaCluster (Figure 6(a)) adopts a clustering-based approach. It identifies clustersof code examples representing the same usage type and selects the most represen-tative code example from each cluster. As a result, eXoaCluster provides code exam-ples with diverse usage types.

— eXoaRank (Figure 6(b)) adopts a ranking-based approach. It first estimates the prob-ability of each intent from the code corpus, assuming, that the most widely usedusage type is likely to be asked. Specifically, the probability of each intent is esti-mated using centrality [Erkan and Radev 2004], representing highly likely intents,inspired by document summarization literature. eXoaRank then selects the top-k rep-resentative code examples with high centrality.

— eXoaHybrid (Figure 6(c)) balances the previous two algorithms using their comple-mentary strengths by not only considering the probability of each intent but alsotrying to diversify the results over multiple usage types. eXoaHybrid iteratively com-putes the marginal utility [Agrawal et al. 2009] by using the probabilities of clustersand code examples to satisfy the user intents. It then selects the code example withthe highest marginal utility at each iteration. eXoaHybrid thus recognizes that thecluster size and the number of representatives from each cluster can be different.

ACM Transactions on Information Systems, Vol. 31, No. 1, Article 1, Publication date: January 2013.

Enriching Documents with Examples: A Corpus Mining Approach 1:9

Fig. 7. Distribution of the number of different usage types in API methods in JDK 5. For highly diverse APImethods (4∼5), “Cluster” provides better code examples, while for less diverse API methods (2∼3), “Rank”provides better code examples. As “Rank” and “Cluster” are identical when API usage is not diverse (or thenumber of usage types is 1), we mark this as “both.”

Figure 7 shows the distribution of the number of usage types found in API meth-ods in JDK 5. Observe that such a number is in the range from one to five, with fivesuggesting that the API method has five different usage types (i.e., diverse) and, in an-other extreme, the API method has a single usage type every one agrees with (i.e., notdiverse). Intuitively, for highly diverse API methods, clustering the results by typesand presenting different usages would be more reasonable. Meanwhile, in another ex-treme, every result falls into a single cluster such that clustering becomes a redundantphase. Our experimental results are consistent with this intuition. Figure 7 shows thatfor diverse scenarios with API methods with 4+ usage types, eXoaCluster is the win-ner, and for the less diverse scenarios, eXoaRank is the winner. Approximately half ofAPI methods fall into the former and the rest into the latter scenario.

So, of what use is eXoaHybrid? If it is easy to predict whether the given API method isdiverse, selecting the winning algorithm for either case is the most effective. However,in a case where such a prediction is non-trivial, eXoaHybrid closely approximates theaccuracy of the winner in all scenarios, while the accuracy of eXoaRank deterioratesin diverse scenarios (and so does that of eXoaCluster in less diverse scenarios). Thatis, eXoaHybrid, which balances eXoaCluster and eXoaRank, can be the best choice tosatisfy developers’ intent for diverse API methods, as discussed in detail in Section 5.

The technical details of each algorithm will be discussed in the next section.

3. ORGANIZATION ALGORITHMS

In this section, we describe the technical details of our three proposed algorithms–eXoaCluster, eXoaRank, and eXoaHybrid.

3.1. eXoaCluster

The goal of eXoaCluster is to cluster code examples into groups of similar usage typesand select the representatives from each group.

3.1.1. Clustering. The first step of eXoaCluster is to cluster the code examples intogroups of different usage types, such that our results can cover various usage types ofthe given API method. Because we represent code examples as characteristic vectors,as described in Section 2.2, well known clustering algorithms on numerical data, such

ACM Transactions on Information Systems, Vol. 31, No. 1, Article 1, Publication date: January 2013.

1:10 J. Kim et al.

as the k-means algorithm, can be applied using L1-distance as the similarity metricdefined as the following.

L1-distance(V1, V2) =n∑

i=1

|xi – yi|,

where V1 and V2 are n-dimensional characteristic vectors and xi and yi are the ithelements of V1 and V2, respectively. In our problem, k is both the number of clustersand the number of usage types of the given API method.

However, the key challenge is that the number of clusters k varies over different APImethods, which requires the automatic tuning of k during a query. However, apply-ing existing algorithms that support this tuning, such as X-means [Pelleg and Moore2000], renders poor clustering results for our problem with 87-dimensional data. Theproblem is that these algorithms iteratively look for the best split and check whetherthe result of the split improves cluster quality. However, as all of the splits becomenear-equally desirable in the high-dimensional space (known as the “curse of dimen-sionality” problem [Xu and Wunsch 2005]), X-means assigns all code examples toone cluster or one code example to one cluster. In other words, the number of clus-ters tuned by X-means either remains kmin or exponentially increases and terminateswhen k = kmax in all cases, where kmin is the minimum number of clusters and kmaxis the maximum number of clusters (default values are kmin = 1 and kmax = number ofcode examples).

In clear contrast, in our problem context, the number of usage types is usually small.We empirically observed that the number of usage types is usually in the range from2 to 5. Using this observation, we invoke the k-means clustering algorithm four times,that is, for k = 2, 3, 4, and 5, and select the results with the best clustering qual-ity. While this approach is generally discouraged for incremental splitting as usedin X-means, we observe that it achieves higher efficiency and accuracy compared toX-means in our specific problem setting with (1) a tightly bounded k range and (2)high-dimensional data.

To select the best clustering results, we use three cluster quality metrics, as similarlydefined in X-means. We compute the following scores for each clustering result, sumthose scores, and select the clustering result with the highest score. To avoid under-represented clusters, we exclude clusters with just one or two code examples.

— Centroid distribution. Clustering where centroids are distributed evenly suggestsa good coverage of varying usage types, as the centroid of a cluster represents thecharacteristic of the cluster. We check whether centroids of clusters are evenly dis-tributed. We measure 1

varifor variance vari of centroids from the clustering results,

where i is the number of clusters.— Sum of the squared error (SSE). If code examples are categorized well, then they

should have similar characteristic vectors with the centroids of their clusters. Wecheck SSE based on the distance between each code example and the centroid of itscluster. As SSE usually decreases as the number of clusters increases, we computethe decreasing quantity. A large decrease in SSE when the number of clusters isincreased suggests that centroids are statistically better representatives of all vec-tors in the cluster, indicating high-quality clustering results. We measure this asΔi = SSEi–1 – SSEi, which represents the difference in SSE, where i – 1 and i are thenumber of clusters.

— Hierarchical clustering. The clusters that represent usages typically exhibit a hier-archical structure. In other words, a cluster can be split into subtypes (the lowerlayers in the hierarchy) or two clusters can be merged to represent their supertype

ACM Transactions on Information Systems, Vol. 31, No. 1, Article 1, Publication date: January 2013.

Enriching Documents with Examples: A Corpus Mining Approach 1:11

(the higher layer). To favor clustering results that preserve this hierarchical rela-tionship, we give a score of 1 if all members of each cluster (when the number ofclusters is i) come from one cluster in the results when the number of clusters isi – 1, and 0 otherwise.

We choose the k value that yields the highest score, which is the sum of scores, andk determines the number of usage types to present for each API method.

3.1.2. Code Example Selection. Once the clusters are identified, we compute the follow-ing three normalized scores in the range [0, 1] as there is no weight factor betweenscores. Next, we add those scores, rank code examples using aggregated overall scores,and select the most representative code example from each cluster with the highestoverall score.

— Representativeness. Representative code examples should have similar characteris-tic vectors to their cluster’s centroid. This representativeness can be computed byusing L1-distance [Campbell 1984; Foody et al. 1992]. Specifically, the measure ofthe representativeness is min( 1

similarity , 1).— Conciseness. Concise code examples have better readability than lengthy ones, and

so developers usually prefer concise code examples. For this reason, we give higherscores to concise code examples, that is, those with fewer lines of code. Specifically,the measure of the conciseness is 1

length .— Correctness. Suitable code examples should contain the correct type of classes and

arguments. However, API methods with the same name can belong to differentclasses or can take different arguments. In addition, IAT may miss the correct typeof classes or arguments and return non-type information. If there are code examplesthat contain the correct type information, we do not provide code examples withnon-type information. However, if there is no such correct code example, code exam-ples with non-type information should be presented. For this reason, we give a highscore to code examples that use API methods that have the correct classes and thematching of arguments. Specifically, we give the score 1 when the code example hascorrect information, and 0 otherwise.

3.2. eXoaRank

As an alternative approach to eXoaCluster that shows the representatives of all clus-ters, treating all clusters as equally to be chosen by developers, we design eXoaRank byconsidering the different probabilities that code examples will satisfy the developers.

We estimate the probability of each intent from the code corpus, assuming that theusage type most commonly used by developers is most likely to be asked in the query.Specifically, we compute such probability as centrality [Erkan and Radev 2004], mea-suring how many similar code examples are contained in the corpus. Therefore, a highcentrality score for an API method means that there are many similar code examplesused by developers. In other words, developers commonly use the API in the way pre-sented in the specific algorithm. That is, this code example is likely to satisfy the userintent with high probability.

To measure the centrality, we model all code examples in the code corpus as a graph,where each node represents a code example and is connected by edges with weightsrepresenting pairwise similarity. We quantify the similarity using the L1-distance be-tween characteristic vectors, as described in Section 2.2.

L1(vi, vj) =n∑

d=1

|vi(d) – vj(d)|,

ACM Transactions on Information Systems, Vol. 31, No. 1, Article 1, Publication date: January 2013.

1:12 J. Kim et al.

where vi and vj are the characteristic vectors of two code examples, n is the dimension-ality of the vectors, and vi(d) is the dth value of vi. Using the L1-distance, we computethe normalized similarity between two code examples.

Similarity(vi, vj) = 1/L1(vi, vj).

This similarity is between 0 and 1. It is higher for similar pairs. For identical pairs,because we cannot compute the similarity score as it approaches infinity, we give themaximum score of 1. To keep the graph structure compact, we only retain the edgesthat have a weight higher than a certain threshold.

Once the graph is constructed, we compute the centrality score of each node (or codeexample). A naive approach would be simply to count the number of edges connected.

Centrality0(ci) = the number of edges connected with ci,

where ci is a node. However, centrality has a recursive nature, and hence nodes thatare similar to those with high centrality will also have high centrality, as similarlyreflected in metrics in social networks, such as Web graphs or co-authorship graphs(e.g., PageRank [Page et al. 1999]). To reflect this, we use the preceding metric asinitial values at iteration 0 and iteratively refine the centrality value, Centralityt(ci),at each iteration t, as shown in the following.

Centralityt(ci) =∑

cj∈S(ci)

Centralityt–1(cj),

where S(ci) is the set of code examples that are connected with ci. To avoid outliers, weextend the centrality notion to be divided by the number of edges.

Centralityt(ci) =∑

cj∈S(ci)

Centralityt–1(cj)# of edges of cj

.

We compute the centrality scores of nodes iteratively until they converge and presentthe top-k results with the highest centrality scores.

3.3. eXoaHybrid

As illustrated in Figure 6, cluster- and ranking-based approaches have complementarystrengths: a cluster-based approach maximizes the coverage but is agnostic to proba-bilities, while a ranking-based approach is probability aware but provides results thatmay skew to a few popular types.

In this section, we describe eXoaHybrid, which balances the strengths by consideringthe probability of centrality but tries to diversify the results over multiple usage typesby avoiding redundancy, as similarly studied in Agrawal et al. [2009] for diversifyinggeneral search results.

We use the top-k representatives selected from eXoaRank and eXoaCluster. Given 2kselected code examples, we first assign them to k clusters generated by eXoaCluster.

We then iteratively select a code example for presentation until k code exampleshave been selected. To select representative code examples, we first quantify the proba-bility of each cluster satisfying the user intent. The probability of each cluster, denotedas P(Ci|API), is defined as follows.

P(Ci|API) =Number of selected code examples in CiTotal number of selected code examples

,

where Ci is the ith cluster and API is a given specific method.

ACM Transactions on Information Systems, Vol. 31, No. 1, Article 1, Publication date: January 2013.

Enriching Documents with Examples: A Corpus Mining Approach 1:13

Table III. An Example Dataset

Code Cluster P(Ci|API) V(e|API, Ci)e1 C1 0.2 0.9e2 C2 0.2 0.2e3 C3 0.6 0.8e4 C3 0.6 0.6e5 C3 0.6 0.5

Table IV. Illustration of eXoaHybrid Algorithm Using the Dataset in Table III

Iteration Code examples Marginal utilities Result set1 {e1, e2, e3, e4, e5} {0.18, 0.04, 0.48, 0.36, 0.30} {e3}2 {e1, e2, e4, e5} {0.18, 0.04, 0.072, 0.06} {e3, e1}3 {e2, e4, e5} {0.04, 0.072, 0.06} {e3, e1, e4}

After computing the probability of the cluster, which represents the frequency of theusage types, we compute the probability of each code example in each cluster to selectsuitable code examples in the cluster. Let V(e|API, Ci) be the probability that the codeexample e in cluster Ci for the given API satisfies developers. In our article, we use thefollowing equation.

V(e|API, Ci) = α× RSC + β × RSR,

where RSC is the normalized eXoaCluster rank score (Section 3.1.2) of e; RSR is thenormalized eXoaRank rank score (Section 3.2) of e; and α and β are weight parameters,as both eXoaCluster and eXoaRank scores represent the quality of the code example e.Both scores are normalized in the range [0, 1].

Using P(Ci|API) and V(e|API, Ci), we compute the overall probability of the codeexamples. As P(Ci|API) represents the probability of the cluster Ci satisfying the userintent and V(e|API, Ci) indicates the probability that the code example e in Ci satisfiesthe user, the total probability of code example e can be computed as follows.

P(Ci|API) × V(e|API, Ci).

We select the code example with the highest probability value.However, ranking examples by the preceding value will retrieve a redundant set

of highly similar results, as in eXoaRank. For diversification, conditional probabilitywas defined in Agrawal et al. [2009], with respect to the set S of code examples al-ready selected by eXoaHybrid in Ci. Initially, when none has been selected, that is,S = ∅, the conditional probability U is identical to the probability of cluster P, thatis, U(Ci|API, ∅) = P(Ci|API). However, when some code example e from Ci is selected,U(Ci|API, S) is updated to

U(Ci|API, S ∪ {e}) = U(Ci|API, S) × (1 – V(e|API, Ci)).

With U, we rank the code examples by the marginal utility in order to obtain diver-sified results.

g(e|API, Ci, S) = U(Ci|API, S) × V(e|API, Ci).

To illustrate how eXoaHybrid works for our example ranking, Table IV shows anexample using the dataset in Table III. In this example, the initial marginal util-ities of the code examples are 0.18, 0.04, 0.48, 0.36, and 0.30. Because e3 has thehighest marginal utility, eXoaHybrid selects e3. Once this code example is selected,the marginal utilities of the other code examples in C3 decrease, because conditionalprobability U(C3|API, {e3}) is updated to 0.12 = 0.6 × (1 – 0.8). After this update, the

ACM Transactions on Information Systems, Vol. 31, No. 1, Article 1, Publication date: January 2013.

1:14 J. Kim et al.

Table V. Agreement of Selected Code Examples in theFirst Gold Standard Set

assessor 1 assessor 2 assessor 3assessor 1 1.0000 0.5814 0.6452assessor 2 — 1.0000 0.5224assessor 3 — — 1.0000

Note: The average of agreements is 0.5830.

marginal utilities of {e1, e2, e4, e5} are {0.18, 0.04, 0.072, 0.06}. At the second iteration,eXoaHybrid selects e1 with the highest marginal utility, and U(C1|API, {e1}) decreasesto 0.02 = 0.2 × (1 – 0.9). At the third iteration, among the examples {e2, e4, e5} ofmarginal utilities {0.04, 0.072, 0.06}, eXoaHybrid selects e4. Finally, eXoaHybrid re-turns {e1, e3, e4} as the top three code examples. In contrast, as eXoaCluster selectsrepresentative code example from each cluster, it will return {e1, e2, e3}. Furthermore,as eXoaRank selects code examples with the highest centrality and cluster C3 has threecode examples that have similar usage, it will return {e3, e4, e5}.

4. EVALUATION

In this section, we first evaluate the quality of the three proposed organization al-gorithms using generated eXoaDocs for JDK 5. We then evaluate the quality of oursearch results by comparing them with existing API documents, code search engines,and gold standard results. The eXoaDocs used for this evaluation are provided athttp://exoa.postech.ac.kr.

4.1. Comparison of Organization Algorithms

In this section, we compare the quality of code examples selected by the three proposedalgorithms: eXoaCluster, eXoaRank, and eXoaHybrid.

For comparison purposes, we constructed two gold standard sets. We first randomlyselected 50 API methods. For each API method, the three proposed algorithms selectedthe top-k representatives, where k is the number of usage types (or clusters) found byeXoaCluster. We then divided them into two sets. The first gold standard set consistedof API methods with four or more usage types (or large number of clusters), and thesecond gold standard set consisted of API methods with two or three usage types (orsmall number of clusters). As a result, 22 API methods were in the first set and 20 APImethods were in the second set. After that, we collected the top-k representatives fromeXoaRank and eXoaCluster, respectively, from each API method. As some code exam-ples were selected by both eXoaRank and eXoaCluster, we collected between k and 2krepresentatives from each API method. Naturally, these selected results contained thetop-k results of eXoaHybrid (we set α = 0.2 and β = 0.8 based on empirical observation).As a result, 198 candidate code examples were selected from the first set and 117 fromthe second set. These candidate code examples were then presented to three humanassessors who were asked to pick the best k results from each API method.

Tables V and VI present the agreement matrix among the three human assessors.The agreement was quantified using the Jaccard similarity [Jaccard 1901], such thatthe agreement between two result sets RA and RB is defined as follows.

Agreement(RA, RB) =|RA ∩ RB||RA ∪ RB|

.

This metric scores higher for two sets with higher agreements, for example, 1 whenRA = RB and 0 when disjoint. Observe from Tables V and VI that all agreements be-tween any two assessors are greater than 0.5, and the average agreements are 0.5830and 0.6854.

ACM Transactions on Information Systems, Vol. 31, No. 1, Article 1, Publication date: January 2013.

Enriching Documents with Examples: A Corpus Mining Approach 1:15

Table VI. Agreement of Selected Code Examples in theSecond Gold Standard Set

assessor 1 assessor 2 assessor 3assessor 1 1.0000 0.7500 0.6060assessor 2 — 1.0000 0.7000assessor 3 — — 1.0000

Note: The average of agreements is 0.6854.

To validate the reliability of the agreements, we computed the Fleiss’ kappascore [Fleiss 2010], which is a statistical measure of the reliability of agreement amonga fixed number of assessors. Let N be the number of code examples, n be the numberof assessors, and k be the number of categories (in our case for the first gold standardset, N = 198, n = 3, and the category is Selected or Not Selected). To compute Fleiss’kappa κ, we first computed pj, the proportion of all assignments that were made to thejth category.

pj =1

Nn

N∑

i=1

nij,

where nij is the number of assessors who assigned the ith code example to the jth cat-egory. We next computed Pi, which indicates the extent of agreement among assessorsfor the ith code example.

Pi =1

n(n – 1)

k∑

j=1

nij(nij – 1).

Using pj and Pi, κ is defined as

κ =P – Pe

1 – Pe,

where P is the mean of Pi, and Pe is the sum of the square of pj. The factor 1 – Peindicates the maximum agreement score, and P – Pe indicates the achieved agreementscore. This metric also scores higher with higher agreement, for example, 1 when allassessors completely agree with each other, but equal to or lower than 0 when disjoint.Fleiss’ kappa scores of the two gold standard sets are 0.4540 and 0.4864, respectively.These scores are interpreted as having moderate agreement [Fleiss 2010], which indi-cates that agreements among the three assessors are reliable.

On the basis of these assessments, we built the two gold standard answer sets usingmajority voting, by selecting the candidates picked by two or more assessors. As aresult, from the first gold standard set, which consisted of four or more usage types,107 code examples were selected as the gold standard answers among 198 candidatecode examples, and from the second gold standard set, which consists of two or threeusage types, 82 code examples were selected as the gold standard answers among 117candidate code examples. On the basis of these gold standard sets, denoted as RG1 andRG2, we computed the agreement between each gold standard set and the result setfrom each algorithm RA, that is, Agreement(RG, RA), as shown in Tables VII and VIII.eXoaCluster showed the best performance among the three proposed algorithms

(0.6328 agreement) for JDK 5 API methods that had four or more usage types, whileeXoaRank showed the best performance (0.4615 agreement) for JDK 5 API methodsthat had two or three usage types.

ACM Transactions on Information Systems, Vol. 31, No. 1, Article 1, Publication date: January 2013.

1:16 J. Kim et al.

Table VII. Agreement Score between the First GoldStandard Set and Each Organization Algorithm

eXoaCluster eXoaRank eXoaHybrid

RG1 0.6328 0.1808 0.5145

Table VIII. Agreement Score between the SecondGold Standard Set and Each Organization Algorithm

eXoaCluster eXoaRank eXoaHybrid

RG2 0.3168 0.4615 0.3434

We first analyzed the reason why eXoaCluster showed the best performance for RG1and eXoaRank showed the best performance for RG2 by analyzing the three humanassessors’ selections.

The results are consistent with our intuition, discussed in Section 2.3. For API meth-ods with few usage types, the usage types selected by the human assessors are alsoskewed to one homogeneous type, which well explains why eXoaRank is the winner.For API methods with diverse usage types, the human assessors’ decisions variedover types, for which eXoaCluster—showing diverse clusters—is effective. Meanwhile,eXoaHybrid closely emulates the performance of the winning algorithm for all APImethods. Considering that it is hard to predict how diverse the use case is per APImethod, eXoaHybrid can be a practical choice for generating examples for all types.More specifically, in the evaluation using RG1, in many cases, eXoaHybrid effectivelyselected the gold standard results that were not selected by eXoaCluster (11 APImethods among 22 API methods). eXoaHybrid also selected the same number of goldstandard results as eXoaCluster for 10 API methods. In the evaluation using RG2,eXoaHybrid effectively selected the gold standard results that were not selected byeXoaRank (13 API methods among 20 API methods) and also selected the same num-ber of gold standard results as eXoaRank for 12 API methods. These results indicatethat eXoaHybrid effectively filtered out outliers from eXoaCluster and selected morerepresentative code examples from eXoaRank.

For the remaining evaluations, we used eXoaCluster, which showed the highest av-erage agreement score between two gold standard test results (Tables VII and VIII).

4.2. Comparison with JavaDocs

We compared the number of code examples that eXoaDocs generated for JDK 5 withthat generated by JavaDocs.

To determine the number of code examples generated by JavaDocs, we analyzedan HTML source code. As the < pre > tag of HTML contains code information, wechecked each method to ascertain whether it contained a < pre > tag. However, asother contents are also expressed using a < pre > tag, we manually checked whetherthe the contents of the < pre > tag were code examples.

Out of more than 27,000 methods, eXoaDocs augmented 75% of code examples(more than 20,000 methods), while only 2% (about 500 methods) were augmented inJavaDocs.

This result shows that our system provides a rich set of code examples when theAPI document contains only a small number of code examples. In addition, even if theAPI document already contains a rich set of code examples, our system can provideadditional real-life code examples using the APIs.

4.3. Comparison with Code Search Engines

In this section, we compare the quality of eXoaDocs with two code search engines,namely Koders and Google Code Search.

ACM Transactions on Information Systems, Vol. 31, No. 1, Article 1, Publication date: January 2013.

Enriching Documents with Examples: A Corpus Mining Approach 1:17

Table IX. Top Query Results for Koders, Google CodeSearch, and eXoaDocs Using 10 Randomly Selected API

Methods

Relevant code Relevant Irrelevantand snippet code code

Koders 22% 8% 70%Google 12% 10% 78%

eXoaDocs 92% 6% 2%

Note: Most of the Koders and Google Code Search re-sults were irrelevant, but most of the eXoaDocs resultswere relevant and had a good snippet.

We queried with ten randomly selected API methods and manually divided the top-kresults (where k is the number of examples embedded in eXoaDocs) of Koders, GoogleCode Search, and eXoaDocs into the following three groups, based on the criteria de-scribed in the following.

— Relevant code and snippet. The code includes the correct API method usage, and itssnippet correctly summarizes the API method usage. As a result, the appropriateAPI method and all parameters used in the API method are well explained in thesnippet (i.e., a good ranking and good summarization).

— Relevant code. The code includes the correct API method usage, but its snippet incor-rectly summarizes the API method usage. As a result, the appropriate API methodor some parameters used in the API method are not explained in the snippet (i.e., agood ranking but bad summarization).

— Irrelevant code. In this case, the queried API method does not appear in either thecode or its summarized snippet (i.e., a bad ranking and bad summarization.

Table IX presents our evaluation results. As eXoaDocs find top results by analyzingthe semantic context of source code, most of the eXoaDocs results (92%) showed properAPI method usage in the summary, while only 22% and 12% of the snippets in Kodersand Google Code Search satisfied the requirements. In addition, the vast majority ofthe results from Koders and Google Code Search, 70% and 78% respectively, wereirrelevant (showing comment lines or import statements, which are not helpful). Thislack of relevance is explained by the fact that when textual features are being built,they cannot distinguish between relevant and irrelevant matches.

4.4. Comparison with Gold Standard

We also evaluated the quality of the code examples of eXoaDocs in terms of rankingand summarization.

Ranking precision/recall. For ranking precision and recall, we first constructed aranking gold standard. To construct the ranking gold standard, we randomly selected20 API methods. As each API method contained about 80 code example summaries onaverage and it was hard to evaluate all of them, we collected 2k code examples fromeach API method, where k is the number of usages (or clusters) of each API method.The 2k code examples consisted of the top-k summaries selected by eXoaDocs and otherk summaries not selected by eXoaDocs among all candidate summaries. As a result,130 code examples were selected from 20 API methods, and these code examples werepresented to five human assessors. We asked each human assessor to mark the k bestcode examples from each API and determined the gold standard set using majorityvoting, marked as RG. As a result, 41 code examples were selected as RG. To measure

ACM Transactions on Information Systems, Vol. 31, No. 1, Article 1, Publication date: January 2013.

1:18 J. Kim et al.

Fig. 8. Example of the difference between gold standard and our slicing technique for the API‘Timer.getDelay().’ The underlined lines were selected by the gold standard only.

the quality of the eXoaDocs results RE, we measured the precision and recall: |RG∩RE||RE|

and |RG∩RE||RG| , respectively.

Compared to the gold standard ranking, eXoaDocs achieved high precision and recallfrom five assessors: 45% and 71%, respectively.

Summarization precision/recall. Similarly, we constructed a gold standard for sum-marization. Using the same 20 API methods, we randomly collected the entire codefrom each API method and asked each human assessor to select lines of code to appearin the summary. We labeled the set of lines selected from majority voting as LG and thelines selected from eXoaDocs as LE. We also measured the precision and recall, that is,|LG∩LE|

|LE| and |LG∩LE||LG| .

Compared to the summarization gold standard, eXoaDocs summarization alsoachieved both a high precision and recall, that is, 83% for precision and 70% forrecall.

We analyzed the lines missed by our summarization step, as the recall was not 100%.Figure 8 shows an example of the difference between the gold standard and our slic-ing technique. Even though our slicing technique effectively found more relevant lines,such as lines 4 and 7, it could not find some lines that were less important but gaveadditional information to understand the query API method. For example, line 6 ini-tializes the variable “callInterval” used with the query API method, but our slicingtechnique missed it, as it is not an argument of the query API method.

4.5. Sample Code Examples

In this section, we present some sample code examples in eXoaDocs.Figure 9 shows three code examples. Figure 9(a) is an example of “Result-

set.getBoolean.” It contains relevant information about variables and classes of thequery API method. Figures 9(b) and (c) are examples of “StringBuilder.substring.”These code examples show that eXoaDocs also successfully presents overloaded APImethods well. These overloaded API methods are detected when we analyze the sourcecode and summarize it into code snippets by building an IAT.

5. USER STUDY

This section presents a user study conducted to evaluate how eXoaDocs affect realsoftware development tasks.

ACM Transactions on Information Systems, Vol. 31, No. 1, Article 1, Publication date: January 2013.

Enriching Documents with Examples: A Corpus Mining Approach 1:19

Fig. 9. Samples of code examples in eXoaDocs.

5.1. Study Design

We conducted a user study with 24 subjects (undergraduate computer science studentsat the Pohang University of Science and Technology). For development tasks, we askedthe subjects to build SQL applications using the java.sql package. The subjects wererandomly divided into two groups that used either eXoaDocs or JavaDocs to completethe following tasks.

— Task1. Establish a connection to the database.— Task2. Create SQL statements.— Task3. Execute the SQL statements.— Task4. Present the query results.

We compared the following metrics to measure the productivity and code quality.The goal of our user study was to validate the presented research that automatic

annotation of API methods with usage examples can (1) increase productivity, (2) im-prove the quality of code, and (3) reduce cognitive loads, both implicitly (based on

ACM Transactions on Information Systems, Vol. 31, No. 1, Article 1, Publication date: January 2013.

1:20 J. Kim et al.

Table X. Averaging the Cumulative Completion Timeof Only Those Who Completed the Task (min:s)

Group Task1 Task2 Task3 & Task4

eXoaGroup 8:53 23:34 30:32JDocGroup 14:40 25:03 32:03

automatically logged user development behaviors) and explicitly (using a post-studysurvey).

5.1.1. Productivity. To evaluate productivity, we used the following measures.

— Overall task completion time measures the overall time to finish each given task.— API document lookup measures the number of lookups, which suggests the scale of

development process disturbance.

5.1.2. Code Quality. We prepared a test suite that created a table, inserted two tu-ples into the table, and printed the tuples using the submitted subject code. Based onwhether the presented tuples matched the tuples inserted in the test suite, we classi-fied the tasks submitted by subjects as “pass” or “fail.”

5.1.3. Cognitive Load. We asked the subjects to quantify the usefulness of the API doc-uments and code examples on a scale of 1 to 5.

5.2. Participants

We first conducted a pre-study survey of the subjects to observe their prior develop-ment experiences. We enumerated example development tasks with varying degreesof difficulty and asked the subjects to choose the tasks they could perform. On the ba-sis of the survey, we observed that none had prior development experience with thejava.sql package.

We divided the subjects into two groups—“JDocGroup,” referring to the group thatused regular JavaDocs (the control group), and “eXoaGroup,” referring to the groupthat used eXoaDocs. For a fair study, using the pre-study survey, we first categorizedthe subjects into four levels based on their Java expertise, because experience withusing the Java language varied in general. We then randomly divided an equal numberof subjects from each level into two groups. (If the subjects knew the group to whicheach subject belonged, it could have affected the study result.) To prevent this problem,we mirrored JavaDocs and hosted both JavaDocs and eXoaDocs on the same server.

5.3. Study Result

5.3.1. Productivity. We first evaluated the productivity of the two groups. We automat-ically logged all the document lookups carried out by all subjects. Based on the log, wemeasured the task completion time and the number of document lookups.

We only considered the development time of subjects who successfully completedeach task. Specifically, as subjects tackled tasks in a linear order, we concluded thatTaski is completed when the subjects referred to the relevant documents for the nextTaski+1. To distinguish whether subjects were simply browsing the document or re-ferring to these documents, we checked the time spent on the given document. Weassumed that the user started on Taski, only when each respective document was re-ferred to for more than 30s.

Table X shows the average completion time of each group. The task completion timesfor eXoaGroup were higher than that for JDocGroup. For Task1, eXoaGroup was up to67% faster in terms of the average completion time. Note that the completion times forTask3 and Task4 are the same, since both tasks referred to the same document. Thisimprovement was statistically significant according to a student’s t-test at P < 0.05

ACM Transactions on Information Systems, Vol. 31, No. 1, Article 1, Publication date: January 2013.

Enriching Documents with Examples: A Corpus Mining Approach 1:21

Table XI. Average Number of Document Lookups

Group Total Distinct Relevant relevantlookups lookups lookups distinct

eXoaGroup 5.67 3.25 2.33 0.72JDocGroup 17.58 7.5 3.25 0.43

Table XII. Number of Subjects that SuccessfullyCompleted the Test Suite for Each Task

Group Task1 Task2 Task3 Task4

eXoaGroup 11 11 3 3JDocGroup 10 8 4 1

Note: More subjects in eXoaGroup than in JDoc-Group successfully completed the tasks.

level for Task1. Further, its 90% confidence intervals are [4 : 29, 13 : 16] for eXoaGroupand [10 : 44, 18 : 37] for JDocGroup.

Table XI compares the document lookup behavior of the two groups. A larger num-ber of lookups indicates that the subjects referred to many pages to understand thecorrect usage. We present the total number of document lookups, the number of dis-tinct lookups minus duplicated counts, and the number of relevant lookups countingonly those on the pages that directly related to the given tasks. Lastly, relevant

distinct indi-cates how many documents, among the distinct documents referred to, were actuallyrelevant to the given task, which suggests the hit ratio of the referenced documents.

eXoaGroup referred to a significantly smaller number of documents, implying lessdisturbance in the software development process than JDocGroup. For instance, JDoc-Group referred to overall three times more documents than did eXoaGroup. This im-provement was statistically significant according to a student’s t-test at P < 0.01 levelfor total lookups, and its 90% confidence intervals were [3.57, 7.76] for eXoaGroup and[11.60, 23.56] for JDocGroup. Meanwhile, the hit ratio was significantly higher (72%)for eXoaGroup than that of JDocGroup (43%).

5.3.2. Code Quality. Only a few subjects completed all of the tasks correctly withinthe specified 40 min. In JDocGroup, only one subject passed the suite of tests, whilethree in eXoaGroup could pass the tests. Table XII reports the number of correct sub-missions from each group. Generally, more subjects from eXoaGroup correctly com-pleted each task. For instance, the ratios of subjects correctly completing Task1, Task2,and Task4 in eXoaGroup to those finishing each task in JDocGroup were, respectively,110%, 138%, and 300%. Note that not only did more subjects in eXoaGroup correctlycomplete each task, but they also completed it in considerably less time.

Surprisingly, for Task3, the number of JDocGroup subjects passing the test suitewas slightly higher than that of eXoaGroup. This exception is due to the specialtyof JavaDocs “ResultSet” used in Task3. The original JavaDocs included manually de-veloped examples, which were succinct and well designed, to help developers under-stand “ResultSet.” As a result, the subjects in both groups leveraged these examplesfor Task3, which explains why the performance difference between the two groups forTask3 was marginal. This result again confirms that the examples in the API docu-ments are indeed very useful for software development tasks.

5.3.3. Cognitive Load. We surveyed the subjects to measure the cognitive load of thegiven tasks. Table XIII summarizes the subjects’ responses to a post-study survey ask-ing how useful the given API documents were. While the majority (7 out of 12) of eXoa-Group answered that the given documents were extremely useful or very useful, only

ACM Transactions on Information Systems, Vol. 31, No. 1, Article 1, Publication date: January 2013.

1:22 J. Kim et al.

Table XIII. Post-Study Survey Result: Were the API Documents Helpful?

Group Extremely Very Somewhat Notuseful useful Useful useful useful

eXoaGroup 3 4 3 2 0JDocGroup 1 2 5 2 2

Note: The majority in eXoaGroup answered that the given API docu-ments were extremely useful or very useful.

Table XIV. Post-Study Survey Result: Were the Given Code ExamplesHelpful?

Group Extremely Very Somewhat Notuseful useful Useful useful useful

eXoaGroup 1 4 3 2 0JDocGroup 0 2 1 1 0

Note: The majority in eXoaGroup answered that the given code exam-ples were useful.

3 out of 12 from JDocGroup answered that the documents were useful. This indicatesthat eXoaDocs significantly enhance the usefulness of the documents.

Next, we asked whether the subjects used the code examples provided in the doc-uments for development tasks. The vast majority in eXoaGroup (10 out of 12) an-swered in the affirmative, while only a few (4 out of 12) from JDocGroup used the codeexamples.

Table XIV shows the survey results regarding the usefulness of the given code ex-amples. Five out of ten in eXoaGroup answered that the given example was extremelyhelpful or very helpful. We also received the following positive comments from thesubjects in eXoaGroup.

Subject 1. “Enough examples and resources for the given task were provided in thegiven document.”

Subject 2. “Usages of parameters and return types were very helpful.”

6. DISCUSSION

6.1. Java Developer Comments

As an indicator for evaluating the usefulness of eXoaDocs, we released eXoaDocs athttp://exoa.postech.ac.kr and sent it to professional Java developers, with a re-quest for feedback on the usefulness of eXoaDocs. While the number of responseswas not sufficiently large, the all feedback received on the generated eXoaDocs waspositive.

Joel Spolsky4. “I think this is a fantastic idea. Just yesterday, I was facing this exactproblem. . . the API documentation wasn’t good enough, and I would have killed for acouple of examples. It sounds like a very smart idea. . . ”

Developer 2. “Automatic example finding sounds really good. It would help de-velopers significantly. In fact, I struggled many times to understand APIs withoutexamples.”

Developer 3. “API documents that have code examples are really helpful. Usingthem, I can reduce the development time significantly. However, JavaDocs provides

4Software engineer, blogger, author of Joel on Software, and creator of StackOverflow.

ACM Transactions on Information Systems, Vol. 31, No. 1, Article 1, Publication date: January 2013.

Enriching Documents with Examples: A Corpus Mining Approach 1:23

very few code examples, so I need to find additional code examples. I like MSDN be-cause most of the methods contain code examples. For this reason, automatically find-ing and adding code examples seems a wonderful idea.. . . ”

6.2. User Feedback

The current eXoaDocs embed code examples selected by our organization algorithmsdescribed in Section 3. We designed three algorithms to select the most useful codeexamples. However, they may not work well for all API methods. They may also selectcode examples that are obsolete or include bugs [Kim et al. 2006]. Automatic removalof such obsolete or bug-ridden examples is a research challenge.

To address these issues, we are developing adaptive organization algorithms basedon developer feedback [Joachims 2002; Radlinski and Joachims 2007; Radlinski et al.2008], such that developers can adjust the rankings of code examples in eXoaDocs byclicking the buttons shown in Figure 2(c). Thus far, we have not received sufficientfeedback from developers, since eXoaDocs was at an early stage of development atthe time of writing this article. However, as we receive more feedback from develop-ers, more important code examples will get positive feedback. This will increase theusefulness of eXoaDocs.

6.3. Source Code Repository

Currently, our eXoaDocs depend on the Koders code search engine. We generatedqueries from API documents, sent the queries to Koders, and collected up to 200 piecesof source code for each API method. The quality of the extracted code examples thusrelies heavily on the quality of Koders search results, and this quality may not beoptimal in some cases, as illustrated by the examples in Figure 1.

Carefully collecting source code from well executed projects or manually collectingcode examples, such as the Apache Software Foundation [ASF 2010] or Java Exam-ples [Java Examples 2010], and indexing them may yield higher-quality code exampleextractions. Building our own source code repository for eXoaDocs remains futurework.

6.4. Summarization

Currently, our API slicing technique only considers related classes, arguments, andthe query API method using a rule-based approach. Though this rule-based approacheffectively finds relevant information, it has some limitations.

First, API methods appearing before or after the query API method are not con-sidered in our approach, but a sequence of such API methods can be a useful featurewhen such API methods are order sensitive. However, for the remaining API methodsthat are not order sensitive, such sequence representation would lead to false posi-tives and also incur the overhead of searching sequences, which is more expensivethan searching vectors. Finding the sweet spot for this trade-off is non-trivial, whichwe leave as a future research direction. Second, our current approach cannot find cor-rect type information for polymorphic methods and cannot cover some cases, such asinstance variable of the class. Improving our summarization module remains futurework.

6.5. Extension of IR Metrics for Evaluation

When we compared the quality of code examples between eXoaDocs and other codesearch engines (Table IX), we did not consider classical metrics, such as normalizeddiscounted cumulative gain (nDCG) and mean reciprocal rank (MRR), because theypenalize diversified results, thereby covering multiple intents, which is a key goal of

ACM Transactions on Information Systems, Vol. 31, No. 1, Article 1, Publication date: January 2013.

1:24 J. Kim et al.

our work. Evaluating diversity-aware MRR, that is, MRR–IA [Agrawal et al. 2009],would be relevant, which is defined as follows.

MRR–IA(Q, k) =∑

cP(c|q)MRR(Q, k|c).

However, this would require knowledge of P(c|q), which can be obtained from userfeedback or eXoaDocs log. As we are currently collecting them, using such a log forMRR–IA evaluation will be an interesting future topic.

6.6. Popularity Bias

In addition to usage examples, eXoaDocs provide popularity information for eachmethod (Figure 2(a)). To determine the popularity, we count the number of code ex-amples generated using our framework. This popularity information is valuable, es-pecially for new developers who do not know which methods to use for their tasks.However, this information can also be misleading for new API methods. While our toolcan automatically include code examples as soon as they are available, the popularityfor new API methods will be low until enough code examples are accumulated. This iscommonly known as the “new item” problem [Balabanovic and Shoham 1997] in rec-ommendation systems. This problem can be partially addressed by introducing specialmarks for newly added API methods.

6.7. Threats to Validity

We identified the following threats to validity.

— The API documents used may not be representative. JavaDocs JDK 5 was used in thisarticle. Since we intentionally chose the most popular and commonly used JavaDocs,we may have a document selection bias.

— The assigned tasks may not be representative. We assigned database related pro-gramming tasks to subjects for our case study, which may or may not be represen-tative scenarios for using eXoaDocs.

— The subject groups may not have been divided equally. We divided the subjects intofour levels based on a pre-study survey of prior programming experience and ran-domly divided each level into two groups. However, it is still possible that the pro-gramming experiences varied between the two groups. We also acknowledge thatthe subjects may not properly represent industrial developers.

7. RELATED WORK

This section provides an overview of related research efforts and discusses how ourwork is distinguished. This work is based on and significantly extends the ideas brieflysketched in Kim et al. [2009] and their formal descriptions [Kim et al. 2010], by addingSection 3 to investigate a solution space of organization algorithms. In addition, evalu-ations in Section 5 were extended to use a public gold standard set built by human as-sessors, which would enable follow-up work to compare its quality against our system.

7.1. Example Recommendation

Since code examples are frequently requested artifacts, there have been efforts torecommend good examples. For example, PHP API documents [PHP 2010] providea large set of good code examples as user contributed notes. Alternatively, Java Ex-amples [Java Examples 2010] and KodeJava [KodeJava 2010] recently collected user-contributed code examples. Often these manually developed and recommended codeexamples are of high quality, and they play an important role in understanding APIs.However, they largely rely on human effort.

ACM Transactions on Information Systems, Vol. 31, No. 1, Article 1, Publication date: January 2013.

Enriching Documents with Examples: A Corpus Mining Approach 1:25

To address the limitations of human effort, many automatic code recommendationsystems have been proposed [Holmes and Murphy 2005; Sahavechaphan and Claypool2006; Xie and Pei 2006; Zhong et al. 2009]. These either take specific keywords fromdevelopers or automatically generate keywords. They then query their source coderepositories to retrieve useful code examples for the query keywords or working sourcecode.

Holmes and Murphy [2005] proposed a technique that recommends sourcecode examples from a repository by matching structures of given code. XSnippet[Sahavechaphan and Claypool 2006] provides a context-sensitive code assistantframework that provides sample source code snippets for developers.

MAPO [Xie and Pei 2006; Zhong et al. 2009] extracts frequent API method usage se-quences, clusters source code with similar sequences, and presents clusters. Althoughthis work shares similarities with our approach, we observe the following key differ-ences. First, because not all API method calls are order sensitive, abstracting code asAPI method call sequences may separate two code examples with the same seman-tics into two clusters. Similarly, sequences with and without optional calls are dividedinto different clusters, which may lead to too many clusters, while our vector-basedabstraction generates a single cluster in these two cases. Second, MAPO uses the APImethod call sequence as summarization and does not provide information on its con-text, such as how the API method arguments were defined and populated, while oursummarization method provides both the API method call and its context.

These code recommendation approaches are similar to ours in that they use sourcecode repositories and suggest code examples automatically. However, several differ-ences exist. First, their systems require special development environments, such asan Eclipse plug-in, while our code examples in eXoaDocs are accessible without re-quiring any special tool. The second difference is that our code example presentationalgorithms present only the most representative examples using clustering and rank-ing based on the extracted features. Lastly, unlike previous approaches requiring theAPI method name to suggest related examples, we provide popularity information toguide developers to the frequently used API methods, even when they do not know thenames.

Similar to this popularity information, there are existing systems that identify com-monly used API methods. SpotWeb [Thummalapenta and Xie 2008] detects hotspotswhich are commonly reused API classes and methods. Holmes and Walker developedPopCon [2007, 2008], which determines the popularity of API methods based on theircalls used in existing software. However, our eXoaDocs provide both popularity infor-mation and code examples simultaneously.

7.2. Code Search

While code recommendation tools build upon development environments (e.g., Eclipseplug-ins) to consider the current development context for searches, a more generalapproach would be a stand-alone search engine. Commercial code search engines, suchas Koders [Koders 2010] and Google Code Search [Google 2010], take keywords asqueries and retrieve source code where the query keywords occur frequently. As theyuse keyword-based search, they usually match the query keywords with the commentsin the source code. As a result, they work well for comment search. However, suchcomments are often sparse and not useful when developers want to find an example ofspecific API methods. In such a case, an API-based code recommendation frameworkis more effective. As a result, they fail to retrieve illustrative code examples, as shownin Section 4.3.

Meanwhile, DeMIMA [Gueheneuc and Antoniol 2008] and DECKARD [Jiang et al.2007] studied the semantic features of source code to tackle a different problem of

ACM Transactions on Information Systems, Vol. 31, No. 1, Article 1, Publication date: January 2013.

1:26 J. Kim et al.

searching for clones when a code segment is a query. As these approaches require acode segment as a query, we cannot use them to find examples for an API document,but we adopt their notion of semantic features.

8. CONCLUSION

In this article, we introduced a new code example recommendation system for intel-ligent code searches by embedding API documents with high-quality code examplesummaries mined from the Web. We proposed three organization algorithms that se-lect the most representative code examples to best satisfy user intent. For JDK 5,eXoaDocs augment code examples for about 75% of methods. Compared with Kodersand Google Code Search, 92% of eXoaDocs results showed proper API usage examples,while only 22% and 12% of the top query results in Koders and Google Code Searchshowed appropriate API method usage in their snippets. In addition, comparison withthe gold standard showed that our summarization and ranking techniques are highlyprecise, and our user study results indicate that the code examples in eXoaDocs im-prove programmer productivity.

As a future direction, we will improve the system by (1) developing example rankingby incorporating user feedback, (2) enhancing summarization using more precise anal-ysis by considering data types as well, and (3) building our own codebase to eliminatethe bias incurred using Koders.

REFERENCES

Agrawal, R., Gollapudi, S., Halverson, A., and Ieong, S. 2009. Diversifying search results. In Proceedings ofthe 2nd ACM International Conference on Web Search and Data Mining (WSDM’09).

ASF. 2010. The Apache Software Foundation. http://www.apache.org/.Bajracharya, S. and Lopes, C. 2009. Mining search topics from a code search engine log. In Proceedings of

the 6th Working Conference on Mining Software Repositories (MSR’09).Balabanovic, M. and Shoham, Y. 1997. Fab: Content-based, collaborative recommendation. Commun. ACM.Campbell, N. A. 1984. Some aspects of allocation and discrimination. In Multivariate Statistical Methods in

Physical Anthropology. Springer, Berlin. 177–192.Devanbu, P., Karstu, S., Melo, W., and Thomas, W. 1996. Analytical and empirical evaluation of software

reuse metrics. In Proceedings of the International Conference on Software Engineering (ICSE’96).Erkan, G. and Radev, D. R. 2004. Lexrank: Graph-based lexical centrality as salience in text summarization.

In J. Artif. Intell. Res.Fleiss. 2010. Fleiss’ kappa. http://en.wikipedia.org/wiki/Fleiss’ kappa.Foody, G. M., Campbell, N., Trod, N., and Wood, T. 1992. Derivation and applications of probabilistic mea-

sures of class membership from the maximum likelihood classification. Photogramm. Eng. Remote Sens.58, 1335–1341.

Gaffney, J. E. and Durek, T. A. 1989. Software reuse—key to enhanced productivity: Some quantitativemodels. Inf. Softw. Technol.

Google. 2010. Google Code Search. http://www.google.com/codesearch.Gueheneuc, Y.-G. and Antoniol, G. 2008. DeMIMA: A multilayered approach for design pattern identifica-

tion. IEEE Trans. Softw. Eng.Holmes, R. and Murphy, G. C. 2005. Using structural context to recommend source code examples. In Pro-

ceedings of the International Conference on Software Engineering (ICSE’05).Holmes, R. and Walker, R. J. 2007. Informing Eclipse API production and consumption. In Proceedings of

the ETX-Eclipse Technology Exchange Conference.Holmes, R. and Walker, R. J. 2008. A newbie’s guide to eclipse APIs. In Proceedings of the 5th Working

Conference on Mining Software Repositories (MSR’08).Horwitz, S., Reps, T., and Binkley, D. 1988. Interprocedural slicing using dependence graphs. In Proceedings

of the Conference on Programming Language Design and Implementation (PLDI’88).Jaccard, P. 1901. E’tude comparative de la distribution florale dans une portion des Alpes et des Jura.

Bulletin del la Socie’te’ Vaudoise des Sciences Naturelles.Java Examples. 2010. Example source code. http://java2s.com/.

ACM Transactions on Information Systems, Vol. 31, No. 1, Article 1, Publication date: January 2013.

Enriching Documents with Examples: A Corpus Mining Approach 1:27

Java2Xml. 2010. Java2XML Project Home Page. https://java2xml.dev.java.net/.Jiang, L., Misherghi, G., Su, Z., and Glondu, S. 2007. DECKARD: Scalable and accurate tree-based detection

of code clones. In Proceedings of the International Conference on Software Engineering (ICSE’07).Joachims, T. 2002. Optimizing search engines using clickthrough data. In Proceedings of the International

Conference on Knowledge Discovery and Data Mining (KDD’02).Kim, J., Lee, S., Hwang, S.-W., and Kim, S. 2009. Adding examples into java documents. In Proceedings of

the International Conference on Automated Software Engineering (ASE’09).Kim, J., Lee, S., Hwang, S.-W., and Kim, S. 2010. Towards an intelligent code search engine. In Proceedings

of the National Conference of the American Association for Artificial Intelligence (AAAI’10).Kim, S., Pan, K., and Whitehead, Jr., E. E. J. 2006. Memories of bug fixes. In Proceedings of the ACM

SIGSOFT Symposium on the Foundations of Software Engineering.KodeJava. 2010. Learn Java Programming by Examples. http://www.kodejava.org/.Koders. 2010. Open Source Code Search Engine. http://www.koders.com.Leopard. 2010. Leopard Reference Library. http://developer.apple.com/referencelibrary/index.html.Lim, W. C. 1994. Effects of reuse on quality, productivity, and economics. IEEE Softw.MSDN. 2010. MSDN Library. http://msdn.microsoft.com/en-us/library/default.aspx.Page, L., Brin, S., Motwani, R., and Winograd, T. 1999. The pagerank citation ranking: Bringing order to the

web. Tech. rep.Pelleg, D. and Moore, A. 2000. X-means: Extending k-means with efficient estimation of the number of

clusters. In Proceedings of the International Conference on Machine Learning (ICML’00).PHP. 2010. Hypertext Preprocessor. http://www.php.net.Radlinski, F. and Joachims, T. 2007. Active exploration for learning rankings from clickthrough data. In

Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD’07).Radlinski, F., Kurup, M., and Joachims, T. 2008. How does clickthrough data reflect retrieval quality? In

Proceedings of the ACM Conference on Information and Knowledge Management (CIKM’08).Sahavechaphan, N. and Claypool, K. T. 2006. XSnippet: Mining for sample code. In Proceedings of the Con-

ference on Object-Oriented Programming, Systems, Languages, and Applications (OOPSLA’06).Thummalapenta, S. and Xie, T. 2008. SpotWeb: Detecting framework hotspots via mining open source repos-

itories on the web. In Proceedings of the Working Conference on Mining Software Repositories (MSR’08).Weiser, M. 1981. Program slicing. In Proceedings of the International Conference on Software Engineering

(ICSE’81).Xie, T. and Pei, J. 2006. MAPO: Mining API usages from open source repositories. In Proceedings of the

Working Conference on Mining Software Repositories (MSR’06).Xu, R. and Wunsch, I. 2005. Survey of clustering algorithms. IEEE Trans. Neural Networks.Zhong, H., Xie, T., Zhang, L., Pei, J., and Mei, H. 2009. MAPO: Mining and recommending API usage pat-

terns. In Proceedings of the European Conference on Object-Oriented Programming (ECOOP’09).

Received June 2011; revised March, July 2012; accepted September 2012

ACM Transactions on Information Systems, Vol. 31, No. 1, Article 1, Publication date: January 2013.