Indexing by Latent Dirichlet Allocation and an Ensemble Model

15
Indexing by Latent Dirichlet Allocation and an Ensemble Model* Yanshan Wang School of Industrial Management Engineering, Korea University, Seongbuk-gu, Seoul, Korea. E-mail: [email protected] Jae-Sung Lee Diquest, #501, Kolon Villant 2, Guro-gu, Seoul, Korea. E-mail: [email protected] In-Chan Choi** School of Industrial Management Engineering, Korea University, Seongbuk-gu, Seoul, Korea. E-mail: [email protected] The contribution of this article is twofold. First, we present Indexing by latent Dirichlet allocation (LDI), an automatic document indexing method. Many ad hoc applications, or their variants with smoothing tech- niques suggested in LDA-based language modeling, can result in unsatisfactory performance as the document representations do not accurately reflect concept space. To improve document retrieval performance, we intro- duce a new definition of document probability vectors in the context of LDA and present a novel scheme for automatic document indexing based on LDA. Second, we propose an Ensemble Model (EnM) for document retrieval. EnM combines basic indexing models by assigning different weights and attempts to uncover the optimal weights to maximize the mean average preci- sion. To solve the optimization problem, we propose an algorithm, which is derived based on the boosting method. The results of our computational experiments on benchmark data sets indicate that both the proposed approaches are viable options for document retrieval. Introduction With the continuous growth in the size of information sources available on the web, document indexing (DI) is gaining attention as a crucial technique to retrieve significant information for users (Choi & Lee, 2010). Using DI, text documents are converted into index terms or document fea- tures that can be trivially analyzed by computers. With an approporiate ranking function or retrieval model, DI has been shown to be effective for document retrieval (Croft, Metzler, & Strohman, 2010). Exact keyword matching is one of the earliest and sim- plest methods applied to DI. In retrieval models (such as the Boolean model) that use exact keyword matching, docu- ments that contain each index term in the query are retrieved from the document set. Users are provided with these docu- ments to obtain the desired information. The drawback of exact keyword matching is that a partial match or ranking of the retrieved documents is not considered, which often leads to poor retrieval results. In order to overcome the aforementioned drawback, the vector space model (VSM) was proposed to represent a document as a vector of index term weights so that relevant documents can be retrieved (Caid, Dumais, & Gallant, 1995; Salton & McGill, 1986; Salton, Wong, & Yang, 1975). In general, VSM-based retrieval models comprise three stages: (a) removing nonsignificant words from the documents using a stop words list; (b) weighting the indexed terms into a vector space; (c) and ranking the documents with respect to the input query according to different similarity measures. Many variants of the VSM have been developed by modi- fying the weighting scheme in the second stage. One of the most well-known weighting schemes is the tf-idf (term frequency–inverse document frequency) scheme (Salton & McGill, 1986). In this scheme, each document (as well as each query) is represented by a fixed vector where each component is the tf-idf term weight. One of the drawbacks in the tf-idf scheme is that it ignores the conceptual meaning *The portion of LDI appeared in the Proceedings of the 2010 International Conference on Data Mining. **Corresponding author. Received January 7, 2014; revised September 28, 2014; accepted Septem- ber 29, 2014 © 2015 ASIS&T Published online in Wiley Online Library (wileyonlinelibrary.com). DOI: 10.1002/asi.23444 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, ••(••):••–••, 2015

Transcript of Indexing by Latent Dirichlet Allocation and an Ensemble Model

Indexing by Latent Dirichlet Allocation andan Ensemble Model*

Yanshan WangSchool of Industrial Management Engineering, Korea University, Seongbuk-gu, Seoul, Korea. E-mail:[email protected]

Jae-Sung LeeDiquest, #501, Kolon Villant 2, Guro-gu, Seoul, Korea. E-mail: [email protected]

In-Chan Choi**School of Industrial Management Engineering, Korea University, Seongbuk-gu, Seoul, Korea. E-mail:[email protected]

The contribution of this article is twofold. First, wepresent Indexing by latent Dirichlet allocation (LDI), anautomatic document indexing method. Many ad hocapplications, or their variants with smoothing tech-niques suggested in LDA-based language modeling, canresult in unsatisfactory performance as the documentrepresentations do not accurately reflect concept space.To improve document retrieval performance, we intro-duce a new definition of document probability vectors inthe context of LDA and present a novel scheme forautomatic document indexing based on LDA. Second,we propose an Ensemble Model (EnM) for documentretrieval. EnM combines basic indexing models byassigning different weights and attempts to uncover theoptimal weights to maximize the mean average preci-sion. To solve the optimization problem, we propose analgorithm, which is derived based on the boostingmethod. The results of our computational experimentson benchmark data sets indicate that both the proposedapproaches are viable options for document retrieval.

Introduction

With the continuous growth in the size of informationsources available on the web, document indexing (DI) isgaining attention as a crucial technique to retrieve significantinformation for users (Choi & Lee, 2010). Using DI, text

documents are converted into index terms or document fea-tures that can be trivially analyzed by computers. With anapproporiate ranking function or retrieval model, DI hasbeen shown to be effective for document retrieval (Croft,Metzler, & Strohman, 2010).

Exact keyword matching is one of the earliest and sim-plest methods applied to DI. In retrieval models (such as theBoolean model) that use exact keyword matching, docu-ments that contain each index term in the query are retrievedfrom the document set. Users are provided with these docu-ments to obtain the desired information. The drawback ofexact keyword matching is that a partial match or ranking ofthe retrieved documents is not considered, which often leadsto poor retrieval results.

In order to overcome the aforementioned drawback, thevector space model (VSM) was proposed to represent adocument as a vector of index term weights so that relevantdocuments can be retrieved (Caid, Dumais, & Gallant, 1995;Salton & McGill, 1986; Salton, Wong, & Yang, 1975). Ingeneral, VSM-based retrieval models comprise three stages:(a) removing nonsignificant words from the documentsusing a stop words list; (b) weighting the indexed terms intoa vector space; (c) and ranking the documents with respectto the input query according to different similarity measures.Many variants of the VSM have been developed by modi-fying the weighting scheme in the second stage. One of themost well-known weighting schemes is the tf-idf (termfrequency–inverse document frequency) scheme (Salton &McGill, 1986). In this scheme, each document (as well aseach query) is represented by a fixed vector where eachcomponent is the tf-idf term weight. One of the drawbacksin the tf-idf scheme is that it ignores the conceptual meaning

*The portion of LDI appeared in the Proceedings of the 2010 InternationalConference on Data Mining.

**Corresponding author.

Received January 7, 2014; revised September 28, 2014; accepted Septem-ber 29, 2014

© 2015 ASIS&T • Published online in Wiley Online Library(wileyonlinelibrary.com). DOI: 10.1002/asi.23444

JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY, ••(••):••–••, 2015

of words, and therefore, it suffers from difficulties associ-ated with synonymy and polysemy (Salton & McGill, 1986).In addition to the conventional tf-idf scheme, Okapi BM25is used to estimate term importance (Robertson et al., 1995).However, Okapi BM25 also ignores the interrelationshipbetween terms within a document (Büttcher, Clarke, &Lushman, 2006).

Document modeling (DM) is the task of finding inherentstructures and features in documents. It provides a newmeans for DI because document features can be regarded asindex terms. Among the DM methods, the concept modelanalyzes documents in a concept space by considering ahidden layer of the interweaving relationship between terms,which allows it to overcome the difficulties of VSM usingconcept representation for documents. The concept modelassumes that terms appearing frequently in documents arelikely to be related to each other through unidentified con-cepts (Guha, McCool, & Miller, 2003; Gunter, 2009;Nikravesh, 2007; Trillo, Po, Ilarri, Bergamaschi, & Mena,2011). Latent semantic analysis (LSA) is a successful deter-ministic approach in concept modeling (Deerwester,Dumais, Landauer, Furnas, & Harshman, 1990). LSA usessingular value decomposition (SVD) and projects high-dimensional data into a lower dimensional space to over-come the overfitting problem. In addition, LSA capturessome aspects of synonymy and polysemy by deriving con-ceptual features of original tf-idf weights. These conceptualfeatures can be used for automatic indexing and retrieval.Despite the merits of the LSA, however, it is difficult tointerpret the outputs obtained from the analysis, especiallywhen the output values are negative.

The probabilistic concept model is a probabilisticapproach to model documents in terms of latent concepts.Indexing and retrieval using probabilistic concept modelsare based on the assumption that the concepts are distributeddifferently in relevant and nonrelevant documents. Probabi-listic latent semantic analysis (pLSA) (Hofmann, 2001), agenerative probabilistic concept model based on LSA, pro-vides clarity in the interpretation of output values becausethey have meanings in terms of probability. ProbabilisticLatent semantic indexing (pLSI) is an application of pLSAin automatic indexing. pLSI has a limitation in that it doesnot explicitly show how to assign probabilities to documentsthat are not in the training set, and that it has to deal with thedifficulties of parameter interpretation. Further, pLSI alwaysoverfits training data sets when the number of parametersincrease with the increasing number of documents in acorpus.

Latent Dirichlet allocation (LDA) is an alternative gen-erative probabilistic concept model (Blei, Ng, & Jordan,2003). By importing a symmetric Dirichlet prior, the LDAresolves the limitations associated with pLSI (Wallach,Mimno, & McCallum, 2009). LDA addresses the topic-based structural analysis of corpora, and thus it can beregarded as a model for topic search. LDA is mainly usedin DM and classification and only a few research studies,such as (Wei & Croft, 2006), apply LDA in the context of

query searches. These studies mainly focus on the methodsof avoiding the assignment of a zero value to the condi-tional probability of a query given a document. Forexample, the use of gamma values in the LDA, or someforms of smoothing models have been proposed(Azzopardi, Girolami, & Van Rijsbergen, 2004). However,it is still unclear how to apply LDA to automatic indexingfor the query search.

A ranking function is used in conjunction with DI tocalculate the relevance between a document and an inputquery. In general, the degree of relevance between the docu-ment and the query is determined using similarity measures.There are many approaches to measure similarity, such asrelevance scores (Singhal, 2001); however, most approachesutilize empirical functions based on experimental evalua-tions. A widely used similarity measure is cosine similarity,which calculates the cosine value of the angle between thequery vector and the document vector (Salton & McGill,1986). The advantage of the cosine similarity measure is thatit considers two documents as identical if they have the sameterm frequencies regardless of the number of terms used.The cosine similarity measure is generally used with index-ing methods for examining retrieval performance(Deerwester et al., 1990; Hofmann, 2001). Because weintend to compare the retrieval performances of differentindexing methods, we employ the cosine similarity measurein our approach as well.

Unlike the retrieval model that integrates an indexingmethod and a ranking function, an ensemble model (EnM) isa discriminative retrieval model based on a linear combina-tion of a number of basis models. EnM ranks documentsaccording to the summation of weighted similarity valuesthat are computed by constituent indexing models. However,different weights assigned to the constituent models result indistinct performances for document retrieval. The optimiza-tion problem of calculating weights of an EnM has been wellstudied in the classification domain whereas it is rarelytackled in the information retrieval domain. AdaRank (Xu &Li, 2007) is by far the only well known learning algorithmthat uses the AdaBoost framework (Freund & Schapire,1995). AdaRank repeatedly constructs weak rankers andfinally linearly combines them into a strong ranker. Thedisadvantage of AdaRank is that the number of iterations isdifficult to decide. Thus, a sufficiently large, human deter-mined constant is assigned to the number of iterations.iRANK (Wei, Li, & Liu, 2010) is also a recently proposedensemble model; however, it can only combine tworankers.

Our contribution is divided in two parts in this article. Inthe first part, we propose an LDA-based probabilistic topicsearch model that identifies a set of documents that closelymatches a given set of query terms in a topic space. This newmethod is called indexing by latent Dirichlet allocation(LDI). In this method, we use LDA to analyze the documentstructure according to the hidden topics, and index docu-ments in terms of topics by novelly defining document prob-ability vectors in topic space. Then, we use these vectors to

2 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—•• 2015DOI: 10.1002/asi

calculate the similarity between documents and queries. Indoing so, we can identify the conceptually relevant docu-ments given an input query, and thus, mitigate the commonlinguistic phenomena, synonymy and polysemy.

In the second part, we propose EnM that directly maxi-mizes the mean average precision (MAP) in order to obtainthe optimal weights assigned to the constituent models. Forsolving the optimization problem, we propose an algorithm,namely EnM.B, based on the boosting scheme. The differ-ences between EnM.B and AdaRank are that the number ofiterations is not predetermined, and the constituent modelsare not randomly generated during iterations in our algo-rithm. Overall, the discriminative EnM performs better thanthe generic LDI. However, EnM suffers from the disadvan-tages of the discriminative model. That is, the discriminativemodel needs training examples that may not exist in a dataset. Therefore, we suggest that, (a) in systems that containlabeled items, such as library systems, the EnM is anoptional method; and (b) in other systems that do notcontain labeled items, such as search engines, LDI is aviable choice.

The remainder of this article is organized as follows. Inthe next section, the background of LDA is presented. Wethen demonstrate the LDI and EnM models. Computationalresults of the proposed methods on publicly available datasets are subsequently reported. Some concluding remarksare offered in the last section.

Latent Dirichlet Allocation

The LDA model is based on the assumptions of bag-of-words that the order of words in a document can beneglected, and all documents in a corpus share a number oflatent topics. Under this assumption, a collection of docu-ments can be represented by a collection of topics, and eachdocument can be represented by the mixtures of these latenttopics in specific proportions.

LDA is a typical directed probabilistic graphical model. Ithas a clear three-level structure: corpus level, documentlevel, and word level. Each level is presented by correspond-ing parameters and random variables. The process of gener-ating words for each document is illustrated as follows (Blei,2012):

1. Randomly choose a proportion over topics.2. For each word in the document,

• Randomly choose a topic from the proportion overtopics in step 1.

• Randomly choose a word from the corresponding pro-portion over the vocabulary.

In the following, we use notations to mathematicallyillustrate the LDA model. A document d is defined as asequence of N words, that is, d = (w1, w2, . . ., wN), where wn

denotes the nth word in the sequence, and a corpus C isdefined as a collection of M documents, C = {d1, d2, . . ., dM}.From the given corpus, the LDA generates hidden topics that

are obtained by inferring the topic mixture θ = {θ1, θ2, . . .,θK} in the document level. Further, a set of N topics in theword level is defined as z = (z1, z2, . . ., zN). In addition, avocabulary index set {1, 2, . . ., V} is maintained to indicatewhether a particular word is used or not, that is, wj = 1 if thejth word in the vocabulary list is used and wj = 0, otherwise.Note that we will also use wj, when used alone unaccompa-nied by the equality, to denote jth word in the vocabulary listin other sections.

Mathematically, the process of LDA to generate a docu-ment consists of three concrete steps (Blei et al., 2003):

1. Choose the number of words N ∼ Poisson (ξ).2. Choose θ ∼ Dirichlet (α).3. For n = 1, 2, . . ., N

• Choose a topic zn ∼ Multinomial (θ);• Choose a word wn ∼ Multinomial (wn|zn, β), a multino-

mial distribution conditioned on the topic zn.

In these steps, Dirichlet parameter α ∈ +RK, multinomialparameter θ ∈ ℝK, θk ≥ 0 and ∑ =k

Kkθ 1, and the correspond-

ing dimension K are all assumed to be known. The condi-tional probability of jth word in the vocabulary list, giventhat the kth topic is selected, is denoted byβkj = p(wj = 1|zk = 1). Its maximum likelihood estimator canbe obtained from a posterior probability distribution. Thematrix of the conditional probabilities is denoted byβ = [βkj] ∈ ℝK×N. The joint prior probability distribution p(θ,z, d|α, β) is expressed as

p p p z p w zn n nn

N

( , , | , ) ( | ) ( | ) ( | , ),θ α β θ α θ βz d ==

∏1

(1)

where p(zn|θ) = θk for the unique k such that znk = 1. Here, zn

k

is a binary variable indicating whether the kth topic is usedin selecting the nth word in the document. We note that thesuperscript represents the order in which a word or a topicappears in the vocabulary list and the topic list, and thesubscript denotes the order in which a word or a topicappears in a document.

Introducing superscript representation makes it easier toderive the marginal probability of the word appearanceobtained by integrating over θ and summing over z on theprior, that is,

p p d

p p z p w z d

z

n n nn

n

( | , ) ( , , | , )

( | ) ( | ) ( | , )

{ }

d z dα β θ α β θ

θ α θ β θ

=

=

∑∫

=1

NN

z

n n nzn

N

n

n

p z p w z d

∏∑∫

∑∏∫==

{ }

( | ) ( | , ) .θ β θ1

(2)

Hence, the corpus probability is obtained as

p C

p p z p w z dd dn d dn dn dzi

N

d

M

dn

d

( | , )

( | ) ( | ) ( | , ) ,

α β

θ α θ β θ= ∑∏∫∏== 11

(3)

JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—•• 2015 3DOI: 10.1002/asi

where θd’s are variables in the document level and wdn andzdn, sampled once for each word in each document, arevariables in the word level.

Using equations (1) and (2), we can specify the posteriordistribution of the hidden variables θ and z as

pp

p( , | , , )

( , , | , )

( | , ).θ α β θ α β

α βz d

z dd

= (4)

Owing to the coupling relationship between variables θand β, the maximum likelihood estimators of the posteriordistribution are not tractable. To overcome this difficulty,Blei et al. (2003) introduced free variational parameters γand φ for Dirichlet and multinomial distribution, respec-tively, and defined variational distribution

q q q zn nn

N

( , | , ) ( | ) ( | ).θ γ ϕ θ γ ϕz ==

∏1

(5)

The maximum likelihood estimators of α and β are calcu-lated by the EM algorithm using Jensen’s inequality to esti-mate the lower bound on the log-likelihood of the variationaldistribution q(θ, z|γ, φ).

Indexing by LDA

In essence, LDA has been used in retrieval models suchas the query likelihood retrieval model for ranking andretrieving documents. Azzopardi et al. (2004) first used oneof the variational parameters γ in LDA in Bayes smoothingand Jelinek-Mercer smoothing document retrieval modelsfor ad-hoc retrieval. However, γ in the LDA is auxiliary andit may not be appropriate for direct use in the query search(Wei & Croft, 2006). Wei and Croft (2006) proposed theLDA-based document model (LBDM) for retrieval tasks bycombining the Dirichlet smoothing method and the LDAposterior estimates of θ and φ. However, the settings forLDA estimation are based on training collection, which iscoarse in the context of information retrieval. Having saidthat, these applications have shown the effectiveness of LDAfor the retrieval task. However, LDA has been rarely appliedto automatic document indexing, to the best of our knowl-edge. Because LDA models documents as a mixture oftopics, it provides a new approach for representing docu-ments in a topic space where the topics can be seen as indexterms for indexing. In this section, we first define new termand document representations in the topic space for indexingdocuments and second, we demonstrate how these novelrepresentations can be applied for the document retrievaltask.

Document Representation in Topic Space

Because we aim to construct explicit document repre-sentations associated with topics, our method directly usesthe β matrix in the LDA model. The conditional probabil-ity βjk in LDA represents the selection probability of the

word wj given a topic (concept) zk. This value representsthe probability of a word given a specific topic and it isused in identifying words that are associated with a topic.However, it may not be used as the probability of a topicgiven a word. Thus, for characterization, we define wordrepresentation in topic space, Wj ∈ ℝK. The kth componentWj

k of Wj represents the probability of word wj embodyingthe kth concept zk. This quantity can be obtained by Bayes’rule as

W p z wp w z p z

p w z p zjk k j

j k k

j h h

h

= = = == = =

= = ==

( | )( | ) ( )

( | ) ( )

1 11 1 1

1 1 11

KK

∑.

(6)

As the topics are ancillary and unordered, we assume thatthe probability of a topic selection is uniformly distributed,that is, p(zh = 1) = p(zk = 1). It is conceivable that moresophisticated adaptive techniques for the probability of topicselection will result in a more accurate model, although wemake a simpler uniform assumption in this article. With thisassumption, we obtain the probability of a word wj corre-sponding to a concept zk as

Wjk jk

jhh

K=

=∑

β

β1

.(7)

Furthermore, the documents can be represented in the topicspace as well, Di ∈ ℝk. The kth component Di

k of Di repre-sents the probability of a concept zk given a document di andit is expressed as

D p z p z w p w

p z w p w

ik k

ik j

ij

i

w d

k j ji

w

ji

j

= =

=∈

∑( | ) ( | , ) ( | )

( | ) ( | )

d d d

dddi

∑ ,(8)

where we assume that the conditional probability p(zk|wj, di)equals the conditional probability p(zk|wj). This assumptionis based on the LDA assumption that there is a fixed numberof underlying topics that are used to generate the words indocuments (Croft et al., 2010). In other words, we assumethat the words in topic space do not depend on which docu-ment it is used in, but on the topic it is generated from. Anapproximation �Di

k of Dik can be obtained by substituting

p(wj|di) with �p w ji( | )d , where

�p wn

Nj

iij

i

( | ) .dd

= (9)

Here, nij denotes the number of occurrences of word wj indocument di and N id denotes the number of words in docu-ment di, that is, N ni j

Vijd = ∑ =1 . We note that there could be

many smoothing schemes to estimate p(wj|di) although wehave chosen the simplest form in this article. Then,

4 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—•• 2015DOI: 10.1002/asi

D D

p z w n

N

W n

N

ik

ik

k jij

w

i

jk

ij

w

ji

ji

i

� � =

=

( | )

.

d

d

d

d

(10)

In general, a document includes various words that are usedto explain key topics in the document. The definition ofdocument probability in Equation (10) captures the topicalfeatures of words in the document. This definition is distin-guished from the usual definition of the probability of adocument in LDA, which assumes that document probabil-ity is the same as the probability of the simultaneous occur-rence of all words used in the document. The new definitionovercomes the difficulty that is associated with the latterdefinition in which the probability of a document consider-ably depends on the length of the document. If topics areregarded as index terms, document representation in thetopic space can be utilized for automatic document indexing.We call this novel indexing method LDI.

Similarity between Document and Query

One of the main applications of automatic indexing isdocument retrieval, which is also the main concern for LDIin this article. With the new definitions in the previous sub-section, each term can be represented in topic space, that is,W W W Wj

kj j j

K= { , , , }1 2 … . Divided by norm, each term isunified to a unit circle. We can define the similarity ρ(·, ·)between two terms ws and wt as

ρ( , )

( | ) ( | )

( | ) ( | )

w w W W

p z w p z w

p z w p z w

s ts t

k s k t

h s

h

Kh t

h

K

= ⋅

=

= =∑

� �

1 1∑∑

∑ ∑∑

=

= =

=

=

k

K

sk tk

shh

K

thh

Kk

K

1

1 1

1

β β

β β,

(11)

where�

WW

Ws

s

s

= ,�

WW

Wt

t

t

= , and� �

W WW

W

W

Ws t

s

s

t

t

⋅ = , .

Hereafter, the capital letters represent probability vectors.The shown similarity measure quantifies the proximity oftwo terms in topic space in terms of the cosine value of theangle between them. Thus, in general, the similarity betweentwo distinct terms in general does not equal to zero, whichmitigates the problem of synonymy. On the other hand, theproblem of polysemy can also be alleviated because eachterm has multiple topical interpretations owing to the repre-sentation in a topic space.

Analogously, similarity measures to compare twodocuments and to compare a term and a document can bedefined as

ρ( , )d ds t s t s tD D D D= ⋅ ⋅� �

� ��

��

(12)

and,

ρ( , ) ,w W D W Dst s t s td = ⋅ ⋅

� ��

� ��

(13)

respectively.In terms of our original problem, we regard the query as

a pseudo-document that contains a set of query termsQ = {q1, q2, . . ., qL}. Then, similar to Equation (10), theprobability vector of the query with respect to the kth topiccan be defined in the concept space as

Q p z Q p z q Q p q Q

p z q

L

k k kj j

q Q

kj

q Q

j

j

= =∈

( | ) ( | , ) ( | )

( | )

.�

(14)

We note that ∑ ∈q Qk

jj p z q Q( | , ) can be simplified as∑ ∈q Q

kjj p z q( | ) . Similarity between query Q and document

ds is measured by

ρ( , ) ,ds s sQ D Q D Q= ⋅ ⋅� �

� ��

��

(15)

where�

…Q Q Q QK= { , , , }1 2 in the concept space.Characterization of a query as a probability vector is

possible because of the new definition of the documentprobability vector in Equation (10). The probability vectorrepresents the characteristics of the words, documents, andqueries in the concept space. One advantage of LDI is thatan unseen training query can be treated coherently as adocument in the training set. This feature is pertinent toLDA, and it is not present in other automatic indexingmethods such as pLSI (Hofmann, 2001).

The size of concept space K plays an important role in ourapproach as it is rooted in LDA. In LDA, the value Kdetermines the degree of abstraction of information.The larger the value K is, the finer is the segmentation ofinformation.

A Toy Example

An example is considered for illustration.

T1: the OS in Apple smartphonesT2: the OS system in Apple productsT3: the sign system in Samsung smartphonesB1: Samsung and Apple signed a contractB2: there are many kinds of product contractsD1: fry the apple pie with some peasD2: the pie should be fried in oilD3: the way to fry dumplingsG1: the oil is made from genetically modified beanG2: the bean is genetically modified from peas

As seen here, the example is made up of 10 documentswith 14 different terms in four different disciplines:

JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—•• 2015 5DOI: 10.1002/asi

technology, business, diet, and genetics. Each document islabeled by the leading character of the related field, viz. T, B,D, and G, followed by the order of its appearance. Forinstance, G2 represents the second document in the geneticsfield. The term frequency is summarized in Table 1.

Figure 1 and Table 2 show the results of LDA applied tothe example data with dimension K = 4. In both figures,the concepts, which would have not been known a priori,were ordered by technology, business, diet, and genetics.In Figure 1, as the term “apple” appears in more thanone discipline, that is, technology, business, and diet, theprobability of the term representing each of these con-cepts spreads around the topics. With the new definitionWj

k in Equation (7), the term “apple” represents the con-cepts technology with probability 0.4275; business,0.3053; diet, 0.2672; and genetics, 0. Other terms, such as“smartphone,” “contract,” “pie,” “genetically-modified,”

appearing in one discipline lead to the probability of 1 inthe respective topics.

Table 2 exhibits an accurate assignment of documentprobabilities with respect to four topics based on the defini-tion of Di

k in Equation (10). Figure 2 shows the distributionof three arbitrary queries with respect to four topics, wherethe queries are given below. From the histogram in Figure 2,we observe that all three queries have no relationship withthe fourth topic, genetics, and the definition Qi

k in Equation(14) properly identified the relationship.

Query 1: the sign system in AppleQuery 2: the contract for Apple productsQuery 3: the Apple OS

Table 3 shows the similarity between each query andeach document, obtained using Equation (15). Figure 3 is aplot of documents and queries on the first three topics space,which illustrates the similarity between the queries anddocuments. Obviously, documents T1, T2, and T3 areclosely related to Query 1 and Query 3, and documents B1

TABLE 1. Summary of the characteristics of the toy example.

Term T1 T2 T3 B1 B2 D1 D2 D3 G1 G2

Apple 1 1 1 1Smartphone 1 1OS 1 1System 1 1Product 1 1Contract 1 1Sign 1 1Samsung 1 1Pie 1 1Pea 1 1Fry 1 1 1Oil 1 1Genetically-modified 1 1Bean 1 1

FIG. 1. Assignment of p(z|w) by Wjk , where group 1 = {“apple”}, group

2 = {“product,” “sign”}, and group 3 = {“pea,” “oil”}.

TABLE 2. Assignment of p(z|d) by Dik .

Document Topic 1 Topic 2 Topic 3 Topic 4

T1 0.8092 0.1018 0.0891 0T2 0.7098 0.2234 0.0668 0T3 0.8039 0.1961 0 0B1 0.2098 0.7234 0.0668 0B2 0.1373 0.8627 0 0D1 0.1069 0.0763 0.6739 0.1429D2 0 0 0.8095 0.1905D3 0 0 1 0G1 0 0 0.1429 0.8571G2 0 0 0.1429 0.8571

FIG. 2. Distribution of query vectors in the concept space with fourdimensions.

6 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—•• 2015DOI: 10.1002/asi

and B2 are closely related to Query 2. We note that thevalues in Table 3 do not possess the meaning in terms ofprobability although they are bounded above by 1 andbelow by 0. As for Query 3, although no term matchesin document T3, the relevance between them is high becauseof the large probabilities in the same topic of technology. Onthe other hand, although the query term “apple” matchesbetween Query 3 and D1, the similarity between them iscalculated to be much smaller than that between Query 3 andT3 by the proposed method because of polysemy in the term“apple.” The illustration in Figure 3 also verifies the signifi-cance of the cosine similarity. These are accurate character-istics of topic search that are not exhibited in keyword-basedmodels. Although this particular example shows the effec-tiveness of the proposed model, the performance depends ontraining data and the assumption of latent topics.

Ensemble model (EnM)

An EnM linearly combining indexing models is studiedin this section. Different weights assigned to these constitu-ent models may result in distinct results. We aim to derive alearning algorithm to obtain the optimal weights with whichthe EnM maximizes MAP.

Notations and Algorithm

Suppose that the relevant document list for query qi ∈ Q,is denoted as Di. |Q| represents the number of queries inthe query set, and |Di|, the number of documents in therelevant document set with respect to the query qi. Letdij ∈ Di denote the jth document in Di. A constituent docu-ment retrieval model ϕk, chosen from a set of methodsφ φ φ φ φk K∈ =Φ Φ, { , , , }1 2 … , results in the similarity scoredenoted by ϕki with respect to the query qi. Accordingly, letR(dij, ϕki) be the ranking position of the jth document for theith query returned by the kth model. In order to measure theperformance of the retrieved ranking, we use the widelyused information retrieval metric, average precision (AP),which is denoted as AP(ϕki) associated with the ithquery and the kth model given relevant document set Di.

Furthermore, we denote the metric, MAP, as E(ϕk) over thequery set Q calculated by the kth model. EnM is written asa linear combination of the constituent models, that is,H k

Kk k= ∑ =1

φ α φ , by which the MAP is E(H(α)), or E(H) forsimplicity. We assume that the relevant document list issorted in descending order according to the scores. Thedefinitions of AP and MAP are given below.

The AP of the kth model ϕk given the ith query qi isdefined as

APD

j

R dki

i ij kij

Di

( )( , )

.φφ

==

∑1

1(16)

In this function, all documents are retrieved and the rankingpositions of all relevant documents are used to define theprecision. This metric is commonly used in literature (Gao,Qi, Xia, & Nie, 2005; Yue, Finley, Radlinski, & Joachims,2007). Further, this metric can be used in VSM and VSM-based variants because relevant documents that contain noquery terms are ranked low. The MAP of the kth model ϕk forthe query set Q is defined as

EQ

APk kii

Q

( ) ( ).φ φ==∑1

1(17)

As for the EnM H, MAP is defined as

E HQ

AP h

Q D

j

R d h

ii

Q

ii

Q

ij ij

Di

( ) ( ( ))

( , ( )).

| |

=

=

=

= =

∑ ∑

1

1 1

1

1 1

α

α

(18)

where hi kK

k ki( )α α φφ= ∑ =1 is the similarity score returned bythe EnM with weight vector α α α φ= ( , , )1 … K for the queryqi. Our goal is to find weights αk’s assigned to ϕk’s with whichthe EnM gives the maximal MAP. Mathematically, we aimto solve the following problem

max ( ).E H (19)

Because the AP is in the range of [0, 1], we can define theloss function of the above objective function as

Loss AP hii

Q

= −=∑ ( ( ( ))).1

1

α (20)

In doing so, the maximization problem Equation (19) isequivalent to the following minimization problem Equation(21)

min .Loss (21)

According to the first-order Taylor series inequality1 − x ≤ e−x, we can instead minimize an upper bound offunction, Equation (20)

TABLE 3. Similarities between queries and documents.

Query 1 Query 2 Query 3

T1 0.9475 0.5227 0.9938T2 0.9885 0.6643 0.9915T3 0.9692 0.6053 0.9833B1 0.6734 0.9902 0.4796B2 0.5681 0.9586 0.3543D1 0.3076 0.2829 0.3420D2 0.1261 0.1245 0.1752D3 0.1296 0.1279 0.1800G1 0.0213 0.0210 0.0296G2 0.0213 0.0210 0.0296

JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—•• 2015 7DOI: 10.1002/asi

Loss AP hii

Q

≤ −=∑ exp( ( ( ))).α

1

(22)

Equivalently, our aim is to minimize the following optimi-zation problem

min exp( ( ( ))).−=∑ AP hii

Q

α1

(23)

where AP(·) is a nonconvex, nondifferentiable, noncontinu-ous function. An algorithm, EnM.B, is implemented forsolving this problem.

Algorithm EnM.B is developed within the boostingscheme that utilizes the competition between constituentmodels and training data sets to iteratively update theweights until the game reaches equilibrium. To make thecontext consistent, we use the toy example introduced inthe previous section and illustrate the operations of the algo-rithm in Figure 4. We assume that two rankers based on thetf-idf weight VSM (TFIDF) and LDI are chosen as constitu-ent methods and that the relevant document lists for Query 1,Query 2, and Query 3 are {T3}, {B1, B2}, and {T1, T2},respectively. As shown in Figure 4, the query set is repre-sented by the box on the left-hand side in which each

circle denotes one query and the ranker set is denoted by thebox on the right-hand side in which each circle denotes oneranker. The size of each circle in the box indicates thecorresponding weight for each query and each ranker,whereas the font size of the term “Loss” indicates the valueof the objective loss function.

00.2

0.40.6

0.81

0

0.2

0.4

0.6

0.8

10

0.2

0.4

0.6

0.8

1

Topic 1 (T)Topic 2 (B)

Top

ic 3

(D

)

Query 2B1B2

D1D2

D3

Query 3

Query 1T2

T3

G2G1

T1

FIG. 3. Queries and documents in a three-dimensional topic space. [Color figure can be viewed in the online issue, which is available atwileyonlinelibrary.com.]

FIG. 4. Account for how the EnM.B works with the toy example.

8 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—•• 2015DOI: 10.1002/asi

ALGORITHM EnM.B: A boosting algorithm for training the EnM.

Require: Query set Q, a set of basis models Φ, a duplicate set of models Φ' = Φ. Initializedweights D1 over all queries with uniform distribution, that is, D1 1= Q . Initialize α‘s with zeros.Set the initial performance measure E0.while |Et − Et−1| > ε doif Φ' = 0 then

Φ' = Φ;end ifSelect basis models φt ∈ Φ' with weights Dt on training queries using equation (37);Update the weight α = α + δj*ej* using equation (35);Compute the MAP Et with EnM Ht;if |Et − Et−1| > ε then

Φ' = Φ'\φt;Update Dt +1 using (33);

end ifend whilereturn EnM H.

In round 1, the EnM.B initializes uniform weightsD1 1 3 1 3 1 3= ( ), , for the queries and α = (0, 0) forthe basis models. The constituent model TFIDF thatminimizes the loss is chosen. Because the TFIDF performswell on some queries and poor on the other queries, theweight for the query on which TFIDF performs poorlyis increased, which results in constructing weightsD2 0 28 0 43 0 228= ( ). , . , . and α = (1.298, 0). From this step,we can observe that the weights D2 play a penalty role onQuery 2 so that the selection of the constituent model onround 2 focuses on Query 2. Then, the LDI model minimiz-ing the loss is chosen in round 2. At the end of round 2, theweights are updated to D3 1 3 1 3 1 3= ( ), , and α = (1.298,1.285). The algorithm stops at round 3 because the lossconverges to zero, that is, the MAP equals to 1. We note thatthe original model set is reused until the algorithm con-verges. The convergence analysis is similar to the generalboosting scheme, and thus, it is omitted in this article. Thecomplexity of the EnM.B is O(T · Kϕ · |Q| · M), where Tindicates the number of iterations. The details of the deriva-tions of this algorithm can be found in Appendix A.

Computational Experiments

Experiment Setup

Four standard data sets collected from four distinct sub-jects are utilized to test the proposed methods. They areprovided in the SMART Information Retrieval System1.Each data set contains a corpus, a list of queries, and corre-sponding relevant documents. Because the data in each set isrelatively small, we construct a merged collection (MC) thatcontains these four data sets. Although the size of MC issmaller than other large collections, such as the TextREtrieval Conference (TREC) Data, the experiment on MCcan somewhat test the performances of the proposedmethods on large data. Another reason of choosing the MCis because of the excessive computing time associated withLDA. To observe the structural information of the MC data

set, we used t-distributed stochastic neighbor embedding(t-SNE) (Van der Maaten & Hinton, 2008) to visualize thetwo-dimensional embedding of the proposed topical docu-ment representations in Figure 5. This drawing also revealsthat the proposed document representations in topic spacepreserve essential structural information in the documents.Some basic data characteristics are summarized in Table 4.

Prior to the proposed method, we applied the followingsimple preprocessing. Stop words were removed from allcorpora using the list of 571 stop words provided inSMART. Special symbols, such as hyphenation marks, wereremoved and those words with a unique appearance in thecorpus were also removed. Some documents and queriesprovided in CISI and CACM were vacant, which mightdecrease the retrieval accuracy. However, instead of remov-ing those documents and queries, they were kept in theexperiments to ensure the integrity of the data sets.

For comparison purpose in evaluating LDI, a tf-idfweight-based VSM model (denoted as TFIDF), LSI, pLSI,and LBDM were also tested on the same data sets. TheTFIDF and LSI methods were coded, the LBDM was imple-mented based on GibbsLDA++2, and the pLSA3, and LDA4

codes were obtained from the publicly available sites.In the experiment for verifying EnM, TFIDF, LSI, pLSI,

and LDI are used as the constituent indexing models. Theadvantage of using these four constituent models is that boththe conceptual meaning and keyword matching informationcan be combined into the EnM. The EnM can benefit fromthe combined information and, in turn, result in an overallimprovement in ranking accuracy. In order to address over-fitting, each experimental data set is divided into two equiva-lent parts, and the EnM is evaluated through twofoldcrossvalidation. A uniform weighted EnM (denoted asUniEnM) is also implemented for comparison.

1Available at: http://www.cs.columbia.edu/~blei/lda-c/

2Available at: http://gibbslda.sourceforge.net/3Available at: http://www.kyb.mpg.de/bs/people/pgehler/code/index

.html4Available at: http://www.cs.columbia.edu/~blei/lda-c/

JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—•• 2015 9DOI: 10.1002/asi

Parameter Selection

Choosing an appropriate number of topics is an importantissue for LSI, pLSI, and LDI. In this article, we followed thestrategy suggested by Deerwester et al. (1990). That is, weexamined the performance for different numbers of dimen-sions and selected the one that maximizes retrieval perfor-mance. Table 5 summarizes the number of dimensions foreach method on different data sets. Validations for theseselections can be found in Appendix B. In applying pLSI,we modified the determination of parameter β. In the searchof the β value in the tempered EM in pLSI, the β value wasreduced until perplexity could no longer be reduced.However, in our experiment, we further reduced the β value

until the precision did not improve. Following the strategygiven by X. Wei and Croft (2006) for LBDM, we set thenumber of topics identical to LDI, λ = 0.7 and μ = 1,000,which are the optimal settings.

Experimental Results of LDI

Figures 6 and 7 depict the precision-recall curves of thetested methods on four standard data sets and MC, respec-tively. The numerical values of the x-axis denote the recall oflabeled relevant documents, whereas the numerical values ofthe y-axis represent the precision of retrieved documents.Compared to TFIDF, LSI, pLSI, and LBDM, the proposedLDI achieved the best performance, except for small inter-vals in recalls on MED and CRAN. In the experiments onCRAN, CISI, and CACM, LDI had a higher precision for ahigh recall regime. This property seems more valuable inpractice because the documents in higher positions areviewed by more users. As for the experiment on MC, theproposed LDI performed better than the indexing methodsTFIDF, LSI, and pLSI, and showed a comparable perfor-mance with the retrieval model LBDM. In addition to theperformance superiority to the LBDM, the proposed LDIalso has advantages in the following two aspects: (a) thetopical representation of a document for automatic indexingis proposed in our method, whereas it is not introduced bythe LBDM; and (b) no smoothing parameters need to betuned in our method whereas additional parameters need tobe trained in the LBDM.

Experimental Results of the EnM

The aim of the first experiment was to examine the learn-ing ability of EnM.B. Owing to length constraints, we

FIG. 5. A two-dimensional embedding of the proposed document representations using t-SNE for the MC data set. [Color figure can be viewed in the onlineissue, which is available at wileyonlinelibrary.com.]

TABLE 4. Data characteristics.

Data Subject Document # Query # Term #

MED M(Medicine) 1,033 30 5,775CRAN A(Aeronautics) 1,400 225 8,213CISI L(Library) 1,460 112 10,170CACM C(Computer) 3,204 64 9,961MC M, A, L and C 7,097 431 27,784

TABLE 5. Number of topics used in LSI, pLSI and LDA.

Method MED CRAN CISI CACM MC

LSI 100 125 150 125 500pLSI 100 150 50 75 400LDI 100 100 100 75 200

10 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—•• 2015DOI: 10.1002/asi

consider the MED corpus as an example. The experimentson other corpora have similar results. The learning curve ofEnM.B in terms of MAP is shown in Figure 8 during thetrials of cross validation. From the figure, we can observethat the EnM.B converges as the number of roundsincreases, which is consistent with our analysis.

We then compute the weights for the constituent models.Because the MAP is determined by the relative ratios of thebasis models, the normalization of the weights does not

change the final MAP value. We normalized the mean valuesof weights over all trials of cross validation, as listed inTable 6.

With the aforementioned corresponding weights,Figures 9 and 10 illustrate the precision-recall curves of theEnM on four standard data sets and MC, respectively. Here,we use the optimal weights of EnM.B over all trials torepresent the EnM. As expected, the performance of theEnM outperforms LDI because the EnM is supervised withthe knowledge of relevant documents whereas LDI is unsu-pervised without training on relevant documents. The EnMis prior to UniEnM as well because the weights are updatedto optimal in EnM.

As shown in Figures 7 and 10, the performance of LDI ismoderate when tested with the MC data set. This may bebecause of the limitation of the data sets in which the rel-evant documents are given within each specific corpus. Indetail, given a query, a document in another corpus mayappear to be intuitively relevant by general users. This docu-ment may be retrieved by LDI but not listed in the relevantlist. Because this kind of relevant documents from othercorpora are excluded in the expert identified relevant list, theprecision of retrieving these documents is relatively low.Two examples in LDI are given below to show this phenom-enon. Two documents retrieved at the first ranking positionin CACM and CISI are seen to be intuitively relevant to thecorresponding query in CISI and CACM, respectively.However, they are excluded in the relevant list. This isbecause some documents in CISI and CACM may be rel-evant because of common subjects, which can also beobserved in Figure 5 Therefore, the performances of theproposed methods on large data sets with a precise relevantlist still require further research.

0 0.5 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

Pre

cisi

onMED

TFIDFLSIpLSILBDMLDI

0 0.5 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Recall

Pre

cisi

on

CRAN

TFIDFLSIpLSILBDMLDI

0 0.5 10

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Recall

Pre

cisi

on

CISI

TFIDFLSIpLSILBDMLDI

0 0.5 10

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Recall

Pre

cisi

on

CACM

TFIDFLSIpLSILBDMLDI

FIG. 6. Precision-recall curves for TFIDF, LSI, pLSI, LBDM, and LDI on MED, CRAN, CISI, and CACM. [Color figure can be viewed in the online issue,which is available at wileyonlinelibrary.com.]

0 0.5 10

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Recall

Pre

cisi

on

MC

TFIDFLSIpLSILBDMLDI

FIG. 7. Precision-recall curves of TFIDF, LSI, pLSI, LBDM, and LDI onMC. [Color figure can be viewed in the online issue, which is available atwileyonlinelibrary.com.]

JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—•• 2015 11DOI: 10.1002/asi

Example 1CISI: QueryID = 6What possibilities are there for verbal communication

between computers and humans, that is, communication viathe spoken word?

CACM: DocumentID = 698, RankingPosition = 1DATA-DIAL: Two-Way Communication with Computers

From Ordinary dial Telephones. An operating system is

FIG. 8. Learning curve of EnM.B on MED. [Color figure can be viewed in the online issue, which is available at wileyonlinelibrary.com.]

0 0.5 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

Pre

cisi

on

MED

LDIUniEnMEnM

0 0.5 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Recall

Pre

cisi

on

CRAN

LDIUniEnMEnM

0 0.5 10

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Recall

Pre

cisi

on

CISI

LDIUniEnMEnM

0 0.5 10

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

RecallP

reci

sion

CACM

LDIUniEnMEnM

FIG. 9. Precision-recall curves of LDI, UniEnM and EnM on MED, CRAN, CISI and CACM. [Color figure can be viewed in the online issue, which isavailable at wileyonlinelibrary.com.]

TABLE 6. Normalized weights for the constituent indexing models in theEnM.

Data TFIDF LSI pLSI LDI

MED 0.21 0.21 0.25 0.33CRAN 0.22 0.22 0.25 0.31CISI 0.31 0.18 0.18 0.33CACM 0.47 0.28 0.12 0.13MC 0.34 0.18 0.26 0.22

0 0.5 10

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Recall

Pre

cisi

on

MC

LDIUniEnMEnM

FIG. 10. Precision-recall curves of LDI, UniEnM and EnM on MC.[Color figure can be viewed in the online issue, which is available atwileyonlinelibrary.com.]

12 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—•• 2015DOI: 10.1002/asi

described which allows users to call up a remotely locatedcomputer from ordinary dial telephones. No special hard-ware or connections are required at the users’ telephones.Input to the computer is through the telephone dial; outputfrom the computer is in spoken form. Results of a test withtelephones in the Boston area are reported.

Example 2CACM: QueryID = 59Dictionary construction and accessing methods for fast

retrieval of words or lexical items or morphologically relatedinformation. Hashing or indexing methods are usuallyapplied to English spelling or natural language problems.

CISI: DocumentID = 179, RankingPosition = 1This book deals with the computer processing of large

information files, with special emphasis on automatictext handling methods. Described in particular are proce-dures for dictionary construction and dictionary look-up,statistical and syntactic language analysis methods, informa-tion search and matching procedures, automatic informationdissemination systems, and methods for user interaction withthe mechanized system. As such, the text includes elements oflinguistics, mathematics, and computer programming.

Table 7 provides a summary of the overall performance ofeach testing model. The performance is measured by theMAP and the percentage of improvement over TFIDF. Fromthis table, we can similarly conclude that the LDI achieves thehighest precision among indexing and retrieval models, andthe EnM, solving by any algorithm, outperforms each con-stituent indexing model. This result verifies the general beliefthat the EnM performs better than individual models.

Discussion and Conclusions

We presented a document indexing method, LDI, basedon LDA. The proposed method utilized the β matrix of LDAin defining document representations in topic space. Theproposed novel document representation was subsequentlyused for retrieval by computing the similarity scoresbetween the documents and the given query. The results ofour computational experiments indicate that the proposed

LDI method is a viable automatic document indexingmethod for information retrieval. Because LDI is a genera-tive model, it can be employed in information systemswithout labeled items, such as search engines, as to retrieverelevant documents according to the user input queries.

Another contribution lied in the proposition of EnM. Thismodel was inspired by the general belief that the combinedmodel performs better than a single model. By formulatingan optimization problem that minimized the loss definedbased on the MAP, we proposed an algorithm, EnM.B, tosolve this problem and obtain the optimal weights for eachconstituent model. The empirical results have shown thatEnM outperforms any constituent model through the overallrecall regimes. Unlike LDI, EnM can be only used in thosesystems with the knowledge of relevant documents, such aslibrary systems.

Two research questions regarding EnM call for furtherinvestigation. (a) Does the selection of the constituentranking models influence the result of EnM? (b) Are theresults of the EnM global optimal to the optimizationproblem? Intuitively, the answer is yes for the first questionbecause, empirically, EnM benefits from LDI and the otherbasis models. The selection of constituent models and thenumber of constituent models are indeed important factors,which require further discussion. As for the second question,the answer is unknown. Though the EnM.B algorithm gen-erates results and converges as iterations increase, no evi-dence shows that the final results are global optimal, or evenlocal optimal. Thus, the proof of optimality requires furtherresearch. Besides, a precise algorithm for solving the opti-mization problem is also a subject that should be studiedfurther. However, by and large, the EnM is still a viablemethod for information retrieval.

Acknowledgments

This work was partially supported by the NationalResearch Foundation of Korea (NRF) grant funded by theKorea government (MEST) (2009-0083893). The first authorgratefully acknowledges the China Scholarship Council(CSC) for fellowship support. We also gratefully acknowl-edge the referees and the editor for their helpful suggestionsand comments.

TABLE 7. MAPs of various methods on the tested data sets.

Method

MED CRAN CISI CACM MC

MAP Impr(%) MAP Impr(%) MAP Impr(%) MAP Impr(%) MAP Impr(%)

TFIDF 0.4605 – 0.2716 – 0.0935 – 0.1177 – 0.1946 –LSI 0.5026 +9.1 0.2661 −2.0 0.1229 +24.0 0.1094 −7.1 0.1709 −12.2pLSI 0.5334 +15.8 0.2740 +0.9 0.1223 +23.4 0.0973 −17.3 0.1918 −1.4LBDM 0.5516 +19.8 0.2941 +8.3 0.1030 +10.2 0.1174 −0.3 0.1968 +1.1LDI 0.5738 +24.6 0.3146 +15.8 0.1429 +52.8 0.1349 +14.6 0.2084 +7.1UniEnM 0.6034 +31.0 0.3519 +29.6 0.1510 +61.5 0.1743 +48.1 0.2735 +40.5EnM 0.6420 +39.4 0.3766 +38.7 0.1637 +75.1 0.1890 +60.6 0.2768 +42.2

Note. Impr(%) indicates the percentage of improvement over TFIDF.

JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—•• 2015 13DOI: 10.1002/asi

References

Azzopardi, L., Girolami, M., & Van Rijsbergen, C. (2004). Topic basedlanguage models for ad hoc information retrieval. In Proceedingsof IEEE International Joint Conference on Neural Networks (Vol. 4,pp. 3281–3286). New York: ACM Press.

Blei, D.M. (2012). Probabilistic topic models. Communications of theACM, 55(4), 77–84.

Blei, D.M., Ng, A.Y., & Jordan, M.I. (2003). Latent dirichlet allocation.Journal of Machine Learning Research, 3(March 2003), 993–1022.

Büttcher, S., Clarke, C.L., & Lushman, B. (2006). Term proximity scoringfor ad-hoc retrieval on very large text collections. In Proceedings of the29th Annual International ACM SIGIR Conference on Research andDevelopment in Information Retrieval (pp. 621–622). New York: ACMPress.

Caid, W.R., Dumais, S.T., & Gallant, S.I. (1995). Learned vector-spacemodels for document retrieval. Information Processing & Management,31(3), 419–429.

Choi, I.C., & Lee, J.S. (2010). Document indexing by latent dirichletallocation. In Proceedings of the International Conference on DataMining (pp. 409–414). Athens: CSREA Press.

Croft, W.B., Metzler, D., & Strohman, T. (2010). Search engines: Informa-tion retrieval in practice. Reading, UK: Addison-Wesley.

Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., &Harshman, R.A. (1990). Indexing by latent semantic analysis. Journal ofthe American Society for Information Science, 41(6), 391–407.

Freund, Y., & Schapire, R.E. (1997). A desicion-theoretic generalization ofon-line learning and an application to boosting. Journal of Computer andSystem Sciences 55(1), 119–139.

Gao, J., Qi, H., Xia, X., & Nie, J.-Y. (2005). Linear discriminant model forinformation retrieval. In Proceedings of the 28th Annual InternationalACM SIGIR Conference on Research and Development in InformationRetrieval (pp. 290–297). New York: ACM Press.

Guha, R., McCool, R., & Miller, E. (2003). Semantic search. In Proceed-ings of the 12th International Conference on World Wide Web(pp. 700–709). New York: ACM Press.

Gunter, D.W. (2009). Semantic search. Bulletin of the American Society forInformation Science and Technology, 36(1), 36–37.

Hofmann, T. (2001). Unsupervised learning by probabilistic latent semanticanalysis. Machine Learning, 42(1–2), 177–196.

Nikravesh, M. (2007). Concept-based search and questionnaire systems.In Forging new frontiers: Fuzzy pioneers I (pp. 193–215). Berlin:Springer.

Robertson, S.E., Walker, S., Jones, S., Hancock-Beaulieu, M.M., Gatford,M., et al. (1995). Okapi at trec-3. In Proceedings of the Third TextRetrieval Conference (TREC-3) (pp. 109–126). Gaithersburg: NationalInstitute of Standards and Technology.

Salton, G., & McGill, M.J. (1986). Introduction to modern informationretrieval. New York: McGraw-Hill, Inc.

Salton, G., Wong, A., & Yang, C.-S. (1975). A vector space model forautomatic indexing. Communications of the ACM, 18(11), 613–620.

Singhal, A. (2001). Modern information retrieval: A brief overview. IEEEData Engineering Bulletin, 24(4), 35–43.

Trillo, R., Po, L., Ilarri, S., Bergamaschi, S., & Mena, E. (2011). Usingsemantic techniques to access web data. Information Systems, 36(2),117–133.

Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-sne.Journal of Machine Learning Research, 9(85), 2579–2605.

Wallach, H.M., Mimno, D.M., & McCallum, A. (2009). Rethinking lda:Why priors matter. In Advances in neural information processingsystems (Vol. 22, pp. 1973–1981). New York: Curran Associates.

Wei, F., Li, W., & Liu, S. (2010). iRANK: A rank-learn-combine frameworkfor unsupervised ensemble ranking. Journal of the American Society forInformation Science and Technology, 61(6), 1232–1243.

Wei, X., & Croft, W.B. (2006). Lda-based document models for ad-hocretrieval. In Proceedings of the 29th Annual International ACM SIGIRConference on Research and Development in Information Retrieval (pp.178–185). New York: ACM Press.

Xu, J., & Li, H. (2007). Adarank: A boosting algorithm for informationretrieval. In Proceedings of the 30th Annual International ACM SIGIRConference on Research and Development in Information Retrieval (pp.391–398). New York: ACM Press.

Yue, Y., Finley, T., Radlinski, F., & Joachims, T. (2007). A support vectormethod for optimizing average precision. In Proceedings of the 30thAnnual International ACM SIGIR Conference on Research and Devel-opment in Information Retrieval (pp. 271–278). New York: ACM Press.

Appendix A

Derivation of the EnM.B

Suppose that at step t, the similarity score vector for thequery qi returned by the αt weighted EnM is

hit t

kt

kik

K

( ) ,α α φφ

==

∑1

(24)

where α α αφ

t tKt= ( , , )1 … . Let j* and δ j

t* be chosen such that

the loss is maximally reduced. The update of αt at step t + 1can be written as α α δt t

jt

jte+ = +1

* * where δ jt* is the step

length and ejt* is the vector with 1 at the j* position and zero

everywhere else. Our goal is to determine j* and δ j*. The

loss function at step t + 1 can be written as

L AP h

AP

j j K it

i

Q

k kik

( , , , , ) exp( ( ( )))

exp

α α δ α α

α φ

ϕ11

1

1

… …+ = −

= −

+

=

=

∑KK

j jii

Q

i j jii

Q

AP h

ϕ

δ φ

α δ φ

∑∑

+⎛⎝⎜

⎞⎠⎟

⎛⎝⎜

⎞⎠⎟

= − +

=

=

1

1

exp( ( ( ) )).(25)

For notational simplicity, we representL j j K( , , , , )α α δ α φ1 … …+ by L(δj), AP(hi(α)) by AP(hi),and omit superscript t henceforth.

Define

Δ( ( )) ( ) ( ) ( ).h AP h AP h APit

i j ji i j jiα δ φ δ φ+ = + − −1 (26)

Then Equation (25) can be written as

L AP h AP exp hj i j ji it

i

( ) exp( ( ))exp( ( )) ( ( ( ))).δ δ φ α= − − − +

=∑ Δ 1

1

Q

(27)

Because AP ranges from 0 to 1, we can get Δ(hi(αt+1)) ∈[−1 − δj, 1] and exp( ( ( ))) [ , ]− ∈+ − +Δ h e ei

t jα δ1 1 1 . Therealways exits a sufficiently large constant M e j≥ +1 δ so thatthe following inequality holds

L M AP h APj i j jii

Q

( ) exp( ( ))exp( ( )).δ δ φ≤ − −=∑

1(28)

14 JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—•• 2015DOI: 10.1002/asi

According to the inequality, for −1<= x <= 1

21 1

ex

e

x

ex−

−≤ + + −αα α , (29)

we have

L MJj j( ) ( ),δ δ≤ (30)

where

J AP hAP

eAP

ej iji ji

i

Qj j( ) exp( ( ))

( ) ( ).δ

φ φδ δ= −+

+−⎧

⎨⎩

⎫⎬⎭

=∑ 1

2

1

21

(31)

Because J(δj) is convex with respect to δj, we can minimizeit by setting

∂∂

=J j

j

( ).

δδ

0 (32)

Let

DiiAP h

Z=

−exp( ( )), (33)

where Z AP hiQ

i= ∑ −=1 exp( ( )) is a normalization factor inthe current step. We can observe that D is a proportion overqueries to penalize those queries on which the EnM has poorperformance. Then, Equation (32) can be written as

−+

+−

== − =∑ ∑D Di jii

Q

i jii

Q

AP

e

AP

ej j

( ( )) ( ( ))

.

1

2

1

201 1

φ φδ δ

(34)

Solving the this equation, we get

δφ

φj

i jii

Q

i jii

Q

AP

AP

=+

=

=

∑1

2

1

1

1

1

log

( ( ))

( ( ))

D

D

(35)

Plugging (35) in (31), we can choose the j* that minimizesthe objective function

j AP APj

i ji jii

Q

* = + −=∑argmin ( ( ))( ( )),D 1 1

1

φ φ (36)

which can be simplified to

j APj

i jii

Q

* ==∑argmax ( ).D φ

1

(37)

Appendix B

Validation of the Selection of Number of Topics

In the experiment, we explored various number of topicsfrom 50 to 150 for MED, CRAN, CISI, and CACM at a stepsize of 25 and from 100 to 500 for MC at a step size of 100.The following tables list most of the results in terms of MAP.For LSI, we used a small number of topics for computationalsimplicity, as suggested by Deerwester et al. (1990), thoughthe performance might slightly improve if a larger numberof topics were explored. As for pLSI and LDI, we observeda more clearly identifiable maxima, that is, the performanceincreased up to a certain point and then decreased.

TABLE B1. MAP with different number of topics on MED.

Number of topics LSI pLSI LDI

K = 50 0.4538 0.4534 0.4868K = 75 0.4885 0.4448 0.5615K = 100 0.5026 0.5334 0.5738K = 125 0.4976 0.4089 0.5687

TABLE B2. MAP with different number of topics on CRAN.

Number of topics LSI pLSI LDI

K = 50 0.2151 0.2495 0.2767K = 75 0.2383 0.2464 0.2931K = 100 0.2541 0.2606 0.3146K = 125 0.2661 0.2681 0.3133K = 150 0.2668 0.2740 0.3002

TABLE B3. MAP with different number of topics on CISI.

Number of topics LSI pLSI LDI

K = 50 0.1086 0.1223 0.1124K = 75 0.1139 0.1062 0.1188K = 100 0.1185 0.1093 0.1429K = 125 0.1210 0.1079 0.1413K = 150 0.1229 0.1056 0.1378

TABLE B4. MAP with different number of topics on CACM.

Number of topics LSI pLSI LDI

K = 50 0.0601 0.0703 0.0908K = 75 0.0817 0.0973 0.1349K = 100 0.0962 0.0920 0.1083K = 125 0.1094 0.0776 0.1256

TABLE B5. MAP with different number of topics on MC.

Number of topics LSI pLSI LDI

K = 100 0.1150 0.1286 0.1906K = 150 0.1339 0.1119 0.1973K = 200 0.1450 0.1194 0.2084K = 250 0.1541 0.1830 0.2011K = 400 0.1681 0.1918 0.1908K = 500 0.1709 0.1909 0.1871

JOURNAL OF THE ASSOCIATION FOR INFORMATION SCIENCE AND TECHNOLOGY—•• 2015 15DOI: 10.1002/asi