Memory-restricted latent semantic analysis to accumulate term-document co-occurrence events

9
Memory-restricted latent semantic analysis to accumulate term-document co-occurrence events Seung-Hoon Na a,, Jong-Hyeok Lee b a Dept. of Computer Science, National University of Singapore, Singapore b Dept. of Creative IT Excellence Engineering & Future IT Innovation Laboratory, POSTECH, South Korea article info Article history: Received 22 November 2011 Available online 11 May 2012 Communicated by Jie Zou Keywords: Co-occurrence Dimensionality reduction Partial-update algorithm Latent semantic analysis abstract This paper addresses a novel adaptive problem of obtaining a new type of term-document weight. In our problem, an input is given by a long sequence of co-occurrence events between terms and documents, namely, a stream of term-document co-occurrence events. Given a stream of term-document co-occur- rences, we learn unknown latent vectors of terms and documents such that their inner product adap- tively approximates the target query-based term-document weights resulting from accumulating co- occurrence events. To this end, we propose a new incremental dimensionality reduction algorithm for adaptively learning a latent semantic index of terms and documents over a collection. The core of our algorithm is its partial updating style, where only a small number of latent vectors are modified for each term-document co-occurrence, while most other latent vectors remain unchanged. Experimental results on small and large standard test collections demonstrate that the proposed algorithm can stably learn the latent semantic index of terms and documents, showing an improvement in the retrieval performance over the baseline method. Ó 2012 Elsevier B.V. All rights reserved. 1. Introduction In this paper, we address the novel task of learning query-based term-document weights, often referred to as query-based weights. In our problem, the term-document weights of a term and a docu- ment are not provided explicitly; they are obtained indirectly from term-query and document-query weights. Instead of explicitly stat- ing the target query-based term-document weights, we have a long sequence of term-document co-occurrence events, referred to as a stream of term-document co-occurrence events. Each co-occurrence event is described by two sets of terms and docu- ments as evidence of a tighter relationship between them. Given a stream of term-document co-occurrence events, the objective of the problem is to gradually accumulate the co-occurrence events and learn a query-based term weighting metric for documents such that the weight of a term in a document is likely to be propor- tional to their co-occurrence rate. A typical scenario for accumulating term-document co-occur- rence events is presented in Algorithm 1, where a search engine continuously processes user queries online. In this scenario, each term-document co-occurrence event is defined for single retrieval. A term and a document are considered to have co-occurred if the document is retrieved in the top-ranked results by a query that includes the term. Algorithm 1: Brief description of accumulating term-document co-occurrence input: m terms and n documents in collection C 1. Initialization: W ij =0 for 1 6 i 6 m and 1 6 i 6 n; 2. Querying: Query Q is provided by a user; 3. Retrieval: Obtain top retrieved documents F for query Q; 4. Update term weight for given F and Q; for t i 2 Q do for d j 2F do W ij W ij + D end end Iterate Step 2–4 until learning is stopped. An obvious way to accumulate term-document co-occurrence events is simply to store all term weight values directly in an m n term-document matrix, where each entry is assigned by a 0167-8655/$ - see front matter Ó 2012 Elsevier B.V. All rights reserved. http://dx.doi.org/10.1016/j.patrec.2012.05.002 Corresponding author. E-mail addresses: [email protected] (S.-H. Na), [email protected] (J.-H. Lee). Pattern Recognition Letters 33 (2012) 1623–1631 Contents lists available at SciVerse ScienceDirect Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec

Transcript of Memory-restricted latent semantic analysis to accumulate term-document co-occurrence events

Pattern Recognition Letters 33 (2012) 1623–1631

Contents lists available at SciVerse ScienceDirect

Pattern Recognition Letters

journal homepage: www.elsevier .com/locate /patrec

Memory-restricted latent semantic analysis to accumulate term-documentco-occurrence events

Seung-Hoon Na a,⇑, Jong-Hyeok Lee b

a Dept. of Computer Science, National University of Singapore, Singaporeb Dept. of Creative IT Excellence Engineering & Future IT Innovation Laboratory, POSTECH, South Korea

a r t i c l e i n f o a b s t r a c t

Article history:Received 22 November 2011Available online 11 May 2012

Communicated by Jie Zou

Keywords:Co-occurrenceDimensionality reductionPartial-update algorithmLatent semantic analysis

0167-8655/$ - see front matter � 2012 Elsevier B.V. Ahttp://dx.doi.org/10.1016/j.patrec.2012.05.002

⇑ Corresponding author.E-mail addresses: [email protected] (S.-H

(J.-H. Lee).

This paper addresses a novel adaptive problem of obtaining a new type of term-document weight. In ourproblem, an input is given by a long sequence of co-occurrence events between terms and documents,namely, a stream of term-document co-occurrence events. Given a stream of term-document co-occur-rences, we learn unknown latent vectors of terms and documents such that their inner product adap-tively approximates the target query-based term-document weights resulting from accumulating co-occurrence events. To this end, we propose a new incremental dimensionality reduction algorithm foradaptively learning a latent semantic index of terms and documents over a collection. The core of ouralgorithm is its partial updating style, where only a small number of latent vectors are modified for eachterm-document co-occurrence, while most other latent vectors remain unchanged. Experimental resultson small and large standard test collections demonstrate that the proposed algorithm can stably learn thelatent semantic index of terms and documents, showing an improvement in the retrieval performanceover the baseline method.

� 2012 Elsevier B.V. All rights reserved.

1. Introduction

In this paper, we address the novel task of learning query-basedterm-document weights, often referred to as query-based weights. Inour problem, the term-document weights of a term and a docu-ment are not provided explicitly; they are obtained indirectly fromterm-query and document-query weights. Instead of explicitly stat-ing the target query-based term-document weights, we have along sequence of term-document co-occurrence events, referredto as a stream of term-document co-occurrence events. Eachco-occurrence event is described by two sets of terms and docu-ments as evidence of a tighter relationship between them. Givena stream of term-document co-occurrence events, the objectiveof the problem is to gradually accumulate the co-occurrence eventsand learn a query-based term weighting metric for documentssuch that the weight of a term in a document is likely to be propor-tional to their co-occurrence rate.

A typical scenario for accumulating term-document co-occur-rence events is presented in Algorithm 1, where a search enginecontinuously processes user queries online. In this scenario, eachterm-document co-occurrence event is defined for single retrieval.

ll rights reserved.

. Na), [email protected]

A term and a document are considered to have co-occurred if thedocument is retrieved in the top-ranked results by a query thatincludes the term.

Algorithm 1: Brief description of accumulating term-documentco-occurrence

input: m terms and n documents in collection C

1. Initialization: Wij = 0 for 1 6 i 6m and 1 6 i 6 n;2. Querying: Query Q is provided by a user;3. Retrieval: Obtain top retrieved documents F for query Q;

4. Update term weight for given F and Q;

for ti 2 Q dofor dj 2 F do

Wij Wij + Dend

endIterate Step 2–4 until learning is stopped.

An obvious way to accumulate term-document co-occurrenceevents is simply to store all term weight values directly in anm � n term-document matrix, where each entry is assigned by a

1624 S.-H. Na, J.-H. Lee / Pattern Recognition Letters 33 (2012) 1623–1631

weight value, Wij, between two objects. However, when thenumber of terms and documents is very large, the term-documentmatrix is high dimensional, which requires a less tractable manip-ulation that is not easily applicable.

To achieve better retrieval efficiency, we propose a novel algo-rithm called memory-restricted latent semantic analysis that effec-tively approximates target query-based term-document weights.Without maintaining a large-scale term-document matrix, ouralgorithm manages only low-dimensional latent vectors of termsand documents, and indirectly stores the target weight of a termin a document into the inner product between their latent vectors.In the proposed method, we first define the target query-basedweights of terms in documents that are obtained by accumulatingco-occurrence events. To restrict the memory capacity further, wethen propose the use of a partial-update criterion that needs to beminimized, thereby modifying only a small number of latentvectors, called focused latent vectors, which are relevant to a givenspecific term-document co-occurrence event. Finally, we obtainthe fixed-point iteration, which incrementally updates a set offocused latent vectors for each co-occurrence event.

Experimental results on small and large information retrieval(IR) test collections show that the proposed algorithm graduallyand incrementally learns the target query-based term weightingmetric from co-occurrence events, thereby improving the retrievalperformance.

2. Related work

Dimensionality reduction has found extensive application indiverse areas, such as information retrieval (Dumais et al., 1988;Deerwester et al., 1990; Dumais, 1992; Bartell et al., 1992, 1995;Berry et al., 1995 Hofmann, 1999; Xu et al., 2003; Wei and Croft,2006; Wang et al., 2011), computer vision (Levy and Lindenbaum,2000; Brand, 2002), collaborative filtering (Hofmann, 1999, 2003;Koren et al., 2009), and data mining. Existing works have investi-gated singular value decomposition (SVD) (Dumais et al., 1988;Deerwester et al., 1990; Dumais, 1992; Berry, 1992; Berry et al.,1995; Berry et al., 1999; Levy and Lindenbaum, 2000; Brand,2002; Wang et al., 2011), probabilistic latent semantic analysis(Hofmann, 1999, 2003), latent Dirichlet allocation (Blei et al.,2003), probabilistic principal component analysis (Tipping andBishop, 1999; Lawrence, 2005), kernel principal component analy-sis (Schölkopf et al., 1997), non-negative matrix factorization (Leeand Seung, 1999, 2000; Berry et al., 2006), nonlinear dimensional-ity reduction (Roweis and Saul, 2000; Lawrence, 2005), and so on.Recent works have addressed the scalability issue so as to scale upthe applicability of dimensionality reduction to large-scale (Yuet al., 2009; Wang et al., 2009) and, more recently, Web-scale datasets (Liu et al., 2010) over the distributed MapReduce framework(Dean and Ghemawat, 2008).

Although dimensionality reduction has mostly been studied in abatch mode, some existing works have developed incrementaldimensionality reduction such that the algorithm can be scaledup to a larger size of data set and can also provide adaptabilityto environmental changes of co-occurrence events (Berry et al.,1995; Bartell et al., 1995; Levy and Lindenbaum, 2000; Brand,2002). For example, Bartell et al. (1995) explicitly used co-relevance events among documents to define a novel static targetinter-document similarity, and they applied multidimensionalscaling (MDS) to approximate target similarities with latentsemantic vectors. Brand (2002) also suggested an incrementalmethod for updating SVD as a robust extension of Levy and Linden-baum (2000)’s sequential updating algorithm of SVD. However,unlike our method, all previous incremental methods are an full-update solution; that is, for each co-occurrence event, all latent

vectors for terms or documents should be updated. This contrastswith our approach, which suggests a partial-update algorithm ina memory-restricted manner.

To the best of our knowledge, no existing methods use a partialupdate algorithm, except for the work of Yu et al. (1985). Yu et al.(1985) proposed an approximation method for adaptively learningsimilarities among objects. For each object, Yu et al. (1985) intro-duced a one-dimensional latent vector, called the latent position,which is randomly initialized before learning the similarities. In(Yu et al., 1985)’s method for each co-occurrence event, the movingprocedure on latent positions is conducted as follows: Given a setof objects that co-occur in a co-occurrence event, the latent posi-tions of the objects are moved slightly toward their central posi-tions such that the objects move slightly closer to each otherafter each latent position is moved. Thus, the moving procedureis continuously applied to all other co-occurrence events untilthe latent positions of the objects are finally converged. AlthoughYu et al. (1985)’s method was originally based on a one-dimen-sional latent space, it is quite simple to extend it to a multidimen-sional latent space.

Yu et al. (1985)’s method can be used to seemingly accumulateco-occurrence events, however, it suffers from the critical limita-tion; Yu et al. (1985)’s method likely falls into the collapsing prob-lem. In other words, with an increase in the number of co-occurrence events that are processed, all the latent positions even-tually tend to converge toward the same position such that thesimilarities (or distances) learned among objects are not distin-guishable, causing all the objects to move into a single cluster.

3. Memory-restricted latent semantic analysis to accumulateterm-document co-occurrence events

3.1. Target query-based term-document matrix

To describe our algorithm, we first need to define the targetquery-based term-document matrix. Suppose that aN

ij is the targetquery-based weight of the ith term in the jth document obtainedafter processing the total number N of term-document co-occur-rence events. Let qN be the Nth co-occurrence event and T ðqNÞand FðqNÞ the sets of terms and documents that appear in an co-occurrence event qN, respectively. For convenience, we also referto T ðqNÞ and FðqNÞ as T N and FN , respectively.

Now, suppose that f(i, j,qN) is the increment of the query-basedweight of the ith term in the jth document for the co-occurrenceevent qN. In order to accumulate the co-occurrence events, the tar-get term weight aN

ij is updated according to the formula

aNij ¼ aN�1

ij þ f ðti; dj; qNÞ ð1Þ

where a0ij is set to zero for all (i, j) entries.

Suppose that we further decompose f(i, j,qN) into two weightparts, term-query weight function g(ti,qk) and document-queryweight function h(dj,qk). Eq. 1 is rewritten as

aNij ¼ aN�1

ij þ gðti; qNÞhðdj; qNÞ ð2Þ

Here, the target weight differs depending on the definition ofg(ti,qk) and h(dj,qk). This paper considers the following setting forg(ti,qk) and h(dj,qk) given as

gðti; qkÞ ¼ PðtijhqkÞ

hðdj; qkÞ ¼ Iðdj 2 FðqkÞÞ ð3Þ

where IðeÞ is the indicator function that returns 1 when expressione is true, or otherwise 0, hqk

is the query language model for qk, andPðti j hqk

Þ is the generative probability of term ti from the querymodel hqk

. To specify the query model hqk, we might use either

maximum likelihood estimation (MLE) or a relevance model (RM)

S.-H. Na, J.-H. Lee / Pattern Recognition Letters 33 (2012) 1623–1631 1625

(Lavrenko and Croft, 2001). Of these two methods of obtaining thesetting, this study uses the model-driven weight given by Eq. (3) fordefining g(ti,qk) and h(dj,qk).

3.2. Problem setting: projecting of query-based weights into latentsemantic space

Let m and n be the number of terms and the number of docu-ments, respectively, and d be the number of dimensions in latentvectors. For each ith term, let ti be its d � 1 current latent term vec-tor and let xi be a new latent term vector that we want to find. Let Tand X be m � d matrices where the rows consist of all current la-tent term vectors (i.e., [t1 � � � tm]T) and all new latent term vectors(i.e., [x1 � � � xm]T), respectively.

Likewise, for each j th document, let zj be its d � 1 current latentdocument vector and let yj be a new latent document vector that wewant to obtain. Let Z and Y be n � d matrices where the rows con-sist of all current latent document vectors (i.e., [z1 � � � zm]T) and allnew latent document vectors (i.e., [y1 � � � ym]T), respectively. Wealso refer to T and X as the (current) latent term matrix and thenew latent term matrix, respectively, and X and Y as the latent doc-ument matrix and the new latent document matrix, respectively.

For the kth co-occurrence event qk, the term co-occurrence vectoris denoted by pk as an m � 1 vector where the ith element isg(ti,qk), and the document co-occurrence vector is denoted by qk asn � 1 vector where the jth element is h(dj,qk).

Suppose that we have processed N co-occurrence events, thenthe target query-based term-document weight matrix accumu-lated from N co-occurrence events is formulated by

PNi¼1piqT

i ,where the (i, j) entry indicates the aN

ij of the ith term and the jthdocument.

Given r new co-occurrence events, now let P be the matrix con-sisting of the term co-occurrence vectors for these new events (i.e.,[pN+1 � � � pN+r])1 and Q be the matrix consisting of the document co-occurrence vectors for these new events (i.e., [qN+1 � � � qN+r]). PQT

then indicates the term-document matrix to be added. By adding(or accumulating) PQT to the current term-document weightsPNþr

i¼Nþ1piqTi , the new target term-document weight matrix accumulated

from N + r co-occurrence events is formulated as

XNþr

i¼1

piqTi ¼

XN

i¼1

piqTi þ PQ T ð4Þ

where the (i, j)th entry indicates aNþrij .

Our goal is to approximate a new target weight matrix using theinner products of a low-dimensional latent term and documentvectors. Formally, the problem is formulated by finding solutionvectors xi and yi that minimize the square error

Xm

i¼1

Xn

j¼1

kaNþrij � xT

i yjk2 ¼

XN

i¼1

piqTi þ PQ T � XYT

����������

2

F

ð5Þ

where kAkF is the Frobenius norm of matrix A.However, in our problem it is assumed that the memory capac-

ity is extremely restricted. In this setting, the direct calculation ofPNi¼1piqT

i is less tractable, because it needs to manage a large-scalem � n similarity matrix. To develop a memory-restricted learningalgorithm, we aggressively deploy the assumption that aN

ij � tTi zi

(i.e.,PN

i¼1piqTi � TXT ), which means that the current weight aN

ij isroughly approximated by the inner product of the current latentterm vector ti and the document vector zj. Based on this approxi-mation, we obtain the following criterion, known as the approxi-mated square error:

1 In fact, r is used to support batch updating. Normally, r is 1 when latent vectorsare updated after a new query is submitted to the system.

kTZT þ PQ T � XYTk2F ð6Þ

Thus, our goal is further simplified to finding an m � d latentterm matrix X and an n � d latent document matrix Y where thenew target weights TZT + PQT are preserved by XYT.

3.3. Partial-update criterion for approximating query-based term-document weight

The above optimization problem can be solved if we use SVDupdating (Berry et al., 1999). We refer to these existing methodsas full-update algorithms, because we would have to modify all la-tent vectors of terms and documents when adaptively learningnew co-occurrence events. Instead of using a full-update algo-rithm, we have developed a partial update algorithm with the aimof updating only a small set of latent vectors, referred to as thefocused sets of terms and documents, which appear in r newco-occurrence events (i.e., terms and documents having nonzeroelements in at least one column of P and Q, respectively).

To derive a partial update algorithm, let T and F be the focusedterm set, defined by [Nþr

i¼Nþ1T i and F be the focused document set de-fined by [Nþr

i¼Nþ1F i. Let T C be the unfocused term set and F C the unfo-cused document set that contain other terms and documents that donot appear in m co-occurrence events, respectively. Given the fo-cused set, we can now decompose T, P, and X matrices into twosubmatrices, i.e., one consisting of row vectors from 1 to jT j thatcorrespond to the focused term set, and another consisting ofrow vectors from jT j +1 to m that correspond to the unfocusedterm set. As a result, T, P, and X are redefined using two submatri-ces that correspond to T and T C , respectively, as follows:

T ¼T0

0

� �; P ¼

P0

0

� �; X ¼

X0

V

� �ð7Þ

where T0, P0, and X0 are jT j � d focused matrices where the rowsconsist of the latent vectors of only the focused terms. V is anðm� jT jÞ � d unfocused matrix where the rows consist of the latentvectors of the unfocused terms. Likewise, we also decompose Z,Qand Y matrices into two submatrices, i.e., one consisting of row vec-tors from 1 to jF j that correspond to the focused document set, andanother consisting of row vectors from jF j+1 to n that correspond tothe unfocused document set, as follows:

Z ¼Z0

0

� �; Q ¼

Q 0

0

� �; Y ¼

Y0

W

� �ð8Þ

where X0, Q0, and Y0 are jF j � d focused matrices where the rowsconsist of the latent vectors of only focused documents. W is anðn� jFjÞ � d unfocused matrix where the rows consist of latent vec-tors of the unfocused documents.

Using this decomposition, the new term-document weight ma-trix to be approximated is now rewritten as follows

TZT þ PQ T ¼T0 P0

V 0

� �Z0 Q 0

W 0

� �T

¼ T0ZT0 þ P0Q T

0 TT0W

VT Z0 VWT

" #

ð9Þ

Similarly, the approximated term-document weight matrixusing new latent matrices X and Y is decomposed into

XYT ¼ X0YT0 XT

0V

VT Y0 VWT

" #ð10Þ

Note that the term-document weight matrix of Eq. (9) containstwo different types of submatrices. The first submatrix is the withinweight matrix (WWM), denoted by T0ZT

0 and VT Z0, which repre-sents the weights between any pair of a focused term and docu-ment within T and F . The second submatrix is the between

1626 S.-H. Na, J.-H. Lee / Pattern Recognition Letters 33 (2012) 1623–1631

weight matrix (BWM) denoted by TT0W and VT Z0 which represents

the weights between a focused term and an unfocused documentor between an unfocused term and a focused document.

To derive a new error criterion that preserves both WWM andBWM in a balanced manner, let JW be the square error term forthe difference between new term-document weight matrix in Eq.(9) and WWM, while JB is the square error term between thenew weight matrix and BWM:

JW ¼ T0ZT0 þ P0Q T

0 � X0YT0

��� ���2

F

JB ¼ T0WT � X0WT��� ���2

Fþ Z0VT � Y0VT��� ���2

Fð11Þ

By interpolating these two square errors, JW and JB, we formulate thefollowing final new criterion J to be minimized:

J ¼ bJW þ ð1� bÞJB ð12Þ

where the parameter b is the mixing rate.

3.4. Fixed-point algorithm

Taking the partial derivative of the criterion J with respect to X0

and Y0, we obtain the following condition for the first critical pointfor X0 and Y0 to minimize J:

ð1� bÞX0WT W ¼ ð1� bÞT0WT W þ bE0Y0

ð1� bÞY0VT V ¼ ð1� bÞZ0VT V þ bET0X0 ð13Þ

where E0 is defined by T0ZT0 þ P0Q T

0 � X0YT0, which denotes the pair-

wise error term of the approximated weights for the target weights.There is no analytical means of solving Eq. (13) with respect to X0

and Y0, so we apply a fixed-point iteration to find X0 and Y0 alterna-tively that satisfies Eq. (13) by reorganizing it. The derived iterativeprocedure is summarized as follows.

1. Initialize X0 and Y0

X0 T0

Y0 Z0

2. Apply a fixed-point iteration until X0 and Y0 converge.

E0 T0ZT0 þ P0Q T

0 � X0YT0

X0 T0 þb

1� bE0Y0ðWT WÞ�1

Y0 Z0 þb

1� bET

0X0ðVT VÞ�1 ð14Þ

3. Assign the resulting converged matrix of X0 and Y0 to newlatent matrix T0 and Z0, respectively:

T0 X0

Z0 Y0

2 This initialization is restricted to the case when d 6 jT j or d 6 jF j. If d > jT j ord > jF j, then either X0 or Y0 is initialized by IjT j 0

� �or IjF j 0� �

.

We call the above algorithm a two-sided memory-restricteddimensionality reduction, referred to as TADR-Basic in short.

There are two exceptional cases where we do not need to usethe BWM criterion, i.e., either when all elements of T0 or X0 arezero or either W or V is a zero matrix. In these cases, the BWM cri-terion can be dropped, because the fixed-point iteration has no ef-fect when updating latent vectors, or X0 and Y0 become zeromatrices as a result. In these cases, we use the specialized condi-tion for the first-critical points where the criterion J is simplifiedto JW, i.e., using b = 1:

T0ZT0 þ P0Q T

0 � X0YT0

� �Y0 ¼ 0

Z0TT0 þ Q 0PT

0 � Y0XT0

� �X0 ¼ 0 ð15Þ

After some algebraic calculation, we obtain the following fixed-point equation for this special case.

X0 ¼ T0ZT0 þ P0Q T

0

� �Y0 YT

0Y0

� ��1

Y0 ¼ Z0TT0 þ Q 0PT

0

� �X0 XT

0X0

� ��1ð16Þ

In the specialized fixed-point iteration stated above, non-zeromatrices should be used for initialing X0 and Y0 by using a non-zeromatrix. Otherwise, the updating procedure is not meaningful. In ourexperiment, X0 and Y0 are initialized to Id 0½ �T where Id is thed � d identity matrix2.

In practice, we need to consider a pending issue: the inversionsof VTV and WTW used in our fixed-point algorithm are both knownas ill-posed problems. As in the case of a typical method for handlingill-posed problems, we use the following regularized criterion J0 byadding a regularized term JR into J:

J0 ¼ J þ kJR ð17Þ

where k is a non-negative value and the regularized term JR is de-fined by tr(XTX) and tr(YTY). The fixed-point equation for J0 isslightly modified from the original equation, although the detailedderivations are not presented in this paper.

3.5. Further extension of adaptive algorithms with p-norm constraint

Reflexivity indicates a property where the most similar object isitself. We note that reflexivity is dependent on the learning spaceused to represent latent vectors. For example, if the inner-productmetric is used as the similarity measure, reflexivity holds when thelearning space is on a circle (hypersphere), whereas it does notwhen the space is on a hyperplane. To provide a higher degree ofreflexivity, we thus restrict the learning space to a d-dimensionalhyperball, since it has a higher degree of reflexivity than Rd.

To force latent vectors onto a d-dimensional hyperball, we addi-tionally use the following p-norm constraint on all latent vectorsthat are constrained using a p-norm with a radius Dp as follows:

kxkp 6 Dp

kykp 6 Dp

To support the p-norm constraint, once we have updated thelatent vectors based on the Algorithm TADR-Basic, we then forcethem to project onto a hyperball such that they satisfy the p-normconstraints. Formally, the projection method is formulated as

xi xi=maxðDp; kxikpÞyi yi=maxðDp; kyikpÞ ð18Þ

As a result, all latent vectors of terms and documents are positionedin a hyperball. We call our method of dimensionality reduction withp-norm projection TADR-Proj.

3.6. Algorithm’s property: dimension uniformity

A notable property resulting from our proposed fixed-pointequation is dimension uniformity, which means that all componentsused for presenting latent vectors are important for representingthe similarity metric. More formally, given two arbitrary eigen-values of covariance matrix ri and rj, we can say that dimension

Table 1Comparison of time complexities and space complexities among three methods of learning the term weight values — Nonapprox indicates a naive learning method withoutapproximation, the Brand’s method indicates one of the state-of-the-art full-update methods, and the proposed method is the proposed partial-update dimensionality reductionmethod.

Time complexity (per update) Space complexity (per update) Space complexity (during learning)

Nonapprox OðjT jjF jÞ OðjT jjF jÞ OðmnÞBrand’s method OðmndÞ Oððmþ nÞdÞ Oððmþ nÞdÞProposed method Oðd3 þ d2ðjT j þ jFjÞ þ djT jjF jÞÞ OðjT jjF jdÞ Oððmþ nÞdÞ

S.-H. Na, J.-H. Lee / Pattern Recognition Letters 33 (2012) 1623–1631 1627

uniformity holds if the ratio of ri/rj is upper-bounded, without beingunboundedly large.

In dimensionality reduction algorithm, it is clearly advanta-geous to have the property of the dimension uniformity. Other-wise, the ratio (ri/rj) is likely to be unbounded and somecomponents of latent vectors become insignificant for representingthe similarity metric; therefore, the actual number of dimensionsused for representing latent vectors becomes considerably smallerthan the original number of dimensions d.

To check in detail whether dimension uniformity holds inthe proposed algorithm TADR-Basic, we rewrite the fixed-pointiteration of Eq. (14) in TADR-Basic by applying latent spacetransformation and detransformation using the eigenvectors ofVTV and WTW. Appendix B presents the detailed procedures ofthe variant. Under the variant of the fixed-point iteration, wedecompose the latent matrices T and X into their column vec-tors as [t(0) � � � t(d)] and [x(0) � � � x(d)], respectively; similarly,we decompose X and Y into their column vectors as [z(0) � � �z(d)] and [y(0) � � � y(d)], respectively. Then, the fixed point ma-trix equation (Eq. (22) in Appendix B) for the ith column vectoris rewritten as follows:

xðiÞ0M tðiÞ0M þ bð1� bÞsi

E00yðiÞ0

yðiÞ0MT zðiÞ0MT þ bð1� bÞdi

E0T0 xðiÞ0

where M is defined as Eq. (22) in Appendix B and si (or di) indicatesthe ith diagonal elements of S (or D). Provided that the average ofthe latent vectors is zero, si (or di) is proportional to the varianceof the projected values of the latent vectors onto the ith eigenvectorof matrix K (or R), and VTV (or WTW) approximately corresponds tothe covariance matrix of latent vectors.

It should be noted that b/((1 � b)si) (or b/((1 � b)di)) is consid-ered to be the pseudo-learning rate, which represents the extentof variation in the current column vectors of the latent termmatrix. This pseudo-learning rate is dimension-dependent, i.e.,the learning rate is lower, for more significant dimensions (i.e., si

(or di) is large). Given the fact that the pseudo-learning rate isdimension-dependent, all the dimensions can remain unbiasedduring the learning time, thus preventing the ratio ri/rj(ri P rj)from being unboundedly large.

3.7. Computational complexity

Table 1 shows the comparison among the computational com-plexities of the three methods used for learning term weightvalues. As observed in Table 1, Brand (2002)’s method requires aspace complexity of Oððmþ nÞdÞ per update, the non-approxi-mated method requires OðmndÞ, and our proposed partial-updatemethod only requires Oðd3 þ d2ðjT j þ jFjÞ þ djT jjF jÞ.3

Noticeably, compared to Brand (2002)’ method, our proposedpartial-update method significantly reduces the time complexity.

3 Here, the number of fixed-point equations is assumed to be bound by a constantvalue.

Provided OðjT jjF jdÞ � OðjF jd2Þ � OðjT jd2Þ � Oðd3Þ (c.f., in one ofour experiments d = 50, jT j=jF j=50), the time complexity for thepartial-update method is roughly Oðd3Þ, which is significantlysmaller than OðmndÞ of Brand (2002)’s method.

4. Experimentation

4.1. Experimental setting

4.1.1. Test setThe proposed algorithm is evaluated on one small test collec-

tion from the MED dataset and a sub-collection of the TIPSTERdataset in TREC, named AP (i.e., Associated Process), which consistsof 158,240 documents.

4.1.2. Automatic generation of queriesOne critical problem we encountered when evaluating our algo-

rithms was that each test collection contained only a small numberof queries. To obtain a sufficient number of real queries, we gener-ated queries automatically for each collection. Our procedure forgenerating queries was:

1. Randomly select a document: d � 1/NSince it becomes the source for generating a query, we call thisdocument the seed document.

2. Initialize query as empty string: q = ‘‘ ’’.3. Select a term to be added to a query, according to the document

language model: w � P(wjhd).After selecting the term w, w is attached to query q, i.e.,q = q + w, where + indicates the concatenation operator of twoword sequences, which sequentially attaches w to the end ofq. Note that we used the parsimonious language model forP(wjhd) instead of the maximum likelihood estimation (MLE)of the document model (Hiemstra et al., 2004), in order toappend as many topical terms to a query as possible.

4. Determine the addition process so that the query continues,according to the continuing probability a, and if the continuingevent occurs, go to Step 3. Otherwise, escape from the genera-tion loop.We used an a of 0.2.

Using the generation procedure described above, we automati-cally obtained 20,000 queries for MED and 1,000,000 queries forAP. For each query, we induced a term-document co-occurrenceevent from the top-retrieved documents, which were obtained byperforming retrieval using language-modeling approaches withDirichlet prior smoothing, where the smoothing parameter wasfixed at 1000. In AP, we redundantly used 1,000,000 events tentimes, thus finally processing 10,000,000 events using ouralgorithm.

4.1.3. Evaluation measures4.1.3.1. MAP: retrieval performance. The most reliable method ofevaluating the quality of the learned similarity is to perform clus-ter-based retrieval based on the learned one and to assess whetherthe retrieval performance is improved. For this purpose, we

1628 S.-H. Na, J.-H. Lee / Pattern Recognition Letters 33 (2012) 1623–1631

adopted mean average precision (MAP), the most widely acceptedevaluation measure of retrieval performance.

4.1.3.2. NNT-MAP and NNT-P@X: Ranked-list evaluation for inter-document similarities. Voorhees’ NNT was used as an evaluationmeasure. Originally, Voorhees’ NNT, which was developed to testthe cluster hypothesis of Jardine and Rijsbergen (1971) andRijsbergen (1979)), was the ratio of the number of co-relevantdocuments to the top five similar documents for a given relevantdocument (Voorhees, 1985), which corresponded to the precisionof five documents in an ad hoc task. Voorhees’ NNT is an evaluationof ranked lists of objects; therefore, without referring to an explicitsimilarity value, NNT is fundamentally the same as an evaluationof an ad hoc retrieval task. Generalizing the original NNT to anyevaluation metric for a ranking task, all IR evaluation metrics suchas MAP and P@X can be used as additional metrics for NNT. Forevaluation metrics, we used NNT-MAP and NNT-P@X, which corre-spond to MAP and P@X (i.e., precision for top X documents) appliedto the ranked lists for NNT, respectively. According to these nota-tions, Voorhees’ NNT is referred to as NNT-Pr@5. The number ofdocuments in a ranked list was fixed at 1000 in all our experi-

0 5 10 15 200.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

The Number of Processed Co−occurrence Events (Unit: 1000)

NN

T−M

AP

d = 5 d = 10 d = 20

(a) MED

1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

0.02

0.04

0.06

0.08

0.1

0.12

0.14

The Number of Processed Co−occurrence Events (Unit: 1000)

NN

T−M

AP d = 50

(b) AP

Fig. 1. NNT-MAPs of inter-document similarities of the latent document vectorslearned from TADR-Proj. The top figure shows the performance curve for 20,000 co-occurrence events during learning in the MED collection. The bottom figurepresents the performance curve for 10,000,000 co-occurrence events duringlearning in the AP collection.

ments. For the evaluation on NNN-MAP and NNN-Pr@X, theinter-document similarity between ith and jth documents isdefined by zT

i zj.

4.2. Evaluation results

4.2.1. Evaluating the quality of learned similarityFor all test collections, we apply TADR-Proj to learn the latent-

semantic space, using F as the set of top-retrieved documents. Todefine query-based document term weight, we use Eq. (3), wherethe query model Pðwjhqk

Þ is estimated by applying the model-basedfeedback on the basis of the documents in F (Zhai and Lafferty,2001a). Other parameters are set to: k (the regularization parame-ter in Eq. (17)), 0.0001; b (the learning rate in Eq. (12)), 0.5; Dp

(p-norm radius in TADR-Proj), 1; p (p-norm in TADR-Proj), 2; andr (the number of processed co-occurrence events in a single up-date), 1. jT j and jF j (the number of focused terms and documents)are 10 for MED and CISI, and 50 for TREC-AP.

Fig. 1 shows the curves of NNT-MAPs of inter-document simi-larities derived by the latent semantic space resulting from theapplication of TADR-Proj, according to the number of processedco-occurrence events. Table 2 shows the final NNT-MAP, NNT-Pr@10, and NNT-Pr@20 of the inter-document similarities learnedfrom TADR-Proj, after considering the given number of co-occur-rence events. The total number of processed co-occurrence eventsis given as 20,000 for MED and 10,000,000 for AP. As shown inFig. 1, NNT-MAPs gradually increase with the number of processedco-occurrence events, implying that the proposed algorithm con-structs the latent semantic space stably and adaptively. In theMED collection, as d increases, in general, the final NNT-MAP ishigher. However, when d is larger than 30, the additional use ofdimensions for latent space does not improve the final NNT-MAPsignificantly.

4.2.2. Evaluating retrieval effectiveness in document expansion usinglearned latent space4.2.2.1. Document expansion methods. In order to determinewhether the latent semantic vectors of terms and documents ob-tained when applying TADR-Proj further improves the retrievaleffectiveness, we evaluated the document expansion method usinglatent semantic vectors, comparing the results to those of the base-line method.

In general, the document expansion combines two differentscores, one from the original retrieval method, and another froma semantically enhanced retrieval method. Formally, the documentexpansion is given by coriginalscorei + (1 � c)expscorei, where c is aninterpolation parameter, originalscorei is the original relevance scorethat is induced from original term-document weights, and expscorei

is the semantically enhanced matching score that is computedfrom the latent semantic term and document vectors.

We consider two different document expansion methods. Thefirst method is based on the vector-space model. In this method,originalscorei is computed using the pivoted normalization vector

Table 2NNT-MAPs of latent document vectors obtained from TADR-Proj after processing allgiven co-occurrence events, i.e., 20,000 for MED, 10,000,000 for AP.

Coll d NNT-MAP NNT-Pr@10 NNT-Pr@20

MED 5 0.4032 0.4872 0.413710 0.5301 0.6070 0.540920 0.5743 0.6378 0.574930 0.5744 0.6422 0.576740 0.5746 0.6388 0.573950 0.5752 0.6382 0.5743

AP 50 0.1345 0.3992 0.3273

S.-H. Na, J.-H. Lee / Pattern Recognition Letters 33 (2012) 1623–1631 1629

space model [8], and expscorei is defined as ith element of the ma-trix Z(TTq) where q is the query vector defined on original termspace. That is, expscorei is the inner product of the query vectorand the ith document vector on the latent semantic space. We callthis document expansion method LS-VSM. The baseline method issimply used to rank documents by originalscorei only, which is re-ferred to as VSM.

The second method is the cluster-based retrieval used in lan-guage modeling approaches (Liu and Croft, 2004; Kurland andLee, 2004; Tao et al., 2006). Among several variants, our methodis mainly based on the aspect-x interpolation method (Kurlandand Lee, 2004). In the second method, originalscorei is computedusing Jelinek–Mercer smoothing in the standard language model-ing approach (Zhai and Lafferty, 2001b) and expscorei is computedusing cluster-based language models where for each document,its cluster is defined as the set of nearest documents. To find doc-uments nearest to a given document, we use ZTZ for defining theinter-document similarities among documents. The second docu-ment expansion is called CLM, and the baseline is referred to as LM.

4.2.2.2. Retrieval results. Our evaluation of document expansions isbased on the same latent vectors used in the previous experiments.

0 5 10 15 200.35

0.4

0.45

0.5

0.55

0.6

The Number of Processed Co−occurrence Events (Unit: 1000)

MAP

Baseline d = 5 d = 10 d = 20

(a) MED (LS-VSM)

0 5 10 15 200.47

0.48

0.49

0.5

0.51

0.52

0.53

0.54

0.55

The Number of Processed Co−occurrence Events (Unit: 1000)

MAP

Baseline d = 5 d = 10 d = 20

(b) MED (CLM)

Fig. 2. MAPs of LS-VSM and CLM using the latent vectors obtained from TADR-Projin MED. The top figure shows the performance curves of LS-VSM and the bottomfigure shows that of CLM.

The setting of the combination parameter c is varied according tothe test collection and the dimensionality reduction (DR) method.The cluster size in CLM is set at 50.

Figs. 2 and 3 depict three different performance curves of MAPsusing the latent vectors obtained from TADR-Proj in each test col-lection. Fig. 2 shows the performance curves of the MAPs of LS-VSM and CLM in the MED and CISI collection, in comparison withtheir baselines, across a d of 5, 10 and 20, according to the numberof processed co-occurrence events, while Fig. 3 shows the perfor-mance curves of MAPs using LS-VSM and CLM in the AP collection,using a d of 50. The results for both test collections show that theproposed algorithm (TADR-Proj) gradually enhances the retrievalperformances, according to the number of increases in processedco-occurrence events.

Table 3 shows the final MAPs of TADR-Proj-based documentexpansions in the MED collection, compared to the MAPs usingthe original latent semantic index (LSI) after processing all givenco-occurrence events. Table 4 shows the final MAPs of TADR-Proj-based document expansions in the AP collection; the MAP ofusing original LSI is not presented due to its weak tractability.

Both tables show that TADR-Proj-based document expansionsdemonstrate statistically significant improvements over the base-line retrievals. In AP, all document expansion methods (i.e., CLMand LS-VSM) show statistically significant improvements overthe baseline when using the Wilcoxon signed-rank test at a 5% sig-nificance level, although the improvements are relatively small. Inparticular, the results in Table 4 show that TADR-Proj-based docu-ment expansions significantly improve those of original LSIs, evenwith a small value of d. This result shows that TADR-Proj has astrong potential to provide a method of constructing a latentsemantic space that is better than the traditional one.

In summary, all our experimental results consistently showedthat the proposed TADR-Proj effectively resolves our problem,and it provides a potential approach to memory-restricted dimen-sionality reduction when accumulating term-document co-occur-rence events.

5. Conclusion

In this paper, we addressed the novel problem of incrementallylearning query-based term-document weights, given a stream ofco-occurrence term-document events, in a memory-restrictedmanner. We formulate the learning problem by first assuming thateach term or document has a low dimensional latent vector and by

1000 2000 3000 4000 5000 6000 7000 8000 9000 100000.23

0.235

0.24

0.245

0.25

0.255

0.26

0.265

The Number of Processed Co−occurrence Events (Unit: 1000)

MAP

VSM LM LS−VSM (d = 50) CLM (d = 50)

(a) AP

Fig. 3. MAPs of LS-VSM and CLM using the latent vectors obtained from TADR-Projin the AP collection.

Table 3MAPs of LS-VSM and CLM using TADR-Proj in MED, for six different values of d, comparing the results of original latent semantic index, and the baseline results. The results usingoriginal latent semantic index are referred to as LS-VSM (LSI) and CLM (LSI); the results of the baselines are shown in VSM and LM.

DR method Baseline d = 5 d = 10 d = 20 d = 30 d = 40 d = 50

LS-VSM (LSI) 0.5225 0.4905 0.5092 0.5770 0.6069 0.6095 0.6058LS-VSM (TADR-Proj) 0.5225 0.5694 0.5902 0.6035 0.5988 0.6065 0.6055CLM (LSI) 0.4978 0.4915 0.5165 0.5262 0.5363 0.5414 0.5466CLM (TADR-Proj) 0.4978 0.5234 0.5312 0.5425 0.5417 0.5434 0.5438

Table 4MAPs of LS-VSM and CLM using TADR-Proj in AP, comparing the results using originallatent semantic index (i.e., LSI) and the baselines (i.e., VSM and LM). The symbols� and � indicate statistically significant improvements over the baseline, under 5%and 1% significance levels, respectively. We applied Wilcoxon signed-rank test as astatistical test.

Performance LM CLM VSM LS-VSM

MAP 0.2511 0.2621� 0.2453 0.2578�

P@10 0.3755 0.3918 0.3796 0.3878

1630 S.-H. Na, J.-H. Lee / Pattern Recognition Letters 33 (2012) 1623–1631

approximately projecting target term-document weights into theinner-product space among the latent vectors of terms and docu-ments. We propose an effective fixed-point algorithm that incre-mentally updates the low dimensional latent vectors of termsand documents using a partial-updating method, which is differentfrom the existing full-updating method.

The experimental results show that the proposed algorithmgradually learns a latent semantic space, according to the numberof processed co-occurrence events. They further show that the pro-posed algorithm has the potential to improve the retrieval perfor-mances of existing methods, including those using the traditionallatent semantic space.

Acknowledgement

The work of the second author was supported by IT ConsilienceCreative Program of MKE and NIPA (C1515-1121-0003).

Appendix A. Derivation of the normal fixed-point equation

By using tr(ATA) for the square of the Frobenius norm of matrixA (kAk2

F ), the two criteria of JW and JB are rewritten as

JW ¼ tr T0ZT0 þ P0Q T

0 � X0YT0

� �TT0ZT

0 þ P0Q T0 � X0YT

0

� � JB ¼ trðWðT0 � X0ÞTðT0 � X0ÞWTÞ þ trðVðZ0 � Y0ÞTðZ0 � Y0ÞVTÞ ð19Þ

We then obtain the partial derivative of JW and JB with respect to X0

and Y0 as follows:

@JW

@X0¼ 2 X0YT

0Y0 � T0ZT0 þ P0Q T

0

� �Y0

� �@JB

@X0¼ 2ð�T0WT W þ X0WT WÞ

@JW

@Y0¼ 2 Y0XT

0X0 � Z0TT0 þ Q 0PT

0

� �X0

� �@JB

@Y0¼ 2 �Z0VT V þ Y0VT V

� �Since J is bJW + (1 � b)JB, the condition of critical points to min-

imize J with respect to X0 and Y0 is given by

b X0YT0Y0 T0ZT

0 þ P0Q T0

� �Y0

� �þ ð1� bÞð�T0WT W þ X0WT WÞ ¼ 0

b Y0XT0X0 Z0TT

0 þ Q 0PT0

� �X0

� �þ ð1� bÞð�Z0VT V þ Y0VT VÞ ¼ 0

By reorganizing the above equations, we obtain the final iterativeprocedures for calculating critical points given in Section 3.4 (Eq.(13)).

Appendix B. Variant of the fixed-point equation based oneigendecomposition

The variant for the fixed-point iteration using eigendecomposi-tion of VTV and WTW is expressed as follows.

1. Perform eigen-decomposition of VTV and WTW to obtain K, R, D,and S as follows:

KDKT ¼ VT V

RSRT ¼WT W ð20Þ

where K and R are the matrices of the eigenvectors and D and S arethe matrices of the eigenvalues of VTV and WTW, respectively.2. Transform the latent matrices of focused objects T0 and Z0 using

K and R to T 00 and Z00, respectively, and initialize X 00 and Y 00 bythe transformed latent matrices T 00 and Z00, respectively.

T 00 ¼ T0K ;Z00 ¼ Z0RX 00 T 00;Y

00 Z00 ð21Þ

For convenience of notation, we introduce M, which is defined asfollows:

M ¼ KT R

where MMT=Im and MTM=In

3. Apply a fixed-point iteration until X 00 and Y 00 converge

E00 T 00MZ0T0 þ P0Q T0 � X 00MY 0T0

� �X 00 T 00 þ

bð1� bÞE

00Y 00S�1MT

Y 00 Z00 þb

ð1� bÞE0T0 X00D�1M ð22Þ

4. Detransform the final latent matrices X 00 and Y 00 using KT and RT,which are inverse matrices of K and R, and we assign the result-ing matrices to the new latent matrices X0 and Y0, respectively.

X0 X00KT

Y0 Y 00RT

References

Bartell, B.T., Cottrell, G.W., Belew, R.K., 1992. Latent semantic indexing is an optimalspecial case of multidimensional scaling. In: Proc. 15th Annual Internat. ACMSIGIR Conf. on Research and Development in Information Retrieval, SIGIR’92,pp. 161–167.

Bartell, B.T., Cottrell, G.W., Belew, R.K., 1995. Representing documents using anexplicit model of their similarities. J. Amer. Soc. Inform. Sci. 46, 254–271.

Berry, M.W., 1992. Large scale sparse singular value computations. Int. J.Supercomput. Appl. 6, 13–49.

Berry, M.W., Dumais, S.T., O’Brien, G.W., 1995. Using linear algebra for intelligentinformation retrieval. SIAM Rev. 37, 573–595.

Berry, M.W., Drmac, Z., Jessup, E.R., 1999. Matrices, vector spaces, and informationretrieval. SIAM Rev. 41, 335–362.

S.-H. Na, J.-H. Lee / Pattern Recognition Letters 33 (2012) 1623–1631 1631

Berry, M.W., Browne, M., Langville, A.N., Pauca, V.P., Plemmons, R.J., 2006.Algorithms and applications for approximate nonnegative matrixfactorization. Comput. Statist. Data Anal., 155–173.

Blei, D.M., Ng, A.Y., Jordan, M.I., 2003. Latent Dirichlet allocation. J. Machine Learn.Res. 3, 993–1022.

Brand, M., 2002. Incremental singular value decomposition of uncertain data withmissing values. In: Proc. 7th European Conf. on Computer Vision-Part I,ECCV’02, pp. 707–720.

Dean, J., Ghemawat, S., 2008. Mapreduce: Simplified data processing on largeclusters. Comm. ACM 51, 107–113.

Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R., 1990.Indexing by latent semantic analysis. J. Amer. Soc. Inform. Sci. 41 (6), 391–407.

Dumais, S.T., 1992. Lsi meets trec: A status report. In: Proc. 1st Text REtrieval Conf.,TREC-1, pp. 137–152.

Dumais, S.T., Furnas, GW, Landauer, T.K., Deerwester, S., Harshman, R., 1988. Usinglatent semantic analysis to improve access to textual information. In: Proc.SIGCHI Conf. on Human factors in Computing Systems, CHI’88, pp. 281–285.

Hiemstra, D., Robertson, S., Zaragoza, H., 2004. Parsimonious language models forinformation retrieval. In: Proc. 27th Annual International ACM SIGIR Conf. onResearch and Development in Information Retrieval, SIGIR’04, pp. 178–185.

Hofmann, T., 1999. Probabilistic latent semantic indexing. In: Proc. 22nd AnnualInternat. ACM SIGIR Conf. on Research and Development in InformationRetrieval, SIGIR’99, pp. 50–57.

Hofmann, T., 2003. Collaborative filtering via gaussian probabilistic latent semanticanalysis. In: Proc. 26th Annual Internat. ACM SIGIR Conf. on Research andDevelopment in Information Retrieval, SIGIR’03, pp. 259–266.

Jardine, N., Rijsbergen, C.J.V., 1971. The use of hierarchic clustering in informationretrieval. Inform. Storage Retriev. 7, 210–240.

Koren, Y., Bell, R., Volinsky, C., 2009. Matrix factorization techniques forrecommender systems. Computer 42, 30–37.

Kurland, O., Lee, L., 2004. Corpus structure, language models, and ad hocinformation retrieval. In: Proc. 27th Annual Internat. ACM SIGIR Conf. onResearch and Development in Information Retrieval, SIGIR’04, pp. 194–201.

Lavrenko, V., Croft, W.B., 2001. Relevance based language models. In: Proc. 24thAnnual International ACM SIGIR Conf. on Research and Development inInformation Retrieval, SIGIR’01, pp. 120–127.

Lawrence, N., 2005. Probabilistic non-linear principal component analysis withgaussian process latent variable models. J. Machine Learn. Res. 6, 1783–1816.

Lee, D.D., Seung, H.S., 1999. Learning the parts of objects by nonnegative matrixfactorization. Nature 401, 788–791.

Lee, D.D., Seung, H.S., 2000. Algorithms for non-negative matrix factorization. In:Adv. Neural Inform. Process. Syst., NIPS’00, pp. 556–562.

Levy, A., Lindenbaum, M., 2000. Sequential Karhunen–Loeve basis extraction and itsapplication to images. IEEE Trans. Image Process. 9 (8), 1371–1374.

Liu, X., Croft, W.B., (2004). Cluster-based retrieval using language models. In: Proc.27th Annual Internat. ACM SIGIR Conf. on Research and Development inInformation Retrieval, SIGIR’04, pp. 186–193.

Liu, C., Yang, H.C., Fan, J., He, L.W., Wang, Y.M., 2010. Distributed nonnegativematrix factorization for web-scale dyadic data analysis on mapreduce. In: Proc.19th Internat. Conf. on World Wide Web, WWW’10, pp. 681–690.

Rijsbergen, C.J.V., 1979. Information Retrieval. Butterworths.Roweis, S.T., Saul, L.K., 2000. Nonlinear dimensionality reduction by locally linear

embedding. Science 290, 2323–2326.Schölkopf, B., Smola, A.J., Müller, K.R., 1997. Kernel principal component analysis. In:

Proc. 7th Internat. Conf. on Artificial Neural Networks, ICANN’97, pp. 583–588.Tao, T., Wang, X., Mei, Q., Zhai, C., 2006 Language model information retrieval with

document expansion. In: Proc. Main Conf. on Human Language TechnologyConference of the North American Chapter of the Association of ComputationalLinguistics, HLT-NAACL’06, pp. 407–414.

Tipping, M.E., Bishop, C.M., 1999. Probabilistic principal component analysis. J.Royal Statisit. Soc. B 61, 1783–1816.

Voorhees, E.M., 1985. The cluster hypothesis revisited. In: Proc. 8th Annual Internat.ACM SIGIR Conf. on Research and Development in Information Retrieval,SIGIR’85, pp. 188–196.

Wang, Y., Bai, H., Stanton. M., Chen, W.Y., Chang, E.Y., 2009. Plda: Parallel latentDirichlet allocation for large-scale applications. In: Proc. 5th Internat. Conf. onAlgorithmic Aspects in Information and Management, AAIM’09, pp. 301–314.

Wang, Q., Xu, J., Li, H., Craswell, N., 2011. Regularized latent semantic indexing. In:Proc. 34th Internat. ACM SIGIR Conf. on Research and Development inInformation, SIGIR’11, pp. 685–694.

Wei, X., Croft, W.B., 2006. Lda-based document models for ad-hoc retrieval. In: Proc.29th Annual Internat. ACM SIGIR Conf. on Research and Development inInformation Retrieval, SIGIR’06, pp. 178–185.

Xu, W., Liu, X., Gong, Y., 2003. Document clustering based on non-negative matrixfactorization. In: Proc. 26th Annual Internat. ACM SIGIR Conf. on Research andDevelopment in Information Retrieval, SIGIR’03, pp. 267–273.

Yu, C., Wang, Y., Chen, C., 1985 Adaptive document clustering. In: SIGIR’85, pp. 197–203.

Yu, K., Zhu, S., Lafferty, J., Gong, Y., 2009. Fast nonparametric matrix factorization forlarge-scalecollaborative filtering. In:Proc. of the 32nd International ACM SIGIR Conf.on Research and Development in Information Retrieval, SIGIR’09, pp. 211–218.

Zhai, C., Lafferty, J., 2001a. Model-based feedback in the language modelingapproach to information retrieval. In: Proc. Tenth Internat. Conf. on Informationand Knowledge Management, CIKM’01, pp. 403–410.

Zhai, C., Lafferty, J., 2001b. A study of smoothing methods for language modelsapplied to ad hoc information retrieval. In: Proc. 24th Annual Internat. ACMSIGIR Conf. on Research and Development in Information Retrieval, SIGIR’01, pp.334–342.