Topic regression multi-modal Latent Dirichlet Allocation for image annotation

8
Topic Regression Multi-Modal Latent Dirichlet Allocation for Image Annotation Duangmanee Putthividhya UCSD 9500 Gilman Drive La Jolla, CA 92307 [email protected] Hagai T. Attias Golden Metallic, Inc. P. O. Box 475608 San Francisco, CA 91147 [email protected] Srikantan S. Nagarajan UCSF 513 Parnassus Avenue San Francisco, CA 94143 [email protected] Abstract We present topic-regression multi-modal Latent Dirich- let Allocation (tr-mmLDA), a novel statistical topic model for the task of image and video annotation. At the heart of our new annotation model lies a novel latent variable re- gression approach to capture correlations between image or video features and annotation texts. Instead of sharing a set of latent topics between the 2 data modalities as in the formulation of correspondence LDA in [2], our approach introduces a regression module to correlate the 2 sets of topics, which captures more general forms of association and allows the number of topics in the 2 data modalities to be different. We demonstrate the power of tr-mmLDA on 2 standard annotation datasets: a 5000-image subset of COREL and a 2687-image LabelMe dataset. The proposed association model shows improved performance over cor- respondence LDA as measured by caption perplexity. 1. Introduction Image and video retrieval has long been an important area of research in computer vision. Traditional meth- ods in multimedia retrieval have focused on a query-by- example paradigm, also known as content-based retrieval, where users submit an image or video query, and the sys- tem returns an item in the database closest in some distance measure to the query. As today’s multimedia content be- comes increasingly multi-modal with texts accompanying images and videos in the form of content description, tran- scribed text, or captions, current state-of-the-art multimedia search technology relies heavily on these collateral annota- tion texts to identify and retrieve images and video. Besides the fact that users often prefer the use of textual queries over examples, an important benefit of such an approach is the high-level semantic retrieval, e.g. retrieval of abstract con- cepts, that could not be achieved with low-level visual cues used in most query-by-example systems. With annotation texts playing an increasingly vital role in modern multimedia retrieval systems, a relevant question one might ask is how to deal with numerous fast-growing user-generated content that often lacks descriptive annota- tion texts which would enable accurate semantic retrieval to be performed. The traditional solution is to employ manual labeling—a process that is costly and unscalable to large- scale repositories. With recent unprecedented availability of image and video data online, there is a growing demand to bypass the human intervention and develop automated tools that can generate semantic descriptors of multimedia content—automatic annotation systems. Given a database of images and their corresponding annotation words that can be used for training, the task of an automatic annotation algorithm is to learn patterns of image-text (or video-text) association so that when presented with an un-annotated im- age, the system can accurately infer the missing annotation. Previous work on image and video annotation can be broadly summarized into 2 groups. The first line of work considers a discriminative approach and cast the problem of automatic annotation as a classification problem, treating annotation words as class labels [6, 10]. By directly model- ing the conditional distribution of image features given an- notation words (class-conditional density), this line of ap- proach makes no attempt to uncover correlation structures in annotation texts that can be useful when predicting an- notation. Another set of techniques consider probabilistic latent variable models for the task of multimedia annota- tion. By postulating the existence of a small set of hidden factors that govern the association between the 2 data types, the latent variable representations learned under such an as- sumption are ensured to be useful in predicting the missing captions given the corresponding image. Several modeling variations have been proposed, see [1, 2, 9, 8], with the dif- ferent forms of probability distribution assumed for caption words (multinomial vs. bernoulli) and image features (mix- ture of Gaussian vs. non-parameteric density estimation). In the specific case of statistical topic models applied to the task of image annotation, the seminal work of Blei et al. [2] proposed 2 association models—multi-modal LDA (mmLDA) and correspondence LDA (cLDA)—which ex- 1

Transcript of Topic regression multi-modal Latent Dirichlet Allocation for image annotation

Topic Regression Multi-Modal Latent Dirichlet Allocation for Image Annotation

Duangmanee PutthividhyaUCSD

9500 Gilman DriveLa Jolla, CA [email protected]

Hagai T. AttiasGolden Metallic, Inc.

P. O. Box 475608San Francisco, CA 91147

[email protected]

Srikantan S. NagarajanUCSF

513 Parnassus AvenueSan Francisco, CA [email protected]

Abstract

We present topic-regression multi-modal Latent Dirich-let Allocation (tr-mmLDA), a novel statistical topic modelfor the task of image and video annotation. At the heart ofour new annotation model lies a novel latent variable re-gression approach to capture correlations between imageor video features and annotation texts. Instead of sharing aset of latent topics between the 2 data modalities as in theformulation of correspondence LDA in [2], our approachintroduces a regression module to correlate the 2 sets oftopics, which captures more general forms of associationand allows the number of topics in the 2 data modalitiesto be different. We demonstrate the power of tr-mmLDAon 2 standard annotation datasets: a 5000-image subset ofCOREL and a 2687-image LabelMe dataset. The proposedassociation model shows improved performance over cor-respondence LDA as measured by caption perplexity.

1. IntroductionImage and video retrieval has long been an important

area of research in computer vision. Traditional meth-ods in multimedia retrieval have focused on a query-by-example paradigm, also known as content-based retrieval,where users submit an image or video query, and the sys-tem returns an item in the database closest in some distancemeasure to the query. As today’s multimedia content be-comes increasingly multi-modal with texts accompanyingimages and videos in the form of content description, tran-scribed text, or captions, current state-of-the-art multimediasearch technology relies heavily on these collateral annota-tion texts to identify and retrieve images and video. Besidesthe fact that users often prefer the use of textual queries overexamples, an important benefit of such an approach is thehigh-level semantic retrieval, e.g. retrieval of abstract con-cepts, that could not be achieved with low-level visual cuesused in most query-by-example systems.

With annotation texts playing an increasingly vital rolein modern multimedia retrieval systems, a relevant question

one might ask is how to deal with numerous fast-growinguser-generated content that often lacks descriptive annota-tion texts which would enable accurate semantic retrieval tobe performed. The traditional solution is to employ manuallabeling—a process that is costly and unscalable to large-scale repositories. With recent unprecedented availabilityof image and video data online, there is a growing demandto bypass the human intervention and develop automatedtools that can generate semantic descriptors of multimediacontent—automatic annotation systems. Given a databaseof images and their corresponding annotation words thatcan be used for training, the task of an automatic annotationalgorithm is to learn patterns of image-text (or video-text)association so that when presented with an un-annotated im-age, the system can accurately infer the missing annotation.

Previous work on image and video annotation can bebroadly summarized into 2 groups. The first line of workconsiders a discriminative approach and cast the problemof automatic annotation as a classification problem, treatingannotation words as class labels [6, 10]. By directly model-ing the conditional distribution of image features given an-notation words (class-conditional density), this line of ap-proach makes no attempt to uncover correlation structuresin annotation texts that can be useful when predicting an-notation. Another set of techniques consider probabilisticlatent variable models for the task of multimedia annota-tion. By postulating the existence of a small set of hiddenfactors that govern the association between the 2 data types,the latent variable representations learned under such an as-sumption are ensured to be useful in predicting the missingcaptions given the corresponding image. Several modelingvariations have been proposed, see [1, 2, 9, 8], with the dif-ferent forms of probability distribution assumed for captionwords (multinomial vs. bernoulli) and image features (mix-ture of Gaussian vs. non-parameteric density estimation).

In the specific case of statistical topic models applied tothe task of image annotation, the seminal work of Blei etal. [2] proposed 2 association models—multi-modal LDA(mmLDA) and correspondence LDA (cLDA)—which ex-

1

tend the basic Latent Dirichlet Allocation (LDA) model tolearn the joint distribution of texts and image features. Inorder to capture correlations between the two modalities,the association models use a set of shared latent variables torepresent the underlying causes of cross-correlations in thedata. Indeed, with the same forms of probabilistic distribu-tions assumed, the two models differ only in their choicesof shared latent variables. In mmLDA [1, 2], the meantopic proportion variable is shared between the 2 modali-ties, potentially leading to an undesirable scenario wheresome topics are used entirely to explain either image fea-tures or caption words. To ensure the same sets of topicsare used to generate corresponding data in the 2 modali-ties, under cLDA each caption word directly shares a hid-den topic variable with a randomly selected image region.Better prediction results are reported in [2] as a result of atighter association enforced in sharing the hidden topics.

In this work, we are interested in alternative methods incapturing statistical association between image and text. In-stead of sharing the hidden topics as in the formulation ofcLDA, our model which we call Topic-Regression Multi-Modal Latent Dirichlet Allocation (tr-mmLDA), learns 2separate sets of hidden topics and a regression modulewhich allows one set of topics to be linearly predicted fromthe other. More specifically, we introduce a linear Gaussianregression module where the proportion of topic variablefor annotation texts is the response variable and is modeledas a noise corrupted version of a linear combination of theimage topic variable. Inspired by the good predictive per-formance of the supervised LDA model [4], we adopt theempirical image topic frequency covariates to ensure thatonly the topics that actually occur in the image modality areused in predicting caption texts. Our proposed formulationcan capture varying degrees of correlations between the twodata modalities and allows the number of hidden topics inimages to be different from that of caption texts. We derivean efficient variational inference algorithm which relies ona mean-field approximation to handle intractable posteriorcomputations. To demonstrate the predictive power of thenew association model, we compare image annotation per-formance on 2 standard datasets: a 5000-image subset ofCOREL and 2687-image 8-category subset of the LabelMedataset. Our results are indeed more superior to cLDA asmeasured by caption perplexity.

The paper is organized as follows. Section 2 describesthe representation of images and caption texts used in alltopic models discussed in this work. In section 3, we re-view the association models of mmLDA and cLDA and dis-cuss their strengths and weaknesses with respect to an im-age annotation task. We describe the details of our proposedmodel and show the derivation of our variational inferencealgorithm. In section 5, we present experimental results onan image annotation task and conclude the paper.

2. Data Representation and Notations

Inspired by the success of the recent work on scene mod-eling [7, 12], we borrow a tool from statistical text docu-ment analysis and adopt a bag-of-word representation forboth image and text. In such a representation, word order-ing is ignored and a document is simply reduced to a vec-tor of word count. A multimedia document consisting ofan image and the corresponding caption text is thus sum-marized in our representation as a pair of vectors of wordcounts. An image word is denoted as a unit-basis vectorr of size Tr with exactly one non-zero entry representingthe membership to only one word in a dictionary of Trwords. A caption word wn is similarly defined for a dic-tionary of size Tw. An image is a collection of N wordoccurrences denoted by R = {r1, r2, . . . , rN}; the cap-tion text is a collection of M word occurrences denotedby W = {w1, w2, . . . , wM}. A training set of D image-caption pairs is denoted as {Rd,Wd}, d ∈ {1, 2, . . . , D}.

3. Probabilistic Models

All the topic models discussed in this work builds onLatent Dirichlet Allocation (LDA) [5] which is a powerfulgenerative model for modeling words in documents. UnderLDA, words in the same document are allowed to exhibitcharacteristics from multiple components (topics). A docu-ment, which is a collection of words, is then summarized interms of the components’ overall relative influences on thecollection. To this end, LDA employs 2 sets of latent vari-ables for each document as seen in Fig 1(a): (i) discrete-valued hidden variables zn which assign each word to oneof the K components and (ii) a latent variable θ that rep-resents the random proportion of the components’ influencein the document. In more specific terms, LDA decomposesthe distribution of word counts for each document into con-tributions from K topics and model the proportion of topicsas a Dirichlet distribution, while each topic, in turn, is amultinomial distribution over words.

3.1. Multi-modal LDA (mmLDA) and Correspon-dence LDA (cLDA)

In order to extend the basic LDA model to learn the jointcorrelations between data of different types, a traditionalsolution under a probabilistic framework is to assume theexistence of a small set of shared latent variables that arethe common causes of correlations between the 2 modali-ties. This is precisely the design philosophy behind multi-modal LDA (mmLDA) and correspondence LDA (cLDA)[2], which extend LDA to describe the joint distributions ofimage and caption words in multimedia documents. Whileadopting the same core assumption of LDA in allowingwords in a document to be generated from multiple topics,the two extension models differ in their choices of latent

variables being shared between images and texts. Orig-inally proposed in [1], mmLDA postulates that the meantopic proportion variable θ is the common factor that gener-ates the two types of words. By forcing the topic proportionto be the same in image and caption modality, the 2 setsof Multinomial topic parameters therefore are assumed tocorrespond. However, the decision to share θ between thetwo data modalities implies that image and caption wordsbecome independent conditioned on θ. Indeed without theplate notation as depicted in Fig 1(b), it is not hard to seethat the association of mmLDA assumes that image andcaption words are exchangeable—a key assumption whichallows words from the two data modalities to potentially begenerated from non-overlapping sets of hidden topics. AsK becomes large, annotation experiments on the CORELdataset in [2] show that more than 50% of the caption wordsare assigned to topics that do not occur in the correspondingimages, rendering the knowledge about the image modal-ity essentially half useless at predicting the missing captionwords. The flexibility of mmLDA provides a good fit for thejoint distribution of the data but is a bad fit for a predictiontask, hence a poor annotation performance.

To ensure that only the set of topics that actually gener-ate the image words are those used in generating the captionwords, correspondence LDA (cLDA) as seen in Fig 1(c) wasdesigned so that image is the primary modality and is gener-ated first; conditioned on the topics used in the image, cap-tion words are then generated. More specifically, by forcingeach caption word to directly share a hidden topic with arandomly selected image word, cLDA guarantees that thetopics in caption texts are indeed a subset of the topics thatoccur in the corresponding image. Note that while each cap-tion word is restricted to be associated with one particularimage region, the association of cLDA does allow the sameimage region to be associated with multiple caption words,accounting for the scenario where more than one captionwords are used to describe a single object in the image.

Despite a good annotation performance as reported in[2], the constrained association of cLDA proves to be toorestrictive in practice. When dealing with annotation wordsthat globally describe the scene as a whole, the associa-tion model that restricts each caption word to one imageregion can be very inaccurate. A more powerful associationmodel should indeed allow the captions words to be influ-enced by topics from all image regions as well as those froma particular subset of regions. In addition, by sharing thediscrete-valued latent topic variables directly between thetwo modalities, cLDA provides no mechanism to allow dif-ferent number of topics to be used in modeling images andcaption texts. In the next section, we will describe a moreflexible association model that address these limitations ofcLDA while still maintaining a good predictive power.

3.2. Topic-regression Multi-modal LDA (tr-mmLDA)

To get past the issues associated with sharing latent vari-ables between data modalities, in this work we explore anovel approach in modeling correlations between data ofdifferent types. Instead of using a set of shared latent vari-ables to explain correlations in the data, we propose a la-tent variable regression approach to correlate latent vari-ables of the two modalities; designed with the predictiontask in mind, our framework thus allows latent variables ofone type to be predicted from latent variables of anothertype. In the specific case of extending LDA to learn cor-relations between image and caption words, we propose aformulation that uses 2 separate topic models one for eachdata modality and introduce a regression module to corre-late the 2 sets of hidden topics. To this end, we draw insightsfrom several recent topic models [3, 11] and adopt a linearGaussian regression module which takes in the image topicproportion as its input and target the hidden topic proportionfor annotation texts as the response variable (in line with thetask of predicting captions given an image). Our approachis similar in spirit to the way topic correlations are capturedin Independent Factor Topic Models in [11] by explicitlymodeling the independent sources and linearly combiningthem to obtain the correlated topic vectors. In our case, thehidden sources of topic correlations in caption data corre-spond to the hidden topics of the image modality.

Our model which we call a topic-regression multi-modalLatent Dirichlet Allocation (tr-mmLDA) has the graphicalrepresentation as shown in Fig 1(d). Given K image top-ics and L text topics, from an image side we have an LDAmodel with hidden topics Z = {z1, z2, . . . , zN} and topicproportion θ. A real-valued topic proportion variable forcaption text x ∈ RL is given by: x = Az̄ + µ + n, whereA is an L×K regression coefficients matrix, µ is a vectorof the mean parameters, n ∼ N (n; 0,Λ) is a zero-meanuncorrelated Gaussian noise with a diagonal precision ma-trix Λ. Instead of regressing over the mean topic proportionvariable θ as done in mmLDA, we follow the formulationin supervised LDA in [4, 14] and adopt the empirical topicfrequency covariates z̄ = 1

N

∑n zn as an input into our re-

gression module so that the topic proportion of annotationdata depends directly on the actual topics that do occur inthe image. Given x, the topic proportion of caption text ηis deterministically obtained via a softmax transformationof x, i.e. the probability of observing topic l is given byηl = exp(xl)PL

k=1 exp(xk). The generative process of tr-mmLDA

for an image-caption pair with N visual words and M cap-tion words is given as follows:• Draw an image topic proportion θ|α ∼ Dir(α)• For each image word rn, n ∈ {1, 2, . . . , N}

1. Draw topic assignment zn = k|θ ∼ Mult(θk)2. Draw visual word rn = t|zn = k ∼ Mult(βr

kt)

• Given the empirical image topic proportion z̄ =1N

PNn=1 zn, we sample a real-valued topic proportion vari-

Figure 1. Graphical model representations for (a) Latent Dirichlet Allocation (LDA) and 3 extensions of LDA for the task of imageannotation: (b) Multi-modal LDA (mmLDA) (c) correspondence LDA (cLDA) (d) Topic-regression Multi-modal LDA (tr-mmLDA).

able for caption text: x|z̄,A, µ,Λ ∼ N (x; Az̄ + µ,Λ).

• Compute topic proportion ηl = exp(xl)PLk=1 exp(xk)

.

• For each caption word wm, m ∈ {1, 2, . . . ,M}

1. Draw topic assignment sm = l|η ∼ Mult(ηl)2. Draw caption word wm = t|sm = l ∼ Mult(βw

lt )

The formulation of tr-mmLDA can be seen as linking theLDA model for images and the IFTM model [11] for cap-tion texts using a linear regression module, which is a flex-ible way of capturing correlations in the data. Under thisframework, varying degrees of correlations can be capturedby adapting the regression coefficients in the matrix A ac-cordingly. When the 2 data modalities are independent, thecoefficients in A are driven close to 0 and tr-mmLDA isreduced to 2 independent topic models one for each datamodality. As correlations between the 2 data types growsstronger, more regression coefficients in the matrix A willtake values further away from 0. Indeed, correspondenceLDA (cLDA) can be derived as a special case of tr-mmLDAby fixing the regression coefficients matrix A to an identitymatrix (assuming K = L) and setting the diagonal entriesof precision matrix Λ to∞, which has the effect of forcingthe empirical topic proportions in the 2 data modalities to beidentical. Note that as the regression coefficient matrix Amoves away from an identity matrix (with more non-zerooff-diagonal entries), tr-mmLDA allows the hidden topicsfrom more than one image region to collectively exert influ-ence on each caption word, which depicts a more accuraterelationship for annotation words that globally describe thescene as a whole. Our framework in capturing correlationsis thus more flexible than cLDA and allow more generalforms of correlation to be modeled.

One important additional benefit in employing 2 sets ofhidden topics is the flexibility in allowing the number oftopics in the two data modalities to be different. Indeedas we shall show in the experimental results, different sta-tistical structures in image features and caption texts oftenresult in different optimal number of topics when fitting atopic model to each modality separately. By restricting thenumber of the topics to be the same, we might end up with

K that overfits the data in one modality while underfits theother modality. Our regression approach to model correla-tions gives the flexibility in exploring the optimal numbersof topics for each data modality separately.

4. Variational EMTo learn parameters of tr-mmLDA that maximizes the

likelihood of the training data, we employ the Expecta-tion Maximization (EM) framework that iteratively esti-mates the model parameters of latent variable models. Us-ing Jensen’s inequality, the E step of the EM algorithm de-rives an auxiliary function which tightly lower-bounds thedata likelihood function to allow for a more simple opti-mization to be performed in the M step. Indeed, for mostprobabilistic models involving latent variables, computingthe exact posterior distribution over latent variables to ob-tain a tight likelihood lower-bound in the E step becomescomputationally intractable. In variational EM, we replacethe exact inference in the E step with an approximate infer-ence algorithm. Variational EM framework thus alternatesbetween computing a strict likelihood lower bound in thevariational E step, and maximizing the bound to obtain anew parameter estimate in the M step.

4.1. Variational Inference

To infer the posterior over hidden variables, we beginwith the expression of the true log-likelihood for an image-caption pair {W,R}:log p(W,R|Ψ) ≥

∫q(Z, θ,x,S) {log p(W,R,Z, θ,x,S|Ψ)

− log q(Z, θ,x,S)} dZdθdxdS = F , (1)

where Ψ denotes the model parameters for tr-mmLDA{βr, βw, γ,A, µ,Λ}. Using the concavity of the log func-tion, we apply Jensen’s inequality and derive a lowerbound of the log-likelihood as seen in (1). Indeed, equal-ity holds when the posterior over the hidden variablesq(Z, θ,x,S) equals the true posterior p(Z, θ,x,S|W,R).Like in LDA, computing the exact joint posterior is com-putationally intractable; we employ a mean-field varia-

tional approximation to approximate the joint posterior dis-tribution with a variational posterior in a factorized form:p(Z, θ,x,S|w,R) ≈

∏n q(zn)

∏m q(sm)q(θ)q(x). With

such a posterior, the RHS of (1) becomes a strict lowerbound of the data likelihood. The goal of the variationalE step now is to find within a family of factorized distri-butions the variational posterior that maximizes the lowerbound. Writing out the likelihood lower bound F on theright hand side of (1), we obtain the following expression:F =

∑n

(E[log p(rn|zn, βr)] + E[p(zn|θ)]) + E[log p(θ|α)]

+∑m

(E[log p(wm|sm, βw)] + E[log p(sm|x)])+

+ E[log p(x|z̄,A,Λ, µ)] +H(q(Z)) +H(q(θ))+H(q(S)) +H(q(x)), (2)

where the expectations are taken with respect to the factor-ized posteriors. H(p(x)) denotes the entropy of p(x). Thefifth expectation term in (2)

Eq(x)[log p(sm = l|x)] = Eq(x)[xl − log(∑j

exj )] (3)

contains a normalization term from the softmax operationthat will be difficult to evaluate in closed-form, regardless ofthe form of the variational posterior q(x). We make use ofconvex duality and represents a convex function (− log(·)function) as a point-wise supremum of linear functions.More specifically, the log normalization term is replacedwith adjustable lower bounds parameterized by a convexvariational parameter ξ:

xl − log∑j

exj ≥ xl − log ξ − 1ξ

∑j

exj + 1. (4)

Under the diagonality assumption of Λ and the use of con-vex variational bound of the log-normalizer term in (4), thefree-form maximization of F w.r.t q(x) results in the vari-ational posterior q(x) automatically taking on a factorizedform q(x) =

∏l q(xl). However, q(xl) obtained by the

free-form maximization is not in the form of a distributionthat we recognize. We thus approximate q(xl) as a Gaus-sian distribution: q(xl) ∼ N (xl; x̄l, γ−1

l ) with mean pa-rameter x̄l and precision γl. To simplify the notation, wedenote q(zn = k) as φnk and q(sm = l) as ηml. Sincethe prior p(θ|α) is a Dirichlet distribution, which is a conju-gate prior to the multinomial distribution, we can concludethat the posterior q(θ) is also a Dirichlet distribution. Morespecifically, we denote the posterior Dirichlet parameters asα̃: q(θ) ∼ Dir(α̃). By taking the expectation with respectto the variational posterior, we can write out the terms inthe lower bound F explicitly as a function of the variationalparameters. The first two terms of F in (2) are given by:∑n,k

φnk∑t

1(rn = t) log βrkt +∑n,k

φnkEq(θ)[log θk] (5)

where Eq(θ)[log θk] = Ψ(α̃k) − Ψ(∑j α̃j), with Ψ(x)

denoting the first derivative of the log-gamma function∂ log Γ(x)

∂x . The third term in (2) can be written as:

log Γ(∑j

αj)−∑j

log Γ(αj) +∑j

(αj − 1)Eq(θ)[log θj ].

By evaluating the expectation with respect to a Gaus-sian posterior q(xj) ∼ N (xj ; x̄j , γj), we have that

Eq(xj)[exj ] = e

x̄j+0.5γj and the fourth and fifth expectation

terms in (2) can be written as:∑m,l

ηml∑t

1(wm = t) log βwlt +∑m,l

ηmlx̄l

−M log ξ − M

ξ

∑j

ex̄j+

0.5γj +M. (6)

Making use of the following expectation E[x>Λx] =x̄>Λx̄ + tr(ΛΓ−1), the sixth term in (2) is given by:

−12((x̄− µ)>Λ(x̄− µ) + tr(ΛΓ−1)

−2(x̄− µ)>ΛAE[z̄] + E[z̄>A>ΛAz̄]), (7)

where E[z̄] = 1N

∑n φn and E[z̄>A>ΛAz̄] is evaluated

to be tr(A>ΛA 1N2 (

∑n diag(φn) +

∑n φn

∑m6=n φ

>m)).

To update these variational parameters, we employ a co-ordinate ascent algorithm where we update one set of pa-rameters while holding the rest fixed. By computing thegradient of F w.r.t. φn and set the derivative to 0, we obtainthe following update rule for φn:

log φn =∑t

1(rn = t) log βr·t + E[log θ] +1N

A>Λ(x̄− µ)

− 12N2

diag(A>ΛA)− 1N2

A>ΛA∑m 6=n

φm. (8)

The variational parameters ηml, ξ, α̃k can be similarly re-estimated, resulting in the following closed-form updates:

log ηml =∑t

1(wm = t) log βwlt + x̄l, (9)

ξ =∑j

ex̄j+

0.5γj , (10)

α̃k =∑n

φnk + αk. (11)

To update the parameters of the variational posteriorq(xl) ∼ N (xl; x̄l, γl), we differentiate F w.r.t. x̄l and ob-tain the following expression for the gradient:

∂F∂x̄l

=Xm

ηml −M

ξe

0.5γl

+x̄l − λl(xl − µl − a>l E[z̄]). (12)

However, the value of x̄l that makes the gradient vanishcannot be obtained in closed-form. We employ a Newton

algorithm that finds a zero-crossing solution for (12) effi-ciently. First, by substituting

∑mηmlλl− x̄l + a>l E[z̄] + µl

with tl, we can re-write the expression in (12) as follows:

tletl =

M

ξλle

0.5γl · e

Pm ηmlλl

+a>l E[z̄]+µl = ul. (13)

The Newton update rule for tl is thus given by:

tnl = tol +ule−tol − toltol + 1

. (14)

Starting from a good initial solution, the Newton algorithmconverges in just a few iterations. The precision parameterγl can be similarly updated using a fast Newton algorithm,with the gradient given as:

∂F∂γ−1

l

= −λl2− M

2ξex̄l · e

γ−1l2 +

12γ−1l

= 0. (15)

4.2. Parameter EstimationClosed-form parameter updates can be obtained for all

our model parameters. Again, by taking a derivative of thelower-bound objective function F w.r.t the regression pa-rameters and set the derivative to 0, the re-estimation equa-tions can be written as follow:

A =

Xd

(x̄d − µ)E[z̄d]>! X

d

E[z̄dz̄>d ]

!−1

, (16)

µ =1

D

Xd

(x̄d −AE[z̄d]) , (17)

Λ−1 =1

D

Xd

“(x̄d − µ)(x̄d − µ)> + Γ−1

d −AE[z̄d]x̄>d

”.

(18)

The Multinomial parameters for each topic can be re-estimated with the following update rules:

βwlt =∑Dd=1

∑m η

dml1(wdm = t)∑

t

∑Dd=1

∑m η

dml1(wdm = t)

, (19)

βrkt =∑Dd=1

∑n φ

dnk1(rdn = t)∑

t

∑Dd=1

∑n φ

dnk1(rdn = t)

. (20)

5. Experimental ResultsWe test our model on 2 standard datasets for image an-

notation: COREL and LabelMe datasets. The 5,000 im-age subset of the COREL dataset is the same subset used inthe annotation experiments in [6]. This subset contains 50classes of images, with 100 images per class. Each image inthe collection is reduced to size 117 × 181 (or 181 × 117).4,500 images are used in training (90 per class), and 500for testing (10 per class). Each image is treated as a col-lection of 20 × 20 patches obtained by sliding a windowwith a 20-pixel interval, resulting in 45 patches per image.

For the LabelMe dataset, following the work in [14], we usethe 8-category subset which contains 2,687 images from theclasses: ‘coast’, ‘forest’, ‘highway’, ‘inside city’, ‘moun-tain’, ‘open country’, ‘street’, and ‘tall building’. 80% ofthe data in each class (2,147 images total) are used for train-ing and 20% for testing (540 total). Each image is of size256 × 256. Again we use a 20 × 20 patch with a 20-pixelinterval, resulting in 144 patches per image.

Following the work in [7], we use the 128-dim SIFT de-scriptor computed on 20 × 20 gray-scale patches. In addi-tion, we follow the work in [13] and add additional 36-dimrobust color descriptors which have been designed to com-plement the SIFT descriptors extracted from the gray-scalepatches. We run k-means on a collection of 164-dim fea-tures to learn a dictionary of Tr = 256 visual words. To ac-count for different statistics in the two image datasets, twovisual word dictionaries are learned separately.

5.1. Caption Perplexity

To measure the quality of annotations predicted by themodels, we follow [2] and adopt caption perplexity as aperformance measure. The essential quantity that we needto compute is the conditional probability of caption wordsgiven a test image p(w|R), which is computed with respectto the variational factorized posterior. As seen in the defini-tion in (21), perplexity is indeed the inverse of the geomet-ric mean likelihood, which implies that the model that giveshigher conditional likelihood will lead to a lower perplexity(hence the lower the number, the better the model).

Perp = exp

(−∑Dd=1

∑Md

m=1 log p(wm|Rd)∑dMd

)(21)

As seen in Fig 2(a), we show caption perplexity for theCOREL dataset as a function of the number of text topicsL when K is fixed at 50 and 100. Generally, perplexitydecreases as L increases and we do not have observe prob-lems with over-fitting. Showing as a baseline in the ma-genta graph is the average perplexity of around 135 wordswhen the regression coefficient matrix A is set to 0, i.e.the 2 data modalities are independent. When A is properlylearned from the data, perplexity is reduced on average byover 50 words. A similar pattern is again observed in theplot of perplexity as a function of K while holding L fixedin Fig 2(b). tr-mmLDA appears quite robust to over-fittingas model complexity increases.

We compare the predictive performance of cLDA and tr-mmLDA. To see if over-fitting is an issue for cLDA, weobserve how perplexity (as a function of K) changes asthe number of patches in each image N changes. To ob-tain images with varying values of N , we sub-sample Npatches from all the patches to represent each image. Asshown in Fig 3(a) for a small value of N = 20, cLDA

(a) (b)Figure 2. COREL dataset: (a) Caption perplexity as a function ofL. (b) Caption perplexity as a function of K.

seriously overfits the data as perplexity increases as K in-creases. With larger N , overfitting becomes less of an issuebut tr-mmLDA still outperforms cLDA .

Indeed, the severe overfitting in cLDA can be directly at-tributed to the restrictive association between the 2 modali-ties. When N is small, since cLDA enforces that the topicsused in the caption modality must be a subset of the topicsoccurring in the image modality, a small value ofN impliesthat the words in each document will be assigned to only asmall number of topics (clusters). For a large value of K,in order not to have empty clusters (topics), a large numberof documents will be required. With less than 2200 train-ing documents in the LabelMe dataset, the topic parame-ters for caption texts will therefore be estimated poorly. Forthe COREL dataset with a larger training set (4500 docu-ments), the problem of over-fitting does not become severeuntil we reduce N down to 10, as shown in Fig 3(b). Sincetr-mmLDA imposes no such restriction with regards to thenumber of topics used for the caption modality, over-fittingbecomes less of an issue using the same dataset size.

(a) (b)

Figure 3. Perplexity as a function of K as N increases for (a)LabelMe (b) COREL dataset.

5.2. Example Annotation and Topics

We compare examples of caption topics from the La-belMe dataset learned using cLDA and tr-mmLDA. Likeother previous topic models, we examine the Multinomialparameters and employ 10 most probable caption words un-der each topic to represent the topic. We learn 50 topicsusing cLDA and found around 50% of those topics learnedhas the word car in its top 10 caption words, while only 4out of 50 topics learned using tr-mmLDA contains the word

(a) (b) (c)

Figure 4. Precision-recall curve for 3 single word queries: ‘wave’,‘pole’, ‘flower’, comparing cLDA (red) and tr-mmLDA (blue).

car, see Figure 5. The car example used here illustratesthat topics learned using cLDA are found to contain moregeneral terms, while tr-mmLDA uncover the 4 contexts carappears in the dataset: class highway, street, inside city, tallbuilding. Fig 5 shows examples of more superior qualityof annotation inferred by our model (compared to cLDA)because of the superior topic parameters learned.Table 1. Example topics learned using cLDA (top panel) and tr-mmLDA (bottom panel)

Topic 1 car, sky, road, tree, building, mountain, treesTopic 2 building, window, car, person, buildings, skyscraperTopic 5 sky, road, car, fence, mountain, trees, signTopic 6 car, bulding, buildings, person, sidewalk, cars, walkingTopic 14 window, building, car, door, person, pane, roadTopic 15 sky, tree, building, mountain, car, road, treesTopic 16 buildings, car, sky, tree, cars, building, roadTopic 17 car, road, sign, trees, street light, highwayTopic 20 car, road, highway, freeway, sign, trees, streetlightTopic 26 building, car, buildings, sky, mountain, sidewalk, roadTopic 27 car, building, buildings, person, sidewalk, road, treeTopic 15 car, road, sign, trees, highway, freeway, skyTopic 32 car, buildings, building, sidewalk, cars, road, skyTopic 39 car, road, car back, car top back, van, car right, car leftTopic 48 balcony, shop, building, door, car, terrace, light

5.3. Text-based RetrievalWe can also use the annotation model of tr-mmLDA

and cLDA to perform image retrieval on a database of un-annotated images using word queries. This retrieval methodis called text-based retrieval, to contrast with content-basedretrieval where queries are given in the form of examples.Given a single word query, we perform retrieval by rankingthe test images according the probability that each imagewill be annotated with the query word. More specifically,the score used in ranking is p(w|Rtest) which can be com-puted using variational posterior inferred for each test docu-ment. Table 2 shows top 4 images from the LabelMe datasetthat have been retrieved, rank ordered by the score p(w|R),using tr-mmLDA with query words ’hill’ and ’buildings’.Figure 4 shows precision-recall curves for 3 single wordqueries, comparing the rank order of the retrieved imagesusing cLDA and tr-mmLDA. Our model generally yieldshigher precisions at the same recall values for all the 3queries and give a better overall retrieval performance.

6. ConclusionIn this work, we propose topic-regression multi-modal

LDA, a novel statistical topic model for the task of image

Groundtruth:water, sea, sky, mountain, sunset, suntr-mmLDA:sky, clouds, water, sea, sunset, mountaincLDA:sky, water, sea, mountain, building, sand

Groundtruth:sky, trees, fieldtr-mmLDA:field, sky, mountain, grass, treescLDA:sky, trees, mountain, tree, water

Groundtruth:tree, trunk, trees, ground, grass, pathtr-mmLDA:tree, trees, trunk, sky, ground, path, grass, forestcLDA:tree, trees, trunk, sky, mountain, grass, ground

Groundtruth:water, sky, mountain, river, snow, snowy, glaciertr-mmLDA:mountain, sky, rocky, snowy, fog, clouds, snow,glaciercLDA:mountain, sky, trees, snowy, field, tree, grass, water

Groundtruth:building, road, pole, door, panetr-mmLDA:window, building, door, car, road, sidewalkcLDA:window, building, skyscraper, door

Groundtruth:sky, building, tree, skyscrapertr-mmLDA:building, skyscraper, buildings, sky, roadcLDA:building, skyscraper, buildings, sky, trees

Figure 5. Examples of predicted annotation on LabelMe dataset.

annotation. The main novelty of our model is the latent vari-able regression approach to capture correlations betweenimage features and annotation texts. Instead of sharing thehidden topics directly between the 2 data modalities as incLDA, we propose a formulation that keeps 2 sets of hiddentopics and incorporate a linear regression module to corre-late them. Our approach can capture varying degrees of cor-relations and allow the number of topics in image and an-notation text to be different. Experimental results on imageannotation show that the association model of tr-mmLDAhas an edge over correspondence LDA in predicting annota-tion as seen in the superior annotation and retrieval quality.

References

[1] K. Barnard, P. Duygulu, N. de Freitas, D. Forsyth, D. M.Blei, and M. I. Jordan. Matching words and pictures. Journalof Machine Learning Research, 3:1107–1135, 2003.

Table 2. Top 4 images (with no captions) retrieved using singleword queries. The queries hill and buildings are used for imagesin the top row and bottom row accordingly.

[2] D. M. Blei and M. I. Jordan. Modeling annotated data. InACM SIGIR, 2003.

[3] D. M. Blei and J. D. Lafferty. A correlated topic model ofscience. The Annals of Applied Statistics, 1(1):17–35, 2007.

[4] D. M. Blei and J. D. McAuliffe. Supervised topic models. InNeural Information Processing Systems (NIPS), 2007.

[5] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichletallocation. Journal of Machine Learning Research, 3:993–1022, 2003.

[6] G. Carneiro and N. Vasconcelos. Formulating semantic im-age annotation as a supervised learning problem. In Interna-tional Conference on Computer Vision and Pattern Recogni-tion (CVPR), 2005.

[7] L. Fei-fei and P. Perona. A bayesian hierarchical model forlearning natural scene categories. In International Confer-ence on Computer Vision and Pattern Recognition (CVPR),2005.

[8] S. Feng, R. Manmatha, and V. Lavrenko. Multiple bernoullirelevance models for image and video annotation. In Inter-national Conference on Computer Vision and Pattern Recog-nition (CVPR), 2004.

[9] V. Lavrenko, R. Manmatha, and J. Jeon. A model for learn-ing the semantics of pictures. In Advances in Neural Infor-mation Processing Systems (NIPS), 2003.

[10] J. Li and J. Z. Wang. Automatic linguistic indexing of pic-tures by a statistical modeling approach. IEEE Transactionson Pattern Analysis and Machine Intelligence, 25(10), 2003.

[11] D. Putthividhya, H. Attias, and S. Nagarajan. Independentfactor topic models. In International Conference on MachineLearning (ICML), 2009.

[12] P. Quelhas, F. Monay, J.-M. Odobez, D. Gatica-Perez,T. Tuytelaars, and L. V. Gool. Modeling scenes with localdescriptors and latent aspects. In International Conferenceon Computer Vision (ICCV), 2005.

[13] J. van de Weijer and C. Schmid. Coloring local feature ex-traction. In ECCV, 2006.

[14] C. Wang, D. M. Blei, and L. Fei-fei. Simultaneous imageclassification and annotation. In International Conferenceon Computer Vision and Pattern Recognition (CVPR), 2009.