Nonparametric Bayesian Multitask Collaborative Filtering

10
Nonparametric Bayesian Multitask Collaborative Filtering Sotirios P. Chatzis Department of Electrical Engineering, Computer Engineering and Informatics Cyprus University of Technology 33 Saripolou Str, Limassol 3603, Cyprus [email protected] ABSTRACT The dramatic rates new digital content becomes available has brought collaborative filtering systems to the epicenter of computer science research in the last decade. One of the greatest challenges collaborative filtering systems are con- fronted with is the data sparsity problem: users typically rate only very few items; thus, availability of historical data is not adequate to effectively perform prediction. To allevi- ate these issues, in this paper we propose a novel multitask collaborative filtering approach. Our approach is based on a coupled latent factor model of the users rating functions, which allows for coming up with an agile information sharing mechanism that extracts much richer task-correlation infor- mation compared to existing approaches. Formulation of our method is based on concepts from the field of Bayesian non- parametrics, specifically Indian Buffet Process priors, which allow for data-driven determination of the optimal number of underlying latent features (item characteristics and user traits) assumed in the context of the model. We exper- iment on several real-world datasets, demonstrating both the efficacy of our method, and its superiority over existing approaches. Categories and Subject Descriptors H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval; I.2 [Artificial Intelligence]: Learn- ing Keywords Collaborative filtering, Indian Buffet Process, multitask learn- ing 1. INTRODUCTION The continuous explosion in the availability of content through the Internet renders content search a challenging task with ever-increasing difficulty. Ratings-based collab- orative filtering (CF) systems have served as an effective Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full cita- tion on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re- publish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. CIKM’13, Oct. 27–Nov. 1, 2013, San Francisco, CA, USA. Copyright 2013 ACM 978-1-4503-2263-8/13/10 ...$15.00. http://dx.doi.org/10.1145/2505515.2505517. approach to address the problem of discovering items of in- terest [4]. They are based on the intuitive idea that the preferences of a user can be inferred by exploiting past rat- ings of that user as well as users with related behavior pat- terns. This thriving subfield of machine learning has started becoming popular since the late 1990s with the spread of on- line services that use recommender systems, such as Ama- zon, Yahoo! Music, MovieLens, Netflix, and CiteULike. Existing CF methods can be classified intro three main categories: memory-based methods, model-based methods, and hybrid methods. Memory-based systems generate pre- dictions by exploiting the original ratings matrix: Item rat- ing prediction for a target user comprises determination of a subset of the users of the system with similar ratings to the target user (target user neighbors), and computation of a weighted average of the ratings of each item provided by the target user neighbors. This is the earliest kind of algo- rithms used in the context of CF systems, and still comprises the basis of most of the filtering functionality performed on popular websites such as Amazon [10, 1]. A significant draw- back of such approaches is that, given the high sparsity of the ratings matrix, the neighborhood of a target user may contain only few, if any, ratings for a given item. Moreover, such approaches require availability of the whole ratings ma- trix to perform prediction in real-time. This might become too wasteful in terms of computational efficiency if a large number of users/items are registered with the system, thus limiting the scalability of the system. Model-based CF methods attempt to ameliorate these is- sues by using the available ratings data to construct a model which expresses the rating decision function of the users. As such, model-based CF approaches entail an off-line training procedure. Then, given the trained model, prediction gen- eration becomes extremely efficient, thus affording scalable real-time operation. Among all model-based CF methods, matrix factorization (MF)-based methods are perhaps the most popular ones in recent years [5, 29, 25, 22, 26, 27, 15]. These methods assume that the registered users and items are related to sets of features that lie in some low- dimensional latent space; prediction is performed based on these latent features assigned to each user and item. More recently, several authors have considered application of al- ternative Bayesian latent factor models instead of matrix factorization (e.g., [9, 19]). As it has been shown, such Bayesian latent factor approaches allow for deriving scalable and robust model-based recommender systems with good performance in large datasets. 2149

Transcript of Nonparametric Bayesian Multitask Collaborative Filtering

Nonparametric Bayesian Multitask Collaborative Filtering

Sotirios P. ChatzisDepartment of Electrical Engineering, Computer Engineering and Informatics

Cyprus University of Technology33 Saripolou Str, Limassol 3603, Cyprus

[email protected]

ABSTRACTThe dramatic rates new digital content becomes availablehas brought collaborative filtering systems to the epicenterof computer science research in the last decade. One of thegreatest challenges collaborative filtering systems are con-fronted with is the data sparsity problem: users typicallyrate only very few items; thus, availability of historical datais not adequate to effectively perform prediction. To allevi-ate these issues, in this paper we propose a novel multitaskcollaborative filtering approach. Our approach is based ona coupled latent factor model of the users rating functions,which allows for coming up with an agile information sharingmechanism that extracts much richer task-correlation infor-mation compared to existing approaches. Formulation of ourmethod is based on concepts from the field of Bayesian non-parametrics, specifically Indian Buffet Process priors, whichallow for data-driven determination of the optimal numberof underlying latent features (item characteristics and usertraits) assumed in the context of the model. We exper-iment on several real-world datasets, demonstrating boththe efficacy of our method, and its superiority over existingapproaches.

Categories and Subject DescriptorsH.3.3 [Information Storage and Retrieval]: InformationSearch and Retrieval; I.2 [Artificial Intelligence]: Learn-ing

KeywordsCollaborative filtering, Indian Buffet Process, multitask learn-ing

1. INTRODUCTIONThe continuous explosion in the availability of content

through the Internet renders content search a challengingtask with ever-increasing difficulty. Ratings-based collab-orative filtering (CF) systems have served as an effective

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected]’13, Oct. 27–Nov. 1, 2013, San Francisco, CA, USA.Copyright 2013 ACM 978-1-4503-2263-8/13/10 ...$15.00.http://dx.doi.org/10.1145/2505515.2505517.

approach to address the problem of discovering items of in-terest [4]. They are based on the intuitive idea that thepreferences of a user can be inferred by exploiting past rat-ings of that user as well as users with related behavior pat-terns. This thriving subfield of machine learning has startedbecoming popular since the late 1990s with the spread of on-line services that use recommender systems, such as Ama-zon, Yahoo! Music, MovieLens, Netflix, and CiteULike.

Existing CF methods can be classified intro three maincategories: memory-based methods, model-based methods,and hybrid methods. Memory-based systems generate pre-dictions by exploiting the original ratings matrix: Item rat-ing prediction for a target user comprises determination ofa subset of the users of the system with similar ratings tothe target user (target user neighbors), and computation ofa weighted average of the ratings of each item provided bythe target user neighbors. This is the earliest kind of algo-rithms used in the context of CF systems, and still comprisesthe basis of most of the filtering functionality performed onpopular websites such as Amazon [10, 1]. A significant draw-back of such approaches is that, given the high sparsity ofthe ratings matrix, the neighborhood of a target user maycontain only few, if any, ratings for a given item. Moreover,such approaches require availability of the whole ratings ma-trix to perform prediction in real-time. This might becometoo wasteful in terms of computational efficiency if a largenumber of users/items are registered with the system, thuslimiting the scalability of the system.

Model-based CF methods attempt to ameliorate these is-sues by using the available ratings data to construct a modelwhich expresses the rating decision function of the users. Assuch, model-based CF approaches entail an off-line trainingprocedure. Then, given the trained model, prediction gen-eration becomes extremely efficient, thus affording scalablereal-time operation. Among all model-based CF methods,matrix factorization (MF)-based methods are perhaps themost popular ones in recent years [5, 29, 25, 22, 26, 27,15]. These methods assume that the registered users anditems are related to sets of features that lie in some low-dimensional latent space; prediction is performed based onthese latent features assigned to each user and item. Morerecently, several authors have considered application of al-ternative Bayesian latent factor models instead of matrixfactorization (e.g., [9, 19]). As it has been shown, suchBayesian latent factor approaches allow for deriving scalableand robust model-based recommender systems with goodperformance in large datasets.

2149

Finally, a number of hybrid CF techniques have also beenproposed in the research literature, combining memory-basedmethods with model-based methods, or utilizing additionalinformation such as content information. Such approacheshope to improve performance even further, by combining thestrengths of different paradigms; characteristic examples arethe works of [13, 14, 12].Despite these advances and the great success CF sys-

tems have been met with in the last decade, the data spar-sity problem [30], i.e. the difficulties resulting from the ex-tremely sparse nature of the ratings matrix in real-worldsystems, continues to pose a great challenge to existing CFsystems. In an effort to better mitigate the effect of datasparsity on the performance of CF systems, few researchershave recently considered the possibility of combining infor-mation from multiple collaborative filtering tasks pertain-ing to different domains into a single multitask predictionsystem. The main notion underlying these efforts is the as-sumption that, by jointly modeling a collection of ratingprediction tasks arising from multiple domains, CF systemscan exploit the correlation between rating prediction prob-lems in different domains to alleviate the effect of data spar-sity. Multitask CF (MCF) systems are capable of extractingshared preference patterns among similar domains, exploita-tion of which allows for improving the obtained rating pre-diction performance in all the jointly modeled domains.Existing approaches in the area of MCF mostly rely on

matrix factorization techniques [34, 28], properly adapted soas to allow for automatically deriving and sharing correlatedinformation across different domains. In this work, we fol-low a different approach: we introduce a novel latent factormodel to perform MCF, leveraging the strengths of Bayesiannonparametrics. Specifically, the rating function of the usersin each domain is expressed by means of a two-componentlatent factor model, with the latent factors expressing theassignment of the users and the available items to latentclasses (assignment of latent features), and the associatedweights (factor loadings) expressing the joint user/item bi-ases of the model, which can be obtained through modeltraining.We impose appropriate nonparametric Bayesian priors over

the considered latent features, namely Indian Buffet Process(IBP) priors. Such a nonparametric Bayesian model con-struction allows for assuming that each user or item may beassociated with multiple latent features. It also allows forthe number of latent features to be automatically discov-ered in the context of an efficient Bayesian inference scheme.Knowledge sharing between the jointly modeled tasks is ef-fectively performed by imposing a suitable matrix-variateprior over the domain-specific model weights (factor load-ings) across all the domains, and deriving the correspondingposterior distributions in the context of the inference algo-rithms of our model. We evaluate the efficacy of our ap-proach by experimenting with several real-world datasets,and compare its performance to the MCF approaches thatcurrently exist in the literature, as well as two related base-line MF-based and Bayesian latent factor model-based CFalgorithms.The remainder of this paper is organized as follows: In

Section 2, we briefly provide the background of our ap-proach. Specifically, we first review existing MCF algo-rithms; subsequently, we briefly present the nonparamet-ric Bayesian prior imposed over the latent variables of our

model, namely the IBP prior. In Section 3, we introduceour proposed model, and derive its inference and predictionalgorithms using a truncated variational Bayesian approach.In Section 4, we experimentally evaluate our method usingtwo real-world datasets. Finally, in Section 5 we summarizeour results, and conclude this paper.

2. METHODOLOGICAL BACKGROUND

2.1 Existing MCF approachesPrevious work on MCF systems is rather limited. An ap-

proach related to our work is the multi-domain CF methodof [34]. In that paper, the authors propose a probabilis-tic framework which uses probabilistic matrix factorization(PMF) [26] to model the rating prediction problem in eachdomain, and allows the extracted knowledge to be adaptivelyshared across different domains by automatically learningthe correlation between domains. In addition to that work,[28] proposed a collective matrix factorization (CMF) ap-proach, closely related to [34]; specifically, the work of [34]can be shown to reduce to CMF, thus incorporating CMF asa special case, by restricting the latent user feature matricesto be identical across domains.

In addition, an older related approach presented in [17]proposed a rating-matrix generative model to perform multi-task collaborative filtering. This approach establishes the re-latedness across multiple rating matrices by finding a sharedimplicit cluster-level rating matrix, which is next extendedto a cluster-level rating model. On this basis, a rating matrixof any related task can be viewed as drawing a set of usersand items from a user-item joint mixture model as well asdrawing the corresponding ratings from the cluster-level rat-ing model. A major component of the work of [17] consistsin the assumption that some tasks cluster together in a la-tent ratings subspace, and, hence, share common preferencepatterns, while others do not. However, a major drawback ofthis method is the absolute lack of a mechanism quantifyinghow much two tasks are related, and adapting knowledge-sharing based on this information. On the contrary, the ex-isting PMF-based approaches entail mechanisms that allowfor achieving this level of flexibility in knowledge sharing.

Furthermore, another effort toward effective knowledgesharing between tasks, as a means of mitigating data spar-sity, is the work of [33]. That paper essentially builds upon aprevious work on Bayesian matrix factorization [24], wherethe user and item factor matrices are considered to be drawnfrom a Dirichlet process mixture prior [21], instead of simplemultivariate Gaussians. The method of [33] achieves knowl-edge transfer by sharing model parameters among differenttasks, and is fully nonparametric in that the dimension oflatent feature vectors is automatically determined. Infer-ence is performed using the variational Bayesian algorithm,similar to our approach, which is much faster than Gibbssampling used by most related Bayesian methods. A signifi-cant downside of this approach is that it relies on parametertying across tasks to perform knowledge sharing. As such,it does not allow for inferring task relatedness, and adaptingthe amount of knowledge sharing between any pair of taskson the basis of their estimated level of relatedness. In con-trast, our approach is explicitly designed to allow for suchan adaptive knowledge sharing capacity.

Finally, several researchers have recently considered a re-lated problem, namely transfer learning for collaborative

2150

filtering systems (e.g., [18, 23, 20, 16]). The aim of suchsystems is to allow for improving predictive performancein a new rating problem with very sparse data with thehelp of other rating problems which have denser rating data.Such an approach is very helpful when dealing with domainswhere too few ratings are available for all items (e.g., in ad-dressing the cold-start problem) [30]. However, the prob-lem we are considering in this paper, i.e. jointly address-ing the sparsity problem in multiple tasks, is quite differ-ent from the knowledge transfer application scenario, whichaims at adaptively transferring existing knowledge only tonew tasks, without affecting the model predictive mecha-nism pertaining to the existing tasks.

2.2 The Indian Buffet Process priorIn many unsupervised learning problems it is necessary to

derive a set of latent variables given a set of observations. Acharacteristic example is collaborative filtering: consideringthat users are associated with possibly multiple latent traits,and items are associated with possibly multiple latent fea-tures, we may wish to identify these sets of latent propertiesand determine which users/items have each property. Un-fortunately, most traditional machine learning approachesrequire the number of latent features as an input. In suchcases, usually one has to resort to application of a model se-lection technique to come up with a trade-off between modelcomplexity and model fit.A solution to this problem is offered in the context of

Bayesian nonparametrics. Nonparametric Bayesian approachestreat the number of latent features as a random quantity tobe determined as part of the posterior inference procedure.The most common nonparametric prior for latent featuremodels is the Indian Buffet Process [8]. The IBP is a prioron infinite binary matrices that allows us to simultaneouslyinfer which features influence a set of observations and howmany features there are. The form of the prior ensures thatonly a finite number of features will be present in any finiteset of observations, but more features may appear as moreobservations are received.Let us consider a set of N objects that may be assigned to

a total of K features. Let Z = [znk]N,Kn,k=1 be a N×K matrix

of assignment variables, with znk = 1 if the nth object isassigned to the kth feature (multiple znk’s may be equal to1 for a given object n), znk = 0 otherwise. The IBP imposesa prior over [Z], a canonical form of Z that is invariant tothe ordering of the features [8]. The imposed prior takes theform

p([Z]) =αK∏

h∈0,1N\0N Kh!exp−αHN

×K∏

k=1

(N −mk)!(mk − 1)!

N !

(1)

Here, mk is the number of objects assigned to the kth fea-ture (s.t. znk = 1), α is the innovation hyperparameter ofthe IBP prior which regulates the number of effective modelfeatures K, HN is the Nth harmonic number, and Kh isthe number of occurrences of the non-zero binary vector hamong the columns in Z.Apart from Markov chain Monte Carlo (MCMC) [8], in-

ference for the IBP can also be performed by means of mean-field variational Bayesian inference methods, which approxi-mate the true posterior via a simpler distribution [11]. Vari-

ational Bayesian inference for the IBP is based on an al-ternative formulation of p(Z) [7], namely the stick-breakingconstruction of the IBP [31]: It has been shown that theprior (1) obtained by the IBP can be equivalently expressedunder the following hierarchical Bayesian construction

znk ∼ Bernoulli(πk) (2)

πk =k∏

i=1

vi (3)

vk ∼ Beta(α, 1) (4)

In other words, under the stick-breaking construction, anequivalent hierarchical expression for the prior p(Z) is ob-tained by introduction of the Beta-distributed stick variablesvk.

3. PROPOSED APPROACH

3.1 Model formulationBefore we introduce our model, let us first formally define

the problem we are aiming to address: Let us consider arecommender system the registered users of which comprisea set N = nuUu=1. Let us also consider that the systemproduces predictions for D tasks, each one pertaining to adifferent domain. For each task, the corresponding domain

comprises the items set C d = cdmMd

m=1, d = 1, . . . , D. Eachuser is allowed by the system to provide a rating for eachitem in each domain; the rating variable rd in domain d is as-sumed to take values in the discrete set 1, . . . , Rd of possi-ble ratings. Each user may provide ratings for one or more ofthe items within each domain. In essence, at any time point,the considered recommender system has available a training

ratings dataset D = (rdn, udn,m

dn)D,Nd

d,n=1 comprising Nd tu-ples from each domain d, consisting of users with indexesudn ∈ 1, . . . , U, items with indexes md

n ∈ 1, . . . ,Md, andassociated ratings rdn ∈ 1, . . . , Rd.

Typically, in real-world scenarios, users provide ratingsonly for a minuscule fraction of the items within each do-main. However, by exploiting information from different do-mains, and sharing correlated information across domains,it is probable that more (implicitly derived) information willbe made available to the predictive models, thus allowing forenhancing model performance in all domains.

Based on this observation, what we aim to achieve in thispaper is to derive a multitask learning approach to syner-gistically perform CF across domains. In this paper, wefollow an approach inspired from the recent literature onBayesian nonparametrics. Specifically, we propose a novelnonparametric Bayesian latent factor model to achieve ourends. Under our approach, the rating function of each userfor each task is expressed as a sum of three terms: a userbias, an item bias, and a hierarchical two-component latentfactor term, with the latent factors expressing the assign-ment of the users and the available items to latent classes(assignment of latent features), and the factor weights (load-ings) taken as the joint user/item biases of the model, whichcan be obtained through model training.

We make the assumption that each user may have oneor more traits which comprise latent variables of the soughtmodel of unknown number, while each item may belong to

2151

one or more feature classes which also comprise latent vari-ables of the sought model of unknown number. To extractthe underlying task correlation information, and perform in-formation sharing across the considered tasks, we impose asuitable joint prior over the factor weights (loadings) of alltasks, and perform inference of the corresponding posteriorsgiven the training data of our algorithm.Let us consider the dth task. The rating function rd of

a user nu ∈ N for item cdm ∈ C d is expressed under ourapproach as follows

rd(u,m) = µd + ρdu + ηdm +

∞∑i=1

zdui

(wd

i · xdm

)+ ϵ (5)

where · denotes the inner product. In the proposed model(5), µd is the mean rating for any user/item pair in domaind. ρdu is the bias of the uth user w.r.t the task of domain d; apositive value for ρdu expresses the propensity of the user forgiving higher than average ratings to any given item fromdomain d, while a negative value expresses their propensityfor giving lower than average ratings. ηdm is the bias of themth item from domain d, and expresses an objective qualitymeasure for the item: a high quality item is more likely toobtain higher than average ratings, even from a user withincompatible latent features; this is expressed by a positiveηdm value. On the contrary, a low quality item is more likelyto obtain lower than average ratings, even from a user withvery compatible latent features; this is expressed by a nega-tive ηdm value. Further, xd

m = [xdmg]∞g=1, where the xdmg are

the latent variables of the items, with xdmg = 1 if the mthitem from domain d is assigned to the gth latent feature(multiple feature assignments are possible for each item),xdmg = 0 otherwise. Similar, the zdui are the latent vari-

ables of the users, with zdui = 1 if the uth user is assigned tothe ith feature (multiple feature assignments are possible foreach user) in the context of the dth task, zdui = 0 otherwise.Finally, the weights wd

i = [wdig]

∞g=1 are the joint user/item

biases (factor loadings) that correspond to the combinationof the ith latent user feature with the gth latent item featurein the context of the dth task.The term ϵ in (5) stands for the noise component of our

model. Here, we consider an additive white noise term, with

ϵ ∼ N (0, σ2) (6)

This selection for the prior of the noise variable ϵ is clearlysimplistic, since the observed ranking variables rd(u,m) areconsidered to take on a set of discrete values. However, theassumption of an additive white noise has the major advan-tage of allowing for a tractable model inference procedure,with simple model update expressions. As such, in the con-text of our work, we opt for such a simple noise prior se-lection, and take measures to alleviate any negative effects.Specifically, for this purpose, instead of using the original in-teger ratings for model training, we employ a warping func-tion h(·) : N → R, which maps the observed integer ratingvalues to the realm of real numbers, and use the so-obtainedvalues as our observations rd(u,m) to perform inference inthe context of our model. The employed warping functionh(·) is taken as a parametric function, i.e., h(·) = h(·|ψ), theparameters ψ of which are randomly initialized, and subse-quently optimized through model training as model hyper-parameters, as we shall explain in Section 3.2.2.

Note that in our model we assume positive factor weightsassociated with “compatible”’ user and item characteristics(latent features), but also negative factor weights for “con-flicting” latent features. We also note that, in the abovedefinition (5) of the proposed model, we have considered aninfinite number of latent features. This assumption reflectsour unawareness of the exact number of existing latent fea-tures, and demands imposition of an appropriate prior dis-tribution over the latent variables xd

m and zdu = [zdui]

∞i=1,

so as to conduct Bayesian inference over the number of la-tent features. For this purpose, we adopt a suitable non-parametric Bayesian construction for our model: formula-tion of our model is performed by imposing IBP priors overthe latent model variables xd

m and zdu. Specifically, we em-

ploy the stick-breaking construction of the IBP, which allowsfor deriving a computationally efficient and scalable infer-ence algorithm for our model under the variational Bayesianparadigm. We consider

zdui ∼ Bernoulli(πdi ), i = 1, . . . ,∞ (7)

πdi =

i∏λ=1

vdλ (8)

vdλ ∼ Beta(αd, 1) (9)

and

xdmg ∼ Bernoulli(ϖdg), g = 1, . . . ,∞ (10)

ϖdg =

g∏λ=1

vdλ (11)

vdλ ∼ Beta(γd, 1) (12)

Further, to conduct Bayesian inference, we also imposeprior distributions over the model bias parameters and meanratings. Specifically, we consider

p(µd) = N (µd|µd0, (σ

dµ)

2) (13)

p(ρdu) = N (ρdu|0, (ςdu)2) (14)

p(ηdm) = N (ηdm|0, (edm)2) (15)

p(W d) =∏i

p(wdi ) = N (wd

i |0, (sdi )2I) (16)

where W d = [wdi ]i.

Bayesian inference for such a model consists in derivationof a family of posterior distributions q(.) over the infinitesets vd = [vdλ]

∞λ=1, v

d = [vdλ]∞λ=1, and W d = [wd

i ]∞i=1. Ap-

parently, under this infinite dimensional setting, Bayesianinference is not tractable. For this reason, we employ a com-mon strategy in the literature of Bayesian nonparametrics,formulated on the basis of a truncated stick-breaking repre-sentation of the IBP [31, 6]. That is, we fix a value I lettingthe posterior over the vdi have the property q(vdI+1 = 0) = 1,and, similar, we fix a value G letting the posterior over thevdg have the property q(vdG+1 = 0) = 1. In other words,

we set the πdi and ϖd

g equal to zero for i > I and g > G,∀d, respectively. Note that, under this setting, our modelcontinues to employ two full IBP priors: truncation is not

2152

imposed on the model itself, but only on the derived poste-rior distribution to allow for a tractable inference procedure[6]. In our work, the values of the truncation thresholds Iand G are set equal to 1,000. This is quite high, and, hence,“close to infinity,” since all related models have been shownto yield optimal performance with no more than 20 latentfeatures. In practice, the IBP mechanisms will only retainthe small fraction of this large number of latent features thatis needed for optimal data modeling (see also [31]).Finally, to perform multitask learning by deriving and

learning the relationships between different domains andtasks, we impose a matrix-variate normal distribution overthe joint matrix of the model factor loadings across tasks,i.e., the matrix W ≜ [vec(W 1), vec(W 2), . . . , vec(WD)],where vec(·) stands for the operator which converts a matrixinto a vector in a column-wise manner. This kind of mod-eling we adopt in our work reflects our assumption that, incases of correlated tasks, users as well as items latent fea-tures may share common natural interpretations and similarinterdependence patterns among tasks (expressed by the cor-responding factor loadings matricesW d). This is in contrastto existing PMF-based approaches, which impose joint pri-ors over the latent feature assignments of the users acrosstasks, thus limiting the spectrum of their considered infor-mation sharing mechanisms only to user features, and ne-glecting the dynamics between user and item latent features(quantified by the W d) which may also follow similar pat-terns among correlated tasks.Specifically, we use the prior

p(W |Ω) = MN (W |0, I ⊗Ω) (17)

where⊗ denotes the Kronecker product, andMN (W |A,B⊗Γ) stands for the matrix-variate normal distribution withmean A ∈ Ra×b, row covariance matrix B ∈ Ra×a, andcolumn covariance matrix Γ ∈ Rb×b, defined as

MN (W |A,B⊗Γ) ≜exp(− 1

2tr(B−1(W −A)Γ−1(W −A)T ))

(2π)ab/2|B|b/2|Γ|a/2(18)

We envisage the learnt hyperparameter matrix Ω as thecomponent of our model that encodes the relationships be-tween domains. We shall elaborate on this aspect next.

3.2 Variational Bayesian inferenceHere, instead of MCMC, we opt for a variational Bayesian

inference procedure, due to its better scalability on largedatasets. Let us denote as Θ the set of hidden variables andunknown parameters of our model over which a prior distri-bution has been imposed. Let us also denote as Ξ the setcomprising all the hyperparameters of the imposed priors.Variational Bayesian inference consists in the introductionof an arbitrary distribution q(Θ) to approximate the actualposterior p(Θ|Ξ,D), which is computationally intractable[2]. The variational posterior q(Θ) is obtained by maxi-mization of the variational free energy of the model, whichis defined as [11]

L(q) =ˆ

dΘq(Θ)logp(D ,Θ|Ξ)q(Θ)

(19)

Note that L(q) comprises a lower bound of the log marginallikelihood (log evidence), logp(D), of the model [11].Due to the considered conjugate prior configuration of our

model, the variational posterior q(Θ) is expected to take the

same functional form as the prior, p(Θ) [32]. Derivationof the variational posterior distribution q(Θ) involves max-imization of the variational free energy L(q) over each oneof the factors of q(Θ) in turn, holding the others fixed, inan iterative manner [2]. In addition, on each iteration, theestimates of the model hyperparameters Ξ are also updated.This latter procedure is performed by maximization of thevariational free energy L(q) over each one of the hyperpa-rameters in Ξ, holding the others fixed. By construction, thisiterative, consecutive updating of the variational posteriordistribution and the model hyperparameters is guaranteedto monotonically and maximally increase the free energyL(q) [32].

3.2.1 Variational posteriorsLet us denote as ⟨.⟩ the posterior expectation of a quantity

(i.e., the expectation w.r.t. the variational posterior q(Θ)of the model). Optimization of L(q) yields the followingvariational posteriors:

1. Regarding the stick-breaking variables of the employedIBPs, the posterior distributions are similar to the onesderived in [6]. We have

q(vdλ) = Beta(vdλ|τdλ1, τdλ2) (20)

where

τdλ1 =αd +

I∑k=λ

U∑u=1

q(zduk = 1)

+

I∑k=λ+1

[U −

U∑u=1

q(zduk = 1)

][k∑

h=λ+1

qdh

](21)

τdλ2 = 1 +

I∑k=λ

[U −

U∑u=1

q(zduk = 1)

]qdλ (22)

and

q(vdλ) = Beta(vdλ|τdλ1, τdλ2) (23)

where

τdλ1 =γd +

G∑k=λ

Md∑m=1

q(xdmk = 1)

+

G∑k=λ+1

Md −Md∑m=1

q(xdmk = 1)

[k∑

h=λ+1

qdh

](24)

τdλ2 = 1 +

G∑k=λ

Md −Md∑m=1

q(xdmk = 1)

qdλ (25)

and, following [31], the quantities qdλ and qdλ are definedas

qdλ ∝ exp

[ψ(τdλ2) +

λ−1∑h=1

ψ(τdh1)−λ∑

h=1

ψ(τdh1 + τdh2)

](26)

qdλ ∝ exp

[ψ(τdλ2) +

λ−1∑h=1

ψ(τdh1)−λ∑

h=1

ψ(τdh1 + τdh2)

](27)

and are normalized so that the they sum to one.

2153

2. Regarding the posterior distributions over the userslatent variables, zdui, optimization of L(q) yields

q(zdui = 1) =1

1 + exp(−νdui)(28)

where

νdui =

i∑λ=1

[ψ(τdλ1)− ψ(τdλ1 + τdλ2)

]−

⟨log

(1−

i∏λ=1

vdλ)⟩

− 1

2σ2

∑n∈Dd(u)

[⟨(wd

i · xdm(n)

)2⟩

− 2⟨wd

i · xdm(n)

⟩(rdn −

⟨µd

⟩−

⟨ρdu

⟩−

⟨ηdm(n)

⟩)+ 2

∑l=i

q(zdul = 1)⟨(

wdl · xd

m(n)

)(wd

i · xdm(n)

)⟩](29)

m(n) is the identifier of the item that corresponds tothe nth training example, m(n) ∈ 1, . . . ,Md, andDd(u) is the subset of the training dataset D that con-tains examples pertaining to the uth user and the dthtask.

3. Regarding the posterior distributions over the itemslatent variables, xdmg, optimization of L(q) yields

q(xdmg = 1) =1

1 + exp(−νdmg)(30)

where

νdmg =

g∑λ=1

[ψ(τdλ1)− ψ(τdλ1 + τdλ2)

]−

⟨log

(1−

g∏λ=1

vdλ)⟩

− 1

2σ2

∑n∈Dd(m)

[⟨(wd

g · zdu(n)

)2⟩

− 2⟨wd

g · zdu(n)

⟩(rdn −

⟨µd

⟩−

⟨ρdu(n)

⟩−

⟨ηdm

⟩)+ 2

∑l=g

q(xdml = 1)⟨(

wdg · zd

u(n)

)(wd

l · zdu(n)

)⟩](31)

u(n) is the identifier of the user that corresponds to the

nth training example, u(n) ∈ 1, . . . , U, Dd(m) is thesubset of the training dataset D that contains exam-ples pertaining to themth item of the dth domain, andwe define wd

g = [wdig]

Ii=1.

4. Regarding the joint user/item biases (factor loadings)of the model, assuming a spherical posterior distribu-tion to simplify the expressions, we yield

q(W ) ≈D∏

d=1

q(W d) ≈D∏

d=1

I∏i=1

q(wdi ) (32)

with

q(wdi ) ≜

G∏g=1

q(wdig) =

G∏g=1

N (wdig|φd

ig, ϕdig) (33)

where

ϕdig =

[ 1

(sdi )2+ [Ψ]d,d

+1

σ2

Nd∑n=1

q(zdu(n),i = 1)q(xdm(n),g = 1)]−1

(34)

and

φdig =ϕd

ig

1

σ2

Nd∑n=1

q(xdm(n),g = 1)q(zdu(n),i = 1)

×[rdn −

⟨µd

⟩−

⟨ρdu(n)

⟩−

⟨ηdm(n)

⟩−

∑l=i

∑ξ =g

q(zdu(n),l = 1)q(xdm(n),ξ = 1)⟨wd

⟩]

−∑ζ =d

⟨wζ

ig

⟩[Ψ]ζ,d

(35)

In the above equations, we denote Ψ = Ω−1, and as[Ψ]i,j the (i, j)th element of matrix Ψ.

5. Regarding the user biases, we have

q(ρdu) = N (ρdu|ρdu, ςdu) (36)

where

ςdu =

[1

(ςdu)2+

#Dd(u)

σ2

]−1

(37)

#Dd(u) is the cardinality of Dd(u), and

ρdu =ςduσ2

∑n∈Dd(u)

[rdn −

⟨µd

⟩−

⟨ηdm(n)

−I∑

i=1

q(zdui = 1)⟨wd

i · xdm(n)

⟩] (38)

6. Regarding the item biases, we have

q(ηdm) = N (ηdm|ηdm, edm) (39)

where

edm =

[1

(edm)2+

#Dd(m)

σ2

]−1

(40)

#Dd(m) is the cardinality of Dd(m), and

ηdm =edmσ2

∑n∈Dd(m)

[rdn −

⟨µd

⟩−

⟨ρdu(n)

−I∑

i=1

q(zdu(n),i = 1)⟨wd

i · xdm

⟩] (41)

7. The mean ratings µd yield

q(µd) = N (µd|µd, σdµ) (42)

where

σdµ =

[1

(σdµ)2

+Nd

σ2

]−1

(43)

2154

and

µd =σdµ

[1

σ2

Nd∑n=1

(rdn −

⟨ρdu(n)

⟩−

⟨ηdm(n)

⟩−

I∑i=1

q(zdu(n),i = 1)⟨wd

i · xdm(n)

⟩)+

µd0

(σdµ)2

](44)

3.2.2 Hyperparameter OptimizationHaving obtained the posterior updates q(Θ), we now pro-

ceed to derivation of the estimates of the hyperparameters ofthe model in Ξ. For this purpose, we resort to maximizationof the variational free energy L(q) of our model over eachone of them. We begin with the updates of the hyperpa-rameters matrix Ω of the matrix-normal joint prior imposedover the model factor loadings. As we discussed previously,the matrix Ω quantifies the shared information between do-mains. Taking the derivative of L(q) over Ω, and settingequal to zero, we yield

Ω =⟨W TW

⟩(45)

which implies

[Ω]d,d′ =

⟨vec

(W d

)T

vec(W d′

)⟩(46)

This result shows that the proposed model learns the hyper-parameter matrix Ω as the correlation between the factorloading (weight) matrices of our model across tasks. Thisis a very interesting finding, since it is compatible with ourconception of the matrix Ω as a quantity encoding the rela-tions between domains.Further, for the noise variance hyperparameter σ2, we

have

σ2 =1∑D

d=1Nd

D∑d=1

Nd∑n=1

⟨(rdn − ρdu(n) − ηdm(n)

−I∑

i=1

q(zdu(n),i = 1)(wd

i · xdm(n)

))2⟩ (47)

Similar are the expressions of the variances of the user anditem biases (we omit them for brevity).Finally, we also need to perform estimation of the param-

eters entailed in the non-linear transform h(·|ψ) employedto optimally map the originally observed integer ratings tothe realm of real numbers. For this purpose, we substitutethe (transformed into) real rating values rdn in the expres-sion (19) of the variational free energy of our model L(q)with the expression of the warping function h(·|ψ) given theoriginal integer ratings, and optimize the resulting expres-sion over ψ. Here, this optimization is performed by meansof the scaled conjugate gradient (SCG) algorithm.

3.3 PredictionHaving derived the inference algorithm for our model, we

now proceed to derivation of its prediction algorithm. Thisconsists in using the trained model to estimate the ratinga user nu would assign to an item cdm from the dth domainthat they have not rated before. For this purpose, we followa MAP prediction approach: We use as the obtained pre-diction the posterior expectation (mean) of the introduced

rating function. From (5), this posterior expectation yields

rd(u,m) =∑i

q(zdui = 1)⟨wd

i

⟩·⟨xd

m

⟩+

⟨µd

⟩+

⟨ρdu

⟩+

⟨ηdm

⟩=µd + ρdu + ηdm +

∑i

q(zdui = 1)[φdig]

Gg=1 ·

⟨xd

m

⟩(48)

where ⟨xd

m

⟩= [q(xdmg = 1)]Gg=1

As we observe, the predictive function of our model yieldsa simple and computationally convenient expression, withvery low computational costs, linear to the input size. In-deed, these costs are similar to PMF-based approaches, suchas [34]. Hence, scalability of our model is very competitive.

4. EXPERIMENTAL RESULTSHere, we evaluate the efficacy of our approach considering

a number of experiments. Specifically, we perform evalua-tions using two commonly used datasets in the CF literature,namely the MovieLens 100K dataset1, dealing with movieratings, and the Book-Crossing dataset2, dealing with bookratings. To evaluate the multitask learning capabilities ofour approach, we follow the experimental setup of [34], ex-ploiting the fact that, in both these datasets, the items canbe divided into multiple heterogeneous domains. We utilizethis feature to obtain multiple CF tasks from these datasets,which are jointly learned using our model.

In our experiments, apart from our method, we also evalu-ate two existing multitask CF methods related to our work,namely the MCF-LF method of [34], and the CMF methodof [28]. As a baseline, we also evaluate two conventionalCF approaches, namely the PMF method of [26], and theBayesian latent factor-based BLITR method of [9]. The pa-rameters of the MCF-LF and CMF methods are selected asdescribed in [34]; following the recommendations therein, weset the number of latent features equal to 10 in both thesemodels. The PMF approach is trained with a learning rateof 0.005 and a momentum of 0.9, as described in [26]. Weuse 10 latent features for this model, similar to [26]. Thesettings of the BLITR model are adopted from [9]: a Rao-Blackwellised Gibbs sampler is used to draw 300 samplesfrom the Markov chain, with a burn-in of 200 samples. Inthe case of BLITR we use 50 latent features, based on thefindings and guidelines of [9].

Finally, regarding the warping function h(·|ψ), we select

r = h(r|ψ) = ψ1log(ψ2r + ψ3) + ψ4 (49)

where r are the originally observed integer ratings, and rthe transformed real values used by our model.

To assess the performance of our algorithm and its con-sidered rivals, we use as our evaluation metric the root meansquare error (RMSE) of the predictions, which reads

RMSE =

√√√√ 1

N

N∑n=1

(rn − rn)2 (50)

1http://www.grouplens.org/2http://www.informatik.uni-freiburg.de/˜cziegler/BX/

2155

Table 1: MovieLens Dataset: Mean performance (RMSE) obtained by the evaluated methods.

Model Comedy Romance Drama Action Thriller

Our approach 0.7984 0.7603 0.7757 0.7572 0.7477CMF 0.8271 0.7976 0.8123 0.7945 0.7985

MCF-LF 0.8019 0.7645 0.7803 0.7606 0.7505BLITR 0.9392 1.1987 0.8968 0.9885 1.0232PMF 0.9416 1.2013 0.9020 0.9901 1.0267

Table 2: Book-Crossing Dataset: Mean performance (RMSE) obtained by the evaluated methods.

Model Mystery-Thrillers Science Fiction-Fantasy Science Business-Investing Religion-Spirituality

Our approach 0.5655 0.5766 0.6042 0.5988 0.5926CMF 0.9624 1.0210 0.9771 0.8463 1.0495

MCF-LF 0.5683 0.5790 0.6049 0.6001 0.5953BLITR 0.8998 0.9612 0.8275 0.8617 0.8712PMF 0.9014 0.9643 0.8301 0.8625 0.8753

Table 3: Minimum and maximum absolute error and itsstandard deviation for the evaluated methods.

ModelMovieLens Book-Crossing

MIN MAX St.D. MIN MAX St.D.Our work 0 3 0.49 0 2 0.47CMF 0 3 0.53 0 5 0.70

MCF-LF 0 3 0.50 0 2 0.47BLITR 0 4 0.58 0 5 0.68PMF 0 4 0.62 0 5 0.69

where rn is the true rating for the nth example, and rn isthe rating predicted by the evaluated model. Our sourcecodes were developed in MATLAB R2012b.

4.1 MovieLens 100K datasetIn our first experiment, we evaluate our method using the

MovieLens 100K dataset. MovieLens 100K is a widely usedbenchmark for movie recommendation systems. It contains100,000 ratings of 1,682 movies provided by 943 users ofthe GroupLens website3. The ratings in this dataset takein a set of 10 discrete values. In addition to this ratinginformation, MovieLens 100K dataset also provides genreinformation about the rated movies.Based on this dataset, we construct an MCF scenario as

follows: We first determine the five most popular genres inthe dataset. These turn out to be the genres: ’Comedy’; ’Ro-mance’; ’Drama’; ’Action’; and, ’Thriller’. Subsequently, wedefine an MCF problem the tasks of which comprise domainsthat correspond to the selected five genres of the dataset.We train the evaluated models using a randomly selected80% of the rating data from each of the five tasks, whilethe rest 20% is kept for testing. To alleviate the effects ofrandom data selection on the obtained performance figures,we repeat our experiments 10 times, with different randomtraining and test set configurations each time.In Table 1, we provide the RMSEs obtained by the eval-

uated methods (means over the conducted repetitions). Aswe observe, the considered multitask CF models consistentlyyield considerably better performance compared to the PMFand BLITR approaches which do not possess multitask learn-

3http://grouplens.org/node/73

ing capabilities. This finding provides strong indication thattaking multiple domains into consideration allows for yield-ing significant competitive advantages compared to treat-ing different domains independently. We also observe thatCMF always seems to be inferior to MCF-LF; this was ex-pectable enough, since MCF-LF is in essence a generaliza-tion of CMF, and reduces to it by simplifying its assump-tions. Note also that BLITR obtains better performancecompared to PMF in this experiment. Finally, we observethat our approach yields a clear advantage over MCF-LF.This finding proves that our Bayesian latent factor model,utilizing two layers of interacting latent factors and two non-parametric Bayesian priors to optimize the latent featureconstruction of our model, can extract subtler shared pat-terns compared to MCF-LF.

4.2 Book-Crossing datasetFurther, we evaluate our method using the Book-Crossing

dataset. Book-Crossing is a public book ratings datasetwhich pertains to ratings from books available through sev-eral websites. In this experiment we use a subset of thisdataset, namely the ratings on books with category informa-tion available on Amazon.com. This subset contains 56,148ratings of 9,009 books provided by 28,503 users. The pro-vided ratings are in the scale 1–10 with increments of 1.

Based on this dataset, we construct an MCF scenario byconsidering the five general book categories in this dataset,namely ’Mystery & Thrillers’; ’Science Fiction & Fantasy’;’Science’; ’Business & Investing’; and, ’Religion & Spiritu-ality’. Specifically, we define an MCF problem the tasksof which comprise domains that correspond to these fivedataset categories. We train the evaluated models usinga randomly selected 80% of the rating data from each task,while the rest 20% is kept for testing. Similar to the previousexperiment, to alleviate the effects of random data selectionon the obtained performance figures, we repeat our experi-ments 10 times, with different random training and test setconfigurations each time.

In Table 2, we provide the RMSEs obtained by the evalu-ated methods (means over the conducted repetitions). As weobserve, MCF-LF as well as our approach consistently yieldconsiderably better performance compared to the PMF andBLITR approaches, which do not possess multitask learning

2156

(a)

(b)

Figure 1: Ω matrices obtained by our method: (a) Movie-Lens dataset; (b) Book-Crossing dataset.

capabilities. However, contrary to our findings regardingthe MovieLens dataset, the performance of the multitaskCMF model is inferior to the considered single-task CF ap-proaches, probably due to its simplistic assumptions. Fi-nally, our method appears to work better than all its com-petitors in all cases.

4.3 Further investigationIn real-world recommender systems, apart from the av-

erage error expressed by RMSE, another also significantquality aspect that determines system attractiveness is errorvariance. In other words, it is crucial that errors, wheneverthey (inevitably) happen, are not too large to make sys-tem performance seem way too poor. Indeed, even seldomcases of too poor a result being obtained may irrevocablyharm user confidence in the system. Table 3 shows the min-imum and the maximum absolute prediction error obtainedby the evaluated methods, as well as its standard deviation,in both the previous experiments. These statistics have beenobtained by using the inverse of function h(·|ψ) in (49) totransform the real ratings predicted by our model to thecorresponding discrete ones. As we observe, our approachyields the most competitive results in all cases.Finally, in Figs. 1a-1b we illustrate the Ω matrices ob-

tained by our method in the MovieLens and Book-Crossingdataset experiments, respectively. The obtained results arequite matching our intuition. For example, in the Movie-Lens dataset experiments, class #2:Romance appears to bestrongly correlated with class #3:Drama, while class #1:Com-edy appears to have very low correlation with class #3:Drama.Similar, in the Book-Crossing dataset experiments, class#1:Mystery & Thrillers appears to have close to zero cor-relation with class #5:Religion & Spirituality, while class#2:Science Fiction & Fantasy appears strongly correlatedwith both class #1:Mystery & Thrillers and #3:Science.

4.4 Computational ComplexityTo conclude, a question that naturally arises concerns how

our method compares to the competition in terms of com-putational costs. To begin with, training our model imposescomputational costs similar to the MCF-LF approach whenthe number of its latent features is equal to the truncationthresholds of our approach. In our elaborate implementa-tion, the number of latent features of MCF-LF was two or-

ders of magnitude less than the initialization of our model.However, most of the initial features of our model werequickly purged; thus, our method yielded an only 27% in-crease in total computational time for model training, com-pared to MCF-LF. Regarding prediction generation usingour model, as we already discussed in Section 3.3, the in-curred computational costs of our approach are similar toMCF-LF, due to the simple linear form of our predictionfunction; our experiments corroborated this theoretical find-ing.

5. CONCLUSIONS & FUTURE WORKIn this work, we presented a Bayesian latent factor model

that aims at addressing the data sparsity problem CF sys-tems suffer from by utilizing information frommultiple tasks,detecting and extracting common patterns, and sharing thisinformation across tasks to enhance the obtained predictiveperformance in all cases. Its main difference from relatedexisting works is that it models the users rating function bymeans of a two-component latent factor model, as well asits formulation on the basis of the nonparametric Bayesianparadigm.

Specifically, our approach is facilitated by imposing twoIndian Buffet Process priors over the variables assigningusers and items to latent features, which allow for doinginference for the appropriate number of latent features usedin the context of the model. Information sharing is per-formed by imposition of a suitable matrix-variate prior overthe factor loadings matrices of the model across the con-sidered tasks. This is in contrast to existing PMF-basedapproaches, which impose joint priors over the latent fea-ture assignments of the users across tasks. As we discussed,our information sharing mechanism allows for deriving richershared information across tasks compared to existing PMF-based approaches.

We provided a highly-scalable algorithm for model in-ference using a truncated variational Bayesian approach.We evaluated our approach using two large commonly usedreal-world datasets. We compared the performance of ourmethod to related existing MCF methods, as well as state-of-the-art single-task CF approaches. As we observed, ourmodel manages to outperform the considered alternatives.

Our future research goal is to adapt our approach so as toalso facilitate performing transfer learning, i.e. transferringthe shared preference patterns extracted from a number oftasks and domains to new tasks, where very few ratings areavailable. Such a functionality is expected to be of great ben-efit, e.g., to the efforts of mitigating the cold-start problem.Exploring utilization of mixtures of latent factor models [3]is also part of our ongoing research.

6. REFERENCES[1] G. Adomavicius and A. Tuzhilin. Toward the next

generation of recommender systems: A survey of thestate-of-the-art and possible extensions. IEEETransactions on Knowledge and Data Engineering,17(6):734–749, 2005.

[2] C. M. Bishop. Pattern Recognition and MachineLearning. Springer, New York, 2006.

[3] S. Chatzis, D. Kosmopoulos, and T. Varvarigou.Signal modeling and classification using a robustlatent space model based on t distributions. IEEETrans. Signal Processing, 56(3):949–963, March 2008.

2157

[4] W. Y. Chen, J. C. Chu, J. Luan, H. Bai, Y. Wang,and E. Y. Chang. Collaborative filtering for Orkutcommunities: discovery of user latent behavior. InProc. WWW’09, pages 681–690, 2009.

[5] D. DeCoste. Collaborative prediction using ensemblesof maximum margin matrix factorizations. In Proc.ICML, pages 249–256, Pittsburgh, Pennsylvania,USA, 2006.

[6] F. Doshi-Velez, K. Miller, J. V. Gael, and Y. W. Teh.Variational inference for the Indian Buffet Process. InProc. AISTATS, 2009.

[7] F. Doshi-Velez, K. T. Miller, J. V. Gael, and Y. W.Teh. Variational inference for the Indian buffetprocess. Technical Report CBL-2009-001,Computational and Biological Learning Laboratory,Department of Engineering, University of Cambridge,2009.

[8] T. Griffiths and Z. Ghahramani. Infinite latent featuremodels and the Indian buffet process. TechnicalReport TR 2005-001, Gatsby ComputationalNeuroscience Unit, 2005.

[9] M. Harvey, M. J. Carman, I. Ruthven, andF. Crestani. Bayesian latent variable models forcollaborative item rating prediction. In Proc. CIKM’11, pages 699–708, 2011.

[10] T. Hofmann. Latent semantic models for collaborativefiltering. ACM Trans. Inf. Syst., 22(1):89–115,January 2004.

[11] M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul.An introduction to variational methods for graphicalmodels. In M. Jordan, editor, Learning in GraphicalModels, pages 105–162. Kluwer, Dordrecht, 1998.

[12] Y. Koren. Factorization meets the neighborhood: amultifaceted collaborative filtering model. InProceedings of the 14th ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining,Las Vegas, Nevada, USA, 2008.

[13] K.Yu, A.Schwaighofer, V.Tresp, X.Xu, andH.-P.Kriegel. Probabilistic memory-basedcollaborative filtering. IEEE Transactions onKnowledge and Data Engeering, 16(1):56– 69, 2004.

[14] K.Yu, S.Zhu, J.D.Lafferty, and Y.Gong. Fastnonparametric matrix factorization for large-scalecollaborative filtering. In Proc. ACM SIGIR, pages211–218, Boston, MA, USA, 2009.

[15] N. D. Lawrence and R. Urtasun. Non-linear matrixfactorization with Gaussian processes. In Proc. ICML,pages 601– 608, Montreal, Quebec, Canada, 2009.

[16] B. Li, Q. Yang, and X. Xue. Can movies and bookscollaborate? cross-domain collaborative filtering forsparsity reduction. In Proc. IJCAI 2009, pages2052–2057.

[17] B. Li, Q. Yang, and X. Xue. Transfer learning forcollaborative filtering via a rating-matrix generativemodel. In Proc. 26th ICML, 2009.

[18] B. Li, Q. Yang, and X. Xue. Transfer learning forcollaborative filtering via a rating-matrix generativemodel. In Proc. ICML, pages 617–624, Montreal,Quebec, Canada, 2009.

[19] E. Meeds, Z. Ghahramani, R. M. Neal, and S. T.Roweis. Modeling dyadic data with binary latentfactors. In Proc. NIPS, 2007.

[20] O. Moreno, B. Shapira, L. Rokach, and G. Shani.TALMUD – transfer learning for multiple domains. InProc. CIKM ’12, pages 425–434, 2012.

[21] P. Muller and F. Quintana. Nonparametric Bayesiandata analysis. Statist. Sci., 19(1):95–110, 2004.

[22] S. Nakajima and M. Sugiyama. Theoretical analysis ofBayesian matrix factorization. J. Machine LearningResearch, 12:2583–2648, 2011.

[23] W. Pan, E. W. Xiang, N. N. Liu, and Q. Yang.Transfer learning in collaborative filtering for sparsityreduction. In Proc. AAAI, pages 230–235, 2010.

[24] I. Porteous, A. Asuncion, and M. Welling. Bayesianmatrix factorization with side information andDirichlet process mixtures. In Proc. AAAI, 2010.

[25] J. D. M. Rennie and N. Srebro. Fast maximum marginmatrix factorization for collaborative prediction. InProc. ICML, pages 713–719, Bonn, Germany, 2005.

[26] R. Salakhutdinov and A. Mnih. Probabilistic matrixfactorization. In Proc. NIPS, 2007.

[27] R. Salakhutdinov and A. Mnih. Bayesian probabilisticmatrix factorization using Markov chain monte carlo.In Proc. ICML’11, 2011.

[28] A. P. Singh and G. J. Gordon. Relational learning viacollective matrix factorization. In Proc. 14th ACMSIGKDD, pages 650–658, Las Vegas, Nevada, USA,2008.

[29] N. Srebro, J. D. M. Rennie, and T. S. Jaakkola.Maximum-margin matrix factorization. In Proc. NIPS,volume 17, pages 1329–1336, Vancouver, BritishColumbia, Canada, 2005.

[30] X. Suand and T. Khoshgoftaar. A survey ofcollaborative filtering techniques. In Advances inArtificial Intelligence, 2009.

[31] Y. W. Teh, D. Gorur, and Z. Ghahramani.Stick-breaking construction for the Indian buffetprocess. In Proc. AISTATS, 2007.

[32] J. Winn and C. Bishop. Variational message passing.J. Machine Learning Research, 6:661–694, 2005.

[33] C. Yuan. Multi-task learning for bayesian matrixfactorization. pages 924–931, 2011.

[34] Y. Zhang, B. Cao, and D.-Y. Yeung. Multi-domaincollaborative filtering. In Proc. UAI, pages 725–732,2010.

2158