RECOMMENDATIONS IN MOBILE AND PERVASIVE BUSINESS ...
-
Upload
khangminh22 -
Category
Documents
-
view
2 -
download
0
Transcript of RECOMMENDATIONS IN MOBILE AND PERVASIVE BUSINESS ...
RECOMMENDATIONS IN MOBILE AND PERVASIVE
BUSINESS ENVIRONMENTS
by
YONG GE
A Dissertation submitted to the
Graduate School-Newark
Rutgers, The State University of New Jersey
in partial fulfillment of the requirements
for the degree of
Doctor of Philosophy
Graduate Program in Management
written under the direction of
Dr. Hui Xiong
and approved by
Newark, New Jersey
May 2013
ABSTRACT OF THE DISSERTATION
RECOMMENDATIONS IN MOBILE AND PERVASIVE BUSINESS
ENVIRONMENTS
By YONG GE
Dissertation Director: Dr. Hui Xiong
Advances in mobile technologies have allowed us to collect and process massive
amounts of mobile data across many different mobile applications. If properly ana-
lyzed, this data can be a source of rich intelligence for providing real-time decision
making in various mobile applications and for the provision of mobile recommenda-
tions. Indeed, mobile recommendations constitute an especially important class of
recommendations because mobile users often find themselves in unfamiliar environ-
ments and are often overwhelmed with the ”new terrain” abundance of unfamiliar
information and uncertain choices. Therefore, it is especially useful to equip them
with the tools and methods that will guide them through all these uncertainties by
providing useful recommendations while they are ”on the move.”
In this dissertation, we aim to address the unique challenges of recommendations
in mobile and pervasive business environments from both theoretical and practical
perspectives. Specifically, we first develop an energy-efficient mobile recommender
system which is to recommend a sequence of potential pick-up points for taxi drivers
by handling the complex data characteristics of real-world location traces. The de-
veloped mobile recommender system can provide effective mobile sequential recom-
mendation and the knowledge extracted from location traces can be used for coaching
ii
drivers and lead to the efficient use of energy. The experimentations on real-world
spatio-temporal data demonstrate the efficiency and effectiveness of our methods.
Moreover, we introduce a focused study of cost-aware collaborative filtering that is
able to address the cost constraint for travel tour recommendation. Specifically, we
present two ways to represent user’s latent cost preference and different cost-aware
collaborative filtering models for travel tour recommendations. We demonstrate that
the cost-aware recommendation models can consistently and significantly outperform
several existing latent factor models. In addition, we introduce a Tourist-Area-Season
Topic (TAST) model. This TAST model can represent travel packages and tourists
by different topic distributions, where the topic extraction is conditioned on both the
tourists and the intrinsic features (i.e. locations, travel seasons) of the landscapes.
Then, based on this topic model representation, we present a cocktail approach to
generate the lists for personalized travel package recommendation. When applied
to real-world travel tour data, the TAST model can lead to better performances of
recommendation. Finally, we introduce the collective training to boost collaborative
filtering models. The basic idea is that we compliment the training data for a particu-
lar collaborative filtering model with the predictions of other models. And we develop
an iterative process to mutually boost each collaborative filtering model iteratively.
iii
ACKNOWLEDGEMENTS
I would like to express my great appreciation to all the people who provided me
tremendous support and help during my Ph.D. study.
First, I would like to express my deep gratitude to my advisor, Prof. Hui Xiong, for
his continuous support, guidance and encouragement, which are necessary to survive
and thrive the graduate school and the beyond. I thank him for generously giving me
motivation, support, time, assistance, opportunities and friendship; for teaching me
how to identify key problems with impact, present and evaluate the ideas. He helped
making me a better writer, speaker and scholar.
I also sincerely thank my other committee members: Prof. Alexander Tuzhilin,
Prof. Vijay Atluri and Prof. Xiaodong Lin. All of them not only provide constructive
suggestions and comments on my work and this thesis, but also offer numerous sup-
port and help in my career choice, and I am very grateful for them. Prof. Alexander
Tuzhilin has been a great professor to me over the past three years. His experience
and vision in recommender systems, data mining and personalization has inspired me
a lot to solve the challenging problems in my research, and I have learned a great deal
from the collaboration with him on many exciting projects. I learned current database
systems and information security technology from Prof. Vijay Atluri’s courses, and I
was provided lots of useful feedback and suggestions from him during my PhD study.
iv
Prof. Xiaodong Lin has provide many exciting discussions for my research and career
development and friendship during my PhD study.
Special thanks are due to Prof. Shashi Shekhar at department of computer science
at University of Minnesota, Prof. Wenjun Zhou at University of Tennessee and Dr.
Ramendra Sahoo at Citi helping with my job search and career development. Thanks
are also due to Dr. Guofei Jiang, Dr. Ming Li, Dr. Milind Naphade, Dr. K.C. Lee,
Prof. Enhong Chen, Dr. Qi Liu, Prof. Zhi-hua Zhou, and Dr. Min Ding. It was a
great pleasure working with all of them. I also owe a hefty amount of thanks to my
colleagues and friends Zhongmou Li, Keli Xiao, Chuanren Liu, Hengshu Zhu, Yanchi
Liu, Chunyu Luo, Zijun Yao, Yanjie Fu, Konstantin Patev, Jingyuan Yang, Xue Bai,
Liyang Tang, Chang Tan, for their help, friendship and valuable suggestion.
I would like to acknowledge the Department of Management Science and Infor-
mation Systems (MSIS) and Center for Information Management, Integration and
Connectivity (CIMIC) for supplying me with the best imaginable equipment and
facilities that helped me to accomplish much of this work.
Finally, I would like to thank my wife, my daughter, my parents, and my brother
for their love, support and understanding. Without their encouragement and help,
this thesis would be impossible.
v
TABLE OF CONTENTS
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii
ACKNOWLEDGEMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
LIST OF FIGURES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi
CHAPTER 1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Background and Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Mobile Recommender Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Research Motivation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Research Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
CHAPTER 2. MOBILE SEQUENTIAL RECOMMENDATION. . . . . . . . . . . . . . . 13
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.1 A General Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2.2 Analysis of Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2.3 The MSR Problem with Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Recommending Point Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3.1 High-Performance Drivers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.2 Clustering Based on Driving Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.3 Probability Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 Sequential Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.4.1 The Potential Travel Distance Function . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4.2 The LCP Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4.3 The SkyRoute Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.4.4 Obtaining the Optimal Driving Route . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.4.5 The Recommendation Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
vi
2.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.5.1 The Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.5.2 An Illustration of Optimal Driving Routes . . . . . . . . . . . . . . . . . . . . . . . 37
2.5.3 An Overall Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.5.4 A Comparison of Skyline Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
2.5.5 Case: Multiple Evaluation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.6 CONCLUDING REMARKS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
CHAPTER 3. COST-AWARE COLLABORATIVE FILTERING FOR TRAVEL
TOUR RECOMMENDATIONS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.1 Collaborative Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.2.2 Travel Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.2.3 Cost/Profit-based Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.3 Cost-aware PMF Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3.1 The vPMF Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3.2 The gPMF Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.3.3 The Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.4 Cost-aware LPMF Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.4.1 The LPMF Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.4.2 The vLPMF Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.4.3 The gLPMF Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.5 Cost-aware MMMF Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.5.1 The MMMF Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.5.2 The vMMMF Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3.5.3 The gMMMF Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.6.1 The Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.6.2 Collaborative Filtering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
3.6.3 The Details of Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.6.4 Validation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.6.5 The Performance Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.6.6 The Performances with Different Values of α and D . . . . . . . . . . . . . . 84
3.6.7 The Performances on Different Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.6.8 The Learned User’s Cost Information . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.6.9 An Efficiency Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
vii
3.7 Conclusion and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
CHAPTER 4. A COCKTAIL APPROACH FOR TRAVEL PACKAGE REC-
OMMENDATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.2 Concepts and Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.3 The TAST Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.3.1 Topic Model Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.3.2 Model Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.3.3 Area/Seasons Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
4.3.4 Related Topic Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.4 Cocktail Recommendation Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.4.1 Seasonal Collaborative Filtering for Tourists . . . . . . . . . . . . . . . . . . . . . 115
4.4.2 New Package Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
4.4.3 Collaborative Pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.4.4 Related Cocktail Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.5 The TRAST Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
4.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
4.6.1 The Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
4.6.2 Season Splitting and Price Segmentation . . . . . . . . . . . . . . . . . . . . . . . . 128
4.6.3 Understanding of Topics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.6.4 Recommendation Performances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
4.6.5 The Evaluation of the TRAST Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
4.6.6 Recommendation for Travel Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
4.7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
CHAPTER 5. COLLABORATIVE FILTERING WITH COLLECTIVE TRAIN-
ING . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
5.3 Collective Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.3.1 The Bi-CF Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
5.3.2 The Tri-CF Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
5.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
5.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
CHAPTER 6. CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . 156
viii
BIBLIOGRAPHY. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
VITA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
ix
LIST OF TABLES
1.1 An Example of Item-User Rating Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Some Acronyms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.2 A Comparison of Search Time (Second) between BFS and LCPS . . . . . 41
3.1 Some Characteristics of Travel Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
3.2 The Notations of 9 Collaborative Filtering Methods . . . . . . . . . . . . . . . . . 73
3.3 A Performance Comparison (10D Latent Features & α = 0.1) . . . . . . . . . 78
3.4 A Performance Comparison (30D Latent Features & α = 0.1) . . . . . . . . . 79
3.5 A Performance Comparison in terms of RMSE. . . . . . . . . . . . . . . . . . . . . . 80
3.6 A Performance Comparison (10D Latent Features & α = 0.3) . . . . . . . . . 85
3.7 A Performance Comparison (30D Latent Features & α = 0.3) . . . . . . . . . 86
3.8 The Performances on Different Users (10D Latent Features & α = 0.1) . 96
3.9 Performances with Tail Users/Packages (30D Latent Features & α = 0.1) 97
3.10 A Comparison of Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
3.11 A Comparison of the Model Efficiency (10D Latent Features) . . . . . . . . . 98
4.1 Mathematical notations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.2 The description of the training and test data. . . . . . . . . . . . . . . . . . . . . . . 126
4.3 A performance comparison: DOA(%). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
4.4 User study ratings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
4.5 Experimental results for K-means clustering. . . . . . . . . . . . . . . . . . . . . . . . 136
4.6 The recall results for Leave-Out-Rest (%). . . . . . . . . . . . . . . . . . . . . . . . . . 137
4.7 Group recommendation results: DOA(%). . . . . . . . . . . . . . . . . . . . . . . . . . . 139
5.1 A Sample Data Set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
5.2 RMSE Comparisons on MovieLens . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
x
LIST OF FIGURES
2.1 An Illustration Example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Some Statistics of the Cab Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3 A Recommended Driving Route. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.4 Illustration: the Sub-route Dominance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.5 Illustration of the Circulating Mechanism. . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.6 Illustration: Optimal Driving Routes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.7 A Comparison of Search Time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.8 The Pruning Effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.9 A Comparison of Search Time (L = 3) on the Synthetic Data set. . . . . . 42
2.10 A Comparison of Skyline Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.11 A Comparison of Search Time for Multiple Optimal Driving Routes . . . 44
3.1 The Cost Distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.2 Graphical Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
3.3 A Performance Comparison in terms of CD (10D Latent Features). . . . . 81
3.4 A Local Performance Comparison in terms of CD (10D Latent Features). 82
3.5 A Performance Comparison in terms of CD (30D Latent Features). . . . . 83
3.6 A Local Performance Comparison in terms of CD (30D Latent Features). 84
3.7 Performances with Different α (10D Latent Features). . . . . . . . . . . . . . . . 87
3.8 Performances with Different D (α = 0.1). . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.9 The Performances on Different Users (10D Latent Features). . . . . . . . . . 89
3.10 An Illustration of User Financial Cost. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
3.11 An Illustration of the Gaussian Parameters of User Cost. . . . . . . . . . . . . 91
3.12 An Illustration of the Convergence of RMSEs (10D Latent Features). . . 93
4.1 An illustration of the chapter contribution. . . . . . . . . . . . . . . . . . . . . . . . . . 102
4.2 An example of the travel package, where the landscapes are represented
by the words in red. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
4.3 TAST: A graphical model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
4.4 The three related topic models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
xi
4.5 The cocktail recommendation approach. . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
4.6 The TRAST model and its two sub-models. . . . . . . . . . . . . . . . . . . . . . . . . 118
4.7 Season splitting and price segmentation. . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
4.8 The correlation of topic distributions between different price ranges
(Left)/different areas (Center)/different seasons(Right). Darker shades
indicate lower similarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
4.9 A performance comparison based on Top-K. . . . . . . . . . . . . . . . . . . . . . . . . 131
4.10 The runtime results for different algorithms. . . . . . . . . . . . . . . . . . . . . . . . . 135
4.11 The precision results for Leave-Out-Rest (%). . . . . . . . . . . . . . . . . . . . . . . 137
5.1 RMSEs at Different Iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
xii
- 1 -
CHAPTER 1
INTRODUCTION
Advances in sensor, wireless communication, and information infrastructure such as
GPS, WiFi, and mobile phone technology have enabled us to collect and process
massive amounts of location traces from multiple sources but under operational time
constraints. These location traces are fine-grained, sufficiently information-rich, and
have global road coverage, and thus provide unparallel opportunities for people to un-
derstand mobile user behaviors and generate useful knowledge, which in turn deliver
intelligence for real-time decision making in various fields, including that of mobile
recommendations. For example, recent years have witnessed a revolution in mobile
phone technology, which is driven by the development of the mobile Internet. Accord-
ing to the Telephia Mobile Internet Report, US had 34.6 million mobile web users as
of June, 2006. While this is only 17% of total wireless phone subscribers, the penetra-
tion rate has been steadily increasing. As the mobile Internet keeps evolving, there
are clear signs that mobile pervasive recommendation will have huge demand, and
therefore,mobile application awareness continues to grow among mobile users. Mobile
pervasive recommendation is promised to provide mobile users access to personalized
recommendations anytime, anywhere. In order to keep this promise, an immediate
need is to understand the unique features that distinguish mobile recommendation
systems from classic recommender systems. Indeed, the objective of this disserta-
- 2 -
tion is to exploit the hidden information in location traces collected from multiple
application domains for developing mobile recommender systems.
1.1 Background and Preliminaries
Recent years have witnessed an increased interest in recommender systems (Hofmann,
1999; Resnick, Iacovou, Suchak, Bergstrom, & Riedl, 1994), especially after these
technologies were popularized by Amazon and Netflix, as well as after the establish-
ment of $1,000,000 Netflix Prize Competition that attracted over 45,000 contestants
from 180 countries. A lot of works have done both in the industry and academia on
developing new approaches to recommender systems over the last decade. In its most
common formulation, the recommendation problem is simplified to the problem of
estimating ratings for items that have not been rated by users. Intuitively, this esti-
mation is usually based on the ratings given by this user to other items, the ratings
by other users to the same item and soem other information(features) of the items.
Once we can estimate ratings for the unknown ratings, we can simply recommend to
the user the item(s) with the highest rating(s).
More formally, the recommendation problem can be formulated as follows. Let C
be the set of all users and let S be the set of all possible items that can be recom-
mended, such as books, movies, or restaurants. The space S of possible items can be
very large, ranging in hundreds of thousands or even millions of items in some appli-
cations, such as recommending books or CDs. Similarly, the user space can also be
very large-millions in some cases. Let u be a utility function that measures usefulness
of item s to user c, i.e., u: CXS → R, where R is a totally ordered set. Then for each
- 3 -
user c ∈ C, we want to choose such item s′ ∈ S that maximizes the user’s utility.
More formally:
∀c ∈ C, s′c = argmaxs∈Su(c, s) (1.1)
In recommender systems the utility of an item is usually represented by a rating,
which indicates how a particular user liked a particular item.
The central problem of recommender systems lies in that utility u is usually not
defined on the whole C×S space, but only on some subset of it. This means u needs to
be extrapolated to the whole space C×S. In recommender systems, utility is typically
represented by ratings and is initially defined only on the items previously rated by the
users. For example, in a movie recommendation application, users initially rate some
subset of movies that they have already seen. An example of a user-item rating matrix
for a movie recommendation application is presented in Table 1.1, where ratings are
specified on the scale of 1 to 5. The ”NaN” symbol for some of the ratings in table
1.1 means that the users have not rated for the movies. Then, the recommendation
engine should be able to estimate/predict the ratings of the unknown ratings and
decide appropriate recommendations based on these predictions.
Extrapolations from known to unknown ratings are usually done by specific heuris-
tics that can exploit the known ratings for prediction and optimize certain perfor-
mance criterion, such as the mean square error. Once the unknown ratings are es-
timated, actual recommendations of an item to a user are made by selecting the
highest rating among all the estimated ratings for that user according to formula 1.1.
Alternatively, we can recommend N best items to a user or a set of users to an item.
- 4 -
Table 1.1. An Example of Item-User Rating Matrix
Alice Bob Cindy David
RainMan NaN NaN 2 3
TheX − Files 1 2 NaN 2
Batman 2 4 2 4
TheGodfather 1 2 NaN 2
Note: NaN indicates unknown rating.
The new ratings of the unknown ratings can be estimated in many different ways
using the methods from machine learning, approximation theory and various heuris-
tics. Recommender systems are usually classified according to their approach to rating
estimation. In the following, we will present a classification that was proposed in the
literature. The commonly accepted formulation of the recommendation problem was
first stated in (Resnick et al., 1994; Shardanand & Maes, 1995; Hill, Stead, Rosen-
stein, & Furnas, 1995) and this problem has been studied extensively since then.
Moreover, recommender systems are usually classified into the following categories,
based on how recommendations are made:
• Content-based recommendations: the user is recommended items similar to the
ones the user preferred in the past;
• Collaborative recommendations: the user is recommended items that people
with similar mind liked in the past;
• Hybrid approaches: these methods combine collaborative and content-based
- 5 -
methods.
In addition to recommender systems that predict the absolute values of ratings
that individual users would give to the unseen items, there has been work done
on preference-based filtering, i.e., predicting the relative preferences of users (Iyer,
Jr., Karger, & Smith, 1998). For example, in a movie recommendation application
preference-based filtering techniques would focus on predicting the correct relative
order of the movies, rather than their individual ratings.
1.2 Mobile Recommender Systems
Recommender systems in the mobile environments become a promising area with the
advanced development of mobile device, such as GPS and WiFi, and the increasing
demand of users for mobile applications, such as travel planning and location-based
shopping. A lot of works have already done both in the industry and academia on
developing new systems and applications in recent years. Typically, mobile recom-
mender systems are systems that provide assistance/guidance to users as they face
decisions ’on the go’, or, in other words, as they move into new, unknown environment.
And different from traditional recommendation techniques, mobile recommendation is
unique in its location-aware capability. Mobile computing adds a relevant but mostly
unexplored piece of information- the users physical location-to the recommendation
problem. For example, a mobile shopping recommender system could analyze the
shopping history of users at different locations and the current position of users to
make recommendation for particular user. Another example would be recommenda-
tion for tourists or traveler. This kind of mobile recommender system could analyze
- 6 -
the historical data of variant tourists or travelers to recommend traveling route to
meet the demand/preference of particular user.
1.3 Research Motivation
However, the development of personalized recommender systems in mobile and per-
vasive environments is much more challenging than developing recommender systems
from traditional domains due to the complexity of spatial data and intrinsic spatio-
temporal relationships, the unclear roles of context-aware information, the lack of user
rating information, and the diversified location-sensitive recommendation tasks. As a
matter of fact, recommender systems in the mobile environments have been studied
before. For instance, the work in targets the development of mobile tourist guides.
Also, Heijden et al. have discussed some technological opportunities associated with
mobile recommendation systems. In addition, Averjanova et al. have developed a
map-based mobile recommender system that can provide users with some personal-
ized recommendations. However, this prior work is mostly based on user ratings and
is only exploratory in nature, and the problem of leveraging unique features distin-
guishing mobile recommender systems remains pretty much open.1 Indeed, there are
a number of technical and domain challenges inherent in designing and implementing
an effective mobile recommend system in pervasive environments. First, the het-
erogeneous and noisy nature of mobile environments makes the data more complex
than traditional commercial item data, such as the Movie data. Location traces are
spatio-temporal data in nature. Spatial data have spatial autocorrelation and follow
the first law of geography - everything is related to everything else, but the nearby
- 7 -
things are more related than distant things. The challenge lies in how to effectively
extract recommendable knowledge from location traces not being affected by these
data characteristics. Second, traditional recommender systems usually rely on the
user ratings for validation. However, in mobile application domains, the user rat-
ings are usually not conveniently available. Therefore, it becomes a real challenge
to develop alternative evaluation metrics and recommendation techniques for mobile
recommender systems. Third, recommendation techniques developed in traditional
recommendation systems may only be tangentially applicable to mobile recommender
systems and a new set of methods needs to be developed instead. In addition, it is
not clear that whether the mobile recommendation techniques developed in one ap-
plication domain can be easily adapted for building a mobile recommender system in
a different application domain. Therefore, it is important to identify the commonality
and diversity among different types of mobile recommender systems. Fourth, in tradi-
tional recommender systems, it is usually not necessary to consider the corresponding
cost to take a recommendation. For instance, the cost for watching a recommended
movie is usually not a concern for any user. However, in mobile recommender systems
for tourists, the users may have various time and price constraints to select among
different recommended travel plans. Finally, the recommended items in traditional
recommender systems usually have a stable value. However, in many mobile recom-
mender systems, the values of the items to be recommended can be depreciated over
time. Moreover, some mobile items have life cycle. For instance, a tour package can
only last for a certain period. The travel agents need to actively create new tour
packages to replace old tour packages based on the interests of the customers.
- 8 -
1.4 Research Contributions
In this dissertation, we study the unique characteristics of mobile recommender sys-
tems and demonstrate how to develop mobile recommender systems in different ap-
plication domains. Generally, the proposed research has the following major thrusts:
• Investigating the impact of the unique characteristics of mobile data on the de-
velopment of mobile recommender systems. To this end, we will exploit mobile
data from different application domains and develop two mobile recommender
systems, an energy-efficient mobile recommender system and a mobile recom-
mender system for targeting tourists. In addition, as an unique challenge to
the development of mobile recommender systems, the issues related to location
privacy will also be taken into the consideration.
• Development of novel approaches to mobile recommender systems that work for
the applications and data described above. Since these applications and data
are significantly different from each other, we also plan to understand common-
ality and diversity across different mobile recommendation techniques. The goal
is to demonstrate the design and implementation issues of mobile recommender
systems in different application settings. In particular, we will also design and
evaluate the effective evaluation metrics for mobile recommender systems. Al-
though the key differences between traditional recommender systems and mobile
recommender systems are known, we will explore them further and at a deeper
level in this project.
- 9 -
Specifically, we first provide a focused study of extracting energy-efficient trans-
portation patterns from location traces. Specifically, we have the initial focus on a
sequence of mobile recommendations. As a case study, we develop a mobile recom-
mender system which has the ability in recommending a sequence of pick-up points
for taxi drivers or a sequence of potential parking positions. The goal of this mobile
recommendation system is to maximize the probability of business success. Along
this line, we provide a Potential Travel Distance (PTD) function for evaluating each
candidate sequence. This PTD function possesses a monotone property which can be
used to effectively prune the search space. Based on this PTD function, we develop
two algorithms, LCP and SkyRoute, for finding the recommended routes. Experi-
mental results show that the proposed system can provide effective mobile sequential
recommendation and the knowledge extracted from location traces can be used for
coaching drivers and leading to the efficient use of energy.
Second we provide another focused study of cost-aware travel tour recommenda-
tion. We first propose two ways to represent user’s cost preference. One way is to
represent user’s cost preference by a 2-dimensional vector. Another way is to con-
sider the uncertainty about the cost that a user can afford and introduce a Gaussian
prior to model user’s cost preference. With these two ways of representation of user’s
cost preference, we develop different cost-aware latent factor models by incorporating
the cost information into the Probabilistic Matrix Factorization (PMF) model, the
Logistic Probabilistic Matrix Factorization (LPMF) model, and the Maximum Mar-
gin Matrix Factorization (MMMF) model respectively. When applied to real-world
travel tour data, all the cost-aware recommendation models consistently outperform
- 10 -
existing latent factor models with a significant margin.
Third we introduce a Tourist-Area-Season Topic (TAST) model to address more
challenges of travel package recommendations. This TAST model can represent travel
packages and tourists by different topic distributions, where the topic extraction is
conditioned on both the tourists and the intrinsic features (i.e. locations, travel sea-
sons) of the landscapes. Then, based on this topic model representation, we propose
a cocktail approach to generate the lists for personalized travel package recommenda-
tion. Furthermore, we extend the TAST model to the Tourist-Relation-Area-Season
Topic (TRAST) model for capturing the latent relationships among the tourists in
each travel group. Finally, we evaluate the TAST model, the TRAST model, and
the cocktail recommendation approach on the real-world travel package data. Ex-
perimental results show that the TAST model can effectively capture the unique
characteristics of the travel data and the cocktail approach is thus much more effec-
tive than traditional recommendation techniques for travel package recommendation.
Also, by considering tourist relationships, the TRAST model can be used as an effec-
tive assessment for travel group formation.
Finally, we introduce a collective training paradigm to address the sparseness issue
of recommendations by automatically and effectively augmenting the training ratings.
Essentially, the collective training paradigm builds multiple different Collaborative
Filtering (CF) models separately, and augments the training ratings of each CF model
by using the partial predictions of other CF models for unknown ratings. Along this
line, we develop two algorithms, Bi-CF and Tri-CF, based on collective training. For
Bi-CF and Tri-CF, we collectively and iteratively train two and three different CF
- 11 -
models via iteratively augmenting training ratings for individual CF model. We also
design different criteria to guide the selection of augmented training ratings for Bi-
CF and Tri-CF. The experimental results show that Bi-CF and Tri-CF algorithms
can significantly outperform baseline methods, such as neighborhood-based and SVD-
based models.
1.5 Overview
Chapter 2 addresses the computation challenge embedded in mobile sequential rec-
ommendation with GPS data. Two types of algorithms are introduced to efficiently
search the optimal drive route and recommend it to users.
Chapter 3 presents different types of cost-aware collaborative filtering models for
travel package recommendation. Two different ways are introduced to represent the
user’s cost preference. Probabilistic Matrix Factorization (PMF) model, Logistic
Probabilistic Matrix Factorization (PMF) model, and Maximum Margin Matrix Fac-
torization (MMMF) model are considered and extended with the cost information.
Experimental results with real world data are presented to validate the effectiveness
of cost-aware models.
Chapter 4 presents two types of topic models (i.e., TAST and TRAST) based on
the LDA model to address the analytical challenges of travel package data. A hybrid
recommendation framework is presented based on the topic models to produce the
recommendation results. Empirical comparisons with real world data are presented
show the performances of different recommendation methods.
Chapter 5 presents a collective training paradigm to address the sparseness issue
- 12 -
of recommendation. This collective training compliments the training data for one
collaborative filtering model by effectively leveraging the predictions of other models.
And an iterative process is introduced to mutually compliment the training data for
each collaborative filtering model.
- 13 -
CHAPTER 2
MOBILE SEQUENTIAL RECOMMENDATION
The increasing availability of large-scale location traces creates unprecedent oppor-
tunities to change the paradigm for knowledge discovery in transportation systems.
A particularly promising area is to extract energy-efficient transportation patterns
(green knowledge), which can be used as guidance for reducing inefficiencies in en-
ergy consumption of transportation sectors. However, extracting green knowledge
from location traces is not a trivial task. Conventional data analysis tools are usually
not customized for handling the massive quantity, complex, dynamic, and distributed
nature of location traces. To that end, in this chapter, we provide a focused study of
extracting energy-efficient transportation patterns from location traces. Specifically,
we have the initial focus on a sequence of mobile recommendations. As a case study,
we develop a mobile recommender system which has the ability in recommending a
sequence of pick-up points for taxi drivers or a sequence of potential parking posi-
tions. The goal of this mobile recommendation system is to maximize the probability
of business success. Along this line, we provide a Potential Travel Distance (PTD)
function for evaluating each candidate sequence. This PTD function possesses a
monotone property which can be used to effectively prune the search space. Based on
this PTD function, we develop two algorithms, LCP and SkyRoute, for finding the
recommended routes. Finally, experimental results show that the proposed system
- 14 -
can provide effective mobile sequential recommendation and the knowledge extracted
from location traces can be used for coaching drivers and leading to the efficient use
of energy.
2.1 Introduction
Advances in sensor, wireless communication, and information infrastructures such as
GPS, WiFi and RFID have enabled us to collect large amounts of location traces (tra-
jectory data) of individuals or objects. Such a large number of trajectories provide
us unprecedented opportunity to automatically discover useful knowledge, which in
turn deliver intelligence for real-time decision making in various fields, such as mobile
recommendations. Indeed, a mobile recommender system promises to provide mo-
bile users access to personalized recommendations anytime, anywhere. To this end,
an important task is to understand the unique features that distinguish pervasive
personalized recommendation systems from classic recommender systems.
Recommender systems (Adomavicius & Tuzhilin, 2005) address the information
overloaded problem by identifying user interests and providing personalized sugges-
tions. In general, there are three ways to develop recommender systems. The first one
is content-based (Mooney & Roy, 1999). It suggests items which are similar to those
a given user has liked in the past. The second way is based on collaborative filtering.
In other words, recommendations are made according to the tastes of other users that
are similar to the target user. Finally, a third way is to combine the above and have
a hybrid solution (Pazzani, 1999). However, the development of personalized recom-
mender systems in mobile and pervasive environments is much more challenging than
- 15 -
developing recommender systems from traditional domains due to the complexity of
spatial data and intrinsic spatio-temporal relationships, the unclear roles of context-
aware information, and the increasing availability of environment sensing capabilities.
Recommender systems in the mobile environments have been studied before (Abowd,
Atkeson, & al, 1997; Averjanova, Ricci, & Nguyen, 2008; Cena et al., 2006; Chev-
erst, Davies, & al, 2000; Miller, Albert, & al, 2003; Tveit, 2001; Heijden, Kotsis,
& Kronsteiner, 2005). For instance, the work in (Abowd et al., 1997; Cena et al.,
2006) targets the development of mobil tourist guides. Also, Heijden et al. have
discussed some technological opportunities associated with mobile recommendation
systems (Heijden et al., 2005). In addition, Averjanova et al. have developed a
map-based mobile recommender system that can provide users with some personal-
ized recommendations (Averjanova et al., 2008). However, this prior work is mostly
based on user ratings and is only exploratory in nature, and the problem of leverag-
ing unique features distinguishing mobile recommender systems remains pretty much
open.
In this chapter, we exploit the knowledge extracted from location traces and de-
velop a mobile recommender system based on business success metrics instead of
predictive performance measures based on user ratings. Indeed, the key idea is to
leverage the business knowledge from the historical data of successful taxi drivers for
helping other taxi drivers improve their business performance. Along this line, we
provide a pilot feasibility study of extracting business-success knowledge from loca-
tion traces by taxi drivers and exploiting this business information for guiding taxis’
driving routes. Specifically, we first extract a group of successful taxi drivers based on
- 16 -
their past performances in terms of revenue per energy use. Then, we can cluster the
pick-up points of these taxi drivers for a certain time period. The centroids of these
clusters can be used as the recommended pick-up points with a certain probability of
success for new taxi drivers in these areas. This problem can be formally defined as
a mobile sequential recommendation problem, which recommends sequential pick-up
points for a taxi driver to maximize his/her business success. Essentially, a key chal-
lenge of this problem is that the computational cost can be dramatically increased
as the number of pick-up points increases, since this is a combinatorial problem in
nature.
To that end, we provide a Potential Travel Distance (PTD) function for evaluating
each candidate route. This PTD function possesses a monotone property which can be
used to effectively prune the search space and generate a small set of candidate routes.
Indeed, we have developed a route recommendation algorithm, named LCP , which
exploits the monotone property of the PTD function. In addition, we observe that
many candidate routes can be dominated by skyline routes (S.Borzsonyi, K.Stocker,
& D.Kossmann, 2001), and thus can be pruned by skyline computing. However,
traditional skyline computing algorithms are not efficient for querying skyline of all
candidate routes because it leads to an expensive network traversal process. Thus,
we propose a SkyRoute algorithm to compute the skyline for candidate routes. An
advantage of searching optimal drive route through skyline computing is that it will
save the total online processing time when we try to provide different optimal drive
routes defined by different business needs.
Finally, the extensive experiments on real-world location traces of 500 taxi drivers
- 17 -
show that both LCP and SkyRoute algorithms outperform the brute-force method
with a significant margin. Also, SkyRoute has a much better performance than
traditional skyline computing methods (S.Borzsonyi et al., 2001). Moreover, we show
that, if there is an online demand for different evaluation criteria, SkyRoute results
in better performances than LCP . However, if there is only one evaluation criterion,
the performance of LCP is the best.
2.2 Problem Formulation
In this section, we formulate the problem of mobile sequential recommendation (MSR).
2.2.1 A General Problem Formulation
Consider a scenario that a large number of GPS traces of taxi drivers have been
collected for a period of time. In this collection of location traces, we also have the
information when a cab is empty or occupied. In this data set, it is possible to
first identify a group of taxi drivers who are very successful in business. Then, we
can cluster the pick-up points of these taxi drivers for a certain time period. The
centroids of these clusters can be used as the recommended pick-up points with a
certain probability of success for new taxi drivers in these areas. Then, a mobile
sequential recommendation problem can be formulated as follows.
Assume that a set of N potential pick-up points, C={C1, C2, · · · , CN}, is available.
Also, the estimated probability that a pick-up event could happen at each pick-up
point is known as P (Ci), where P (Ci)(i = 1, · · · , N) is assumed to be independently
distributed. Let P = {P (C1), P (C2), · · · , P (CN)} denote the probability set. In
addition, let−→R = {−→R1,
−→R2, · · · ,
−→RM} be the set of all the directed sequences (potential
- 18 -
driving routes) generated from C and |−→R| = M is the size of−→R - the number of
all possible driving routes. Note that the pick-up points in each directed sequence
are assumed to be different from each other. Next, let L−→Ri
be the length of route
−→Ri(1 ≤ i ≤ M), where 1 ≤ L−→
Ri≤ N . Finally, for a directed sequence
−→Ri, Let P−→
Ribe
the route probability set which are the probabilities of all pick-up points containing
in−→Ri, where P−→
Riis a subset of P .
C1
T
C4
P(C1)
P(C4)
D(C4−>C3)
D1
PoCab
C3
P(C3)
C2
P(C2)
D4
Figure 2.1. An Illustration Example.
The objective of this MSR problem is to recommend a travel route for a cab driver
in a way such that the potential travel distance before having customer is minimized.
Let F be the function for computinging the Potential Travel Distance (PTD) before
having a customer. The PTD can be denoted as F(PoCab,−→R,P). In other words, the
computation of PTD depends on the current position of a cab (PoCab), a suggested
sequential pick-up points (−→R〉), and the corresponding probabilities associated with
all recommended pick-up points.
Based on the above definitions and notations, we can formally define the problem
as:
- 19 -
The MSR Problem
Given: A set of potential pick-up points C with |C| = N , a probability set
P = {P (C1), P (C2), · · · , P (CN)}, a directed sequence set−→R with |−→R| = M
and the current position (PoCab) of a cab driver, who needs the service.
Objective: Recommending an optimal driving route−→R (
−→R ∈ −→R). The goal
is to minimize the PTD:
min−→Ri∈−→R
F(PoCab,−→Ri,P−→Ri
) (2.1)
The MSR problem involves the recommendation of a sequence of pick-up points
and has combinatorial complexity in nature. However, this problem is practically
important and interesting, since it helps to improve the business performances of taxi
companies, the efficient use of energy, the productivity of taxi drivers, and the user
experiences.
The MSR problem is different from traditional Traveling Salesman Problem (TSP)
(Applegate, Bixby, & al, 2006), which finds a shortest path that visits each given
location exactly once. The reason is that TSP evaluates a combination of exact N
given locations. In other words, all N locations have to be involved. In contrast,
the proposed MSR problem is to find a subset locations of given N locations for
recommendation. Also, the MSR problem is different from the traditional scheduling
problem (Dell’Amico, Fischetti, & Toth, 1993; Portugal, Lourenc4o, & Paixao, 2009),
which selects a set of duties for vehicle drivers. The reason is that all these duties
- 20 -
are determined in advance, such as delivering the packages to determined locations,
while the MSR problem consists of uncertain pick-up jobs among several locations.
Figure 2.1 shows an illustration example. In the figure, for a cab T, the closest pick-
up point is C1. However, we cannot simply recommend C1 as the first stop in the
recommended sequence even if the probability of having a customer at C1 is greater
than C4 which is the second closest to T. The reason is that there is still probability
that this cab drive cannot find a customer at C1 and then it will cost much more to
go to a next pick-up point. Instead, if T goes to C4 first, T might be able to exploit
a sequence of pick-up opportunities.
For the MSR problem, there are two major challenges. First, how to find reliable
pick-up points from the historical data and how to estimate the successful probability
at each pick-up point? Second, there is a computational challenge to search an optimal
route.
2.2.2 Analysis of Computational Complexity
Here, we analyze the computational complexity of the MSR problem. A brute-force
method for searching the optimal recommended route has to check all possible se-
quences in−→R. If we assume the cost for computing the function F once is 1 (
Cox(F) = 1), the complexity of searching a given set C with N pick-up points is as
follows.
Lemma 1 Given a set of pick-up points C, where |C| = N , 1 ≤ L−→Ri≤ N and
Cox(F) = 1, the complexity of searching an optimal directed sequence from−→R is
O(N !)
- 21 -
Proof The complexity of searching an optimal sequence is equal to the total num-
ber M of all possible sequences generated from C. Since every directed sequence is
actually a permutation of pick-up points which form the subset of C, we decompose
the checking process into two steps: enumeration of non-empty subset B from C and
the permutation of pick-up points belonging to the subset B. For a subset B with
i different pick-up points, there are totally(
Ni
)different subsets. And the range of
integer i is 1 ≤ i ≤ N . For each subset B of i different element, there are totally
i! different permutations. Thus the total number of all possible directed sequences
generated from C is M =∑N
i=1
(Ni
) · i! < N !(1+ 1+1/2) = 52·N !. Thus, we can have
2 · N ! < M < 52· N !. Therefore, the complexity of search optimal directed sequence
is O(N !).
2.2.3 The MSR Problem with Constraints
As illustrated above, it is computationally prohibited to search for the optimal so-
lution of the general MSR problem. Therefore, from a practical perspective, we
consider a simplified version of the MSR problem. Specifically, we put a constraint
on the length of a recommended route L−→Ri
. In other words, the length of a recom-
mended route is set to be a constant; that is, L−→Ri
= L. To simplify the discussion,
let−→RL
i denote the recommended route with a length of L. Based on this constraint,
we can simplify the original objective function of the MSR problem as follows.
- 22 -
The MSR Problem with a Length Constraint
Objective: Recommending an optimal sequence−→RL(
−→RL ∈ −→R). The goal is to
minimize the PTD:
min−→RLi ∈
−→R
F(PoCab,−→RL
i ,P−→RLi
)
The computational complexity of this simplified MSR problem is analyzed as
follows.
Lemma 2 Given |C| = N ,L−→Ri
= L and Cox(F) = 1, the computational complexity
of searching an optimal directed sequence with a length of L from−→R is O(NL)
Proof Since the length of the recommended route has been fixed, the computational
complexity can actually be obtained through modifying equation in proof of Lemma
1 as M =(
NL) · L!, where M is the number of all the sequences with a length as L. M
can be transformed as N(N−1) · · · (N−L+1). Thus, the computational complexity
of this problem is O(NL).
The above shows that the computational cost of this simplified MSR problem will
dramatically increase as the number of pick-up points N increases. In this chapter,
we focus on studying the MSR problem with a length constraint.
2.3 Recommending Point Generation
In this section, we show how to generate the recommending points and compute the
probability of pick-up events at each recommending point from location traces of cab
drivers.
- 23 -
2.3.1 High-Performance Drivers
In real world, there are always high-performance experienced cab drivers, who typ-
ically have sufficient driving hours and higher customer occupancy rates - the per-
centage of driving time with customers. For example, Figure 2.2 (a) and (b) show
the distributions of driving hours and occupancy rates of more than 500 drivers in
San Francisco over a period of about 30 days. In the figure, we can clearly see that
the drivers have different performances in terms of occupancy rates. Based on this
observation, we will first extract a group of high-performance drivers with sufficient
driving hours and high occupancy rates. The past pick-up records of these selected
drivers will be used for the generation of potential pick-up points for recommendation.
0 100 200 300 400 5000
5
10
15
20
25
30
35
40
45
50
Driving Hours
Fre
qu
en
cy o
f D
riv
ing
Ho
urs
s
(a) Driving Hours
0 0.1 0.2 0.3 0.4 0.5 0.6 0.70
10
20
30
40
50
60
70
80
90
Occupancy Rate
Fre
quen
cy o
f Occ
upan
cy R
ates
(b) Occupancy Rates
Figure 2.2. Some Statistics of the Cab Data.
2.3.2 Clustering Based on Driving Distance
After carefully observing historical pick-up points of high-performance drivers, we
notice that there are relative more pick-up events in some places than others. In other
words, there are the cluster effect of historical pick-up points. Therefore, we propose
- 24 -
to cluster historical pick-up points of high-performance drivers into N clusters. The
centroids of these clusters will be used for recommending pick-up points. For this
clustering algorithm, we use driving distance rather than Euclidean distance as the
distance measure. In this study, we perform clustering based on driving distance
during different time periods in order to have recommending pick-up pointers for
different time periods. Another benefit of clustering historical pick-up points is to
dramatically reduce the computational cost of the MRS problem.
2.3.3 Probability Calculation
For each recommended pick-up point (the centroid of historical pick-up cluster), the
probability of a pick-up event can be computed based on historical pick-up data. The
idea is to measure how frequent pick-up events can happen when cabs travel across
each pick-up cluster. Specifically, we first obtain the spatial coverage of each cluster.
Then, let #T denote the number of cabs which have no customer before passing a
cluster. For these #T empty cabs, the number of pick-up events #P is counted in this
cluster. Finally, the probability of pick-up event for each cluster (each recommended
pick-up point) can be estimated as P (Ci)1≤i≤N = #P
#T, where #P and #T are recorded
for each historical pick-up cluster at different time periods.
2.4 Sequential Recommendation
In this section, we design mobile sequential algorithms for searching the optimal route
for recommendation.
- 25 -
C1
C2
C3
C4D1
D2
D3
D4
P(C1)
P(C2)
P(C3)
P(C4)
PoCab
Figure 2.3. A Recommended Driving Route.
2.4.1 The Potential Travel Distance Function
First, we introduce the Potential Travel Distance (PTD) function, which will be
exploited for algorithm design. To simplify the discussion, we illustrate the PTD
function via an example. Specifically, Figure 2.3 shows a recommended driving route
PoCab → C1 → C2 → C3 → C4 for the cab PoCab, where the length of suggested
driving route L = 4.
When a cab driver follows this route−→RL, he/she may pick up customers at each
pick-up point with a probability P (Ci). For example, a pick-up event may happen
at C1 with the probability P (C1), or at C2 with the probability P (C1)P (C2), where
P (Ci) = 1 − P (Ci) is the probability that a pick-up event does not happen at Ci.
Therefore, the travel distance before a pick-up event is discretely distributed. In
addition, it is possible that there is no pick-up event happening after going through
the suggested route. This probability is P (C1) ·P (C2) ·P (C3) ·P (C4). In this chapter,
since we only consider the driving routes with a fixed length, the travel distance
beyond the last pick-up point is set to be D∞ equally for all suggested driving routes.
Formally, we represent the distribution of the travel distance before next pick-up event
- 26 -
with two vectors: D−→RL
=〈D1, (D1+D2), (D1+D2+D3), (D1+D2+D3+D4), D∞〉 and
P−→RL
=〈P1, P (C1) ·P (C2), P (C1) ·P (C2) ·P (C3), P (C1) ·P (C2) ·P (C3) ·P (C4), P (C1) ·
P (C2) · P (C3) · P (C4)〉. Finally, the Potential Travel Distance (PTD) function F is
defined as the mean of this distribution as follows.
F = D−→RL· P−→
RL(2.2)
where · is the dot product of two vectors.
From the definition of the PTD function, we know that the evaluation of a sug-
gested drive route is only determined by the probability of each pick-up point and
the travel distance along the suggested route, except the common D∞. These two
types of information associated with each drive route−→RL
i can be represented with one
2L-dimensional vector DP = 〈DP1, · · · , DPl, · · ·DP2L〉. Let us consider the example
in Figure 2.3, where L = 4. The 8-dimensional vector DP for this specific driving
route is DP = 〈D1, P (C1), D2, P (C2), D3,
P (C3), D4, P (C4)〉.
However, to find the optimal suggested route, if we use a brute-force method, we
need to compute the PTD for all directed sequences with a length L. This involves
a lot of computation. Indeed, many suggested routes can be removed without com-
puting the PTD function, because all pick-up points along these routes are far away
from the target cab. Along this line, we identify a monotone property of the Function
F as follows.
Lemma 3 The Monotone Property of the PTD Function F . The PTD Func-
tion F(DP) is strictly monotonically increasing with each attribute of vector DP,
- 27 -
which is a 2L-dimensional vector.
Proof A proof sketch is as follows. By the definition of the function F in Equation
2.2, we can first derive the polynomial form of F . From the polynomial form of F ,
we can observe that the degree of each variable is one. Also, D∞ is assumed to be
one big enough constant. To prove the monotonicity of F , it is equally to prove that
the coefficient of each variable is positive. This is easy to show. The proof details are
omitted due to the space limit.
2.4.2 The LCP Algorithm
In this subsection, we introduce the LCP algorithm for finding an optimal driving
route. In LCP , we exploit the monotone property of the PTD function and two
other pruning strategies, Route Dominance and Constrained Subroute Dominance,
for pruning the search space.
Definition 1 Route Dominance. A recommended driving route−→RL, associated
with the vector DP, dominates another route−→RL, associated with the vector DP, iff
∃1 ≤ l ≤ 2L, DPl < DP l and ∀1 ≤ l ≤ 2L, DPl ≤ DP l. This can be denoted as
−→RL °
−→RL.
By this definition, if a candidate route A is dominated by a candidate route B,
A cannot be an optimal route. Next, we provide a definition of constraint sub-route
dominance.
Definition 2 Constrained Sub-route Dominance. Consider that two sub-routes
−→R sub and
−→R′
sub with an equal length (the number of pick-up points) and the same
- 28 -
source and destination points. If the associated vector of−→R sub dominates the associ-
ated vector of−→R′
sub, then−→R sub dominates
−→R′
sub, i.e.−→R sub °
−→R′
sub.
C2
C3
C′3
C4
D′3
D4D3
D’4
P(C3)
P(C4)
P(C′3)
Figure 2.4. Illustration: the Sub-route Dominance.
For example, as shown in Figure 2.4,−→R sub is C2 → C3 → C4 and
−→R′
sub is C2 →
C ′3 → C4. The associated vectors of
−→R sub and
−→R′
sub areDPsub = 〈D3, P (C3), D4, P (C4)〉
and DP ′sub = 〈D′3, P (C ′
3), D′4, P (C4)〉 respectively. Then the dominance of
−→R sub over
−→R′
sub is determined by the dominance of these two vectors. Here, we have the con-
straints that two routes have the same length as well as the same source and desti-
nation. The constrained sub-route dominance enables us to prune the search space
in advance. This is shown in the following lemma.
Lemma 4 LCP Pruning. For two sub-routes A and B with a length L, which in-
cludes only pick-up points, if sub-route A is dominated by sub-route B under Definition
2, the candidate routes with a length L which contain sub-route A will be dominated
and can be pruned in advance.
- 29 -
Let us study the example in Figure 2.4. If L = 3 and−→R sub (C2 → C3 → C4)
dominates−→R′
sub(C2 → C ′3 → C4), the candidate PoCab → C2 → C3 → C4 dominates
the candidate PoCab → C2 → C ′3 → C4 by Definition 1. Thus we can prune the
candidate contains−→R′
sub in advance before online recommendation. Specifically, the
LCP algorithm will enumerate all the L-length sub-routes, which include only pick-
up points, and prune the dominated sub-routes by Definition 2 offline. This pruning
process could be done offline before the position of a taxi driver is known. As a result,
LCP pruning will save a lot of computational cost since it reduces the search space
effectively.
2.4.3 The SkyRoute Algorithm
In this subsection, we show how to leverage the idea of skyline computing for iden-
tifying representative skyline routes among all the candidate routes. Here, we first
formally define skyline routes.
Definition 3 Skyline Route. A recommended driving route−→RL is a skyline route
iff ∀−→RLi ∈
−→R,−→RL
i cannot dominate−→RL by Definition 1. This is denoted as
−→RL
i 1−→RL.
The skyline route query retrieves all the skyline routes with a length of L. For-
mally, we use−→RSkyline to represent the set of all the skyline routes.
Lemma 5 Joint Principle of Skyline Routes and the PTD Function F . The
optimal driving route determined by the PTD function F should be a skyline route.
This is denoted as−→RL ∈ −→RSkyline
- 30 -
Proof(Proof Sketch.) This lemma can be proved by contradiction. Assume that−→RL1
is an optimal driving route and is not a skyline route. By Definition 3,−→RL1 must
be dominated by some driving route denoted as (−→RL
i ), which is a skyline route. By
Definition 1, each attribute of the vector associating with−→RL1 should be not smaller
than the corresponding attribute of the vector associating with−→RL
i . Also, there must
be one attribute, for which the value of vector associating with−→RL1 is bigger than
that of vector associating with associating with−→RL
i . Then, by Lemma 3, the function
F value of the vector associating with−→RL
i should be less than that of the vector
associating with−→RL1 . Therefore,
−→RL1 should not be the optimal drive route.
With the joint principle of skyline routes and the PTD function F in Lemma 5,
it is possible to first find skyline routes and then search for the optimal driving route
from the set of skyline routes. This way can eliminate lots of candidates without
computing the PTD function F . Next, we show how to compute skyline routes.
Indeed, skyline computing, which retrieves non-dominated data points, has been
extensively studied in the database literature (D.Papadias, Y.Tao, & B.Seeger, 2005;
J.Chomicki, P.Godfrey, & D.Liang, 2003; Kian-Lee, Pin-Kwang, & Ooi, 2001). How-
ever, most of these algorithms cannot be directly used to find skyline routes in
the MSR problem, because vectors associated with suggested routes are generated
through an expensive cluster network traversal process. In Particular, the perfor-
mances of traditional skyline computing algorithms degrade significantly when the
network size increases or the length of suggested driving route is increased. Also, there
are a large memory requirement for storing these vectors during the traditional skyline
- 31 -
computing process. Moreover, for real-world applications, the position of empty cab
is dynamic. Therefore, the recommended driving routes are dynamic in a real-time
fashion. This means that we cannot have the indices for the multi-dimensional data
points(vector DP) in advance, which is desired for many traditional skyline comput-
ing algorithms (Tian, C.K.Lee, & Lee, 2009). To this end, we design a SkyRoute
algorithm for computing skyline routes by exploiting the unique properties of skyline
routes for the purpose of efficient computation.
The basic idea of the SkyRoute algorithm is to prune some candidate routes,
which are comprised of the dominated sub-routes and cannot be skyline routes, at
a very early stage. This idea is based on the observation that any recommended
driving routes are composed of sub-routes and different routes can cover the same sub-
routes. The search space will be significantly reduced, since lots of candidate routes
containing the dominated sub-routes will be discarded from further consideration as
skyline routes. In the following, we first introduce two lemmas for candidate routes
pruning based on dominated sub-routes.
Lemma 6 Backward Pruning. If a sub-route R1 from PoCab to an intermediate
pick-up point Ci is dominated by another sub-route R2 from PoCab to Ci under the
sub-route dominance By Definition 2, all the candidate routes−→RL3R1, which have
R1 as a precedent sub-route will be dominated by the candidate routes−→RL3R2. The
only different between−→RL3R1 and
−→RL3R2 is from PoCab to Ci. Thus, those candidate
routes−→RL3R1 can be pruned in advance.
- 32 -
Lemma 7 Forward Pruning. If a sub-route R1 from one pick-up point Ci to an-
other pick-up point Cj is dominated by another sub-route R2 from Ci to Cj under
the sub-route dominance by Definition 2, then all the candidate routes−→RL3R1, which
contain R1 as sub-route will be dominated by the candidate routes−→RL3R2. The only
difference between−→RL3R1 and
−→RL3R2 is from Ci to Cj, Therefore, those candidate
routes−→RL3R1 can be pruned in advance.
With the lemma of Backward Pruning, it is possible to decide some dominated
sub-routes and discard some candidate routes which contain these dominated sub-
routes. Also, the benefit of the lemma of Forwarding Pruning is the ability to prune
some dominated sub-routes as well as some candidate routes offline, since both prob-
abilities and distances between pick-up points can be obtained before any online
recommendation of driving routes. Note that only sub-routes with a length less than
L need to be considered in the above discussion.
Figure 2.4.3 shows the pseudo-code of the SkyRoute algorithm. As can be seen,
during offline processing, SkyRoute checks the dominance of sub-routes with a length
L by Definition 2 and prunes the ones dominated by others. This process is also
applied in the LCP algorithm. In addition, SkyRoute can also prune sub-routes with
different lengths with Forward Pruning in lemma 7. During online processing, results
of offline processing are used as candidate routes. From line 2 to line 5, SkyRoute
iteratively checks the sub-routes with PoCab as the source node and prunes the
candidate routes containing dominated sub-routes with Backward Pruning in lemma
6. Then, in line 6, the candidate set is obtained after all the pruning process. Finally,
- 33 -
a skyline query (S.Borzsonyi et al., 2001) is conducted on this candidate set to find
skyline routes. Please note that the online search time of the optimal driving route
should include the time of online process of SkyRoute and the search time on the set
of skyline routes.
2.4.4 Obtaining the Optimal Driving Route
For both LCP and SkyRoute algorithms, after all the pruning process, we will have
a set of final candidate routes for a given taxi driver. To obtain the optimal driving
route, we can simply compute the PTD function F for all the remaining candidate
routes with a length L. Then, the route with the minimal PTD value is the optimal
driving route for this given taxi driver.
2.4.5 The Recommendation Process
Even though we can find the optimal drive route for a given cab with its current
position, it is still a challenging problem about how to make the recommendation for
many cabs in the same area. In this section, we address this problem and introduce
a strategy for the recommendation process in the real world.
A simple way is to suggest all these empty cabs to follow the same optimal drive
route, however there is naturally an overload problem, which will degrade the per-
formance of the recommender system. To this end, we employ load balancing tech-
niques (Grosu & Chronopoulos, 2004) to distribute the empty cabs to follow multiple
optimal drive routes. The problem of load balancing has been widely used in dis-
tributed systems for the purpose of optimizing a given objective through finding
allocations of multiple jobs to different computers. For example, the load balancing
- 34 -
Input: C: set of cluster nodes with central positions; P : probability set for all
cluster nodes; Dist: pairwise drive distance matrix of cluster nodes; L:
the length of suggested drive route; PoCab: the position of one empty
cab
Output:−→R Skyline: list of skyline drive routes.
Online Processing: Enumerate all candidate routes by connecting PoCab1
with each sub-route of RLsub obtained in step 10 during Offline Processing
for i = 2 : L − 1 do2
Decide dominated sub-routes with ith intermediate cluster and prune the3
corresponding candidates by using lemma 6
Update the candidate set by filtering the pruned candidates in step 34
end5
Select the remained candidate routes with length of L from the loop above6
Final typical skyline query to get−→R Skyline from those candidate routes in step 67
Offline Processing(LCP ): Enumerate all sub-routes with length of L from C8
Prune and maintain dominated Constrained Sub-routes with length of L using9
lemma 7
Maintain the remained non-dominated sub-routes with length of L, denoted as10
RLsub
Algorithm 1: The SkyRoute Algorithm
- 35 -
NO.1
NO.2 NO.3
NO.kNO.k-1
NO.i
Multiple Empty Cabs
K drive routesNO.1
NO.kNO.k-1
NO. i
Figure 2.5. Illustration of the Circulating Mechanism.
mechanism distributes requests among web servers in order to minimize the execu-
tion time. For the proposed mobile recommendation system, we can treat multiple
empty cabs as jobs and multiple optimal drive routes as computers. Then, we can
deal with this overload problem by exploiting existing load balancing algorithms.
Specifically, in this study, we apply the circulating mechanism for the recommender
systems by exploiting a Round Robin algorithm (Xu & Huang, CS213 Univ. of Cali-
fornia,Riverside), which is a static load balancing method.
Under the circulating mechanism, to make recommendation for multiple empty
cabs, a round robin scheduler alternates the recommendation among multiple optimal
drive routes in a circular manner. As shown in Figure 2.5, we could search k optimal
drive routes and recommend the NO.1 route to the first coming empty cab. Then, for
the second empty cab, the NO. 2 drive route will be recommended. Assume there are
more than k empty cabs, recommendations are repeated from NO. 1 route again after
the kth empty cab. In practice, to achieve this, one central dispatch (processor) is
needed to maintain the empty cabs and assignments among the top-k driving routes.
Note that the load balancing techniques are not the focus of this dissertation.
- 36 -
2.5 Experimental Results
In this section, we evaluate the performances of the proposed two algorithms: LCP
and SkyRoute.
2.5.1 The Experimental Setup
Real-world Data. In the experiments, we have used real-world cab mobility traces,
which are provided by the Exploratorium - the museum of science, art and human
perception through the cabspotting project (http://cabspotting.org/ , n.d.). This data
set contains GPS location traces of approximately 500 taxis collected around 30 days
in the San Francisco Bay Area. For each recorded point, there are four attributes:
latitude, longitude, fare identifier and time stamp. In the experiments, we select the
successful cab drivers and generate the cluster information as follows. Specifically,
we select cab drivers with total driving hours over 230 and occupancy rates greater
than 0.5. In total, we obtain 20 cab drivers and their location traces. Based on
this selected data, we generate potential pick-up points and the pick-up probability
associated with each pick-up point for different time periods. In the experiments, we
focus on two time periods: 2PM -3PM and 6PM -7PM . For these two time periods,
we obtain 636 and 400 historical pick-up points respectively. After calculating the
pairwise driving distance of pick-up points with the Google Map API, we use Cluto
(Karypis, n.d.) for clustering. All default parameters are used in the clustering process
except for ”-clmethod=direct”. Please note that, since the driving distance measured
by the Google Map API depends on the driving direction, we use the average to
estimate the distance between each pair of pick-up points. Finally, we group the
- 37 -
historical pick-up points into 10 clusters. The traveling distances between clusters
are measured between centroids of clusters with the Google Map API.
Synthetic data. To enhance validation, we also generate synthetic data for
the experiments. Specifically, we randomly generate potential pick-up points within
a specified area and generate the pick-up probability associated with each pick-up
point by a standard uniform distribution. In total, we have 3 synthetic data sets
with 10, 15 and 20 pick-up points respectively. For this synthetic data, we use the
Euclidean distance instead of the driving distance to measure the traveling distance
between pick-up points. Also, for both real-world and synthetic data, we randomly
generate the positions of the target cab for recommendation.
Experimental Environment. The algorithms were implemented in Matlab2008a.
All the experiments were conducted on a Windows 7 with Intel Core2 Quad Q8300
and 6.00GB RAM. The search time for the optimal driving route and the skyline
computing time are two main performance metrics. All the reported results are the
average of 10 runs.
2.5.2 An Illustration of Optimal Driving Routes
Here, we show some optimal driving routes determined by the PTD function F on
real-world data.
In Figure 2.6, we plot the potential pick-up points within the time period 6PM -
7PM and the assumed position of the target cab for recommendation. During this
time period, the optimal drive routes evaluated by the PTD function are PoCab →
C1 → C3 → C2, PoCab → C1 → C3 → C2 → C7 and PoCab → C4 → C1 →
- 38 -
C3 → C2 → C7 for L = 3, L = 4 and L = 5 respectively.
C1C4
C3
C2
C6
C5C8
C9
C7
C10
PoCab
Figure 2.6. Illustration: Optimal Driving Routes.
2.5.3 An Overall Comparison
In this subsection, we show an overall comparison of computational performances of
several algorithms.
First, in SkyRoute, after the pruning process proposed in this chapter, we ap-
ply some traditional skyline computing methods to find the skylines from the re-
mained candidate set. Here, we employ two skyline computing methods, BNL and
D&C (S.Borzsonyi et al., 2001). In this experiment, all acronyms of evaluated algo-
- 39 -
Table 2.1. Some Acronyms.
BFS: Brute-Force Search .
LCPS: Search with LCP
SR(BNL)S: Search via Skyline Computing
algorithm SkyRoute + BNL.
SR(D&C)S: Searching via Skyline Computing
Algorithm SkyRoute + D&C.
rithms are given in Table 2.1. Note that, for BFS, we only compute the PTD value
for all candidate routes one by one and find the maximum value as well as the opti-
mal driving route. Also, most information, such as the locations of potential pick-up
points and the pick-up probability, can be known in advance. The online computa-
tions are the distance from the target cab to pick-up points and PTD function.
Figure 2.7 shows the online search time of optimal driving routes evaluated by
the PTD function for different values of L on both synthetic data and real-world
data. The search time shown here includes all the time for online processing. As
can be seen, LCPS outperforms BFS and SR(D&C)S with a significant margin for
all different lengths of the optimal drive route on both synthetic and real data. The
reason why searching via skyline computing takes longer time than LCPS or BFS is
that skyline computing is partially online processing and takes a lot of time. Although
we only show the results of the time period 6PM − 7PM , a similar trend has also
been observed in other time periods.
In terms of the pruning effect, both LCP and SkyRoute can prune the search
space significantly as shown in Figure 2.8, where we show the pruning ratios of LCP
and Skyroute. Note that the pruning ratio is the number of pruned candidates divided
- 40 -
3 5
0
1
2
3
4
5
6
7
8
Length of Driving Route(L)
Sea
rch
Tim
e (S
ec)
3 4 50
10
20
30
40
50
60
Length of Driving Route(L)
Sea
rch
Tim
e (S
ec)
BFSLCPSSR(D&C)S
(b) Comparisons on Synthetic Data (10 Clusters)(a) Comparisons on Real Data (6−7PM)
BFSLCPSSR(D&C)S
4
Figure 2.7. A Comparison of Search Time.
3 4 5
Length of Driving Route (L)
Pru
ning
Per
cent
age
10 15 20
Number of Pick−up Points
Pru
ning
Per
cent
age
LCPSSkyline
(a) The Pruning Effect on Real Data (6−7PM) (b) The Pruning Effect on Synthetic Data (L=3)
1
0.4
0.6
0.8
0.65
0.75
0.85
0.95
LCPSSkyline
Figure 2.8. The Pruning Effect
by the original number of all the candidates.
In addition, for LCPS, the pruning process can be done in advance. This saves
a lot of time for online search. In particular, Table 2.2 shows a comparison of online
search time between BFS and LCPS across different numbers of pick-up points and
different lengths of driving routes on both synthetic and real-world data. As can be
seen, LCPS always outperforms BFS with a significant margin.
Finally, Figure 2.9 shows the online search time of optimal driving routes (L = 3)
- 41 -
Table 2.2. A Comparison of Search Time (Second) between BFS and LCPS
10 Synthetic Pick-up Clusters
L = 3 L = 4 L = 5
BFS 0.051643 0.300211 2.000949
LCPS 0.043750 0.165401 0.803290
15 Synthetic Pick-up Clusters
BFS 0.142254 1.925054 23.517042
LCPS 0.095364 0.611193 4.322053
Real Data (2-3PM)
BFS 0.045933 0.297187 1.991507
LCPS 0.036736 0.141536 0.622932
across different numbers of pick-up points on synthetic data. In the figure, a similar
trend of performances can be observed as in Figure 2.7.
2.5.4 A Comparison of Skyline Computing
In this subsection, we evaluate the performances of different skyline computing algo-
rithms.
This experiment was conducted across different numbers of pick-up points and
different lengths of recommended driving routes on both synthetic and real-world
data. As shown in Figure 2.10, SkyRoute with BNL or D&C can lead to better
efficiency compared to traditional skyline computing methods. The above indicates
that SkyRoute is an effective method for computing Skyline routes.
Furthermore, we have observed that the computation cost of BNL or D&C varies
on different data sets with the same size of candidate routes. The reason is that BNL
or D&C has different computation complexity for the best and worse cases. Therefore,
even with the same number of pick-up points and the same length of driving routes,
the running time of SkyRoute(BNL), (SkyRoute(D&C) or SR(D&C)S) is different as
- 42 -
10 12 14 16 18 200
0.5
1
1.5
2
2.5
3
3.5
Number of Pick−up Points
Sea
rch
Tim
e (S
ec)
BFSLCPS
SR(D&C)SSR(BNL)S
Figure 2.9. A Comparison of Search Time (L = 3) on the Synthetic Data set.
shown in Figure 2.10 and Figure 2.7
2.5.5 Case: Multiple Evaluation Functions
Here, we show the advantages of searching optimal driving routes through skyline
computing. Specifically, we evaluate the following business scenario. When there
are business needs for different ways to define optimal driving routes, which can be
measured by different evaluation functions.
As can be seen in Figure 2.7 and Figure 2.9, the search of an optimal driving route
via skyline computing does not outperform LCPS or BFS, because it takes the most
part of total online processing time for computing skylines. However, for a target cab
and fixed potential pick-up points, we only need to compute skylines once. And the
search space can be pruned drastically as shown in Figure 2.8. In other words, if the
goal is to provide multiple optimal driving routes based on different business needs
- 43 -
10 15 200
2
4
6
Number of Pick−up Points
Sky
line
Com
putin
g Ti
me
(Sec
)
3 4 5
0
40
80
120
Length of Driving Route (L)
Sky
line
Com
putin
g Ti
me
(Sec
)
BNLD&CSkyRoute(BNL)SkyRoute(D&C)
(a) Comparisons on Synthetic Data (L=3) (b) Comparisons on Real Data (6−7PM)
BNLD&CSkyRoute(BNL)SkyRoute(D&C)
Figure 2.10. A Comparison of Skyline Computing
at the same time. Skyline computing will have an advantage.
To illustrate this benefit of skyline computing, we design 5 different evaluation
functions (including PTD) to select 5 corresponding optimal drive routes. Note that
all these evaluation functions have the monotonicity Property as stated in lemma
3. Due to the space limitation, we omit the details of these evaluation functions.
Then, we search five different optimal driving routes simultaneously with the methods
shown in Table 2.1 on both synthetic data and real-world data. Figure 2.11 shows the
comparisons of computational performances with L = 3. As can be seen, SR(D&C)S
outperforms LCPS and BFS with a significant margin.
2.6 CONCLUDING REMARKS
In this chapter, we developed an energy-efficient mobile recommender system by
exploiting the energy-efficient driving patterns extracted from the location traces of
Taxi drivers. This system has the ability to recommend a sequence of potential pick-
up points for a driver in a way such that the potential travel distance before having
- 44 -
0
0.05
0.1
0.15
0.2
0.25
Sea
rch
Tim
e w
ith M
ultip
le E
valu
atio
n F
unct
ions
(S
ec)
Comparisons on Synthetic Data (L=3, 10 Clusters)Comparisons on Real Data (L=3, 6−7PM)
SR(D&C)S
LCPS
BFS
BFS
LCPSSR(D&C)S
Figure 2.11. A Comparison of Search Time for Multiple Optimal Driving Routes
customer is minimized. To develop the system, we first formalized a mobile sequential
recommendation problem and provided a Potential Travel Distance (PTD) function
for evaluating each candidate sequence. Based on the monotone property of the
PTD function, we proposed a recommendation algorithm, named LCP . Moreover,
we observed that many candidate routes can be dominated by skyline routes, and
thus can be pruned by skyline computing. Therefore, we also proposed a SkyRoute
algorithm to efficiently compute the skylines for candidate routes. An advantage of
searching an optimal route through skyline computing is that it can save the overall
online processing time when we try to provide different optimal driving routes defined
by different business needs.
Finally, experimental results showed that the LCP algorithm outperforms the
brute-force method and SkyRoute with a significant margin when searching only one
optimal driving route. Moreover, the results showed that SkyRoute leads to better
performances than brute-force and LCP when there is an online demand for different
- 46 -
CHAPTER 3
COST-AWARE COLLABORATIVE FILTERING FOR TRAVEL TOUR
RECOMMENDATIONS
Advances in tourism economics have enabled us to collect massive amounts of travel
tour data. If properly analyzed, this data can be a source of rich intelligence for
providing real-time decision making and for the provision of travel tour recommen-
dations. However, tour recommendation is quite different from traditional recom-
mendations, because the tourist’s choice is directly affected by the travel cost, which
includes both financial and time cost. To that end, in this chapter, we provide a
focused study of cost-aware tour recommendation. Along this line, we first propose
two ways to represent user’s cost preference. One way is to represent user’s cost pref-
erence by a 2-dimensional vector. Another way is to consider the uncertainty about
the cost that a user can afford and introduce a Gaussian prior to model user’s cost
preference. With these two ways of representation of user’s cost preference, we de-
velop different cost-aware latent factor models by incorporating the cost information
into the Probabilistic Matrix Factorization (PMF) model, the Logistic Probabilistic
Matrix Factorization (LPMF) model, and the Maximum Margin Matrix Factoriza-
tion (MMMF) model respectively. When applied to real-world travel tour data, all
the cost-aware recommendation models consistently outperform existing latent factor
models with a significant margin.
- 47 -
3.1 Introduction
Recent years have witnessed an increased interest in data-driven travel marketing.
As a result, massive amounts of travel data have been accumulated, and thus provide
unparallel opportunities for people to understand user behaviors and generate useful
knowledge, which in turn deliver intelligence for real-time decision making in various
fields, including that of travel tour recommendation.
Recommender systems address the information overloaded problem by identifying
user interests and providing personalized suggestions. In general, there are three
ways to develop recommender systems (Adomavicius & Tuzhilin, 2005). The first
one is content-based. It suggests the items which are similar to those a given user
has liked in the past. The second way is based on collaborative filtering (Ge, Xiong,
Tuzhilin, & Liu, 2011; Q. Liu, Chen, Xiong, & Ding, 2010; N. N. Liu, Zhao, Xiang, &
Yang, 2010). In other words, recommendations are made according to the tastes of
other users that are similar to the target user. Finally, a third way is to combine the
above two approaches and lead to a hybrid solution (Xu & Huang, CS213 Univ. of
California,Riverside). However, the development of recommender systems for travel
tour recommendation is significantly different from developing recommender systems
in traditional domains, since the tourist’s choice is directly affected by the travel cost
which includes the financial cost as well as various other types of costs, such as time
and opportunity costs.
In addition, there are some unique characteristics of travel tour data, which dis-
tinguish the travel tour recommendation from the traditional recommendation, such
- 48 -
as movie recommendation. First, the prices of travel packages can vary a lot. For
example, by examining the real-world travel tour logs collected by a travel company,
we can find that the prices of packages can range from $50 to $10000. Second, the
time cost of packages also varies very much. For instance, while some travel packages
take less than 3 days, other packages may take more than 10 days. In traditional
recommender systems, the cost for consuming a recommended item, such as a movie
or music, is usually not a concern for the customers. However, the tourists usually
have the financial and time constraints for selecting a travel package. In fact, Figure
3.1 shows the cost distributions of some tourists. In the figure, each point corresponds
to one user. As can be seen, both the financial and time costs vary a lot among dif-
ferent tourists. Therefore, for the traditional recommendation models, which do not
consider the cost of travel packages, it is difficult to provide the right travel tour rec-
ommendation for the right tourists. For example, traditional recommender systems
might recommend a travel package to a tourist who cannot afford it because of the
price or time.
To address the above challenge, in this chapter, we study how to incorporate the
cost information into traditional latent factor models for travel tour recommendation.
The extended latent factor models aim to learn user’s cost preferences and user’s
interests simultaneously from the large scale of travel tour logs. Specifically, we
introduce two types of cost information into the traditional latent factor models. The
first type of cost information refers to the observable costs of a travel package, which
include both financial cost and time cost of the travel package. For example, if a
person goes on a trip to Cambodia for 7 days and pays $2000 for the travel package j,
- 49 -
0 2 4 6 8 100
1000
2000
3000
4000
5000
6000
7000
8000
Time Cost (Day)
Fin
an
cia
l Co
st (
RM
B)
Figure 3.1. The Cost Distribution.
then the observed costs of this travel package are denoted as a vector CVj= (2000, 7).
The second type of cost information refers to the unobserved financial and time cost
preference of a user. We propose two different ways to represent the unobserved user’s
cost preference. First, we represent the cost preference of user i with a 2-dimensional
cost vector CUi, which denotes both financial and time costs. Second, since there is
still some uncertainty about the financial and time costs that a user can afford, we
further introduce a Gaussian priori G(CUi), instead of the cost vector CUi
, on the cost
preference of user i to express the uncertainty.
Given the above item cost information and two ways of representation of user’s
cost preference, we have introduced two cost-aware Probabilistic Matrix Factoriza-
tion (PMF) (Salakhutdinov & Mnih, 2008) models in (Ge, Liu, Xiong, Tuzhilin, &
Chen, 2011). These two cost-aware Probabilistic Matrix Factorization models are
- 50 -
based on the Gaussian noise assumption over observed implicit ratings. However,
in this chapter, we further argue that it may be better to assume noise term as bi-
nomial, because over 60% implicit ratings of travel packages are 1. Therefore, we
further investigate two more latent factor models, i.e., Logistic Probabilistic Matrix
Factorization (LPMF) (Yang et al., 2011) and Maximum Margin Matrix Factorization
(MMMF) (Srebro, Rennie, & Jaakkola, 2005) models, and propose new cost-aware
models based on them in this chapter. Compared with Probabilistic Matrix Factor-
ization model studied in (Ge, Liu, et al., 2011), these two latent factor models are
based on different assumptions and have different mathematical formulations. We
have to develop different techniques to incorporate the cost information into these
two models in this chapter. Furthermore, for both Logistic Probabilistic Matrix Fac-
torization and Maximum Margin Matrix Factorization models, we need to sample
negative ratings, which were not considered in (Ge, Liu, et al., 2011), to learn the
latent features. In sum, we develop cost-aware extended models by using two ways of
representation of user’s cost preference for each of PMF, LPMF and MMMF models.
In addition to the unknown latent features, such as the user’s latent features, the
unobserved user’s cost information (e.g., CU or G(CU)) is also learned by training
these extended cost-aware latent factor models. Particularly, by investigating and ex-
tending the above three latent factor models, we expect to gain more understanding
about which model works the best for travel tour recommendations in practice and
how much improvement we may achieve by incorporating the cost information into
the different models. Finally, we provide efficient algorithms to solve the different
objective functions in these extended models.
- 51 -
Finally, with real-world travel data, we provide very extensive experimentation in
this chapter, which is much more than that in (Ge, Liu, et al., 2011). Specifically, we
first show that the performances of PMF, LPMF and MMMF models for tour rec-
ommendation can be improved by taking the cost information into the consideration,
especially when active users have very few observed ratings. The statistical signifi-
cance test shows that the improvement of cost-aware models is significant. Second,
the extended MMMF and LPMF models lead to a better improvement of performance
than the extended PMF models in terms of Precision@K and MAP for travel tour
recommendations. Third, we demonstrate that the sampled negative ratings have
interesting influence on the performance of extended LPMF and MMMF models for
travel package recommendations. Finally, we demonstrate that the latent user’s cost
information learned by extended models can help travel companies with customer
segmentation.
3.2 Related Work
Related work can be grouped into three categories. The first category includes the
work on collaborative filtering models. In the second category, we introduce the
related work about travel recommendation. Finally, the third category includes the
work on cost/profit-based recommendation.
3.2.1 Collaborative Filtering
Two types of collaborative filtering models have been intensively studied recently:
memory-based and model-based approaches. Memory-based algorithms (Deshpande
& Karypis, 2004; Koren, 2008; Bell & Koren, 2007) essentially make rating prediction
- 52 -
by using some other neighboring ratings. In the model-based approaches, training
data are used to train a predefined model. Different approaches (Hofmann, 2004;
N. N. Liu, Xiang, Zhao, & Yang, 2010; Xue et al., 2005; B. Marlin, 2003; Ge, Xiong,
et al., 2011) vary due to different statistical models assumed for the data. In partic-
ular, various matrix factorization (Srebro et al., 2005; Salakhutdinov & Mnih, 2008;
Agarwal & Chen, 2009) methods have been proposed for collaborative filtering. Most
MF approaches focus on fitting the user-item rating matrix using low rank approxi-
mation and use the learned latent user/item features to predict the unknown ratings.
PMF model (Salakhutdinov & Mnih, 2008) was proposed by assuming Gaussian noise
to observed ratings, and applying Gaussian prior to latent features. Via introduc-
ing Logistic function to the loss function, PMF was also extended to address binary
ratings (Yang et al., 2011). Recently, instead of constraining the dimensionality of
latent factors, Srebro et al. (Srebro et al., 2005) proposed the MMMF model via
constraining the norms of user and item feature matrices. Finally, more sophisticated
methods are also available to consider user/item side information (Adams, Dahl, &
Murray, 2010; Gu, Zhou, & Ding, 2010), social influence (Ma, King, & Lyu, 2009),
and context information (Adomavicius, Sankaranarayanan, Sen, & Tuzhilin, 2005)
(e.g., temporal information (Liang Xiong, 2010) and spatio-temporal context (Lu,
Agarwal, & Dhillon, 2009)). However, most of the above methods were developed
for recommending traditional items, such as movie, music, articles, and webpages. In
these recommendation tasks, financial and time costs are usually not essential to the
recommendation results and are not considered in the models.
- 53 -
3.2.2 Travel Recommendation
Travel-related recommendations have been studied before. For instance, in (Hao et
al., 2010), one probabilistic topic model was proposed to mine two types of topics,
i.e., local topics (e.g., lava, coastline) and global topics (e.g., hotel, airport), from
travelogue on the website. Travel recommendation was performed by recommend-
ing a destination, which is similar to a given location or relevant to a given travel
intention, to a user. (Cena et al., 2006) presented UbiquiTO tourist guide for in-
telligent content adaptation. UbiquiTO used a rule-based approach to adapt the
content of the provided recommendation. A content adaptation approach (Yu et
al., 2006) was developed for presenting tourist-related information. Both content and
presentation recommendations were tailored to particular mobile devices and network
capabilities. They used content-based, rule-based and Bayesian classification methods
to provide tourism-related mobile recommendations. (Baltrunas, Ricci, & Ludwig,
2011) presented a method to recommend various places of interest for tourists by us-
ing physical, social and modal types of contextual information. The recommendation
algorithm was based on the factor model that is extended to model the impact of the
selected contextual conditions on the predicted rating. A tourist guide system COM-
PASS (Setten, Pokraev, Koolwaaij, & Instituut, 2004) was presented to support many
standard tourism-related functions. Finally, other examples of travel recommenda-
tions proposed in the literature are also available in (Cheverst et al., 2000; Ardissono,
Goy, Petrone, Segnan, & Torasso, 2002; Carolis, Mazzotta, Novielli, & Silvestri, 2009;
M.-H. Park, Hong, & Cho, 2007; Woerndl, Huebner, Bader, & Vico, 2011; Baltrunas,
- 54 -
Ludwig, Peer, & Ricci, 2011; Jannach & Hegelich, 2009), and (Kenteris, Gavalas, &
Economou, 2011) provided an extensive categorization of mobile guides according to
connectivity to Internet, being indoor or outdoor, etc. In this chapter, we focus on
developing cost-aware latent factor models for travel package recommendation, which
is different from the above travel recommendation tasks.
3.2.3 Cost/Profit-based Recommendation
Also, there are some prior works (Hosanagar, Krishnan, & Ma, 2008; Das, Mathieu, &
Ricketts, 2010; Chen, Hsu, Chen, & Hsu, 2008; Ge et al., 2010) related to profit/cost-
based recommender systems. For instance, (Hosanagar et al., 2008) studied the
impact of firm’s profit incentives on the design of recommender systems. In particular,
this research identified the conditions under which a profit-maximizing recommender
recommends the item with highest margins and those under which it recommends
the most relevant item. It also explored the mismatch between consumers and firm
incentives, and determined the social costs associated with this mismatch. (Das et
al., 2010) studied the question of how a vendor can directly incorporate profitability
of items into the recommendation process so as to maximize the expected profit while
still providing accurate recommendations. The proposed approach takes the output
of a traditional recommender system and adjusts it according to item profitability.
However, most of these prior travel-related and cost-based recommendation studies
did not explicitly consider the expense and time cost for travel recommendation. Also,
in this chapter, we focus on travel tour recommendation.
Finally, in our preliminary work on travel tour recommendation (Ge, Liu, et al.,
- 55 -
2011), we developed two simple cost-aware PMF models for travel tour recommenda-
tion. In this chapter, we provide a comprehensive study of cost-aware collaborative
filtering for travel tour recommendation. Particularly, we investigate how to incorpo-
rate the cost information into different latent factor models and evaluate the design
decisions related to model choice and development.
3.3 Cost-aware PMF Models
In this section, we propose two ways to represent user’s cost preferences, and introduce
how to incorporate the cost information into the PMF (Salakhutdinov & Mnih, 2008)
model by designing two cost-aware PMF models: vPMF and gPMF models.
3.3.1 The vPMF Model
vPMF is a cost-aware probabilistic matrix factorization model which represents user
and item costs with 2-dimensional vectors as shown in Figure 3.2 (b). Suppose we
have N users and M packages. Let Rij be the rating of user i for package j, Ui and
Vj represent D-dimensional user-specific and package-specific latent feature vectors
respectively (both Ui and Vj are column vectors in this chapter). Also, let CUiand
CVjrepresent 2-dimensional cost vectors for user i and package j respectively. In
addition, CU and CV simply denote the sets of cost vectors for all the users and
all the packages respectively. The conditional distribution over the observed ratings
R ∈ RN×M is:
p(R|U, V, CU , CV , σ2) =N∏
i=1
M∏j=1
[N (Rij|f(Ui, Vj, CUi, CVj
), σ2)]Iij , (3.1)
- 56 -
where N (x|µ, σ2) is the probability density function of the Gaussian distribution with
mean µ and variance σ2, and Iij is the indicator variable that is equal to 1 if user
i rates item j and is equal to 0 otherwise. Also U is a D × N matrix and V is a
D ×M matrix. The function f(x) is to approximate the rating for item j by user i.
We define f(x) as:
f(Ui, Vj, CUi, CVj
) = S(CUi, CVj
) · UTi Vj , (3.2)
where S(CUi, CVj
) is a similarity function to measure the similarity between user cost
vector CUiand item cost vector CVj
. Several existing similarity/distance functions
can be used here to perform this calculation, such as Pearson coefficient, the cosine
similarity or Euclidean distance. CV can be considered to be known in this chapter
because we can directly obtain the cost information for tour packages from the tour
logs. CU is the set of user cost vectors which is going to be estimated. Moreover, we
also apply zero-mean spherical Gaussian prior (Salakhutdinov & Mnih, 2008) on user
and item latent feature vectors:
p(U |σ2U) =
N∏i=1
N (Ui|0, σ2UI),
p(V |σ2V ) =
M∏j=1
N (Vj|0, σ2V I).
As shown in Figure 3.2, in addition to user and item latent feature vectors, we
- 57 -
also need to learn user cost vectors simultaneously. By a Bayesian inference, we have
p(U,V, CU |R,CV , σ2, σ2U , σ2
V )
∝ p(R|U, V, CU , CV , σ2)p(U |σ2U)p(V |σ2
V )
=N∏
i=1
M∏j=1
[N (Rij|f(Ui, Vj, CUi, CVj
), σ2)]Iij
×M∏i=1
N (Ui|0, σ2UI)×
N∏j=1
N (Vj|0, σ2V I). (3.3)
U ,V and CU can be learned by maximizing this posterior distribution or the log of the
posterior distribution over user cost vectors, user and item latent feature vectors with
fixed hyperparameters, i.e., the observation noise variance and prior variances. By
Equation (3.3) or Figure 3.2, we can find that vPMF is actually an enhanced general
model of PMF by taking the cost information into consideration. In other words, if
we limit S(CUi, CVj
) as 1 for all pairs of user and item, vPMF will be a PMF model.
The log of the posterior distribution in Equation (3.3) is:
ln p(U, V, CU |R, CV , σ2, σ2U , σ2
V ) =
− 1
2σ2
N∑i=1
M∑j=1
Iij(Rij − f(Ui, Vj, CUi, CVj
))2
− 1
2{(
N∑i=1
M∑j=1
Iij) ln σ2 + ND ln σ2U + MD ln σ2
V }
− 1
2σ2U
N∑i=1
UTi Ui − 1
2σ2V
M∑j=1
V Tj Vj + C, (3.4)
where C is a constant that does not depend on the parameters. Maximizing the log of
the posterior distribution over user cost vectors, user and item latent feature vectors
is equivalent to minimize the sum-of-squared-errors objective function with quadratic
- 58 -
regularization terms:
E =1
2
N∑i=1
M∑j=1
Iij(Rij − S(CUi, CVj
) · UTi Vj)
2
+λU
2
N∑i=1
||Ui||2F +λV
2
M∑j=1
||Vj||2F , (3.5)
where λU = σ2/σ2U , λV = σ2/σ2
V , and || · ||2F denotes the Frobenius norm. From the
objective function, i.e., Equation (3.5), we can also see that vPMF model will be
reduced to PMF model if S(CUi, CVj
) = 1 for all pairs of user and item.
Since the dimension of cost vectors is small, we use the Euclidean distance for
the similarity function as S(CUi, CVj
) = (2 − ||CUi− CVj
||2)/2. Since two attributes
of the cost vector have significantly different levels of scale, we utilize the Min-Max
Normalization technique to preprocess all cost vectors of items. Then the value of
attribute of cost vectors is scaled to fit in the specific range [0, 1]. Subsequently,
the value of the above similarity function also locates in the range [0, 1]. Then, a
local minimum of the objective function given by Equation (3.5) can be obtained by
performing gradient descent in Ui, Vj and CUias:
∂E
∂Ui
=M∑
j=1
Iij
(S
(CUi
, CVj
) · UTi Vj −Rij
) · S (CUi
, CVj
)Vj + λUUi
∂E
∂Vj
=N∑
i=1
Iij
(S
(CUi
, CVj
) · UTi Vj −Rij
) · S (CUi
, CVj
)UT
i + λV Vj
∂E
∂CUi
=M∑
j=1
Iij
(S
(CUi
, CVj
)UT
i Vj −Rij
) · UTi VjS
′ (CUi, CVj
), (3.6)
where S ′ (CUi, CVj
)is the derivative with respect to CUi
.
- 59 -
3.3.2 The gPMF Model
In the real world, the user’s expectation on the financial and time cost of travel pack-
ages may vary within a certain range. Also, as shown in Equation (3.5), overfitting
can happen when we perform the optimization with respect to CUi(i = 1 · · ·N).
These two observations suggest that it might be better if we could use a distribution
to model the user’s cost preference, instead of representing it as a 2-dimension vec-
tor. Therefore, we propose to use a 2-dimensional Gaussian distribution to model the
user’s cost preference in the gPMF model as:
p(CUi|µCUi
, σ2CU
) = N (CUi|µCUi
, σ2CU
I). (3.7)
In Equation (3.7), µCUiis the mean of the 2-dimensional Gaussian distribution for
user Ui. σ2CU
is assumed to be the same for all the users for simplicity.
In gPMF model, since we use a 2-dimensional Gaussian distribution to represent
the user’s cost preference, we need to change the function for measuring the simi-
larity/match between the user’s cost preference and the package cost information.
Considering each package’s cost is represented by a constant vector and the user’s
cost preference is characterized via a distribution, we measure the similarity between
the user’s cost preference and the package’s cost as:
SG(CVj,G(CUi
)) = N (CVj|µCUi
, σ2CU
I), (3.8)
where we simply use G(CUi) to represent the 2-dimensional Gaussian distribution of
user Ui. Note that CUiin equations 3.8 and 3.7 represents the variable of the user’s
cost distribution G(CUi), instead of a user cost vector. Along this line, the function
- 60 -
for approximating the rating for item j by user i is defined as:
fG(Ui, Vj,G(CUi), CVj
) = SG(CVj,G(CUi
)) · UTi Vj
= N (CVj|µCUi
, σ2CU
I) · UTi Vj . (3.9)
With this representation of user’s cost preference and the similarity function, a similar
Bayesian inference as Equation (3.3) can be obtained:
p(U, V, µCU|R, CV , σ2, σ2
U , σ2V , σ2
CU)
∝ p(R|U, V, µCU, CV , σ2, σ2
CU)p(CV |µCU
, σ2CU
)p(U |σ2U)p(V |σ2
V )
=N∏
i=1
M∏j=1
(N (Rij|fG
(Ui, Vj,G(CUi
), CVj
), σ2
))Iij
×N∏
i=1
M∏j=1
N (CVj|µCUi
, σ2CU
I)Iij
×N∏
i=1
N (Ui|0, σ2UI)×
M∏j=1
N (Vj|0, σ2V I), (3.10)
where µCU= (µCU1
, µCU2, · · · , µCUN
), which denotes the set of means of all users’ cost
distributions. p(CV |µCU, σ2
CU) is the likelihood given the parameters of all users’ cost
distributions. Given the known ratings of a user, the costs of packages rated by this
user can be treated as observations of this user’s cost distribution. This is why we
represent the likelihood over CV , i.e., the set of package’s cost. Then we are able to
derive the likelihood asN∏
i=1
M∏j=1
N (CVj|µCUi
, σ2CU
I)Iij in Equation 3.10.
Maximizing the log of the posterior over the means of all user’s cost distributions,
user and item latent features is equivalent to minimize the sum-of-squared-errors
objective function with quadratic regularization terms with respect to U , V and
- 61 -
σUσV σUσV
σ σ
j=1, ... ,Mi=1, ... ,N
Vj Ui
R ij
(a) PMF
j=1, ... ,M
Vj Ui
R ij
i=1, ... ,N
(b) vPMF (c) gPMF
CUi
CV σ
σV σU
UiVj
R ij
i=1, ... ,Nj=1, ... ,M
σCU
μCU iCVj
Figure 3.2. Graphical Models.
µCU= (µCU1
, µCU2, · · · , µCUN
):
E =1
2
N∑i=1
M∑j=1
Iij
(Rij −N (CVj
|µCUi, σ2
CUI) · UT
i Vj
)2
+λU
2
N∑i=1
||Ui||2F +λV
2
M∑j=1
||Vj||2F
+λCU
2
N∑i=1
M∑j=1
Iij||CVj− µCUi
||2, (3.11)
where λCU= σ2/σ2
CU, λU = σ2/σ2
U , and λV = σ2/σ2V . As we can see from Equation
(3.11), the 2-dimensional Gaussian distribution for modeling user’s cost preference
leads to one more regularization term to the objective function, thus easing the over-
fitting. gPMF model is also an enhanced general model of PMF, because the objective
function, i.e., Equation (3.11), is reduced to that of PMF if σ2CU
is limited to be in-
finite. A local minimum of the objective function given by Equation (3.11) can be
identified by performing gradient descent in Ui, Vj and µCUi. For the same reason,
we also utilize the Min-Max Normalization to preprocess all the cost vectors of items
before training the model.
In this chapter, instead of using Equation (3.2) and Equation (3.9), which may
- 62 -
have predictions out of the valid rating range, we further apply the logistic function
g(x) = 1/(1 + exp(−x)) to the results of Equation (3.2) and Equation (3.9). The
applied logistic function bounds the range of predictions as [0, 1]. Also, we map the
observed ratings from the original range [1, K] (K is the maximum rating value) to
the interval [0, 1] using the function t(x) = (x − 1)/(K − 1), thus the valid rating
range matches the range of predictions by our models. Eventually, to get the final
prediction for an unknown rating, we restore the scale of predictions from [0, 1] to
[1,K] by using the inverse transformation of function t(x) = (x− 1)/(K − 1).
3.3.3 The Computational Complexity
The main computation of gradient methods is to evaluate the object function and
its gradients against variables. Because of the sparseness of matrices R, the compu-
tational complexity of evaluating the object function (3.5) is O(ηf), where η is the
number of nonzero entries in R and f is the number of latent factors. The computa-
tional complexity for gradients ∂E∂U
, ∂E∂V
and ∂E∂CU
in Equation (3.6) is also O(ηf). Thus,
for each iteration, the total computational complexity is O(ηf). Thus, the computa-
tional cost of vPMF model is linear with respect to the number of observed ratings in
the sparse matrix R. Similarly, the overall computational complexity of gPMF model
is also O(ηf), because the only difference between gPMF and vPMF is that we need
to compute the cost similarity with the 2-dimensional Gaussian distribution, instead
of the Euclidean distance involved in vPMF. This complexity analysis shows that the
proposed cost-aware models are efficient and can scale to very large data. In addition,
instead of performing batch learning, we divide the training set into sub-batches and
- 63 -
update all latent features after sub-batch in order to speed-up training.
3.4 Cost-aware LPMF Models
In this section, we first briefly introduce the LPMF model, and then propose the cost-
aware LPMF models to incorporate the cost information. Note that, in this section
and section 3.5, all notations, such as CUiand µCUi
, have the same meaning as in
section 3.3 unless specified otherwise.
3.4.1 The LPMF Model
LPMF (Yang et al., 2011) generalizes the PMF model via applying the Logistic func-
tion as the loss function. Given binary ratings, Rij follows a Bernoulli distribution,
instead of Normal distribution. Then Logistic function is used to model the rating
as:
P (Rij = 1|Ui, Vj) = σ(UTi Vj) =
1
1 + e−UTi Vj
P (Rij = 0|Ui, Vj) = 1− P (Rij = 1|Ui, Vj) =1
1 + eUTi Vj
= σ(−UTi Vj) ,
where Rij = 1 means Rij is a positive rating and Rij = 0 indicates Rij is a nega-
tive rating. Given the training set, i.e., all observed binary ratings, the conditional
likelihood over all available ratings can be calculated as:
p(R|U, V ) =N∏
i=1
M∏j=1
((P (Rij = 1))Rij(1− P (Rij = 1))1−Rij
)Iij, (3.12)
where (P (Rij = 1))Rij(1−P (Rij = 1))1−Rij is actually the Bernoulli probability mass
function. Also Iij is the indicator variable that is equal to 1 if user i rates item j as
either positive or negative and is equal to 0 otherwise.
- 64 -
To avoid overfitting via Maximum Likelihood Estimation (MLE), we also intro-
duce Gaussian priors onto U and V and find a Maximum A Posteriori (MAP) es-
timation for U and V . The log of the posterior distribution over U and V is given
by
ln p(U, V |R, σ2U , σ2
V )
=N∑
i=1
M∑j=1
Iij
(Rij ln σ(UT
i Vj) + (1−Rij) ln σ(−UTi Vj)
)
− 1
2σ2U
N∑i=1
UTi Ui − 1
2σ2V
M∑j=1
UTj Vj
− 1
2(ND ln σ2
U + MD ln σ2V ) + C, (3.13)
where C is a constant that does not depend on the parameters. By maximizing the
objective function, i.e., Equation 3.13, U and V can be estimated.
However, in our travel tour data set, the original ratings are not binary, but or-
dinal. Thus, we need to binarize the original ordinal ratings before training LPMF
model. In fact, some research works (Pan et al., 2008; Yang et al., 2011) have shown
that the binarization can yield better recommendation performances in terms of rel-
evance and accuracy (Herlocker, Konstan, Terveen, John, & Riedl, 2004). We are
interested in investigating this potential for our travel recommendations. Specifically,
a rating Rij is considered as positive if it is equal to or greater than 1. However, in
our travel tour data set, there are no negative ratings available. Actually, in many
recommendation applications, such as YouTube.com and Epinions.com, negative rat-
ings may be extremely few, or completely missed because users are much less inclined
to give negative ratings for items they dislike than positive ratings for items they
- 65 -
like, as illustrated in (B. M. Marlin & Zemel, 2009, 2007). To this end, we adopt the
User-Oriented Sampling approach in (Pan et al., 2008; Pan & Scholz, 2009) to get
the negative ratings. Basically, if a user has rated more items (travel packages) with
positive ratings, those items that she/he has not rated positively could be rated as
negative with higher probability. Overall, we control the number of sampled negative
ratings by setting the ratio of the number of negative ratings to the number of positive
ratings, i.e., α. For example, α = 0.1 means that the number of negative ratings we
sample is 10% of the number of positive ratings.
3.4.2 The vLPMF Model
Similar to the vPMF model, we first represent the user’s cost preference with a 2-
dimensional vector. Then we incorporate the cost information into LPMF model
as:
P (Rij = 1|Ui, Vj) = S(CUi, CVj
) · σ(UTi Vj) =
S(CUi, CVj
)
1 + e−UTi Vj
(3.14)
P (Rij = 0|Ui, Vj) = 1− P (Rij = 1|Ui, Vj) =1 + eUT
i Vj − S(CUi, CVj
)
1 + eUTi Vj
. (3.15)
Here the similarity S(CUi, CVj
) needs to be set within the range [0, 1] in order to main-
tain that the conditional probability is within the range [0, 1]. Thus, the similarity
function defined in subsection 3.3.1, i.e., S(CUi, CVj
) = (2− ||CUi− CVj
||2)/2, is also
applicable here.
Given the above formulation, we can get the log of posterior distribution over U ,
V and CU as:
- 66 -
ln p(U, V |R, σ2U , σ2
V , CV , σ2)
=N∑
i=1
M∑j=1
Iij{Rij ln(S(CUi, CVj
)σ(UTi Vj))
+ (1−Rij) ln(1− S(CUi, CVj
)σ(UTi Vj))}
− 1
2σ2U
N∑i=1
UTi Ui − 1
2σ2V
M∑j=1
V Tj Vj
− 1
2(ND ln σ2
U + MD ln σ2V ) + C. (3.16)
We search the local maximum of the objective function, i.e., Equation 3.16, by per-
forming gradient ascent in Ui (1 ≤ i ≤ N), Vj (1 ≤ j ≤ M) and CUi(1 ≤ i ≤ N). To
save space, we omit the details of partial derivatives.
3.4.3 The gLPMF Model
With the 2-dimensional Gaussian distribution for modeling the user’s cost preference,
i.e., Equation 3.8, we update Equations 3.14 and 3.15 as:
P (Rij = 1|Ui, Vj) = SG(CVj,G(CUi
)) · σ(UTi Vj) =
SG(CVj,G(CUi
))
1 + e−UTi Vj
,
P (Rij = 0|Ui, Vj) = 1− P (Rij = 1|Ui, Vj) =1 + eUT
i Vj − SG(CVj,G(CUi
))
1 + eUTi Vj
,
where SG(CVj,G(CUi
)) is defined in Equation 3.8. Here we also constrain the similarity
SG(CVj,G(CUi
)) to be within the range [0, 1]. To apply such a constraint, we limit
the common variance, i.e., σ2CU
in Equation 3.8, to a specific range, which will be
discussed in section 5.4.
- 67 -
Then the log of posterior distribution over U , V and µCUcan be updated as:
lnp(U, V, µCU|R, σ2
U , σ2V , σ2
CU, σ2, CV )
=N∑
i=1
M∑j=1
Iij[Rij ln SG(CVj,G(CUi
))σ(UTi Vj)
+ (1−Rij) ln(1− SG(CVj,G(CUi
))σ(UTi Vj))]
− 1
2σ2CU
N∑i=1
M∑j=1
Iij(CVj− µCUi
)T (CVj− µCUi
)
− 1
2σ2U
N∑i=1
UTi Ui − 1
2σ2V
M∑j=1
V Tj Vj − 1
2[(
N∑i=1
M∑j=1
Iij) ln σ2
+ (N∑
i=1
M∑j=1
Iij) ln σ2CU
+ ND ln σ2U + MD ln σ2
V ] + C . (3.17)
Finally we search the local maximum of the objective function, i.e., Equation 3.17, by
performing gradient ascent in Ui (1 ≤ i ≤ N), Vj (1 ≤ j ≤ M) and µCUi(1 ≤ i ≤ N).
To predict an unknown rating, e.g., Rij, as positive or negative with LPMF,
vLPMF or gLPMF model, we compute the conditional probability P (Rij = 1) with
the learned Ui, Vj, CUior µCUi
. If P (Rij = 1) is greater than 0.5, we predict Rij as
positive, otherwise we predict Rij as negative. In practice, we can also rank all items
based on the probability of being positive for a user and recommend the top items to
the user.
The computational complexity of LPMF, vPMF or gPMF is also linear with the
number of available ratings for training. We also divide the training set into sub-
batches and update all latent features sub-batch by sub-batch.
- 68 -
3.5 Cost-aware MMMF Models
In this section, we propose the cost-aware MMMF models after briefly introducing
the classic MMMF model. For the MMMF model and its cost-aware extensions, we
also take binary ratings as input.
3.5.1 The MMMF Model
MMMF (Srebro et al., 2005; Rennie & Srebro, 2005) allows an unbound dimensional-
ity for the latent feature space via limiting the trace norm of X = UT V . Specifically,
given a matrix R with binary ratings, we minimize the trace norm 1 matrix X and
the hinge loss as:
||X||∑ + C∑ij
Iijh(XijRij), (3.18)
where C is a trade-off parameter and h(·) is the smooth hinge loss function (Rennie
& Srebro, 2005) as:
h(z) =
12− z if x ≤ 0
12(1− z)2 if 0 < x < 1
0 if x ≥ 1 .
Note that for the MMMF model, we denote the positive rating as 1, and the negative
rating as −1, instead of 0. By minimizing the objective function, i.e., Equation 3.18,
we can estimate U and V . In addition, we adopt the same methods as described in
subsection 3.4.1 to binarize the original ordinal ratings and obtain negative ratings.
1Also known as the nuclear norm and the Ky-Fan n-norm
- 69 -
3.5.2 The vMMMF Model
To incorporate both user and item cost information into MMMF model, we extend
the smooth hinge loss function with the 2-dimensional user’s cost vector as:
h(Xij, CUi, CVj
, Rij) = h(S(CUi
, CVj)XijRij
). (3.19)
Then we can update the objective function, i.e. Equation 3.18, as:
||X||∑ + C∑ij
Iijh(S(CUi
, CVj)XijRij
). (3.20)
Here we can have different similarity measurements for S(CUi, CVj
), but we need to
constrain the similarity S(CUi, CVj
) to be non-negative, because otherwise the symbol
of XijRij may be changed by S(CUi, CVj
). To this end, we still use the similarity
function defined in subsection 3.3.1 to compute the similarity.
To solve the minimization problem in Equation 3.20, we adopt the local search
heuristic as suggested in (Rennie & Srebro, 2005), where it was shown that the
minimization problem in Equation 3.20 is equivalent to
G =1
2(||U ||2Fro + ||V ||2Fro) +
C∑ij
Iijh(S(CUi
, CVj)(UT
i Vj)Rij
). (3.21)
In other words, instead of searching over X, we search over pairs of matrices (U, V ),
as well as the set of user cost vectors CU = {CU1 , · · · , CUN} to minimize the objective
function, i.e., Equation 3.21. Finally we turn to the gradient descent algorithm to
solve the optimization problem in Equation 3.21 as used in (Rennie & Srebro, 2005).
- 70 -
3.5.3 The gMMMF Model
Moreover, we extend the smooth hinge loss function with the 2-dimensional Gaussian
distribution, i.e., Equation 3.8, as:
h(Xij,G(CUi), CVj
, Rij) = h(N (CVj
|µCUi, σ2
CUI)XijRij
). (3.22)
Here, N (CVj|µCUi
, σ2CU
I) is positive naturally because it is a probability density func-
tion. Then, similar to Equation 3.21, we can derive a new objective function:
G =1
2(||U ||2Fro + ||V ||2Fro) +
C∑ij
Iijh(N (CVj
|µCUi, σ2
CUI)(UT
i Vj)Rij
). (3.23)
To solve the above problem, we also adopt the gradient descent algorithm as used for
vMMMF model.
To predict an unknown rating, such as Rij, with MMMF, we compute UTi Vj.
If UTi Vj is greater than a threshold, Rij is predicted as positive, otherwise Rij is
predicted as negative. With vMMMF and gMMMF, we predict an unknown rating
as positive or negative by thresholding S(CUi, CVj
)UTi Vj or N (CVj
|µCUi, σ2
CUI)UT
i Vj in
the same way. Of course, there are other methods (Rennie & Srebro, 2005; Srebro et
al., 2005) to decide the final predictions. But we adopt the above simple way because
this is not the focus of this chapter.
The computational complexity of MMMF, vMMMF or gMMMF is also linear
with the number of available ratings for training. Here, we adopt the same strategy
to speed up the training processing.
- 71 -
3.6 Experimental Results
In this section, we evaluate the performances of the cost-aware collaborative filtering
methods on real-world travel data for travel tour recommendation.
3.6.1 The Experimental Setup
Experimental Data. The travel tour data set used in this chapter is provided by a
travel company. In the data set, there are more than 200,000 expense records starting
from the beginning of 2000 to October 2010. In addition to the Customer ID and
travel Package ID, there are many other attributes for each record, such as the cost
of the package, the travel days, the package name and some short descriptions of
the package, and the start date. Also, the data set includes some information about
the customers, such as age and gender. From these records, we are able to obtain
the information about users (tourists), items (packages) and user ratings. Moreover,
we are able to know the financial and time cost for each package from these tour
logs. Instead of using explicit rating (e.g., scores from 1 to 5) which is actually not
available in our travel tour data, we use the purchasing frequency as the implicit
rating. Actually, the purchasing frequency has been widely used for measuring the
utility of an item for a user (Panniello, Tuzhilin, Gorgoglione, Palmisano, & Pedone,
2009) in the transaction-based recommender systems (Panniello et al., 2009; Huang,
Chung, & Chen, 2004; Pan et al., 2008; Huang, Li, & Chen, 2005). Since a user may
purchase the same package multiple times for her/his family members and many local
travel packages are even consumed multiple times by the same user, there are still a
lot of implicit ratings larger than 1, while over 60% of implicit ratings are 1.
- 72 -
Table 3.1. Some Characteristics of Travel Data
Statistics User Package
Min Number of Rating 4 4
Max Number of Rating 62 1976
Average Number of Rating 5.94 24.57
The tourism data is naturally much sparser than the movie data. For instance, a
user may watch more than 50 movies each year, while there are not many people who
will travel more than 50 times every year. In fact, many tourists only have three or
five travel records in the data set. To reduce the challenge of sparseness, we simply
ignore users, who have traveled less than 4 times, as well as packages which have been
purchased for less than 4 times. After this data preprocessing, we have 34007 pairs
of ratings with 1384 packages and 5724 users. Thus the sparseness of this data is still
higher than the famous Movielens data set 2 and Eachmovie 3 data set. Finally,
some statistics of the item-user rating matrix of our travel tour data are summarized
in Table 3.1.
Experimental Platform. All the algorithms were implemented in Matlab2008a.
All the experiments were conducted on a Windows 7 with Intel Core2 Quad Q8300
and 6.00GB RAM.
2http://www.cs.umn.edu/Research/GroupLens.3HP retired the EachMovie dataset
- 73 -
3.6.2 Collaborative Filtering Methods
We have extended 3 different collaborative filtering models with two ways of rep-
resentation of user’s cost preference. Thus, we totally have 9 collaborative filtering
models in this experiment. Also, we compare our extended cost-aware models with
Regression-based Latent Factor Models (RLFM) (Agarwal & Chen, 2009), which take
the cost information of packages as item features and incorporate such features into
matrix factorization framework. In (Agarwal & Chen, 2009), two versions of RLFM
were proposed for both Gaussian and Binary response. In the experiment of this
chapter, both of them are used as additional baseline methods. To present the ex-
perimental comparisons easily, we denote these methods with acronyms in Table 3.2.
Table 3.2. The Notations of 9 Collaborative Filtering Methods
PMF Probabilistic Matrix Factorization
vPMF PMF + Vector-based Cost Representation
gPMF PMF + Gaussian-based Cost Representation
RLFM Regression-based Latent Factor Model for Gaussian response
LPMF Logistic Probabilistic Matrix Factorization
vLPMF LPMF + Vector-based Cost Representation
gLPMF LPMF + Gaussian-based Cost Representation
LRLFM Regression-based Latent Factor Model for Binary response
MMMF Maximum Margin Matrix Factorization
vMMMF MMMF + Vector-based Cost Representation
gMMMF MMMF + Gaussian-based Cost Representation
- 74 -
3.6.3 The Details of Training
First, we train the PMF model and its extensions with the original ordinal ratings. For
the PMF model, we empirically specify the parameters as: λU = 0.05 and λV = 0.005.
For vPMF and gPMF models, we use the same values for λU and λV , together with
λCU= 0.2 for the gPMF model. We specify σ2
CU= 0.09 for the gPMF model in
the following. Also, we remove the global effect (Q. Liu et al., 2010) by subtracting
the average rating of the training set from each rating before performing PMF-based
models. Moreover, we initialize the cost vector (e.g., CUi) or the mean of the 2-
dimensional Gaussian distribution (e.g., µCUi) for a user with the average cost of
all items rated by this user, while user/item latent feature vectors are initialized
randomly.
Second, we train LPMF, MMMF and their extensions with the binarized ratings.
We set different values for the ratio α in order to empirically examine how the ratio
affects the performances of LPMF, MMMF and their extensions. For the LPMF-
based models, the parameters are empirically specified as σ2U = 0.85 and σ2
V = 0.85. In
addition, σ2CU
is set as 0.3 for the gLPMF model in order to constrain SG(CVj,G(CUi
))
to be within the range [0,1] as mentioned in subsection 3.4.3. For the MMMF-based
approaches, the parameters are empirically specified as C = 1.8, and σ2CU
= 0.09 for
gMMMF. The cost vectors or the means of the 2-dimensional Gaussian distribution
of users, and the user/item latent feature vectors are initialized in the same way as
PMF-based approaches.
Finally, we use cross-validation to evaluate the performances of different methods.
- 75 -
We split all original ratings or positive ratings to two parts with split ratio 90/10.
90% of original or positive ratings are used for training and 10% of them are used for
testing. For each user-item pair in the testing set, the item is considered relevant to
the user in this experiment. After getting the 90% of positive ratings, we sample the
negative ratings with the set ratio α. We conduct the splitting 5 times independently
and show the average results on 5 testing sets for all comparisons. In addition, we stop
the iteration of each approach via limiting the same maximum number of iterations,
which is set as 60 in this experiment.
3.6.4 Validation Metrics
We adopt Precision@K and Mean Average Precision (MAP) (Herlocker et al., 2004)
to evaluate the performances of all competing methods listed in subsection 3.6.2.
Moreover, we use Root Mean Square Error (RMSE) and Cumulative Distribution
(CD) (Koren, 2008) to examine the performances of the PMF-based methods from
different perspectives, while both RMSE and CD are less suitable for the evaluation
of LPMF-based and MMMF-based models with the input of binary ratings.
Precision@K is calculated as:
Precision@K =
∑Ui∈U |TK(Ui)|∑Ui∈U |RK(Ui)| , (3.24)
where RK(Ui) is the top-K items recommended to user i, TK(Ui) denotes all truly
relevant items among RK(Ui), and U represents the set of all users in a test set. MAP
is the mean of average precision (AP) over all users in the test set. AP is calculated
as:
APu =
∑Ni=1 p(i)× rel(i)
number of relevant items, (3.25)
- 76 -
where i is the position in the rank list. N is the number of returned items in the list.
p(i) is the precision of a cut-off rank list from 1 to i and rel(i) is an indicator function
equaling 1 if the item at position i is a relevant item, 0 otherwise. The RMSE is
defined as:
RMSE =
√∑ij (rij − rij)
2
N, (3.26)
where rij denotes the rating of item j by user i, rij denotes the corresponding rating
predicted by the model, and N denotes the number of tested ratings.
CD (Koren, 2008) is designed to measure the qualify of top-K recommendations.
CD measurement could explicitly guide people to specify K in order to contain the
most interesting items in the suggested top-K set with certain probability. In the
following, we briefly introduce how to compute CD with the testing set(more details
about this validation method can be found in (Koren, 2008)). First, all highest ratings
in the testing set are selected. Assume that we haveM ratings with the highest rating.
For each item i with the highest rating by user u, we randomly select C additional
items and predict the ratings by u for i and other C items. Then, we order these C+1
items based on their predicted ratings in a decreasing order. There are C+1 different
possible ranks for item i, ranging from the best case where none(0%) of the random
C items appearing before item i, to the worst case where all (100%) of the random C
items appearing before item i. For each of those M ratings, we independently draw
the C additional items, predict the associated ratings, and derive a relative ranking
(RR) between 0% and 100%. Finally, we analyze the distribution of overall M RR
observations, and estimate the cumulative distribution (CD). In our experiments, we
- 77 -
specify C = 200 and obtain 761 RR observations in total.
3.6.5 The Performance Comparisons
In this subsection, we present comprehensive experimental comparisons of all the
methods with four validation measurements.
First, we examine how the incorporated cost information boosts different models
in terms of different validation measurements. Table 3.3 shows the comparisons of
all methods in terms of Precision@K and MAP. In Table 3.3, the dimension of latent
factors (e.g., Ui, Vj) is specified as 10 and the ratio α is set as 0.1 for the sampling of
negative ratings. Performances in terms of Precision@K are evaluated with different
K values, i.e., K = 5 and K = 10. For example, Precision@5 of vPMF and gPMF is
increased by 7.54% and 13.58% respectively. MAP of vPMF and gPMF is increased by
4.21% and 17.71% respectively. Similarly, vLPMF (gLPMF) and vMMMF (gMMMF)
outperform LPMF and MMMF models in terms of Precision@K and MAP. Also,
vPMF (gPMF) and vLPMF (gLPMF) result in better performances than RLFM
and LRLFM. In addition, we can observe that MMMF, LPMF and their extensions
produce much better results than PMF and its extensions in terms of Precision@K
and MAP. There are two main reasons why LPMF-based methods and MMMF-
based methods perform better than PMF-based methods. First, the lost functions
of LPMF and MMMF are more suitable for the travel package data, because over
60% of known ratings are 1. Second, sampled negative ratings are helpful because
the unknown ratings are actually not missed at random. For example, if one user has
not consumed one package so far, this probably tells us that this user does not like
- 78 -
Table 3.3. A Performance Comparison (10D Latent Features & α = 0.1)
Precision@5 Precision@10 MAP
PMF 0.0265 0.0154 0.0689
RLFM 0.0271 0.0167 0.0695
vPMF 0.0285 0.0181 0.0718
gPMF 0.0301 0.0193 0.0811
LPMF 0.0482 0.0339 0.1385
LRLFM 0.0486 0.0338 0.1394
vLPMF 0.0497 0.0342 0.1420
gLPMF 0.0501 0.0351 0.1460
MMMF 0.0545 0.0408 0.1571
vMMMF 0.0552 0.0411 0.1606
gMMMF 0.0558 0.0413 0.1629
this package. The sampled negative ratings somehow leverage this information and
contribute to the better performance of LPMF-based and MMMF-based methods.
Then we make the parallel comparisons in Table 3.4, where the dimension of latent
factors is specified as 30 and α = 0.1. By comparing Table 3.4 with Table 3.3, we
can find that increasing the dimension of latent factors could generally boost the
performance of all 9 methods. Furthermore, in both Table 3.4 and Table 3.3, the 2-
dimensional Gaussian distribution for modeling user’s cost preference leads to better
results than the cost vector. All the above results show that it is helpful to consider
the cost information for travel recommendations and the way of representation of
- 79 -
Table 3.4. A Performance Comparison (30D Latent Features & α = 0.1)
Precision@5 Precision@10 MAP
PMF 0.0271 0.0167 0.0704
RLFM 0.0280 0.0175 0.0714
vPMF 0.0291 0.0184 0.0752
gPMF 0.0309 0.0194 0.0813
LPMF 0.0485 0.034 0.1355
LRLFM 0.0489 0.0341 0.1397
vLPMF 0.0498 0.0343 0.1423
gLPMF 0.0503 0.0354 0.1468
MMMF 0.0618 0.0472 0.1723
vMMMF 0.0629 0.0480 0.1737
gMMMF 0.0638 0.0487 0.1750
user’s cost preference may influence the performance of cost-aware models.
For PMF-based methods, we also adopt RMSE and CD to evaluate their perfor-
mances because they produce numerical predictions for unknown ratings. A perfor-
mance comparison of PMF, vPMF and gPMF with 10-dimensional and 30-dimensional
latent features is shown in Table 3.5. Also, we compare the performances of PMF-
based models using the CD metric introduced in subsection 3.6.4. Figure 3.3 shows
the cumulative distribution of the computed percentile ranks for the three models
over all 761 RR observations. Note that we use 10-dimensional latent features in Fig-
ure 3.3. As can be seen, both vPMF and gPMF models outperform the competing
- 80 -
Table 3.5. A Performance Comparison in terms of RMSE
PMF RLFM vPMF gPMF
10D Latent Features
RMSE 0.4981 0.4963 0.4951 0.4932
30D Latent Features
RMSE 0.4960 0.4928 0.4933 0.4913
model, i.e., PMF model. For example, considering the point of 0.1 on x-axis, the
CD value for gPMF at this point suggests that, if we recommend top-20 ones from
randomly-selected 201 packages, approximately at least one package matches user’s
interest and cost-expectation with a probability as 53%. Since people are usually
more interested in top-5 or even top-3 ones, out of 201 packages, we zoom in on the
head of the x-axis, which represents top-K recommendation in a more detailed way.
As shown in Figure 3.4, a more clear difference can be observed. For example, gPMF
model has a probability of 0.5 to suggest a highest-rated package before other 198
packages. In other words, if we use gPMF to recommend top-2 packages out of 201
packages, we can match user’s needs with a probability of 0.5. This outperforms
PMF with over 60% percentage. Also, vPMF leads to better performance than PMF.
In addition, we show more comparisons in Figures 3.5 and 3.6 with 30-dimensional
latent features, where a similar trend can be observed.
Furthermore, we conduct statistical significance test to show whether the perfor-
mance improvement of cost-aware latent factor models is statistically significant. We
do the statistical significance test based on the results in Tables 3.3, 3.4, 3.6 and 3.7.
- 81 -
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Relative rank
Cu
mu
lati
ve
dis
trib
uti
on
gPMF
vPMF
PMF
Figure 3.3. A Performance Comparison in terms of CD (10D Latent Features).
Specifically, we first get the difference between the performance measurement of one
cost-aware model (e.g., vPMF or gPMF) and the performance measurement of the
corresponding original model (i.e., PMF, LPMF or MMMF). For example, from Ta-
ble 3.3, we get the difference between Precision@5 of vPMF and Precision@5 of PMF,
which is 0.0285 - 0.0265 = 0.002, and the different between Precision@5 of gPMF and
Precision@5 of PMF, which is 0.0301 - 0.0265 = 0.0036. Along this line, from Ta-
ble 3.3, we get 18 samples of difference between the performance measurements of
cost-aware models and those of original models (i.e., PMF, LPMF, and MMMF). And
from Tables 3.3, 3.4, 3.6 and 3.7, we get total 60 samples of difference between the
performance measurements of cost-aware models and those of original models. While
half of these samples are for cost-aware models with vector-based cost representation,
half of them are for cost-aware models with Gaussian-based cost representation. The
- 82 -
0 0.01 0.02 0.03 0.04 0.050
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Relative rank
Cu
mu
lati
ve
dis
trib
uti
on
gPMF
vPMF
PMF
Figure 3.4. A Local Performance Comparison in terms of CD (10D Latent Features).
statistical significance test is conducted for each half of these 60 samples separately
in order to examine the different statistical significance of improvement by different
cost representations in cost-aware latent factor models. More specifically, the null
hypothesis of each test is there is no significant difference between the mean of sam-
ples of difference and zero. For the 30 samples of difference for vector-based cost
representation, the sample mean is around 0.0015; the sample standard deviation is
around 0.0016. Then, we can derive that the one-tailed p-value is less than 0.0001.
Thus, we can conclude that we should reject the null hypothesis, and the mean of
samples of difference is significantly larger than zero at the significance level of 0.01.
For the another half of 60 samples for Gaussian-based cost representation, we gain
the same conclusion.
In addition, we further conduct similar statistical significance test by using the
- 83 -
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Relative rank
Cu
mu
lati
ve
dis
trib
uti
on
gPMF
vPMF
PMF
Figure 3.5. A Performance Comparison in terms of CD (30D Latent Features).
relative difference between performance measurements of cost-aware models and those
of original models. For example, from Table 3.3, we get the relative difference between
Precision@5 of vPMF and Precision@5 of PMF as (0.0285 - 0.0265)/0.0265 = 0.07547.
After obtaining all 60 samples of such relative difference, we conduct the similar
statistical test on each half of these samples. The null hypothesis of each test is
there is no significant difference between the mean of samples of relative difference
and µ0. µ0 is the assumed population mean of relative difference of performance
measurements. For the vector-based cost representation, the conclusion is that the
mean of relative difference is significantly larger than 0.018 at the significance level
of 0.05. For the Gaussian-based cost representation, the conclusion is that the mean
of relative difference is significantly larger than 0.037 at the significance level of 0.05.
- 84 -
0 0.01 0.02 0.03 0.04 0.050
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Relative rank
Cu
mu
lati
ve
dis
trib
uti
on
gPMF
vPMF
PMF
Figure 3.6. A Local Performance Comparison in terms of CD (30D Latent Features).
3.6.6 The Performances with Different Values of α and D
As we mentioned in subsection 3.6.3, the ratio α may influence the results of LPMF
and MMMF-based methods. To examine this point, we set the ratio α as α =
0.3, and produce another set of results by LPMF and MMMF-based methods as
shown in Table 3.6, where the dimension of latent factors is set as 10. By comparing
with Table 3.3, we can observe that increasing α from 0.1 to 0.3 actually causes
the performances of LPMF and MMMF-based methods to generally decrease. A
similar trend can be observed in Table 3.7, where the dimension of latent factors
is 30. This is probably caused by that the increased negative ratings by sampling
are noisy, or not accurate. Though more accurate training ratings should generally
yield better results, more noisy or inaccurate negative ratings may lead to biased
parameter estimations and worse predictions. On the contrary, fewer but accurate
- 85 -
Table 3.6. A Performance Comparison (10D Latent Features & α = 0.3)
Precision@5 Precision@10 MAP
LPMF-based Methods
LPMF 0.0466 0.0329 0.1325
vLPMF 0.0472 0.033 0.1336
gLPMF 0.0475 0.034 0.1339
MMMF-based Methods
MMMF 0.053 0.0369 0.1507
vMMMF 0.0537 0.0369 0.1525
gMMMF 0.0541 0.0372 0.1534
sampled negative ratings may result in better performances. To further examine this
point, we show the performances of MMMF-based models with a series of α values
in Figure 3.7, where the dimension of latent factors is also 10. As can be seen in
Figure 3.7, the performances in terms of Precision@5 and MAP first increase, and
then decrease as the ratio α is increased from 0 to 1.
By comparing Table 3.3 and Table 3.4, we can observe that increasing the di-
mension of latent factors tends to lead to better performance. To further investigate
this observation, in Figure 3.8, we show the Precision@10 of latent factor models
versus the dimension of latent features. As can be seen, Precision@K of all methods
gratually increases when the dimension of latent features becomes larger.
- 86 -
Table 3.7. A Performance Comparison (30D Latent Features & α = 0.3)
Precision@5 Precision@10 MAP
LPMF-based Methods
LPMF 0.0496 0.0340 0.1418
vLPMF 0.0497 0.0341 0.1422
gLPMF 0.0502 0.0355 0.1430
MMMF-based Methods
MMMF 0.0557 0.0376 0.1555
vMMMF 0.0563 0.0378 0.1585
gMMMF 0.0565 0.0379 0.1588
3.6.7 The Performances on Different Users
For most collaborative filtering models, the prediction performances for users with
different number of observed ratings usually vary a lot. Particularly, the performances
on users with very few ratings may be quite bad for traditional collaborative filtering
models. However, the user and item cost information play as an effective constraint
to tune the prediction via the similarity weight. Thus, our extended models with
cost information are expected to perform better on users with few ratings than the
traditional models. In order to examine this potential, we first group all users based on
the number of observed ratings in the training set, and then compare the performances
of different methods over different user groups. Specifically, users are grouped into 5
classes: ”1-5”, ”6-10”, ”11-20”, ”21-30” and ” > 30”. For example, the group of ”1-5”
denotes that the number of observed ratings per user in the training set is between 1
- 87 -
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.038
0.04
0.042
0.044
0.046
0.048
0.05
0.052
0.054
0.056
Pre
cis
ion
@5
MMMF
vMMMF
gMMMF
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90.135
0.14
0.145
0.15
0.155
0.16
0.165
MA
P
vMMMF
MMMF
gMMMF
(a) (b)
Figure 3.7. Performances with Different α (10D Latent Features).
and 5.
Table 3.8 shows the performances of different methods in terms of Precision@K
and MAP. In Table 3.8, the dimension of latent factors is 10 and the ratio α is 0.1. As
can be seen in Table 3.8, our extended models with the incorporated cost consistently
outperform the traditional methods. For example, for the group of ”1-5”, MAP of
gPMF, gLPMF and gMMMF is increased by 13.26% on average. In addition, the
comparisons of RMSE among PMF-based methods are shown in Figure 3.9, where
the dimension of latent factors is also 10 and the RMSE is the value of final iteration
for each method.
Performances with Tail Packages and Users. In Table 3.9, we demonstrate
the performances of different methods with all tail users and packages. Tail users are
those who have consumed less than 4 different travel packages. Tail packages are those
which have been purchased by less than 4 different users. These tail users or packages
usually contribute a lot to the high sparseness of recommendation data (Y.-J. Park
& Tuzhilin, 2008), and eventually cause the average performances of collaborative
- 88 -
0 50 100 1500.015
0.02
0.025
0.03
0.035
0.04
Dimension of Latent Factors
Pre
cisi
on
@1
0
PMF
vPMF
gPMF
LPMF
vLPMF
gLPMF
Figure 3.8. Performances with Different D (α = 0.1).
filtering methods to decrease (Y.-J. Park & Tuzhilin, 2008). As shown in Table 3.9,
Precision@K or MAP are generally lower than those in Table 3.4. While the long tail
is a general and important topic in the recommender systems field, it is not the focus
of this chapter.
3.6.8 The Learned User’s Cost Information
By training cost-aware latent factor models, we can not only produce better recom-
mendation results as shown in subsections 3.6.7 and 3.6.5, but also learn the latent
user’s cost information. In the following, we illustrate the user’s cost information
learned by our models and demonstrate that the learned user’s cost information can
help travel companies with the customer clustering or segmentation.
Since we normalize the package cost vectors into [0, 1] before feeding into our
models, the learned user’s cost features (CU and µCU) via our models have the similar
- 89 -
0.485
0.49
0.495
0.5
0.505
0.51
0.515
Number of Observed Ratings
RM
SE
gPMFvPMFPMF
1-5 5-10 10-20 20-30 >30
Figure 3.9. The Performances on Different Users (10D Latent Features).
scale as normalized package cost vectors. To visualize the learned CU , we first restored
the scale of user’s cost features (CU and µCU) by using the inverse transformation
of MinMax normalization. Figure 3.10 shows the financial cost feature of CU by the
vPMF model for randomly-selected 40 users, where each user corresponds to a column
of vertically-distributed points. For example, for the most right vertical points, the
star represents the learned user financial cost feature and the dots represent the
financial cost of packages, which are rated by this specific user in the training set.
As we can see, the learned user’s financial cost feature is relatively representative.
However, there is still obvious variance among the cost features of packages by some
users. That is why we apply the Gaussian distribution to model user’s cost preference.
In Figure 3.11, we visualize the learned µCUby gPMF for randomly-selected 12 users.
For each subfigure of Figure 3.11, we directly plot the learned 2-dimensional µCUi
- 90 -
0
2000
4000
6000
8000
10000
12000
User
Fin
an
cia
l C
ost
(RM
B)
Figure 3.10. An Illustration of User Financial Cost.
(without inverse transformation) for individual user and all normalized 2-dimensional
cost vectors of packages, which are rated by the user in the training set. And µCUiis
represented as the star and the dot represents the package cost vector.
The learned user’s latent features, e.g., Ui, with PMF, LPMF or MMMF models
can be used to group users or customers. We argue that the learned user’s cost infor-
mation, in addition to the user’s latent features, can improve the customer clustering
or segmentation. In order to show this effect, we first cluster users with the latent
features learned by PMF via representing each user with her/his latent feature vector.
We use K-means algorithm to perform the clustering and denote the clustering result
as Clu. Then, with the same clustering method we cluster users with both user’s
latent features and user’s cost information, i.e., CU or µCU, learned by vPMF and
gPMF. Now each user is represented by a vector containing her/his latent features
- 91 -
0 0.02 0.040
0.2
0.4
Financial costT
ime
co
st
0 0.050
0.2
0.4
Financial cost
Tim
e c
ost
0 0.05 0.10
0.2
0.4
Financial cost
Tim
e c
ost
0 0.05 0.10
0.2
0.4
Financial cost
Tim
e c
ost
0 0.05 0.10
0.2
0.4
Financial cost
Tim
e c
ost
0 0.1 0.20
0.2
0.4
Financial cost
Tim
e c
ost
0 0.02 0.040
0.05
0.1
Financial cost
Tim
e c
ost
0 0.1 0.20
0.2
0.4
Financial cost
Tim
e c
ost
0 0.05 0.10
0.5
Financial cost
Tim
e c
ost
0 0.1 0.20
0.2
0.4
Financial cost
Tim
e c
ost
0 0.050
0.2
0.4
Financial cost
Tim
e c
ost
0 0.1 0.20
0.2
0.4
Financial cost
Tim
e c
ost
Figure 3.11. An Illustration of the Gaussian Parameters of User Cost.
and cost vector CUior µCUi
. We denote this clustering result as Clu+. However, there
is no available benchmark to evaluate these two clustering results with traditional ex-
ternal clustering validation measurements (Wu, Xiong, & Chen, 2009). To this end,
we leverage the explicit cost information of items to make the comparisons between
these two clustering results. Specifically, for each user within a cluster, we can get
the average financial/time cost of all travel packages, which are consumed by this
user. After obtaining the average financial/time cost of each user of one cluster, we
can get the variances of such average financial/time cost of all users for this cluster.
Table 3.10 shows the comparisons of these two clustering results in terms of such
variance. Here the number of clusters is specified as 5 for K-means algorithm and
- 92 -
C1 indicates cluster 1. Also in Table 3.10, Clu+ is obtained via using µCUlearned
by gPMF in addition to the user’s latent features. As can be seen from Table 3.10,
the average variance over 5 clusters of Clu+ is much less than that of Clu. From this
perspective, we can see that the learned user’s cost information improve the results
of customer clustering or segmentation.
3.6.9 An Efficiency Analysis
As stated in subsection 3.3.3, the computational complexity of the proposed ap-
proaches is linear with respect to the number of ratings. This indicates that the
extended models are theoretically scalable for very large data. Here, we would like
to show the efficiency of all the methods in this experiment. Table 3.11 shows the
training time of all 9 models. Here, we used the 10-dimensional latent features. Since
there is some additional cost for computing the similarity functions and updating
cost vectors or the parameters of Gaussian distribution for the 6 cost-aware models,
more time is required for these 6 models, e.g., vMMMF and gMMMF. In addition,
the Gaussian distribution causes more time than the 2-dimensional vector, because
there is one more regularization item caused by the Gaussian prior in the objective
functions. But, the computing time of cost-aware models is still linearly increasing
as the number of observed ratings increases as discussed in subsection 3.3.3. In addi-
tion, we show the convergence of RMSEs on the test set for PMF-based methods in
Figure 3.12. As can be seen, vPMF and gPMF can quickly converge to relatively low
RMSEs after the first 25 rounds of iterations.
- 93 -
0 5 10 15 20 25 300.495
0.5
0.505
0.51
0.515
0.52
0.525
Epochs
RM
SE
gPMF
vPMF
PMF
Figure 3.12. An Illustration of the Convergence of RMSEs (10D Latent Features).
3.7 Conclusion and Discussion
In this chapter, we studied the problem of travel tour recommendation by analyzing
a large amount of travel logs collected from a travel agent company. One unique
characteristic of tour recommendation is that there are different financial and time
costs associated with each travel package. Different tourists usually have different
affordability for these two aspects of cost. Thus, we explicitly incorporated observ-
able and unobservable cost factors into the recommendation models. Specifically,
we first proposed two ways to model user’s cost preference. With these two ways
of representation of user’s cost preference, we incorporated the cost information into
three classic latent factor models for collaborative filtering, including the Probabilistic
Matrix Factorization (PMF) model, the Logistic Probabilistic Matrix Factorization
(LPMF) model, and the Maximum Margin Matrix Factorization (MMMF) model.
- 94 -
When applied to real-world travel tour data, the extended PMF, LPMF and MMMF
models showed consistently better performances for travel tour recommendation than
classic PMF, LPMF and MMMF models which do not consider the cost information.
Furthermore, the extended MMMF and LPMF models lead to a better performance
improvement than the extended PMF models. Finally, we have demonstrated that
the latent user’s cost information learned by these models can help to do customer
segmentation for travel companies.
Discussion. People may argue that some dimensions of learned latent factors
of users/packages might somehow capture the cost factors implicitly. However, it is
hard to identify which dimensions correspond to these cost factors. At the same time,
in our application (and in many others), the cost information of is given explicitly,
and it is every natural to incorporate it into the model(s), that is what we do in
this chapter. Furthermore, through extensive experimentation, we showed that this
addtional information, indeed, boosts the performance of the collaborating filtering
methods that do not take this cost information into account.
As shown Table 3.9, tail users/packages result in lower performances for different
collaborative filtering methods. Since the long tail is a major challenge in the rec-
ommendation field, and is not the focus of this chapter, we would like to study this
topic for travel package recommendations in the future.
Like cost information, time sensitivity is another important factor for travel pack-
age recommendations. For example, Orlando trips may be more attractive to people
in Northeast of US during winter. However, since the focus of the chapter is on incor-
porating economic indicators, such as costs, into recommendation models, we would
- 96 -
Table 3.8. The Performances on Different Users (10D Latent Features & α = 0.1)
Groups ”1-5” ”6-10” ”11-20” ”21-30” ”>30”
PMF
Precision@5 0.0211 0.0295 0.0482 0.072
MAP 0.0586 0.0784 0.0902 0.0958 0.0054
vPMF
Precision@5 0.0223 0.0306 0.0498 0.096
MAP 0.0573 0.0865 0.0959 0.1228 0.005
gPMF
Precision@5 0.0259 0.0308 0.053 0.096
MAP 0.0752 0.086 0.0937 0.1154 0.0045
LPMF
Precision@5 0.036 0.0488 0.0738 0.1419 0.0857
MAP 0.1109 0.1466 0.1836 0.2118 0.1722
vLPMF
Precision@5 0.0386 0.0496 0.0744 0.1419 0.0857
MAP 0.1186 0.1471 0.1863 0.2120 0.2613
gLPMF
Precision@5 0.0391 0.0500 0.0748 0.1426
MAP 0.1191 0.1479 0.1869 0.2128 0.2621
- 97 -
Table 3.9. Performances with Tail Users/Packages (30D Latent Features & α = 0.1)
Precision@5 Precision@10 MAP
PMF 0.0253 0.0148 0.0644
vPMF 0.0254 0.0157 0.0658
gPMF 0.0265 0.0164 0.0663
LPMF 0.043 0.0305 0.1286
vLPMF 0.0441 0.0324 0.1292
gLPMF 0.0462 0.0339 0.1315
MMMF 0.0553 0.0416 0.1651
vMMMF 0.0561 0.0431 0.1668
gMMMF 0.0578 0.0454 0.1683
Table 3.10. A Comparison of Variance
Results on Clu
C1 C2 C3 C4 C5 Average
Financial Variance 0.00091 0.00102 0.00079 0.00086 0.00114 0.000944
Time Variance 0.0292 0.0012 0.0321 0.0093 0.0125 0.0169
Results on Clu+
C1 C2 C3 C4 C5 Average
Financial Variance 0.00073 0.00105 0.00047 0.00090 0.00035 0.00070
Time Variance 0.0193 0.0009 0.0214 0.0098 0.0133 0.0129
- 98 -
Table 3.11. A Comparison of the Model Efficiency (10D Latent Features)
PMF vPMF gPMF
Training Time (Sec) 3.411 4.894 10.878
LPMF vLPMF gLPMF
Training Time (Sec) 63.452 81.411 201.329
MMMF vMMMF gMMMF
Training Time (Sec) 82.306 98.187 187.250
- 99 -
CHAPTER 4
A COCKTAIL APPROACH FOR TRAVEL PACKAGE RECOMMENDATION
Recent years have witnessed an increased interest in recommender systems. Despite
significant progress in this field, there still remain numerous avenues to explore. In-
deed, this chapter provides a study of exploiting online travel information for person-
alized travel package recommendation. A critical challenge along this line is to address
the unique characteristics of travel data, which distinguish travel packages from tra-
ditional items for recommendation. To that end, in this chapter, we first analyze
the characteristics of the existing travel packages and develop a Tourist-Area-Season
Topic (TAST) model. This TAST model can represent travel packages and tourists
by different topic distributions, where the topic extraction is conditioned on both the
tourists and the intrinsic features (i.e. locations, travel seasons) of the landscapes.
Then, based on this topic model representation, we propose a cocktail approach to
generate the lists for personalized travel package recommendation. Furthermore, we
extend the TAST model to the Tourist-Relation-Area-Season Topic (TRAST) model
for capturing the latent relationships among the tourists in each travel group. Finally,
we evaluate the TAST model, the TRAST model, and the cocktail recommendation
approach on the real-world travel package data. Experimental results show that the
TAST model can effectively capture the unique characteristics of the travel data and
the cocktail approach is thus much more effective than traditional recommendation
- 100 -
techniques for travel package recommendation. Also, by considering tourist relation-
ships, the TRAST model can be used as an effective assessment for travel group
formation.
4.1 Introduction
As an emerging trend, more and more travel companies provide online services. How-
ever, the rapid growth of online travel information imposes an increasing challenge for
tourists who have to choose from a large number of available travel packages for satis-
fying their personalized needs. Moreover, to increase the profit, the travel companies
have to understand the preferences from different tourists and serve more attractive
packages. Therefore, the demand for intelligent travel services is expected to increase
dramatically.
Since recommender systems have been successfully applied to enhance the quality
of service in a number of fields (Adomavicius & Tuzhilin, 2005; Ge et al., 2010), it
is natural choice to provide travel package recommendations. Actually, recommen-
dations for tourists have been studied before (Abowd et al., 1997; Averjanova et al.,
2008; Cena et al., 2006), and to the best of our knowledge, the first operative tourism
recommender system was introduced by Delgado et al. (Delgado & Davidson, 2002).
Despite of the increasing interests in this field, the problem of leveraging unique fea-
tures to distinguish personalized travel package recommendations from traditional
recommender systems remains pretty open.
Indeed, there are many technical and domain challenges inherent in designing and
implementing an effective recommender system for personalized travel package recom-
- 101 -
mendation. First, travel data are much fewer and sparser than traditional items, such
as movies for recommendation, because the costs for a travel are much more expen-
sive than for watching a movie (Ge, Liu, et al., 2011). Second, every travel package
consists of many landscapes (places of interest and attractions), and thus has intrinsic
complex spatio-temporal relationships. For example, a travel package only includes
the landscapes which are geographically co-located together. Also, different travel
packages are usually developed for different travel seasons. Therefore, the landscapes
in a travel package usually have spatial-temporal autocorrelations. Third, traditional
recommender systems usually rely on user explicit ratings. However, for travel data,
the user ratings are usually not conveniently available. Finally, the traditional items
for recommendation usually have a long period of stable value, while the values of
travel packages can easily depreciate over time and a package usually only lasts for
a certain period of time. The travel companies need to actively create new tour
packages to replace the old ones based on the interests of the tourists.
To address these challenges, in our preliminary work (Q. Liu, Ge, Li, Xiong, &
Chen, 2011), we proposed a cocktail approach on personalized travel package recom-
mendation. Specifically, we first analyze the key characteristics of the existing travel
packages. Along this line, travel time and travel destinations are divided into differ-
ent seasons and areas. Then, we develop a Tourist-Area-Season Topic (TAST) model,
which can represent travel packages and tourists by different topic distributions. In
the TAST model, the extraction of topics is conditioned on both the tourists and the
intrinsic features (i.e. locations, travel seasons) of the landscapes. As a result, the
TAST model can well represent the content of the travel packages and the interests
- 102 -
Figure 4.1. An illustration of the chapter contribution.
of the tourists. Based on this TAST model, a cocktail approach is developed for
personalized travel package recommendation by considering some additional factors
including the seasonal behaviors of tourists, the prices of travel packages, and the cold
start problem of new packages. Finally, the experimental results on real-world travel
data show that the TAST model can effectively capture the unique characteristics of
travel data and the cocktail recommendation approach performs much better than
traditional techniques.
In this chapter, we further study some related topic models of the TAST model,
and explain the corresponding travel package recommendation strategies based on
them. Also, we propose the Tourist-Relation-Area-Season Topic (TRAST) model,
which helps understand the reasons why tourists form a travel group. This goes be-
yond personalized package recommendations and is helpful for capturing the latent
relationships among the tourists in each travel group. In addition, we conduct sys-
tematic experiments on the real-world data. These experiments not only demonstrate
- 103 -
Niagara Falls Discovery
Figure 4.2. An example of the travel package, where the landscapes are represented
by the words in red.
that the TRAST model can be used as an assessment for travel group automatic for-
mation but also provide more insights into the TAST model and the cocktail recom-
mendation approach. In summary, the contributions of the TAST model, the cocktail
approaches and the TRAST model for travel package recommendations are shown in
Fig. 4.1, where each dashed rectangular box in the dashed circle identifies a travel
group and the tourists in the same travel group are represented by the same icons.
4.2 Concepts and Data Description
In this section, we first introduce the basic concepts, and then describe the recom-
mendation scenario of this study. Finally, we provide the detailed information about
the unique characteristics of travel package data.
Definition 4 A travel package is a general service package provided by a travel
company for the individual or a group of tourists based on their travel preferences. A
package usually consists of the landscapes and some related information, such as the
price, the travel period, and the transportation means.
Specifically, the travel topics are the themes designed for this package, and the
- 104 -
landscapes are the travel places of interest and attractions, which usually locate in
nearby areas.
Following Definition 4, an example document for a package named “Niagara
Falls Discovery” from the STA Travel 1 is shown in Fig. 4.2. It includes the travel
topics (tour style), travel days, price, travel area (the northeastern U.S.), and land-
scapes (e.g., Niagara Falls) etc. Note that different packages may include the same
landscapes and each landscape can be used for multiple packages. Meanwhile, for
some reasons, the tourists for each individual package are often divided into differ-
ent travel groups (i.e., traveling together). In addition, each package has a travel
schedule and most of the packages will be traveled only in a given time (season) of
the year, i.e., they have strong seasonal patterns. For example, the “Maple Leaf
Adventures” is usually meaningful in Fall.
In this chapter, we aim to make personalized travel package recommendations for
the tourists. Thus, the users are the tourists and the items are the existing packages,
and we exploit a real-world travel data set provided by a travel company in China for
building recommender systems. There are nearly 220, 000 expense records (purchases
of individual tourists) starting from January 2000 to October 2010. From this data
set, we extracted 23, 351 useful records of 7, 749 travel groups for 5, 211 tourists from
908 domestic and international packages in a way that each tourist has traveled at
least two different packages. The extracted data contain 1, 065 different landscapes
located in 139 cities from 10 countries. On average, each package has 11 different
landscapes, and each tourist has traveled 4.4 times.
1STA Travel, URL:http://www.statravel.com/
- 105 -
As illustrated in our preliminary work (Q. Liu et al., 2011), there are some unique
characteristics of the travel data. First, it is very sparse, and each tourist has only a
few travel records. The extreme sparseness of the data leads to difficulties for using
traditional recommendation techniques, such as collaborative filtering. For example,
it is hard to find the credible nearest neighbors for the tourists because there are very
few co-traveling packages.
Second, the travel data has strong time dependence. The travel packages often
have a life cycle along with the change to the business demand, i.e., they only last
for a certain period. In contrast, most of the landscapes will still be active after
the original package has been discarded. These landscapes can be used to form
new packages together with some other landscapes. Thus, we can observe that the
landscapes are more sustainable and important than the package itself.
Third, landscape has some intrinsic features like the geographic location and the
right travel seasons. Only the landscapes with similar spatial-temporal features are
suitable for the same packages, i.e., the landscapes in one package have spatial-
temporal auto-correlations and follow the first law of geography–everything is related
to everything else, but the nearby things are more related than distant things (Cressie,
1991). Therefore, when making recommendations, we should take the landscapes’
spatial-temporal correlations into consideration so as to describe the tourists and the
packages precisely.
Fourth, the tourists will consider both time and financial costs before they accept
a package. This is quite different from the traditional recommendations where the
cost of an item is usually not a concern. Thus, it is very important to profile the
- 106 -
tourists based on their interests as well as the time and the money they can afford.
Since the package with a higher price often tends to have more time and vice versa,
in this chapter we only take the price factor into consideration.
Fifth, people often travel with their friends, family or colleagues. Even when two
tourists in the same travel group are totally strangers, there must be some reasons
for the travel company to put them together. For instance, they may be of the same
age or have the same travel schedule. Hence, it is also very important to understand
the relationships among the tourists in the same travel group. This understanding
can help to form the travel group.
Last but not least,few tourist ratings are available for travel packages. However,
we can see that every choice of a travel package indicates the strong interest of the
tourist in the content provided in the package.
In summary, these characteristics bring in three major challenges. First, how to
compare the interests of tourists and the content of the travel package; Second, how
to make package recommendations for each tourist; Third, how to capture the tourist
relationships to form a travel group. As a result, it is necessary to develop more
suitable approaches for travel package recommendation.
4.3 The TAST Model
In this section, we show how to represent the packages and tourists by a topic model,
like the methods in (Blei, Ng, Jordan, & Lafferty, 2003) based on Bayesian networks,
so that the similarity between packages and tourists can be measured. Table 4.1 lists
some mathematical notations in this chapter.
- 107 -
Table 4.1. Mathematical notations.
Notation Description
U = {U1, U2, ..., UM} the set of tourists
S = {S1, S2, ..., SJ} the set of seasons
P = {P1, P2, ..., PN} the set of packages
T = {T1, T2, ..., TZ} the set of topics
A = {A1, A2, ..., AO} the set of different areas
P′= {P ′
1, P′2, ..., P
′D} packages for travel logs
P′′
= {P ′′1 , P
′′2 , ..., P
′′
D′} packages for travel group logs
LAi= {LAi1 , ..., LAi|Ai|
} landscape set for area Ai
LP′i
= {LP′i 1
, ..., LP′i |P ′
i|} landscapes for the package P
′i
LP′′i
= {LP′′i 1
, ..., LP′′i |P ′′
i|} landscapes for the package P
′′i
4.3.1 Topic Model Representation
When designing a travel package, we assume that people in travel companies often
consider the following issues. First, it is necessary to determine the set of target
tourists, the travel seasons, and the travel places. Second, one or multiple travel
topics ( e.g., “The Sunshine Trip”) will be chosen based on the category of target
tourists and the scheduled travel seasons. Each package and landscape can be viewed
as a mixture of a number of travel topics. Then, the landscapes will be determined
according to the travel topics and the geographic locations. Finally, some additional
information (e.g., price, transportation, and accommodations) should be included.
According to these processes, we formalize package generation as a What-Who-When-
Where(4W ) problem. Here, we omit the additional information and each W stands
for the travel topics, the target tourists, the seasons and the corresponding landscape
- 108 -
located areas, respectively. These four factors are strongly correlated.
Formally, we reprocess the generation of a package in a topic model style, where we
treat it mainly as a landscape drawing problem. These landscapes for the package are
drawn from the landscape set one by one. For choosing a landscape, we first choose a
topic from the distribution over topics specific to the given tourist and season, then
the landscape is generated from the chosen topic and travel area. We call our model
for package representation as the TAST (Tourist-Area-Season Topic) model. Please
note that, a topic mentioned in TAST is different from a real topic, where the former
one is a latent factor extracted by topic model, while the latter one is an explicit
travel theme identified in the real world, and latent topics are used to simulate real
topics. Without loss of generality, we use travel topic and topic to stand for the real
and latent topic, respectively.
Mathematically, the generative process corresponds to the hierarchical Bayesian
model for TAST is shown in Fig. 4.3, where shaded and unshaded variables indi-
cate observed and latent variables respectively. The TAST model follows the similar
Dirichlet distribution assumptions as (Blei et al., 2003), and here landscapes are the
“tokens” for topic modelling. In TAST model, the notation P′d is different from Pd,
where Pd is the ID for a package in the package set while P′d stands for the pack-
age ID of one travel log, and each travel log can be distinguished by a vector of
three attributes 〈P ′d, Ud, timestamp〉, where the timestamp can be further projected
to a season Sd and P′d=〈LP
′d, Ad, price2 〉. Specifically, in Fig. 4.3, each package
P′d is represented as a vector of |LP
′d| landscapes where landscape l is chosen from
2The price factor will be considered later.
- 109 -
Figure 4.3. TAST: A graphical model.
one area a and a ∈ Ad (Ad includes the located area(s) for P′d) and (Ud,Sd) is the
specific tourist-season pair. t is a topic which is chosen from the set T with Z top-
ics. θ and φ correspond to the topic distribution and landscape distribution specific
to each tourist-season pair and area-topic pair respectively, where α and β are the
corresponding hyperparameters.
The distributions, such as θ and φ, can be extracted after inferring this TAST
model (“invert” the generative process and “generate” latent variables). The general
idea is to find a latent variable (e.g., topic) setting so as to get a marginal distribution
of the travel log set P′:
p(P′|α, β, U, S, A) =
∫∫ M∏m=1
J∏j=1
p(θmj|α)O∏
o=1
Z∏k=1
p(φok|β)D∏
d=1
|LP′d
|∏i=1
Z∑tdi=1
(p(tdi|θUdSd)
∑adi∈Ad
(p(adi|Ad)p(ldi|φaditdi)))dφdθ
- 110 -
4.3.2 Model Inference
While the inference on models in the LDA family cannot be solved with closed-form
solutions, a variety of algorithms have been developed to estimate the parameters of
these models. In this chapter, we exploit the Gibbs sampling method (Griffiths &
Steyvers, 2004), a form of Markov chain Monte Carlo, which is easy to implement and
provides a relatively efficient way for extracting a set of topics from a large set of travel
logs. During the Gibbs sampling, the generation of each landscape token for a given
travel log depends on the topic distribution of the corresponding tourist-season pair
and the landscape distribution of the area-topic pair. Finally, the posterior estimates
of θ and φ given the training set can be calculated by:
ˆθmjt =αt + nmjt
ΣZk=1(αk + nmjk)
, ˆφokl =βl + mokl
Σ|Ao|q=1(βq + mokq)
(4.1)
where |Ao| is the number of landscapes in area Ao, nmjt is the number of landscape
tokens assigned to topic Tt and tourist-season pair (Um, Sj), and mokl is the number
of tokens of landscape Ll assigned to area-topic pair (Ao, Tk). Let us take the topic
assignment for “Central Park” as an example, in each iteration, the topic assignment
of one “Central Park” token depends on not only the topics of the landscapes traveled
by the tourist in the given season but also the topics of the other landscapes located
nearby. Meanwhile, many other posterior probabilities can also be estimated, e.g.,
the topic distribution of tourist Ui and package Pi:
ϑUij =
αj + ΣJs=1nisj
ΣZk=1(αk + ΣJ
s=1nisk), ϑP
ij =αj + hij
ΣZk=1(αk + hik)
(4.2)
- 111 -
(a) The TT model. (b) The TAT model. (c) The TST model.
Figure 4.4. The three related topic models.
where hij is the number of the landscape tokens in package Pi and these tokens are
assigned to topic Tj.
After Gibbs sampling, all the tourists and packages are represented by the Z
entry topic distribution vectors (Z, the number of topics, is usually in the range
of [20,100]). For example, a tourist, who traveled “Tour in Disneyland, Hongkong”
and “Christmas day in Hongkong”, may have high probabilities on the entries that
stand for the topics such as “amusement parks” and “Hongkong”. By computing
the similarity of the topic distribution vectors, we can find the similarity between
the corresponding tourists and packages. There are also many other benefits of the
TAST model, e.g., we can learn the popular topics in each season and find the popular
landscapes for each topic.
4.3.3 Area/Seasons Segmentation
There are two extremes for the coverage of each area Ai and each season Si: we can
view the whole earth as an area and the entire year as a season, or we can view
each landscape itself as an area and each month as a different season. However, the
first extreme is too coarse to capture the spatial-temporal auto-correlations, and we
- 112 -
will face the overfitting issue for the second extreme and the Gibbs sampling will be
difficult to converge.
To this end, we divide the entire location space in our data set into 7 big areas
according to the travel area segmentations provided by the travel company, which
are South China (SC), Center China (CC), North China (NC), East Asia (EA),
Southeast Asia (SA), Oceania (OC) and North America (NA), respectively.To make
more reasonable season splitting, we assume that most packages are seasonal, and
we use an information gain based method (Fayyad & Irani, 1993) to get the season
splits.The information entropy of the season SP is Ent(SP )=−∑|SP |i=1 pilog(pi) , where
|SP | is the number of different packages in SP and pi is the proportion of package
Pi in this season. Initially, the entire year is viewed as a big season and then we
partition it into several seasons recursively. In each iteration, we use the weighted
average entropy (WAE) to find the best split:
WAE(i; SP ) =|SP
1 (i)||SP | Ent(SP
1 (i)) +|SP
2 (i)||SP | Ent(SP
2 (i))
where SP1 (i) and SP
2 (i) are two sub-seasons of season SP when being splitted at
the i-th month. The best split month induces a maximum information gain given by
4E(i) which is equal to Ent(SP )−WAE(i; SP ) .
4.3.4 Related Topic Models
While the generation processes in TAST are similar to those in the text modelling
problems for both documents (Blei et al., 2003), the TAST model is quite different
from these traditional ones (e.g., LDA, AT, and ART models). The TAST model
- 113 -
has a crucial enhancement by considering the intrinsic features (i.e., location, travel
seasons) of the landscapes, and thus it can effectively capture the spatial-temporal
auto-correlations among landscapes. The benefit is that the TAST model can describe
the travel package and the tourist interests more precisely, because the nearby land-
scapes or the landscapes preferred by the same tourists tend to have the same topic.
In addition, the text modelling has the assumption that the words in an email/ar-
ticle are generated by multiple authors, while we assume that the landscapes in the
package are generated for the specific tourist of this travel log. Therefore, each single
text is considered only once in the text models. However, each package may appear
many times in the TAST model according to their records in the travel logs.
Indeed, as shown in Fig. 4.4, there are three related topic models. The first one
(Fig. 4.4(a)) is the Tourist Topic (TT) model, which does not consider the travel
area and travel season factors. The second one (Fig. 4.4(b)) is the Tourist-Area
Topic (TAT) model, which only considers the travel area. The third one (Fig. 4.4(c))
is the Tourist-Season Topic (TST) model , which only considers the travel season.
All these methods can also be used for package and tourist representation. Finally,
note that the graphical representations of TT and TST are similar to the AT model
and ART model, respectively. However, their differences have been discussed.
4.4 Cocktail Recommendation Approach
In this section, we propose a cocktail approach on personalized travel package recom-
mendation based on the TAST model, which follows a hybrid recommendation strat-
egy (Burke, 2007) and has the ability to combine many possible constraints that exist
- 114 -
Figure 4.5. The cocktail recommendation approach.
in the real-world scenarios. Specifically, we first use the output topic distributions of
TAST to find the seasonal nearest neighbors for each tourist, and collaborative filter-
ing will be used for ranking the candidate packages. Next, new packages are added
into the candidate list by computing similarity with the candidate packages generated
previously. Finally, we use collaborative pricing to predict the possible price distri-
bution of each tourist and reorder the packages. After removing the packages which
are no longer active, we will have the final recommendation list.
Fig. 4.5 illustrates the framework of the proposed cocktail approach, and each
step of this approach is introduced in the following subsections. We should note that,
the major computation cost for this approach is the inference of the TAST model.
As the increase of travel records, the computation cost will increase. However, since
the topics of each landscape evolves very slowly, we can update the inference process
periodically offline in real-world applications. At the end of this section, we will
describe many similar cocktail recommendation strategies based on the related topic
- 115 -
models of TAST.
4.4.1 Seasonal Collaborative Filtering for Tourists
In this subsection, we describe the method for generating the personalized candidate
package set for each tourist by the collaborating filtering method. After we have
obtained the topic distribution of each tourist and package by the TAST model, we can
compute the similarity between each tourist by their topic distribution similarities.
Intuitively, based on the idea of collaborative filtering, for a given user, we rec-
ommend the items that are preferred by the users who have similar tastes with her.
However, as we explained previously, the package recommendation is more complex
than the traditional ones. For example, if we make recommendations for tourists
in winter, it is inappropriate to recommend “Maple Leaf Adventures”. In other
words, for a given tourist, we should recommend the packages that are enjoyed by
other tourists at the specific season. Indeed, we have obtained the seasonal topic
distribution for each tourist from the TAST model. Multiple methods can be used to
compute these similarities, such as matrix factorization (Koren & Bell, 2011; Koren,
2008) and graphical distances (Fouss, Pirotte, Renders, & al, 2007). Alternatively,
a simple but effective way is to use the Correlation coefficient, and the similarity
between tourist Um and Un in season Sj can be computed by:
SimSj(Um, Un) =
∑Zk=1(θmjk − θmj)(θnjk − θnj)√∑Z
k=1(θmjk − θmj)2
√∑Zk=1(θnjk − θnj)2
(4.3)
where θmj is the average topic probability for the tourist-season pair (Um, Sj)3
3If tourist Um has never traveled in season Sj , then her total topic distribution ϑUm is used as an
alternative throughout this chapter.
- 116 -
, For a given tourist, we can find his/her nearest neighbors by ranking their sim-
ilarity values. Thus, the packages, favored by these neighbors but have not been
traveled by the given tourist, can be selected as candidate packages which form a
rough recommendation list, and they are ranked by the probabilities computed by
the collaborative filtering.
4.4.2 New Package Problem
In recommender systems, there is a cold-start problem, i.e., it is difficult to recommend
new items. As we have explored in Section 4.2, travel packages often have a life cycle
and new packages are usually created. Meanwhile, most of the landscapes will keep
in use, which means nearly all the new packages are totally or partially composed
by the existing landscapes. Let us take the year of 2010 as an example. There are
65 new packages in the data and only 2 of them are composed completely by new
landscapes. Thus, for most of the new packages P new, their topic distributions can
be estimated by the topics of their landscapes:
ϑP new
ij =αj +
∑l∈P new
iolj
ΣZk=1(αk +
∑l∈P new
iolk)
(4.4)
where olj is the number of times that landscape l is assigned to topic Tj in the
travel logs, and the seasonal topic distribution of the new packages can be computed
in the similar way. The following question is how to recommend new packages. One
way to address this issue is to recommend the new packages that are similar to
the ones already traveled by the given tourist (i.e., via the content based method).
However, if the recommender systems just deal with the current interest of the given
- 117 -
tourist, we will suffer from the overspecialization problem (Adomavicius & Tuzhilin,
2005). Thus, we propose to compute the similarity between the new package and
the given number (e.g. 10) of candidate packages in the top of the recommendation
list. The new packages which are similar to the candidate packages are added into
the recommendation list and their ranks in the list based on the average probabilities
of the similar candidate packages. It is expected that this method can not only deal
with the cold-start problem but also avoid the overspecialization problem. Please note
that, in real applications, new travel package recommendation list can be separated
from the general list. However, in this chapter, for better illustration and evaluation,
we insert the new packages into the general recommendation list as an alternative.
Since there is no effective method to learn the topics of the new packages whose
landscapes are not included in the training set, we can use the topic distributions of
their located areas on the given travel season as an estimation. Luckily, there are few
such packages.
4.4.3 Collaborative Pricing
In this subsection, we present the method to consider the price constraint for devel-
oping a more personalized package recommender system. The price of travel packages
may vary from $20 to more than $3, 000, so the price factor influences the decision
of tourists. Along this line, we propose a collaborative pricing method in which we
first divide the prices into different segments. Then, we propose to use the Markov
forecasting model to predict the next possible price range for a given tourist.
In the first phase, we divide the prices of the packages based on the variance of
- 118 -
(a) The TRAST model. (b) The TRAST1 model.
(c) The TRAST2 model.
Figure 4.6. The TRAST model and its two sub-models.
- 119 -
prices in the travel logs. We first sort the prices of the travel logs, and then partition
the sorted list PL into several sub-lists in a binary-recursive way. In each iteration,
we first compute the variance of all prices in the list. Later, the best split price having
the minimal weighted average variance (WAV) defined as:
WAV (i; PL) =|PL1(i)||PL| V ar(PL1(i)) +
|PL2(i)||PL| V ar(PL2(i))
where PL1(i) and PL2(i) are two sub-lists of PL split at the i-th element and V ar
represents the variance. This best split price leads to a maximum decrease of 4V (i),
which is equal to V ar(PL)−WAV (i; PL).
In the second phase, we mark each price segment as a price state and compute the
transition probabilities between them. Specifically, at first, if a tourist used a package
with price state a before traveling a package with price state b, then the weight of
the edge from a to b will plus 1. After summing up the weights from all the tourists,
we normalize them into transition probabilities, and all the transition probabilities
compose a state transition matrix. From the current price state of a given tourist
(i.e. the current price distribution normalized from his/her previous travel records),
we predict the next possible price state by the one-step Markov forecasting model
based on random walk. Finally, we obtain the predicted probability distribution of
the given tourist on each state, and use these probabilities as weights to multiply
the probabilities of the candidate packages in the rough recommendation list so as to
reorder these packages. After removing the packages which are no longer active, we
have the final recommendation list.
- 120 -
4.4.4 Related Cocktail Recommendations
The previous cocktail recommendation approach (Cocktail) is mainly based on the
TAST model and the collaborative filtering method. Indeed, another possible cock-
tail approach is the content based cocktail, and in the following, we call this method
TASTContent. The main difference between TASTContent and Cocktail is that in
TASTContent the content similarity between packages and tourists are used for rank-
ing packages instead of using collaborative filtering. Since TASTContent can only
capture the existing travel interests of the tourists, thus it may also suffer from the
overspecialization problem.
As there are many related topic models for the TAST model, it is also possible
to design the similar cocktail recommendation approaches based on these models.
Actually, it is quite straightforward to replace the TAST model by TT, TAT and
TST models in the cocktail approach to get the new recommendations. For example,
in the experimental section, the notation TTER stands for the cocktail approach that
is based on the TT model.
In Cocktail we use the price factor as an external constraint to measure package
ranks. To some extent, the package prices may also directly influence the interests of
the tourists. Thus, it can be included in the topic model representation. If we replace
the season token Sd in Fig. 4.3 by (Sd, Cd) pair, where Cd is the price segment of this
package log, and update the previous 4W assumptions, the price factor can be well
incorporated into the topic model. In this way, the topic preference of the packages
in each price segment can also be inferred. What’s more, this topic model shares
- 121 -
the same inference process with the TAST model, and in the following, we call the
cocktail recommendation approach based on this model as Cocktail-.
In summary, both Cocktail and the above related approaches follow the idea of
hybrid recommendations, which exploit multiple recommendation techniques, such
as collaborative filtering and content-based approaches, for the best performances.
Indeed, hybrid recommender systems are usually more practical and have been widely
used (Burke, 2007; Lai, Xiang, Diao, Liu, & al, 2011). For instance, seven different
types of hybrid recommendation techniques have been discussed in (Burke, 2007).
In fact, the cocktail recommendation is a combined exploitation of several hybrid
approaches. Specifically, the seasonal collaborative filtering based on topic modelling
is a “Feature Augmentation”, where the new features of latent topics are generated
as the better input to enhance the existing algorithm. Second, the insertion of new
packages is a “Mixed” strategy, where recommendations from different sources are
combined. At last, the collaborative pricing is similar to a “Cascade” strategy, where
the secondary recommender refines the decisions made by a stronger one.
4.5 The TRAST Model
In this section, we extend the current TAST model and propose a novel Tourist-
Relation-Area-Season Topic (TRAST) model to formulate the tourist relationships in
a travel group.
In the TAST model, we do not consider the information of the travel group.
However, as noted in Section 4.2, each package has usually been used by many groups
of tourists, and the tourists belong to different travel groups. Thus, if two tourists
- 122 -
have taken the same package but in different travel groups, we can only say these
two tourists have the same travel interest, but we cannot conclude that they share
the same travel profile. However, if these two tourists are in the same group, they
may share some common travel traits, such as similar cultural interests and holiday
patterns. In the future, they may also want to travel together. Also, they may be
family and always travel together during the holiday season. In this chapter, we
use the notation relationship to measure these commonalities and connections in
tourists’ travel profiles. Please also note that there are multiple tourist relationships
simultaneously.
Based on the above understanding, we incorporate into the TAST model a new set
of variables, with each entry indicating one relationship, and we consider the tourist
relationships in each travel group. This novel topic model is named as the TRAST
model, as shown in Fig. 4.6(a), where each tourist has a multinomial distribution
over G relationships, and each relationship has a multinomial distribution over Z
topics. Other assumptions are similar to those in the TAST model. However, in the
TRAST model, the purchases of the tourists in each travel group are summed up as
one single expense record and thus it has more complex generative process. We can
understand this process by a simple example. Assume that two selected tourists in
a travel group (U′′d ) are u1 and u2, who are young and dating with each other. Now,
they decide to travel in winter (Sd) and the destination is North America (Ad). To
generate a travel landscape (l), we first extract a relationship (r, e.g., lover), and
then find a topic (t) for lovers to travel in the winter (e.g., skiing). Finally, based on
this skiing topic and the selected travel area (e.g., Northeast America), we draw a
- 123 -
landscape (e.g., Stowe, Vermont).
Thus, in the TRAST model, the notation U′′d stands for a group of tourists and
P′′d is the corresponding package ID for this travel group log. θ and Λ correspond
to the topic distribution and relationship distribution specific to each relationship-
season pair and tourist, respectively, where η is a new hyperparameter. The marginal
distribution of the travel group set P′′
can be computed as:
p(P′′ |α, β, η, U, S, A) =
∫∫∫ M∏i=1
p(Λi|η)G∏
i=1
J∏j=1
p(θij|α)O∏
i=1
Z∏j=1
p(φij|β)D′∏
d=1
|LP′′d
|∏i=1
(p(u1, u2|U ′′d )
M∑rdi=1
(p(rdi|u1, u2)Z∑
tdi=1
(p(tdi|θrdiSd)
∑adi∈Ad
(p(adi|Ad)p(ldi|φaditdi)))))dφdθdη
To perform the inference, the Gibbs sampling formulae can be derived in a similar
way as the TAST model, but the sampling procedure at each iteration is significantly
more complex. To make inference more efficient and easier for understanding, we
instead perform it in two distinct parts. We first split TRAST model into two sub-
models, as shown in Fig. 4.6(b) and 4.6(c). The first sub-model TRAST1 is just
like the TAST model, except for the two tourists are latent factors and some of the
notations are with different meanings here. By this model, we use a sample to obtain
topic assignments and tourist pair assignments for each landscape token. Then, in the
second sub-model TRAST2, we treat topics and tourist pairs as known, and the goal
is to obtain relationship assignments. In the following, let us introduce the inference
- 124 -
of these two models, one by one.
If we directly transfer the results that we get from the TAST model to assign a
topic for each landscape token in the TRAST1 model, we need to compute n(u1,u2)st
for each (u1, u2) pair, which is the number of landscape tokens that are assigned to
topic t, and have been co-traveled by tourists (u1, u2) in season s. In this way, we
have to compute and store each n(u1,u2)st, an entry in a M ∗M ∗ J ∗Z matrix. Thus,
the cost will be too expensive (actually, most of the entries should be 0). Instead, we
use the following strategy as a simulation.
p(adi , tdi , (udi1 , udi2)|...) ∝
αtdi+nudi1
sdtdi+nudi2
sdtdi−1
ΣZk=1(αk+nudi1
sdk+nudi2sdk)−1
βldi+maditdildi
−1
Σ|Ai|k=1(βldi
+maditdik)−1(4.5)
where “...” refers to all the known information such as the area (a¬di), topic (t¬di
)
and tourist pair ((u1, u2)¬di) information of other landscape tokens, and the hyperpa-
rameters α, and β. By the above equation, we only have to keep a M ∗ J ∗ Z matrix
for storing each nust.
We can see that the TRAST2 model is similar to the TST model (Fig. 4.4(c)),
except for the location of Sd and the pair of tourists. Similar to the inference of the
TRAST1 model, when inferring this model, for each relationship assignment, we use
the following equation:
p(rdi |...) ∝ηrdi
+nu1rdi+nu2rdi
−1
ΣGk=1(ηk+nu1rk
+nu2rk)−1
αtdi+mrdiSdtdi
−1
ΣZt=1(αt+mrdiSdt)−1
(4.6)
- 125 -
After Gibbs sampling, each tourist’s travel relationship preference can be esti-
mated by the following equation, and each entry of θ and φ can be computed simi-
larly.
Λir =ηt + nir
ΣGk=1(ηk + nik)
(4.7)
Actually, this TRAST model can be easily extended for computing relationships
among many more tourists. However, the computation cost will also go up. To
simplify the problem, in this chapter, each time we only consider two tourists in a
travel group as a tourist pair for mining their relationships. By this TRAST model,
all the tourists’ travel preferences are represented by relationship distributions. For
a set of tourists, who want to travel the same package, we can use their relationship
distributions as features to cluster them, so as to put them into different travel groups.
Thus, in this scenario, many clustering methods can be adopted. Since choosing
clustering algorithm is beyond the scope of this chapter, in the experiments, we refer
to K-means, one of the most popular clustering algorithms.
Thus, the TRAST model can be used as an assessment for travel group automatic
formation. Indeed, in real applications, when generating a travel group, some more
external constraints, such as tourists’ travel date requirements, the travel company’s
travel group schedule should also be considered. Please note that, it is possible to use
the topics mined by TRAST1 to represent the latent relationships directly. However,
in this way, the topics will represent both landscape topics and latent relationships,
it would be hard for interpretation.
- 126 -
4.6 Experimental Results
In this section, we evaluate the performances of the proposed models on real-world
data, and some of previous results (Q. Liu et al., 2011) are omitted due to the space
limit. Specifically, we demonstrate: (1) the results of the season splitting and price
segmentation, (2) the understanding of the extracted topics, (3) a recommendation
performance comparison between Cocktail and benchmark methods, (4) the evalua-
tion of the TRAST model, and (5) a brief discussion on recommendations for travel
groups.
4.6.1 The Experimental Setup
The data set was divided into a training set and a test set. Specifically, the last
expense record of each tourist in the year of 2010 was chosen to be part of the test
set, and the remaining records were used for training. The detailed information is
described in Table 4.2 4 . Note that there are 65 new packages traveled by 269
tourists in the test set. However, only two of these packages are composed completely
by new landscapes, and there are 11 new landscapes.
Table 4.2. The description of the training and test data.
Data Split #Tourists #Packages #Landscapes #Records #Groups
Training set 5,211 843 1,054 22,201 7,083
Test set 1,150 908 1,065 1,150 666
Benchmark Methods. To compare the fitness of the TAST model, we compare
4Since the data is very sparse and to ensure that each method can get a meaningful result, wechoose a comparably small test set.
- 127 -
it with three related models: TAT model, TST model and TT model, which do not
take the season, area, and both season and area factors into consideration, respec-
tively. The perplexity (an evaluation metric for measuring the goodness of fit of a
model) comparison result illustrated in (Q. Liu et al., 2011) shows that TAST model
has significantly better predictive power than three other models.
For the recommendation accuracies of the Cocktail approach, we compare it with
the following benchmarks:
• Three methods based on topic models including TTER, TASTContent and
Cocktail- as described in Section 4.4.4.
• A content-based recommendation (SContent) based on co-traveled landscapes.
• For the memory based collaborative filtering, we implemented the user based
collaborative filtering method (UCF).
• For the model based collaborative filtering, we chose Binary SVD (BSVD) (Lai
et al., 2011).
• Since UCF and BSVD only use the package-level information, to do a fair
comparison, we implemented two similar methods based on landscapes (i.e.,
LUCF, LBSVD).
• One graph-based algorithm, LItemRank (Gori & Pucci, 2007), where a land-
scape correlation graph is constructed, and the packages are ranked by the
expected average steady-state probabilities on their landscapes.
- 128 -
(a) (b)
Figure 4.7. Season splitting and price segmentation.
In the following, we choose the fixed Dirichlet distributions, and these settings are
widely used in the existing works (Griffiths & Steyvers, 2004). For instance, we set
β=0.1 and α = 50/Z for the TAST model.
4.6.2 Season Splitting and Price Segmentation
In this subsection, we present the results of season splitting and price segmentation as
shown in Fig. 4.7. For better illustration, in Fig. 4.7(a), we only show the travel logs
with prices lower than $1, 500. In the figure, different price segments are represented
with different grayscale settings, and seasons are split by the dashed lines among
months. In total, we have 4 seasons (i.e., spring, summer, fall, and winter), and 5
price segments (i.e., very low, low, medium, high and very high). Since almost all
the tourists in the data are from South China, this season splitting has well captured
the climatic features there. Another interesting observation is that the peak times for
travel in China include February (around the Spring Festival), July and August (the
summer for students) and the beginning of October (National Day holiday).
- 129 -
Figure 4.8. The correlation of topic distributions between different price ranges
(Left)/different areas (Center)/different seasons(Right). Darker shades indicate lower
similarity.
Fig. 4.7(b) describes the relationship between the percentage of the travel packages
and the number of scheduled travel seasons. In Fig. 4.7(b), we can see that most of
the packages are only traveled in one season during a year, and less than 6% packages
are scheduled in the entire year. At last, note that we do not give the illustration of
relationship between each travel package and the number of its located areas. The
reason is that almost all the packages in the data located in only one of the 7 travel
areas. These statistical results reflect the fact that landscapes in most packages have
spatial-temporal auto-correlations, and the travel area and travel season segmentation
methods are reasonable and effective.
4.6.3 Understanding of Topics
To understand the latent topics extracted by TAST, we focus on studying the rela-
tionships between topics and their landscapes’/packages’ intrinsic characteristics.
In (Q. Liu et al., 2011) we have demonstrated that TAST can capture the spatial-
temporal correlations among landscapes, and these landscapes, which are close to each
- 130 -
other or with similar travel seasons, can be discovered. Meanwhile, the TAST model
retains the good quality of the traditional topic models for capturing the relation-
ships between landscapes locating in different areas and has no special travel season
preference. Similarly, the topic distributions on each package can be also computed.
Based on the price-spatial-temporal correlations of packages (for many interpreta-
tions, there may contain some noise), all the topics can now be classified into eight
types, which are noted from 1-1-1 (packages have price, spatial and temporal cor-
relations) to 0-0-0 (packages have none of these correlations). Another interesting
observation is that, the top travel packages in many topics are actually quite similar
with each other, even though they are with different package IDs. For example, all
the packages in topic 43 are about the Kunming-Dali-Lijiang tour. This finding once
again demonstrates that, in addition to capture the intrinsic characteristics of the
travel data, the TAST model still holds the capability of traditional models, such as
the property of clustering documents (packages) (Blei et al., 2003).
In addition, we show the Pearson Correlations of the topic distributions for differ-
ent prices/areas/seasons in Fig. 4.8 where different prices/areas/seasons are assigned
with different topic distributions. From the left matrix, it is very interesting to ob-
serve that the topic distribution of the very low price segment and the very high price
segment are quite different from three other price ranges. In the center matrix, for
most area pairs, there are no obvious topic correlations, except for East Asia (EA)
and North China (NC), which locate nearby and are with similar latitude. The differ-
ent types of topic relationships between seasons are more clear as shown in the right
matrix, the most different two pairs of seasons are (winter, summer) and (summer,
- 131 -
Figure 4.9. A performance comparison based on Top-K.
fall), while (summer, spring) have the most similar latent topic distributions.
Table 4.3. A performance comparison: DOA(%).
Alg. SContent UCF BSVD LUCF LBSVD
DOA(%) 62.41 69.96 68.77 88.44 87.67
Alg. TTER TASTContent Cocktail- Cocktail
DOA(%) 89.82 80.00 92.44 92.56
4.6.4 Recommendation Performances
Since there are no explicit ratings for validation, we use the ranking accuracy instead.
We adopt the widely used Degree of Agreement (DOA) (Q. Liu, Chen, Xiong, Ding,
& Chen, 2012) and Top-K (Koren, 2008) as the evaluation metrics. Also, a simple
user study was conducted and volunteers were invited to rate the recommendations.
For comparison, we recorded the best performance of each algorithm by tuning their
parameters, and we also set some general rules for fair comparison. For instance,
- 132 -
for collaborative filtering based methods, we usually consider the contribution of the
nearest neighbors with similarity values larger than 0.
DOA measures the percentage of item pairs ranked in the correct order with
respect to all pairs (Gori & Pucci, 2007). Let NWUi = P − (FUi ∪EUi) denote the set
of packages that do not occur in the training set (FUi) nor the test set (EUi
) for Ui,
and PRPjdenote the predicted rank of package Pj in the recommendation list, and
define check orderUi(Pj, Pk) as 1 if PRPj
≥ PRPkotherwise 0. Then the individual
DOA for tourist Ui is defined as:
DOAUi =
∑j∈EUi
,k∈NWUicheck orderUi(Pj , Pk)
|EUi | × |NWUi |
For instance, an ideal(a random) ranking corresponds to a 100%( an average 50%)
DOA, and we use DOA to stand for the average of each individual DOA. Under this
metric, the ranking performance of each method is shown in Table 4.3, where we
can see that Cocktail outperforms the benchmark methods. By integrating the price
factor into the TAST model, Cocktail- performs nearly as well as Cocktail, and both
of them perform better than TTER. Also, the methods that consider landscape infor-
mation (i.e., LUCF, LBSVD, LItemRank, TTER, TASTContent, Cocktail) usually
outperform those do not use such information (i.e., UCF, BSVD). As mentioned pre-
viously, it is harder to find the credible nearest neighbor tourists (and latent interests)
only based on the co-traveling packages. Furthermore, TASTContent performs bet-
ter than SContent, and TTER performs better than LUCF and LBSVD, and these
demonstrate the effectiveness of modelling latent topics. Meanwhile, unlike watch-
- 133 -
ing movies, most of the tourists seldom travel the packages that are similar to the
ones that they have already traveled (e.g., have too many identical landscapes), thus
content based methods (i.e., SContent and TASTContent) perform worse than col-
laborative filterings (e.g., LUCF and Cocktail).
Top-K indicates the recall value of the recommended top-K percent of packages.
Since there is only 1 relevant package for each test tourist (i.e., |EUi| = 1), we define
Top−KUi=#hit, where #hit equals to 1 or 0. Then, the average of individual Top-Ks
are used for comparing the performances of the algorithms as shown in Fig. 4.9. We
can see that Cocktail still outperforms other methods and the Top-K result is very
similar to the DOA result, except that BSVD/LBSVD are evaluated better now.
User Study. Since it is now impossible for us to directly ask the test tourists to
rate the recommendation results, we conducted another type of user study. Specif-
ically, we first gave the package information that one tourist had traveled and the
season that he/she was planing a new trip, then we showed the top ranked recom-
mendations from each algorithm (i.e., LUCF, LBSVD, TTER, TASTContent, and
Cocktail). Finally, some volunteers were invited to blindly review the recommen-
dations on a 5-point Likert scale ranging from 1 (Meaningless) to 5 (Excellent). In
total, we collected 2,580 ratings for these 5 algorithms (i.e., 516 for each) from 17
volunteers (all of them are the undergraduate and graduate students from the Univer-
sity of Science and Technology of China). The final mean ratings and the standard
deviations (SD) are shown in Table 4.4. We can see that the rating for Cocktail
is slightly higher than others, and LBSVD outperforms both LUCF and TASTCon-
tent. By applying z-test, we find that the differences between the ratings obtained
- 134 -
Table 4.4. User study ratings.
LUCF LBSVD TTER TASTContent Cocktail
Mean 3.22 3.30 3.46 3.20 3.55
SD 0.74 0.75 0.81 0.94 0.76
by Cocktail and the other algorithms are statistically significant with |z|≥ 2.58 and
thus p≤ 0.01 (except for the comparison with TTER, where |z| = 1.53 and p = 0.06).
Another interesting observation is that the SD value for TASTContent is extremely
high, which means this content based algorithm makes very distinguishable and con-
troversial recommendations.
In summary, Cocktail performs better than other methods for all the evalua-
tion metrics, and Cocktail-/TTER have the second best performances. Due to the
unique characteristics of the travel data, the traditional collaborative filtering meth-
ods (UCF and BSVD) do not perform well, and they cannot recommend new packages
for tourists. Since different metrics characterize the recommendations from different
perspectives, some “controversial results” have also been observed (e.g., the different
performances of LBSVD). In general, the methods, which consider additional useful
information in a proper way, tend to have better performances. During the user study
where the users are exposed to many different recommendations simultaneously, we
also noticed that it is often hard for them to directly judge two recommendation
results from different algorithms. This indicates the issue: the ways of exposing the
recommendations and interacting with the users are also very important for success-
fully deploying a system.
- 135 -
Figure 4.10. The runtime results for different algorithms.
Computational Performances. Also, we compare the computational perfor-
mances of the algorithms. We run all the algorithms on the same platform 5 .
Fig. 4.10 shows the execution time (i.e., the time used for building the model and
making final recommendations for all the test tourists). We can see that many al-
gorithms (e.g., LItemRank, TASTContent, Cocktail- and Cocktail) have the similar
runtime. Among all the algorithms, BSVD and LBSVD are the most efficient, and
Cocktail- has the worst computational performance. Specifically, for the topic model
based methods, TTER does not have to consider the seasonal topic similarities of the
tourists, thus it is the most efficient.
4.6.5 The Evaluation of the TRAST Model
Since we have little information about tourists, it is hard to interpret the identified
relationships. However, we can test the effectiveness of the TRAST model from an
alternative perspective; that is, the mined relationships will be used as features to
help automatically form travel groups. We conduct two types of experiments. The
first experiment is to use K-means clustering for grouping given tourists, and the
second one is to find the tourists who would like to travel with given tourist.
5For the topic model based algorithms, we set Gibbs sampling run 100 iterations, since similarresults are already observed.
- 136 -
Table 4.5. Experimental results for K-means clustering.HHHHHHHHHHHHFeatures
Metrics Cosine Euclidean distance
MI (↑) V Dn (↓) MI (↑) V Dn (↓)
Groups 0.7570 0.4453 0.7659 0.4233
Landscapes 0.7640 0.4727 0.7714 0.4619
Topics 0.7556 0.4227 0.7459 0.4440
Relationships 0.7972 0.4012 0.7804 0.4161
To this end, we use 7, 083 travel groups to train the TRAST model. For testing,
we select 76 packages from the original test set (shown in Table 4.2) to ensure that
each selected package has more than 2 travel groups. In total, there are 167 travel
groups traveled by 570 tourists. In the experiments, we fix the number of topics and
relationships to be 100 and 20, and set parameters η, α, β the same as the TAST
model.
For the clustering experiment, given the set of tourists (i.e., objects for clustering)
and the number of travel groups (K) of each test package, we run K-means to cluster
these tourists into K groups, and here the relationship serves as the feature for clus-
tering. We compare this clustering result with three other clustering results, which
are the K-means results by using group logs (i.e., if two tourists often traveled in
the same groups, then they will have similar travel preferences), traveled landscapes
and topics (mined by TAST model) as features, respectively. Thus, the better the
selected features, the better clustering results should be observed. Indeed, K-means
clustering validation has been carefully studied before, and we choose two recognized
- 137 -
(a) Cosine. (b) Euclid dis-
tance.
Figure 4.11. The precision results for Leave-Out-Rest (%).
Table 4.6. The recall results for Leave-Out-Rest (%).HHHHHHHHHHHHFeatures
MetricsCosine Euclidean distance
Groups 37.10 53.31
Landscapes 59.28 50.27
Topics 53.26 53.07
Relationships 60.18 56.23
validation measures, MI (mutual information) and V Dn(normalized van Dongen cri-
terion), where MI is a widely used measure and V Dn is the most suitable validation
measure identified. The corresponding experimental results are shown in Table 4.5.
We can see that, regardless of the similarity measures, the K-means results based on
relationships always perform much better than the clustering results based on other
features for each evaluation metric.
Meanwhile, we evaluate the identified relationships from each tourist’s point of
view. Specifically, we randomly select a tourist from each travel group, and then we
- 138 -
rank all the rest tourists (including the ones from other groups) of this travel package
for this tourist (i.e., Leave-Out-Rest). Here, the ranking list is generated based on
the candidates’ similarities with the given tourist computed by the travel relationship
distributions (or co-traveled groups, or landscapes, or topic distributions). Ideally,
the tourists who are in the same travel group with the given tourist should appear
earlier in the list. To evaluate these ranking lists, we choose “precision” and “recall”
as the metrics, and the corresponding results are shown in Fig. 4.11 and Table 4.6.
We can see that the ranking lists based on relationships are still better than those
based on other features.
From the above analysis, we know that the relationships identified by TRAST
can be better used for clustering tourists and help to find the most possible co-travel
tourists for a given tourist. Thus, compared to co-traveled groups, landscapes and
topics, it is more suitable for travel companies to choose relationships as an assessment
for travel group automatic formation.
4.6.6 Recommendation for Travel Groups
The evaluations in previous sections are mainly focused on the individual (personal-
ized) recommendations. Since there are tourists who frequently travel together, it is
interesting to know whether the latent variables (e.g., the topics of each individual
tourist and the relationships of a travel group) as well as the cocktail approaches are
useful for making recommendations to a group of tourists. To this end, we performed
an experimental study on group recommendations.
Similar to the evaluation for the personalized recommendation, we recommended
- 139 -
for the 666 travel groups existing in the test set (shown in Table 4.2). Specifically, each
recommendation algorithm simply views a group of tourists as an “individual tourist”
and all the previous travel/expense records of these tourists are used for training,
and then generates a single recommendation list for each test group (tourists in this
group) using the training set (training groups). According to their performances
in Section 4.6.4, we chose five typical recommendation algorithms for comparison
including LUCF, LBSVD, TTER, the two Cocktails based on the topics extracted
by the TAST model and based on the relationships extracted by the TRAST model.
We chose DOA as the evaluation metric due to its simplicity in interpretation. The
experimental results are shown in Table 4.7, where we can see that Cocktails still
outperform other algorithms, and in addition to modelling each individual tourist,
the relationships can also be used for making recommendations. Meanwhile, we
observe that both LUCF and LBSVD perform much better with more training records
comparing to the results in Table 4.3.
Table 4.7. Group recommendation results: DOA(%).
Alg. LUCF LBSVD TTER Cocktail(Topics) Cocktail(Relationships)
DOA(%) 90.86 88.77 89.60 92.29 92.10
It is worth noting that the differences between group recommendation and indi-
vidual recommendation are more subtle and complex than we could imagine at the
first glance (Jameson & Smyth, 2007). While the detailed discussion is beyond the
scope of this chapter, we hope there are more future studies on travel group recom-
- 140 -
mendations.
4.7 Concluding Remarks
In this chapter, we present study on personalized travel package recommendation.
Specifically, we first analyzed the unique characteristics of travel packages and de-
veloped the Tourist-Area-Season Topic (TAST) model, a Bayesian network for travel
package and tourist representation. The TAST model can discover the interests of the
tourists and extract the spatial-temporal correlations among landscapes. Then, we
exploited the TAST model for developing a cocktail approach on personalized travel
package recommendation. This cocktail approach follows a hybrid recommendation
strategy and has the ability to combine several constraints existing in the real-world
scenario. Furthermore, we extended the TAST model to the TRAST model, which
can capture the relationships among tourists in each travel group. Finally, an empiri-
cal study was conducted on real-world travel data. Experimental results demonstrate
that the TAST model can capture the unique characteristics of the travel packages,
the cocktail approach can lead to better performances of travel package recommenda-
tion, and the TRAST model can be used as an effective assessment for travel group
automatic formation. We hope these encouraging results could lead to many future
work.
- 141 -
CHAPTER 5
COLLABORATIVE FILTERING WITH COLLECTIVE TRAINING
Rating sparsity is a critical issue for collaborative filtering. For example, the well-
known Netflix Movie rating data contain ratings of only about 1% user-item pairs.
One way to address this rating sparsity problem is to develop more effective methods
for training rating prediction models. To this end, in this chapter, we introduce a
collective training paradigm to automatically and effectively augment the training
ratings. Essentially, the collective training paradigm builds multiple different Col-
laborative Filtering (CF) models separately, and augments the training ratings of
each CF model by using the partial predictions of other CF models for unknown
ratings. Along this line, we develop two algorithms, Bi-CF and Tri-CF, based on
collective training. For Bi-CF and Tri-CF, we collectively and iteratively train two
and three different CF models via iteratively augmenting training ratings for individ-
ual CF model. We also design different criteria to guide the selection of augmented
training ratings for Bi-CF and Tri-CF. Finally, the experimental results show that
Bi-CF and Tri-CF algorithms can significantly outperform baseline methods, such as
neighborhood-based and SVD-based models.
- 142 -
5.1 Introduction
Recommender systems (Adomavicius & Tuzhilin, 2005) provide personalized sugges-
tions by identifying user interests from user behavior data. As a major recommenda-
tion technique, collaborative filtering (CF) aims at predicting the preference of a user
by using available ratings or taste information from many users. Specifically, given
N users, M items and a M ×N preference matrix R, CF is typically to predict the
unknown ratings in R by using the available training ratings. Many CF algorithms,
which can usually be categorized into two groups: memory-based and model-based
methods (Adomavicius & Tuzhilin, 2005), have been proposed to address this predic-
tion problem.
The prediction performance of most CF methods strongly depends on the avail-
able training ratings. In other words, better prediction can usually be expected if
more training ratings become available. However, rating data are usually very sparse
because it is expensive to obtain more training ratings from users or experts. Con-
sequently, the unknown ratings usually significantly outnumber the available ratings.
Then, the question is whether it is possible to leverage the abundant unknown ratings
in addition to training ratings to improve the performance of CF methods. Through
the sample data in Table 5.1, we demonstrate the feasibility to exploit unknown
ratings to improve CF methods.
In Table 5.1, we have an item-user matrix (R), where there are 7 items and 5
users. For example, with an item-oriented KNN method (iKNN) (Adomavicius &
Tuzhilin, 2005), we can predict the rating R(4, 3) and R(2, 3) as around 1. Note
- 143 -
Table 5.1. A Sample Data Set.
User1 User2 User3 User4 User5
Item1 NaN NaN 2 3 NaN
Item2 1 2 NaN 2 3
Item3 2 4 2 4 5
Item4 1 2 NaN 2 3
Item5 1 2 1 NaN 4
Item6 1 2 NaN 5 7
Item7 NaN NaN 5 NaN NaN
Note: NaN indicates unknown rating.
that KNN is the acronym of K-Nearest Neighbors and is also known as neighborhood
method (Adomavicius & Tuzhilin, 2005). With these two predictions, we can better
measure the similarity between User3 and User5, thus we can better predict rat-
ing R(1, 5) via using user-oriented KNN method (uKNN) (Adomavicius & Tuzhilin,
2005). In contrast, if we only use known ratings to predict R(1, 5) with user-oriented
KNN method (Adomavicius & Tuzhilin, 2005), we are not able to obtain reliable
similarity value among User3 and User5 because the support (i.e., the number of
common ratings by User3 and User5) is too low. Therefore we are not able to predict
R(1, 5) well. Through this illustrative study, we show that the performance of one
CF model can be improved by leveraging partial predictions of other CF models for
unknown ratings.
To that end, in this chapter, we introduce the collective training paradigm to im-
- 144 -
prove CF methods. Essentially, the collective training paradigm iteratively augments
the training ratings of one model by using the partial predictions of other CF models,
then re-trains all CF models again. Along this line, we first develop a Bi-CF algo-
rithm based on collective training among two CF models, which iteratively augments
the training ratings for one model by leveraging the partial predictions of the other
model, and re-trains the two CF models and re-makes predictions. The final predic-
tion is based on the ensemble of the two CF models. Furthermore, to exploit the
advantage of different CF models, we collectively train three CF models and develop
a Tri-CF algorithm. For both Bi-CF and Tri-CF algorithm, one essential challenge
is how to select augmented training ratings. In this chapter, we design two different
criteria to guide the selection for Bi-CF and Tri-CF. Finally, the experimental results
on MovieLens data show that both Bi-CF and Tri-CF models could result in better
performance than several traditional methods, such as KNN and SVD methods.
5.2 Related Work
First, the idea of collective training has been studied for classification and regres-
sion problems (Ghamrawi & McCallum, 2005; Sen et al., 2008; Zhou & Li., 2005)
in machine learning community. But, in this chapter, we adapt collective training to
collaborative filtering and propose two algorithms to deal with the arisen challenge,
which is how to iteratively augment the training set for individual CF method. Sec-
ond, in the field of collaborative filtering (Ge, Liu, et al., 2011; Q. Liu et al., 2010),
there are some research papers (Jin & Si, 2004; Boutilier, Zemel, & Marlin, 2003;
Harpale & Yang, 2008), which have already explored unknown ratings to improve
- 145 -
collaborating filtering methods and are known as active collaborative filtering. How-
ever, most of these methods need to query users with a small amount of unknown
ratings. Then these supplemental training samples are included and used to build
the CF models again together with original training samples. In other words, user’s
or expert’s interaction is still needed to exploit the unknown ratings. In addition,
Zhang and Pu (Zhang & Pu, 2007) introduced a specific recursive method to itera-
tively use some predicted ratings for predictions of other unknown ratings. However,
this method is specifically designed for the user-based CF approach. In contrast,
our collective training is developed to automatically exploit unknown ratings without
user’s interaction and collectively train and boost different CF approaches.
5.3 Collective Training
In the context of collaborative filtering, collective training is to boost one CF model
by the predictions of other CF models. The diversity of these CF models is needed
for collective training because the estimations of unknown ratings by these CF meth-
ods will be the same if all these CF models are identical. Then, the mutual boost
effect among these CF models will disappear. With different CF methods, different
algorithms can be developed to perform collective training. The methods to select
augmented training ratings may also vary among different algorithms.
Though collective training can be generally adapted to various combinations of
multiple CF methods, we focus on three CF methods, i.e., uKNN (Bell & Koren,
2007), iKNN (Bell & Koren, 2007) and SVD (Paterek, 2007), and develop two specific
examples of collective training among them. Specifically, we design Bi-CF and Tri-CF
- 146 -
algorithms based on these three CF models. Before introducing Bi-CF and Tri-CF,
we briefly review these three CF methods.
Suppose we have N users and M items, and a set of available ratings. To esti-
mate the unknown rating rji to item j by user i, item-oriented KNN method makes
the prediction for rji as: rji =∑
v∈N(j) svjrvi∑v∈N(j) svj
. N(j) is a set of neighboring items that
are also rated by user i. svj is similarity between item j and item v, which is often
computed with traditional correlation measurement, e.g. Pearson Correlation or Co-
sine Correlation. The analogous user-oriented KNN make the prediction for rji as:
rji =∑
u∈N(i) suirju∑u∈N(i) sui
, where N(i) is a set of neighboring users who also rate item j. sui
is similarity between user u and user i. SVD models a user’s preference to a item
as dot product of the user latent factor and the item latent factor (Paterek, 2007).
Given observed training rates, user and item latent factors are learned by minimizing
the objective function as:
E =1
2
N∑i=1
M∑j=1
Iji(rji − UTi Vj)
2 + αU
N∑i=1
||Ui||+ αV
M∑j=1
||Vj||,
where Iji is 1 if rji is observed, and 0 otherwise. αU and αV are parameters.
5.3.1 The Bi-CF Algorithm
Bi-CF algorithm is based on two different CF methods: SVD and user-oriented KNN
(uKNN). Specifically, we provide the pseudo code of Bi-CF as shown in Figure 5.3.1.
As we can see, we first make prediction for the unknown ratings by using uKNN and
SVD methods (step 2-3). Secondly we select partial predictions yielded by individual
model (step 6-7). Then we re-predict the unknown ratings by uKNN model (SVD-
based model) with the selected predictions by SVD-based model(neighborhood-based
- 147 -
model) in addition to the original training ratings (step 8-10). This process is itera-
tively performed until a stop criterion is satisfied, which is that both SVD and uKNN
models do not change much. Specifically, the changing of SVD model is reflected by
U and M in latent feature space. And the changing of uKNN can be reflected by the
predictions for unknown ratings.
Furthermore, instead of selecting the augmented ratings from all unknown ratings,
we select these augmented ratings from a pool of unknown ratings for each iteration
(step 4 and 11). This strategy can significantly decrease the probability that the same
set of predicted ratings is selected at different steps of iteration. And this mechanism
could save much time for the selection, which is actually quite time-consuming due
to the large number of unknown ratings. Also the selected augmented predictions
(H1 and H2) are put back into W after each iteration. After all iterations, we make
predictions for unknown ratings by combining the results of uKNN and SVD models
(step 15). Also note that, Θ in Algorithm 5.3.1 represents all parameters for uKNN
and SVD, including the number of neighbors for uKNN, the number of latent factor
and penalty parameters for SVD.
Confidence Measurement. One critical challenge of Bi-CF algorithm is how
to effectively and efficiently select the partial predictions from all unknown ratings.
On one hand, if many inaccurate predictions are augmented, the CF model may be
degraded, but not boosted. On the other hand, the overall iteration will be very time-
consuming if the selection takes much time. To this end, we propose one criterion to
efficiently estimate the confidence of prediction.
Since there is no benchmark for an unknown rating, we turn to consulting available
- 148 -
Input: R: the set of known ratings W : the set of unknown ratings T : the set
of testing ratings Θ: the set of parameters K: the number of
augmented ratings
Output: Pt: the predictions on testing set.
R1 ← R; R2 ← R1
Make prediction for W with uKNN and R12
Make prediction for W with SVD and R23
Generate pool W by randomly selecting from W4
while a stopping criterion is not satisfied do5
Select K predictions (denoted as H1) from all predictions for W by uKNN6
Select K predictions (denoted as H2) from all predictions for W by SVD7
R1 ← R1 ∪H2; R2 ← R2 ∪H18
Make prediction for W with uKNN and R19
Make prediction for W with SVD and R210
Generate W by randomly selecting from W11
end12
Get the predictions P 1t for test set T with uKNN13
Get the predictions P 2t for test set T with SVD14
Output: Pt ← Average(P 1t , P 2
t );15
Algorithm 2: The Bi-CF Algorithm
- 149 -
ratings, which are neighboring to the unknown rating in terms of users or items, to
estimate the confidence of prediction for the unknown rating. For the prediction of
one unknown rating, if these neighboring available ratings are predicted well with
the CF model, we think the prediction of this unknown rating is high-confident.
Thus, the average/overall deviation between ground truth ratings and predictions of
these neighboring available ratings should be evaluated first. Specifically, given one
unknown rating rji, we first find a set of items (N(j)), which are neighboring to item
j and rated by user i. Also we find a set of users (N(i)), who are neighboring to
user i and rate item j. With N(j) and user i, we have a set of known ratings {rvi},
v ∈ N(j). And we have a set of known ratings {rju}, u ∈ N(i), with item j and N(i).
We still can get the prediction for each one in these two sets of known ratings with
CF models. Accordingly we denote these two sets of predictions as {pvi} and {pju}.
Note that elements in {pvi} ({pju}) have one-to-one correspondence with elements
in {rvi} ({rju}). Then we use RMSE (Root Mean Squared Error) to evaluate the
average deviation between ground truth ratings and predictions as follows:
√∑u∈N(i)(rju − pju)2 +
∑v∈N(j)(rvi − pvi)2
|N(i)|+ |N(j)| (5.1)
|N(i)| or |N(j)| is the number of neighboring users or item. We specify this parameter
as the same as the number of neighbors in KNN methods. Since the confidence of
prediction for rij is inversely proportional to RMSE, we select top-K predictions,
which are associated with lowest RMSE in equation 5.1, for each iteration.
- 150 -
5.3.2 The Tri-CF Algorithm
In this subsection, we introduce Tri-CF algorithm, which is based on item-oriented
KNN (iKNN), uKNN and SVD and boosts one CF model with the augmented ratings
generated from the predictions of the other two CF models. As shown in Figure 5.3.2,
Tri-CF has a similar interactive process as Bi-CF. But different from Bi-CF model,
Tri-CF evaluate the confidence of predictions for each unknown rating via analyzing
the consistency of predictions by two models. In other words, if two CF motheds make
consistent predicitons for one unknown rating, the predictions are considered as high-
confident and will be added to the training set for the third CF model. Specifically,
for one unknown rating, we can obtain three predictions p1, p2 and p3 by uKNN, SVD
and iKNN respectively. The consistency and confidence of two predictions p1 and p2
are inversely proportional to |p1 − p2|. Thus, among all unknown ratings, we select
top-K unknown ratings, which are associated with lowest values of p1 − p2. And for
each selected unknown rating, we calculate the average of p1 and p2, and add it into
the training set for iKNN. For uKNN and SVD we use the same way to obtain the
augmented training ratings.
However, the consistent predictions by two models (e.g., uKNN and SVD) still
may be inaccurate, and consequently will degrade the third model (e.g., iKNN) if such
predictions are augmented into training set of the third model (e.g., iKNN). Thus,
inspired by (Zhou & Li., 2005), we heuristically put a constraint condition in order
to select effective predictions and overcome the argmented noisy ratings. Specifically,
we first evaluate the confidence of the predictions for each known rating with the the
- 151 -
same method as in the above paragraph. Note that we consider the two predictions
p1 and p2 as confident if |p1−p2| is lower than 0.5 in the experiment. Then, we count
the number (denoted as c) of confident predictions for the known ratings. Among
these c confident predictions, we count the number (denoted as c′) of predictions,
which are almost the same as the ground truth ratings. Finally we estimate the noise
rate of the found high-confident estimations as c−c′c . This noise rate is estimated for
the augmented rating of individual CF model. Therefore, during each iteration, we
estimate the noise rate of potential augmented ratings for each CF model. And we
augment the training rating, if it is lower than certain threshold Nr. In addition, we
use some other procedures for Tri-CF as mentioned in section 5.3.1, such as stopping
criterion.
5.4 Experimental Results
In this section, we empirically validate the performances of the proposed Bi-CF and
Tri-CF models.
The Experiment Setup. We validate the proposed Bi-CF and Tri-CF models on
the MovieLens dataset 1 which contains 100000 discrete ratings (on a 1-5 scale) from
943 users for 1682 movies. In this chapter, 80% of known ratings are used as training
ratings and 20% are used as the testing set. The parameters for SVD are specified
as αU = 0.05, αV = 0.05, and the learning rate γ = 0.003 as suggested in (Paterek,
2007). And we represent the number of neighbors for KNN and equation 5.1 as Nei,
the number of latent factors as f . In our experiments, we show the performance with
1http://www.grouplens.org/node/73
- 152 -
Input: R: the set of known ratings W : the set of unknown ratings T : the set
of testing ratings Θ: the set of parameters K: the number of top-K
confident predictions Nr: the noise rate threshold
Output: Pt: the predictions on testing set.
R1 ← R; R2 ← R; R3 ← R1
Make prediction for W with uKNN and R12
Make prediction for W with SVD and R23
Make prediction for W with iKNN and R34
Generate pool W by randomly selecting from W5
while a stopping criterion is not satisfied do6
If Noise rate for iKNN is lower than Nr. Then Select Top-K confident7
predictions H12 from W by uKNN and SVD; R3 ← R3 ∪H12
If Noise rate for uKNN is lower than Nr. Then Select Top-K confident8
predictions H23 from Wby SVD and iKNN; R1 ← R1 ∪H23
If Noise rate for SVD is lower than Nr. Then Select Top-K confident9
predictions H13 from Wby uKNN and iKNN; R2 ← R2 ∪H13
Make prediction for W with uKNN and R110
Make prediction for W with SVD and R211
Make prediction for W with iKNN and R312
Regenerate W by randomly selecting from W13
end14
Get the predictions P 1t (P 2
t , P 3t ) for test set T with CF1 (CF2, CF3)15
Output: Pt ← Average(P 1t , P 2
t , P 3t );16
- 153 -
different values of Nei and f . Also the number of augmented ratings during each
iteration is set as K = 500. The noise rate threshold is set as Nr = 0.1. Finally,
we use the RMSE (Bell & Koren, 2007; Adomavicius & Tuzhilin, 2005) metric to
evaluate different methods.
In Table 5.2, we show the performances of different methods with different values
of Nei and f . Particularly, we also directly ensemble SVD and uKNN by averaging
the final predictions of these two models. We represent this method as Ensemble.
As can be seen, Bi-CF and Tri-CF models can outperform the competing methods,
including KNN, SVD and Ensemble, in the most cases. The results of Bi-CF and
Tri-CF are obtained after a stop criterion is satisfied.
To further study and compare the proposed two models, we compare the RMSEs
on the testing set at different steps of iteration in Figure 5.1, where we obtain the
RMSEs at each step of iteration by averaging the predictions of the two/three basic
CF models. In Figure 5.1, we specify the number of neighbors as Nei = 40 and f = 40.
As can be seen, the RMSEs of both Bi-CF and Tri-CF decrease significantly after
several initial iterations. Also, the Tri-CF results show a little better performance
during these iteration. Note that it takes much more steps to converge for both Bi-CF
and Tri-CF models, but here we only show the first 10 iterations.
5.5 Concluding Remarks
In this chapter, we exploited the well-known concept of collective training for collab-
orative filtering and demonstrated its effectiveness for recommendation. Essentially,
the collective training paradigm builds multiple collaborative filtering models, and
augments the training rating for one collaborative filtering model by leveraging the
- 154 -
Table 5.2. RMSE Comparisons on MovieLens
Nei f uKNN SVD Ensemble Bi-CF Tri-CF
10 10 1.0401 0.987 0.9702 0.9581 0.9522
20 20 1.0207 1.0022 0.9702 0.9590 0.9535
30 30 1.0183 1.0162 0.9750 0.9600 0.9535
40 40 1.0181 1.0298 0.9795 0.9609 0.9580
1 2 3 4 5 6 7 8 9 100.95
0.955
0.96
0.965
0.97
0.975
0.98
0.985
Iterations
RM
SE
Bi−CFTri−CF
Figure 5.1. RMSEs at Different Iterations.
predictions of other collaborative filtering models. To demonstrate the usefulness
and practicality of this powerful idea, we developed two specific examples of collec-
tive training of multiple CFs, i.e., Bi-CF and Tri-CF. Two different criteria are also
designed to guide the selection of augmented training ratings. Finally, experimental
results on the MovieLen data showed the advantages of both Bi-CF and Tri-CF by
comparing with some baseline methods, such as KNN and SVD. As a future work, we
would like to explore other possible combinations of collective training, in addition to
Bi-CF and Tri-CF, and identify the most powerful combination methods. In addi-
- 155 -
tion, one limitation of Bi-CF and Tri-CF is that it takes many iterations before a stop
criterion is satisfied. In the future, we will study the convergence of the iterations.
- 156 -
CHAPTER 6
CONCLUSIONS AND FUTURE WORK
In this dissertation, we address the unique and intractable analytical challenges of mo-
bile recommendations by effectively modeling and efficiently computing with various
mobile data, such as GPS data and travel package data.
First we developed an energy-efficient mobile recommender system by exploit-
ing the energy-efficient driving patterns extracted from the location traces of Taxi
drivers. This system has the ability to recommend a sequence of potential pick-up
points for a driver in a way such that the potential travel distance before having cus-
tomer is minimized. To develop the system, we first formalized a mobile sequential
recommendation problem and provided a Potential Travel Distance (PTD) function
for evaluating each candidate sequence. Based on the monotone property of the PTD
function, we proposed a recommendation algorithm, named LCP . Moreover, we ob-
served that many candidate routes can be dominated by skyline routes, and thus can
be pruned by skyline computing. Therefore, we also proposed a SkyRoute algorithm
to efficiently compute the skylines for candidate routes. An advantage of searching
an optimal route through skyline computing is that it can save the overall online
processing time when we try to provide different optimal driving routes defined by
different business needs.
Second we developed different cost-aware collaborative filtering models to address
- 157 -
the cost constraint of travel tour recommendation. And we empirically investigated
which model can lead to the best improvement by incorporating the cost information
and which one can work the best in practice. We demonstrate the performance
comparisons among all methods with different evaluation metrics.
Third we developed the Tourist-Area-Season Topic (TAST) model, a Bayesian
network for travel package and tourist representation. The TAST model can discover
the interests of the tourists and extract the spatial-temporal correlations among land-
scapes. Then, we exploited the TAST model for developing a cocktail approach on
personalized travel package recommendation. This cocktail approach follows a hybrid
recommendation strategy and has the ability to combine several constraints existing
in the real-world scenario. Furthermore, we extended the TAST model to the TRAST
model, which can capture the relationships among tourists in each travel group. Ex-
perimental results demonstrate that the TAST model can capture the unique charac-
teristics of the travel packages, the cocktail approach can lead to better performances
of travel package recommendation, and the TRAST model can be used as an effective
assessment for travel group automatic formation.
- 158 -
BIBLIOGRAPHY
Abowd, G., Atkeson, C., & al et. (1997). Cyber-guide: A mobile context-aware
tour guide. Wireless Networks , 3(5), 421-433.
Adams, R. P., Dahl, G. E., & Murray, I. (2010). Incorporating side information in
probabilistic matrix factorization with gaussian processes. In Computing research
repository - corr.
Adomavicius, G., Sankaranarayanan, R., Sen, S., & Tuzhilin, A. (2005). Incorpo-
rating contextual information in recommender systems using a multidimensional
approach. ACM Transactions on Information Systems , 23 (1), 103-145.
Adomavicius, G., & Tuzhilin, A. (2005). Toward the next generation of recom-
mender systems: A survey of the state-of-the-art and possible extensions. IEEE
TKDE , 17 (6), 734-749.
Agarwal, D., & Chen, B. C. (2009). Regression-based latent factor models. In
Proceedings of the acm sigkdd international conference on knowledge discovery and
data mining (p. 19-28).
Applegate, D. L., Bixby, R. E., & al et. (2006). The traveling salesman problem:
A computational study. Princeton University Press.
Ardissono, L., Goy, A., Petrone, G., Segnan, M., & Torasso, P. (2002). Ubiquitous
user assistance in a tourist information server. In Proceedings of international
conference on adaptive hypermedia and adaptive web based systems (p. 14-23).
Averjanova, O., Ricci, F., & Nguyen, Q. N. (2008). Map-based interaction with
a conversational mobile recommender system. In The 2nd int’l conf on mobile
ubiquitous computing, systems, services and technologies.
Baltrunas, L., Ludwig, B., Peer, S., & Ricci, F. (2011). Context-aware places
of interest recommendations for mobile users. In Proceedings of the international
conference on human-computer interaction (p. 531-540).
- 159 -
Baltrunas, L., Ricci, F., & Ludwig, B. (2011). Context relevance assessment
for recommender systems. In Proceedings of the 2011 international conference on
intelligent user interfaces.
Bell, R. M., & Koren, Y. (2007). Scalable collaborative filtering with jointly
derived neighborhood interpolation weights. In Ieee icdm (p. 43-52). Omaha NE,
US.
Blei, D. M., Ng, A. Y., Jordan, M. I., & Lafferty, J. (2003). Latent dirichlet
allocation. Journal of Machine Learning Research, 3 , 2003.
Boutilier, C., Zemel, R. S., & Marlin, B. (2003). Active collaborative filtering. In
Acm sigir.
Burke, R. (2007). Hybrid web recommender systems. In The adaptive web (p. 377-
408).
Carolis, B. D., Mazzotta, I., Novielli, N., & Silvestri, V. (2009). Using common
sense in providing personalized recommendations in the tourism domain. In Pro-
ceedings of workshop on context-aware recommender systems.
Cena, F., Console, L., Gena, C., Goy, A., Levi, G., Modeo, S., et al. (2006). Inte-
grating heterogeneous adaptation techniques to build a flexible and usable mobile
tourist guide. AI Communication, 19 (4), 369–384.
Chen, L.-S., Hsu, F.-H., Chen, M.-C., & Hsu, Y.-C. (2008). Developing recom-
mender systems with the consideration of product profitability for sellers. Infor-
mation Sciences , 178(4), 1032-1048.
Cheverst, K., Davies, N., & al et. (2000). Developing a context-aware electronic
tourist guide: some issues and experiences. In the sigchi conference on human
factors in computing systems (p. 17-24).
Cressie, N. A. C. (1991). Statistics for spatial data (ISBN:0471843369 ed.). Wiley
and Sons.
Das, A., Mathieu, C., & Ricketts, D. (2010). Maximizing profit using recommender
systems. In Proceedings of the international conference on world wide web.
Delgado, J., & Davidson, R. (2002). Knowledge bases and user profiling in travel
and hospitality recommender systems. In Enter (p. 1-16).
- 160 -
Dell’Amico, M., Fischetti, M., & Toth, P. (1993). Heuristic algorithms for the
multiple depot vehicle scheduling problem. Management Science, 39(1), 115-125.
Deshpande, M., & Karypis, G. (2004). Item-based top-n recommendation. In Acm
transactions on information systems (Vol. 22, p. 143-177).
D.Papadias, Y.Tao, G., & B.Seeger. (2005). Progressive skyline computation in
database systems. ACM TODS , 30(1), 43-82.
Fayyad, U. M., & Irani, K. B. (1993). Multi-interval discretization of continuous-
valued attributes for classification learning. In Ijcai (p. 1022-1027).
Fouss, F., Pirotte, A., Renders, J.-M., & al et. (2007). Random-walk compu-
tation of similarities between nodes of a graph with application to collaborative
recommendation. IEEE TKDE , 19 (3), 355-369.
Ge, Y., Liu, Q., Xiong, H., Tuzhilin, A., & Chen, J. (2011). Cost-aware travel
tour recommendation. In Proceedings of the acm sigkdd international conference
on knowledge discovery and data mining (p. 983-991).
Ge, Y., Xiong, H., Tuzhilin, A., & Liu, Q. (2011). Collaborative filtering with
collective training. In Proceedings of the acm conference on recommender systems
(p. 281-284).
Ge, Y., Xiong, H., Tuzhilin, A., Xiao, K., Gruteser, M., & Pazzani, M. J. (2010).
An energy-efficient mobile recommender system. In Proceedings of the acm sigkdd
international conference on knowledge discovery and data mining (p. 899-908).
Ghamrawi, N., & McCallum, A. (2005). Collective multi-label classification. In
Acm cikm.
Gori, M., & Pucci, A. (2007). Itemrank: A random-walk based scoring algorithm
for recommender engines. In Ijcai (p. 2766-2771).
Griffiths, T., & Steyvers, M. (2004). Finding scientific topics. PNAS , 101 , 5228-
5235.
Grosu, D., & Chronopoulos, A. T. (2004). Algorithmic mechanism design for load
balancing in distributed systems. IEEE TSMC-B , 34(1), 77-84.
Gu, Q., Zhou, J., & Ding, C. H. Q. (2010). Collaborative filtering weighted non-
negative matrix factorization incorporating user and item graphs. In Proceedings
of the siam international conference on data mining (p. 199-210).
- 161 -
Hao, Q., Cai, R., Wang, C., Xiao, R., Yang, J.-M., Pang, Y., et al. (2010). Equip
tourists with knowledge mined from travelogues. In Proceedings of the international
conference on world wide web.
Harpale, A. S., & Yang, Y. (2008). Personalized active learning for collaborative
filtering. In Acm sigir.
Heijden, H. van der, Kotsis, G., & Kronsteiner, R. (2005). Mobile recommendation
systems for decision making ’on the go’. In Icmb.
Herlocker, J. L., Konstan, J. A., Terveen, L. G., John, & Riedl, T. (2004). Eval-
uating collaborative filtering recommender systems. ACM Transactions on Infor-
mation Systems , 22 , 5-53.
Hill, W., Stead, L., Rosenstein, M., & Furnas, G. (1995). Recommending and
evaluating choices in a virtual community of use. In Proceedings of the sigchi
conference on human factors in computing systems (1995) (p. 194-201).
Hofmann, T. (1999). Probabilistic latent semantic analysis. In Proceedings of the
fifteenth conference on uncertainty in artificial intelligence (p. 289-296). Stock-
holm, Sweden.
Hofmann, T. (2004). Latent semantic models for collaborative filtering. ACM
Transactions on Information Systems , 22(1), 89-115.
Hosanagar, K., Krishnan, R., & Ma, L. (2008). Recommended for you: The impact
of profit incentives on the relevance of online recommendations. In Proceedings of
the international conference on information systems. Paris.
http://cabspotting.org/. (n.d.).
Huang, Z., Chung, W., & Chen, H. (2004). A graph model for e-commerce rec-
ommender systems. Journal of the American Society for Information Science and
Technology , 55 , 259-274.
Huang, Z., Li, X., & Chen, H. (2005). Link prediction approach to collaborative
filtering. In In proceedings of the joint conference on digital libraries (p. 141-142).
Iyer, R. D., Jr., Karger, D. R., & Smith, A. C. (1998). An efficient boosting
algorithm for combining preferences. In Proceedings of the fifteenth international
conference on machine learning.
- 162 -
Jameson, A., & Smyth, B. (2007). Recommendation to groups. In The adaptive
web (p. 596-627).
Jannach, D., & Hegelich, K. (2009). A case study on the effectiveness of rec-
ommendations in the mobile internet. In Proceedings of the acm conference on
recommender systems (p. 205-208).
J.Chomicki, P.Godfrey, J., & D.Liang. (2003). Skyline with presorting. In Icde
(p. 717- 719).
Jin, R., & Si, L. (2004). A bayesian approach toward active learning for collabo-
rative filtering. In Uai.
Karypis, G. (n.d.). Cluto: http://glaros.dtc.umn.edu/gkhome/views/cluto.
Kenteris, M., Gavalas, D., & Economou, D. (2011). Electronic mobile guides: a
survey. Personal and Ubiquitous Computing , 15 (1).
Kian-Lee, T., Pin-Kwang, E., & Ooi, B. C. (2001). Efficient progressive skyline
computation. In Vldb.
Koren, Y. (2008). Factorization meets the neighborhood: a multifaceted collab-
orative filtering model. In Proceedings of the acm sigkdd international conference
on knowledge discovery and data mining (p. 426-434).
Koren, Y., & Bell, R. (2011). Advances in collaborative filtering. In Recommender
systems handbook (p. 145-186).
Lai, S., Xiang, L., Diao, R., Liu, Y., & al et. (2011). Hybrid recommendation
models for binary user preference prediction problem. In Kdd cup.
Liang Xiong, T.-K. H. J. S. J. G. C., Xi Chen. (2010). Temporal collaborative
filtering with bayesian probabilistic tensor factorization. In Proceedings of the siam
international conference on data mining (p. 211-222).
Liu, N. N., Xiang, E., Zhao, M., & Yang, Q. (2010). Unifying explicit and implicit
feedback for collaborative filtering. In Proceedings of the 19th acm conference on
information and knowledge management.
Liu, N. N., Zhao, M., Xiang, E. W., & Yang, Q. (2010). Online evolutionary col-
laborative filtering. In Proceedings of the acm conference on recommender systems
(p. 95-102).
- 163 -
Liu, Q., Chen, E., Xiong, H., Ding, C., & Chen, J. (2012). Enhancing collaborative
filtering by user interests expansion via personalized ranking. IEEE TSMC-B ,
42 (1), 218-233.
Liu, Q., Chen, E., Xiong, H., & Ding, C. H. Q. (2010). Exploiting user interests for
collaborative filtering: interests expansion via personalized ranking. In Proceedings
of the acm conference on information and knowledge management (pp. 1697–1700).
Liu, Q., Ge, Y., Li, Z., Xiong, H., & Chen, E. (2011). Personalized travel package
recommendation. In Icdm (p. 407-416).
Lu, Z., Agarwal, D., & Dhillon, I. S. (2009). A spatio-temporal approach to col-
laborative filtering. In Proceedings of the acm conference on recommender systems
(p. 13-20).
Ma, H., King, I., & Lyu, M. R. (2009). Learning to recommend with social trust
ensemble. In Research and development in information retrieval (p. 203-210).
Marlin, B. (2003). Modeling user rating profiles for collaborative filtering. In
Neural information processing systems.
Marlin, B. M., & Zemel, R. S. (2007). Collaborative filtering and the missing at
random assumption. In Proceedings of the conference on uncertainty in artificial
intelligence (p. 267-275).
Marlin, B. M., & Zemel, R. S. (2009). Collaborative prediction and ranking with
non-random missing data. In Proceedings of the acm conference on recommender
systems (p. 5-12).
Miller, B. N., Albert, I., & al et. (2003). Movielens unplugged: Experiences with
a recommender system on four mobile devices. In International conference on
intelligent user interfaces.
Mooney, R. J., & Roy, L. (1999). Content-based book recommendation using
learning for text categorization. In Workshop recom. sys.: Algo. and evaluation.
Pan, R., & Scholz, M. (2009). Mind the gaps: weighting the unknown in large-
scale one-class collaborative filtering. In Proceedings of the acm sigkdd international
conference on knowledge discovery and data mining (p. 667-676).
- 164 -
Pan, R., Zhou, Y., Cao, B., Liu, N. N., Lukose, R., Scholz, M., et al. (2008). One-
class collaborative filtering. In Proceedings of the ieee international conference on
data mining (p. 502-511).
Panniello, U., Tuzhilin, A., Gorgoglione, M., Palmisano, C., & Pedone, A. (2009).
Experimental comparison of pre- vs. post-filtering approaches in context-aware rec-
ommender systems. In Proceedings of the acm conference on recommender systems
(p. 265-268).
Park, M.-H., Hong, J.-H., & Cho, S.-B. (2007). Location-based recommendation
system using bayesian user’s preference model in mobile devices. In Proceedings of
the international conference on ubiquitous intelligence and computing.
Park, Y.-J., & Tuzhilin, A. (2008). The long tail of recommender systems and how
to leverage it. In Proceedings of the acm conference on recommender systems.
Paterek, A. (2007). Improving regularized singular value decomposition for collab-
orative filtering. In Kdd cup and workshop.
Pazzani, M. (1999). A framework for collaborative, content-based, and demo-
graphic filtering. Artificial Intelligence Review .
Portugal, R., Lourenc4o, H. R., & Paixao, J. P. (2009). Driver scheduling problem
modelling. Public Transport , 1(2), 103-120.
Rennie, J. D. M., & Srebro, N. (2005). Fast maximum margin matrix factorization
for collaborative prediction. In In proceedings of the international conference on
machine learning (pp. 713–719).
Resnick, P., Iacovou, N., Suchak, M., Bergstrom, P., & Riedl, J. (1994). Grouplens:
An open architecture for collaborative filtering of netnews. In Proceedings of the
1994 acm conference on computer supported cooperative work (pp. 175–186). ACM
Press.
Salakhutdinov, R., & Mnih, A. (2008). Probabilistic matrix factorization. In
Neural information processing systems.
S.Borzsonyi, K.Stocker, & D.Kossmann. (2001). The skyline operator. In Icde
(p. 421-430).
Sen, P., Namata, G., Bilgic, M., Getoor, L., Gallagher, B., & Eliassi-rad, T. (2008).
Collective classification in network data articles. In Ai magazine.
- 165 -
Setten, M. V., Pokraev, S., Koolwaaij, J., & Instituut, T. (2004). Context-aware
recommendations in the mobile tourist application compass. In Proceedings of
international conference on adaptive hypermedia and adaptive web-based systems
(p. 235-244).
Shardanand, U., & Maes, P. (1995). Social information filtering: Algorithms for
automating ”word of mouth”. In In proceedings of acm CHI’95 conference on
human factors in computing systems (pp. 210–217). ACM Press.
Srebro, N., Rennie, J., & Jaakkola, T. (2005). Maximum margin matrix factoriza-
tions. In Neural information processing systems.
Tian, Y., C.K.Lee, K., & Lee, W.-C. (2009). Finding skyline paths in road net-
works. In Gis (p. 444-447).
Tveit, A. (2001). Peer-to-peer based recommendations for mobile commerce. In
the 1st international workshop on mobile commerce.
Woerndl, W., Huebner, J., Bader, R., & Vico, D. G. (2011). A model for proac-
tivity in mobile, context-aware recommender systems. In Proceedings of the acm
conference on recommender systems.
Wu, J., Xiong, H., & Chen, J. (2009). Adapting the right measures for k-means
clustering. In Proceedings of the acm sigkdd international conference on knowledge
discovery and data mining (p. 877-886).
Xu, Z., & Huang, R. (CS213 Univ. of California,Riverside). Performance study of
load balancing algorithms in distributed web server systems. In Tr.
Xue, G., Lin, C., Yang, Q., Xi, W., Zeng, H., Yu, Y., et al. (2005). Scalable
collaborative filtering using cluster-based smoothing. In Proceeding of the interna-
tional acm sigir conference on research and development in information retrieval
(p. 114-121).
Yang, S.-H., Long, B., Smola, A. J., Sadagopan, N., Zheng, Z., & Zha, H. (2011).
Like like alike: joint friendship and interest propagation in social networks. In
Proceedings of the international conference on world wide web (p. 537-546).
Yu, Z., Zhou, X., Zhang, D., Chin, C.-Y., Wang, X., & Men, J. (2006, July). Sup-
porting context-aware media recommendations for smart phones. IEEE Pervasive
Computing , 5 (3), 68–75.
- 166 -
Zhang, J., & Pu, P. (2007). A recursive prediction algorithm for collaborative
filtering recommender systems. In Acm recsys.
Zhou, Z.-H., & Li., M. (2005). Tri-training: Exploiting unlabeled data using three
classifiers. IEEE TKDE , 17 (11), 1529-1541.
- 167 -
VITA
Yong Ge
1982 Born January in Xuzhou, Jiangsu Province, China.
2001 Graduated from Suining High School, Xuzhou, Jiangsu Province,
China.
2001-05 Attended Xi’an Jiao Tong University, Xi’an, China; majored in Infor-
mation Engineering.
2005 B.S., Xi’an Jiao Tong University.
2005-08 Graduate study in Signal and Information Processing, University of
Science and Technology of China, Hefei, China.
2008 M.S., University of Science and Technology of China.
2008-13 Graduate study in Information Technology, Rutgers University,
Newark, New Jersey, U.S.A.
2008-12 Teaching Assistantship, Department of Management Science and In-
formation Systems.
2011-12 Instructor in Management Information Systems and Data Mining for
Business Intelligence, Rutgers University.
2011 Article: “ Multi-focal Learning for Customer Problem Analysis,” ACM
Transaction on Intelligent Systems and Technology, vol. 2, no. 3
2012 Article: “ A Cocktail Approach for Travel Package Recommendation,”
IEEE Transactions on Knowledge and Data Engineering, accepted
2013 Ph.D., Rutgers University.