Improved Collaborative Filtering Algorithm using Topic Model

Collaborative filtering algorithms make use of interactions rates between users and items for generating recommendations. Similarity among users or items is calculated based on rating mostly, without considering explicit properties of users or items involved. In this paper, we proposed collaborative filtering algorithm using topic model. We describe user-item matrix as document-word matrix and user are represented as random mixtures over item, each item is characterized by a distribution over users. The experiments showed that the proposed algorithm achieved better performance compared the other state-of-the-art algorithms on MovieLens data sets. Keywords—collaborative filtering; LDA; topic model


Introduction
With the emergence of Internet, there is more and more information disseminating all over this channel.The abundant amount of information, however, causes difficulty for users to locate desired information, which is referred to as the information overload problem due to our limited processing ability.Therefore, recommender systems arise to assist users to acquire useful information based on their past preferences or collaborative preferences from other sources.
Most recommendation algorithms start by finding a set of customers whose purchased and rated items overlap the user's purchased and rated items.The algorithm aggregates items from these similar customers, eliminates items the user has already purchased or rated, and recommends the remaining items to the user.
Recommender systems are often based on Collaborative Filtering (CF), which relies only on past user behavior-for example, their previous transactions or product ratings-and does not require the creation of explicit profiles [1].Notably, CF techniques do not require domain knowledge and avoid the need for extensive data collection.In addition, relying directly on user behavior allows uncovering complex and unexpected patterns that would be difficult or impossible to profile using known data attributes.As a consequence, CF attracted much of attention in the past decade, resulting in significant progress and being adopted by some successful commercial systems [2] [3].Herlocker et al. estimated a user's preference for those items by ratings, these rating is given by similar people on an items [4].Sarwar et al. exploited similarity of items with other items that the user has already rated to predict the user's preference on items [5].Koren et al. made use of Singular Value Decomposition (SVD) to factorize user-item rating matrix to determine latent properties of users and items [6].Chen, Chunan et al. addresses the problem of k Closest Pairs (kCP) query in spatial network databases [7].Chang et al. proposed an LDA based document recommendation system which utilized an Item Based CF algorithm with document similarity calculation based on latent topic distribution of documents [8].Liu, Qi, et al proposed a latent factor model based on LDA to model evolution of user interests based on personalized ranking [9].Ortega et al. pointed out that there were four stages in the CF process where the users' data could be aggregated into the data of the group.According to their finding, the system performance would be significantly improved if the aggregation was done at an earlier stage of the process [10].Wang Z et al. present Friendbook, a novel semantic-based friend recommendation system for social networks, which recommends friends to users based on their life styles instead of social graphs [11].
Our approach utilizes topic model to infer latent properties of items and then calculates user's preferences on historical ratings.We compute a hybrid user similarity score, which make use of user similarity in the topic model along with user similarity based on cosine.This way, our approach differs from the above references to improve quality of recommendations.
The paper is organized as follows.Section 2 describes Collaborative Filtering Algorithms.Section 3 defines the proposed algorithm, and Section 4 presents the results of applying this algorithm to MovieLens datasets.We conclude and discuss further research directions in Section 5.

Collaborative Filtering Algorithms
A traditional collaborative filtering algorithm is usually represented as an m×n customer-product matrix, R, such that ri,j is one if the ith customer has purchased the jth product, where U={u1,u2…um} is the set of customers, I={i1,i2…in} is the set of product.It is shown as Figure 1.We term this m×n representation of the input data set as original representation.The most important step in collaborative filtering algorithm is that of computing the similarity between customers as it is used to form a proximity-based neighborhood between a target customer and a number of like-minded customers.The main goal of neighborhood of formation is to find, for each customer u, an ordered list of l customers N={n1,n2…nl} such that sim(u,N1) is maximum, sim(u,N2) is the next maximum and so on.The proximity between two customers is usually measured by Cosine: Pearson Correlation: Adjusted Cosine: where i r is the average of ri.
The final step of collaborative filtering algorithm is to derive the top-N recommendations from the neighborhood of customers.

Collaborative Filtering Recommenders Using Topic Model
The main step of collaborative filtering algorithm is to rank each item according to how many similar customers purchased it.Either cosine or correlation is bags of words.They cannot find the relation between words deeply.Topic model is another good choice.

LDA Model
LDA is a generative probabilistic model of a corpus.The basic idea of LDA is that documents are represented as random mixtures over latent topics, each topic is characterized by a distribution over words.
The LDA model is represented as a probabilistic graphical model in Figure 1  A k-dimensional Dirichlet random variable  can take values in the (k−1)-simplex, and has the following probability density on this simplex: where the parameter  is a k-vector with components i  >0, and where ) (x  is the Gamma function.The Dirichlet is a convenient distribution on the simplex, has finite dimensional sufficient statistics, and is conjugate to the multinomial distribution.Given the parameters  and  , the joint distribution of a topic mixture , a set of  topics z, and a set of N words w is given by: where p(zn| ) is simply i  for the unique i. Integrating over  and summing over z, we obtain the marginal distribution of a document: Finally, taking the product of the marginal probabilities of single documents, we obtain the probability of a corpus:

Collaborative Filtering Recommenders using Topic Model
In collaborative filtering algorithm, the input data is m×n matrix as shown in Table1.The matrix is the input of topic model.The matrix is computed as Figure 3 using LDA, where ij  is distribution of user i over item j.

Figure 3. user-item distribution matrix
By LDA, the count of user purchase item in matrix is denoted as distribution.The similarity of users is calculated as:

Proposed Algorithm
Collaborative filtering recommenders using topic model is described as follows: Input: user-item rate matrix Output: Top-N recommender (a)Compute similarity c LDA , , , i j i j i j sim sim sim   (b)Find neighbor according to similarity and the number of the nearest (c)Predict users as Equation (9)where M is set of neighbor.

Experiments
We evaluated our algorithms on the MovieLens data sets.This data set consists of 100,000 ratings (1-5) from 943 users on 1682 movies.In order to evaluate our algorithm, we use Mean Absolute Error(MAE) as measure.MAE is a common measure in recommender system.It is an average of the absolute errors between predictions of target user and eventual outcomes.MAE is given by , , ( ) where , u i r  is predict value of product i which is calculated as Equation (9).
Collaborative Filtering algorithm includes user-based and item-based.In order to identify our proposed algorithm, we take experiments on these two side.

User-based LDA Collaborative Filtering
In this part of experiments, we first identify the validation of our proposed algorithm.we set cluster is 5, 10, 20, 30, 40, 50 respectively and neighbor size is 5,10,20,30,40,50,60,80,100,130,160,200 respectively.The number of topic is 20.The experiment result is shown in Figure 4.The x-axis of Figure 4 is neighbour size, y-axis is MAE calculated by Equation (10).There are 5 curve in Figure 4,which reprents MAE of different cluster.
To compare with baseline, we also run the user-based collaborative filtering algorithm with cosine, pearson correlation and adjusted cosine when cluster is 5.The compared result is shown in Figure 5.The neighbour size is the same as Figure 4.There are 4 curves in Figure5,which reprents MAE of different method.

Item-based LDA Collaborative Filtering
We also take experiments with item-based.The experiment results are shown in Figure 6 and Figure 7.The experiments parameter is the same as section 4.1.
From experiment results we can see that (1) Figure 4 and Figure 6 are results of our proposed method under different clusters and different neighbour size.The results are evaluated by MAE, which is an average of the absolute errors between predictions of target.It is obvious that the lower,the better.Neither userbased or item-based, the cluster is lager, the MAE is lower.
(2)The contribution of our proposed algothrim is using topic model to compute similarity between users.To identify the effectiveness, we also compare our algorithm with others.The baseline is cosine, pearson correlation and adjusted cosine.

Conclusions
Collaborative Filtering algorithms make use of interactions between users and items in the form of implicit or explicit ratings alone for generating recommendations.Similarity among users or items is calculated purely based on rating overlap in this case, without considering explicit properties of users or items involved, limiting their applicability in domains with very sparse rating spaces.In this paper, we proposed collaborative filtering algorithms using topic model, which can improved the similarity between users and items.

Figure 1 .
Figure 1.User-item rating matrix . The boxes are "plates" representing replicates.The outer plate represents documents, while the inner plate represents the repeated choice of topics and words within a document.As the figure makes clear, there are three levels to the LDA representation.The parameters  and  are corpus level parameters, assumed to be sampled once in the process of generating a corpus.The variables d  are document-level variables, sampled once per document.Finally, the variables dn Z and dn W are word-level variables and are sampled once for each word in each document.
(d)Recommender the Top-N users.

Figure 5 andFigure 7 .
Figure 7. Results of four item-based Collaborative Filtering algorithm