Recommender Systems, Collaborative Filtering Approach
Getting recommendations is an important part of the decision making. Suggestions for books on Amazon, or movies on Netflix are real- world examples of recommender systems.Design of such recommendation engines depends on the domain and the available data.
Recommender systems can be categorized as Content Based Recommendation Systems, Collaborative Filtering Recommendation Systems and Hybrid Recommendation Systems.
Content Based Recommendation: Generates recommendations according to the items active user liked before. Recommender system analyzes the content of recent items of user, and recommends new ones according analyzed data.
Collaborative Filtering Recommendation: Generating recommendations based on active user’s previous preferences, checks all users who liked similar items. (or checks all items which are liked by users.)
Hybrid Recommendation: Generates recommendations by combining above 2 methods.
There are several techniques for each category, and it is impossible to show examples for each category/approach.
I am mainly interested in Collaborative Filtering (CF) Approach, and in this post you can find some implementation examples of CF Approach.
Why Collaborative Filtering?
For me, I made long hours of research before implementing few algorithms, but in short the approach you should choose is really depends on your data and application domain.
If you are thinking about creating your own recommendation system (RecSys), first analyze your data, and decide what features you’ll need in your application, can you go with a content based RecSys or not.
According to my data and my project details, I decided the best option for me is CF, specifically Incremental SVD is a great match for me.
Why Collaborative Filtering Is So Popular?
Netflix! In my opinion (and I think many others may agree with me) The Nexflix Price is the locomotive for CF Approaches and maybe matrix factorization.
Netflix Prize was a competition to improve Netflix’s own algorithm by at least 10%, and the winner took $1M (waow!), it ended in 09′.
Ok enough history, lets talk about a bit on CF.
Collaborative Filtering recommendation systems work by collecting users’ preferences, and generating predictions by user-based similarities or item-based similarities.
CF methods can be divided into 2 categoeries as, Memory (Neighborhood) Based and Model Based Approaches.
Memory (Neighborhood) Based Recommendation Systems
User-based or item-based techniques can be used. For user-based technique, number of similar users are found by calculating similarity (we’ll get there soon -Jaccard Index, Euclidean…-) and with a basic loop inside similar users’ items, all non-rated items are found with a weighted estimated rating value. A very similar method also works for item-based technique.
I implemented both item-based and user-based techniques while working on recommender systems, neighborhood approach is easy to understand and implement, takes only 3 steps.
1. Find similarity for active user to all other users, and pick top k users (called k-Nearest Neighborhood).
2. Select all items that active user hasn’t rated yet.
3. Compute a weighted prediction rating for items
There are several methods to calculate similarities between items or users, and I implemented only few of them including Pearson Correlation Coefficient, Euclidean Distance, Jaccard Index and Cosine Based Similarity. Again, it is best to choose similarity method according to your data and application.
You can see math formulas for these methods at the end of the post.
Model Based Recommendation Systems
Model can be developed using matrix factorization techniques, data minin, machine learning algorithms to find the patterns based on the data. This model is used to make predictions on the real data afterwards.
There are various Collaborative Filtering modelling algorithms including Bayesian Networks, Latent Semantic Models, Clustering Models, Singular Value Decomposition and others.
So what’s the pros and cons over Neighborhood Based Approach?
The main disadvantage that is worth to mention is, they are difficult to implement. (but not that much, don’t give up 🙂 )
So the advantages are, it handles data sparsity better, provides scalability on large datasets, increases the performance of prediction.
I implemented Item-based SVD, User-based SVD and Incremental SVD matrix factorization techniques to see what approach suits my data more. You can check all codes from github project linked below.
I won’t go into details for these implementations, SVD or matrix factorization, but you can check below useful links, including the my implementations. And you should search academic materials for new developments on topic in IEEE or somewhere else, especially Munir Sarwar and Yehuda Koren (winner of Netflix Price)
And finally, I used publicly available MovieLens test data (mostly the package that has 100K ratings), you can download a test data here: http://www.grouplens.org/node/73
Similarity Methods Math Formulas