Software Programming

Kunuk Nykjaer

Archive for the ‘Data Mining’ Category

How does the Amazon recommendation system work? – Analyze the algorithm and make a prototype that visualizes the algorithm

with 13 comments

I think the recommendation systems are interesting.

I decided that I wanted to learn how the Amazon recommendation works in theory and afterwards implement a demo-site that visualized the algorithm. My implementation is not identical as the Amazon’s version but it follows the same principle. My main focus is to illustrate and visualize the concept. I used a item-to-item matrix table for simplicity.

Reference reading materials:
(2) Amazon Recommendation Patent

I recommend reading (1). This is only 5 pages long and easy to read. Amazon’s algorithm is based on item-to-item filtering. In short: Amazon developed their own recommendation system based on items rather than users, because there are fewer items than users (scalability). The list of visited items by any user is stored in a item-to-item matrix table. The recommendation algorithm are calculated by using the Cosine Similarity function on the vectors from the matrix table.

I recommend reading (2) for technical insights and design overview of how Amazon has implemented it.

For the demo-site I wrote a short abstract. The demo-site can be seen here.
The demo project can be downloaded here

How does the Amazon recommendation works? 

This is about visualizing the item to item collaborations filtering mechanism using a item-to-item matrix table.
The item-to-item matrix, the vectors and the calculated data values are displayed.

There are n different items and the item recommendation can display up to m items.

There are implemented different item-to-item neighborhood functions. 
A simple max count of seen neighbor items, the Cosine Similarity and the Jaccard Index.

A tracker keeps track of visited items for any user and is saved to a matrix table.
To make it simple only the relation between previous and current viewed item are tracked in this example.


The demo-site has two pages: Home and item page. The item page shows specific information regarding the viewed item and the recommendation is displayed here.

There are 5 different components

  • Multiple view – shows multiple components, all the items are displayed here
  • User view – shows specific information about the current user in the session
  • Item view – shows detailed information about the current item
  • Recommendation view – shows recommended items based on the current item
  • Data view – visualizes the data structure used by the recommendation algorithm


This shows how the tracker collects the data from the users in to the matrix table. (The illustrated tracking method is a simplified version. You could also iterate the viewed items when a user view a new item to save all the item-to-item relation for the viewed items).

When a new visitor user3 sees the item A,
the recommendation system founds out the closest match are the Items B and C.

The general idea of the Amazon recommendation engine is to locate item vectors which are similar in pattern for the current viewed item vector.

e.g. A vector with pattern [1,1,1,0,0,0,0] is more similar to vector [0,1,1,0,1,0,0] than to vector [0,0,0,1,0,1,1].

Written by kunuk Nykjaer

March 4, 2012 at 6:43 pm

Social data exploration

with 2 comments

I will have fun playing with Social data. The idea is to make a site where users can login with a social profile such as Facebook, Google or Twitter account. After the users login the user data will be extracted using the social API. The purpose is to explore Social sites API and use them to make web-tools for easy data exploration, make data visualizations or maybe even knowledge discovery using Data mining techniques.

I will put the content on this site.

I will use the Socialauth-net login framework.

Written by kunuk Nykjaer

December 4, 2011 at 9:56 pm

Predicting Country Happiness using WEKA – Data Mining

leave a comment »

Data Mining is knowledge discovery. You take some data, examine it and discover some knowledge.
In this project I want to predict country happiness by given statistical information of the countries in the world.

In 2006 psychologist made a satisfaction life index for 178 countries. The numeric ranking data is in Wikipedia.

The raw data which is used for the knowledge discovery is taken from World data bank. This is data from all the countries with thousands of attributes information such as Death rate, Population Rate, Money income.

A great Data Mining tool called WEKA is used where the data mining algorithms are applied on the dataset.

The project process flow
project flow

After the preliminary preprocessing step, there were 100 countries and 484 attribute information to work with. The attributes are reduced to 15 removing redundancy and high dimensionality issues.

Classification algorithms are used, more specific Decision Trees so I can show some decision tree graph.
I want to train a prediction model based on the statistic information using the satisfaction with life index as a class label. The Satisfaction with Life Index is a numeric value. I have converted it to three categories: Happy, Indifferent and Unhappy with even distribution of the 100 countries.

This is the graph example where Happy is :-), Indifferent is 😐 and Unhappy is 😦

Here you can see the attributes used in the J48 Graft decision tree algorithm.
J48 Graft (C4.5 Decision Tree) – click to see full picture
decision tree

The values are normalized from 0 to 1. The graph for example shows that higher death rate leads to more unhappiness and plant species that are threatened leads to more happiness (this is discovered knowledge).

By using Functional Tree or Logistic Model Tree decision tree algorithms I was able to get the prediction up to 65% with 10 cross-fold validation. Thus if you gave me a random unseen country with statistic information I could guess with 65% accuracy whether the country happiness is happy, indifferent or Unhappy.

Reference links

WEKA Results

Trainingdata 80 countries
Testdata 20 countries

Training data Model building
Correctly Classified Instances 75 93.75 %
Incorrectly Classified Instances 5 6.25 %

10 cross-folds validation
Correctly Classified Instances 49 61.25 %
Incorrectly Classified Instances 31 38.75 %

The build model is clearly over-fitted.

Testing the model (the over-fitted version, not 10 cross-fold) using the test data of 20 countries gave 60% correct classification, 12 out of 20. Better than random but not amazing.

=== Confusion Matrix for J48 Graft 10-cross fold of training data ===
a b c <– classified as
15 10 4 | a = 😐
8 14 2 | b = 😦
6 1 20 | c = 🙂

The confusion matrix tells that it is hard to predict indifferent happiness

WEKA console usage examples:

WEKA model building
java weka.classifiers.trees.J48graft -C 0.25 -M 2 -t c:\temp\trainingdata.csv -d c:\temp\mymodel.model

WEKA running the model with test data
java weka.classifiers.trees.J48graft -p 15 -l c:\temp\mymodel.model -T c:\temp\testdata.csv

The value 15 tells the class label is at column 15

I discovered that the order appearance of the nominal class label must be the same in training and test-data. Else you get a no meaningful exception when running the test data against the model.

Written by kunuk Nykjaer

June 3, 2011 at 12:56 am