Software Programming

Kunuk Nykjaer

Predicting Country Happiness using WEKA – Data Mining

leave a comment »


Data Mining is knowledge discovery. You take some data, examine it and discover some knowledge.
In this project I want to predict country happiness by given statistical information of the countries in the world.

In 2006 psychologist made a satisfaction life index for 178 countries. The numeric ranking data is in Wikipedia.

The raw data which is used for the knowledge discovery is taken from World data bank. This is data from all the countries with thousands of attributes information such as Death rate, Population Rate, Money income.

A great Data Mining tool called WEKA is used where the data mining algorithms are applied on the dataset.

The project process flow
project flow

After the preliminary preprocessing step, there were 100 countries and 484 attribute information to work with. The attributes are reduced to 15 removing redundancy and high dimensionality issues.

Classification algorithms are used, more specific Decision Trees so I can show some decision tree graph.
I want to train a prediction model based on the statistic information using the satisfaction with life index as a class label. The Satisfaction with Life Index is a numeric value. I have converted it to three categories: Happy, Indifferent and Unhappy with even distribution of the 100 countries.

This is the graph example where Happy is :-), Indifferent is 😐 and Unhappy is 😦

Here you can see the attributes used in the J48 Graft decision tree algorithm.
J48 Graft (C4.5 Decision Tree) – click to see full picture
decision tree

The values are normalized from 0 to 1. The graph for example shows that higher death rate leads to more unhappiness and plant species that are threatened leads to more happiness (this is discovered knowledge).

By using Functional Tree or Logistic Model Tree decision tree algorithms I was able to get the prediction up to 65% with 10 cross-fold validation. Thus if you gave me a random unseen country with statistic information I could guess with 65% accuracy whether the country happiness is happy, indifferent or Unhappy.

Reference links

—————-
WEKA Results

Trainingdata 80 countries
Testdata 20 countries

Training data Model building
Correctly Classified Instances 75 93.75 %
Incorrectly Classified Instances 5 6.25 %

10 cross-folds validation
Correctly Classified Instances 49 61.25 %
Incorrectly Classified Instances 31 38.75 %

The build model is clearly over-fitted.

Testing the model (the over-fitted version, not 10 cross-fold) using the test data of 20 countries gave 60% correct classification, 12 out of 20. Better than random but not amazing.

=== Confusion Matrix for J48 Graft 10-cross fold of training data ===
a b c <– classified as
15 10 4 | a = 😐
8 14 2 | b = 😦
6 1 20 | c = πŸ™‚

The confusion matrix tells that it is hard to predict indifferent happiness

WEKA console usage examples:

WEKA model building
java weka.classifiers.trees.J48graft -C 0.25 -M 2 -t c:\temp\trainingdata.csv -d c:\temp\mymodel.model

WEKA running the model with test data
java weka.classifiers.trees.J48graft -p 15 -l c:\temp\mymodel.model -T c:\temp\testdata.csv

The value 15 tells the class label is at column 15

I discovered that the order appearance of the nominal class label must be the same in training and test-data. Else you get a no meaningful exception when running the test data against the model.

Advertisements

Written by kunuk Nykjaer

June 3, 2011 at 12:56 am

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: