Predicting Country Happiness using WEKA – Data Mining
Data Mining is knowledge discovery. You take some data, examine it and discover some knowledge.
In this project I want to predict country happiness by given statistical information of the countries in the world.
The raw data which is used for the knowledge discovery is taken from World data bank. This is data from all the countries with thousands of attributes information such as Death rate, Population Rate, Money income.
A great Data Mining tool called WEKA is used where the data mining algorithms are applied on the dataset.
The project process flow
After the preliminary preprocessing step, there were 100 countries and 484 attribute information to work with. The attributes are reduced to 15 removing redundancy and high dimensionality issues.
Classification algorithms are used, more specific Decision Trees so I can show some decision tree graph.
I want to train a prediction model based on the statistic information using the satisfaction with life index as a class label. The Satisfaction with Life Index is a numeric value. I have converted it to three categories: Happy, Indifferent and Unhappy with even distribution of the 100 countries.
This is the graph example where Happy is :-), Indifferent is 😐 and Unhappy is 😦
The values are normalized from 0 to 1. The graph for example shows that higher death rate leads to more unhappiness and plant species that are threatened leads to more happiness (this is discovered knowledge).
By using Functional Tree or Logistic Model Tree decision tree algorithms I was able to get the prediction up to 65% with 10 cross-fold validation. Thus if you gave me a random unseen country with statistic information I could guess with 65% accuracy whether the country happiness is happy, indifferent or Unhappy.
Trainingdata 80 countries
Testdata 20 countries
Training data Model building
Correctly Classified Instances 75 93.75 %
Incorrectly Classified Instances 5 6.25 %
10 cross-folds validation
Correctly Classified Instances 49 61.25 %
Incorrectly Classified Instances 31 38.75 %
The build model is clearly over-fitted.
Testing the model (the over-fitted version, not 10 cross-fold) using the test data of 20 countries gave 60% correct classification, 12 out of 20. Better than random but not amazing.
=== Confusion Matrix for J48 Graft 10-cross fold of training data ===
a b c <– classified as
15 10 4 | a = 😐
8 14 2 | b = 😦
6 1 20 | c = 🙂
The confusion matrix tells that it is hard to predict indifferent happiness
WEKA console usage examples:
WEKA model building
java weka.classifiers.trees.J48graft -C 0.25 -M 2 -t c:\temp\trainingdata.csv -d c:\temp\mymodel.model
WEKA running the model with test data
java weka.classifiers.trees.J48graft -p 15 -l c:\temp\mymodel.model -T c:\temp\testdata.csv
The value 15 tells the class label is at column 15
I discovered that the order appearance of the nominal class label must be the same in training and test-data. Else you get a no meaningful exception when running the test data against the model.