Costa Rican Household Poverty Level Prediction¶

Aneesha Nanda, Ansul Sinha, Juan Alvarez, Katie Goulding¶

Introduction¶

About the project¶

The purpose of this project is to look at what factors of a household are able to predict the need of social welfare assistance. This is important because social welfare programs are often controversial and where resources are distributed will help create a more equitable world. Current methods for identifying where to provide aid and whether certain segments of the population are receiving enough aid rely on these factors and can lead to inaccurate assessments of social need. This research project focuses more on observable traits of a household and demographics of each household member to predict the level of need of a family.

Using an open source dataset from kaggle, we used our prior research into social welfare determinants to drive our analysis. After we turned our outcome into a binary variable, we often refer to our two outcomes as vulnerable and not vulnerable. Our label of "vulnerable" also includes households classified as moderate and extreme poverty.

About the team¶

We are group of Informatics students at the University of Washington who like to explore datasets and run models to predict outcomes. This project is for our Introduction to Data Science course.

Data Preparation¶

The dataset required a lot of cleaning prior to use. We started off by dropping columns with squared values because they were highly correlated to data that already existed in the dataset and therefore unnecessary. From there, we discovered the edjefe/edjefa columns regarding years of education for the head of the household had a mix of integers and strings. We converted all “no” values to 0 and all “yes” values to 1. The dataset also originally categorized people into 4 categories of vulnerability in terms of poverty, however we converted it into a binary outcome with 1 being vulnerable and 0 being not vulnerable.

After the initial preparation of the data, it became necessary to deal with missing values. The “meaneduc” column only had 5 missing values so we imputed 0 knowing this would hardly affect our data. We discovered that the column “v18q1” which was the number of tablets a household owned was directly related to another column dictating whether or not that household had any tablets. The missing values in that column only existed if the household had no tablets, so we were able to impute 0 again. The “rez_esc” column which conveyed the number of years a person was behind in school also had many missing values. When we plotted the age distribution, however, it became clear that the missing values were for people either too young to have started school or people who had already completed school so we could confidently fill a 0 in for the missing values as the number of years behind in school. Finally, “v2a1” which informed us of people’s monthly rent payments had many missing values. Luckily, the data table provided us with the type of ownership people had over their homes. We discovered that a majority of people actually owned their homes and therefore had no value in the rent column. Again, we were able to impute a 0 for these rows without skewing the data too much.

Dealing with Missing Data¶

Table of count of null values per column:

v2a1        6860
v18q1       7342
rez_esc     7928
meaneduc       5
dtype: int64

Table of the count of the type ownership of household:

Owns: 5911
Installments: 0
Rents:  0
Precarious:  163
Other:  786

Series([], dtype: float64)

Exploratory Data Analysis¶

From our background research, we learned that size of family, level of education, and gender are three key factors that make a person more “at risk” of being in poverty. With this in mind, we created two new data frames separated by our outcome variable: vulnerable (includes moderate and extreme poverty) and not vulnerable.

As we explored the number of people in households, we initially graphed the frequency which visually displayed vulnerable households having more people than not vulnerable households. However, after normalizing the data as a percentage, the graphs conveyed a more truthful story. Interested in how technology ownership is split between our binary outcome, we saw little difference in the ownership of mobile phones and computers.

Our first visual helps us understand the breakdown of our data. It tells us that there are roughly 33% more people who are not vulnerable than there are people who are vulnerable, who are experiencing moderate poverty, or who are experiencing extreme poverty combined. Under the assumption that the data collection process was just, this insight tells us that the majority of households are not vulnerable to poverty. One thing that surprised us as it didn’t match with our research was the chart displaying the number of people in vulnerable households. While roughly 25% of both vulnerable and not vulnerable households contain 4 people, roughly 25% of vulnerable households have 3 people and roughly 20% of not vulnerable households have 5 people. This was in contrast to the information we found that stated the greater number of people in households is correlated with increased likelihood to be experiencing poverty.

Number households not vulnerable vs. vulnerable + extreme and moderate poverty¶

Number of People in Households (percent)¶

(0, 50)

Number of people in households (frequency)¶

(0, 1650)

Years of education of the female head of household¶

(0, 4100)

Households that own a mobile phone¶

Households that own a computer¶

Monthly rent¶

Feature Selection for Statistical Modelling¶

In terms of selecting the best features for our Logistic model, we decided to take a recursive feature elimination approach to narrow down our 135+ features. Before we ran our RFE model, we removed similar household size columns that correlated to one another. Since we used a Logistic regression model, our RFE model was fitted with a LogisticRegression object. Looking at the top 30 features, we computed a formula string with Target as our dependent variable and the 30 features as our independent variables.

Statistical Modeling: Logistic Regression¶

We conducted a logistic regression analysis for our statistical modeling, which is appropriate for our data as there many independent variables (predicting features), and the dependent variable (our Target) is dichotomous: either the household is vulnerable or it is not.

Confusion Matrix of Logistic Regression:

array([[5112,  884],
       [1512, 2049]], dtype=int64)

0.4245998315080034 is Type I error, where we incorrectly classified true negatives as posivites.
0.1474316210807205 is Type II error, where we incorrectly classified true positives as negatives.

Our accuracy score for logistic regression is 0.7492937114157162

Logistic Regression Insights¶

An AUC score of 0.714 indicates that the probability of our model ranking a random positive example higher than a random negative example is 71.4%. Since our AUC value is closer to 1, the model does an average job at classifying the outcome, however it can be greatly improved with machine learning models explored below.

Splitting Data¶

The data was split using a standard test size of 0.2 making it so there was 80% training data and 20% testing data. Cross validation when running machine learning models allows this split to be tested over 10 folds of data.

Random Forest Classifier¶

This classifier is a collection of decision trees. This tends to be a better version of decision tree because it accounts for decision trees’ tendency to overfit the data to their training set. This algorithm constructs a multitude of decision trees and returns the mean classification. Due to its algorithm, it is typically one of the more accurate models when classifying a binary outcome.

Grid Search Score:

-0.09257322175732217

Confusion Matrix:

array([[1129,   81],
       [  73,  629]], dtype=int64)

Type I and Type II Error:

0.14672364672364668 is Type I error, where we incorrectly classified true negatives as posivites.
0.06115702479338847 is Type II error, where we incorrectly classified true positives as negatives.

Bar Chart of Predicted vs Actual Outcomes:

Random Forest Insights¶

As seen in the chart above of actual vs predicted values, the Random Forest Classifier was successful at predicting if homes are vulnerable to experiencing poverty. In particular, the confusion matrix confirms the minimal percent of false negatives and positives that Random Forest predicts. With a 5.7% error of incorrectly classifying true positives as negatives and an 18.3% error of incorrectly classifying true negatives as positives, we can conclude that a Random Forest Model is able to predict vulnerability of experiencing poverty relatively accurately.

KNN Classifier¶

KNN Classifier is based on feature similarity; we used it to observe how closely our features resemble our training set; it conducts minimal training or modeling. KNN Classifier is non-parametric, meaning it makes no assumption about the data distribution, so the model structure is based entirely on the data . We chose this classifier as it is a great algorithm in predicting discrete values of its k nearest neighbors. The range for k was selecting after testing several, more expansive ranges for k and comparing processing speed to accuracy.

Grid Search Score:

-0.2944560669456067

Confusion Matrix:

array([[993, 217],
       [346, 356]], dtype=int64)

Type I and Type II Error:

0.49287749287749283 is Type I error, where we incorrectly classified true negatives as posivites.
0.17933884297520664 is Type II error, where we incorrectly classified true positives as negatives.

Bar Chart of Predicted vs Actual Outcomes:

KNN Insights¶

Compared to Random Forest, KNN did not perform as accurately. As the chart above conveys, the KNN model was less successful at predicting the number of not vulnerable households. Our confusion matrix revealed an error of 48.1% where we incorrectly classified true negatives as positives, and 19.3% of the time incorrectly classifying true positives as negatives. These percentages, in particular the 48.1%, is significant and contributes to us claiming that KNN is not the most effective model to use with our data. With a Type I error of 48.1%, this would lead to classifying not vulnerable households as vulnerable. The implications of this revolve around the distribution of resources and might lead to households receiving welfare who aren't in as great of need as other households.

Decision Tree Classifier¶

This algorithm solves problems using tree representations. It can be used to predict values of target variables by learning decision rules based on prior training data. This is useful when the outcome is binary.The range for minimum sample leaves was once again found by testing a variety of ranges and narrowing it down to a range that still produced accurate results while having relatively efficient processing speed.

Grid Search Score:

-0.08054393305439331

Confusion Matrix:

array([[1129,   81],
       [  73,  629]], dtype=int64)

Type I and Type II Error:

0.10398860398860399 is Type I error, where we incorrectly classified true negatives as posivites.
0.06694214876033056 is Type II error, where we incorrectly classified true positives as negatives.

Bar Chart of Predicted vs Actual Outcomes:

Decision Tree Insights¶

The Decision Tree Classifier was our most accurate model, predicting less than a difference of 30 for both the actual vulnerable and the actual not vulnerable and the bars in the graph above appearing almost equal. With a 6.2% error (only 0.5% higher than random forest) of incorrectly classifying true negatives as positives and a 13.3% error (the lowest of the three models for Type I), of incorrectly classifying true negatives as positives, the Decision Tree model was the most effective at accurately predicting poverty vulnerability level.

Table of Type I and Type II Errors Across Three Models:

Conclusions¶

We have taken a convoluted data frame from Kaggle, wrangled said data, analyzed each column, ran statistical and machine learning models, and made predictions given each model. The most elaborate part of this project was understanding the data frame. Running explanatory data analysis on major features gave us better insights on the data and what exactly some aspects of the dataset are correlated or tied together. Given a data frame with Spanish feature names, missing data, inconsistent attributes, and replicated columns for household size, we were able to run a recursive feature elimination model to gather our top 30 features using a Logistic regression. The top 30 features included data on whether the individual owned a tablet or computer, if they were younger than 12 years old, what the predominant material on the outside wall, floor, and roof is, main source of energy, education level, etc. We decided to turn our target column to a binary count because the initial multi-level classification was skewed more towards not vulnerable households, so our model would have been weighed towards predicting more not vulnerables households better and predicting vulnerable households poorer. After running both a Logistical regression model, random forest classifier, k-nearest neighbors classifier, and decision tree classifier, we have determined that a decision tree classifier was our most accurate model with predicting less than a difference of 30 for both the actual vulnerable and the actual not vulnerable. Given more of a scope on this project we could have came up with a more feature engineering heavy approach, but given the expansiveness of the amount features, that would have not been the best use of our resources.

Running these models and analyzing the data was important for us to accomplish as it gave us insights on vital data science methods. Likewise, it is important for governments to assist their citizens in any way they can so that they can thrive in a livable condition. With a country like Costa Rica that has had a poverty rate of 22% in 2015, we can predict if a household may be in need so that they may receive some help in terms of welfare assistance.

Dep. Variable:	Target	No. Observations:	9557
Model:	GLM	Df Residuals:	9526
Model Family:	Binomial	Df Model:	30
Link Function:	logit	Scale:	1.0000
Method:	IRLS	Log-Likelihood:	-4820.4
Date:	Tue, 04 Dec 2018	Deviance:	9640.8
Time:	23:16:25	Pearson chi2:	1.04e+04
No. Iterations:	23	Covariance Type:	nonrobust

	coef	std err	z	P>\|z\|	[0.025	0.975]
Intercept	1.0565	0.220	4.812	0.000	0.626	1.487
v18q	-1.1485	0.074	-15.473	0.000	-1.294	-1.003
r4t1	0.6062	0.029	21.249	0.000	0.550	0.662
paredzinc	-0.6618	0.225	-2.936	0.003	-1.104	-0.220
paredfibras	0.5143	0.822	0.626	0.531	-1.096	2.125
pisoother	-23.2988	4.28e+04	-0.001	1.000	-8.4e+04	8.39e+04
pisonatur	22.3087	4.02e+04	0.001	1.000	-7.88e+04	7.88e+04
pisomadera	0.6235	0.100	6.222	0.000	0.427	0.820
techootro	-23.4864	2.51e+04	-0.001	0.999	-4.92e+04	4.91e+04
abastaguano	1.5857	0.639	2.480	0.013	0.332	2.839
noelec	-1.6197	0.530	-3.058	0.002	-2.658	-0.581
energcocinar2	-0.9413	0.120	-7.842	0.000	-1.177	-0.706
energcocinar3	-0.8203	0.119	-6.882	0.000	-1.054	-0.587
elimbasu4	24.0321	3.3e+04	0.001	0.999	-6.47e+04	6.48e+04
elimbasu6	-24.3398	3.6e+04	-0.001	0.999	-7.06e+04	7.06e+04
eviv3	-0.7015	0.053	-13.338	0.000	-0.805	-0.598
dis	0.3513	0.102	3.456	0.001	0.152	0.550
hogar_adul	-0.2549	0.023	-11.016	0.000	-0.300	-0.210
instlevel1	1.2102	0.114	10.647	0.000	0.987	1.433
instlevel2	1.3999	0.107	13.031	0.000	1.189	1.611
instlevel3	1.0893	0.105	10.420	0.000	0.884	1.294
instlevel4	1.0965	0.106	10.355	0.000	0.889	1.304
instlevel5	0.7368	0.118	6.218	0.000	0.505	0.969
instlevel6	1.3757	0.192	7.171	0.000	1.000	1.752
instlevel9	-1.8810	0.724	-2.596	0.009	-3.301	-0.461
tipovivi2	-0.8763	0.102	-8.585	0.000	-1.076	-0.676
tipovivi3	-0.4191	0.066	-6.304	0.000	-0.549	-0.289
tipovivi4	0.9079	0.223	4.065	0.000	0.470	1.346
computer	-0.6654	0.123	-5.420	0.000	-0.906	-0.425
mobilephone	-0.8760	0.160	-5.477	0.000	-1.189	-0.563
lugar3	0.6069	0.101	6.000	0.000	0.409	0.805

	Error Type	Random Forest	K Nearest Neighbors	Decision Tree
0	Type I	0.146724	0.492877	0.103989
1	Type II	0.061157	0.179339	0.066942