The purpose of this project is to look at what factors of a household are able to predict the need of social welfare assistance. This is important because social welfare programs are often controversial and where resources are distributed will help create a more equitable world. Current methods for identifying where to provide aid and whether certain segments of the population are receiving enough aid rely on these factors and can lead to inaccurate assessments of social need. This research project focuses more on observable traits of a household and demographics of each household member to predict the level of need of a family.
Using an open source dataset from kaggle, we used our prior research into social welfare determinants to drive our analysis. After we turned our outcome into a binary variable, we often refer to our two outcomes as vulnerable and not vulnerable. Our label of "vulnerable" also includes households classified as moderate and extreme poverty.
We are group of Informatics students at the University of Washington who like to explore datasets and run models to predict outcomes. This project is for our Introduction to Data Science course.
The dataset required a lot of cleaning prior to use. We started off by dropping columns with squared values because they were highly correlated to data that already existed in the dataset and therefore unnecessary. From there, we discovered the edjefe/edjefa columns regarding years of education for the head of the household had a mix of integers and strings. We converted all “no” values to 0 and all “yes” values to 1. The dataset also originally categorized people into 4 categories of vulnerability in terms of poverty, however we converted it into a binary outcome with 1 being vulnerable and 0 being not vulnerable.
After the initial preparation of the data, it became necessary to deal with missing values. The “meaneduc” column only had 5 missing values so we imputed 0 knowing this would hardly affect our data. We discovered that the column “v18q1” which was the number of tablets a household owned was directly related to another column dictating whether or not that household had any tablets. The missing values in that column only existed if the household had no tablets, so we were able to impute 0 again. The “rez_esc” column which conveyed the number of years a person was behind in school also had many missing values. When we plotted the age distribution, however, it became clear that the missing values were for people either too young to have started school or people who had already completed school so we could confidently fill a 0 in for the missing values as the number of years behind in school. Finally, “v2a1” which informed us of people’s monthly rent payments had many missing values. Luckily, the data table provided us with the type of ownership people had over their homes. We discovered that a majority of people actually owned their homes and therefore had no value in the rent column. Again, we were able to impute a 0 for these rows without skewing the data too much.
Table of count of null values per column: