Exploratory Data Analysis (EDA)
The Ames housing dataset
First, let’s load the required packages. We will use the tidyverse
for general data processing and visualisation.
library(tidyverse)
library(naniar) # for visualising missing data
library(GGally) # for EDA
library(ggcorrplot)
library(AmesHousing)
library(plotly) # dynamic visualisations
library(bestNormalize)
library(qs)
library(janitor)
theme_set(theme_minimal())
We will use the Ames housing data to explore different ML approaches to regression. This dataset was “designed” by Dean De Cock as an alternative to the “classic” Boston housing dataset, and has been extensively used in ML teaching. It is also available from kaggle as part of its advanced regression practice competition.
The independent variables presented in the data include:
- 20 continuous variables relate to various area dimensions for each observation;
- 14 discrete variables, which typically quantify the number of items occurring within the house;
- 23 ordinal, 23 nominal categorical variables, with 2 (STREET: gravel or paved) - 28 (NEIGHBORHOOD) classes;
We will explore both the “uncleaned” data available from kaggle/UCI, and the processed data available in the AmesHousing
package in R, for which documentation is available here. It can be useful for understanding what each of the independent variables mean.
<- AmesHousing::make_ames()
ameshousing_temp
# Use this function to make the names easier to type
<- ameshousing_temp %>%
ameshousing ::clean_names()
janitor
# Read in the uncleaned data.
<- AmesHousing::ames_raw ameshousing_uncleaned
Exploratory data analysis
Exploratory data analysis involves looking at:
- The distribution of variables in your dataset;
- Whether any data is missing;
- Data skewness;
- Correlated variables.
Visualise missingness
When working with missing data, it can be helpful to look for “co-missingness”, i.e. multiple variables missing together. For example, when working with patient data, number of pregnancies, age at onset of menstruation and menopause may all be missing - which, when observed together, may indicate that these samples come from male patients for whom this data is irrelevant. “Gender” may or may not be a variable coded in the dataset.
A way of visualising missing data in the tidy context has been proposed @tierney2018expanding. See this web page for more options for your own data.
Let’s look at the missing variables in our housing data:
gg_miss_var(ameshousing_uncleaned)
We can see that the most missingness is observed in the Pool_QC
, Misc_Feature
, Alley
, Fence
and Fireplace_QC
variables. This is most likely due to many houses not having pools, alleys, fences, and fireplaces, and not having any features that the real estate agent considers to be notable enough to be added to the “miscellaneous” category.
An upset plot will give us more idea about the co-missingness of these variables:
gg_miss_upset(ameshousing_uncleaned, nsets = 10)
Next, let’s create two “helper” vectors with the names of the numeric and categorical variables from the ameshousing
dataset, which we can then use to batch subset our dataset prior to EDA/visualisation:
# pull out all of the numerical variables
<- ameshousing %>%
numVars select_if(is.numeric) %>%
names()
# use Negate(is.numeric) to pull out all of the categorical variables
<- ameshousing %>%
catVars select_if(Negate(is.numeric)) %>%
names()
Let’s then use the ggpairs()
function to generate a plot of the first 10 numeric variables (and sale price, which is 33) against each other. We can repeat this for variables 11-20 and 21-33.
ggpairs(data = ameshousing,
columns = numVars[c(1:10, 33)],
title = "Numeric variables 1 - 10")
# ggpairs(ameshousing, numVars[c(11:20, 33)], title = "Numeric variables 11 - 20")
# ggpairs(ameshousing, numVars[c(21:33)], title = "Numeric variables 21 - 33")
ggpairs(data = ameshousing,
columns = c(catVars[2:5], "sale_price"),
title = "Some categorical variables")
Next, we can generate a correlation plot between all of our numeric variables. By default, the cor()
method will calculate the Pearson correlation between the Sale_Price
and the other variables, and we can specify how we’d like to handle missing data when calculating this correlation.
In this case, we use pairwise.complete.obs
, which calculates the correlation between each pair of variables using all complete pairs of observations on those variables.
We then plot the correlation using the corrplot library, which has several options for how to visualise a correlation plot. See here for some examples of the visualisations it can produce.
# pairs.panels(ameshousing[ , names(ameshousing)[c(3, 16, 23, 27,37)]], scale=TRUE)
<- cor(ameshousing[,numVars],
ameshousingCor use = "pairwise.complete.obs")
<- cor_pmat(ameshousingCor)
ameshousingCor_pvalues ggcorrplot(ameshousingCor, type = "lower")
We can also make a dynamic visualisation using plotly.
#Bonus: interactive corrplot with zoom and mouseover
ggcorrplot(ameshousingCor, type = "lower") %>% ggplotly()
Let’s plot one of these relationships:
%>%
ameshousing ggplot(aes(x = gr_liv_area, y = sale_price/1000)) +
geom_point(alpha = 0.1) +
labs(y = "Sale Price/$1000",
x = "Living Area (sq.ft)",
title = "Ames Housing Data") +
geom_smooth(method= "lm") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
We can see that there are five houses with an area > 4000 square feet that seem to be outliers in the data. We should filter them out. Next, let’s generate a boxplot by Quality:
# Create a filtered dataframe
<-
ameshousing_filt %>%
ameshousing filter(gr_liv_area <= 4000)
# Make our ggplot object
<- ameshousing_filt %>%
p mutate(quality = as.factor(overall_qual)) %>%
ggplot(aes(x = quality,
y = sale_price / 1000,
fill = quality)) +
labs(y = "Sale Price in $k's",
x = "Overall Quality of House",
title = "Ames Housing Data") +
geom_boxplot() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Now make it a plotly
ggplotly(p)
EDA of outcome variable
You also need to do EDA on the outcome variable to:
- identify outliers
- explore whether there is any skew in its distribution
- identify a transformation to use when modelling the data (if appropriate)
This is because many models, including ordinary linear regression, assume that prediction errors (and hence the response) are normally distributed.
%>%
ameshousing_filt ggplot(aes(x = sale_price/1000)) +
geom_histogram(bins = 50) +
labs(x = "Sale Price in $k's",
y = "Number of Houses sold")
Let’s explore different ways of transforming the Sale Price.
#No transform
%>%
ameshousing_filt ggplot(aes( sample = sale_price)) +
stat_qq() + stat_qq_line(col = "blue")
#Sqrt transform
%>%
ameshousing_filt ggplot(aes( sample = sqrt(sale_price))) +
stat_qq() + stat_qq_line(col = "blue")
#natural log transform
%>%
ameshousing_filt ggplot(aes( sample = log(sale_price))) +
stat_qq() + stat_qq_line(col = "blue")
#log10 transform
%>%
ameshousing_filt ggplot(aes( sample = log10(sale_price))) +
stat_qq() + stat_qq_line(col = "blue")
$sale_price <- log10(ameshousing_filt$sale_price) ameshousing_filt
Feature transformation
The year in which the house was built and the year when it was remodelled are not really the most relevant parameters we look at when buying a house: instead, buyers usually care a lot more about the age of the house and the time since the last remodel. Let’s transform these features:
<-
ameshousing_filt_tr %>%
ameshousing_filt mutate(time_since_remodel = year_sold - year_remod_add,
house_age = year_sold - year_built) %>%
select(-year_remod_add, -year_built)
qsave(ameshousing_filt_tr, "../_models/ames_dataset_filt.qs")
Note Make sure to create a “models” folder in your project working directory! Before you can save your data as .Rds objects, you will actually need to create a folder for these files to go into. Do this by clicking on the “new folder” button in the files window in R studio. Rename your new folder to “models”.
Tierney, Nicholas J, and Dianne H Cook. 2018. “Expanding Tidy Data Principles to Facilitate Missing Data Exploration, Visualization and Assessment of Imputations.” arXiv Preprint arXiv:1809.02264.