Key Word(s): Missing Data, Imputation


Title

Exercise 1 - Dealing with Missingness

Description

The goal of the exercise is to get comfortable with missingness, and how to handle it and do some basic imputations in skealrn. Firday's class will go further into handling missngness.

Instructions:

We are using synthetic data to illustrate the issues with missing data. We will

  • Create a synthetic dataset from two predictors
  • Create missingness in 3 different ways
  • Handle it 4 different wats (dropping rows, mean imputation, OLS imputation, and 3-NN imputation)

Hits:

pandas.dropna : Drop rows with missingness

pandas.fillna : Fill in missingness

sklearn.LinearRegression : Generates a Linear Regression Model

sklearn.KNNImputer : Fill in missingness

Note: This exercise is auto-graded and you can try multiple attempts.

In [ ]:
%matplotlib inline
from sklearn import datasets
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
import scipy 
In [ ]:
from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsClassifier 

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import SimpleImputer, KNNImputer, IterativeImputer, MissingIndicator

Dealing with Missingness

Missing Data

Create data in which the true theoretical regression line is: $$ Y = 3X_1 + 2X_2 + \varepsilon,\hspace{0.1in} \varepsilon \sim N(0,1)$$

Note: $\rho_{X1,X2} = 0.5$

We will be inserting missingness into x1 in various ways, and analyzing the results.

In [ ]:
n = 500
np.random.seed(109)

x1 = np.random.normal(0,1,size=n)
x2 = 0.5*x1+np.random.normal(0,np.sqrt(0.75),size=n)
X = pd.DataFrame(data=np.transpose([x1,x2]),columns=["x1","x2"])

y = 3*x1 - 2*x2 + np.random.normal(0,1,size=n)
y = pd.Series(y)


df = pd.DataFrame(data=np.transpose([x1,x2,y]),columns=["x1","x2","y"])

# Checking the correlation
scipy.stats.pearsonr(x1,x2) 
In [ ]:
fig,(ax1,ax2,ax3) =  plt.subplots(1, 3, figsize = (18,5))
ax1.scatter(x1,y)
ax2.scatter(x2,y)
ax3.scatter(x2,x1,color="orange")
ax1.set_title("y vs. x1")
ax2.set_title("y vs. x2")
ax3.set_title("x1 vs. x2")
plt.show()

Poke holes in $X_1$ in 3 different ways (all roughly 20% of data are removed):

  • MCAR: just take out a random sample of 20% of observations in $X_1$
  • MAR: missingness in $X_1$ depends on $X_2$, and thus can be recovered in some way
  • MNAR: missingness in $X_1$ depends on $X_1$, and thus can be recovered in some way
In [ ]:
x1_mcar = x1.copy()
x1_mar = x1.copy()
x1_mnar = x1.copy()

#missing completely at random
miss_mcar = np.random.choice(n,int(0.2*n),replace=False)
x1_mcar[miss_mcar] = np.nan

#missing at random: one way to do it
miss_mar = np.random.binomial(1,0.05+0.85*(x2>(x2.mean()+x2.std())),n)
x1_mar[miss_mar==1] = np.nan

#missing not at random: one way to do it
miss_mnar = np.random.binomial(1,0.05+0.85*(y>(y.mean()+y.std())),n)
x1_mnar[miss_mnar==1] = np.nan
In [ ]:
# Create the 3 datasets with missingness
df_mcar = df.copy()
df_mar = df.copy()
df_mnar = df.copy()

# plug in the appropriate x1 with missingness
df_mcar['x1'] = x1_mcar
df_mar['x1'] = x1_mar
df_mnar['x1'] = x1_mnar
In [ ]:
# no missingness: on the full dataset
ols = LinearRegression().fit(df[['x1','x2']],df['y'])
print(ols.intercept_,ols.coef_)
In [ ]:
# Fit the linear regression blindly on the dataset with MCAR missingness, see what happens
LinearRegression().fit(df_mcar[['x1','x2']],df_mcar['y'])

Q1 Why aren't the estimates exactly $\hat{\beta}_1 = 3$ and $\hat{\beta}_2 = -2$ ? How does sklearn handle missingness? What would be a first naive approach to handling missingness?

your answer here

What happens when you just drop rows?

In [ ]:
# no missingness for comparison sake
ols = LinearRegression().fit(df[['x1','x2']],df['y'])
print(ols.intercept_,ols.coef_)
In [ ]:
# MCAR: drop the rows that have any missingness
ols_mcar = LinearRegression().fit(df_mcar.dropna()[['x1','x2']],df_mcar.dropna()['y'])
print(ols_mcar.intercept_,ols_mcar.coef_)
In [ ]:
### edTest(test_mar) ###

# MAR: drop the rows that have any missingness
ols_mar = LinearRegression().fit(___,___)
print(ols_mar.intercept_,ols_mar.coef_)
In [ ]:
# MNAR: drop the rows that have any missingness

ols_mnar = ___
print(___,___)

Q2 How do the estimates compare when just dropping rows? Are they able to recover the values of $\beta_1$ that they should? In which form of missingness is the result the worst?

your answer here

Let's Start Imputing

In [ ]:
#Make back-=up copies for later since we'll have lots of imputation approaches.
df_mcar_raw = df_mcar.copy()
df_mar_raw = df_mar.copy()
df_mnar_raw = df_mnar.copy()

Mean Imputation:

Perform mean imputation using the fillna, dropna, and mean functions.

In [ ]:
df_mcar = df_mcar_raw.copy()
df_mcar['x1'] = df_mcar['x1'].fillna(df_mcar['x1'].dropna().mean())

ols_mcar_mean = LinearRegression().fit(df_mcar[['x1','x2']],df_mcar['y'])
print(ols_mcar_mean.intercept_,ols_mcar_mean.coef_)
In [ ]:
### edTest(test_mar_mean) ###

df_mar = df_mar_raw.copy()

df_mar['x1'] = df_mar['x1'].fillna(___)

ols_mar_mean = LinearRegression().fit(___,___)
print(ols_mar_mean.intercept_,ols_mar_mean.coef_)
In [ ]:
df_mnar = df_mnar_raw.copy()

df_mnar['x1'] = ___

ols_mnar_mean = ___
print(___,___)

Q3 How do the estimates compare when performing mean imputation vs. just dropping rows? Have things gotten better or worse (for what types of missingness)?

your answer here

Linear Regression Imputation

This is difficult to keep straight. There are two models here:

  1. an imputation model based on OLS concerning just the predictors (to predict $X_1$ from $X_2$) and
  2. the model we really care about to predict $Y$ from the 'improved' $X_1$ (now with imputed values) and $X_2$.
In [ ]:
df_mcar = df_mcar_raw.copy()

# fit the imputation model
ols_imputer_mcar = LinearRegression().fit(df_mcar.dropna()[['x2']],df_mcar.dropna()['x1'])

# perform some imputations
x1hat_impute = pd.Series(ols_imputer_mcar.predict(df_mcar[['x2']]))
df_mcar['x1'] = df_mcar['x1'].fillna(x1hat_impute)

# fit the model we care about
ols_mcar_ols = LinearRegression().fit(df_mcar[['x1','x2']],df_mcar['y'])
print(ols_mcar_ols.intercept_,ols_mcar_ols.coef_)
In [ ]:
df_mar = df_mar_raw.copy()
ols_imputer_mar = LinearRegression().fit(__,__)

x1hat_impute = pd.Series(ols_imputer_mar.predict(___))
df_mar['x1'] = df_mar['x1'].fillna(___)

ols_mar_ols = LinearRegression().fit(___,___)
print(ols_mar_ols.intercept_,ols_mar_ols.coef_)
In [ ]:
### edTest(test_mnar_ols) ###

df_mnar = df_mnar_raw.copy()

ols_imputer_mnar = ___

x1hat_impute = ___
df_mnar['x1'] = ___

ols_mnar_ols = ___
print(___,___)

Q4: How do the estimates compare when performing model-based imputation vs. mean imputation? Have things gotten better or worse (for what types of missingness)?

your answer here

$k$-NN Imputation ($k$=3)

In [ ]:
df_mcar = df_mcar_raw.copy()
X_mcar = KNNImputer(n_neighbors=3).fit_transform(df_mcar[['x1','x2']])

ols_mcar_knn = LinearRegression().fit(X_mcar,df_mcar['y'])
print(ols_mcar_knn.intercept_,ols_mcar_knn.coef_)
In [ ]:
df_mar = df_mar_raw.copy()
X_mar = KNNImputer(n_neighbors=3).fit_transform(___)

ols_mar_knn = LinearRegression().fit(___,___)
print(ols_mar_knn.intercept_,ols_mar_knn.coef_)
In [ ]:
df_mnar = df_mnar_raw.copy()
X_mnar = ___

ols_mnar_knn = ___
print(ols_mnar_knn.intercept_,ols_mnar_knn.coef_)

Q5: Which of the 4 methods for handling missingness worked best? Which worked the worst? Were the estimates improved or worsened in each of the 3 types of missingness?

your answer here

Q6: This exercise focused on 'inference' (considering just the estimates of coefficients, not the uncertainty of these estimates, which would be even worse). What are the ramifications on prediction? Is the situation more or less concerning?

your answer here

In [ ]: