{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Title :\n",
"Exercise: Dealing with Missingness\n",
"\n",
"## Description :\n",
"The goal of the exercise is to get comfortable with different types of missingness and ways to try and handle them with a few basic imputations methods using numpy, pandas, and sklearn. The examples will show how the combination of different types of missingness and imputation methods can affect inference.\n",
"\n",
"## Data Description:\n",
"\n",
"## Instructions:\n",
"We are using synthetic data to illustrate the issues with missing data. We will\n",
"- Create a synthetic dataset from two predictors\n",
"- Create missingness in 3 different ways\n",
"- Handle it 4 different ways (dropping rows, mean imputation, OLS imputation, and k-NN imputation)\n",
"\n",
"## Hints: \n",
"\n",
"pandas.dropna\n",
"Drop rows with missingness\n",
"\n",
"pandas.fillna\n",
"Fill in missingness either with a single values or a with a Series\n",
"\n",
"sklearn.impute.SimpleImputer\n",
"Imputation transformer for completing missing values.\n",
"\n",
"sklearn.LinearRegression\n",
"Generates a Linear Regression Model\n",
"\n",
"sklearn.impute.KNNImputer\n",
"Fill in missingness with a KNN model\n",
"\n",
"**Note:** This exercise is auto-graded and you can try multiple attempts. "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"%matplotlib inline\n",
"import matplotlib.pyplot as plt\n",
"import pandas as pd\n",
"import numpy as np\n",
"\n",
"from sklearn.linear_model import LinearRegression \n",
"from sklearn.impute import SimpleImputer, KNNImputer"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Dealing with Missingness"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Missing Data\n",
"We'll create data of the form:\n",
"$$ y = 3x_1 - 2x_2 + \\varepsilon,\\hspace{0.1in} \\varepsilon \\sim N(0,1)$$\n",
"\n",
"We will then be inserting missingness into `x1` in various ways, and analyzing the results of different methods for handling those missing values."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# Number of data points to generate\n",
"n = 500\n",
"# Set random seed for numpy to ensure reproducible results\n",
"np.random.seed(109)\n",
"# Generate our predictors...\n",
"x1 = np.random.normal(0, 1, size=n)\n",
"x2 = 0.5*x1 + np.random.normal(0, np.sqrt(0.75), size=n)\n",
"X = pd.DataFrame(data=np.transpose([x1,x2]),columns=[\"x1\",\"x2\"])\n",
"# Generate our response...\n",
"y = 3*x1 - 2*x2 + np.random.normal(0, 1, size=n)\n",
"y = pd.Series(y)\n",
"# And put them all in a nice DataFrame\n",
"df = pd.DataFrame(data=np.transpose([x1, x2, y]), columns=[\"x1\", \"x2\", \"y\"]) "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"fig, axs = plt.subplots(1, 3, figsize = (16,5))\n",
"\n",
"plot_pairs = [('x1', 'y'), ('x2', 'y'), ('x1', 'x2')]\n",
"for ax, (x_var, y_var) in zip(axs, plot_pairs):\n",
" df.plot.scatter(x_var, y_var, ax=ax, title=f'{y_var} vs. {x_var}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Poke holes in $X_1$ in 3 different ways: \n",
"\n",
"- **Missing Completely at Random** (MCAR): missingness is not predictable.\n",
"- **Missing at Random** (MAR): missingness depends on other observed data, and thus can be recovered in some way\n",
"- **Missingness not at Random** (MNAR): missingness depends on unobserved data and thus cannot be recovered\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here we generate indices of $X_1$ to be dropped due to 3 types of missingness using $n$ single bernoulli trials.\\\n",
"The only difference between the 3 sets of indices is the probabilities of success for each trial (i.e., the probability that a given observation will be missing)."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"missing_A = np.random.binomial(1, 0.05 + 0.85*(y > (y.mean()+y.std())), n).astype(bool)\n",
"missing_B = np.random.binomial(1, 0.2, n).astype(bool)\n",
"missing_C = np.random.binomial(1, 0.05 + 0.85*(x2 > (x2.mean()+x2.std())), n).astype(bool)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# Helper function to replace x_1 with nan at specified indices\n",
"def create_missing(missing_indices, df=df):\n",
" df_new = df.copy()\n",
" df_new.loc[missing_indices, 'x1'] = np.nan\n",
" return df_new"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Fill in the blank to match the index sets above (missing_A, B, or C) with the type of missingness they represent."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_missing_type) ###\n",
"\n",
"# Missing completely at random (MCAR)\n",
"df_mcar = create_missing(missing_indices=___)\n",
"\n",
"# Missing at random (MAR)\n",
"df_mar = create_missing(missing_indices=___)\n",
"\n",
"# Missing not at random (MNAR)\n",
"df_mnar = create_missing(missing_indices=___)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"First, let's fit a model with no missing data."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"# no missingness: on the full dataset\n",
"ols = LinearRegression().fit(df[['x1', 'x2']], df['y'])\n",
"print('No missing data:', ols.intercept_, ols.coef_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"⏸ **Q1.1** Why aren't the estimates exactly $\\hat{\\beta_0} = 0$, $\\hat{\\beta}_1 = 3$ and $\\hat{\\beta}_2 = -2$ ? Isn't that our true data generating function?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*your answer here*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"---\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, let's naively fit a linear regression on the dataset with MCAR missingness and see what happens..."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"# Fit inside a try/except block just in case...\n",
"try:\n",
" ouch = LinearRegression().fit(df_mcar[['x1','x2']],df_mcar['y'])\n",
"except Exception as e:\n",
" print(e)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"⏸ **Q1.2** How did sklearn handle the missingness? (feel free to add some code above to experiment if you are still unsure)\n",
"\n",
"**A**: It ignored the _columns_ with missing values\\\n",
"**B**: It ignored the _rows_ with missing values\\\n",
"**C**: It didn't handle the missingness and the fit failed"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_Q1_2) ###\n",
"# Submit an answer choice as a string below \n",
"# (Eg. if you choose option A, put 'A')\n",
"answer1_2 = '___'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"---"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"⏸ **Q1.3** What would be a first naive approach to handling missingness?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*your answer here*\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"---\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### What happens if we ignore problematic rows?"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# MCAR: drop the rows that have any missingness\n",
"ols_mcar = LinearRegression().fit(df_mcar.dropna()[['x1', 'x2']], df_mcar.dropna()['y'])\n",
"print('MCAR (drop):', ols_mcar.intercept_, ols_mcar.coef_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use the same strategy for the other types of missingness."
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_mar) ###\n",
"# MAR: drop the rows that have any missingness\n",
"ols_mar = LinearRegression().fit(df_mar.dropna()[[___]], df_mar.dropna()[__])\n",
"print('MAR (drop):', ols_mar.intercept_,ols_mar.coef_)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"# MNAR: drop the rows that have any missingness\n",
"ols_mnar = LinearRegression().fit(___, ___)\n",
"print('MNAR (drop):', ols_mnar.intercept_, ols_mnar.coef_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"⏸️ **Q2** Compare the various estimates above and how well they were able to recover the value of $\\beta_1$. For which form of missingness did dropping result in the _worst_ estimate?\n",
"\n",
"**A**: MCAR\\\n",
"**B**: MAR\\\n",
"**C**: MNAR"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_Q2) ###\n",
"# Submit an answer choice as a string below \n",
"# (Eg. if you choose option A, put 'A')\n",
"answer2 = '___'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Let's Start Imputing"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"# Make backup copies for later since we'll have lots of imputation approaches.\n",
"X_mcar_raw = df_mcar.drop('y', axis=1).copy()\n",
"X_mar_raw = df_mar.drop('y', axis=1).copy()\n",
"X_mnar_raw = df_mnar.drop('y', axis=1).copy()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Mean Imputation:\n",
"\n",
"Perform mean imputation using the `fillna`, `dropna`, and `mean` functions."
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"# Here's an example of one way to do the mean imputation with the above methods\n",
"X_mcar = X_mcar_raw.copy()\n",
"X_mcar['x1'] = X_mcar['x1'].fillna(X_mcar['x1'].dropna().mean())\n",
"\n",
"ols_mcar_mean = LinearRegression().fit(X_mcar, y)\n",
"print('MCAR (mean):', ols_mcar_mean.intercept_, ols_mcar_mean.coef_)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_mar_mean) ###\n",
"X_mar = X_mar_raw.copy()\n",
"# You can add as many lines as you see fit, so long as the final model is correct\n",
"ols_mar_mean = ___\n",
"print('MAR (mean):',ols_mar_mean.intercept_, ols_mar_mean.coef_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can also use SKLearn's `SimpleImputer` object. By default it will replace NaN values with the column's mean."
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_mnar_mean) ###\n",
"X_mnar = X_mnar_raw.copy()\n",
"# instantiate imputer object\n",
"imputer = ___\n",
"# fit & transform X_mnar with the imputer\n",
"X_mnar = ___\n",
"# fit OLS model on imputed data\n",
"ols_mnar_mean = ___\n",
"print('MNAR (mean):', ols_mnar_mean.intercept_, ols_mnar_mean.coef_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"⏸️ **Q3** In our examples, how do these estimates compare when performing mean imputation vs. just dropping rows? \n",
"\n",
"**A**: They are better\\\n",
"**B**: They are worse\\\n",
"**C**: They are the same"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_Q3) ###\n",
"# Submit an answer choice as a string below \n",
"# (Eg. if you choose option A, put 'A')\n",
"answer3 = '___'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"---\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Linear Regression Imputation \n",
"\n",
"If you're not careful, it can be difficult to keep things straight. There are _two_ models here: \n",
"\n",
"1. an _imputation_ model concerning just the predictors (to predict $X_1$ from $X_2$) and \n",
"2. the _substantive_ model we really care about used to predict $Y$ from the 'improved' $X_1$ (now with imputed values) and $X_2$."
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"X_mcar = X_mcar_raw.copy()\n",
"\n",
"# Fit the imputation model\n",
"ols_imputer_mcar = LinearRegression().fit(X_mcar.dropna()[['x2']], X_mcar.dropna()['x1'])\n",
"\n",
"# Perform some imputations\n",
"yhat_impute = pd.Series(ols_imputer_mcar.predict(X_mcar[['x2']]))\n",
"X_mcar['x1'] = X_mcar['x1'].fillna(yhat_impute)\n",
"\n",
"# Fit the model we care about\n",
"ols_mcar_ols = LinearRegression().fit(X_mcar, y)\n",
"print('MCAR (OLS):', ols_mcar_ols.intercept_,ols_mcar_ols.coef_)"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_mar_ols) ###\n",
"X_mar = X_mar_raw.copy()\n",
"# Fit imputation model\n",
"ols_imputer_mar = LinearRegression().fit(___, ___)\n",
"# Get values to be imputed\n",
"yhat_impute = pd.Series(ols_imputer_mar.predict(___))\n",
"# Fill missing values with imputer's predictions\n",
"X_mar['x1'] = X_mar['x1'].fillna(___)\n",
"# Fit our final, 'substantive' model\n",
"ols_mar_ols = LinearRegression().fit(___, ___)\n",
"\n",
"print('MAR (OLS):', ols_mar_ols.intercept_,ols_mar_ols.coef_)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_mnar_ols) ###\n",
"X_mnar = X_mnar_raw.copy()\n",
"# your code here\n",
"# You can add as many lines as you see fit, so long as the final model is correct\n",
"ols_mnar_ols = ___\n",
"print('MNAR (OLS):', ols_mnar_ols.intercept_, ols_mnar_ols.coef_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"⏸️ **Q4**: Compare the estimates when performing OLS model-based imputation vs. mean imputation? Which type of missingness saw the biggest improvement?\n",
"\n",
"**A**: MCAR\\\n",
"**B**: MAR\\\n",
"**C**: MNAR"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_Q4) ###\n",
"# Submit an answer choice as a string below \n",
"# (Eg. if you choose option A, put 'A')\n",
"answer4 = '___'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"---\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### $k$-NN Imputation ($k$=3)\n",
"As an alternative to linear regression, we can also use $k$-NN as our imputation model.\\\n",
"SKLearn's `KNNImputer` object makes this very easy."
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [],
"source": [
"X_mcar = X_mcar_raw.copy()\n",
"\n",
"X_mcar = KNNImputer(n_neighbors=3).fit_transform(X_mcar)\n",
"\n",
"ols_mcar_knn = LinearRegression().fit(X_mcar,y)\n",
"\n",
"print('MCAR (KNN):', ols_mcar_knn.intercept_,ols_mcar_knn.coef_)"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_mar_knn) ###\n",
"X_mar = X_mar_raw.copy()\n",
"# Add imputed values to X_mar\n",
"X_mar = KNNImputer(___).fit_transform(___)\n",
"# Fit substantive model on imputed data\n",
"ols_mar_knn = LinearRegression().fit(__,__)\n",
"\n",
"print('MAR (KNN):', ols_mar_knn.intercept_,ols_mar_knn.coef_)"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_mnar_knn) ###\n",
"X_mnar = X_mnar_raw.copy()\n",
"# your code here\n",
"# You can add as many lines as you see fit, so long as the final model is correct\n",
"ols_mnar_knn = ___\n",
"\n",
"print('MNAR (KNN):', ols_mnar_knn.intercept_,ols_mnar_knn.coef_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"⏸️ **Q5**: True or False - While some methods may work better than others depending on the context, any imputation method is better than none (that is, as opposed to simply dropping)."
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_Q5) ###\n",
"# Submit an answer choice as boolean value\n",
"answer5 = ___"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"⏸️ **Q6**: Suppose your friends makes the following suggestion:\n",
"\n",
"\"The MNAR missing data can be predicted in part from the response $y$. Why not impute these missing $x_1$ values with an imputation model using $y$ as a predictor? It's true we can't impute like this with new data for which we don't have the $y$ values. But it will improve our training data, our model's fit, and so too its performance on new data!\"\n",
"\n",
"What is a _big problem_ with this idea?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*your answer here*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"---\n",
"\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}