{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Title :\n", "Exercise: Dealing with Missingness\n", "\n", "## Description :\n", "The goal of the exercise is to get comfortable with different types of missingness and ways to try and handle them with a few basic imputations methods using numpy, pandas, and sklearn. The examples will show how the combination of different types of missingness and imputation methods can affect inference.\n", "\n", "## Data Description:\n", "\n", "## Instructions:\n", "We are using synthetic data to illustrate the issues with missing data. We will\n", "- Create a synthetic dataset from two predictors\n", "- Create missingness in 3 different ways\n", "- Handle it 4 different ways (dropping rows, mean imputation, OLS imputation, and k-NN imputation)\n", "\n", "## Hints: \n", "\n", "pandas.dropna\n", "Drop rows with missingness\n", "\n", "pandas.fillna\n", "Fill in missingness either with a single values or a with a Series\n", "\n", "sklearn.impute.SimpleImputer\n", "Imputation transformer for completing missing values.\n", "\n", "sklearn.LinearRegression\n", "Generates a Linear Regression Model\n", "\n", "sklearn.impute.KNNImputer\n", "Fill in missingness with a KNN model\n", "\n", "**Note:** This exercise is auto-graded and you can try multiple attempts. " ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "import matplotlib.pyplot as plt\n", "import pandas as pd\n", "import numpy as np\n", "\n", "from sklearn.linear_model import LinearRegression \n", "from sklearn.impute import SimpleImputer, KNNImputer" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Dealing with Missingness" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Missing Data\n", "We'll create data of the form:\n", "$$ y = 3x_1 - 2x_2 + \\varepsilon,\\hspace{0.1in} \\varepsilon \\sim N(0,1)$$\n", "\n", "We will then be inserting missingness into `x1` in various ways, and analyzing the results of different methods for handling those missing values." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Number of data points to generate\n", "n = 500\n", "# Set random seed for numpy to ensure reproducible results\n", "np.random.seed(109)\n", "# Generate our predictors...\n", "x1 = np.random.normal(0, 1, size=n)\n", "x2 = 0.5*x1 + np.random.normal(0, np.sqrt(0.75), size=n)\n", "X = pd.DataFrame(data=np.transpose([x1,x2]),columns=[\"x1\",\"x2\"])\n", "# Generate our response...\n", "y = 3*x1 - 2*x2 + np.random.normal(0, 1, size=n)\n", "y = pd.Series(y)\n", "# And put them all in a nice DataFrame\n", "df = pd.DataFrame(data=np.transpose([x1, x2, y]), columns=[\"x1\", \"x2\", \"y\"]) " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "fig, axs = plt.subplots(1, 3, figsize = (16,5))\n", "\n", "plot_pairs = [('x1', 'y'), ('x2', 'y'), ('x1', 'x2')]\n", "for ax, (x_var, y_var) in zip(axs, plot_pairs):\n", " df.plot.scatter(x_var, y_var, ax=ax, title=f'{y_var} vs. {x_var}')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Poke holes in $X_1$ in 3 different ways: \n", "\n", "- **Missing Completely at Random** (MCAR): missingness is not predictable.\n", "- **Missing at Random** (MAR): missingness depends on other observed data, and thus can be recovered in some way\n", "- **Missingness not at Random** (MNAR): missingness depends on unobserved data and thus cannot be recovered\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we generate indices of $X_1$ to be dropped due to 3 types of missingness using $n$ single bernoulli trials.\\\n", "The only difference between the 3 sets of indices is the probabilities of success for each trial (i.e., the probability that a given observation will be missing)." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "missing_A = np.random.binomial(1, 0.05 + 0.85*(y > (y.mean()+y.std())), n).astype(bool)\n", "missing_B = np.random.binomial(1, 0.2, n).astype(bool)\n", "missing_C = np.random.binomial(1, 0.05 + 0.85*(x2 > (x2.mean()+x2.std())), n).astype(bool)" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [], "source": [ "# Helper function to replace x_1 with nan at specified indices\n", "def create_missing(missing_indices, df=df):\n", " df_new = df.copy()\n", " df_new.loc[missing_indices, 'x1'] = np.nan\n", " return df_new" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fill in the blank to match the index sets above (missing_A, B, or C) with the type of missingness they represent." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "### edTest(test_missing_type) ###\n", "\n", "# Missing completely at random (MCAR)\n", "df_mcar = create_missing(missing_indices=___)\n", "\n", "# Missing at random (MAR)\n", "df_mar = create_missing(missing_indices=___)\n", "\n", "# Missing not at random (MNAR)\n", "df_mnar = create_missing(missing_indices=___)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First, let's fit a model with no missing data." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# no missingness: on the full dataset\n", "ols = LinearRegression().fit(df[['x1', 'x2']], df['y'])\n", "print('No missing data:', ols.intercept_, ols.coef_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "⏸ **Q1.1** Why aren't the estimates exactly $\\hat{\\beta_0} = 0$, $\\hat{\\beta}_1 = 3$ and $\\hat{\\beta}_2 = -2$ ? Isn't that our true data generating function?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*your answer here*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "---\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's naively fit a linear regression on the dataset with MCAR missingness and see what happens..." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "# Fit inside a try/except block just in case...\n", "try:\n", " ouch = LinearRegression().fit(df_mcar[['x1','x2']],df_mcar['y'])\n", "except Exception as e:\n", " print(e)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "⏸ **Q1.2** How did sklearn handle the missingness? (feel free to add some code above to experiment if you are still unsure)\n", "\n", "**A**: It ignored the _columns_ with missing values\\\n", "**B**: It ignored the _rows_ with missing values\\\n", "**C**: It didn't handle the missingness and the fit failed" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "### edTest(test_Q1_2) ###\n", "# Submit an answer choice as a string below \n", "# (Eg. if you choose option A, put 'A')\n", "answer1_2 = '___'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "---" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "⏸ **Q1.3** What would be a first naive approach to handling missingness?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*your answer here*\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "---\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What happens if we ignore problematic rows?" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "scrolled": true }, "outputs": [], "source": [ "# MCAR: drop the rows that have any missingness\n", "ols_mcar = LinearRegression().fit(df_mcar.dropna()[['x1', 'x2']], df_mcar.dropna()['y'])\n", "print('MCAR (drop):', ols_mcar.intercept_, ols_mcar.coef_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Use the same strategy for the other types of missingness." ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "### edTest(test_mar) ###\n", "# MAR: drop the rows that have any missingness\n", "ols_mar = LinearRegression().fit(df_mar.dropna()[[___]], df_mar.dropna()[__])\n", "print('MAR (drop):', ols_mar.intercept_,ols_mar.coef_)" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [], "source": [ "# MNAR: drop the rows that have any missingness\n", "ols_mnar = LinearRegression().fit(___, ___)\n", "print('MNAR (drop):', ols_mnar.intercept_, ols_mnar.coef_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "⏸️ **Q2** Compare the various estimates above and how well they were able to recover the value of $\\beta_1$. For which form of missingness did dropping result in the _worst_ estimate?\n", "\n", "**A**: MCAR\\\n", "**B**: MAR\\\n", "**C**: MNAR" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "### edTest(test_Q2) ###\n", "# Submit an answer choice as a string below \n", "# (Eg. if you choose option A, put 'A')\n", "answer2 = '___'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Let's Start Imputing" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "# Make backup copies for later since we'll have lots of imputation approaches.\n", "X_mcar_raw = df_mcar.drop('y', axis=1).copy()\n", "X_mar_raw = df_mar.drop('y', axis=1).copy()\n", "X_mnar_raw = df_mnar.drop('y', axis=1).copy()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Mean Imputation:\n", "\n", "Perform mean imputation using the `fillna`, `dropna`, and `mean` functions." ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "# Here's an example of one way to do the mean imputation with the above methods\n", "X_mcar = X_mcar_raw.copy()\n", "X_mcar['x1'] = X_mcar['x1'].fillna(X_mcar['x1'].dropna().mean())\n", "\n", "ols_mcar_mean = LinearRegression().fit(X_mcar, y)\n", "print('MCAR (mean):', ols_mcar_mean.intercept_, ols_mcar_mean.coef_)" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "### edTest(test_mar_mean) ###\n", "X_mar = X_mar_raw.copy()\n", "# You can add as many lines as you see fit, so long as the final model is correct\n", "ols_mar_mean = ___\n", "print('MAR (mean):',ols_mar_mean.intercept_, ols_mar_mean.coef_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can also use SKLearn's `SimpleImputer` object. By default it will replace NaN values with the column's mean." ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [], "source": [ "### edTest(test_mnar_mean) ###\n", "X_mnar = X_mnar_raw.copy()\n", "# instantiate imputer object\n", "imputer = ___\n", "# fit & transform X_mnar with the imputer\n", "X_mnar = ___\n", "# fit OLS model on imputed data\n", "ols_mnar_mean = ___\n", "print('MNAR (mean):', ols_mnar_mean.intercept_, ols_mnar_mean.coef_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "⏸️ **Q3** In our examples, how do these estimates compare when performing mean imputation vs. just dropping rows? \n", "\n", "**A**: They are better\\\n", "**B**: They are worse\\\n", "**C**: They are the same" ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [], "source": [ "### edTest(test_Q3) ###\n", "# Submit an answer choice as a string below \n", "# (Eg. if you choose option A, put 'A')\n", "answer3 = '___'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "---\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Linear Regression Imputation \n", "\n", "If you're not careful, it can be difficult to keep things straight. There are _two_ models here: \n", "\n", "1. an _imputation_ model concerning just the predictors (to predict $X_1$ from $X_2$) and \n", "2. the _substantive_ model we really care about used to predict $Y$ from the 'improved' $X_1$ (now with imputed values) and $X_2$." ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [], "source": [ "X_mcar = X_mcar_raw.copy()\n", "\n", "# Fit the imputation model\n", "ols_imputer_mcar = LinearRegression().fit(X_mcar.dropna()[['x2']], X_mcar.dropna()['x1'])\n", "\n", "# Perform some imputations\n", "yhat_impute = pd.Series(ols_imputer_mcar.predict(X_mcar[['x2']]))\n", "X_mcar['x1'] = X_mcar['x1'].fillna(yhat_impute)\n", "\n", "# Fit the model we care about\n", "ols_mcar_ols = LinearRegression().fit(X_mcar, y)\n", "print('MCAR (OLS):', ols_mcar_ols.intercept_,ols_mcar_ols.coef_)" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "### edTest(test_mar_ols) ###\n", "X_mar = X_mar_raw.copy()\n", "# Fit imputation model\n", "ols_imputer_mar = LinearRegression().fit(___, ___)\n", "# Get values to be imputed\n", "yhat_impute = pd.Series(ols_imputer_mar.predict(___))\n", "# Fill missing values with imputer's predictions\n", "X_mar['x1'] = X_mar['x1'].fillna(___)\n", "# Fit our final, 'substantive' model\n", "ols_mar_ols = LinearRegression().fit(___, ___)\n", "\n", "print('MAR (OLS):', ols_mar_ols.intercept_,ols_mar_ols.coef_)" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "### edTest(test_mnar_ols) ###\n", "X_mnar = X_mnar_raw.copy()\n", "# your code here\n", "# You can add as many lines as you see fit, so long as the final model is correct\n", "ols_mnar_ols = ___\n", "print('MNAR (OLS):', ols_mnar_ols.intercept_, ols_mnar_ols.coef_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "⏸️ **Q4**: Compare the estimates when performing OLS model-based imputation vs. mean imputation? Which type of missingness saw the biggest improvement?\n", "\n", "**A**: MCAR\\\n", "**B**: MAR\\\n", "**C**: MNAR" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [], "source": [ "### edTest(test_Q4) ###\n", "# Submit an answer choice as a string below \n", "# (Eg. if you choose option A, put 'A')\n", "answer4 = '___'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "---\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### $k$-NN Imputation ($k$=3)\n", "As an alternative to linear regression, we can also use $k$-NN as our imputation model.\\\n", "SKLearn's `KNNImputer` object makes this very easy." ] }, { "cell_type": "code", "execution_count": 31, "metadata": {}, "outputs": [], "source": [ "X_mcar = X_mcar_raw.copy()\n", "\n", "X_mcar = KNNImputer(n_neighbors=3).fit_transform(X_mcar)\n", "\n", "ols_mcar_knn = LinearRegression().fit(X_mcar,y)\n", "\n", "print('MCAR (KNN):', ols_mcar_knn.intercept_,ols_mcar_knn.coef_)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "### edTest(test_mar_knn) ###\n", "X_mar = X_mar_raw.copy()\n", "# Add imputed values to X_mar\n", "X_mar = KNNImputer(___).fit_transform(___)\n", "# Fit substantive model on imputed data\n", "ols_mar_knn = LinearRegression().fit(__,__)\n", "\n", "print('MAR (KNN):', ols_mar_knn.intercept_,ols_mar_knn.coef_)" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [], "source": [ "### edTest(test_mnar_knn) ###\n", "X_mnar = X_mnar_raw.copy()\n", "# your code here\n", "# You can add as many lines as you see fit, so long as the final model is correct\n", "ols_mnar_knn = ___\n", "\n", "print('MNAR (KNN):', ols_mnar_knn.intercept_,ols_mnar_knn.coef_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "⏸️ **Q5**: True or False - While some methods may work better than others depending on the context, any imputation method is better than none (that is, as opposed to simply dropping)." ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "### edTest(test_Q5) ###\n", "# Submit an answer choice as boolean value\n", "answer5 = ___" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "⏸️ **Q6**: Suppose your friends makes the following suggestion:\n", "\n", "\"The MNAR missing data can be predicted in part from the response $y$. Why not impute these missing $x_1$ values with an imputation model using $y$ as a predictor? It's true we can't impute like this with new data for which we don't have the $y$ values. But it will improve our training data, our model's fit, and so too its performance on new data!\"\n", "\n", "What is a _big problem_ with this idea?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*your answer here*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "---\n", "\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }