{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Title :\n",
    "Exercise: Dealing with Missingness\n",
    "\n",
    "## Description :\n",
    "The goal of the exercise is to get comfortable with different types of missingness and ways to try and handle them with a few basic imputations methods using numpy, pandas, and sklearn. The examples will show how the combination of different types of missingness and imputation methods can affect inference.\n",
    "\n",
    "## Data Description:\n",
    "\n",
    "## Instructions:\n",
    "We are using synthetic data to illustrate the issues with missing data.  We will\n",
    "- Create a synthetic dataset from two predictors\n",
    "- Create missingness in 3 different ways\n",
    "- Handle it 4 different ways (dropping rows, mean imputation, OLS imputation, and k-NN imputation)\n",
    "\n",
    "## Hints: \n",
    "\n",
    "<a href=\"https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html\" target=\"_blank\">pandas.dropna</a>\n",
    "Drop rows with missingness\n",
    "\n",
    "<a href=\"https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html\" target=\"_blank\">pandas.fillna</a>\n",
    "Fill in missingness either with a single values or a with a Series\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html\" target=\"_blank\">sklearn.impute.SimpleImputer</a>\n",
    "Imputation transformer for completing missing values.\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html\" target=\"_blank\">sklearn.LinearRegression</a>\n",
    "Generates a Linear Regression Model\n",
    "\n",
    "<a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html\" target=\"_blank\">sklearn.impute.KNNImputer</a>\n",
    "Fill in missingness with a KNN model\n",
    "\n",
    "**Note:** This exercise is auto-graded and you can try multiple attempts. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "import matplotlib.pyplot as plt\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "\n",
    "from sklearn.linear_model import LinearRegression \n",
    "from sklearn.impute import SimpleImputer, KNNImputer"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Dealing with Missingness"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Missing Data\n",
    "We'll create data of the form:\n",
    "$$ y = 3x_1 - 2x_2 + \\varepsilon,\\hspace{0.1in} \\varepsilon \\sim N(0,1)$$\n",
    "\n",
    "We will then be inserting missingness into `x1` in various ways, and analyzing the results of different methods for handling those missing values."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Number of data points to generate\n",
    "n = 500\n",
    "# Set random seed for numpy to ensure reproducible results\n",
    "np.random.seed(109)\n",
    "# Generate our predictors...\n",
    "x1 = np.random.normal(0, 1, size=n)\n",
    "x2 = 0.5*x1 + np.random.normal(0, np.sqrt(0.75), size=n)\n",
    "X = pd.DataFrame(data=np.transpose([x1,x2]),columns=[\"x1\",\"x2\"])\n",
    "# Generate our response...\n",
    "y = 3*x1 - 2*x2 + np.random.normal(0, 1, size=n)\n",
    "y = pd.Series(y)\n",
    "# And put them all in a nice DataFrame\n",
    "df = pd.DataFrame(data=np.transpose([x1, x2, y]), columns=[\"x1\", \"x2\", \"y\"]) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, axs =  plt.subplots(1, 3, figsize = (16,5))\n",
    "\n",
    "plot_pairs = [('x1', 'y'), ('x2', 'y'), ('x1', 'x2')]\n",
    "for ax, (x_var, y_var) in zip(axs, plot_pairs):\n",
    "    df.plot.scatter(x_var, y_var, ax=ax, title=f'{y_var} vs. {x_var}')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Poke holes in $X_1$ in 3 different ways: \n",
    "\n",
    "- **Missing Completely at Random** (MCAR): missingness is not predictable.\n",
    "- **Missing at Random** (MAR): missingness depends on other observed data, and thus can be recovered in some way\n",
    "- **Missingness not at Random** (MNAR): missingness depends on unobserved data and thus cannot be recovered\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here we generate indices of $X_1$ to be dropped due to 3 types of missingness using $n$ single bernoulli trials.\\\n",
    "The only difference between the 3 sets of indices is the probabilities of success for each trial (i.e., the probability that a given observation will be missing)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "missing_A = np.random.binomial(1, 0.05 + 0.85*(y > (y.mean()+y.std())),  n).astype(bool)\n",
    "missing_B = np.random.binomial(1, 0.2, n).astype(bool)\n",
    "missing_C = np.random.binomial(1, 0.05 + 0.85*(x2 > (x2.mean()+x2.std())), n).astype(bool)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Helper function to replace x_1 with nan at specified indices\n",
    "def create_missing(missing_indices, df=df):\n",
    "    df_new = df.copy()\n",
    "    df_new.loc[missing_indices, 'x1'] = np.nan\n",
    "    return df_new"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Fill in the blank to match the index sets above (missing_A, B, or C) with the type of missingness they represent."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_missing_type) ###\n",
    "\n",
    "# Missing completely at random (MCAR)\n",
    "df_mcar = create_missing(missing_indices=___)\n",
    "\n",
    "# Missing at random (MAR)\n",
    "df_mar = create_missing(missing_indices=___)\n",
    "\n",
    "# Missing not at random (MNAR)\n",
    "df_mnar = create_missing(missing_indices=___)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "First, let's fit a model with no missing data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# no missingness: on the full dataset\n",
    "ols = LinearRegression().fit(df[['x1', 'x2']], df['y'])\n",
    "print('No missing data:', ols.intercept_, ols.coef_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "⏸ **Q1.1** Why aren't the estimates exactly $\\hat{\\beta_0} = 0$, $\\hat{\\beta}_1 = 3$ and $\\hat{\\beta}_2 = -2$ ?  Isn't that our true data generating function?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*your answer here*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "---\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, let's naively fit a linear regression on the dataset with MCAR missingness and see what happens..."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Fit inside a try/except block just in case...\n",
    "try:\n",
    "    ouch = LinearRegression().fit(df_mcar[['x1','x2']],df_mcar['y'])\n",
    "except Exception as e:\n",
    "    print(e)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "⏸ **Q1.2** How did sklearn handle the missingness? (feel free to add some code above to experiment if you are still unsure)\n",
    "\n",
    "**A**: It ignored the _columns_ with missing values\\\n",
    "**B**: It ignored the _rows_ with missing values\\\n",
    "**C**: It didn't handle the missingness and the fit failed"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_Q1_2) ###\n",
    "# Submit an answer choice as a string below \n",
    "# (Eg. if you choose option A, put 'A')\n",
    "answer1_2 = '___'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "---"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "⏸ **Q1.3** What would be a first naive approach to handling missingness?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*your answer here*\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "---\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### What happens if we ignore problematic rows?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# MCAR: drop the rows that have any missingness\n",
    "ols_mcar = LinearRegression().fit(df_mcar.dropna()[['x1', 'x2']], df_mcar.dropna()['y'])\n",
    "print('MCAR (drop):', ols_mcar.intercept_, ols_mcar.coef_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use the same strategy for the other types of missingness."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_mar) ###\n",
    "# MAR: drop the rows that have any missingness\n",
    "ols_mar = LinearRegression().fit(df_mar.dropna()[[___]], df_mar.dropna()[__])\n",
    "print('MAR (drop):', ols_mar.intercept_,ols_mar.coef_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "# MNAR: drop the rows that have any missingness\n",
    "ols_mnar = LinearRegression().fit(___, ___)\n",
    "print('MNAR (drop):', ols_mnar.intercept_, ols_mnar.coef_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "⏸️ **Q2** Compare the various estimates above and how well they were able to recover the value of $\\beta_1$.  For which form of missingness did dropping result in the _worst_ estimate?\n",
    "\n",
    "**A**: MCAR\\\n",
    "**B**: MAR\\\n",
    "**C**: MNAR"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_Q2) ###\n",
    "# Submit an answer choice as a string below \n",
    "# (Eg. if you choose option A, put 'A')\n",
    "answer2 = '___'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Let's Start Imputing"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Make backup copies for later since we'll have lots of imputation approaches.\n",
    "X_mcar_raw = df_mcar.drop('y', axis=1).copy()\n",
    "X_mar_raw = df_mar.drop('y', axis=1).copy()\n",
    "X_mnar_raw = df_mnar.drop('y', axis=1).copy()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Mean Imputation:\n",
    "\n",
    "Perform mean imputation using the `fillna`, `dropna`, and `mean` functions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Here's an example of one way to do the mean imputation with the above methods\n",
    "X_mcar = X_mcar_raw.copy()\n",
    "X_mcar['x1'] = X_mcar['x1'].fillna(X_mcar['x1'].dropna().mean())\n",
    "\n",
    "ols_mcar_mean = LinearRegression().fit(X_mcar, y)\n",
    "print('MCAR (mean):', ols_mcar_mean.intercept_, ols_mcar_mean.coef_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_mar_mean) ###\n",
    "X_mar = X_mar_raw.copy()\n",
    "# You can add as many lines as you see fit, so long as the final model is correct\n",
    "ols_mar_mean = ___\n",
    "print('MAR (mean):',ols_mar_mean.intercept_, ols_mar_mean.coef_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can also use SKLearn's `SimpleImputer` object. By default it will replace NaN values with the column's mean."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_mnar_mean) ###\n",
    "X_mnar = X_mnar_raw.copy()\n",
    "# instantiate imputer object\n",
    "imputer = ___\n",
    "# fit & transform X_mnar with the imputer\n",
    "X_mnar = ___\n",
    "# fit OLS model on imputed data\n",
    "ols_mnar_mean = ___\n",
    "print('MNAR (mean):', ols_mnar_mean.intercept_, ols_mnar_mean.coef_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "⏸️ **Q3** In our examples, how do these estimates compare when performing mean imputation vs. just dropping rows?  \n",
    "\n",
    "**A**: They are better\\\n",
    "**B**: They are worse\\\n",
    "**C**: They are the same"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_Q3) ###\n",
    "# Submit an answer choice as a string below \n",
    "# (Eg. if you choose option A, put 'A')\n",
    "answer3 = '___'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "---\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Linear Regression Imputation \n",
    "\n",
    "If you're not careful, it can be difficult to keep things straight.  There are _two_ models here: \n",
    "\n",
    "1. an _imputation_ model concerning just the predictors (to predict $X_1$ from $X_2$) and \n",
    "2. the _substantive_ model we really care about used to predict $Y$ from the 'improved' $X_1$ (now with imputed values) and $X_2$."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_mcar = X_mcar_raw.copy()\n",
    "\n",
    "# Fit the imputation model\n",
    "ols_imputer_mcar = LinearRegression().fit(X_mcar.dropna()[['x2']], X_mcar.dropna()['x1'])\n",
    "\n",
    "# Perform some imputations\n",
    "yhat_impute = pd.Series(ols_imputer_mcar.predict(X_mcar[['x2']]))\n",
    "X_mcar['x1'] = X_mcar['x1'].fillna(yhat_impute)\n",
    "\n",
    "# Fit the model we care about\n",
    "ols_mcar_ols = LinearRegression().fit(X_mcar, y)\n",
    "print('MCAR (OLS):', ols_mcar_ols.intercept_,ols_mcar_ols.coef_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_mar_ols) ###\n",
    "X_mar = X_mar_raw.copy()\n",
    "# Fit imputation model\n",
    "ols_imputer_mar = LinearRegression().fit(___, ___)\n",
    "# Get values to be imputed\n",
    "yhat_impute = pd.Series(ols_imputer_mar.predict(___))\n",
    "# Fill missing values with imputer's predictions\n",
    "X_mar['x1'] = X_mar['x1'].fillna(___)\n",
    "# Fit our final, 'substantive' model\n",
    "ols_mar_ols = LinearRegression().fit(___, ___)\n",
    "\n",
    "print('MAR (OLS):', ols_mar_ols.intercept_,ols_mar_ols.coef_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_mnar_ols) ###\n",
    "X_mnar = X_mnar_raw.copy()\n",
    "# your code here\n",
    "# You can add as many lines as you see fit, so long as the final model is correct\n",
    "ols_mnar_ols = ___\n",
    "print('MNAR (OLS):', ols_mnar_ols.intercept_, ols_mnar_ols.coef_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "⏸️ **Q4**: Compare the estimates when performing OLS model-based imputation vs. mean imputation?  Which type of missingness saw the biggest improvement?\n",
    "\n",
    "**A**: MCAR\\\n",
    "**B**: MAR\\\n",
    "**C**: MNAR"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_Q4) ###\n",
    "# Submit an answer choice as a string below \n",
    "# (Eg. if you choose option A, put 'A')\n",
    "answer4 = '___'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "---\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### $k$-NN Imputation ($k$=3)\n",
    "As an alternative to linear regression, we can also use $k$-NN as our imputation model.\\\n",
    "SKLearn's `KNNImputer` object makes this very easy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_mcar = X_mcar_raw.copy()\n",
    "\n",
    "X_mcar = KNNImputer(n_neighbors=3).fit_transform(X_mcar)\n",
    "\n",
    "ols_mcar_knn = LinearRegression().fit(X_mcar,y)\n",
    "\n",
    "print('MCAR (KNN):', ols_mcar_knn.intercept_,ols_mcar_knn.coef_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_mar_knn) ###\n",
    "X_mar = X_mar_raw.copy()\n",
    "# Add imputed values to X_mar\n",
    "X_mar = KNNImputer(___).fit_transform(___)\n",
    "# Fit substantive model on imputed data\n",
    "ols_mar_knn = LinearRegression().fit(__,__)\n",
    "\n",
    "print('MAR (KNN):', ols_mar_knn.intercept_,ols_mar_knn.coef_)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_mnar_knn) ###\n",
    "X_mnar = X_mnar_raw.copy()\n",
    "# your code here\n",
    "# You can add as many lines as you see fit, so long as the final model is correct\n",
    "ols_mnar_knn = ___\n",
    "\n",
    "print('MNAR (KNN):', ols_mnar_knn.intercept_,ols_mnar_knn.coef_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "⏸️ **Q5**: True or False - While some methods may work better than others depending on the context, any imputation method is better than none (that is, as opposed to simply dropping)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_Q5) ###\n",
    "# Submit an answer choice as boolean value\n",
    "answer5 = ___"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "⏸️ **Q6**: Suppose your friends makes the following suggestion:\n",
    "\n",
    "\"The MNAR missing data can be predicted in part from the response $y$. Why not impute these missing $x_1$ values with an imputation model using $y$ as a predictor? It's true we can't impute like this with new data for which we don't have the $y$ values. But it will improve our training data, our model's fit, and so too its performance on new data!\"\n",
    "\n",
    "What is a _big problem_ with this idea?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*your answer here*"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "---\n",
    "\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}