{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Title :\n", "Exercise: Hypothesis Testing\n", "\n", "## Description :\n", "\n", "The goal of this exercise is to identify the relevant features of the dataset using **Hypothesis testing** and to plot a bar plot like the one given below:\n", "\n", "\n", "\n", "## Data Description:\n", "\n", "## Instructions:\n", "\n", "- Read the file `Advertising.csv` as a dataframe.\n", "- Fit a simple multi-linear regression with \"medv\" as the response variable and the remaining columns as the predictor variables.\n", "- Compute the coefficients of the model and plot a bar chart to depict these values.\n", "- To find the distributions of the coefficients perform bootstrap.\n", "- For each bootstrap:\n", " - Fit a simple multi-linear regression with the same conditions as before.\n", " - Compute the coefficient values and store as a list.\n", "- Compute the |t|∣t∣ values for each of the coefficient value in the list.\n", "- Plot a bar chart of the varying |t|∣t∣ values.\n", "- Compute the p-value from the |t|∣t∣ values.\n", "- Plot a bar chart of 1-p1−p values of the coefficients. Also mark the 0.95 line on the chart as shown above.\n", "\n", "## Hints: \n", "\n", "pd.read_csv(filename)\n", "Returns a pandas dataframe containing the data and labels from the file data\n", "\n", "sklearn.preprocessing.normalize()\n", "Scales input vectors individually to unit norm (vector length).\n", "\n", "np.interp()\n", "Returns one-dimensional linear interpolation\n", "\n", "sklearn.train_test_split()\n", "Splits the data into random train and test subsets\n", "\n", "sklearn.LinearRegression()\n", "LinearRegression fits a linear model\n", "\n", "sklearn.fit()\n", "Fits the linear model to the training data\n", "\n", "sklearn.predict()\n", "Predict using the linear model.\n", "\n", "**Note:** This exercise is **auto-graded and you can try multiple attempts**." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Import necessary libraries\n", "%matplotlib inline\n", "import numpy as np\n", "import pandas as pd\n", "from scipy import stats\n", "import matplotlib.pyplot as plt\n", "from sklearn import preprocessing\n", "from sklearn.linear_model import Lasso\n", "from sklearn.linear_model import Ridge\n", "from sklearn.metrics import mean_squared_error\n", "from sklearn.linear_model import LinearRegression\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.preprocessing import PolynomialFeatures\n" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# Read the file \"Advertising.csv\" as a dataframe\n", "df = pd.read_csv(\"Advertising.csv\",index_col=0)\n", "\n", "# Take a quick look at the dataframe\n", "df.head()\n" ] }, { "cell_type": "code", "execution_count": 333, "metadata": {}, "outputs": [], "source": [ "# Get all the columns except 'sales' as the predictors\n", "X = df.drop(['sales'],axis=1)\n", "\n", "# Select 'sales' as the response variable\n", "y = df['sales']\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Initialize a linear regression model with normalize=True\n", "lreg = LinearRegression(normalize=True)\n", "\n", "# Fit the model on the entire data\n", "lreg.fit(X, y)\n" ] }, { "cell_type": "code", "execution_count": 335, "metadata": {}, "outputs": [], "source": [ "# Get the coefficient of each predictor as a dictionary\n", "coef_dict = dict(zip(df.columns[:-1], np.transpose(lreg.coef_)))\n", "predictors,coefficients = list(zip(*sorted(coef_dict.items(),key=lambda x: x[1])))\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Helper code to visualize the coefficients of all predictors\n", "fig, ax = plt.subplots()\n", "ax.barh(predictors,coefficients, align='center',color=\"#336600\",alpha=0.7)\n", "ax.grid(linewidth=0.2)\n", "ax.set_xlabel(\"Coefficient\")\n", "ax.set_ylabel(\"Predictors\")\n", "plt.show()\n" ] }, { "cell_type": "code", "execution_count": 337, "metadata": {}, "outputs": [], "source": [ "# Helper function to compute the t-statistic \n", "def get_t(arr):\n", " means = np.abs(arr.mean(axis=0))\n", " stds = arr.std(axis=0)\n", " return np.divide(means,stds)\n" ] }, { "cell_type": "code", "execution_count": 338, "metadata": {}, "outputs": [], "source": [ "# Initialize an empty list to store the coefficient values\n", "coef_dist = []\n", "\n", "# Set the number of bootstraps\n", "numboot = 1000\n", "\n", "# Loop over the all the bootstraps\n", "for i in range(___):\n", "\n", " # Get a bootstrapped version of the dataframe\n", " df_new = df.sample(frac=1,replace=True)\n", "\n", " # Get all the columns except 'sales' as the predictors\n", " X = df_new.drop(___,axis=1)\n", "\n", " # Select 'sales' as the response variable\n", " y = df_new[___]\n", "\n", " # Initialize a linear regression model with normalize=True\n", " lreg = LinearRegression(normalize=___)\n", "\n", " # Fit the model on the entire data\n", " lreg.fit(___, ___)\n", "\n", " # Append the coefficients of all predictors to the list\n", " coef_dist.append(lreg.coef_)\n", "\n", "# Convert the list to a numpy array\n", "coef_dist = np.array(coef_dist)\n" ] }, { "cell_type": "code", "execution_count": 339, "metadata": {}, "outputs": [], "source": [ "# Use the helper function get_t to find the T-test values\n", "tt = get_t(___)\n", "n = df.shape[0]\n" ] }, { "cell_type": "code", "execution_count": 340, "metadata": {}, "outputs": [], "source": [ "# Get the t-value associated with each predictor\n", "tt_dict = dict(zip(df.columns[:-1], tt))\n", "predictors, tvalues = list(zip(*sorted(tt_dict.items(),key=lambda x:x[1])))\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Helper code below to visualise the t-values\n", "fig, ax = plt.subplots()\n", "ax.barh(predictors,tvalues, align='center',color=\"#336600\",alpha=0.7)\n", "ax.grid(linewidth=0.2)\n", "ax.set_xlabel(\"T-test values\")\n", "ax.set_ylabel(\"Predictors\")\n", "plt.show();\n" ] }, { "cell_type": "code", "execution_count": 342, "metadata": {}, "outputs": [], "source": [ "### edTest(test_pval) ###\n", "\n", "# From t-test values compute the p values using scipy.stats \n", "# T-distribution function\n", "pval = stats.t.sf(tt, n-1)*2\n", "\n", "# Here we use sf i.e 'Survival function' which is 1 - CDF of the t distribution.\n", "# We also multiply by two because its a two tailed test.\n", "# Please refer to lecture notes for more information\n", "\n", "# Since p values are in reversed order, we find the 'confidence' \n", "# which is 1-p\n", "conf = ___\n" ] }, { "cell_type": "code", "execution_count": 343, "metadata": {}, "outputs": [], "source": [ "# Get the 'confidence' values associated with each predictor\n", "conf_dict = dict(zip(df.columns[:-1], conf))\n", "predictors, confs = list(zip(*sorted(conf_dict.items(),key=lambda x:x[1])))\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Helper code below to visualise the confidence values\n", "fig, ax = plt.subplots()\n", "ax.barh(predictors,confs, align='center',color=\"#336600\",alpha=0.7)\n", "ax.grid(linewidth=0.2)\n", "ax.axvline(x=0.95,linewidth=3,linestyle='--', color = 'black',alpha=0.8,label = '0.95')\n", "ax.set_xlabel(\"$1-p$ value\")\n", "ax.set_ylabel(\"Predictors\")\n", "ax.legend()\n", "plt.show();\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }