{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Title :\n",
"Exercise: Hypothesis Testing\n",
"\n",
"## Description :\n",
"\n",
"The goal of this exercise is to identify the relevant features of the dataset using **Hypothesis testing** and to plot a bar plot like the one given below:\n",
"\n",
"
\n",
"\n",
"## Data Description:\n",
"\n",
"## Instructions:\n",
"\n",
"- Read the file `Advertising.csv` as a dataframe.\n",
"- Fit a simple multi-linear regression with \"medv\" as the response variable and the remaining columns as the predictor variables.\n",
"- Compute the coefficients of the model and plot a bar chart to depict these values.\n",
"- To find the distributions of the coefficients perform bootstrap.\n",
"- For each bootstrap:\n",
" - Fit a simple multi-linear regression with the same conditions as before.\n",
" - Compute the coefficient values and store as a list.\n",
"- Compute the |t|∣t∣ values for each of the coefficient value in the list.\n",
"- Plot a bar chart of the varying |t|∣t∣ values.\n",
"- Compute the p-value from the |t|∣t∣ values.\n",
"- Plot a bar chart of 1-p1−p values of the coefficients. Also mark the 0.95 line on the chart as shown above.\n",
"\n",
"## Hints: \n",
"\n",
"pd.read_csv(filename)\n",
"Returns a pandas dataframe containing the data and labels from the file data\n",
"\n",
"sklearn.preprocessing.normalize()\n",
"Scales input vectors individually to unit norm (vector length).\n",
"\n",
"np.interp()\n",
"Returns one-dimensional linear interpolation\n",
"\n",
"sklearn.train_test_split()\n",
"Splits the data into random train and test subsets\n",
"\n",
"sklearn.LinearRegression()\n",
"LinearRegression fits a linear model\n",
"\n",
"sklearn.fit()\n",
"Fits the linear model to the training data\n",
"\n",
"sklearn.predict()\n",
"Predict using the linear model.\n",
"\n",
"**Note:** This exercise is **auto-graded and you can try multiple attempts**."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# Import necessary libraries\n",
"%matplotlib inline\n",
"import numpy as np\n",
"import pandas as pd\n",
"from scipy import stats\n",
"import matplotlib.pyplot as plt\n",
"from sklearn import preprocessing\n",
"from sklearn.linear_model import Lasso\n",
"from sklearn.linear_model import Ridge\n",
"from sklearn.metrics import mean_squared_error\n",
"from sklearn.linear_model import LinearRegression\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.preprocessing import PolynomialFeatures\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# Read the file \"Advertising.csv\" as a dataframe\n",
"df = pd.read_csv(\"Advertising.csv\",index_col=0)\n",
"\n",
"# Take a quick look at the dataframe\n",
"df.head()\n"
]
},
{
"cell_type": "code",
"execution_count": 333,
"metadata": {},
"outputs": [],
"source": [
"# Get all the columns except 'sales' as the predictors\n",
"X = df.drop(['sales'],axis=1)\n",
"\n",
"# Select 'sales' as the response variable\n",
"y = df['sales']\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# Initialize a linear regression model with normalize=True\n",
"lreg = LinearRegression(normalize=True)\n",
"\n",
"# Fit the model on the entire data\n",
"lreg.fit(X, y)\n"
]
},
{
"cell_type": "code",
"execution_count": 335,
"metadata": {},
"outputs": [],
"source": [
"# Get the coefficient of each predictor as a dictionary\n",
"coef_dict = dict(zip(df.columns[:-1], np.transpose(lreg.coef_)))\n",
"predictors,coefficients = list(zip(*sorted(coef_dict.items(),key=lambda x: x[1])))\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# Helper code to visualize the coefficients of all predictors\n",
"fig, ax = plt.subplots()\n",
"ax.barh(predictors,coefficients, align='center',color=\"#336600\",alpha=0.7)\n",
"ax.grid(linewidth=0.2)\n",
"ax.set_xlabel(\"Coefficient\")\n",
"ax.set_ylabel(\"Predictors\")\n",
"plt.show()\n"
]
},
{
"cell_type": "code",
"execution_count": 337,
"metadata": {},
"outputs": [],
"source": [
"# Helper function to compute the t-statistic \n",
"def get_t(arr):\n",
" means = np.abs(arr.mean(axis=0))\n",
" stds = arr.std(axis=0)\n",
" return np.divide(means,stds)\n"
]
},
{
"cell_type": "code",
"execution_count": 338,
"metadata": {},
"outputs": [],
"source": [
"# Initialize an empty list to store the coefficient values\n",
"coef_dist = []\n",
"\n",
"# Set the number of bootstraps\n",
"numboot = 1000\n",
"\n",
"# Loop over the all the bootstraps\n",
"for i in range(___):\n",
"\n",
" # Get a bootstrapped version of the dataframe\n",
" df_new = df.sample(frac=1,replace=True)\n",
"\n",
" # Get all the columns except 'sales' as the predictors\n",
" X = df_new.drop(___,axis=1)\n",
"\n",
" # Select 'sales' as the response variable\n",
" y = df_new[___]\n",
"\n",
" # Initialize a linear regression model with normalize=True\n",
" lreg = LinearRegression(normalize=___)\n",
"\n",
" # Fit the model on the entire data\n",
" lreg.fit(___, ___)\n",
"\n",
" # Append the coefficients of all predictors to the list\n",
" coef_dist.append(lreg.coef_)\n",
"\n",
"# Convert the list to a numpy array\n",
"coef_dist = np.array(coef_dist)\n"
]
},
{
"cell_type": "code",
"execution_count": 339,
"metadata": {},
"outputs": [],
"source": [
"# Use the helper function get_t to find the T-test values\n",
"tt = get_t(___)\n",
"n = df.shape[0]\n"
]
},
{
"cell_type": "code",
"execution_count": 340,
"metadata": {},
"outputs": [],
"source": [
"# Get the t-value associated with each predictor\n",
"tt_dict = dict(zip(df.columns[:-1], tt))\n",
"predictors, tvalues = list(zip(*sorted(tt_dict.items(),key=lambda x:x[1])))\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# Helper code below to visualise the t-values\n",
"fig, ax = plt.subplots()\n",
"ax.barh(predictors,tvalues, align='center',color=\"#336600\",alpha=0.7)\n",
"ax.grid(linewidth=0.2)\n",
"ax.set_xlabel(\"T-test values\")\n",
"ax.set_ylabel(\"Predictors\")\n",
"plt.show();\n"
]
},
{
"cell_type": "code",
"execution_count": 342,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_pval) ###\n",
"\n",
"# From t-test values compute the p values using scipy.stats \n",
"# T-distribution function\n",
"pval = stats.t.sf(tt, n-1)*2\n",
"\n",
"# Here we use sf i.e 'Survival function' which is 1 - CDF of the t distribution.\n",
"# We also multiply by two because its a two tailed test.\n",
"# Please refer to lecture notes for more information\n",
"\n",
"# Since p values are in reversed order, we find the 'confidence' \n",
"# which is 1-p\n",
"conf = ___\n"
]
},
{
"cell_type": "code",
"execution_count": 343,
"metadata": {},
"outputs": [],
"source": [
"# Get the 'confidence' values associated with each predictor\n",
"conf_dict = dict(zip(df.columns[:-1], conf))\n",
"predictors, confs = list(zip(*sorted(conf_dict.items(),key=lambda x:x[1])))\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# Helper code below to visualise the confidence values\n",
"fig, ax = plt.subplots()\n",
"ax.barh(predictors,confs, align='center',color=\"#336600\",alpha=0.7)\n",
"ax.grid(linewidth=0.2)\n",
"ax.axvline(x=0.95,linewidth=3,linestyle='--', color = 'black',alpha=0.8,label = '0.95')\n",
"ax.set_xlabel(\"$1-p$ value\")\n",
"ax.set_ylabel(\"Predictors\")\n",
"ax.legend()\n",
"plt.show();\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 4
}