{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Title :\n", "Exercise: Computing the CI\n", "\n", "## Description :\n", "You are the manager of the Advertising division of your company, and your boss asks you the question, **\"How much more sales will we have if we invest $1000 dollars in TV advertising?\"**\n", "\n", "\n", "\n", "The goal of this exercise is to estimate the Sales with a 95% confidence interval using the Advertising.csv dataset.\n", "\n", "## Data Description:\n", "\n", "## Instructions:\n", "\n", "- Read the file `Advertising.csv` as a dataframe.\n", "- Fix a budget amount of 1000 dollars for TV advertising as variable called Budget.\n", "- Select the number of bootstraps.\n", "- For each bootstrap:\n", " - Select a new dataframe with the predictor as TV and the response as Sales.\n", " - Fit a simple linear regression on the data.\n", " - Predict on the budget and compute the error estimate using the helper function `error_func()`.\n", " - Store the sales as a sum of the prediction and the error estimate and append to `sales_list`.\n", "- Sort the `sales_list` which is a distribution of predicted sales over numboot bootstraps.\n", "- Compute the 95% confidence interval of `sales_list`.\n", "- Use the helper function `plot_simulation` to visualize the distribution and print the estimated sales.\n", "\n", "## Hints: \n", "\n", "np.random.randint()\n", "Returns list of integers as per mentioned size \n", "\n", "df.sample()\n", "Get a new sample from a dataframe\n", "\n", "plt.hist()\n", "Plots a histogram\n", "\n", "plt.axvline()\n", "Adds a vertical line across the axes\n", "\n", "plt.axhline()\n", "Add a horizontal line across the axes\n", "\n", "plt.legend()\n", "Place a legend on the axes\n", "\n", "ndarray.sort()\n", "Returns the sorted ndarray.\n", "\n", "np.percentile(list, q)\n", "Returns the q-th percentile value based on the provided ascending list of values.\n", "\n", "Note: This exercise is **auto-graded and you can try multiple attempts**." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Import necessary libraries\n", "%matplotlib inline\n", "import numpy as np\n", "import pandas as pd\n", "from scipy import stats\n", "import matplotlib.pyplot as plt\n", "from sklearn import preprocessing\n", "from sklearn.metrics import mean_squared_error\n", "from sklearn.linear_model import LinearRegression\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.preprocessing import PolynomialFeatures\n" ] }, { "cell_type": "code", "execution_count": 95, "metadata": {}, "outputs": [], "source": [ "# Read the `Advertising.csv` dataframe\n", "df = pd.read_csv('Advertising.csv')\n", "\n", "# Take a quick look at the data\n", "df.head()\n" ] }, { "cell_type": "code", "execution_count": 134, "metadata": {}, "outputs": [], "source": [ "# Helper function to compute the variance of the error term \n", "def error_func(y,y_p):\n", " n = len(y)\n", " return np.sqrt(np.sum((y-y_p)**2/(n-2)))\n" ] }, { "cell_type": "code", "execution_count": 147, "metadata": {}, "outputs": [], "source": [ "# Set the number of bootstraps \n", "numboot = 1000\n", "\n", "# Set the budget as per the instructions given \n", "# Use 2D list to facilitate model prediction (sklearn.LinearRegression requires input as a 2d array)\n", "budget = [[___]]\n", "\n", "# Initialize an empty list to store sales predictions for each bootstrap\n", "sales_list = []\n" ] }, { "cell_type": "code", "execution_count": 148, "metadata": {}, "outputs": [], "source": [ "# Loop through each bootstrap\n", "for i in range(___):\n", "\n", " # Create bootstrapped version of the data using the sample function\n", " # Set frac=1 and replace=True to get a bootstrap\n", " df_new = df.sample(___, replace=___)\n", "\n", " # Get the predictor data ('TV') from the new bootstrapped data\n", " x = df_new[[___]]\n", "\n", " # Get the response data ('Sales') from the new bootstrapped data\n", " y = df_new.___\n", "\n", " # Initialize a Linear Regression model\n", " linreg = LinearRegression()\n", "\n", " # Fit the model on the new data\n", " linreg.fit(___,___)\n", "\n", " # Predict on the budget from the original data\n", " prediction = linreg.predict(budget)\n", "\n", " # Predict on the bootstrapped data\n", " y_pred = linreg.predict(x) \n", "\n", " # Compute the error using the helper function error_func\n", " error = np.random.normal(0,error_func(y,y_pred))\n", " \n", " # The final sales prediction is the sum of the model prediction \n", " # and the error term\n", " sales = ___\n", "\n", " # Convert the sales to float type and append to the list\n", " sales_list.append(np.float64(___))\n" ] }, { "cell_type": "code", "execution_count": 137, "metadata": {}, "outputs": [], "source": [ "### edTest(test_sales) ###\n", "# Sort the list containing sales predictions in ascending order \n", "sales_list.sort()\n", "\n", "# Find the 95% confidence interval using np.percentile function \n", "# at 2.5% and 97.5%\n", "sales_CI = (np.percentile(___,___),np.percentile(___, ___))\n" ] }, { "cell_type": "code", "execution_count": 138, "metadata": {}, "outputs": [], "source": [ "# Helper function to plot the histogram of beta values along \n", "# with the 95% confidence interval\n", "def plot_simulation(simulation,confidence):\n", " plt.hist(simulation, bins = 30, label = 'beta distribution', align = 'left', density = True,edgecolor='k')\n", " plt.axvline(confidence[1], 0, 1, color = 'r', label = 'Right Interval')\n", " plt.axvline(confidence[0], 0, 1, color = 'red', label = 'Left Interval')\n", " plt.xlabel('Beta value')\n", " plt.ylabel('Frequency')\n", " plt.legend(frameon = False, loc = 'upper right')\n", " plt.show();\n", " " ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Call the plot_simulation function above with the computed sales \n", "# distribution and the confidence intervals computed earlier\n", "plot_simulation(sales_list,sales_CI)\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Print the computed values\n", "print(f\"With a TV advertising budget of ${budget[0][0]},\")\n", "print(f\"we can expect an increase of sales anywhere between {sales_CI[0]:0.2f} to {sales_CI[1]:.2f}\\\n", " with a 95% confidence interval\")\n", " " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "⏸ The sales predictions here is based on the Simple-Linear regression model between `TV` and `Sales`. Re-run the above exercise by fitting the model considering all variables in `Advertising.csv`. \n", "\n", "Keep the budget the same, i.e $1000 for 'TV' advertising. \n", "You may have to change the `budget` variable to something like `[[1000,0,0]]` for proper computation.\n", "\n", "Does your predicted sales interval change?\n", "Why, or why not?" ] }, { "cell_type": "code", "execution_count": 149, "metadata": {}, "outputs": [], "source": [ "### edTest(test_chow1) ###\n", "# Type your answer within in the quotes given\n", "answer1 = '___'\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }