{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Title :\n",
    "Exercise: Computing the CI\n",
    "\n",
    "## Description :\n",
    "You are the manager of the Advertising division of your company, and your boss asks you the question, **\"How much more sales will we have if we invest $1000 dollars in TV advertising?\"**\n",
    "\n",
    "<img src=\"../fig/fig3.jpeg\" style=\"width: 500px;\">\n",
    "\n",
    "The goal of this exercise is to estimate the Sales with a 95% confidence interval using the Advertising.csv dataset.\n",
    "\n",
    "## Data Description:\n",
    "\n",
    "## Instructions:\n",
    "\n",
    "- Read the file `Advertising.csv` as a dataframe.\n",
    "- Fix a budget amount of 1000 dollars for TV advertising as variable called Budget.\n",
    "- Select the number of bootstraps.\n",
    "- For each bootstrap:\n",
    "    - Select a new dataframe with the predictor as TV and the response as Sales.\n",
    "    - Fit a simple linear regression on the data.\n",
    "    - Predict on the budget and compute the error estimate using the helper function `error_func()`.\n",
    "    - Store the sales as a sum of the prediction and the error estimate and append to `sales_list`.\n",
    "- Sort the `sales_list` which is a distribution of predicted sales over numboot bootstraps.\n",
    "- Compute the 95% confidence interval of `sales_list`.\n",
    "- Use the helper function `plot_simulation` to visualize the distribution and print the estimated sales.\n",
    "\n",
    "## Hints: \n",
    "\n",
    "<a href=\"https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.randint.html\" target=\"_blank\">np.random.randint()</a>\n",
    "Returns list of integers as per mentioned size \n",
    "\n",
    "<a href=\"https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html\" target=\"_blank\">df.sample()</a>\n",
    "Get a new sample from a dataframe\n",
    "\n",
    "<a href=\"https://matplotlib.org/3.2.2/api/_as_gen/matplotlib.pyplot.hist.html\" target=\"_blank\">plt.hist()</a>\n",
    "Plots a histogram\n",
    "\n",
    "<a href=\"https://matplotlib.org/api/_as_gen/matplotlib.pyplot.axvline.html\" target=\"_blank\">plt.axvline()</a>\n",
    "Adds a vertical line across the axes\n",
    "\n",
    "<a href=\"https://matplotlib.org/api/_as_gen/matplotlib.pyplot.axhline.html\" target=\"_blank\">plt.axhline()</a>\n",
    "Add a horizontal line across the axes\n",
    "\n",
    "<a href=\"https://matplotlib.org/api/_as_gen/matplotlib.pyplot.legend.html\" target=\"_blank\">plt.legend()</a>\n",
    "Place a legend on the axes\n",
    "\n",
    "<a href=\"https://numpy.org/doc/stable/reference/generated/numpy.ndarray.sort.html#numpy.ndarray.sort\" target=\"_blank\">ndarray.sort()</a>\n",
    "Returns the sorted ndarray.\n",
    "\n",
    "<a href=\"https://numpy.org/doc/stable/reference/generated/numpy.percentile.html\" target=\"_blank\">np.percentile(list, q)</a>\n",
    "Returns the q-th percentile value based on the provided ascending list of values.\n",
    "\n",
    "Note: This exercise is **auto-graded and you can try multiple attempts**."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import necessary libraries\n",
    "%matplotlib inline\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from scipy import stats\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn import preprocessing\n",
    "from sklearn.metrics import mean_squared_error\n",
    "from sklearn.linear_model import LinearRegression\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.preprocessing import PolynomialFeatures\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 95,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Read the `Advertising.csv` dataframe\n",
    "df = pd.read_csv('Advertising.csv')\n",
    "\n",
    "# Take a quick look at the data\n",
    "df.head()\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 134,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Helper function to compute the variance of the error term \n",
    "def error_func(y,y_p):\n",
    "    n = len(y)\n",
    "    return np.sqrt(np.sum((y-y_p)**2/(n-2)))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 147,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set the number of bootstraps \n",
    "numboot = 1000\n",
    "\n",
    "# Set the budget as per the instructions given \n",
    "# Use 2D list to facilitate model prediction (sklearn.LinearRegression requires input as a 2d array)\n",
    "budget = [[___]]\n",
    "\n",
    "# Initialize an empty list to store sales predictions for each bootstrap\n",
    "sales_list = []\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 148,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Loop through each bootstrap\n",
    "for i in range(___):\n",
    "\n",
    "    # Create bootstrapped version of the data using the sample function\n",
    "    # Set frac=1 and replace=True to get a bootstrap\n",
    "    df_new = df.sample(___, replace=___)\n",
    "\n",
    "    # Get the predictor data ('TV') from the new bootstrapped data\n",
    "    x = df_new[[___]]\n",
    "\n",
    "    # Get the response data ('Sales') from the new bootstrapped data\n",
    "    y = df_new.___\n",
    "\n",
    "    # Initialize a Linear Regression model\n",
    "    linreg = LinearRegression()\n",
    "\n",
    "    # Fit the model on the new data\n",
    "    linreg.fit(___,___)\n",
    "\n",
    "    # Predict on the budget from the original data\n",
    "    prediction = linreg.predict(budget)\n",
    "\n",
    "    # Predict on the bootstrapped data\n",
    "    y_pred = linreg.predict(x) \n",
    "\n",
    "    # Compute the error using the helper function error_func\n",
    "    error = np.random.normal(0,error_func(y,y_pred))\n",
    "    \n",
    "    # The final sales prediction is the sum of the model prediction \n",
    "    # and the error term\n",
    "    sales = ___\n",
    "\n",
    "    # Convert the sales to float type and append to the list\n",
    "    sales_list.append(np.float64(___))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 137,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_sales) ###\n",
    "# Sort the list containing sales predictions in ascending order \n",
    "sales_list.sort()\n",
    "\n",
    "# Find the 95% confidence interval using np.percentile function \n",
    "# at 2.5% and 97.5%\n",
    "sales_CI = (np.percentile(___,___),np.percentile(___, ___))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 138,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Helper function to plot the histogram of beta values along \n",
    "# with the 95% confidence interval\n",
    "def plot_simulation(simulation,confidence):\n",
    "    plt.hist(simulation, bins = 30, label = 'beta distribution', align = 'left', density = True,edgecolor='k')\n",
    "    plt.axvline(confidence[1], 0, 1, color = 'r', label = 'Right Interval')\n",
    "    plt.axvline(confidence[0], 0, 1, color = 'red', label = 'Left Interval')\n",
    "    plt.xlabel('Beta value')\n",
    "    plt.ylabel('Frequency')\n",
    "    plt.legend(frameon = False, loc = 'upper right')\n",
    "    plt.show();\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Call the plot_simulation function above with the computed sales \n",
    "# distribution and the confidence intervals computed earlier\n",
    "plot_simulation(sales_list,sales_CI)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Print the computed values\n",
    "print(f\"With a TV advertising budget of ${budget[0][0]},\")\n",
    "print(f\"we can expect an increase of sales anywhere between {sales_CI[0]:0.2f} to {sales_CI[1]:.2f}\\\n",
    " with a 95% confidence interval\")\n",
    " "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "⏸ The sales predictions here is based on the Simple-Linear regression model between `TV` and `Sales`. Re-run the above exercise by fitting the model considering all variables in `Advertising.csv`. \n",
    "\n",
    "Keep the budget the same, i.e $1000 for 'TV' advertising. \n",
    "You may have to change the `budget` variable to something like `[[1000,0,0]]` for proper computation.\n",
    "\n",
    "Does your predicted sales interval change?\n",
    "Why, or why not?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 149,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_chow1) ###\n",
    "# Type your answer within in the quotes given\n",
    "answer1 = '___'\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}