Exercise: 2 - Prediction CI


You are the manager of the Advertising division of your company, and your boss asks you the question, "How much more sales will we have if we invest $1000 dollars in TV advertising?"

The goal of this exercise is to estimate the Sales with a 95% confidence interval using the Advertising.csv dataset.


  • Read the file Advertising.csv as a dataframe.
  • Fix a budget amount of 1000 dollars for TV advertising as variable called Budget
  • Select the number of bootstraps.
  • For each bootstrap:
    • Select a new dataframe with the predictor as TV and the response as Sales.
    • Fit a simple linear regression on the data.
    • Predict on the budget and compute the error estimate using the helper function error_func()
    • Store the sales as a sum of the prediction and the error estimate and append to sales_dist
  • Sort the sales_dist which is a distribution of predicted sales overnumboot` bootstraps.
  • Compute the 95% confidence interval of sales_dist
  • Use the helper function plot_simulation to visualize the distribution and print the estimated sales.


np.random.randint() : Returns list of integers as per mentioned size

df.sample() : Get a new sample from a dataframe

plt.hist() : Plots a histogram

plt.axvline() : Adds a vertical line across the axes

plt.axhline() : Add a horizontal line across the axes

plt.legend() : Place a legend on the axes

ndarray.sort() : Returns the sorted ndarray.

np.percentile(list, q) : Returns the q-th percentile value based on the provided ascending list of values.

Note: This exercise is auto-graded and you can try multiple attempts.

In [1]:
# Import libraries
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from scipy import stats
In [2]:
# Read the `Advertising.csv` dataframe

df = pd.read_csv('Advertising.csv')
In [3]:
# Take a quick look at the data

TV Radio Newspaper Sales
0 230.1 37.8 69.2 22.1
1 44.5 39.3 45.1 10.4
2 17.2 45.9 69.3 9.3
3 151.5 41.3 58.5 18.5
4 180.8 10.8 58.4 12.9
In [134]:
# This helper function computes the variance of the error term 

def error_func(y,y_p):
    n = len(y)
    return np.sqrt(np.sum((y-y_p)**2/(n-2)))
In [147]:
# select the number of bootstraps 

numboot = 1000

# Select the budget. We have used a 2d list to facilitate model prediction (sklearn.LinearRegression requires input as a 2d array)
budget = [[___]]

# Define an empty list that will store sales predictions for each bootstrap
sales_dist = []
In [148]:
# Running through each bootstrap, we fit a model, make predictions and compute sales which is appended to the list defined above

for i in range(___):
    # Bootstrap using df.sample method.
    df_new = df.sample(frac=___, replace=___)
    x = df_new[[___]]
    y = df_new[___]
    linreg = LinearRegression()
    prediction = linreg.predict(budget)
    y_pred = linreg.predict(x) 
    error = np.random.normal(0,error_func(y,y_pred))
    # The final sales prediction is the sum of the model prediction and the error term
    sales = ___
In [137]:
### edTest(test_sales) ###
# We sort the list containing sales predictions in ascending values 


# find the 95% confidence interval using np.percentile function at 2.5% and 97.5%

sales_CI = (np.percentile(___,___),np.percentile(___, ___))
In [138]:
# Use this helper function to plot the histogram of beta values along with the 95% confidence interval

def plot_simulation(simulation,confidence):
    plt.hist(simulation, bins = 30, label = 'beta distribution', align = 'left', density = True,edgecolor='k')
    plt.axvline(confidence[1], 0, 1, color = 'r', label = 'Right Interval')
    plt.axvline(confidence[0], 0, 1, color = 'red', label = 'Left Interval')
    plt.xlabel('Beta value')
    plt.legend(frameon = False, loc = 'upper right')
In [2]:
# call the function above with the computed sales distribution and the confidence intervals from earlier

In [3]:
# Print the computed values

print(f"With a TV advertising budget of ${budget[0][0]},")
print(f"we can expect an increase of sales anywhere between {sales_CI[0]:0.2f} to {sales_CI[1]:.2f}\
 with a 95% confidence interval")

Post-exercise question

Your sales prediction is based on the Simple-Linear regression model between TV and Sales. Now, re-run the above exercise but this time fit the model considering all variables in Advertising.csv.

Keep the budget the same, i.e $1000 for 'TV' advertising. You may have to change the budget variable to something like [[1000,0,0]] for proper computation.

Does your predicted sales interval change? Why, or why not?

In [149]:
# Your answer here