Key Word(s): hypothesis testing, confidence intervals

Title¶

Exercise: 2 - Prediction CI

Description¶

You are the manager of the Advertising division of your company, and your boss asks you the question, "How much more sales will we have if we invest $1000 dollars in TV advertising?"

The goal of this exercise is to estimate the Sales with a 95% confidence interval using the Advertising.csv dataset.

Instructions:¶

Read the file Advertising.csv as a dataframe.
Fix a budget amount of 1000 dollars for TV advertising as variable called Budget
Select the number of bootstraps.
For each bootstrap:
- Select a new dataframe with the predictor as TV and the response as Sales.
- Fit a simple linear regression on the data.
- Predict on the budget and compute the error estimate using the helper function error_func()
- Store the sales as a sum of the prediction and the error estimate and append to sales_dist
Sort the sales_dist which is a distribution of predicted sales overnumboot` bootstraps.
Compute the 95% confidence interval of sales_dist
Use the helper function plot_simulation to visualize the distribution and print the estimated sales.

Hints¶

np.random.randint() : Returns list of integers as per mentioned size

df.sample() : Get a new sample from a dataframe

plt.hist() : Plots a histogram

plt.axvline() : Adds a vertical line across the axes

plt.axhline() : Add a horizontal line across the axes

plt.legend() : Place a legend on the axes

ndarray.sort() : Returns the sorted ndarray.

np.percentile(list, q) : Returns the q-th percentile value based on the provided ascending list of values.

Note: This exercise is auto-graded and you can try multiple attempts.

In [1]:

# Import libraries
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from scipy import stats

In [2]:

# Read the `Advertising.csv` dataframe

df = pd.read_csv('Advertising.csv')

In [3]:

# Take a quick look at the data

df.head()

Out[3]:

	TV	Radio	Newspaper	Sales
0	230.1	37.8	69.2	22.1
1	44.5	39.3	45.1	10.4
2	17.2	45.9	69.3	9.3
3	151.5	41.3	58.5	18.5
4	180.8	10.8	58.4	12.9

In [134]:

# This helper function computes the variance of the error term 

def error_func(y,y_p):
    n = len(y)
    return np.sqrt(np.sum((y-y_p)**2/(n-2)))

In [147]:

# select the number of bootstraps 

numboot = 1000

# Select the budget. We have used a 2d list to facilitate model prediction (sklearn.LinearRegression requires input as a 2d array)
budget = [[___]]

# Define an empty list that will store sales predictions for each bootstrap
sales_dist = []

In [148]:

# Running through each bootstrap, we fit a model, make predictions and compute sales which is appended to the list defined above

for i in range(___):
    # Bootstrap using df.sample method.
    df_new = df.sample(frac=___, replace=___)
    x = df_new[[___]]
    y = df_new[___]
    linreg = LinearRegression()
    linreg.fit(_,_)
    prediction = linreg.predict(budget)
    y_pred = linreg.predict(x) 
    error = np.random.normal(0,error_func(y,y_pred))
    # The final sales prediction is the sum of the model prediction and the error term
    sales = ___
    sales_dist.append(np.float(___))

In [137]:

### edTest(test_sales) ###
# We sort the list containing sales predictions in ascending values 

sales_dist.sort()

# find the 95% confidence interval using np.percentile function at 2.5% and 97.5%

sales_CI = (np.percentile(___,___),np.percentile(___, ___))

In [138]:

# Use this helper function to plot the histogram of beta values along with the 95% confidence interval

def plot_simulation(simulation,confidence):
    plt.hist(simulation, bins = 30, label = 'beta distribution', align = 'left', density = True,edgecolor='k')
    plt.axvline(confidence[1], 0, 1, color = 'r', label = 'Right Interval')
    plt.axvline(confidence[0], 0, 1, color = 'red', label = 'Left Interval')
    plt.xlabel('Beta value')
    plt.ylabel('Frequency')
    plt.legend(frameon = False, loc = 'upper right')

In [2]:

# call the function above with the computed sales distribution and the confidence intervals from earlier

plot_simulation(sales_dist,sales_CI)

In [3]:

# Print the computed values

print(f"With a TV advertising budget of ${budget[0][0]},")
print(f"we can expect an increase of sales anywhere between {sales_CI[0]:0.2f} to {sales_CI[1]:.2f}\
 with a 95% confidence interval")

Post-exercise question¶

Your sales prediction is based on the Simple-Linear regression model between TV and Sales. Now, re-run the above exercise but this time fit the model considering all variables in Advertising.csv.

Keep the budget the same, i.e $1000 for 'TV' advertising. You may have to change the budget variable to something like [[1000,0,0]] for proper computation.

Does your predicted sales interval change? Why, or why not?

In [149]:

# Your answer here