Key Word(s): hypothesis testing, confidence intervals
The goal of this exercise is to estimate the Sales with a 95% confidence interval using the Advertising.csv dataset.
Instructions:¶
- Read the file Advertising.csv as a dataframe.
- Fix a budget amount of 1000 dollars for TV advertising as variable called Budget
- Select the number of bootstraps.
- For each bootstrap:
- Select a new dataframe with the predictor as TV and the response as Sales.
- Fit a simple linear regression on the data.
- Predict on the budget and compute the error estimate using the helper function
error_func()
- Store the
sales
as a sum of the prediction and the error estimate and append tosales_dist
- Sort the
sales_dist which is a distribution of predicted sales over
numboot` bootstraps. - Compute the 95% confidence interval of sales_dist
- Use the helper function
plot_simulation
to visualize the distribution and print the estimated sales.
Hints¶
np.random.randint() : Returns list of integers as per mentioned size
df.sample() : Get a new sample from a dataframe
plt.hist() : Plots a histogram
plt.axvline() : Adds a vertical line across the axes
plt.axhline() : Add a horizontal line across the axes
plt.legend() : Place a legend on the axes
ndarray.sort() : Returns the sorted ndarray.
np.percentile(list, q) : Returns the q-th percentile value based on the provided ascending list of values.
Note: This exercise is auto-graded and you can try multiple attempts.
# Import libraries
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from scipy import stats
# Read the `Advertising.csv` dataframe
df = pd.read_csv('Advertising.csv')
# Take a quick look at the data
df.head()
# This helper function computes the variance of the error term
def error_func(y,y_p):
n = len(y)
return np.sqrt(np.sum((y-y_p)**2/(n-2)))
# select the number of bootstraps
numboot = 1000
# Select the budget. We have used a 2d list to facilitate model prediction (sklearn.LinearRegression requires input as a 2d array)
budget = [[___]]
# Define an empty list that will store sales predictions for each bootstrap
sales_dist = []
# Running through each bootstrap, we fit a model, make predictions and compute sales which is appended to the list defined above
for i in range(___):
# Bootstrap using df.sample method.
df_new = df.sample(frac=___, replace=___)
x = df_new[[___]]
y = df_new[___]
linreg = LinearRegression()
linreg.fit(_,_)
prediction = linreg.predict(budget)
y_pred = linreg.predict(x)
error = np.random.normal(0,error_func(y,y_pred))
# The final sales prediction is the sum of the model prediction and the error term
sales = ___
sales_dist.append(np.float(___))
### edTest(test_sales) ###
# We sort the list containing sales predictions in ascending values
sales_dist.sort()
# find the 95% confidence interval using np.percentile function at 2.5% and 97.5%
sales_CI = (np.percentile(___,___),np.percentile(___, ___))
# Use this helper function to plot the histogram of beta values along with the 95% confidence interval
def plot_simulation(simulation,confidence):
plt.hist(simulation, bins = 30, label = 'beta distribution', align = 'left', density = True,edgecolor='k')
plt.axvline(confidence[1], 0, 1, color = 'r', label = 'Right Interval')
plt.axvline(confidence[0], 0, 1, color = 'red', label = 'Left Interval')
plt.xlabel('Beta value')
plt.ylabel('Frequency')
plt.legend(frameon = False, loc = 'upper right')
# call the function above with the computed sales distribution and the confidence intervals from earlier
plot_simulation(sales_dist,sales_CI)
# Print the computed values
print(f"With a TV advertising budget of ${budget[0][0]},")
print(f"we can expect an increase of sales anywhere between {sales_CI[0]:0.2f} to {sales_CI[1]:.2f}\
with a 95% confidence interval")
Post-exercise question¶
Your sales prediction is based on the Simple-Linear regression model between TV
and Sales
.
Now, re-run the above exercise but this time fit the model considering all variables in Advertising.csv
.
Keep the budget the same, i.e $1000 for 'TV' advertising.
You may have to change the budget
variable to something like [[1000,0,0]]
for proper computation.
Does your predicted sales interval change? Why, or why not?
# Your answer here