# Title¶

Exercise: 2 - Prediction CI

You are the manager of the Advertising division of your company, and your boss asks you the question, "How much more sales will we have if we invest 1000 dollars in TV advertising?" The goal of this exercise is to estimate the Sales with a 95% confidence interval using the Advertising.csv dataset. # Instructions:¶ • Read the file Advertising.csv as a dataframe. • Fix a budget amount of 1000 dollars for TV advertising as variable called Budget • Select the number of bootstraps. • For each bootstrap: • Select a new dataframe with the predictor as TV and the response as Sales. • Fit a simple linear regression on the data. • Predict on the budget and compute the error estimate using the helper function error_func() • Store the sales as a sum of the prediction and the error estimate and append to sales_dist • Sort the sales_dist which is a distribution of predicted sales overnumboot bootstraps. • Compute the 95% confidence interval of sales_dist • Use the helper function plot_simulation to visualize the distribution and print the estimated sales. # Hints¶ np.random.randint() : Returns list of integers as per mentioned size df.sample() : Get a new sample from a dataframe plt.hist() : Plots a histogram plt.axvline() : Adds a vertical line across the axes plt.axhline() : Add a horizontal line across the axes plt.legend() : Place a legend on the axes ndarray.sort() : Returns the sorted ndarray. np.percentile(list, q) : Returns the q-th percentile value based on the provided ascending list of values. Note: This exercise is auto-graded and you can try multiple attempts. In : # Import libraries %matplotlib inline import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn import preprocessing from sklearn.metrics import mean_squared_error from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split from sklearn.preprocessing import PolynomialFeatures from scipy import stats  In : # Read the Advertising.csv dataframe df = pd.read_csv('Advertising.csv')  In : # Take a quick look at the data df.head()  Out: TV Radio Newspaper Sales 0 230.1 37.8 69.2 22.1 1 44.5 39.3 45.1 10.4 2 17.2 45.9 69.3 9.3 3 151.5 41.3 58.5 18.5 4 180.8 10.8 58.4 12.9 In : # This helper function computes the variance of the error term def error_func(y,y_p): n = len(y) return np.sqrt(np.sum((y-y_p)**2/(n-2)))  In : # select the number of bootstraps numboot = 1000 # Select the budget. We have used a 2d list to facilitate model prediction (sklearn.LinearRegression requires input as a 2d array) budget = [[___]] # Define an empty list that will store sales predictions for each bootstrap sales_dist = []  In : # Running through each bootstrap, we fit a model, make predictions and compute sales which is appended to the list defined above for i in range(___): # Bootstrap using df.sample method. df_new = df.sample(frac=___, replace=___) x = df_new[[___]] y = df_new[___] linreg = LinearRegression() linreg.fit(_,_) prediction = linreg.predict(budget) y_pred = linreg.predict(x) error = np.random.normal(0,error_func(y,y_pred)) # The final sales prediction is the sum of the model prediction and the error term sales = ___ sales_dist.append(np.float(___))  In : ### edTest(test_sales) ### # We sort the list containing sales predictions in ascending values sales_dist.sort() # find the 95% confidence interval using np.percentile function at 2.5% and 97.5% sales_CI = (np.percentile(___,___),np.percentile(___, ___))  In : # Use this helper function to plot the histogram of beta values along with the 95% confidence interval def plot_simulation(simulation,confidence): plt.hist(simulation, bins = 30, label = 'beta distribution', align = 'left', density = True,edgecolor='k') plt.axvline(confidence, 0, 1, color = 'r', label = 'Right Interval') plt.axvline(confidence, 0, 1, color = 'red', label = 'Left Interval') plt.xlabel('Beta value') plt.ylabel('Frequency') plt.legend(frameon = False, loc = 'upper right')  In : # call the function above with the computed sales distribution and the confidence intervals from earlier plot_simulation(sales_dist,sales_CI)  In : # Print the computed values print(f"With a TV advertising budget of{budget},")
print(f"we can expect an increase of sales anywhere between {sales_CI:0.2f} to {sales_CI:.2f}\
with a 95% confidence interval")


## Post-exercise question¶

Your sales prediction is based on the Simple-Linear regression model between TV and Sales. Now, re-run the above exercise but this time fit the model considering all variables in Advertising.csv.

Keep the budget the same, i.e \$1000 for 'TV' advertising. You may have to change the budget variable to something like [[1000,0,0]] for proper computation.

Does your predicted sales interval change? Why, or why not?

In :
# Your answer here
`