Key Word(s): hypothesis testing, confidence intervals

The goal of this exercise is to estimate the **Sales** with a 95% confidence interval using the Advertising.csv dataset.

# Instructions:¶

- Read the file Advertising.csv as a dataframe.
- Fix a budget amount of 1000 dollars for TV advertising as variable called Budget
- Select the number of bootstraps.
- For each bootstrap:
- Select a new dataframe with the predictor as TV and the response as Sales.
- Fit a simple linear regression on the data.
- Predict on the budget and compute the error estimate using the helper function
`error_func()`

- Store the
`sales`

as a sum of the prediction and the error estimate and append to`sales_dist`

- Sort the
`sales_dist which is a distribution of predicted sales over`

numboot` bootstraps. - Compute the 95% confidence interval of sales_dist
- Use the helper function
`plot_simulation`

to visualize the distribution and print the estimated sales.

# Hints¶

np.random.randint() : Returns list of integers as per mentioned size

df.sample() : Get a new sample from a dataframe

plt.hist() : Plots a histogram

plt.axvline() : Adds a vertical line across the axes

plt.axhline() : Add a horizontal line across the axes

plt.legend() : Place a legend on the axes

ndarray.sort() : Returns the sorted ndarray.

np.percentile(list, q) : Returns the q-th percentile value based on the provided ascending list of values.

Note: This exercise is **auto-graded and you can try multiple attempts.**

```
# Import libraries
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from scipy import stats
```

```
# Read the `Advertising.csv` dataframe
df = pd.read_csv('Advertising.csv')
```

```
# Take a quick look at the data
df.head()
```

```
# This helper function computes the variance of the error term
def error_func(y,y_p):
n = len(y)
return np.sqrt(np.sum((y-y_p)**2/(n-2)))
```

```
# select the number of bootstraps
numboot = 1000
# Select the budget. We have used a 2d list to facilitate model prediction (sklearn.LinearRegression requires input as a 2d array)
budget = [[___]]
# Define an empty list that will store sales predictions for each bootstrap
sales_dist = []
```

```
# Running through each bootstrap, we fit a model, make predictions and compute sales which is appended to the list defined above
for i in range(___):
# Bootstrap using df.sample method.
df_new = df.sample(frac=___, replace=___)
x = df_new[[___]]
y = df_new[___]
linreg = LinearRegression()
linreg.fit(_,_)
prediction = linreg.predict(budget)
y_pred = linreg.predict(x)
error = np.random.normal(0,error_func(y,y_pred))
# The final sales prediction is the sum of the model prediction and the error term
sales = ___
sales_dist.append(np.float(___))
```

```
### edTest(test_sales) ###
# We sort the list containing sales predictions in ascending values
sales_dist.sort()
# find the 95% confidence interval using np.percentile function at 2.5% and 97.5%
sales_CI = (np.percentile(___,___),np.percentile(___, ___))
```

```
# Use this helper function to plot the histogram of beta values along with the 95% confidence interval
def plot_simulation(simulation,confidence):
plt.hist(simulation, bins = 30, label = 'beta distribution', align = 'left', density = True,edgecolor='k')
plt.axvline(confidence[1], 0, 1, color = 'r', label = 'Right Interval')
plt.axvline(confidence[0], 0, 1, color = 'red', label = 'Left Interval')
plt.xlabel('Beta value')
plt.ylabel('Frequency')
plt.legend(frameon = False, loc = 'upper right')
```

```
# call the function above with the computed sales distribution and the confidence intervals from earlier
plot_simulation(sales_dist,sales_CI)
```

```
# Print the computed values
print(f"With a TV advertising budget of ${budget[0][0]},")
print(f"we can expect an increase of sales anywhere between {sales_CI[0]:0.2f} to {sales_CI[1]:.2f}\
with a 95% confidence interval")
```

## Post-exercise question¶

Your sales prediction is based on the Simple-Linear regression model between `TV`

and `Sales`

.
Now, re-run the above exercise but this time fit the model considering all variables in `Advertising.csv`

.

Keep the budget the same, i.e $1000 for 'TV' advertising.
You may have to change the `budget`

variable to something like `[[1000,0,0]]`

for proper computation.

Does your predicted sales interval change? Why, or why not?

```
# Your answer here
```