Title :¶
Exercise: Confidence Intervals for Beta value
Description :¶
The goal of this exercise is to create a plot like the one given below for $\beta_0$ and $\beta_1$.
Data Description:¶
Instructions:¶
- Follow the steps from the previous exercise to get the lists of beta values.
- Sort the list of beta values in ascending order (from low to high).
- To compute the 95% confidence interval, find the 2.5 percentile and the 97.5 percentile using
np.percentile()
. - Use the helper code
plot_simulation()
to visualise the $\beta$ values along with its confidence interval
Hints:¶
$${\widehat {\beta_1 }}={\frac {\sum _{i=1}^{n}(x_{i}-{\bar {x}})(y_{i}-{\bar {y}})}{\sum _{i=1}^{n}(x_{i}-{\bar {x}})^{2}}}$$$${\widehat {\beta_0 }}={\bar {y}}-{\widehat {\beta_1 }}\,{\bar {x}}$$np.random.randint() Returns list of integers as per mentioned size
df.iloc[] Purely integer-location based indexing for selection by position
plt.hist() Plots a histogram
plt.axvline() Adds a vertical line across the axes
plt.axhline() Add a horizontal line across the axes
plt.xlabel() Sets the label for the x-axis
plt.ylabel() Sets the label for the y-axis
plt.legend() Place a legend on the axes
ndarray.sort() Returns the sorted ndarray.
np.percentile(list, q) Returns the q-th percentile value based on the provided ascending list of values.
Note: This exercise is auto-graded and you can try multiple attempts.
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
Reading the standard Advertising dataset¶
# Read the 'Advertising_adj.csv' file
df = pd.read_csv('Advertising_adj.csv')
# Take a quick look at the data
df.head(3)
# Use the bootstrap function defined in the previous exercise
def bootstrap(df):
selectionIndex = np.random.randint(len(df), size = len(df))
new_df = df.iloc[selectionIndex]
return new_df
# Initialize empty lists to store beta values from 100 bootstraps
# of the original data
beta0_list, beta1_list = [],[]
# Set the number of bootstraps
numberOfBootstraps = 100
# Loop over the number of bootstraps
for i in range(numberOfBootstraps):
# Call the function bootstrap with the original dataframe
df_new = bootstrap(df)
# Compute the mean of the predictor i.e. the TV column
xmean = df_new.tv.mean()
# Compute the mean of the response i.e. the Sales column
ymean = df_new.sales.mean()
# Compute beta1 analytical using the equation in the hints
beta1 = (((df_new.tv - xmean)*(df_new.sales - ymean)).sum())/(((df_new.tv - xmean)**2).sum())
# Compute beta1 analytical using the equation in the hints
beta0 = ymean - beta1*xmean
# Append the beta values to their appropriate lists
beta0_list.append(beta0)
beta1_list.append(beta1)
### edTest(test_sort) ###
# Sort the two lists of beta values from the lowest value to highest
beta0_list.___;
beta1_list.___;
### edTest(test_beta) ###
# Find the 95% percent confidence for beta0 interval using the
# percentile function
beta0_CI = (np.___,np.___)
# Find the 95% percent confidence for beta1 interval using the
# percentile function
beta1_CI = (np.___,np.___)
# Print the confidence interval of beta0 upto 3 decimal points
print(f'The beta0 confidence interval is {___}')
# Print the confidence interval of beta1 upto 3 decimal points
print(f'The beta1 confidence interval is {___}')
# Helper function to plot the histogram of beta values along with
# the 95% confidence interval
def plot_simulation(simulation,confidence):
plt.hist(simulation, bins = 30, label = 'beta distribution', align = 'left', density = True)
plt.axvline(confidence[1], 0, 1, color = 'r', label = 'Right Interval')
plt.axvline(confidence[0], 0, 1, color = 'red', label = 'Left Interval')
plt.xlabel('Beta value')
plt.ylabel('Frequency')
plt.title('Confidence Interval')
plt.legend(frameon = False, loc = 'upper right')
# Call the function plot_simulation to get the histogram for beta 0
# with the confidence interval
plot_simulation(___,___)
# Call the function plot_simulation to get the histogram for beta 1
# with the confidence interval
plot_simulation(___,___)