Title¶
Exercise: B.1 - Beta values for Data from Random Universe using Bootstrap
Description¶
Solve the previous exercise by building your own bootstrap function.
Instructions¶
Define a function bootstrap that takes a dataframe as the input. Use NumPy's random.randint() function to generate random integers in the range of the length of the dataset. These integers will be used as the indices to access the rows of the dataset.
Similar to the previous exercise, compute the $\beta_0$ and $\beta_1$ values for each instance of the dataframe.
Plot the $\beta_0$, $\beta_1$ histograms.
Hints¶
To compute the beta values use the following equations:
$\beta_{0}=\bar{y}-\left(b_{1} * \bar{x}\right)$
$\beta_{1}=\frac{\sum(x-\bar{x}) *(y-\bar{y})}{\sum(x-\bar{x})^{2}}$
where $\bar{x}$ is the mean of $x$ and $\bar{y}$ is the mean of $y$
np.random.randint() : Returns list of integers as per mentioned size
np.dot() : Computes the dot product of two arrays
df.iloc[] : Purely integer-location based indexing for selection by position
ax.hist() : Plots a histogram
ax.set_xlabel() : Sets label for x-axthe is
ax.set_ylabel() : Sets label for the y-axis
Note: This exercise is auto-graded and you can try multiple attempts.
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from randomuniverse import RandomUniverse
%matplotlib inline
Reading the standard Advertising dataset¶
# Read the file "Advertising_csv"
df = pd.read_csv('Advertising_adj.csv')
# Take a quick look at the data
df.head()
# Define a bootstrap function, which inputs a dataframe & outputs a bootstrapped dataframe
def bootstrap(df):
selectionIndex = np.random.randint(___, size = ___)
new_df = df.iloc[___]
return new_df
# Create two empty lists to store beta values
beta0_list, beta1_list = [],[]
# For each instance of the for loop, call your bootstrap function, calculate the beta values
# Store the beta values in the appropriate list
#Choose the number of "parallel" Universes to generate the new dataset
number_of_bootstraps = 1000
for i in range(number_of_bootstraps):
df_new = bootstrap(df)
# x is the predictor variable given by 'tv' values
# y is the reponse variable given by 'sales' values
x = ___
y = ___
#Find the mean of the x values
xmean = x.___
#Find the mean of the y values
ymean = y.___
# Using equations given and discussed in lecture compute the beta0 and beta1 values
# Hint: use np.dot to perform the mulitplication operation
beta1 = ___
beta0 = ___
# Append the calculated values of beta1 and beta0
beta0_list.append(___)
beta1_list.append(___)
### edTest(test_beta) ###
#Compute the mean of the beta0 and beta1 lists
beta0_mean = np.mean(___)
beta1_mean = np.mean(___)
# plot histogram of beta0 and beta1
fig, ax = plt.subplots(1,2, figsize=(18,8))
ax[0].___
ax[1].___
ax[0].set_xlabel(___)
ax[1].set_xlabel(___)
ax[0].set_ylabel('Frequency');
Compare the plots with the results from the RandomUniverse() function¶
# The below helper code will help you visualise the similarity between
# the bootstrap function you wrote & the RandomUniverse() function from last exercise
beta0_randUni, beta1_randUni = [],[]
parallelUniverses = 1000
for i in range(parallelUniverses):
df_new = RandomUniverse(df)
xmean = df_new.tv.mean()
ymean = df_new.sales.mean()
# Using Linear Algebra result as discussed in lecture
beta1 = np.dot((x-xmean) , (y-ymean))/((x-xmean)**2).sum()
beta0 = ymean - beta1*xmean
beta0_randUni.append(beta0)
beta1_randUni.append(beta1)
# Use this helper code to plot the bootstrapped beta values & the ones from random universe
def plotmulti(list1, list2):
fig, axes = plt.subplots(1,2, figsize = (10,4), sharey = 'row')
axes[0].hist(list1);
axes[0].set_xlabel('Beta Distribution')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Bootstrap')
axes[1].hist(list2);
axes[1].set_xlabel('Beta Distribution')
axes[1].set_title('Random Universe')
# Just use the 'plotmulti' function above to compare the two histograms for beta0
plotmulti(beta0_list, beta0_randUni)
#Now compare for beta1
plotmulti(beta1_list, beta1_randUni)