Title¶

MCMC from Scratch for Linear Regression

Description :¶

The aim of this exercise is to perform Monte Carlo Markov Chain (MCMC) from scratch for linear regression. For this, we will be using our old friend the Advertising dataset.

On completing the exercise you should be able to see the following distribution. One for each of the beta value:

Instructions:¶

Read the data file Advertising.csv and set the predictor and response variables.
Fit a linear regression model on the advertising data and take a look at the beta values.
Create 2 lists to store the beta values and initialize the beta values.
Define a function get_prior to compute the prior value given the beta values.
Compute the likelihood, prior and posterior for the initial beta values.
For a selected number of sampling "epochs":
- Compute new beta values
- Compute the corresponding likelihood, prior and posterior.
- Compute the exponential ratio of the current and previous posterior.
- Based on the ratio, select or reject the new beta values.
Choose a burn rate.
Plot the histogram of the beta values.

Hints:¶

np.log() Computes the natural logarithm, element-wise.

np.exp() Calculates the exponential of all elements in the input array.

LinearRegression() Initiates an ordinary least squares Linear Regression.

.fit() Fits the linear model to the data.

model.coef_ Estimated coefficients for the linear regression problem

model.intercept_ Independent term in the linear model.

np.random.normal() Draw random samples from a normal (Gaussian) distribution.

norm.pdf() A normal continuous random variable.

np.sum() Sum of array elements over a given axis.

np.random.uniform() Draw samples from a uniform distribution.

In [0]:

# Import necessary libraries
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
%matplotlib inline
from scipy import stats
from scipy.stats import norm

In [0]:

# Read the data file 'Advertising.csv'
df = pd.read_csv("Advertising.csv")

# Use the column "tv" as the predictor
x = df[['tv']]

# Use the column "sales" as the response
y = df.sales.values

In [0]:

# Take a quick look at the data
df.head()

In [0]:

# Initiate a Linear Regression model
model = ___

# Fit the model on the predictor and response data
___

In [0]:

# Take a quick look at the beta values got after model fitting
# Use the model.intercept_ and model.coef_ for this
b0 = ___
b1 = ___
print("Beta0 is",b0)
print("Beta1 is", b1)

In [0]:

# Helper code to plot the true and predicted data
plt.plot(x,y,'o', label="True Data", color='darkblue')
plt.plot(x,model.predict(df[['tv']]), label="Prediction", color='coral')
plt.xlabel("TV")
plt.ylabel("Sales")
plt.legend()

In [0]:

# Define 2 empty lists to store the accepted beta values in a list 
beta0_list = []
beta1_list = []

In [0]:

# Initialize beta0 to a resonable value based on the model parameter seen above
beta0 = ___

# Initialize beta1 to a resonable value based on the model parameter seen above
beta1 = ___

In [0]:

# Function to get the prior given the beta0 and beta1 values
# NOTE - All the computations are done in the log space so that the numbers are managable.
def get_log_prior(beta0,beta1):
    
    # The prior of beta0 is a value from a normal PDF of beta0 with mean as 100 and standard deviation as 50
    # Take the log of these value  
    log_prior_b0 = ___
    
    # The prior of beta1 is a value from a normal PDF of beta1 with mean as 1 and standard deviation as 1
    # Take the log of this value 
    log_prior_b1 = ___
    
    # Compute the prior as the sum of the log priors of beta0 and beta1
    log_prior = ___
  
    # Return the prior value
    return log_prior

In [0]:

# Compute the log-likelihood for the initial beta values 
# pay attention to the dimensions of y and x. 
log_likelihood = - np.sum( np.log( (y.reshape(-1,1)  - np.array( beta1*x  + beta0))**2) )

# Get the prior of the intial beta values by calling the get_log_prior function
log_prior = ___

# Compute the log posterior of the initial beta values
# The posterior is the sum of the log_likelihood and log_prior
log_posterior = ___

In [0]:

# Save the initial posterior value as prev_posterior for comparision later
prev_logposterior = log_posterior

# Append the initial beta values i.e. beta0 and beta1 to the list
beta0_list.append(beta0)
beta1_list.append(beta1)

In [0]:

# Specify the number of sampling "epochs" (less than 500k)
epochs = ___

⏸ How does the number of samples generated affect results of MCMC?¶

A. As the number of samples are increased the beta values chosen grow increasing random.¶

B. For a smaller number of samples the beta values are closer to the true value with reduced randomness.¶

C. The number of samples does not affect the beta values, it only depends on the prior.¶

D. As the number of samples increase, the beta values slowly converge to their true values.¶

In [0]:

### edTest(test_chow1) ###
# Submit an answer choice as a string below (eg. if you choose option C, put 'C')
answer1 = '___'

In [0]:

# Loop over the range of sampling "epochs"
for i in range(epochs):

    # Get a new beta1 value with mean as the latest element beta1 and scale as 0.1
    beta0 = ___

    # Get a new beta0 value with mean as the latest element beta0 and scale as 0.5
    beta1 = ___

    # Get the prior values for the new beta values by calling the get_log_prior function
    log_prior = ___

    # Compute P(data|w) i.e. the log-likelihood for all the data points
    log_likelihood = ___

    # To compute the posterior given the likelihood and prior
    # The posterior is the sum of the likelihood and prior
    log_posterior = ___

    # Compute the the exponential of the ratio of the posterior given its previous value
    # Since it is the log, the ratio is computed as the difference between the values
    exp_ratio = ___

    # If the ratio is greater than 1 then accept the new beta values in this case
    if exp_ratio>1:

        # Append the beta0 and beta1 to the beta list values
        beta0_list.append(beta0)
        beta1_list.append(beta1)

        # Save the accepted posterior as the previous posterior
        prev_logposterior = log_posterior

    # If the ratio is less than 1 then get a random value between 0 and 1
    else:
        coin = ___

        # Set a threshold value
        threshold = ___
        
        # Check if the random value is higher than the threshold
        # Append the beta values to the list and update the previous posterior
        if coin > threshold:
            beta0_list.append(beta0)
            beta1_list.append(beta1)
            prev_logposterior = log_posterior

⏸ If the threshold is set to a higher value, new beta values are rejected more often if they do not improve the convergence to the true value¶

The statement above is:¶

A. True for all cases¶

B. False for all cases¶

C. True only when the number of samples is less¶

D. True only when prior is extremely far from the real value¶

In [0]:

### edTest(test_chow2) ###
# Submit an answer choice as a string below (eg. if you choose option C, put 'C')
answer2 = '___'

In [0]:

# The number of data points to consider after the beta list has been populated
burn_rate = int(len(beta0_list)*0.3)

In [0]:

### edTest(test_chow3) ###
# Check posterior mean for beta0 and beta1 
print(np.mean(beta0_list[burn_rate:]), np.mean(beta1_list[burn_rate:]))

In [0]:

# Helper code to plot the histogram of the beta values

# Plot the histogram of the beta0 values
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))
ax1.hist(beta0_list[burn_rate:], color='#B2D7D0',edgecolor="black", linewidth=1)
ax1.set_xlabel("BETA 0", fontsize=14)
ax1.set_ylabel("FREQUENCY", fontsize=14);

# Plot the histogram of the beta1 values
ax2.hist(beta1_list[burn_rate:], color='#EFAEA4',edgecolor="black", linewidth=1)
ax2.set_xlabel("BETA 1", fontsize=14)
ax2.set_ylabel("FREQUENCY", fontsize=14);