Key Word(s): probability


Title

Exercise: 2 - Linear Regression in Statsmodels

Description

The goal of this exercise is to use statsmodels to fit and interpret a regression model.

Roadmap

  • Read the dataset 'movies.csv' as a dataframe
  • Look at the scatterplot to predict revenue from budget costs, and investigate the appropriateness of linear regression.
  • Estimate the linear regression model using both sklearn and statsmodels, and compare the results.
  • Compare the coefficients of the multiple regression models (with additive effects and interactive effects) with those of the simple linear regression model.
  • Investigate the appropriateness of the linear regression assumptions that the probabilisitic model implies.
  • Bonus Question: calculate the log-likelihood for a linear regression model!

Hints

statmodels.OLS() hyperlinked : Fir an ordinary least squares (OLS) regression model using statsmodels.

statmodels.add_constant() hyperlinked : Adds a column of ones to the design matrix X.

LinearRegression() hyperlinked : Returns a linear regression object from the sklearn library.

LinearRegression().coef_ hyperlinked : This attribute returns the coefficient(s) of the linear regression object

sklearn.fit() hyperlinked : Fit linear model

Note: This exercise is auto-graded and you can try multiple attempts.

In [ ]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
In [ ]:
# Read the data from 'movies.csv' to a dataframe
movies = pd.read_csv('movies.csv')
In [ ]:
#Take a peak at the dataset
movies.head()

(a) Plot the scatterplot to predict 'domestic' revenue from the 'budget' cost of the movie.

In [ ]:
# create the scatterplot to predict 'domestic'
# from 'budget'
plt.scatter(___,___,marker='.')
plt.xlabel('Budget (millions $)')
plt.ylabel('Domestic Gross (millions $)')
plt.title('Revenue vs. Budget for Major Domestic Movies since 1980')
plt.show()

Question: What stands out in the plots above?Does linear regression seems appropriate based on this scatterplot?

(b) Use sklearn to get linear regression estimates.

In [ ]:
### edTest(test_sklearn_regress) ###

from sklearn.linear_model import LinearRegression
regress = LinearRegression().fit(
    ___,___)
print(regress.coef_,regress.intercept_)

(c) Fit a linear regression model using OLS from statsmodels instead, and compare.

In [ ]:
import statsmodels.api as sm

# you have to first create the X matrix with the 
# intercept included, then fit the model
X = sm.add_constant(movies['budget'])
ols1 = sm.OLS(movies['domestic'],X).fit()

ols1.summary()

Question: When would it be prefered to fit a regression model without an intercept?

(d) Fit a second OLS model (OLS2) with 'budget' and 'year' as predictors and compare to OLS1.

In [ ]:
### edTest(test_ols2) ###


X = sm.add_constant(___)
ols2 = sm.OLS(movies['domestic'],X).fit()

ols2.summary()

Question: How does the coefficient estimate for budget compare in this multiple regression to the corresponding estimate in the simple regression model? Why is that the case?

(e) Fit a model with the interaction term between budget and year (first need to define it) and the 'main effects' of the 2 predictors, and interpret the results.

In [ ]:
### edTest(test_interaction) ###

#create the interaction term
interaction =___*___
movies['interaction'] = interaction

# define the X matrix
X = ___

#fit the model 
ols3 = sm.OLS(___,___).fit()

ols3.summary()

Question: How have the estimates changed in this model compared to the earlier ones (especially for budget)? Why is this the case?

(e) Investigate the assumptions to this linear regression model (OLS3) using the plots below.

In [ ]:
# define predicted values (yhat) and residuals
yhat = ols3.predict()
resid = ___ - ___

#plot the histogram of the residuals
plt.___(resid,bins=20)
plt.show()
In [ ]:
# residual scatterplot 
plt.scatter(___,___,marker='.')
plt.hlines(0,xmin=0,xmax=500,color='red')
plt.show()

Question: What stands out in the plots above?

Bonus Question: Confirm the log-likelihood evaluation for OLS1 (just plug in your estimates).

In [ ]:
### use ols1.params,ols1.mse_resid, and norm.logpdf as your basis
from scipy.stats import norm