Title¶
Exercise 2 - Redefining and Scoping
Description¶
Breakout #2 Tasks (20+min):
- Someone else share and take notes (who resides furthest from the Bahamas)
- Solidify your question(s) of interest.
- Determine the next tasks:
- What other data do you need? How will this data be collected and combined?
- What data cleaning and wrangling tasks are needed?
- What other EDA is necessary? What visuals should be included?
- What is a goal for a first baseline model (Key: should be interpretable)? Be sure to include the class of model and the variables involved.
- What is a reasonable goal for a final model and product?
- Determine how long each task should take.
- Assign next tasks to group members. Do not actual perform these tasks!
Breakout 2 Tasks (20+min):¶
- Someone else share and take notes (who resides furthest from the Bahamas)
- Solidify your question(s) of interest.
- Determine the next tasks:
- What other data do you need? How will this data be collected and combined?
- What data cleaning and wrangling tasks are needed?
- What other EDA is necessary? What visuals should be included?
- What is a goal for a first baseline model (Key: should be interpretable)? (Be sure to include the class of model and the variables involved.
- What is a reasonable goal for a final model and product?
- Determine how long each task should take.
- Assign next tasks to group members. Do not actual perform these tasks!
In [95]:
import pandas as pd
import sys
import numpy as np
import sklearn as sk
import scipy as sp
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
%matplotlib inline
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import KNeighborsRegressor
import sklearn.metrics as met
from sklearn.preprocessing import PolynomialFeatures
In [137]:
movies = pd.read_csv('data/tmdb_5000_movies.csv')
credits = pd.read_csv('data/tmdb_5000_credits.csv')
movies.head()
Out[137]:
In [138]:
credits.head()
Out[138]:
In [139]:
print(movies.dtypes)
quants = movies.columns[(movies.dtypes == "int64") | (movies.dtypes == "float64") ].values
quants = quants[quants!='id']
In [140]:
pd.Series(np.append(quants,'year'))
Out[140]:
In [141]:
movies['release_date'] = pd.to_datetime(movies['release_date'])
movies['year'] = pd.DatetimeIndex(movies['release_date']).year
movies['month'] = pd.DatetimeIndex(movies['release_date']).month
movies['decade'] = ((movies['year']) // 10)*10
In [142]:
oldest = np.argmin(movies['release_date'])
newest = np.argmax(movies['release_date'])
print("Oldest Movie:" , movies['title'][oldest], " in", movies['release_date'][oldest])
print("Newest Movie:" , movies['title'][newest], " in", movies['release_date'][newest])
In [143]:
sns.pairplot(movies[np.append(quants,'year')]);
Out[143]:
In [145]:
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(15,5))
ax1.scatter(movies['budget'],movies['revenue'])
ax1.set_title("Revenue vs. Budget")
ax2.scatter(np.log10(movies['budget']+0.1),np.log10(movies['revenue']+0.1))
ax2.set_title("Revenue vs. Budget (both on log10 scale)")
plt.show()
In [146]:
print(np.sum(movies['runtime']==0))
movies[(movies['budget']<1000) | (movies['revenue']<1000 )][['revenue','budget']]
Out[146]:
In [147]:
movies_raw = movies.copy()
In [148]:
movies = movies[(movies['budget']>=1000) & (movies['revenue']>=1000 )]
movies['logbudget'] = np.log10(movies['budget'])
movies['logrevenue'] = np.log10(movies['revenue'])
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(15,5))
ax1.scatter(movies['logbudget'],movies['logrevenue'])
ax1.set_title("Revenue vs. Budget (both on log10 scale) After Trimming")
ax2.scatter(movies['budget'],movies['revenue'])
ax2.set_title("Revenue vs. Budget After Trimming")
plt.show()
In [149]:
ols1 = LinearRegression()
ols1.fit(movies[['logbudget']],movies['logrevenue'])
print(f"Estimated Linear Regression Coefficients: Intercept = {ols1.intercept_:.4f}, Slope(s) = {ols1.coef_[0]:.4f}")
In [150]:
ols2 = LinearRegression()
ols2.fit(movies[['logbudget','year']],movies['logrevenue'])
print(f"Estimated Linear Regression Coefficients: Intercept = {ols2.intercept_:.3f}, Slope(s) =", np.round(ols2.coef_,5))
In [157]:
poly = PolynomialFeatures(interaction_only=True,include_bias=False)
X_interact = poly.fit_transform(movies[['logbudget','year']])
In [159]:
ols3 = LinearRegression()
ols3.fit(X_interact ,movies['logrevenue'])
print(f"Estimated Linear Regression Coefficients: Intercept = {ols3.intercept_:.3f}, Slope(s) =", np.round(ols3.coef_,4))