Title

Exercise 1 - Exploration, Wrangling, and Defining a Question

Description

Breakout #1 Tasks (15-20min):

  1. Someone share (the person who resides closest to the Bahamas…thanks Columbus). Someone different will share in the next breakout.
  2. Explore the data (some of that is done with you with code). Please do a little more exploration.
  3. Come up with an interesting question or two you can answer with this data set. Come up with a question or two that can be answered with supplemental data:
    • start with ideal, and then get more practical based on what is likely available.

CS-109A Introduction to Data Science

Lecture 16: Review and a Preview

Harvard University
Fall 2020


In [ ]:
import pandas as pd
import sys
import numpy as np
import sklearn as sk
import scipy as sp
import matplotlib.pyplot as plt
import seaborn as sns
import datetime
%matplotlib inline

from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.neighbors import KNeighborsRegressor
import sklearn.metrics as met

from sklearn.preprocessing import PolynomialFeatures
In [ ]:
movies = pd.read_csv('tmdb_5000_movies.csv')
credits = pd.read_csv('tmdb_5000_credits.csv')

movies.head()
In [ ]:
credits.head()
In [ ]:
print(movies.dtypes)

quants = movies.columns[(movies.dtypes == "int64") | (movies.dtypes == "float64") ].values
quants = quants[quants!='id']
In [ ]:
pd.Series(np.append(quants,'year'))
In [ ]:
movies['release_date'] = pd.to_datetime(movies['release_date'])
movies['year'] = pd.DatetimeIndex(movies['release_date']).year
movies['month'] = pd.DatetimeIndex(movies['release_date']).month
movies['decade'] = ((movies['year']) // 10)*10
In [ ]:
oldest = np.argmin(movies['release_date'])
newest = np.argmax(movies['release_date'])

print("Oldest Movie:" , movies['title'][oldest], " in", movies['release_date'][oldest])
print("Newest Movie:" , movies['title'][newest], " in", movies['release_date'][newest])
In [ ]:
sns.pairplot(movies[np.append(quants,'year')]);
In [ ]:
movies_raw = movies.copy()

Breakout 1 Tasks (15-20min):

  1. Someone share (the person who resides closest to the Bahamas…thanks Columbus). Someone different will share in the next breakout.
  2. Explore the data (some of that is done for you above). Please do a little more exploration and wrangling.
  3. Come up with an interesting question or two you can answer with this data set. Come up with a question or two that can be answered with supplemental data:
    • start with ideal, and then get more practical based on what is likely available.
In [ ]: