CS-109A Introduction to Data Science

Lab 6: Logistic Regression

Harvard University
Fall 2019
Instructors: Pavlos Protopapas, Kevin Rader, Chris Tanner
Lab Instructors: Chris Tanner and Eleni Kaxiras.
Contributors: Chris Tanner


In [1]:
## RUN THIS CELL TO PROPERLY HIGHLIGHT THE EXERCISES
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)
Out[1]:

Learning Goals (EDIT)

In this lab, we'll explore different models used to predict which of several labels applies to a new datapoint based on labels observed in the training data.

By the end of this lab, you should:

  • Be familiar with the sklearn implementations of
    • Linear Regression
    • Logistic Regression
  • Be able to make an informed choice of model based on the data at hand
  • (Bonus) Structure your sklearn code into Pipelines to make building, fitting, and tracking your models easier
In [2]:
# IMPORTS GALORE
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import pandas as pd
from pandas.plotting import scatter_matrix

import statsmodels.api as sm
from statsmodels.api import OLS

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.metrics import r2_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_curve
from sklearn.metrics import auc

pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)

Part 1: The AirBnB NYC 2019 Dataset + EDA

The dataset contains information about AirBnB hosts in NYC from 2019. There are 49k unique hosts and 16 features for each:

  • id: listing ID
  • name: name of the listing
  • host_id: host ID
  • host_name: name of the host
  • neighbourhood_group: NYC borough
  • neighbourhood: neighborhood
  • latitude: latitude coordinates
  • longitude: longitude coordinates
  • room_type: listing space type (e.g., private room, entire home)
  • price: price in dollars per night
  • minimum_nights: number of min. nights required for booking
  • number_of_reviews: number of reviews
  • last_review: date of the last review
  • reviews_per_month: number of reviews per month
  • calculated_host_listings_count: number of listings the host has
  • availability_365: number of days the listing is available for booking

Our goal is to predict the price of unseen housing units as being 'affordable' or 'unaffordable', by using their features. We will assume that this task is for a particular client who has a specific budget and would like to simplify the problem by classifying any unit that costs \< \$150 per night as 'affordable' and any unit that costs \\$150 or great as 'unaffordable'.

For this task, we will exercise our normal data science pipeline -- from EDA to modelling and visualization. In particular, we will show the performance of 3 classifiers:

  • Maximum Likelihood Estimate (MLE)
  • Linear Regression
  • Logistic Regression

Let's get started! And awaaaaay we go!

Read-in and checking

We do the usual read-in and verification of the data:

In [3]:
df = pd.read_csv("../data/nyc_airbnb.csv") #, index_col=0)
df.head()
Out[3]:
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365
0 2539 Clean & quiet apt home by the park 2787 John Brooklyn Kensington 40.64749 -73.97237 Private room 149 1 9 2018-10-19 0.21 6 365
1 2595 Skylit Midtown Castle 2845 Jennifer Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 1 45 2019-05-21 0.38 2 355
2 3647 THE VILLAGE OF HARLEM....NEW YORK ! 4632 Elisabeth Manhattan Harlem 40.80902 -73.94190 Private room 150 3 0 NaN NaN 1 365
3 3831 Cozy Entire Floor of Brownstone 4869 LisaRoxanne Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 1 270 2019-07-05 4.64 1 194
4 5022 Entire Apt: Spacious Studio/Loft by central park 7192 Laura Manhattan East Harlem 40.79851 -73.94399 Entire home/apt 80 10 9 2018-11-19 0.10 1 0

Building the training/dev/testing data

As usual, we split the data before we begin our analysis. It would be unfair to cheat by looking at the testing data. Let's divide the data into 60% training, 20% development (aka validation), 20% testing. However, before we split the data, let's make the simple transformation and converting the prices into a categories of being affordable or not.

In [4]:
df['affordable'] = np.where(df['price'] < 150, 1, 0)
df
Out[4]:
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 affordable
0 2539 Clean & quiet apt home by the park 2787 John Brooklyn Kensington 40.64749 -73.97237 Private room 149 1 9 2018-10-19 0.21 6 365 1
1 2595 Skylit Midtown Castle 2845 Jennifer Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 1 45 2019-05-21 0.38 2 355 0
2 3647 THE VILLAGE OF HARLEM....NEW YORK ! 4632 Elisabeth Manhattan Harlem 40.80902 -73.94190 Private room 150 3 0 NaN NaN 1 365 0
3 3831 Cozy Entire Floor of Brownstone 4869 LisaRoxanne Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 1 270 2019-07-05 4.64 1 194 1
4 5022 Entire Apt: Spacious Studio/Loft by central park 7192 Laura Manhattan East Harlem 40.79851 -73.94399 Entire home/apt 80 10 9 2018-11-19 0.10 1 0 1
5 5099 Large Cozy 1 BR Apartment In Midtown East 7322 Chris Manhattan Murray Hill 40.74767 -73.97500 Entire home/apt 200 3 74 2019-06-22 0.59 1 129 0
6 5121 BlissArtsSpace! 7356 Garon Brooklyn Bedford-Stuyvesant 40.68688 -73.95596 Private room 60 45 49 2017-10-05 0.40 1 0 1
7 5178 Large Furnished Room Near B'way 8967 Shunichi Manhattan Hell's Kitchen 40.76489 -73.98493 Private room 79 2 430 2019-06-24 3.47 1 220 1
8 5203 Cozy Clean Guest Room - Family Apt 7490 MaryEllen Manhattan Upper West Side 40.80178 -73.96723 Private room 79 2 118 2017-07-21 0.99 1 0 1
9 5238 Cute & Cozy Lower East Side 1 bdrm 7549 Ben Manhattan Chinatown 40.71344 -73.99037 Entire home/apt 150 1 160 2019-06-09 1.33 4 188 0
10 5295 Beautiful 1br on Upper West Side 7702 Lena Manhattan Upper West Side 40.80316 -73.96545 Entire home/apt 135 5 53 2019-06-22 0.43 1 6 1
11 5441 Central Manhattan/near Broadway 7989 Kate Manhattan Hell's Kitchen 40.76076 -73.98867 Private room 85 2 188 2019-06-23 1.50 1 39 1
12 5803 Lovely Room 1, Garden, Best Area, Legal rental 9744 Laurie Brooklyn South Slope 40.66829 -73.98779 Private room 89 4 167 2019-06-24 1.34 3 314 1
13 6021 Wonderful Guest Bedroom in Manhattan for SINGLES 11528 Claudio Manhattan Upper West Side 40.79826 -73.96113 Private room 85 2 113 2019-07-05 0.91 1 333 1
14 6090 West Village Nest - Superhost 11975 Alina Manhattan West Village 40.73530 -74.00525 Entire home/apt 120 90 27 2018-10-31 0.22 1 0 1
15 6848 Only 2 stops to Manhattan studio 15991 Allen & Irina Brooklyn Williamsburg 40.70837 -73.95352 Entire home/apt 140 2 148 2019-06-29 1.20 1 46 1
16 7097 Perfect for Your Parents + Garden 17571 Jane Brooklyn Fort Greene 40.69169 -73.97185 Entire home/apt 215 2 198 2019-06-28 1.72 1 321 0
17 7322 Chelsea Perfect 18946 Doti Manhattan Chelsea 40.74192 -73.99501 Private room 140 1 260 2019-07-01 2.12 1 12 1
18 7726 Hip Historic Brownstone Apartment with Backyard 20950 Adam And Charity Brooklyn Crown Heights 40.67592 -73.94694 Entire home/apt 99 3 53 2019-06-22 4.44 1 21 1
19 7750 Huge 2 BR Upper East Cental Park 17985 Sing Manhattan East Harlem 40.79685 -73.94872 Entire home/apt 190 7 0 NaN NaN 2 249 0
20 7801 Sweet and Spacious Brooklyn Loft 21207 Chaya Brooklyn Williamsburg 40.71842 -73.95718 Entire home/apt 299 3 9 2011-12-28 0.07 1 0 0
21 8024 CBG CtyBGd HelpsHaiti rm#1:1-4 22486 Lisel Brooklyn Park Slope 40.68069 -73.97706 Private room 130 2 130 2019-07-01 1.09 6 347 1
22 8025 CBG Helps Haiti Room#2.5 22486 Lisel Brooklyn Park Slope 40.67989 -73.97798 Private room 80 1 39 2019-01-01 0.37 6 364 1
23 8110 CBG Helps Haiti Rm #2 22486 Lisel Brooklyn Park Slope 40.68001 -73.97865 Private room 110 2 71 2019-07-02 0.61 6 304 1
24 8490 MAISON DES SIRENES1,bohemian apartment 25183 Nathalie Brooklyn Bedford-Stuyvesant 40.68371 -73.94028 Entire home/apt 120 2 88 2019-06-19 0.73 2 233 1
25 8505 Sunny Bedroom Across Prospect Park 25326 Gregory Brooklyn Windsor Terrace 40.65599 -73.97519 Private room 60 1 19 2019-06-23 1.37 2 85 1
26 8700 Magnifique Suite au N de Manhattan - vue Cloitres 26394 Claude & Sophie Manhattan Inwood 40.86754 -73.92639 Private room 80 4 0 NaN NaN 1 0 1
27 9357 Midtown Pied-a-terre 30193 Tommi Manhattan Hell's Kitchen 40.76715 -73.98533 Entire home/apt 150 10 58 2017-08-13 0.49 1 75 0
28 9518 SPACIOUS, LOVELY FURNISHED MANHATTAN BEDROOM 31374 Shon Manhattan Inwood 40.86482 -73.92106 Private room 44 3 108 2019-06-15 1.11 3 311 1
29 9657 Modern 1 BR / NYC / EAST VILLAGE 21904 Dana Manhattan East Village 40.72920 -73.98542 Entire home/apt 180 14 29 2019-04-19 0.24 1 67 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
48865 36472171 1 bedroom in sunlit apartment 99144947 Brenda Manhattan Inwood 40.86845 -73.92449 Private room 80 1 0 NaN NaN 1 79 1
48866 36472710 CozyHideAway Suite 274225617 Alberth Queens Briarwood 40.70786 -73.81448 Entire home/apt 58 1 0 NaN NaN 1 159 1
48867 36473044 The place you were dreaming for.(only for guys) 261338177 Diana Brooklyn Gravesend 40.59080 -73.97116 Shared room 25 1 0 NaN NaN 6 338 1
48868 36473253 Heaven for you(only for guy) 261338177 Diana Brooklyn Gravesend 40.59118 -73.97119 Shared room 25 7 0 NaN NaN 6 365 1
48869 36474023 Cozy, Sunny Brooklyn Escape 1550580 Julia Brooklyn Bedford-Stuyvesant 40.68759 -73.95705 Private room 45 4 0 NaN NaN 1 7 1
48870 36474911 Cozy, clean Williamsburg 1- bedroom apartment 1273444 Tanja Brooklyn Williamsburg 40.71197 -73.94946 Entire home/apt 99 4 0 NaN NaN 1 22 1
48871 36475746 A LARGE ROOM - 1 MONTH MINIMUM - WASHER&DRYER; 144008701 Ozzy Ciao Manhattan Harlem 40.82233 -73.94687 Private room 35 29 0 NaN NaN 2 31 1
48872 36476675 Nycity-MyHome 8636072 Ben Manhattan Hell's Kitchen 40.76236 -73.99255 Entire home/apt 260 3 0 NaN NaN 1 9 0
48873 36477307 Brooklyn paradise 241945355 Clement & Rose Brooklyn Flatlands 40.63116 -73.92616 Entire home/apt 170 1 0 NaN NaN 2 363 0
48874 36477588 Short Term Rental in East Harlem 214535893 Jeffrey Manhattan East Harlem 40.79760 -73.93947 Private room 50 7 0 NaN NaN 1 22 1
48875 36478343 Welcome all as family 274273284 Anastasia Manhattan East Harlem 40.78749 -73.94749 Private room 140 1 0 NaN NaN 1 180 1
48876 36478357 Cozy, Air-Conditioned Private Bedroom in Harlem 177932088 Joseph Manhattan Harlem 40.80953 -73.95410 Private room 60 1 0 NaN NaN 1 26 1
48877 36479230 Studio sized room with beautiful light 65767720 Melanie Brooklyn Bushwick 40.70418 -73.91471 Private room 42 7 0 NaN NaN 1 16 1
48878 36479723 Room for rest 41326856 Jeerathinan Queens Elmhurst 40.74477 -73.87727 Private room 45 1 0 NaN NaN 5 172 1
48879 36480292 Gorgeous 1.5 Bdr with a private yard- Williams... 540335 Lee Brooklyn Williamsburg 40.71728 -73.94394 Entire home/apt 120 20 0 NaN NaN 1 22 1
48880 36481315 The Raccoon Artist Studio in Williamsburg New ... 208514239 Melki Brooklyn Williamsburg 40.71232 -73.94220 Entire home/apt 120 1 0 NaN NaN 3 365 1
48881 36481615 Peaceful space in Greenpoint, BK 274298453 Adrien Brooklyn Greenpoint 40.72585 -73.94001 Private room 54 6 0 NaN NaN 1 15 1
48882 36482231 Bushwick _ Myrtle-Wyckoff 66058896 Luisa Brooklyn Bushwick 40.69652 -73.91079 Private room 40 20 0 NaN NaN 1 31 1
48883 36482416 Sunny Bedroom NYC! Walking to Central Park!! 131529729 Kendall Manhattan East Harlem 40.79755 -73.93614 Private room 75 2 0 NaN NaN 2 364 1
48884 36482783 Brooklyn Oasis in the heart of Williamsburg 274307600 Jonathan Brooklyn Williamsburg 40.71790 -73.96238 Private room 190 7 0 NaN NaN 1 341 0
48885 36482809 Stunning Bedroom NYC! Walking to Central Park!! 131529729 Kendall Manhattan East Harlem 40.79633 -73.93605 Private room 75 2 0 NaN NaN 2 353 1
48886 36483010 Comfy 1 Bedroom in Midtown East 274311461 Scott Manhattan Midtown 40.75561 -73.96723 Entire home/apt 200 6 0 NaN NaN 1 176 0
48887 36483152 Garden Jewel Apartment in Williamsburg New York 208514239 Melki Brooklyn Williamsburg 40.71232 -73.94220 Entire home/apt 170 1 0 NaN NaN 3 365 0
48888 36484087 Spacious Room w/ Private Rooftop, Central loca... 274321313 Kat Manhattan Hell's Kitchen 40.76392 -73.99183 Private room 125 4 0 NaN NaN 1 31 1
48889 36484363 QUIT PRIVATE HOUSE 107716952 Michael Queens Jamaica 40.69137 -73.80844 Private room 65 1 0 NaN NaN 2 163 1
48890 36484665 Charming one bedroom - newly renovated rowhouse 8232441 Sabrina Brooklyn Bedford-Stuyvesant 40.67853 -73.94995 Private room 70 2 0 NaN NaN 2 9 1
48891 36485057 Affordable room in Bushwick/East Williamsburg 6570630 Marisol Brooklyn Bushwick 40.70184 -73.93317 Private room 40 4 0 NaN NaN 2 36 1
48892 36485431 Sunny Studio at Historical Neighborhood 23492952 Ilgar & Aysel Manhattan Harlem 40.81475 -73.94867 Entire home/apt 115 10 0 NaN NaN 1 27 1
48893 36485609 43rd St. Time Square-cozy single bed 30985759 Taz Manhattan Hell's Kitchen 40.75751 -73.99112 Shared room 55 1 0 NaN NaN 6 2 1
48894 36487245 Trendy duplex in the very heart of Hell's Kitchen 68119814 Christophe Manhattan Hell's Kitchen 40.76404 -73.98933 Private room 90 7 0 NaN NaN 1 23 1

48895 rows × 17 columns

NOTE: The affordable column now has a value of 1 whenever the price is < 150, and 0 otherwise.

Also, the feature named neighbourhood_group can be easily confused with neighbourhood, so let's go ahead and rename it to borough, as that is more distinct:

In [5]:
df.rename(columns={"neighbourhood_group": "borough"}, inplace=True)
df
Out[5]:
id name host_id host_name borough neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 affordable
0 2539 Clean & quiet apt home by the park 2787 John Brooklyn Kensington 40.64749 -73.97237 Private room 149 1 9 2018-10-19 0.21 6 365 1
1 2595 Skylit Midtown Castle 2845 Jennifer Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 1 45 2019-05-21 0.38 2 355 0
2 3647 THE VILLAGE OF HARLEM....NEW YORK ! 4632 Elisabeth Manhattan Harlem 40.80902 -73.94190 Private room 150 3 0 NaN NaN 1 365 0
3 3831 Cozy Entire Floor of Brownstone 4869 LisaRoxanne Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 1 270 2019-07-05 4.64 1 194 1
4 5022 Entire Apt: Spacious Studio/Loft by central park 7192 Laura Manhattan East Harlem 40.79851 -73.94399 Entire home/apt 80 10 9 2018-11-19 0.10 1 0 1
5 5099 Large Cozy 1 BR Apartment In Midtown East 7322 Chris Manhattan Murray Hill 40.74767 -73.97500 Entire home/apt 200 3 74 2019-06-22 0.59 1 129 0
6 5121 BlissArtsSpace! 7356 Garon Brooklyn Bedford-Stuyvesant 40.68688 -73.95596 Private room 60 45 49 2017-10-05 0.40 1 0 1
7 5178 Large Furnished Room Near B'way 8967 Shunichi Manhattan Hell's Kitchen 40.76489 -73.98493 Private room 79 2 430 2019-06-24 3.47 1 220 1
8 5203 Cozy Clean Guest Room - Family Apt 7490 MaryEllen Manhattan Upper West Side 40.80178 -73.96723 Private room 79 2 118 2017-07-21 0.99 1 0 1
9 5238 Cute & Cozy Lower East Side 1 bdrm 7549 Ben Manhattan Chinatown 40.71344 -73.99037 Entire home/apt 150 1 160 2019-06-09 1.33 4 188 0
10 5295 Beautiful 1br on Upper West Side 7702 Lena Manhattan Upper West Side 40.80316 -73.96545 Entire home/apt 135 5 53 2019-06-22 0.43 1 6 1
11 5441 Central Manhattan/near Broadway 7989 Kate Manhattan Hell's Kitchen 40.76076 -73.98867 Private room 85 2 188 2019-06-23 1.50 1 39 1
12 5803 Lovely Room 1, Garden, Best Area, Legal rental 9744 Laurie Brooklyn South Slope 40.66829 -73.98779 Private room 89 4 167 2019-06-24 1.34 3 314 1
13 6021 Wonderful Guest Bedroom in Manhattan for SINGLES 11528 Claudio Manhattan Upper West Side 40.79826 -73.96113 Private room 85 2 113 2019-07-05 0.91 1 333 1
14 6090 West Village Nest - Superhost 11975 Alina Manhattan West Village 40.73530 -74.00525 Entire home/apt 120 90 27 2018-10-31 0.22 1 0 1
15 6848 Only 2 stops to Manhattan studio 15991 Allen & Irina Brooklyn Williamsburg 40.70837 -73.95352 Entire home/apt 140 2 148 2019-06-29 1.20 1 46 1
16 7097 Perfect for Your Parents + Garden 17571 Jane Brooklyn Fort Greene 40.69169 -73.97185 Entire home/apt 215 2 198 2019-06-28 1.72 1 321 0
17 7322 Chelsea Perfect 18946 Doti Manhattan Chelsea 40.74192 -73.99501 Private room 140 1 260 2019-07-01 2.12 1 12 1
18 7726 Hip Historic Brownstone Apartment with Backyard 20950 Adam And Charity Brooklyn Crown Heights 40.67592 -73.94694 Entire home/apt 99 3 53 2019-06-22 4.44 1 21 1
19 7750 Huge 2 BR Upper East Cental Park 17985 Sing Manhattan East Harlem 40.79685 -73.94872 Entire home/apt 190 7 0 NaN NaN 2 249 0
20 7801 Sweet and Spacious Brooklyn Loft 21207 Chaya Brooklyn Williamsburg 40.71842 -73.95718 Entire home/apt 299 3 9 2011-12-28 0.07 1 0 0
21 8024 CBG CtyBGd HelpsHaiti rm#1:1-4 22486 Lisel Brooklyn Park Slope 40.68069 -73.97706 Private room 130 2 130 2019-07-01 1.09 6 347 1
22 8025 CBG Helps Haiti Room#2.5 22486 Lisel Brooklyn Park Slope 40.67989 -73.97798 Private room 80 1 39 2019-01-01 0.37 6 364 1
23 8110 CBG Helps Haiti Rm #2 22486 Lisel Brooklyn Park Slope 40.68001 -73.97865 Private room 110 2 71 2019-07-02 0.61 6 304 1
24 8490 MAISON DES SIRENES1,bohemian apartment 25183 Nathalie Brooklyn Bedford-Stuyvesant 40.68371 -73.94028 Entire home/apt 120 2 88 2019-06-19 0.73 2 233 1
25 8505 Sunny Bedroom Across Prospect Park 25326 Gregory Brooklyn Windsor Terrace 40.65599 -73.97519 Private room 60 1 19 2019-06-23 1.37 2 85 1
26 8700 Magnifique Suite au N de Manhattan - vue Cloitres 26394 Claude & Sophie Manhattan Inwood 40.86754 -73.92639 Private room 80 4 0 NaN NaN 1 0 1
27 9357 Midtown Pied-a-terre 30193 Tommi Manhattan Hell's Kitchen 40.76715 -73.98533 Entire home/apt 150 10 58 2017-08-13 0.49 1 75 0
28 9518 SPACIOUS, LOVELY FURNISHED MANHATTAN BEDROOM 31374 Shon Manhattan Inwood 40.86482 -73.92106 Private room 44 3 108 2019-06-15 1.11 3 311 1
29 9657 Modern 1 BR / NYC / EAST VILLAGE 21904 Dana Manhattan East Village 40.72920 -73.98542 Entire home/apt 180 14 29 2019-04-19 0.24 1 67 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
48865 36472171 1 bedroom in sunlit apartment 99144947 Brenda Manhattan Inwood 40.86845 -73.92449 Private room 80 1 0 NaN NaN 1 79 1
48866 36472710 CozyHideAway Suite 274225617 Alberth Queens Briarwood 40.70786 -73.81448 Entire home/apt 58 1 0 NaN NaN 1 159 1
48867 36473044 The place you were dreaming for.(only for guys) 261338177 Diana Brooklyn Gravesend 40.59080 -73.97116 Shared room 25 1 0 NaN NaN 6 338 1
48868 36473253 Heaven for you(only for guy) 261338177 Diana Brooklyn Gravesend 40.59118 -73.97119 Shared room 25 7 0 NaN NaN 6 365 1
48869 36474023 Cozy, Sunny Brooklyn Escape 1550580 Julia Brooklyn Bedford-Stuyvesant 40.68759 -73.95705 Private room 45 4 0 NaN NaN 1 7 1
48870 36474911 Cozy, clean Williamsburg 1- bedroom apartment 1273444 Tanja Brooklyn Williamsburg 40.71197 -73.94946 Entire home/apt 99 4 0 NaN NaN 1 22 1
48871 36475746 A LARGE ROOM - 1 MONTH MINIMUM - WASHER&DRYER; 144008701 Ozzy Ciao Manhattan Harlem 40.82233 -73.94687 Private room 35 29 0 NaN NaN 2 31 1
48872 36476675 Nycity-MyHome 8636072 Ben Manhattan Hell's Kitchen 40.76236 -73.99255 Entire home/apt 260 3 0 NaN NaN 1 9 0
48873 36477307 Brooklyn paradise 241945355 Clement & Rose Brooklyn Flatlands 40.63116 -73.92616 Entire home/apt 170 1 0 NaN NaN 2 363 0
48874 36477588 Short Term Rental in East Harlem 214535893 Jeffrey Manhattan East Harlem 40.79760 -73.93947 Private room 50 7 0 NaN NaN 1 22 1
48875 36478343 Welcome all as family 274273284 Anastasia Manhattan East Harlem 40.78749 -73.94749 Private room 140 1 0 NaN NaN 1 180 1
48876 36478357 Cozy, Air-Conditioned Private Bedroom in Harlem 177932088 Joseph Manhattan Harlem 40.80953 -73.95410 Private room 60 1 0 NaN NaN 1 26 1
48877 36479230 Studio sized room with beautiful light 65767720 Melanie Brooklyn Bushwick 40.70418 -73.91471 Private room 42 7 0 NaN NaN 1 16 1
48878 36479723 Room for rest 41326856 Jeerathinan Queens Elmhurst 40.74477 -73.87727 Private room 45 1 0 NaN NaN 5 172 1
48879 36480292 Gorgeous 1.5 Bdr with a private yard- Williams... 540335 Lee Brooklyn Williamsburg 40.71728 -73.94394 Entire home/apt 120 20 0 NaN NaN 1 22 1
48880 36481315 The Raccoon Artist Studio in Williamsburg New ... 208514239 Melki Brooklyn Williamsburg 40.71232 -73.94220 Entire home/apt 120 1 0 NaN NaN 3 365 1
48881 36481615 Peaceful space in Greenpoint, BK 274298453 Adrien Brooklyn Greenpoint 40.72585 -73.94001 Private room 54 6 0 NaN NaN 1 15 1
48882 36482231 Bushwick _ Myrtle-Wyckoff 66058896 Luisa Brooklyn Bushwick 40.69652 -73.91079 Private room 40 20 0 NaN NaN 1 31 1
48883 36482416 Sunny Bedroom NYC! Walking to Central Park!! 131529729 Kendall Manhattan East Harlem 40.79755 -73.93614 Private room 75 2 0 NaN NaN 2 364 1
48884 36482783 Brooklyn Oasis in the heart of Williamsburg 274307600 Jonathan Brooklyn Williamsburg 40.71790 -73.96238 Private room 190 7 0 NaN NaN 1 341 0
48885 36482809 Stunning Bedroom NYC! Walking to Central Park!! 131529729 Kendall Manhattan East Harlem 40.79633 -73.93605 Private room 75 2 0 NaN NaN 2 353 1
48886 36483010 Comfy 1 Bedroom in Midtown East 274311461 Scott Manhattan Midtown 40.75561 -73.96723 Entire home/apt 200 6 0 NaN NaN 1 176 0
48887 36483152 Garden Jewel Apartment in Williamsburg New York 208514239 Melki Brooklyn Williamsburg 40.71232 -73.94220 Entire home/apt 170 1 0 NaN NaN 3 365 0
48888 36484087 Spacious Room w/ Private Rooftop, Central loca... 274321313 Kat Manhattan Hell's Kitchen 40.76392 -73.99183 Private room 125 4 0 NaN NaN 1 31 1
48889 36484363 QUIT PRIVATE HOUSE 107716952 Michael Queens Jamaica 40.69137 -73.80844 Private room 65 1 0 NaN NaN 2 163 1
48890 36484665 Charming one bedroom - newly renovated rowhouse 8232441 Sabrina Brooklyn Bedford-Stuyvesant 40.67853 -73.94995 Private room 70 2 0 NaN NaN 2 9 1
48891 36485057 Affordable room in Bushwick/East Williamsburg 6570630 Marisol Brooklyn Bushwick 40.70184 -73.93317 Private room 40 4 0 NaN NaN 2 36 1
48892 36485431 Sunny Studio at Historical Neighborhood 23492952 Ilgar & Aysel Manhattan Harlem 40.81475 -73.94867 Entire home/apt 115 10 0 NaN NaN 1 27 1
48893 36485609 43rd St. Time Square-cozy single bed 30985759 Taz Manhattan Hell's Kitchen 40.75751 -73.99112 Shared room 55 1 0 NaN NaN 6 2 1
48894 36487245 Trendy duplex in the very heart of Hell's Kitchen 68119814 Christophe Manhattan Hell's Kitchen 40.76404 -73.98933 Private room 90 7 0 NaN NaN 1 23 1

48895 rows × 17 columns

Without looking at the full data yet, let's just ensure our prices are within valid ranges:

In [6]:
df['price'].describe()
Out[6]:
count    48895.000000
mean       152.720687
std        240.154170
min          0.000000
25%         69.000000
50%        106.000000
75%        175.000000
max      10000.000000
Name: price, dtype: float64

Uh-oh. We see that price has a minimum value of \$0. I highly doubt any unit in NYC is free. These data instances are garbage, so let's go ahead and remove any instance that has a price of \\$0.

In [7]:
print("original training size:", df.shape)
df = df.loc[df['price'] != 0]
print("new training size:", df.shape)
original training size: (48895, 17)
new training size: (48884, 17)

Now, let's split the data while ensuring that our test set has a fair distribution of affordable units, then further split our training set so as to create the development set:

In [8]:
df_train, df_test = train_test_split(df, test_size=0.2, random_state=42, stratify=df['affordable'])
df_train, df_dev = train_test_split(df_train, test_size=0.25, random_state=99) #stratify=df_train['affordable'])

# ensure our dataset splits are of the % sizes we want
total_size = len(df_train) + len(df_dev) + len(df_test)
print("train:", len(df_train), "=>", len(df_train) / total_size)
print("dev:", len(df_dev), " =>", len(df_dev) / total_size)
print("test:", len(df_test), "=>", len(df_test) / total_size)
train: 29330 => 0.5999918173635546
dev: 9777  => 0.20000409131822272
test: 9777 => 0.20000409131822272

Let's remove the target value (i.e., affordable) from our current dataframes and create it as separate prediction dataframes.

In [9]:
# training
x_train = df_train.drop(['price', 'affordable'], axis=1)
y_train = pd.DataFrame(data=df_train['affordable'], columns=["affordable"])

# dev
x_dev = df_dev.drop(['price', 'affordable'], axis=1)
y_dev = pd.DataFrame(data=df_dev['affordable'], columns=["affordable"])

# test
x_test = df_test.drop(['price', 'affordable'], axis=1)
y_test = pd.DataFrame(data=df_test['affordable'], columns=["affordable"])

From now onwards, we will do EDA and cleaning based on the training set, x_train.

In [10]:
for col in x_train.columns:
    print(col, ":", np.sum([x_train[col].isnull()]))
id : 0
name : 12
host_id : 0
host_name : 12
borough : 0
neighbourhood : 0
latitude : 0
longitude : 0
room_type : 0
minimum_nights : 0
number_of_reviews : 0
last_review : 6065
reviews_per_month : 6065
calculated_host_listings_count : 0
availability_365 : 0

Oh dear. It appears ~6k of the rows have missing values concerning the reviews. It seems impossible to impute the last_review feature with reasonable values, as this is very specific to each unit. At best, we could guess the date based on the reviews_per_month, but that feature is missing for the same rows. Further, it might be difficult to replace reviews_per_month with reasonable values -- sure, we could fill in values to be the median value, but that seems wrong to generalize so heavily, especially for over 20% of our data. Consequently, let's just ignore these two columns.

In [11]:
x_train = x_train.drop(['last_review', 'reviews_per_month'], axis=1)
x_dev = x_dev.drop(['last_review', 'reviews_per_month'], axis=1)
x_test = x_test.drop(['last_review', 'reviews_per_month'], axis=1)

Let's look at the summary statistics of the data:

In [12]:
x_train.describe()
Out[12]:
id host_id latitude longitude minimum_nights number_of_reviews calculated_host_listings_count availability_365
count 2.933000e+04 2.933000e+04 29330.000000 29330.000000 29330.000000 29330.000000 29330.000000 29330.000000
mean 1.899091e+07 6.746725e+07 40.729049 -73.952129 6.891647 23.490829 7.111081 113.047017
std 1.102972e+07 7.863754e+07 0.054446 0.046320 19.236816 45.324235 32.904893 131.845296
min 2.539000e+03 2.438000e+03 40.499790 -74.242850 1.000000 0.000000 1.000000 0.000000
25% 9.380684e+06 7.794212e+06 40.690423 -73.983130 1.000000 1.000000 1.000000 0.000000
50% 1.960499e+07 3.049924e+07 40.723090 -73.955630 3.000000 5.000000 1.000000 44.000000
75% 2.921518e+07 1.074344e+08 40.763067 -73.936100 5.000000 24.000000 2.000000 228.000000
max 3.648561e+07 2.743213e+08 40.913060 -73.712990 1000.000000 629.000000 327.000000 365.000000

Next, we see that the minimum_nights feature has a maximum value of 1,250. That's almost 3.5 years, which is probably longer than the duration that most people rent an apartment. This seems anomalous and wrong. Let's discard it and other units that are outrageous. Well, what constitutes 'outrageous'? We see that the standard deviation for minimum_nights is 21.24. If we assume our distribution of values are normally distributed, then only using values that are within 2 standard deviations of the mean would yield us with ~95% of the original data. However, we have no reason to believe our data is actually normally distributed, especially since our mean is 7. To have a better idea of our actual values, let's plot it as a histogram.

In [13]:
fig, ax = plt.subplots(1,1)
ax.hist(x_train['minimum_nights'], 25, log=True)
plt.xlabel('minimum_nights')
plt.ylabel('count')
Out[13]:
Text(0, 0.5, 'count')

Yea, that instance was a strong outlier, and the host was being ridiculously greedy. That's a clever way to get out a multi-year lease. Notice that we are using log-scale. Clearly, a lot of our mass is from units less than 365 days. To get a better sense of that subset, let's re-plot only units with minumum_nights < 365 days.

In [14]:
subset = x_train['minimum_nights']<365
fig, ax = plt.subplots(1,1)
ax.hist(x_train['minimum_nights'][subset], 30, log=True)
plt.xlabel('minimum_nights')
plt.ylabel('count')
Out[14]:
Text(0, 0.5, 'count')

Ok, that doesn't look too bad, as most units require < 30 nights. It's surprising that some hosts list an unreasonable requirement for the minimum number of nights. There is a risk that any host that lists such an unreasonable value might also have other incorrect information. Personally, I think anything beyond 30 days could be suspicious. If we were to exclude any unit that requires more than 30 days, how many instances would we be ignoring?

In [15]:
len(x_train.loc[x_train['minimum_nights']>30])
Out[15]:
436

Alright, we'd be throwing away 436 out of our ~30k entries. That's roughly 1.5\% of our data. While we generally want to keep and use as much data as we can, I think this is an okay amount to discard, especially considering (1) we have a decently large amount of data remaining, and (2) the entries beyond a 30-day-min could be unrealiable.

In [16]:
good_subset = x_train['minimum_nights'] <= 30
x_train = x_train.loc[good_subset]
y_train = y_train.loc[good_subset]

Notice that we only trimmed our training data, not our development or testing data. I am making this choice because in real scenarios, we would not know the nature of the testing data values. We pre-processed our data to ignore all data that has a price of $0, and to ignore certain columns (even if it's in the testing set), but that was fair because those columns proved to be obvious, bogus element of the dataset. However, it would be unfair to inspect the values of the training set and then to further trim the development and testing set accordingly, conditioned on certain data values.

The remaining columns of our training data all have reasonable summary statistics. None of the min's or max's are cause for concern, and we have no reason to assert a certain distribution of values. Since all the feature values are within reasonable ranges, and there are no missing values (NaNs) remaining, we can confidently move foward. To recap, our remaining columns are now:

In [17]:
[col for col in x_train.columns] # easier to read vertically than horizontally
Out[17]:
['id',
 'name',
 'host_id',
 'host_name',
 'borough',
 'neighbourhood',
 'latitude',
 'longitude',
 'room_type',
 'minimum_nights',
 'number_of_reviews',
 'calculated_host_listings_count',
 'availability_365']

We don't have a terribly large number of features. This allows us to inspect every pairwise interaction. A scatterplot is great for this, as it provides us with a high-level picture of how every pair of features correlates. If any subplot of features depicts a linear relationship (i.e., a clear, concise path with mass concentrated together), then we can assume there exists some collinearity -- that the two features overlap in what they are capturing and that they are not independent from each other.

In [18]:
scatter_matrix(x_train, figsize=(30,20));