CS-109A Introduction to Data Science

Lab 6: Logistic Regression

Harvard University
Fall 2019
Instructors: Pavlos Protopapas, Kevin Rader, Chris Tanner
Lab Instructors: Chris Tanner and Eleni Kaxiras.
Contributors: Chris Tanner


In [1]:
## RUN THIS CELL TO PROPERLY HIGHLIGHT THE EXERCISES
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)
Out[1]:

Learning Goals (EDIT)

In this lab, we'll explore different models used to predict which of several labels applies to a new datapoint based on labels observed in the training data.

By the end of this lab, you should:

  • Be familiar with the sklearn implementations of
    • Linear Regression
    • Logistic Regression
  • Be able to make an informed choice of model based on the data at hand
  • (Bonus) Structure your sklearn code into Pipelines to make building, fitting, and tracking your models easier
In [2]:
# IMPORTS GALORE
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import pandas as pd
from pandas.plotting import scatter_matrix

import statsmodels.api as sm
from statsmodels.api import OLS

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.metrics import r2_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_curve
from sklearn.metrics import auc

pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)

Part 1: The AirBnB NYC 2019 Dataset + EDA

The dataset contains information about AirBnB hosts in NYC from 2019. There are 49k unique hosts and 16 features for each:

  • id: listing ID
  • name: name of the listing
  • host_id: host ID
  • host_name: name of the host
  • neighbourhood_group: NYC borough
  • neighbourhood: neighborhood
  • latitude: latitude coordinates
  • longitude: longitude coordinates
  • room_type: listing space type (e.g., private room, entire home)
  • price: price in dollars per night
  • minimum_nights: number of min. nights required for booking
  • number_of_reviews: number of reviews
  • last_review: date of the last review
  • reviews_per_month: number of reviews per month
  • calculated_host_listings_count: number of listings the host has
  • availability_365: number of days the listing is available for booking

Our goal is to predict the price of unseen housing units as being 'affordable' or 'unaffordable', by using their features. We will assume that this task is for a particular client who has a specific budget and would like to simplify the problem by classifying any unit that costs \< \$150 per night as 'affordable' and any unit that costs \\$150 or great as 'unaffordable'.

For this task, we will exercise our normal data science pipeline -- from EDA to modelling and visualization. In particular, we will show the performance of 3 classifiers:

  • Maximum Likelihood Estimate (MLE)
  • Linear Regression
  • Logistic Regression

Let's get started! And awaaaaay we go!

Read-in and checking

We do the usual read-in and verification of the data:

In [3]:
df = pd.read_csv("../data/nyc_airbnb.csv") #, index_col=0)
df.head()
Out[3]:
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365
0 2539 Clean & quiet apt home by the park 2787 John Brooklyn Kensington 40.64749 -73.97237 Private room 149 1 9 2018-10-19 0.21 6 365
1 2595 Skylit Midtown Castle 2845 Jennifer Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 1 45 2019-05-21 0.38 2 355
2 3647 THE VILLAGE OF HARLEM....NEW YORK ! 4632 Elisabeth Manhattan Harlem 40.80902 -73.94190 Private room 150 3 0 NaN NaN 1 365
3 3831 Cozy Entire Floor of Brownstone 4869 LisaRoxanne Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 1 270 2019-07-05 4.64 1 194
4 5022 Entire Apt: Spacious Studio/Loft by central park 7192 Laura Manhattan East Harlem 40.79851 -73.94399 Entire home/apt 80 10 9 2018-11-19 0.10 1 0

Building the training/dev/testing data

As usual, we split the data before we begin our analysis. It would be unfair to cheat by looking at the testing data. Let's divide the data into 60% training, 20% development (aka validation), 20% testing. However, before we split the data, let's make the simple transformation and converting the prices into a categories of being affordable or not.

In [4]:
df['affordable'] = np.where(df['price'] < 150, 1, 0)
df
Out[4]:
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 affordable
0 2539 Clean & quiet apt home by the park 2787 John Brooklyn Kensington 40.64749 -73.97237 Private room 149 1 9 2018-10-19 0.21 6 365 1
1 2595 Skylit Midtown Castle 2845 Jennifer Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 1 45 2019-05-21 0.38 2 355 0
2 3647 THE VILLAGE OF HARLEM....NEW YORK ! 4632 Elisabeth Manhattan Harlem 40.80902 -73.94190 Private room 150 3 0 NaN NaN 1 365 0
3 3831 Cozy Entire Floor of Brownstone 4869 LisaRoxanne Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 1 270 2019-07-05 4.64 1 194 1
4 5022 Entire Apt: Spacious Studio/Loft by central park 7192 Laura Manhattan East Harlem 40.79851 -73.94399 Entire home/apt 80 10 9 2018-11-19 0.10 1 0 1
5 5099 Large Cozy 1 BR Apartment In Midtown East 7322 Chris Manhattan Murray Hill 40.74767 -73.97500 Entire home/apt 200 3 74 2019-06-22 0.59 1 129 0
6 5121 BlissArtsSpace! 7356 Garon Brooklyn Bedford-Stuyvesant 40.68688 -73.95596 Private room 60 45 49 2017-10-05 0.40 1 0 1
7 5178 Large Furnished Room Near B'way 8967 Shunichi Manhattan Hell's Kitchen 40.76489 -73.98493 Private room 79 2 430 2019-06-24 3.47 1 220 1
8 5203 Cozy Clean Guest Room - Family Apt 7490 MaryEllen Manhattan Upper West Side 40.80178 -73.96723 Private room 79 2 118 2017-07-21 0.99 1 0 1
9 5238 Cute & Cozy Lower East Side 1 bdrm 7549 Ben Manhattan Chinatown 40.71344 -73.99037 Entire home/apt 150 1 160 2019-06-09 1.33 4 188 0
10 5295 Beautiful 1br on Upper West Side 7702 Lena Manhattan Upper West Side 40.80316 -73.96545 Entire home/apt 135 5 53 2019-06-22 0.43 1 6 1
11 5441 Central Manhattan/near Broadway 7989 Kate Manhattan Hell's Kitchen 40.76076 -73.98867 Private room 85 2 188 2019-06-23 1.50 1 39 1
12 5803 Lovely Room 1, Garden, Best Area, Legal rental 9744 Laurie Brooklyn South Slope 40.66829 -73.98779 Private room 89 4 167 2019-06-24 1.34 3 314 1
13 6021 Wonderful Guest Bedroom in Manhattan for SINGLES 11528 Claudio Manhattan Upper West Side 40.79826 -73.96113 Private room 85 2 113 2019-07-05 0.91 1 333 1
14 6090 West Village Nest - Superhost 11975 Alina Manhattan West Village 40.73530 -74.00525 Entire home/apt 120 90 27 2018-10-31 0.22 1 0 1
15 6848 Only 2 stops to Manhattan studio 15991 Allen & Irina Brooklyn Williamsburg 40.70837 -73.95352 Entire home/apt 140 2 148 2019-06-29 1.20 1 46 1
16 7097 Perfect for Your Parents + Garden 17571 Jane Brooklyn Fort Greene 40.69169 -73.97185 Entire home/apt 215 2 198 2019-06-28 1.72 1 321 0
17 7322 Chelsea Perfect 18946 Doti Manhattan Chelsea 40.74192 -73.99501 Private room 140 1 260 2019-07-01 2.12 1 12 1
18 7726 Hip Historic Brownstone Apartment with Backyard 20950 Adam And Charity Brooklyn Crown Heights 40.67592 -73.94694 Entire home/apt 99 3 53 2019-06-22 4.44 1 21 1
19 7750 Huge 2 BR Upper East Cental Park 17985 Sing Manhattan East Harlem 40.79685 -73.94872 Entire home/apt 190 7 0 NaN NaN 2 249 0
20 7801 Sweet and Spacious Brooklyn Loft 21207 Chaya Brooklyn Williamsburg 40.71842 -73.95718 Entire home/apt 299 3 9 2011-12-28 0.07 1 0 0
21 8024 CBG CtyBGd HelpsHaiti rm#1:1-4 22486 Lisel Brooklyn Park Slope 40.68069 -73.97706 Private room 130 2 130 2019-07-01 1.09 6 347 1
22 8025 CBG Helps Haiti Room#2.5 22486 Lisel Brooklyn Park Slope 40.67989 -73.97798 Private room 80 1 39 2019-01-01 0.37 6 364 1
23 8110 CBG Helps Haiti Rm #2 22486 Lisel Brooklyn Park Slope 40.68001 -73.97865 Private room 110 2 71 2019-07-02 0.61 6 304 1
24 8490 MAISON DES SIRENES1,bohemian apartment 25183 Nathalie Brooklyn Bedford-Stuyvesant 40.68371 -73.94028 Entire home/apt 120 2 88 2019-06-19 0.73 2 233 1
25 8505 Sunny Bedroom Across Prospect Park 25326 Gregory Brooklyn Windsor Terrace 40.65599 -73.97519 Private room 60 1 19 2019-06-23 1.37 2 85 1
26 8700 Magnifique Suite au N de Manhattan - vue Cloitres 26394 Claude & Sophie Manhattan Inwood 40.86754 -73.92639 Private room 80 4 0 NaN NaN 1 0 1
27 9357 Midtown Pied-a-terre 30193 Tommi Manhattan Hell's Kitchen 40.76715 -73.98533 Entire home/apt 150 10 58 2017-08-13 0.49 1 75 0
28 9518 SPACIOUS, LOVELY FURNISHED MANHATTAN BEDROOM 31374 Shon Manhattan Inwood 40.86482 -73.92106 Private room 44 3 108 2019-06-15 1.11 3 311 1
29 9657 Modern 1 BR / NYC / EAST VILLAGE 21904 Dana Manhattan East Village 40.72920 -73.98542 Entire home/apt 180 14 29 2019-04-19 0.24 1 67 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
48865 36472171 1 bedroom in sunlit apartment 99144947 Brenda Manhattan Inwood 40.86845 -73.92449 Private room 80 1 0 NaN NaN 1 79 1
48866 36472710 CozyHideAway Suite 274225617 Alberth Queens Briarwood 40.70786 -73.81448 Entire home/apt 58 1 0 NaN NaN 1 159 1
48867 36473044 The place you were dreaming for.(only for guys) 261338177 Diana Brooklyn Gravesend 40.59080 -73.97116 Shared room 25 1 0 NaN NaN 6 338 1
48868 36473253 Heaven for you(only for guy) 261338177 Diana Brooklyn Gravesend 40.59118 -73.97119 Shared room 25 7 0 NaN NaN 6 365 1
48869 36474023 Cozy, Sunny Brooklyn Escape 1550580 Julia Brooklyn Bedford-Stuyvesant 40.68759 -73.95705 Private room 45 4 0 NaN NaN 1 7 1
48870 36474911 Cozy, clean Williamsburg 1- bedroom apartment 1273444 Tanja Brooklyn Williamsburg 40.71197 -73.94946 Entire home/apt 99 4 0 NaN NaN 1 22 1
48871 36475746 A LARGE ROOM - 1 MONTH MINIMUM - WASHER&DRYER; 144008701 Ozzy Ciao Manhattan Harlem 40.82233 -73.94687 Private room 35 29 0 NaN NaN 2 31 1
48872 36476675 Nycity-MyHome 8636072 Ben Manhattan Hell's Kitchen 40.76236 -73.99255 Entire home/apt 260 3 0 NaN NaN 1 9 0
48873 36477307 Brooklyn paradise 241945355 Clement & Rose Brooklyn Flatlands 40.63116 -73.92616 Entire home/apt 170 1 0 NaN NaN 2 363 0
48874 36477588 Short Term Rental in East Harlem 214535893 Jeffrey Manhattan East Harlem 40.79760 -73.93947 Private room 50 7 0 NaN NaN 1 22 1
48875 36478343 Welcome all as family 274273284 Anastasia Manhattan East Harlem 40.78749 -73.94749 Private room 140 1 0 NaN NaN 1 180 1
48876 36478357 Cozy, Air-Conditioned Private Bedroom in Harlem 177932088 Joseph Manhattan Harlem 40.80953 -73.95410 Private room 60 1 0 NaN NaN 1 26 1
48877 36479230 Studio sized room with beautiful light 65767720 Melanie Brooklyn Bushwick 40.70418 -73.91471 Private room 42 7 0 NaN NaN 1 16 1
48878 36479723 Room for rest 41326856 Jeerathinan Queens Elmhurst 40.74477 -73.87727 Private room 45 1 0 NaN NaN 5 172 1
48879 36480292 Gorgeous 1.5 Bdr with a private yard- Williams... 540335 Lee Brooklyn Williamsburg 40.71728 -73.94394 Entire home/apt 120 20 0 NaN NaN 1 22 1
48880 36481315 The Raccoon Artist Studio in Williamsburg New ... 208514239 Melki Brooklyn Williamsburg 40.71232 -73.94220 Entire home/apt 120 1 0 NaN NaN 3 365 1
48881 36481615 Peaceful space in Greenpoint, BK 274298453 Adrien Brooklyn Greenpoint 40.72585 -73.94001 Private room 54 6 0 NaN NaN 1 15 1
48882 36482231 Bushwick _ Myrtle-Wyckoff 66058896 Luisa Brooklyn Bushwick 40.69652 -73.91079 Private room 40 20 0 NaN NaN 1 31 1
48883 36482416 Sunny Bedroom NYC! Walking to Central Park!! 131529729 Kendall Manhattan East Harlem 40.79755 -73.93614 Private room 75 2 0 NaN NaN 2 364 1
48884 36482783 Brooklyn Oasis in the heart of Williamsburg 274307600 Jonathan Brooklyn Williamsburg 40.71790 -73.96238 Private room 190 7 0 NaN NaN 1 341 0
48885 36482809 Stunning Bedroom NYC! Walking to Central Park!! 131529729 Kendall Manhattan East Harlem 40.79633 -73.93605 Private room 75 2 0 NaN NaN 2 353 1
48886 36483010 Comfy 1 Bedroom in Midtown East 274311461 Scott Manhattan Midtown 40.75561 -73.96723 Entire home/apt 200 6 0 NaN NaN 1 176 0
48887 36483152 Garden Jewel Apartment in Williamsburg New York 208514239 Melki Brooklyn Williamsburg 40.71232 -73.94220 Entire home/apt 170 1 0 NaN NaN 3 365 0
48888 36484087 Spacious Room w/ Private Rooftop, Central loca... 274321313 Kat Manhattan Hell's Kitchen 40.76392 -73.99183 Private room 125 4 0 NaN NaN 1 31 1
48889 36484363 QUIT PRIVATE HOUSE 107716952 Michael Queens Jamaica 40.69137 -73.80844 Private room 65 1 0 NaN NaN 2 163 1
48890 36484665 Charming one bedroom - newly renovated rowhouse 8232441 Sabrina Brooklyn Bedford-Stuyvesant 40.67853 -73.94995 Private room 70 2 0 NaN NaN 2 9 1
48891 36485057 Affordable room in Bushwick/East Williamsburg 6570630 Marisol Brooklyn Bushwick 40.70184 -73.93317 Private room 40 4 0 NaN NaN 2 36 1
48892 36485431 Sunny Studio at Historical Neighborhood 23492952 Ilgar & Aysel Manhattan Harlem 40.81475 -73.94867 Entire home/apt 115 10 0 NaN NaN 1 27 1
48893 36485609 43rd St. Time Square-cozy single bed 30985759 Taz Manhattan Hell's Kitchen 40.75751 -73.99112 Shared room 55 1 0 NaN NaN 6 2 1
48894 36487245 Trendy duplex in the very heart of Hell's Kitchen 68119814 Christophe Manhattan Hell's Kitchen 40.76404 -73.98933 Private room 90 7 0 NaN NaN 1 23 1

48895 rows × 17 columns

NOTE: The affordable column now has a value of 1 whenever the price is < 150, and 0 otherwise.

Also, the feature named neighbourhood_group can be easily confused with neighbourhood, so let's go ahead and rename it to borough, as that is more distinct:

In [5]:
df.rename(columns={"neighbourhood_group": "borough"}, inplace=True)
df
Out[5]:
id name host_id host_name borough neighbourhood latitude longitude room_type price minimum_nights number_of_reviews last_review reviews_per_month calculated_host_listings_count availability_365 affordable
0 2539 Clean & quiet apt home by the park 2787 John Brooklyn Kensington 40.64749 -73.97237 Private room 149 1 9 2018-10-19 0.21 6 365 1
1 2595 Skylit Midtown Castle 2845 Jennifer Manhattan Midtown 40.75362 -73.98377 Entire home/apt 225 1 45 2019-05-21 0.38 2 355 0
2 3647 THE VILLAGE OF HARLEM....NEW YORK ! 4632 Elisabeth Manhattan Harlem 40.80902 -73.94190 Private room 150 3 0 NaN NaN 1 365 0
3 3831 Cozy Entire Floor of Brownstone 4869 LisaRoxanne Brooklyn Clinton Hill 40.68514 -73.95976 Entire home/apt 89 1 270 2019-07-05 4.64 1 194 1
4 5022 Entire Apt: Spacious Studio/Loft by central park 7192 Laura Manhattan East Harlem 40.79851 -73.94399 Entire home/apt 80 10 9 2018-11-19 0.10 1 0 1
5 5099 Large Cozy 1 BR Apartment In Midtown East 7322 Chris Manhattan Murray Hill 40.74767 -73.97500 Entire home/apt 200 3 74 2019-06-22 0.59 1 129 0
6 5121 BlissArtsSpace! 7356 Garon Brooklyn Bedford-Stuyvesant 40.68688 -73.95596 Private room 60 45 49 2017-10-05 0.40 1 0 1
7 5178 Large Furnished Room Near B'way 8967 Shunichi Manhattan Hell's Kitchen 40.76489 -73.98493 Private room 79 2 430 2019-06-24 3.47 1 220 1
8 5203 Cozy Clean Guest Room - Family Apt 7490 MaryEllen Manhattan Upper West Side 40.80178 -73.96723 Private room 79 2 118 2017-07-21 0.99 1 0 1
9 5238 Cute & Cozy Lower East Side 1 bdrm 7549 Ben Manhattan Chinatown 40.71344 -73.99037 Entire home/apt 150 1 160 2019-06-09 1.33 4 188 0
10 5295 Beautiful 1br on Upper West Side 7702 Lena Manhattan Upper West Side 40.80316 -73.96545 Entire home/apt 135 5 53 2019-06-22 0.43 1 6 1
11 5441 Central Manhattan/near Broadway 7989 Kate Manhattan Hell's Kitchen 40.76076 -73.98867 Private room 85 2 188 2019-06-23 1.50 1 39 1
12 5803 Lovely Room 1, Garden, Best Area, Legal rental 9744 Laurie Brooklyn South Slope 40.66829 -73.98779 Private room 89 4 167 2019-06-24 1.34 3 314 1
13 6021 Wonderful Guest Bedroom in Manhattan for SINGLES 11528 Claudio Manhattan Upper West Side 40.79826 -73.96113 Private room 85 2 113 2019-07-05 0.91 1 333 1
14 6090 West Village Nest - Superhost 11975 Alina Manhattan West Village 40.73530 -74.00525 Entire home/apt 120 90 27 2018-10-31 0.22 1 0 1
15 6848 Only 2 stops to Manhattan studio 15991 Allen & Irina Brooklyn Williamsburg 40.70837 -73.95352 Entire home/apt 140 2 148 2019-06-29 1.20 1 46 1
16 7097 Perfect for Your Parents + Garden 17571 Jane Brooklyn Fort Greene 40.69169 -73.97185 Entire home/apt 215 2 198 2019-06-28 1.72 1 321 0
17 7322 Chelsea Perfect 18946 Doti Manhattan Chelsea 40.74192 -73.99501 Private room 140 1 260 2019-07-01 2.12 1 12 1
18 7726 Hip Historic Brownstone Apartment with Backyard 20950 Adam And Charity Brooklyn Crown Heights 40.67592 -73.94694 Entire home/apt 99 3 53 2019-06-22 4.44 1 21 1
19 7750 Huge 2 BR Upper East Cental Park 17985 Sing Manhattan East Harlem 40.79685 -73.94872 Entire home/apt 190 7 0 NaN NaN 2 249 0
20 7801 Sweet and Spacious Brooklyn Loft 21207 Chaya Brooklyn Williamsburg 40.71842 -73.95718 Entire home/apt 299 3 9 2011-12-28 0.07 1 0 0
21 8024 CBG CtyBGd HelpsHaiti rm#1:1-4 22486 Lisel Brooklyn Park Slope 40.68069 -73.97706 Private room 130 2 130 2019-07-01 1.09 6 347 1
22 8025 CBG Helps Haiti Room#2.5 22486 Lisel Brooklyn Park Slope 40.67989 -73.97798 Private room 80 1 39 2019-01-01 0.37 6 364 1
23 8110 CBG Helps Haiti Rm #2 22486 Lisel Brooklyn Park Slope 40.68001 -73.97865 Private room 110 2 71 2019-07-02 0.61 6 304 1
24 8490 MAISON DES SIRENES1,bohemian apartment 25183 Nathalie Brooklyn Bedford-Stuyvesant 40.68371 -73.94028 Entire home/apt 120 2 88 2019-06-19 0.73 2 233 1
25 8505 Sunny Bedroom Across Prospect Park 25326 Gregory Brooklyn Windsor Terrace 40.65599 -73.97519 Private room 60 1 19 2019-06-23 1.37 2 85 1
26 8700 Magnifique Suite au N de Manhattan - vue Cloitres 26394 Claude & Sophie Manhattan Inwood 40.86754 -73.92639 Private room 80 4 0 NaN NaN 1 0 1
27 9357 Midtown Pied-a-terre 30193 Tommi Manhattan Hell's Kitchen 40.76715 -73.98533 Entire home/apt 150 10 58 2017-08-13 0.49 1 75 0
28 9518 SPACIOUS, LOVELY FURNISHED MANHATTAN BEDROOM 31374 Shon Manhattan Inwood 40.86482 -73.92106 Private room 44 3 108 2019-06-15 1.11 3 311 1
29 9657 Modern 1 BR / NYC / EAST VILLAGE 21904 Dana Manhattan East Village 40.72920 -73.98542 Entire home/apt 180 14 29 2019-04-19 0.24 1 67 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
48865 36472171 1 bedroom in sunlit apartment 99144947 Brenda Manhattan Inwood 40.86845 -73.92449 Private room 80 1 0 NaN NaN 1 79 1
48866 36472710 CozyHideAway Suite 274225617 Alberth Queens Briarwood 40.70786 -73.81448 Entire home/apt 58 1 0 NaN NaN 1 159 1
48867 36473044 The place you were dreaming for.(only for guys) 261338177 Diana Brooklyn Gravesend 40.59080 -73.97116 Shared room 25 1 0 NaN NaN 6 338 1
48868 36473253 Heaven for you(only for guy) 261338177 Diana Brooklyn Gravesend 40.59118 -73.97119 Shared room 25 7 0 NaN NaN 6 365 1
48869 36474023 Cozy, Sunny Brooklyn Escape 1550580 Julia Brooklyn Bedford-Stuyvesant 40.68759 -73.95705 Private room 45 4 0 NaN NaN 1 7 1
48870 36474911 Cozy, clean Williamsburg 1- bedroom apartment 1273444 Tanja Brooklyn Williamsburg 40.71197 -73.94946 Entire home/apt 99 4 0 NaN NaN 1 22 1
48871 36475746 A LARGE ROOM - 1 MONTH MINIMUM - WASHER&DRYER; 144008701 Ozzy Ciao Manhattan Harlem 40.82233 -73.94687 Private room 35 29 0 NaN NaN 2 31 1
48872 36476675 Nycity-MyHome 8636072 Ben Manhattan Hell's Kitchen 40.76236 -73.99255 Entire home/apt 260 3 0 NaN NaN 1 9 0
48873 36477307 Brooklyn paradise 241945355 Clement & Rose Brooklyn Flatlands 40.63116 -73.92616 Entire home/apt 170 1 0 NaN NaN 2 363 0
48874 36477588 Short Term Rental in East Harlem 214535893 Jeffrey Manhattan East Harlem 40.79760 -73.93947 Private room 50 7 0 NaN NaN 1 22 1
48875 36478343 Welcome all as family 274273284 Anastasia Manhattan East Harlem 40.78749 -73.94749 Private room 140 1 0 NaN NaN 1 180 1
48876 36478357 Cozy, Air-Conditioned Private Bedroom in Harlem 177932088 Joseph Manhattan Harlem 40.80953 -73.95410 Private room 60 1 0 NaN NaN 1 26 1
48877 36479230 Studio sized room with beautiful light 65767720 Melanie Brooklyn Bushwick 40.70418 -73.91471 Private room 42 7 0 NaN NaN 1 16 1
48878 36479723 Room for rest 41326856 Jeerathinan Queens Elmhurst 40.74477 -73.87727 Private room 45 1 0 NaN NaN 5 172 1
48879 36480292 Gorgeous 1.5 Bdr with a private yard- Williams... 540335 Lee Brooklyn Williamsburg 40.71728 -73.94394 Entire home/apt 120 20 0 NaN NaN 1 22 1
48880 36481315 The Raccoon Artist Studio in Williamsburg New ... 208514239 Melki Brooklyn Williamsburg 40.71232 -73.94220 Entire home/apt 120 1 0 NaN NaN 3 365 1
48881 36481615 Peaceful space in Greenpoint, BK 274298453 Adrien Brooklyn Greenpoint 40.72585 -73.94001 Private room 54 6 0 NaN NaN 1 15 1
48882 36482231 Bushwick _ Myrtle-Wyckoff 66058896 Luisa Brooklyn Bushwick 40.69652 -73.91079 Private room 40 20 0 NaN NaN 1 31 1
48883 36482416 Sunny Bedroom NYC! Walking to Central Park!! 131529729 Kendall Manhattan East Harlem 40.79755 -73.93614 Private room 75 2 0 NaN NaN 2 364 1
48884 36482783 Brooklyn Oasis in the heart of Williamsburg 274307600 Jonathan Brooklyn Williamsburg 40.71790 -73.96238 Private room 190 7 0 NaN NaN 1 341 0
48885 36482809 Stunning Bedroom NYC! Walking to Central Park!! 131529729 Kendall Manhattan East Harlem 40.79633 -73.93605 Private room 75 2 0 NaN NaN 2 353 1
48886 36483010 Comfy 1 Bedroom in Midtown East 274311461 Scott Manhattan Midtown 40.75561 -73.96723 Entire home/apt 200 6 0 NaN NaN 1 176 0
48887 36483152 Garden Jewel Apartment in Williamsburg New York 208514239 Melki Brooklyn Williamsburg 40.71232 -73.94220 Entire home/apt 170 1 0 NaN NaN 3 365 0
48888 36484087 Spacious Room w/ Private Rooftop, Central loca... 274321313 Kat Manhattan Hell's Kitchen 40.76392 -73.99183 Private room 125 4 0 NaN NaN 1 31 1
48889 36484363 QUIT PRIVATE HOUSE 107716952 Michael Queens Jamaica 40.69137 -73.80844 Private room 65 1 0 NaN NaN 2 163 1
48890 36484665 Charming one bedroom - newly renovated rowhouse 8232441 Sabrina Brooklyn Bedford-Stuyvesant 40.67853 -73.94995 Private room 70 2 0 NaN NaN 2 9 1
48891 36485057 Affordable room in Bushwick/East Williamsburg 6570630 Marisol Brooklyn Bushwick 40.70184 -73.93317 Private room 40 4 0 NaN NaN 2 36 1
48892 36485431 Sunny Studio at Historical Neighborhood 23492952 Ilgar & Aysel Manhattan Harlem 40.81475 -73.94867 Entire home/apt 115 10 0 NaN NaN 1 27 1
48893 36485609 43rd St. Time Square-cozy single bed 30985759 Taz Manhattan Hell's Kitchen 40.75751 -73.99112 Shared room 55 1 0 NaN NaN 6 2 1
48894 36487245 Trendy duplex in the very heart of Hell's Kitchen 68119814 Christophe Manhattan Hell's Kitchen 40.76404 -73.98933 Private room 90 7 0 NaN NaN 1 23 1

48895 rows × 17 columns

Without looking at the full data yet, let's just ensure our prices are within valid ranges:

In [6]:
df['price'].describe()
Out[6]:
count    48895.000000
mean       152.720687
std        240.154170
min          0.000000
25%         69.000000
50%        106.000000
75%        175.000000
max      10000.000000
Name: price, dtype: float64

Uh-oh. We see that price has a minimum value of \$0. I highly doubt any unit in NYC is free. These data instances are garbage, so let's go ahead and remove any instance that has a price of \\$0.

In [7]:
print("original training size:", df.shape)
df = df.loc[df['price'] != 0]
print("new training size:", df.shape)
original training size: (48895, 17)
new training size: (48884, 17)

Now, let's split the data while ensuring that our test set has a fair distribution of affordable units, then further split our training set so as to create the development set:

In [8]:
df_train, df_test = train_test_split(df, test_size=0.2, random_state=42, stratify=df['affordable'])
df_train, df_dev = train_test_split(df_train, test_size=0.25, random_state=99) #stratify=df_train['affordable'])

# ensure our dataset splits are of the % sizes we want
total_size = len(df_train) + len(df_dev) + len(df_test)
print("train:", len(df_train), "=>", len(df_train) / total_size)
print("dev:", len(df_dev), " =>", len(df_dev) / total_size)
print("test:", len(df_test), "=>", len(df_test) / total_size)
train: 29330 => 0.5999918173635546
dev: 9777  => 0.20000409131822272
test: 9777 => 0.20000409131822272

Let's remove the target value (i.e., affordable) from our current dataframes and create it as separate prediction dataframes.

In [9]:
# training
x_train = df_train.drop(['price', 'affordable'], axis=1)
y_train = pd.DataFrame(data=df_train['affordable'], columns=["affordable"])

# dev
x_dev = df_dev.drop(['price', 'affordable'], axis=1)
y_dev = pd.DataFrame(data=df_dev['affordable'], columns=["affordable"])

# test
x_test = df_test.drop(['price', 'affordable'], axis=1)
y_test = pd.DataFrame(data=df_test['affordable'], columns=["affordable"])

From now onwards, we will do EDA and cleaning based on the training set, x_train.

In [10]:
for col in x_train.columns:
    print(col, ":", np.sum([x_train[col].isnull()]))
id : 0
name : 12
host_id : 0
host_name : 12
borough : 0
neighbourhood : 0
latitude : 0
longitude : 0
room_type : 0
minimum_nights : 0
number_of_reviews : 0
last_review : 6065
reviews_per_month : 6065
calculated_host_listings_count : 0
availability_365 : 0

Oh dear. It appears ~6k of the rows have missing values concerning the reviews. It seems impossible to impute the last_review feature with reasonable values, as this is very specific to each unit. At best, we could guess the date based on the reviews_per_month, but that feature is missing for the same rows. Further, it might be difficult to replace reviews_per_month with reasonable values -- sure, we could fill in values to be the median value, but that seems wrong to generalize so heavily, especially for over 20% of our data. Consequently, let's just ignore these two columns.

In [11]:
x_train = x_train.drop(['last_review', 'reviews_per_month'], axis=1)
x_dev = x_dev.drop(['last_review', 'reviews_per_month'], axis=1)
x_test = x_test.drop(['last_review', 'reviews_per_month'], axis=1)

Let's look at the summary statistics of the data:

In [12]:
x_train.describe()
Out[12]:
id host_id latitude longitude minimum_nights number_of_reviews calculated_host_listings_count availability_365
count 2.933000e+04 2.933000e+04 29330.000000 29330.000000 29330.000000 29330.000000 29330.000000 29330.000000
mean 1.899091e+07 6.746725e+07 40.729049 -73.952129 6.891647 23.490829 7.111081 113.047017
std 1.102972e+07 7.863754e+07 0.054446 0.046320 19.236816 45.324235 32.904893 131.845296
min 2.539000e+03 2.438000e+03 40.499790 -74.242850 1.000000 0.000000 1.000000 0.000000
25% 9.380684e+06 7.794212e+06 40.690423 -73.983130 1.000000 1.000000 1.000000 0.000000
50% 1.960499e+07 3.049924e+07 40.723090 -73.955630 3.000000 5.000000 1.000000 44.000000
75% 2.921518e+07 1.074344e+08 40.763067 -73.936100 5.000000 24.000000 2.000000 228.000000
max 3.648561e+07 2.743213e+08 40.913060 -73.712990 1000.000000 629.000000 327.000000 365.000000

Next, we see that the minimum_nights feature has a maximum value of 1,250. That's almost 3.5 years, which is probably longer than the duration that most people rent an apartment. This seems anomalous and wrong. Let's discard it and other units that are outrageous. Well, what constitutes 'outrageous'? We see that the standard deviation for minimum_nights is 21.24. If we assume our distribution of values are normally distributed, then only using values that are within 2 standard deviations of the mean would yield us with ~95% of the original data. However, we have no reason to believe our data is actually normally distributed, especially since our mean is 7. To have a better idea of our actual values, let's plot it as a histogram.

In [13]:
fig, ax = plt.subplots(1,1)
ax.hist(x_train['minimum_nights'], 25, log=True)
plt.xlabel('minimum_nights')
plt.ylabel('count')
Out[13]:
Text(0, 0.5, 'count')

Yea, that instance was a strong outlier, and the host was being ridiculously greedy. That's a clever way to get out a multi-year lease. Notice that we are using log-scale. Clearly, a lot of our mass is from units less than 365 days. To get a better sense of that subset, let's re-plot only units with minumum_nights < 365 days.

In [14]:
subset = x_train['minimum_nights']<365
fig, ax = plt.subplots(1,1)
ax.hist(x_train['minimum_nights'][subset], 30, log=True)
plt.xlabel('minimum_nights')
plt.ylabel('count')
Out[14]:
Text(0, 0.5, 'count')

Ok, that doesn't look too bad, as most units require < 30 nights. It's surprising that some hosts list an unreasonable requirement for the minimum number of nights. There is a risk that any host that lists such an unreasonable value might also have other incorrect information. Personally, I think anything beyond 30 days could be suspicious. If we were to exclude any unit that requires more than 30 days, how many instances would we be ignoring?

In [15]:
len(x_train.loc[x_train['minimum_nights']>30])
Out[15]:
436

Alright, we'd be throwing away 436 out of our ~30k entries. That's roughly 1.5\% of our data. While we generally want to keep and use as much data as we can, I think this is an okay amount to discard, especially considering (1) we have a decently large amount of data remaining, and (2) the entries beyond a 30-day-min could be unrealiable.

In [16]:
good_subset = x_train['minimum_nights'] <= 30
x_train = x_train.loc[good_subset]
y_train = y_train.loc[good_subset]

Notice that we only trimmed our training data, not our development or testing data. I am making this choice because in real scenarios, we would not know the nature of the testing data values. We pre-processed our data to ignore all data that has a price of $0, and to ignore certain columns (even if it's in the testing set), but that was fair because those columns proved to be obvious, bogus element of the dataset. However, it would be unfair to inspect the values of the training set and then to further trim the development and testing set accordingly, conditioned on certain data values.

The remaining columns of our training data all have reasonable summary statistics. None of the min's or max's are cause for concern, and we have no reason to assert a certain distribution of values. Since all the feature values are within reasonable ranges, and there are no missing values (NaNs) remaining, we can confidently move foward. To recap, our remaining columns are now:

In [17]:
[col for col in x_train.columns] # easier to read vertically than horizontally
Out[17]:
['id',
 'name',
 'host_id',
 'host_name',
 'borough',
 'neighbourhood',
 'latitude',
 'longitude',
 'room_type',
 'minimum_nights',
 'number_of_reviews',
 'calculated_host_listings_count',
 'availability_365']

We don't have a terribly large number of features. This allows us to inspect every pairwise interaction. A scatterplot is great for this, as it provides us with a high-level picture of how every pair of features correlates. If any subplot of features depicts a linear relationship (i.e., a clear, concise path with mass concentrated together), then we can assume there exists some collinearity -- that the two features overlap in what they are capturing and that they are not independent from each other.

In [18]:
scatter_matrix(x_train, figsize=(30,20));

Part 2: Predicting with MLE

Maximum-likelihood estimation (MLE) is a very simple model which does not require learning any weight/coefficient parameters. Specifically, MLE selects the parameter value ($y$) that makes the observed data most probable, so as to maximize the likelihood function. This choice of $y$ is completely independent of $x$. That is, a MLE model returns the $y$-value that was probable in the data its seen.


Exercise 1: Using the training data, select the MLE for $y$, where $y \in \{0,1\}$. Using the development to evaluate your MLE model, what is the accuracy (the % correct)?
In [19]:
# [SOLUTION: REMOVE]
mle_y = y_train['affordable'].value_counts().idxmax()
dev_accuracy = y_dev['affordable'].value_counts()[mle_y] / len(y_dev['affordable'])
dev_accuracy
Out[19]:
0.650301728546589

Part 3: Predicting with Linear Regression

Now, let's actually use our features to make more informed predictions. Since our model needs to use numeric values, not textual ones, let's use ONLY the following features for our linear model:

  • borough, using 1-hot encodings. There are 5 distinct boroughs, so represent them via 4 unique columns.
  • latitude
  • longitude
  • room_type, using 1-hot encodings. There are 3 distinct room_types, so represent them via 2 unique columns.
  • minimum_nights
  • number_of_reviews
  • calculated_host_listings_count
  • availability_365


Exercise 2: Convert `x_train` to have only the columns listed above. The shape should be 28,894 x 12
In [20]:
# [SOLUTION: remove!!]
x_train = pd.get_dummies(x_train, columns=['borough', 'room_type'], drop_first=True)
x_train = x_train.drop(['id', 'name', 'host_id', 'host_name', 'neighbourhood'], axis=1)

x_dev = pd.get_dummies(x_dev, columns=['borough', 'room_type'], drop_first=True)
x_dev = x_dev.drop(['id', 'name', 'host_id', 'host_name', 'neighbourhood'], axis=1)

x_test = pd.get_dummies(x_test, columns=['borough', 'room_type'], drop_first=True)
x_test = x_test.drop(['id', 'name', 'host_id', 'host_name', 'neighbourhood'], axis=1)
Exercise 3: For this exercise, perform multi-linear regression and evaluate it on the development set. Do not introduce any polynomial terms or any other new features. Any prediction that is >= 0.5 should be treated as being an 'affordable' prediction. Anything below 0.5 should be 'unaffordable'. What is your accuracy %? (). Is this what you expected? Is this reasonable, and if not, what do you think are the issues?
In [21]:
# [SOLUTION HERE]
# training set
x_train_padded = sm.add_constant(x_train) # to allow for beta_0
y_train_lr = y_train['affordable'].values.reshape(-1,1)

# development set
x_dev_padded = sm.add_constant(x_dev)
y_dev_lr = y_dev['affordable'].values.reshape(-1,1)
/usr/local/lib/python3.7/site-packages/numpy/core/fromnumeric.py:2495: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
  return ptp(axis=axis, out=out, **kwargs)
In [22]:
model = OLS(y_train_lr, x_train_padded)
results = model.fit()
results.summary()
Out[22]:
OLS Regression Results
Dep. Variable: y R-squared: 0.371
Model: OLS Adj. R-squared: 0.371
Method: Least Squares F-statistic: 1422.
Date: Sun, 13 Oct 2019 Prob (F-statistic): 0.00
Time: 15:07:15 Log-Likelihood: -12826.
No. Observations: 28894 AIC: 2.568e+04
Df Residuals: 28881 BIC: 2.579e+04
Df Model: 12
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
const 96.8761 6.899 14.042 0.000 83.354 110.398
latitude 0.5947 0.067 8.825 0.000 0.463 0.727
longitude 1.6313 0.077 21.062 0.000 1.480 1.783
minimum_nights 0.0052 0.000 17.194 0.000 0.005 0.006
number_of_reviews 0.0008 5.08e-05 14.877 0.000 0.001 0.001
calculated_host_listings_count -0.0007 7.39e-05 -9.368 0.000 -0.001 -0.001
availability_365 -0.0004 1.85e-05 -23.768 0.000 -0.000 -0.000
borough_Brooklyn 0.0728 0.019 3.828 0.000 0.036 0.110
borough_Manhattan -0.1229 0.017 -7.107 0.000 -0.157 -0.089
borough_Queens -0.0038 0.018 -0.209 0.835 -0.040 0.032
borough_Staten Island 0.4936 0.036 13.864 0.000 0.424 0.563
room_type_Private room 0.4636 0.005 99.360 0.000 0.454 0.473
room_type_Shared room 0.5049 0.015 34.059 0.000 0.476 0.534
Omnibus: 565.926 Durbin-Watson: 2.004
Prob(Omnibus): 0.000 Jarque-Bera (JB): 331.005
Skew: -0.093 Prob(JB): 1.33e-72
Kurtosis: 2.510 Cond. No. 5.73e+05


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.73e+05. This might indicate that there are
strong multicollinearity or other numerical problems.




In [23]:
# your code here
y_hat_dev = results.predict(exog=x_dev_padded)

# calculating and reporting the requested values, particularly the Test R^2
print('Train R^2 = {:.4}'.format(results.rsquared))
print('Test R^2 = {:.4}'.format(r2_score(y_dev_lr, y_hat_dev)))

# i'm using numpy's round() function, instead of manually checking for values above 0.5
accuracy_score(y_dev, np.round(y_hat_dev))
Train R^2 = 0.3714
Test R^2 = 0.3
Out[23]:
0.7751866625754321
Exercise 4: Akin to what you did in Homework 3, regularize your model via Ridge regression and Lasso regression. Specifically, report the model's accuracy on the development set (as you did in Exercise 2); do so while varying the alpha (aka lambda) parameter to be each of these values: [.001, .01, .05, .1, .5, 1, 5, 10, 50, 100, 500]). What is your best result?
In [32]:
# [SOLUTION HERE]
best_accuracy = -1
best_model = None
for cur_alpha in [0.001, .01, .05, .1, .5, 1, 5, 10, 50, 100, 500]:

    # fit (using Ridge Regression), predict, and score
    fitted_ridge = Ridge(alpha=cur_alpha).fit(x_train, y_train_lr)
    y_hat_dev = fitted_ridge.predict(x_dev).reshape(1,-1)[0]
    
    cur_accuracy = accuracy_score(y_dev['affordable'].to_numpy(), np.round(y_hat_dev))
    if cur_accuracy > best_accuracy:
        best_accuracy = cur_accuracy
        best_model = fitted_ridge
    
    # fit (using Lasso Regression), predict, and score
    fitted_lasso = Lasso(alpha=cur_alpha).fit(x_train, y_train_lr)
    y_hat_dev = fitted_lasso.predict(x_dev).reshape(1,-1)[0]
    cur_accuracy = accuracy_score(y_dev['affordable'].to_numpy(), np.round(y_hat_dev))
    if cur_accuracy > best_accuracy:
        best_accuracy = cur_accuracy
        best_model = fitted_lasso
    
print("best_model:", best_model, "yielded accuracy of:", best_accuracy)
best_model: Ridge(alpha=100, copy_X=True, fit_intercept=True, max_iter=None,
      normalize=False, random_state=None, solver='auto', tol=0.001) yielded accuracy of: 0.7777436841566943

Note, we did not perform cross-validation, so perhaps our model could have performed even better, had we done so.

Exercise 5: Plot two histograms of the residuals from your best performing linear regression model (having trained on the training set, one plot should show the distribution of training set residuals and another plot for the distribution of development set residuals). Does this adhere to the assumptions of a linear model?
In [36]:
# [SOLUTION HERE]

# construct training residuals
y_hat_train = best_model.predict(x_train)
training_residuals = y_train_lr[:,0] - y_hat_train[:,0]

# construct dev residuals
y_hat_dev = best_model.predict(x_dev).reshape(1,-1)[0]
dev_residuals = y_dev['affordable'].to_numpy() - y_hat_dev

# make plot of training residuals
fig, axes = plt.subplots(1,2,figsize=(15,5))

axes[0].set_title('Histogram of Training Residuals')
axes[0].hist(training_residuals, alpha=0.8, bins=20)
axes[0].axhline(0, c='black', lw=2)
axes[0].set_xlabel(r'residuals')

# make plot of dev residuals
axes[1].set_title('Histogram of Development Residuals')
axes[1].hist(dev_residuals, alpha=0.4, bins=20)
axes[1].axhline(0, c='black', lw=2)
axes[1].set_xlabel(r'residuals')
plt.show()

print("min residual:", min(dev_residuals))
min residual: -6.374920636446401

The above plots suggest that the training data is not too conducive to being modelled by a linear regression model, for the residuals seem to be bimodal -- there isn't a single normal distrubion of residual values. Also, just for fun, we plotted the errors/residuals from having evaluated on the unseen development set. Doing so would provide no information about the assumptions of linear model being appropriate for our training data (as the unseen data could be completely dissimiliar from anything we saw during training). Yet, we hope that the development set residuals would be minimal, and it's nice that the errors seem to follow a normal distrubtion -- although, there's some outliers that we perform badly on, but this can happen.

Part 4: Binary Logistic Regression

Linear regression is usually a good baseline model, but since the outcome we're trying to predict only takes values 0 and 1 we'll want to use logistic regression instead of basic linear regression.

We will use sklearn for now, but statsmodels also provides LogisticRegression, along with nifty features like confidence intervals.

First, let's import the necessary classes:

In [37]:
from sklearn.linear_model import LogisticRegression

Next, let's instantiate a new LogisticRegression model:

In [38]:
lr = LogisticRegression()

Now, we can fit our model with just 1 line!

In [39]:
lr.fit(x_train, y_train['affordable'])
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)
Out[39]:
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)
Exercise 6: Using .predict(), make predictions on the development set

See .predict() documentation here. NOTE: regularization is applied by default. Especially pay attention to the following arguments/parameters:

  • C penalty, which we discussed in class. Experiment with varying values from 0 to 100 million!
  • max_iterations: experiment with values from 5 to 5000. Do you expect more iterations to always perform better? Why or why not?
  • penalty: for designating L1 (Lasso) or L2 (Ridge) loss; default is L2
  • solver: especially for the multi-class setting

After fitting the model, you can print the .coef_ value to see its coefficient.

In [40]:
y_hat_dev = lr.predict(x_dev)
initial_score = accuracy_score(y_dev['affordable'].to_numpy(), y_hat_dev)
print("our initial logistic regression model yielded accuracy score of:", initial_score)

best_accuracy = -1
best_model = None

# experiment with different values
c_vals = [1, 10, 100, 1000, 10000, 100000, 1000000, 10000000]
num_iters = [5, 10, 100, 1000, 5000]
for c_val in c_vals:
    for num_iter in num_iters:
        lr = LogisticRegression(C=c_val, solver='liblinear', max_iter=num_iter)
        lr.fit(x_train, y_train['affordable'])
        y_hat_dev = lr.predict(x_dev)
        cur_accuracy = accuracy_score(y_dev['affordable'].to_numpy(), y_hat_dev)

        if cur_accuracy > best_accuracy:
            best_accuracy = cur_accuracy
            best_model = lr

print("best logistic regression model:", lr, "yielded an accuracy score:", best_accuracy)
print("its learned coefficients:", len(best_model.coef_[0]))
print("the coefficients align with our features:", x_dev.shape)
our initial logistic regression model yielded accuracy score of: 0.7737547304899254
/usr/local/lib/python3.7/site-packages/sklearn/svm/base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)
/usr/local/lib/python3.7/site-packages/sklearn/svm/base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)
/usr/local/lib/python3.7/site-packages/sklearn/svm/base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)
/usr/local/lib/python3.7/site-packages/sklearn/svm/base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)
/usr/local/lib/python3.7/site-packages/sklearn/svm/base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)
/usr/local/lib/python3.7/site-packages/sklearn/svm/base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)
/usr/local/lib/python3.7/site-packages/sklearn/svm/base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)
/usr/local/lib/python3.7/site-packages/sklearn/svm/base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)
/usr/local/lib/python3.7/site-packages/sklearn/svm/base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)
/usr/local/lib/python3.7/site-packages/sklearn/svm/base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)
/usr/local/lib/python3.7/site-packages/sklearn/svm/base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)
/usr/local/lib/python3.7/site-packages/sklearn/svm/base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)
/usr/local/lib/python3.7/site-packages/sklearn/svm/base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)
/usr/local/lib/python3.7/site-packages/sklearn/svm/base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)
/usr/local/lib/python3.7/site-packages/sklearn/svm/base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)
/usr/local/lib/python3.7/site-packages/sklearn/svm/base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)
best logistic regression model: LogisticRegression(C=10000000, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=5000, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False) yielded an accuracy score: 0.7737547304899254
its learned coefficients: 12
the coefficients align with our features: (9777, 12)

The results here should show that for this dataset, logistic regression offered effectively identical performance as linear regression. There are two main takeaways from this:

  • logistic regression should not be viewed as being superior to linear regression; it should be viewed as a solution to a different type of problem -- classification (predicting categorical outputs), not regression (predicting continuous-valued outputs).
  • In our situation, our two categories/classes (affordable or not) had an ordinal nature. That is, the continuum of prices directly aligned with the structure of our two classes. Alternatively, you could imagine other scenarios where our two categories are nominal and thus un-rankable (e.g., predicting cancer or not, or predicting which NYC borough an AirBnB is in based on its property features).

Part 5 (The Real Challenge): Multiclass Classification

Before we move on, let's consider a more common use case of logistic regression: predicting not just a binary variable, but what level a categorical variable will take. Instead of breaking the price variable into two classes (affordable being true or false), we may care for more fine-level granularity.

For this exercise, go back to the original df dataframe and construct 5 classes of pricing:

  • budget: < 80
  • affordable: 80 < x < 120
  • average: 120 < x < 180
  • expensive: 180 < x < 240
  • very expensive: 240 < x

The cut function obviously stores a lot of extra information for us. It's a very useful tool for discretizing an existing variable.

Exercise 8: After making the new categories, perform the same predictions as above. Compare your results. What improvements could we make? (not just w/ the parameters, but with possibly expanding and using other features from our original dataset!)
In [41]:
# creates multi-class labels for training
x_train_multiclass = x_train.copy()
x_train_multiclass['price_level'] = pd.cut(df_train['price'],[0,80,120,180,240,float('inf')], labels=[0,1,2,3,4])
y_train_multiclass = pd.DataFrame(data=x_train_multiclass['price_level'], columns=["price_level"])
x_train_multiclass = x_train_multiclass.drop(['price_level'], axis=1)

# creats multi-class labels for dev
x_dev_multiclass = x_dev.copy()
x_dev_multiclass['price_level'] = pd.cut(df_dev['price'],[0,80,120,180,240,float('inf')], labels=[0,1,2,3,4])
y_dev_multiclass = pd.DataFrame(data=x_dev_multiclass['price_level'], columns=["price_level"])
x_dev_multiclass = x_dev_multiclass.drop(['price_level'], axis=1)
In [42]:
best_accuracy = -1
best_model = None

# experiment with different values
c_vals = [1, 10, 100, 1000, 10000]
num_iters = [10, 100, 1000, 5000]
for c_val in c_vals:
    for num_iter in num_iters:
        lr = LogisticRegression(solver="lbfgs", max_iter=10000)
        lr.fit(x_train_multiclass, y_train_multiclass['price_level'])
        y_hat_dev = lr.predict(x_dev_multiclass)
        cur_accuracy = accuracy_score(y_dev_multiclass['price_level'].to_numpy(), y_hat_dev)
        print(cur_accuracy)
        if cur_accuracy > best_accuracy:
            best_accuracy = cur_accuracy
            best_model = lr
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)
0.5048583410043981
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)
0.5048583410043981
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)
0.5048583410043981
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)
0.5048583410043981
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)
0.5048583410043981
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)
0.5048583410043981
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)
0.5048583410043981
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)
0.5048583410043981
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)
0.5048583410043981
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)
0.5048583410043981
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)
0.5048583410043981
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)
0.5048583410043981
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)
0.5048583410043981
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)
0.5048583410043981
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)
0.5048583410043981
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)
0.5048583410043981
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)
0.5048583410043981
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)
0.5048583410043981
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)
0.5048583410043981
/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)
0.5048583410043981
In [43]:
print("best logistic regression model:", lr, "yielded an accuracy score:", best_accuracy)
print("its learned coefficients:", len(best_model.coef_[0]))
print("the coefficients align with our features:", x_dev.shape)
for i in range(len(x_dev.columns)):
    print("feature:", x_dev.columns[i], "; coef:", best_model.coef_[0][i])
best logistic regression model: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=10000,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False) yielded an accuracy score: 0.5048583410043981
its learned coefficients: 12
the coefficients align with our features: (9777, 12)
feature: latitude ; coef: 7.312415440262378
feature: longitude ; coef: 4.059059803768657
feature: minimum_nights ; coef: 0.044388062697350704
feature: number_of_reviews ; coef: 0.0013794514234422481
feature: calculated_host_listings_count ; coef: -0.0034845157546768537
feature: availability_365 ; coef: -0.0021468646692259148
feature: borough_Brooklyn ; coef: 0.19507151988381405
feature: borough_Manhattan ; coef: -1.7898799616161736
feature: borough_Queens ; coef: 0.06310066394457561
feature: borough_Staten Island ; coef: 2.4381414950649947
feature: room_type_Private room ; coef: 3.4513468035217554
feature: room_type_Shared room ; coef: 4.686077440118879

Despite having 5 distinct price categories now, our performance isn't too bad! To increase performance further, we could first use cross-validation. Then, we could look at our original data and try to better use its features. For example, perhaps it would be useful to expand out our 'neighbourhood' feature into one-hot encodings? I imagine the fine-level, granular information of 'neighbourhood' correlates well with price. The only concern and question to ask ourselves is how much data do we have for each neighbourhood? (We'd aim to have plenty of representative data). Related, the longitude and latitude features provide fine-level information, but perhaps it's hard for the model to use it since the range is so small. If we were to scale the lat and long values to be between 0 and 1, it might allow for the model to better distinguish between the nuanced values.

For this exercise, we uniformly care about each price level and prediction thereof. However, in some scenarios, our classification accuracy for some categories is much more important than others (e.g., predicting cancer or not). That is, our false negatives (misses) are way more serious and potentially deadly. For situations like this, it is better to error on the side of caution and allow for false positives (aka false alarms) moreso than false negatives (misses). To handle this, one could weight each class, and specify such during training/ fitting our model. As we learned in class, we can plot the performance as we vary the prediction threshold, while paying attention to how that affects the number of false negatives and false positives.