Key Word(s): logistic regression, linear regression, mle

CS-109A Introduction to Data Science

Lab 6: Logistic Regression¶

Harvard University
Fall 2019
Instructors: Pavlos Protopapas, Kevin Rader, Chris Tanner
Lab Instructors: Chris Tanner and Eleni Kaxiras.
Contributors: Chris Tanner

In [1]:

## RUN THIS CELL TO PROPERLY HIGHLIGHT THE EXERCISES
import requests
from IPython.core.display import HTML
styles = requests.get("https://raw.githubusercontent.com/Harvard-IACS/2018-CS109A/master/content/styles/cs109.css").text
HTML(styles)

Out[1]:

Learning Goals (EDIT)¶

In this lab, we'll explore different models used to predict which of several labels applies to a new datapoint based on labels observed in the training data.

By the end of this lab, you should:

Be familiar with the sklearn implementations of
- Linear Regression
- Logistic Regression
Be able to make an informed choice of model based on the data at hand
(Bonus) Structure your sklearn code into Pipelines to make building, fitting, and tracking your models easier

In [2]:

# IMPORTS GALORE
%matplotlib inline
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import pandas as pd
from pandas.plotting import scatter_matrix

import statsmodels.api as sm
from statsmodels.api import OLS

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import PolynomialFeatures
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.metrics import r2_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_curve
from sklearn.metrics import auc

pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)

Part 1: The AirBnB NYC 2019 Dataset + EDA¶

The dataset contains information about AirBnB hosts in NYC from 2019. There are 49k unique hosts and 16 features for each:

id: listing ID
name: name of the listing
host_id: host ID
host_name: name of the host
neighbourhood_group: NYC borough
neighbourhood: neighborhood
latitude: latitude coordinates
longitude: longitude coordinates
room_type: listing space type (e.g., private room, entire home)
price: price in dollars per night
minimum_nights: number of min. nights required for booking
number_of_reviews: number of reviews
last_review: date of the last review
reviews_per_month: number of reviews per month
calculated_host_listings_count: number of listings the host has
availability_365: number of days the listing is available for booking

Our goal is to predict the price of unseen housing units as being 'affordable' or 'unaffordable', by using their features. We will assume that this task is for a particular client who has a specific budget and would like to simplify the problem by classifying any unit that costs \< \$150 per night as 'affordable' and any unit that costs \\$150 or great as 'unaffordable'.

For this task, we will exercise our normal data science pipeline -- from EDA to modelling and visualization. In particular, we will show the performance of 3 classifiers:

Maximum Likelihood Estimate (MLE)
Linear Regression
Logistic Regression

Let's get started! And awaaaaay we go!

Read-in and checking¶

We do the usual read-in and verification of the data:

In [3]:

df = pd.read_csv("../data/nyc_airbnb.csv") #, index_col=0)
df.head()

Out[3]:

	id	name	host_id	host_name	neighbourhood_group	neighbourhood	latitude	longitude	room_type	price	minimum_nights	number_of_reviews	last_review	reviews_per_month	calculated_host_listings_count	availability_365
0	2539	Clean & quiet apt home by the park	2787	John	Brooklyn	Kensington	40.64749	-73.97237	Private room	149	1	9	2018-10-19	0.21	6	365
1	2595	Skylit Midtown Castle	2845	Jennifer	Manhattan	Midtown	40.75362	-73.98377	Entire home/apt	225	1	45	2019-05-21	0.38	2	355
2	3647	THE VILLAGE OF HARLEM....NEW YORK !	4632	Elisabeth	Manhattan	Harlem	40.80902	-73.94190	Private room	150	3	0	NaN	NaN	1	365
3	3831	Cozy Entire Floor of Brownstone	4869	LisaRoxanne	Brooklyn	Clinton Hill	40.68514	-73.95976	Entire home/apt	89	1	270	2019-07-05	4.64	1	194
4	5022	Entire Apt: Spacious Studio/Loft by central park	7192	Laura	Manhattan	East Harlem	40.79851	-73.94399	Entire home/apt	80	10	9	2018-11-19	0.10	1	0

Building the training/dev/testing data¶

As usual, we split the data before we begin our analysis. It would be unfair to cheat by looking at the testing data. Let's divide the data into 60% training, 20% development (aka validation), 20% testing. However, before we split the data, let's make the simple transformation and converting the prices into a categories of being affordable or not.

In [4]:

df['affordable'] = np.where(df['price'] < 150, 1, 0)
df

Out[4]:

	id	name	host_id	host_name	neighbourhood_group	neighbourhood	latitude	longitude	room_type	price	minimum_nights	number_of_reviews	last_review	reviews_per_month	calculated_host_listings_count	availability_365	affordable
0	2539	Clean & quiet apt home by the park	2787	John	Brooklyn	Kensington	40.64749	-73.97237	Private room	149	1	9	2018-10-19	0.21	6	365	1
1	2595	Skylit Midtown Castle	2845	Jennifer	Manhattan	Midtown	40.75362	-73.98377	Entire home/apt	225	1	45	2019-05-21	0.38	2	355	0
2	3647	THE VILLAGE OF HARLEM....NEW YORK !	4632	Elisabeth	Manhattan	Harlem	40.80902	-73.94190	Private room	150	3	0	NaN	NaN	1	365	0
3	3831	Cozy Entire Floor of Brownstone	4869	LisaRoxanne	Brooklyn	Clinton Hill	40.68514	-73.95976	Entire home/apt	89	1	270	2019-07-05	4.64	1	194	1
4	5022	Entire Apt: Spacious Studio/Loft by central park	7192	Laura	Manhattan	East Harlem	40.79851	-73.94399	Entire home/apt	80	10	9	2018-11-19	0.10	1	0	1
5	5099	Large Cozy 1 BR Apartment In Midtown East	7322	Chris	Manhattan	Murray Hill	40.74767	-73.97500	Entire home/apt	200	3	74	2019-06-22	0.59	1	129	0
6	5121	BlissArtsSpace!	7356	Garon	Brooklyn	Bedford-Stuyvesant	40.68688	-73.95596	Private room	60	45	49	2017-10-05	0.40	1	0	1
7	5178	Large Furnished Room Near B'way	8967	Shunichi	Manhattan	Hell's Kitchen	40.76489	-73.98493	Private room	79	2	430	2019-06-24	3.47	1	220	1
8	5203	Cozy Clean Guest Room - Family Apt	7490	MaryEllen	Manhattan	Upper West Side	40.80178	-73.96723	Private room	79	2	118	2017-07-21	0.99	1	0	1
9	5238	Cute & Cozy Lower East Side 1 bdrm	7549	Ben	Manhattan	Chinatown	40.71344	-73.99037	Entire home/apt	150	1	160	2019-06-09	1.33	4	188	0
10	5295	Beautiful 1br on Upper West Side	7702	Lena	Manhattan	Upper West Side	40.80316	-73.96545	Entire home/apt	135	5	53	2019-06-22	0.43	1	6	1
11	5441	Central Manhattan/near Broadway	7989	Kate	Manhattan	Hell's Kitchen	40.76076	-73.98867	Private room	85	2	188	2019-06-23	1.50	1	39	1
12	5803	Lovely Room 1, Garden, Best Area, Legal rental	9744	Laurie	Brooklyn	South Slope	40.66829	-73.98779	Private room	89	4	167	2019-06-24	1.34	3	314	1
13	6021	Wonderful Guest Bedroom in Manhattan for SINGLES	11528	Claudio	Manhattan	Upper West Side	40.79826	-73.96113	Private room	85	2	113	2019-07-05	0.91	1	333	1
14	6090	West Village Nest - Superhost	11975	Alina	Manhattan	West Village	40.73530	-74.00525	Entire home/apt	120	90	27	2018-10-31	0.22	1	0	1
15	6848	Only 2 stops to Manhattan studio	15991	Allen & Irina	Brooklyn	Williamsburg	40.70837	-73.95352	Entire home/apt	140	2	148	2019-06-29	1.20	1	46	1
16	7097	Perfect for Your Parents + Garden	17571	Jane	Brooklyn	Fort Greene	40.69169	-73.97185	Entire home/apt	215	2	198	2019-06-28	1.72	1	321	0
17	7322	Chelsea Perfect	18946	Doti	Manhattan	Chelsea	40.74192	-73.99501	Private room	140	1	260	2019-07-01	2.12	1	12	1
18	7726	Hip Historic Brownstone Apartment with Backyard	20950	Adam And Charity	Brooklyn	Crown Heights	40.67592	-73.94694	Entire home/apt	99	3	53	2019-06-22	4.44	1	21	1
19	7750	Huge 2 BR Upper East Cental Park	17985	Sing	Manhattan	East Harlem	40.79685	-73.94872	Entire home/apt	190	7	0	NaN	NaN	2	249	0
20	7801	Sweet and Spacious Brooklyn Loft	21207	Chaya	Brooklyn	Williamsburg	40.71842	-73.95718	Entire home/apt	299	3	9	2011-12-28	0.07	1	0	0
21	8024	CBG CtyBGd HelpsHaiti rm#1:1-4	22486	Lisel	Brooklyn	Park Slope	40.68069	-73.97706	Private room	130	2	130	2019-07-01	1.09	6	347	1
22	8025	CBG Helps Haiti Room#2.5	22486	Lisel	Brooklyn	Park Slope	40.67989	-73.97798	Private room	80	1	39	2019-01-01	0.37	6	364	1
23	8110	CBG Helps Haiti Rm #2	22486	Lisel	Brooklyn	Park Slope	40.68001	-73.97865	Private room	110	2	71	2019-07-02	0.61	6	304	1
24	8490	MAISON DES SIRENES1,bohemian apartment	25183	Nathalie	Brooklyn	Bedford-Stuyvesant	40.68371	-73.94028	Entire home/apt	120	2	88	2019-06-19	0.73	2	233	1
25	8505	Sunny Bedroom Across Prospect Park	25326	Gregory	Brooklyn	Windsor Terrace	40.65599	-73.97519	Private room	60	1	19	2019-06-23	1.37	2	85	1
26	8700	Magnifique Suite au N de Manhattan - vue Cloitres	26394	Claude & Sophie	Manhattan	Inwood	40.86754	-73.92639	Private room	80	4	0	NaN	NaN	1	0	1
27	9357	Midtown Pied-a-terre	30193	Tommi	Manhattan	Hell's Kitchen	40.76715	-73.98533	Entire home/apt	150	10	58	2017-08-13	0.49	1	75	0
28	9518	SPACIOUS, LOVELY FURNISHED MANHATTAN BEDROOM	31374	Shon	Manhattan	Inwood	40.86482	-73.92106	Private room	44	3	108	2019-06-15	1.11	3	311	1
29	9657	Modern 1 BR / NYC / EAST VILLAGE	21904	Dana	Manhattan	East Village	40.72920	-73.98542	Entire home/apt	180	14	29	2019-04-19	0.24	1	67	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
48865	36472171	1 bedroom in sunlit apartment	99144947	Brenda	Manhattan	Inwood	40.86845	-73.92449	Private room	80	1	0	NaN	NaN	1	79	1
48866	36472710	CozyHideAway Suite	274225617	Alberth	Queens	Briarwood	40.70786	-73.81448	Entire home/apt	58	1	0	NaN	NaN	1	159	1
48867	36473044	The place you were dreaming for.(only for guys)	261338177	Diana	Brooklyn	Gravesend	40.59080	-73.97116	Shared room	25	1	0	NaN	NaN	6	338	1
48868	36473253	Heaven for you(only for guy)	261338177	Diana	Brooklyn	Gravesend	40.59118	-73.97119	Shared room	25	7	0	NaN	NaN	6	365	1
48869	36474023	Cozy, Sunny Brooklyn Escape	1550580	Julia	Brooklyn	Bedford-Stuyvesant	40.68759	-73.95705	Private room	45	4	0	NaN	NaN	1	7	1
48870	36474911	Cozy, clean Williamsburg 1- bedroom apartment	1273444	Tanja	Brooklyn	Williamsburg	40.71197	-73.94946	Entire home/apt	99	4	0	NaN	NaN	1	22	1
48871	36475746	A LARGE ROOM - 1 MONTH MINIMUM - WASHER&DRYER;	144008701	Ozzy Ciao	Manhattan	Harlem	40.82233	-73.94687	Private room	35	29	0	NaN	NaN	2	31	1
48872	36476675	Nycity-MyHome	8636072	Ben	Manhattan	Hell's Kitchen	40.76236	-73.99255	Entire home/apt	260	3	0	NaN	NaN	1	9	0
48873	36477307	Brooklyn paradise	241945355	Clement & Rose	Brooklyn	Flatlands	40.63116	-73.92616	Entire home/apt	170	1	0	NaN	NaN	2	363	0
48874	36477588	Short Term Rental in East Harlem	214535893	Jeffrey	Manhattan	East Harlem	40.79760	-73.93947	Private room	50	7	0	NaN	NaN	1	22	1
48875	36478343	Welcome all as family	274273284	Anastasia	Manhattan	East Harlem	40.78749	-73.94749	Private room	140	1	0	NaN	NaN	1	180	1
48876	36478357	Cozy, Air-Conditioned Private Bedroom in Harlem	177932088	Joseph	Manhattan	Harlem	40.80953	-73.95410	Private room	60	1	0	NaN	NaN	1	26	1
48877	36479230	Studio sized room with beautiful light	65767720	Melanie	Brooklyn	Bushwick	40.70418	-73.91471	Private room	42	7	0	NaN	NaN	1	16	1
48878	36479723	Room for rest	41326856	Jeerathinan	Queens	Elmhurst	40.74477	-73.87727	Private room	45	1	0	NaN	NaN	5	172	1
48879	36480292	Gorgeous 1.5 Bdr with a private yard- Williams...	540335	Lee	Brooklyn	Williamsburg	40.71728	-73.94394	Entire home/apt	120	20	0	NaN	NaN	1	22	1
48880	36481315	The Raccoon Artist Studio in Williamsburg New ...	208514239	Melki	Brooklyn	Williamsburg	40.71232	-73.94220	Entire home/apt	120	1	0	NaN	NaN	3	365	1
48881	36481615	Peaceful space in Greenpoint, BK	274298453	Adrien	Brooklyn	Greenpoint	40.72585	-73.94001	Private room	54	6	0	NaN	NaN	1	15	1
48882	36482231	Bushwick _ Myrtle-Wyckoff	66058896	Luisa	Brooklyn	Bushwick	40.69652	-73.91079	Private room	40	20	0	NaN	NaN	1	31	1
48883	36482416	Sunny Bedroom NYC! Walking to Central Park!!	131529729	Kendall	Manhattan	East Harlem	40.79755	-73.93614	Private room	75	2	0	NaN	NaN	2	364	1
48884	36482783	Brooklyn Oasis in the heart of Williamsburg	274307600	Jonathan	Brooklyn	Williamsburg	40.71790	-73.96238	Private room	190	7	0	NaN	NaN	1	341	0
48885	36482809	Stunning Bedroom NYC! Walking to Central Park!!	131529729	Kendall	Manhattan	East Harlem	40.79633	-73.93605	Private room	75	2	0	NaN	NaN	2	353	1
48886	36483010	Comfy 1 Bedroom in Midtown East	274311461	Scott	Manhattan	Midtown	40.75561	-73.96723	Entire home/apt	200	6	0	NaN	NaN	1	176	0
48887	36483152	Garden Jewel Apartment in Williamsburg New York	208514239	Melki	Brooklyn	Williamsburg	40.71232	-73.94220	Entire home/apt	170	1	0	NaN	NaN	3	365	0
48888	36484087	Spacious Room w/ Private Rooftop, Central loca...	274321313	Kat	Manhattan	Hell's Kitchen	40.76392	-73.99183	Private room	125	4	0	NaN	NaN	1	31	1
48889	36484363	QUIT PRIVATE HOUSE	107716952	Michael	Queens	Jamaica	40.69137	-73.80844	Private room	65	1	0	NaN	NaN	2	163	1
48890	36484665	Charming one bedroom - newly renovated rowhouse	8232441	Sabrina	Brooklyn	Bedford-Stuyvesant	40.67853	-73.94995	Private room	70	2	0	NaN	NaN	2	9	1
48891	36485057	Affordable room in Bushwick/East Williamsburg	6570630	Marisol	Brooklyn	Bushwick	40.70184	-73.93317	Private room	40	4	0	NaN	NaN	2	36	1
48892	36485431	Sunny Studio at Historical Neighborhood	23492952	Ilgar & Aysel	Manhattan	Harlem	40.81475	-73.94867	Entire home/apt	115	10	0	NaN	NaN	1	27	1
48893	36485609	43rd St. Time Square-cozy single bed	30985759	Taz	Manhattan	Hell's Kitchen	40.75751	-73.99112	Shared room	55	1	0	NaN	NaN	6	2	1
48894	36487245	Trendy duplex in the very heart of Hell's Kitchen	68119814	Christophe	Manhattan	Hell's Kitchen	40.76404	-73.98933	Private room	90	7	0	NaN	NaN	1	23	1

48895 rows × 17 columns

NOTE: The affordable column now has a value of 1 whenever the price is < 150, and 0 otherwise.

Also, the feature named neighbourhood_group can be easily confused with neighbourhood, so let's go ahead and rename it to borough, as that is more distinct:

In [5]:

df.rename(columns={"neighbourhood_group": "borough"}, inplace=True)
df

Out[5]:

	id	name	host_id	host_name	borough	neighbourhood	latitude	longitude	room_type	price	minimum_nights	number_of_reviews	last_review	reviews_per_month	calculated_host_listings_count	availability_365	affordable
0	2539	Clean & quiet apt home by the park	2787	John	Brooklyn	Kensington	40.64749	-73.97237	Private room	149	1	9	2018-10-19	0.21	6	365	1
1	2595	Skylit Midtown Castle	2845	Jennifer	Manhattan	Midtown	40.75362	-73.98377	Entire home/apt	225	1	45	2019-05-21	0.38	2	355	0
2	3647	THE VILLAGE OF HARLEM....NEW YORK !	4632	Elisabeth	Manhattan	Harlem	40.80902	-73.94190	Private room	150	3	0	NaN	NaN	1	365	0
3	3831	Cozy Entire Floor of Brownstone	4869	LisaRoxanne	Brooklyn	Clinton Hill	40.68514	-73.95976	Entire home/apt	89	1	270	2019-07-05	4.64	1	194	1
4	5022	Entire Apt: Spacious Studio/Loft by central park	7192	Laura	Manhattan	East Harlem	40.79851	-73.94399	Entire home/apt	80	10	9	2018-11-19	0.10	1	0	1
5	5099	Large Cozy 1 BR Apartment In Midtown East	7322	Chris	Manhattan	Murray Hill	40.74767	-73.97500	Entire home/apt	200	3	74	2019-06-22	0.59	1	129	0
6	5121	BlissArtsSpace!	7356	Garon	Brooklyn	Bedford-Stuyvesant	40.68688	-73.95596	Private room	60	45	49	2017-10-05	0.40	1	0	1
7	5178	Large Furnished Room Near B'way	8967	Shunichi	Manhattan	Hell's Kitchen	40.76489	-73.98493	Private room	79	2	430	2019-06-24	3.47	1	220	1
8	5203	Cozy Clean Guest Room - Family Apt	7490	MaryEllen	Manhattan	Upper West Side	40.80178	-73.96723	Private room	79	2	118	2017-07-21	0.99	1	0	1
9	5238	Cute & Cozy Lower East Side 1 bdrm	7549	Ben	Manhattan	Chinatown	40.71344	-73.99037	Entire home/apt	150	1	160	2019-06-09	1.33	4	188	0
10	5295	Beautiful 1br on Upper West Side	7702	Lena	Manhattan	Upper West Side	40.80316	-73.96545	Entire home/apt	135	5	53	2019-06-22	0.43	1	6	1
11	5441	Central Manhattan/near Broadway	7989	Kate	Manhattan	Hell's Kitchen	40.76076	-73.98867	Private room	85	2	188	2019-06-23	1.50	1	39	1
12	5803	Lovely Room 1, Garden, Best Area, Legal rental	9744	Laurie	Brooklyn	South Slope	40.66829	-73.98779	Private room	89	4	167	2019-06-24	1.34	3	314	1
13	6021	Wonderful Guest Bedroom in Manhattan for SINGLES	11528	Claudio	Manhattan	Upper West Side	40.79826	-73.96113	Private room	85	2	113	2019-07-05	0.91	1	333	1
14	6090	West Village Nest - Superhost	11975	Alina	Manhattan	West Village	40.73530	-74.00525	Entire home/apt	120	90	27	2018-10-31	0.22	1	0	1
15	6848	Only 2 stops to Manhattan studio	15991	Allen & Irina	Brooklyn	Williamsburg	40.70837	-73.95352	Entire home/apt	140	2	148	2019-06-29	1.20	1	46	1
16	7097	Perfect for Your Parents + Garden	17571	Jane	Brooklyn	Fort Greene	40.69169	-73.97185	Entire home/apt	215	2	198	2019-06-28	1.72	1	321	0
17	7322	Chelsea Perfect	18946	Doti	Manhattan	Chelsea	40.74192	-73.99501	Private room	140	1	260	2019-07-01	2.12	1	12	1
18	7726	Hip Historic Brownstone Apartment with Backyard	20950	Adam And Charity	Brooklyn	Crown Heights	40.67592	-73.94694	Entire home/apt	99	3	53	2019-06-22	4.44	1	21	1
19	7750	Huge 2 BR Upper East Cental Park	17985	Sing	Manhattan	East Harlem	40.79685	-73.94872	Entire home/apt	190	7	0	NaN	NaN	2	249	0
20	7801	Sweet and Spacious Brooklyn Loft	21207	Chaya	Brooklyn	Williamsburg	40.71842	-73.95718	Entire home/apt	299	3	9	2011-12-28	0.07	1	0	0
21	8024	CBG CtyBGd HelpsHaiti rm#1:1-4	22486	Lisel	Brooklyn	Park Slope	40.68069	-73.97706	Private room	130	2	130	2019-07-01	1.09	6	347	1
22	8025	CBG Helps Haiti Room#2.5	22486	Lisel	Brooklyn	Park Slope	40.67989	-73.97798	Private room	80	1	39	2019-01-01	0.37	6	364	1
23	8110	CBG Helps Haiti Rm #2	22486	Lisel	Brooklyn	Park Slope	40.68001	-73.97865	Private room	110	2	71	2019-07-02	0.61	6	304	1
24	8490	MAISON DES SIRENES1,bohemian apartment	25183	Nathalie	Brooklyn	Bedford-Stuyvesant	40.68371	-73.94028	Entire home/apt	120	2	88	2019-06-19	0.73	2	233	1
25	8505	Sunny Bedroom Across Prospect Park	25326	Gregory	Brooklyn	Windsor Terrace	40.65599	-73.97519	Private room	60	1	19	2019-06-23	1.37	2	85	1
26	8700	Magnifique Suite au N de Manhattan - vue Cloitres	26394	Claude & Sophie	Manhattan	Inwood	40.86754	-73.92639	Private room	80	4	0	NaN	NaN	1	0	1
27	9357	Midtown Pied-a-terre	30193	Tommi	Manhattan	Hell's Kitchen	40.76715	-73.98533	Entire home/apt	150	10	58	2017-08-13	0.49	1	75	0
28	9518	SPACIOUS, LOVELY FURNISHED MANHATTAN BEDROOM	31374	Shon	Manhattan	Inwood	40.86482	-73.92106	Private room	44	3	108	2019-06-15	1.11	3	311	1
29	9657	Modern 1 BR / NYC / EAST VILLAGE	21904	Dana	Manhattan	East Village	40.72920	-73.98542	Entire home/apt	180	14	29	2019-04-19	0.24	1	67	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
48865	36472171	1 bedroom in sunlit apartment	99144947	Brenda	Manhattan	Inwood	40.86845	-73.92449	Private room	80	1	0	NaN	NaN	1	79	1
48866	36472710	CozyHideAway Suite	274225617	Alberth	Queens	Briarwood	40.70786	-73.81448	Entire home/apt	58	1	0	NaN	NaN	1	159	1
48867	36473044	The place you were dreaming for.(only for guys)	261338177	Diana	Brooklyn	Gravesend	40.59080	-73.97116	Shared room	25	1	0	NaN	NaN	6	338	1
48868	36473253	Heaven for you(only for guy)	261338177	Diana	Brooklyn	Gravesend	40.59118	-73.97119	Shared room	25	7	0	NaN	NaN	6	365	1
48869	36474023	Cozy, Sunny Brooklyn Escape	1550580	Julia	Brooklyn	Bedford-Stuyvesant	40.68759	-73.95705	Private room	45	4	0	NaN	NaN	1	7	1
48870	36474911	Cozy, clean Williamsburg 1- bedroom apartment	1273444	Tanja	Brooklyn	Williamsburg	40.71197	-73.94946	Entire home/apt	99	4	0	NaN	NaN	1	22	1
48871	36475746	A LARGE ROOM - 1 MONTH MINIMUM - WASHER&DRYER;	144008701	Ozzy Ciao	Manhattan	Harlem	40.82233	-73.94687	Private room	35	29	0	NaN	NaN	2	31	1
48872	36476675	Nycity-MyHome	8636072	Ben	Manhattan	Hell's Kitchen	40.76236	-73.99255	Entire home/apt	260	3	0	NaN	NaN	1	9	0
48873	36477307	Brooklyn paradise	241945355	Clement & Rose	Brooklyn	Flatlands	40.63116	-73.92616	Entire home/apt	170	1	0	NaN	NaN	2	363	0
48874	36477588	Short Term Rental in East Harlem	214535893	Jeffrey	Manhattan	East Harlem	40.79760	-73.93947	Private room	50	7	0	NaN	NaN	1	22	1
48875	36478343	Welcome all as family	274273284	Anastasia	Manhattan	East Harlem	40.78749	-73.94749	Private room	140	1	0	NaN	NaN	1	180	1
48876	36478357	Cozy, Air-Conditioned Private Bedroom in Harlem	177932088	Joseph	Manhattan	Harlem	40.80953	-73.95410	Private room	60	1	0	NaN	NaN	1	26	1
48877	36479230	Studio sized room with beautiful light	65767720	Melanie	Brooklyn	Bushwick	40.70418	-73.91471	Private room	42	7	0	NaN	NaN	1	16	1
48878	36479723	Room for rest	41326856	Jeerathinan	Queens	Elmhurst	40.74477	-73.87727	Private room	45	1	0	NaN	NaN	5	172	1
48879	36480292	Gorgeous 1.5 Bdr with a private yard- Williams...	540335	Lee	Brooklyn	Williamsburg	40.71728	-73.94394	Entire home/apt	120	20	0	NaN	NaN	1	22	1
48880	36481315	The Raccoon Artist Studio in Williamsburg New ...	208514239	Melki	Brooklyn	Williamsburg	40.71232	-73.94220	Entire home/apt	120	1	0	NaN	NaN	3	365	1
48881	36481615	Peaceful space in Greenpoint, BK	274298453	Adrien	Brooklyn	Greenpoint	40.72585	-73.94001	Private room	54	6	0	NaN	NaN	1	15	1
48882	36482231	Bushwick _ Myrtle-Wyckoff	66058896	Luisa	Brooklyn	Bushwick	40.69652	-73.91079	Private room	40	20	0	NaN	NaN	1	31	1
48883	36482416	Sunny Bedroom NYC! Walking to Central Park!!	131529729	Kendall	Manhattan	East Harlem	40.79755	-73.93614	Private room	75	2	0	NaN	NaN	2	364	1
48884	36482783	Brooklyn Oasis in the heart of Williamsburg	274307600	Jonathan	Brooklyn	Williamsburg	40.71790	-73.96238	Private room	190	7	0	NaN	NaN	1	341	0
48885	36482809	Stunning Bedroom NYC! Walking to Central Park!!	131529729	Kendall	Manhattan	East Harlem	40.79633	-73.93605	Private room	75	2	0	NaN	NaN	2	353	1
48886	36483010	Comfy 1 Bedroom in Midtown East	274311461	Scott	Manhattan	Midtown	40.75561	-73.96723	Entire home/apt	200	6	0	NaN	NaN	1	176	0
48887	36483152	Garden Jewel Apartment in Williamsburg New York	208514239	Melki	Brooklyn	Williamsburg	40.71232	-73.94220	Entire home/apt	170	1	0	NaN	NaN	3	365	0
48888	36484087	Spacious Room w/ Private Rooftop, Central loca...	274321313	Kat	Manhattan	Hell's Kitchen	40.76392	-73.99183	Private room	125	4	0	NaN	NaN	1	31	1
48889	36484363	QUIT PRIVATE HOUSE	107716952	Michael	Queens	Jamaica	40.69137	-73.80844	Private room	65	1	0	NaN	NaN	2	163	1
48890	36484665	Charming one bedroom - newly renovated rowhouse	8232441	Sabrina	Brooklyn	Bedford-Stuyvesant	40.67853	-73.94995	Private room	70	2	0	NaN	NaN	2	9	1
48891	36485057	Affordable room in Bushwick/East Williamsburg	6570630	Marisol	Brooklyn	Bushwick	40.70184	-73.93317	Private room	40	4	0	NaN	NaN	2	36	1
48892	36485431	Sunny Studio at Historical Neighborhood	23492952	Ilgar & Aysel	Manhattan	Harlem	40.81475	-73.94867	Entire home/apt	115	10	0	NaN	NaN	1	27	1
48893	36485609	43rd St. Time Square-cozy single bed	30985759	Taz	Manhattan	Hell's Kitchen	40.75751	-73.99112	Shared room	55	1	0	NaN	NaN	6	2	1
48894	36487245	Trendy duplex in the very heart of Hell's Kitchen	68119814	Christophe	Manhattan	Hell's Kitchen	40.76404	-73.98933	Private room	90	7	0	NaN	NaN	1	23	1

48895 rows × 17 columns

Without looking at the full data yet, let's just ensure our prices are within valid ranges:

In [6]:

df['price'].describe()

Out[6]:

count    48895.000000
mean       152.720687
std        240.154170
min          0.000000
25%         69.000000
50%        106.000000
75%        175.000000
max      10000.000000
Name: price, dtype: float64

Uh-oh. We see that price has a minimum value of \$0. I highly doubt any unit in NYC is free. These data instances are garbage, so let's go ahead and remove any instance that has a price of \\$0.

In [7]:

print("original training size:", df.shape)
df = df.loc[df['price'] != 0]
print("new training size:", df.shape)

original training size: (48895, 17)
new training size: (48884, 17)

Now, let's split the data while ensuring that our test set has a fair distribution of affordable units, then further split our training set so as to create the development set:

In [8]:

df_train, df_test = train_test_split(df, test_size=0.2, random_state=42, stratify=df['affordable'])
df_train, df_dev = train_test_split(df_train, test_size=0.25, random_state=99) #stratify=df_train['affordable'])

# ensure our dataset splits are of the % sizes we want
total_size = len(df_train) + len(df_dev) + len(df_test)
print("train:", len(df_train), "=>", len(df_train) / total_size)
print("dev:", len(df_dev), " =>", len(df_dev) / total_size)
print("test:", len(df_test), "=>", len(df_test) / total_size)

train: 29330 => 0.5999918173635546
dev: 9777  => 0.20000409131822272
test: 9777 => 0.20000409131822272

Let's remove the target value (i.e., affordable) from our current dataframes and create it as separate prediction dataframes.

In [9]:

# training
x_train = df_train.drop(['price', 'affordable'], axis=1)
y_train = pd.DataFrame(data=df_train['affordable'], columns=["affordable"])

# dev
x_dev = df_dev.drop(['price', 'affordable'], axis=1)
y_dev = pd.DataFrame(data=df_dev['affordable'], columns=["affordable"])

# test
x_test = df_test.drop(['price', 'affordable'], axis=1)
y_test = pd.DataFrame(data=df_test['affordable'], columns=["affordable"])

From now onwards, we will do EDA and cleaning based on the training set, x_train.

In [10]:

for col in x_train.columns:
    print(col, ":", np.sum([x_train[col].isnull()]))

id : 0
name : 12
host_id : 0
host_name : 12
borough : 0
neighbourhood : 0
latitude : 0
longitude : 0
room_type : 0
minimum_nights : 0
number_of_reviews : 0
last_review : 6065
reviews_per_month : 6065
calculated_host_listings_count : 0
availability_365 : 0

Oh dear. It appears ~6k of the rows have missing values concerning the reviews. It seems impossible to impute the last_review feature with reasonable values, as this is very specific to each unit. At best, we could guess the date based on the reviews_per_month, but that feature is missing for the same rows. Further, it might be difficult to replace reviews_per_month with reasonable values -- sure, we could fill in values to be the median value, but that seems wrong to generalize so heavily, especially for over 20% of our data. Consequently, let's just ignore these two columns.

In [11]:

x_train = x_train.drop(['last_review', 'reviews_per_month'], axis=1)
x_dev = x_dev.drop(['last_review', 'reviews_per_month'], axis=1)
x_test = x_test.drop(['last_review', 'reviews_per_month'], axis=1)

Let's look at the summary statistics of the data:

In [12]:

x_train.describe()

Out[12]:

	id	host_id	latitude	longitude	minimum_nights	number_of_reviews	calculated_host_listings_count	availability_365
count	2.933000e+04	2.933000e+04	29330.000000	29330.000000	29330.000000	29330.000000	29330.000000	29330.000000
mean	1.899091e+07	6.746725e+07	40.729049	-73.952129	6.891647	23.490829	7.111081	113.047017
std	1.102972e+07	7.863754e+07	0.054446	0.046320	19.236816	45.324235	32.904893	131.845296
min	2.539000e+03	2.438000e+03	40.499790	-74.242850	1.000000	0.000000	1.000000	0.000000
25%	9.380684e+06	7.794212e+06	40.690423	-73.983130	1.000000	1.000000	1.000000	0.000000
50%	1.960499e+07	3.049924e+07	40.723090	-73.955630	3.000000	5.000000	1.000000	44.000000
75%	2.921518e+07	1.074344e+08	40.763067	-73.936100	5.000000	24.000000	2.000000	228.000000
max	3.648561e+07	2.743213e+08	40.913060	-73.712990	1000.000000	629.000000	327.000000	365.000000

Next, we see that the minimum_nights feature has a maximum value of 1,250. That's almost 3.5 years, which is probably longer than the duration that most people rent an apartment. This seems anomalous and wrong. Let's discard it and other units that are outrageous. Well, what constitutes 'outrageous'? We see that the standard deviation for minimum_nights is 21.24. If we assume our distribution of values are normally distributed, then only using values that are within 2 standard deviations of the mean would yield us with ~95% of the original data. However, we have no reason to believe our data is actually normally distributed, especially since our mean is 7. To have a better idea of our actual values, let's plot it as a histogram.

In [13]:

fig, ax = plt.subplots(1,1)
ax.hist(x_train['minimum_nights'], 25, log=True)
plt.xlabel('minimum_nights')
plt.ylabel('count')

Out[13]:

Text(0, 0.5, 'count')

Yea, that instance was a strong outlier, and the host was being ridiculously greedy. That's a clever way to get out a multi-year lease. Notice that we are using log-scale. Clearly, a lot of our mass is from units less than 365 days. To get a better sense of that subset, let's re-plot only units with minumum_nights < 365 days.

In [14]:

subset = x_train['minimum_nights']<365
fig, ax = plt.subplots(1,1)
ax.hist(x_train['minimum_nights'][subset], 30, log=True)
plt.xlabel('minimum_nights')
plt.ylabel('count')

Out[14]:

Text(0, 0.5, 'count')

Ok, that doesn't look too bad, as most units require < 30 nights. It's surprising that some hosts list an unreasonable requirement for the minimum number of nights. There is a risk that any host that lists such an unreasonable value might also have other incorrect information. Personally, I think anything beyond 30 days could be suspicious. If we were to exclude any unit that requires more than 30 days, how many instances would we be ignoring?

In [15]:

len(x_train.loc[x_train['minimum_nights']>30])

Out[15]:

Alright, we'd be throwing away 436 out of our ~30k entries. That's roughly 1.5\% of our data. While we generally want to keep and use as much data as we can, I think this is an okay amount to discard, especially considering (1) we have a decently large amount of data remaining, and (2) the entries beyond a 30-day-min could be unrealiable.

In [16]:

good_subset = x_train['minimum_nights'] <= 30
x_train = x_train.loc[good_subset]
y_train = y_train.loc[good_subset]

Notice that we only trimmed our training data, not our development or testing data. I am making this choice because in real scenarios, we would not know the nature of the testing data values. We pre-processed our data to ignore all data that has a price of $0, and to ignore certain columns (even if it's in the testing set), but that was fair because those columns proved to be obvious, bogus element of the dataset. However, it would be unfair to inspect the values of the training set and then to further trim the development and testing set accordingly, conditioned on certain data values.

The remaining columns of our training data all have reasonable summary statistics. None of the min's or max's are cause for concern, and we have no reason to assert a certain distribution of values. Since all the feature values are within reasonable ranges, and there are no missing values (NaNs) remaining, we can confidently move foward. To recap, our remaining columns are now:

In [17]:

[col for col in x_train.columns] # easier to read vertically than horizontally

Out[17]:

['id',
 'name',
 'host_id',
 'host_name',
 'borough',
 'neighbourhood',
 'latitude',
 'longitude',
 'room_type',
 'minimum_nights',
 'number_of_reviews',
 'calculated_host_listings_count',
 'availability_365']

We don't have a terribly large number of features. This allows us to inspect every pairwise interaction. A scatterplot is great for this, as it provides us with a high-level picture of how every pair of features correlates. If any subplot of features depicts a linear relationship (i.e., a clear, concise path with mass concentrated together), then we can assume there exists some collinearity -- that the two features overlap in what they are capturing and that they are not independent from each other.

In [18]:

scatter_matrix(x_train, figsize=(30,20));

Part 2: Predicting with MLE¶

Maximum-likelihood estimation (MLE) is a very simple model which does not require learning any weight/coefficient parameters. Specifically, MLE selects the parameter value ($y$) that makes the observed data most probable, so as to maximize the likelihood function. This choice of $y$ is completely independent of $x$. That is, a MLE model returns the $y$-value that was probable in the data its seen.

Exercise 1: Using the training data, select the MLE for $y$, where $y \in \{0,1\}$. Using the development to evaluate your MLE model, what is the accuracy (the % correct)?

In [19]:

# [SOLUTION: REMOVE]
mle_y = y_train['affordable'].value_counts().idxmax()
dev_accuracy = y_dev['affordable'].value_counts()[mle_y] / len(y_dev['affordable'])
dev_accuracy

Out[19]:

0.650301728546589

Part 3: Predicting with Linear Regression¶

Now, let's actually use our features to make more informed predictions. Since our model needs to use numeric values, not textual ones, let's use ONLY the following features for our linear model:

borough, using 1-hot encodings. There are 5 distinct boroughs, so represent them via 4 unique columns.
latitude
longitude
room_type, using 1-hot encodings. There are 3 distinct room_types, so represent them via 2 unique columns.
minimum_nights
number_of_reviews
calculated_host_listings_count
availability_365

Exercise 2: Convert `x_train` to have only the columns listed above. The shape should be 28,894 x 12

In [20]:

# [SOLUTION: remove!!]
x_train = pd.get_dummies(x_train, columns=['borough', 'room_type'], drop_first=True)
x_train = x_train.drop(['id', 'name', 'host_id', 'host_name', 'neighbourhood'], axis=1)

x_dev = pd.get_dummies(x_dev, columns=['borough', 'room_type'], drop_first=True)
x_dev = x_dev.drop(['id', 'name', 'host_id', 'host_name', 'neighbourhood'], axis=1)

x_test = pd.get_dummies(x_test, columns=['borough', 'room_type'], drop_first=True)
x_test = x_test.drop(['id', 'name', 'host_id', 'host_name', 'neighbourhood'], axis=1)

Exercise 3: For this exercise, perform multi-linear regression and evaluate it on the development set. Do not introduce any polynomial terms or any other new features. Any prediction that is >= 0.5 should be treated as being an 'affordable' prediction. Anything below 0.5 should be 'unaffordable'. What is your accuracy %? (). Is this what you expected? Is this reasonable, and if not, what do you think are the issues?

In [21]:

# [SOLUTION HERE]
# training set
x_train_padded = sm.add_constant(x_train) # to allow for beta_0
y_train_lr = y_train['affordable'].values.reshape(-1,1)

# development set
x_dev_padded = sm.add_constant(x_dev)
y_dev_lr = y_dev['affordable'].values.reshape(-1,1)

/usr/local/lib/python3.7/site-packages/numpy/core/fromnumeric.py:2495: FutureWarning: Method .ptp is deprecated and will be removed in a future version. Use numpy.ptp instead.
  return ptp(axis=axis, out=out, **kwargs)

In [22]:

model = OLS(y_train_lr, x_train_padded)
results = model.fit()
results.summary()

Out[22]:

OLS Regression Results
Dep. Variable:	y	R-squared:	0.371
Model:	OLS	Adj. R-squared:	0.371
Method:	Least Squares	F-statistic:	1422.
Date:	Sun, 13 Oct 2019	Prob (F-statistic):	0.00
Time:	15:07:15	Log-Likelihood:	-12826.
No. Observations:	28894	AIC:	2.568e+04
Df Residuals:	28881	BIC:	2.579e+04
Df Model:	12
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
const	96.8761	6.899	14.042	0.000	83.354	110.398
latitude	0.5947	0.067	8.825	0.000	0.463	0.727
longitude	1.6313	0.077	21.062	0.000	1.480	1.783
minimum_nights	0.0052	0.000	17.194	0.000	0.005	0.006
number_of_reviews	0.0008	5.08e-05	14.877	0.000	0.001	0.001
calculated_host_listings_count	-0.0007	7.39e-05	-9.368	0.000	-0.001	-0.001
availability_365	-0.0004	1.85e-05	-23.768	0.000	-0.000	-0.000
borough_Brooklyn	0.0728	0.019	3.828	0.000	0.036	0.110
borough_Manhattan	-0.1229	0.017	-7.107	0.000	-0.157	-0.089
borough_Queens	-0.0038	0.018	-0.209	0.835	-0.040	0.032
borough_Staten Island	0.4936	0.036	13.864	0.000	0.424	0.563
room_type_Private room	0.4636	0.005	99.360	0.000	0.454	0.473
room_type_Shared room	0.5049	0.015	34.059	0.000	0.476	0.534

Omnibus:	565.926	Durbin-Watson:	2.004
Prob(Omnibus):	0.000	Jarque-Bera (JB):	331.005
Skew:	-0.093	Prob(JB):	1.33e-72
Kurtosis:	2.510	Cond. No.	5.73e+05

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.73e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

In [23]:

# your code here
y_hat_dev = results.predict(exog=x_dev_padded)

# calculating and reporting the requested values, particularly the Test R^2
print('Train R^2 = {:.4}'.format(results.rsquared))
print('Test R^2 = {:.4}'.format(r2_score(y_dev_lr, y_hat_dev)))

# i'm using numpy's round() function, instead of manually checking for values above 0.5
accuracy_score(y_dev, np.round(y_hat_dev))

Train R^2 = 0.3714
Test R^2 = 0.3

Out[23]:

0.7751866625754321

Exercise 4: Akin to what you did in Homework 3, regularize your model via Ridge regression and Lasso regression. Specifically, report the model's accuracy on the development set (as you did in Exercise 2); do so while varying the alpha (aka lambda) parameter to be each of these values: [.001, .01, .05, .1, .5, 1, 5, 10, 50, 100, 500]). What is your best result?

In [32]:

# [SOLUTION HERE]
best_accuracy = -1
best_model = None
for cur_alpha in [0.001, .01, .05, .1, .5, 1, 5, 10, 50, 100, 500]:

    # fit (using Ridge Regression), predict, and score
    fitted_ridge = Ridge(alpha=cur_alpha).fit(x_train, y_train_lr)
    y_hat_dev = fitted_ridge.predict(x_dev).reshape(1,-1)[0]
    
    cur_accuracy = accuracy_score(y_dev['affordable'].to_numpy(), np.round(y_hat_dev))
    if cur_accuracy > best_accuracy:
        best_accuracy = cur_accuracy
        best_model = fitted_ridge
    
    # fit (using Lasso Regression), predict, and score
    fitted_lasso = Lasso(alpha=cur_alpha).fit(x_train, y_train_lr)
    y_hat_dev = fitted_lasso.predict(x_dev).reshape(1,-1)[0]
    cur_accuracy = accuracy_score(y_dev['affordable'].to_numpy(), np.round(y_hat_dev))
    if cur_accuracy > best_accuracy:
        best_accuracy = cur_accuracy
        best_model = fitted_lasso
    
print("best_model:", best_model, "yielded accuracy of:", best_accuracy)

best_model: Ridge(alpha=100, copy_X=True, fit_intercept=True, max_iter=None,
      normalize=False, random_state=None, solver='auto', tol=0.001) yielded accuracy of: 0.7777436841566943

Note, we did not perform cross-validation, so perhaps our model could have performed even better, had we done so.

Exercise 5: Plot two histograms of the residuals from your best performing linear regression model (having trained on the training set, one plot should show the distribution of training set residuals and another plot for the distribution of development set residuals). Does this adhere to the assumptions of a linear model?

In [36]:

# [SOLUTION HERE]

# construct training residuals
y_hat_train = best_model.predict(x_train)
training_residuals = y_train_lr[:,0] - y_hat_train[:,0]

# construct dev residuals
y_hat_dev = best_model.predict(x_dev).reshape(1,-1)[0]
dev_residuals = y_dev['affordable'].to_numpy() - y_hat_dev

# make plot of training residuals
fig, axes = plt.subplots(1,2,figsize=(15,5))

axes[0].set_title('Histogram of Training Residuals')
axes[0].hist(training_residuals, alpha=0.8, bins=20)
axes[0].axhline(0, c='black', lw=2)
axes[0].set_xlabel(r'residuals')

# make plot of dev residuals
axes[1].set_title('Histogram of Development Residuals')
axes[1].hist(dev_residuals, alpha=0.4, bins=20)
axes[1].axhline(0, c='black', lw=2)
axes[1].set_xlabel(r'residuals')
plt.show()

print("min residual:", min(dev_residuals))

min residual: -6.374920636446401

The above plots suggest that the training data is not too conducive to being modelled by a linear regression model, for the residuals seem to be bimodal -- there isn't a single normal distrubion of residual values. Also, just for fun, we plotted the errors/residuals from having evaluated on the unseen development set. Doing so would provide no information about the assumptions of linear model being appropriate for our training data (as the unseen data could be completely dissimiliar from anything we saw during training). Yet, we hope that the development set residuals would be minimal, and it's nice that the errors seem to follow a normal distrubtion -- although, there's some outliers that we perform badly on, but this can happen.

Part 4: Binary Logistic Regression¶

Linear regression is usually a good baseline model, but since the outcome we're trying to predict only takes values 0 and 1 we'll want to use logistic regression instead of basic linear regression.

We will use sklearn for now, but statsmodels also provides LogisticRegression, along with nifty features like confidence intervals.

First, let's import the necessary classes:

In [37]:

from sklearn.linear_model import LogisticRegression

Next, let's instantiate a new LogisticRegression model:

In [38]:

lr = LogisticRegression()

Now, we can fit our model with just 1 line!

In [39]:

lr.fit(x_train, y_train['affordable'])

/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:432: FutureWarning: Default solver will be changed to 'lbfgs' in 0.22. Specify a solver to silence this warning.
  FutureWarning)

Out[39]:

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

Exercise 6: Using .predict(), make predictions on the development set

See .predict() documentation here. NOTE: regularization is applied by default. Especially pay attention to the following arguments/parameters:

C penalty, which we discussed in class. Experiment with varying values from 0 to 100 million!
max_iterations: experiment with values from 5 to 5000. Do you expect more iterations to always perform better? Why or why not?
penalty: for designating L1 (Lasso) or L2 (Ridge) loss; default is L2
solver: especially for the multi-class setting

After fitting the model, you can print the .coef_ value to see its coefficient.

In [40]:

y_hat_dev = lr.predict(x_dev)
initial_score = accuracy_score(y_dev['affordable'].to_numpy(), y_hat_dev)
print("our initial logistic regression model yielded accuracy score of:", initial_score)

best_accuracy = -1
best_model = None

# experiment with different values
c_vals = [1, 10, 100, 1000, 10000, 100000, 1000000, 10000000]
num_iters = [5, 10, 100, 1000, 5000]
for c_val in c_vals:
    for num_iter in num_iters:
        lr = LogisticRegression(C=c_val, solver='liblinear', max_iter=num_iter)
        lr.fit(x_train, y_train['affordable'])
        y_hat_dev = lr.predict(x_dev)
        cur_accuracy = accuracy_score(y_dev['affordable'].to_numpy(), y_hat_dev)

        if cur_accuracy > best_accuracy:
            best_accuracy = cur_accuracy
            best_model = lr

print("best logistic regression model:", lr, "yielded an accuracy score:", best_accuracy)
print("its learned coefficients:", len(best_model.coef_[0]))
print("the coefficients align with our features:", x_dev.shape)

our initial logistic regression model yielded accuracy score of: 0.7737547304899254

/usr/local/lib/python3.7/site-packages/sklearn/svm/base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)
/usr/local/lib/python3.7/site-packages/sklearn/svm/base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)
/usr/local/lib/python3.7/site-packages/sklearn/svm/base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)
/usr/local/lib/python3.7/site-packages/sklearn/svm/base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)
/usr/local/lib/python3.7/site-packages/sklearn/svm/base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)
/usr/local/lib/python3.7/site-packages/sklearn/svm/base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)
/usr/local/lib/python3.7/site-packages/sklearn/svm/base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)
/usr/local/lib/python3.7/site-packages/sklearn/svm/base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)
/usr/local/lib/python3.7/site-packages/sklearn/svm/base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)
/usr/local/lib/python3.7/site-packages/sklearn/svm/base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)
/usr/local/lib/python3.7/site-packages/sklearn/svm/base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)
/usr/local/lib/python3.7/site-packages/sklearn/svm/base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)
/usr/local/lib/python3.7/site-packages/sklearn/svm/base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)
/usr/local/lib/python3.7/site-packages/sklearn/svm/base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)
/usr/local/lib/python3.7/site-packages/sklearn/svm/base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)
/usr/local/lib/python3.7/site-packages/sklearn/svm/base.py:929: ConvergenceWarning: Liblinear failed to converge, increase the number of iterations.
  "the number of iterations.", ConvergenceWarning)

best logistic regression model: LogisticRegression(C=10000000, class_weight=None, dual=False,
                   fit_intercept=True, intercept_scaling=1, l1_ratio=None,
                   max_iter=5000, multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False) yielded an accuracy score: 0.7737547304899254
its learned coefficients: 12
the coefficients align with our features: (9777, 12)

The results here should show that for this dataset, logistic regression offered effectively identical performance as linear regression. There are two main takeaways from this:

logistic regression should not be viewed as being superior to linear regression; it should be viewed as a solution to a different type of problem -- classification (predicting categorical outputs), not regression (predicting continuous-valued outputs).
In our situation, our two categories/classes (affordable or not) had an ordinal nature. That is, the continuum of prices directly aligned with the structure of our two classes. Alternatively, you could imagine other scenarios where our two categories are nominal and thus un-rankable (e.g., predicting cancer or not, or predicting which NYC borough an AirBnB is in based on its property features).

Part 5 (The Real Challenge): Multiclass Classification¶

Before we move on, let's consider a more common use case of logistic regression: predicting not just a binary variable, but what level a categorical variable will take. Instead of breaking the price variable into two classes (affordable being true or false), we may care for more fine-level granularity.

For this exercise, go back to the original df dataframe and construct 5 classes of pricing:

budget: < 80
affordable: 80 < x < 120
average: 120 < x < 180
expensive: 180 < x < 240
very expensive: 240 < x

The cut function obviously stores a lot of extra information for us. It's a very useful tool for discretizing an existing variable.

Exercise 8: After making the new categories, perform the same predictions as above. Compare your results. What improvements could we make? (not just w/ the parameters, but with possibly expanding and using other features from our original dataset!)

In [41]:

# creates multi-class labels for training
x_train_multiclass = x_train.copy()
x_train_multiclass['price_level'] = pd.cut(df_train['price'],[0,80,120,180,240,float('inf')], labels=[0,1,2,3,4])
y_train_multiclass = pd.DataFrame(data=x_train_multiclass['price_level'], columns=["price_level"])
x_train_multiclass = x_train_multiclass.drop(['price_level'], axis=1)

# creats multi-class labels for dev
x_dev_multiclass = x_dev.copy()
x_dev_multiclass['price_level'] = pd.cut(df_dev['price'],[0,80,120,180,240,float('inf')], labels=[0,1,2,3,4])
y_dev_multiclass = pd.DataFrame(data=x_dev_multiclass['price_level'], columns=["price_level"])
x_dev_multiclass = x_dev_multiclass.drop(['price_level'], axis=1)

In [42]:

best_accuracy = -1
best_model = None

# experiment with different values
c_vals = [1, 10, 100, 1000, 10000]
num_iters = [10, 100, 1000, 5000]
for c_val in c_vals:
    for num_iter in num_iters:
        lr = LogisticRegression(solver="lbfgs", max_iter=10000)
        lr.fit(x_train_multiclass, y_train_multiclass['price_level'])
        y_hat_dev = lr.predict(x_dev_multiclass)
        cur_accuracy = accuracy_score(y_dev_multiclass['price_level'].to_numpy(), y_hat_dev)
        print(cur_accuracy)
        if cur_accuracy > best_accuracy:
            best_accuracy = cur_accuracy
            best_model = lr

/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)

0.5048583410043981

/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)

0.5048583410043981

/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)

0.5048583410043981

/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)

0.5048583410043981

/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)

0.5048583410043981

/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)

0.5048583410043981

/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)

0.5048583410043981

/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)

0.5048583410043981

/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)

0.5048583410043981

/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)

0.5048583410043981

/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)

0.5048583410043981

/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)

0.5048583410043981

/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)

0.5048583410043981

/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)

0.5048583410043981

/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)

0.5048583410043981

/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)

0.5048583410043981

/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)

0.5048583410043981

/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)

0.5048583410043981

/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)

0.5048583410043981

/usr/local/lib/python3.7/site-packages/sklearn/linear_model/logistic.py:469: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)

0.5048583410043981

In [43]:

print("best logistic regression model:", lr, "yielded an accuracy score:", best_accuracy)
print("its learned coefficients:", len(best_model.coef_[0]))
print("the coefficients align with our features:", x_dev.shape)
for i in range(len(x_dev.columns)):
    print("feature:", x_dev.columns[i], "; coef:", best_model.coef_[0][i])

best logistic regression model: LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=10000,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False) yielded an accuracy score: 0.5048583410043981
its learned coefficients: 12
the coefficients align with our features: (9777, 12)
feature: latitude ; coef: 7.312415440262378
feature: longitude ; coef: 4.059059803768657
feature: minimum_nights ; coef: 0.044388062697350704
feature: number_of_reviews ; coef: 0.0013794514234422481
feature: calculated_host_listings_count ; coef: -0.0034845157546768537
feature: availability_365 ; coef: -0.0021468646692259148
feature: borough_Brooklyn ; coef: 0.19507151988381405
feature: borough_Manhattan ; coef: -1.7898799616161736
feature: borough_Queens ; coef: 0.06310066394457561
feature: borough_Staten Island ; coef: 2.4381414950649947
feature: room_type_Private room ; coef: 3.4513468035217554
feature: room_type_Shared room ; coef: 4.686077440118879

Despite having 5 distinct price categories now, our performance isn't too bad! To increase performance further, we could first use cross-validation. Then, we could look at our original data and try to better use its features. For example, perhaps it would be useful to expand out our 'neighbourhood' feature into one-hot encodings? I imagine the fine-level, granular information of 'neighbourhood' correlates well with price. The only concern and question to ask ourselves is how much data do we have for each neighbourhood? (We'd aim to have plenty of representative data). Related, the longitude and latitude features provide fine-level information, but perhaps it's hard for the model to use it since the range is so small. If we were to scale the lat and long values to be between 0 and 1, it might allow for the model to better distinguish between the nuanced values.

For this exercise, we uniformly care about each price level and prediction thereof. However, in some scenarios, our classification accuracy for some categories is much more important than others (e.g., predicting cancer or not). That is, our false negatives (misses) are way more serious and potentially deadly. For situations like this, it is better to error on the side of caution and allow for false positives (aka false alarms) moreso than false negatives (misses). To handle this, one could weight each class, and specify such during training/ fitting our model. As we learned in class, we can plot the performance as we vary the prediction threshold, while paying attention to how that affects the number of false negatives and false positives.