Key Word(s): Logistic Regression, Classification, kNN

Title¶

Exercise 1 - Guesstimating Beta values for Logistic Regression

Description¶

The goal of the exercise is to produce a plot similar to the one given below, by guesstimating the values of the coefficient estimates for $\beta_0$ and $\beta_1$.

Instructions:¶

We are trying to predict who will claim insurance as a function of age using the data. To do so we need :

Read the insurance_claim.csv as a dataframe.
Assign the predictor and response variables.
Guesstimate the values of the coefficients $\hat{\beta}_0$ and $\hat{\beta}_1$.
Predict the response variable using the formula of a simple logistic regression given below (no package allowed)
Compute the accuracy of the model.
Repeat the above steps by changing the values of the coefficients $\hat{\beta}_0$ and $\hat{\beta}_1$, until you get "good" accuracy.
Plot the Insurance Claim vs. Age graph with the fit of the model.

Hints:¶

Logistic Regression equation:

$$P\left(Y=1\right)=\frac{1}{1+e^{-\left(\beta_{0\ }+\beta_1X\right)}}$$

plt.plot() : Plots x versus y as lines and/or markers

sklearn.accuracy_score() : Accuracy classification score

plt.xticks() : Get or set the current tick locations and labels of the x-axis.

plt.yticks() : Get or set the current tick locations and labels of the y-axis.

plt.xlabel() : Set the label for the x-axis.

plt.ylabel() : Set the label for the y-axis.

Note: This exercise is auto-graded and you can try multiple attempts.

In [1]:

# import important libraries

%matplotlib inline
import numpy as np
import pandas as pd
from math import exp
import matplotlib.pyplot as plt
from sklearn.preprocessing import normalize
from sklearn.metrics import accuracy_score

In [2]:

# Make a dataframe of the file "insurance_claim.csv"

data_filename = 'insurance_claim.csv'
df = pd.read_csv(data_filename)

In [ ]:

# Take a quick look of the data, notice that the response variable is binary

df.head()

In [3]:

# Assign age as the predictor variable using double brackets
x = df[[___]]

# Assign insuranceclaim as the response variable
y = df[___]

In [ ]:

# Make a plot of the response (insuranceclaim) vs the predictor (age)
plt.plot(___,___,'o', markersize=7,color="#011DAD",label="Data")

# Also add the labels for 'x' & 'y' values
plt.xlabel(___)
plt.ylabel(___)

plt.xticks(np.arange(18, 80, 4.0))

# Label the value 1 as 'Yes' & 0 as 'No'
plt.yticks((0,1), labels=('No', 'Yes'))
plt.legend(loc='best')
plt.show()

In [5]:

### edTest(test_beta_guesstimate) ###
# Guesstimate the values of beta0 & beta1

beta0 = ___

beta1 = ___

In [6]:

### edTest(test_beta_computation) ###
# Use the logistic function below to predict the response based on the input
logit = []

for i in x:
    # Append the P(y=1) values to the logit list
    logit.append(___)

In [6]:

# If the predictions are above a threshold of 0.5, predict as 1, else 0

y_pred = []

for py in ___:
    if py >= 0.5:
        ____
    else:
        ____

In [ ]:

# Use accuracy_score function to find the accuracy 

accuracy = accuracy_score(___, ___)

# Print the accuracy
print (___)

In [ ]:

# Make a plot similar to the one above along with the fit curve
plt.plot(___, ___,'o', markersize=7,color="#011DAD",label="Data")

plt.plot(___,___,linewidth=2,color='black',label="Prediction")

plt.xticks(np.arange(18, 80, 4.0))
plt.xlabel("Age")
plt.ylabel("Insurance claim")
plt.yticks((0,1), labels=('No', 'Yes'))
plt.legend()
plt.show()

Post exercise question:¶

In this exercise, you may have had to stumble around to find the right values of $\beta_0$ and $\beta_1$ to get accurate results.

Although you may have used visual inspection to find a good fit, in most problems you would need a quantative method to measure the performance of your model. (Loss function)

Which of the following below are NOT possible ways of quantifying the performance of the model.

A. Compute the mean squared error loss of the predicted labels.
B. Evaluate the log-likelihood for this Bernoulli response variable.
C. Go the the temple of Apollo at Delphi, and ask the high priestess Pythia
D. Compute the total number of misclassified labels.

In [ ]:

### edTest(test_quiz) ###

# Put down your answers in a string format below (using quotes)

# for. eg, if you think the options are 'A' & 'B', input below as "A,B"

answer = ___