Title

Exercise 1 - Guesstimating Beta values for Logistic Regression

Description

The goal of the exercise is to produce a plot similar to the one given below, by guesstimating the values of the coefficient estimates for $\beta_0$ and $\beta_1$.

Instructions:

We are trying to predict who will claim insurance as a function of age using the data. To do so we need :

  • Read the insurance_claim.csv as a dataframe.
  • Assign the predictor and response variables.
  • Guesstimate the values of the coefficients $\hat{\beta}_0$ and $\hat{\beta}_1$.
  • Predict the response variable using the formula of a simple logistic regression given below (no package allowed)
  • Compute the accuracy of the model.
  • Repeat the above steps by changing the values of the coefficients $\hat{\beta}_0$ and $\hat{\beta}_1$, until you get "good" accuracy.
  • Plot the Insurance Claim vs. Age graph with the fit of the model.

Hints:

  • Logistic Regression equation:
$$P\left(Y=1\right)=\frac{1}{1+e^{-\left(\beta_{0\ }+\beta_1X\right)}}$$

plt.plot() : Plots x versus y as lines and/or markers

sklearn.accuracy_score() : Accuracy classification score

plt.xticks() : Get or set the current tick locations and labels of the x-axis.

plt.yticks() : Get or set the current tick locations and labels of the y-axis.

plt.xlabel() : Set the label for the x-axis.

plt.ylabel() : Set the label for the y-axis.

Note: This exercise is auto-graded and you can try multiple attempts.

In [1]:
# import important libraries

%matplotlib inline
import numpy as np
import pandas as pd
from math import exp
import matplotlib.pyplot as plt
from sklearn.preprocessing import normalize
from sklearn.metrics import accuracy_score
In [2]:
# Make a dataframe of the file "insurance_claim.csv"

data_filename = 'insurance_claim.csv'
df = pd.read_csv(data_filename)
In [ ]:
# Take a quick look of the data, notice that the response variable is binary

df.head()
In [3]:
# Assign age as the predictor variable using double brackets
x = df[[___]]

# Assign insuranceclaim as the response variable
y = df[___]
In [ ]:
# Make a plot of the response (insuranceclaim) vs the predictor (age)
plt.plot(___,___,'o', markersize=7,color="#011DAD",label="Data")

# Also add the labels for 'x' & 'y' values
plt.xlabel(___)
plt.ylabel(___)

plt.xticks(np.arange(18, 80, 4.0))

# Label the value 1 as 'Yes' & 0 as 'No'
plt.yticks((0,1), labels=('No', 'Yes'))
plt.legend(loc='best')
plt.show()
In [5]:
### edTest(test_beta_guesstimate) ###
# Guesstimate the values of beta0 & beta1

beta0 = ___

beta1 = ___
In [6]:
### edTest(test_beta_computation) ###
# Use the logistic function below to predict the response based on the input
logit = []

for i in x:
    # Append the P(y=1) values to the logit list
    logit.append(___)
    
    
In [6]:
# If the predictions are above a threshold of 0.5, predict as 1, else 0

y_pred = []

for py in ___:
    if py >= 0.5:
        ____
    else:
        ____
        
In [ ]:
# Use accuracy_score function to find the accuracy 

accuracy = accuracy_score(___, ___)

# Print the accuracy
print (___)
In [ ]:
# Make a plot similar to the one above along with the fit curve
plt.plot(___, ___,'o', markersize=7,color="#011DAD",label="Data")

plt.plot(___,___,linewidth=2,color='black',label="Prediction")

plt.xticks(np.arange(18, 80, 4.0))
plt.xlabel("Age")
plt.ylabel("Insurance claim")
plt.yticks((0,1), labels=('No', 'Yes'))
plt.legend()
plt.show()

Post exercise question:

In this exercise, you may have had to stumble around to find the right values of $\beta_0$ and $\beta_1$ to get accurate results.

Although you may have used visual inspection to find a good fit, in most problems you would need a quantative method to measure the performance of your model. (Loss function)

Which of the following below are NOT possible ways of quantifying the performance of the model.

  • A. Compute the mean squared error loss of the predicted labels.
  • B. Evaluate the log-likelihood for this Bernoulli response variable.
  • C. Go the the temple of Apollo at Delphi, and ask the high priestess Pythia
  • D. Compute the total number of misclassified labels.
In [ ]:
### edTest(test_quiz) ###

# Put down your answers in a string format below (using quotes)

# for. eg, if you think the options are 'A' & 'B', input below as "A,B"

answer = ___