{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Title :\n",
"Exercise - Logistic Regression\n",
"\n",
"## Description :\n",
"\n",
"Fit logistic regression models using:\n",
"\n",
"SKLearn LogisticRegression (sklearn.linear_model.LogisticRegression)\n",
"\n",
"Statsmodels Logit (statsmodels.api.Logit)\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# import libraries\n",
"\n",
"import pandas as pd \n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.linear_model import LogisticRegression, LinearRegression\n",
"import statsmodels.api as sm"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"heart = pd.read_csv('Heart.csv')\n",
"\n",
"# Force the response into a binary indicator:\n",
"heart['AHD'] = 1*(heart['AHD'] == \"Yes\")\n",
"\n",
"heart.describe()"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# Make a plot of the response (AHD) vs the predictor (Age)\n",
"\n",
"plt.plot(heart[['Age']].values, heart['AHD'].values ,'o', markersize=7,color=\"#011DAD\",label=\"Data\")\n",
"\n",
"plt.xticks(np.arange(18, 80, 4.0))\n",
"plt.xlabel(\"Age\")\n",
"plt.ylabel(\"AHD\")\n",
"plt.yticks((0,1), labels=('No', 'Yes'))\n",
"\n",
"plt.legend()\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# split into train and validation\n",
"heart_train, heart_val = train_test_split(heart, train_size = 0.75, random_state = 5)\n",
"\n",
"# select variables for model estimation\n",
"x_train = heart_train[['Age']]\n",
"y_train = heart_train['AHD']\n",
"\n",
"x_val = heart_val[['Age']]\n",
"y_val = heart_val['AHD']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Simple linear regression model fitting\n",
"\n",
"Define and fit a linear regression model to predict `Age` from `MaxHR`."
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# Create a linear regression model, with random state=5\n",
"\n",
"regress1 = LinearRegression(fit_intercept=True).fit(x_train,y_train)\n",
"\n",
"print(\"Linear Regression Estimated Betas:\",regress1.intercept_,regress1.coef_[0])"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# Plot the estimated probability for training data\n",
"dummy_x=np.linspace(np.min(x_train)-30,np.max(x_train)+30)\n",
"yhat_regress = regress1.predict(dummy_x.reshape(-1,1))\n",
"plt.plot(x_train, y_train, 'o' ,alpha=0.2, label='Data')\n",
"plt.plot(dummy_x, yhat_regress, label = \"OLS\")\n",
"\n",
"plt.ylim(-0.2, 1.2)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### What could go wrong with this linear regression model? "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*your answer here*"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Simple logisitc regression model fitting\n",
"\n",
"Define and fit a logistic regression model with random state=5 to predict `Age` from `MaxHR`."
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_logit1) ###\n",
"# Create a logistic regression model, with random state=5 and no penalty\n",
"\n",
"logit1 = ___(penalty=___, max_iter = 1000, random_state=5)\n",
"\n",
"#Fit the model using the training set\n",
"\n",
"logit1.fit(x_train,y_train)\n",
"\n",
"# Get the coefficient estimates\n",
"\n",
"print(\"Logistic Regression Estimated Betas (B0,B1):\",logit1.intercept_,logit1.coef_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Interpret the Coefficient Estimates"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Calculate the estimated probability that a person with age 60 will have AHD in the ICU."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**your answer here**"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"# Confirm the probability calculation above using logit1.predict()\n",
"# Be careful as to how you define the new observation. Hint: double brackets is one way to do it\n",
"\n",
"logit1.predict_proba([[___]])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Accuracy computation"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_accuracy) ###\n",
"\n",
"# Compute the training & validation accuracy \n",
"\n",
"train_accuracy = logit1.___(x_train , y_train)\n",
"val_accuracy = logit1.___(x_val , y_val)\n",
"\n",
"# Print the two accuracies below\n",
"\n",
"print(\"Train Accuracy\", train_accuracy)\n",
"print(\"Validation Accuracy\", val_accuracy)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Plot the predictions"
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"x=np.linspace(np.min(heart[['Age']])-10,np.max(heart[['Age']])+10,200)\n",
"\n",
"yhat_class_logit = logit1.predict(x)\n",
"yhat_prob_logit = logit1.predict_proba(x)[:,1]\n",
"\n",
"# plot the observed data\n",
"plt.plot(x_train, y_train, 'o' ,alpha=0.1, label='Train Data')\n",
"plt.plot(x_val, 0.94*y_val+0.03, 'o' ,alpha=0.1, label='Validation Data')\n",
"\n",
"# plot the predictions\n",
"plt.plot(x, yhat_class_logit, label='logit1 Classifications')\n",
"plt.plot(x, yhat_prob_logit, label='logit1 Probabilities')\n",
"\n",
"# put the lower-left part of the legend 5% to the right along the x-axis, and 45% up along the y-axis\n",
"plt.legend(loc=(0.05,0.45))\n",
"\n",
"# Don't forget your axis labels!\n",
"plt.xlabel(\"Age\")\n",
"plt.ylabel(\"Heart disease (AHD)\")\n",
"\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Statistical Inference\n",
"Train a new logistic regression model using statsmodels package. Print model summary and interpret the results."
]
},
{
"cell_type": "code",
"execution_count": 0,
"metadata": {},
"outputs": [],
"source": [
"### edTest(test_logit2) ###\n",
"# adding a column of ones to X\n",
"x_train_with_constant = sm.add_constant(x_train)\n",
"x_val_with_constant = sm.add_constant(x_val)\n",
"\n",
"# train a new model using statsmodels package\n",
"logreg = sm.___(y_train, x_train_with_constant).fit()\n",
"print(logreg.summary())\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### What is an estimated 95% confidence interval for the coefficient corresponding to 'Age' variable?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"*your answer here*"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.5"
}
},
"nbformat": 4,
"nbformat_minor": 1
}