{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "## Title :\n", "Exercise - Logistic Regression\n", "\n", "## Description :\n", "\n", "Fit logistic regression models using:\n", "\n", "SKLearn LogisticRegression (sklearn.linear_model.LogisticRegression)\n", "\n", "Statsmodels Logit (statsmodels.api.Logit)\n", "\n" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# import libraries\n", "\n", "import pandas as pd \n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.linear_model import LogisticRegression, LinearRegression\n", "import statsmodels.api as sm" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "heart = pd.read_csv('Heart.csv')\n", "\n", "# Force the response into a binary indicator:\n", "heart['AHD'] = 1*(heart['AHD'] == \"Yes\")\n", "\n", "heart.describe()" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Make a plot of the response (AHD) vs the predictor (Age)\n", "\n", "plt.plot(heart[['Age']].values, heart['AHD'].values ,'o', markersize=7,color=\"#011DAD\",label=\"Data\")\n", "\n", "plt.xticks(np.arange(18, 80, 4.0))\n", "plt.xlabel(\"Age\")\n", "plt.ylabel(\"AHD\")\n", "plt.yticks((0,1), labels=('No', 'Yes'))\n", "\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# split into train and validation\n", "heart_train, heart_val = train_test_split(heart, train_size = 0.75, random_state = 5)\n", "\n", "# select variables for model estimation\n", "x_train = heart_train[['Age']]\n", "y_train = heart_train['AHD']\n", "\n", "x_val = heart_val[['Age']]\n", "y_val = heart_val['AHD']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Simple linear regression model fitting\n", "\n", "Define and fit a linear regression model to predict `Age` from `MaxHR`." ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Create a linear regression model, with random state=5\n", "\n", "regress1 = LinearRegression(fit_intercept=True).fit(x_train,y_train)\n", "\n", "print(\"Linear Regression Estimated Betas:\",regress1.intercept_,regress1.coef_[0])" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Plot the estimated probability for training data\n", "dummy_x=np.linspace(np.min(x_train)-30,np.max(x_train)+30)\n", "yhat_regress = regress1.predict(dummy_x.reshape(-1,1))\n", "plt.plot(x_train, y_train, 'o' ,alpha=0.2, label='Data')\n", "plt.plot(dummy_x, yhat_regress, label = \"OLS\")\n", "\n", "plt.ylim(-0.2, 1.2)\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What could go wrong with this linear regression model? " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*your answer here*" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Simple logisitc regression model fitting\n", "\n", "Define and fit a logistic regression model with random state=5 to predict `Age` from `MaxHR`." ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "### edTest(test_logit1) ###\n", "# Create a logistic regression model, with random state=5 and no penalty\n", "\n", "logit1 = ___(penalty=___, max_iter = 1000, random_state=5)\n", "\n", "#Fit the model using the training set\n", "\n", "logit1.fit(x_train,y_train)\n", "\n", "# Get the coefficient estimates\n", "\n", "print(\"Logistic Regression Estimated Betas (B0,B1):\",logit1.intercept_,logit1.coef_)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Interpret the Coefficient Estimates" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Calculate the estimated probability that a person with age 60 will have AHD in the ICU." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "**your answer here**" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "# Confirm the probability calculation above using logit1.predict()\n", "# Be careful as to how you define the new observation. Hint: double brackets is one way to do it\n", "\n", "logit1.predict_proba([[___]])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Accuracy computation" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "### edTest(test_accuracy) ###\n", "\n", "# Compute the training & validation accuracy \n", "\n", "train_accuracy = logit1.___(x_train , y_train)\n", "val_accuracy = logit1.___(x_val , y_val)\n", "\n", "# Print the two accuracies below\n", "\n", "print(\"Train Accuracy\", train_accuracy)\n", "print(\"Validation Accuracy\", val_accuracy)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Plot the predictions" ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "x=np.linspace(np.min(heart[['Age']])-10,np.max(heart[['Age']])+10,200)\n", "\n", "yhat_class_logit = logit1.predict(x)\n", "yhat_prob_logit = logit1.predict_proba(x)[:,1]\n", "\n", "# plot the observed data\n", "plt.plot(x_train, y_train, 'o' ,alpha=0.1, label='Train Data')\n", "plt.plot(x_val, 0.94*y_val+0.03, 'o' ,alpha=0.1, label='Validation Data')\n", "\n", "# plot the predictions\n", "plt.plot(x, yhat_class_logit, label='logit1 Classifications')\n", "plt.plot(x, yhat_prob_logit, label='logit1 Probabilities')\n", "\n", "# put the lower-left part of the legend 5% to the right along the x-axis, and 45% up along the y-axis\n", "plt.legend(loc=(0.05,0.45))\n", "\n", "# Don't forget your axis labels!\n", "plt.xlabel(\"Age\")\n", "plt.ylabel(\"Heart disease (AHD)\")\n", "\n", "plt.show()\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Statistical Inference\n", "Train a new logistic regression model using statsmodels package. Print model summary and interpret the results." ] }, { "cell_type": "code", "execution_count": 0, "metadata": {}, "outputs": [], "source": [ "### edTest(test_logit2) ###\n", "# adding a column of ones to X\n", "x_train_with_constant = sm.add_constant(x_train)\n", "x_val_with_constant = sm.add_constant(x_val)\n", "\n", "# train a new model using statsmodels package\n", "logreg = sm.___(y_train, x_train_with_constant).fit()\n", "print(logreg.summary())\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### What is an estimated 95% confidence interval for the coefficient corresponding to 'Age' variable?" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*your answer here*" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 1 }