{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Title\n", "\n", "**Exercise 1 - Basic Multi-classification**\n", "\n", "# Description\n", "The goal of the exercise is to get comfortable using multiclass classification models. Eventually, you will produce a plot similar to the one given below:\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Instructions: \n", "We are trying to predict the types of Irises in the classic Iris data set based on measured characteristics\n", "- Load the Iris data set and convert to a data frame.\n", "- Fit multinomial & OvR logistic regressions and a $k$-NN model. \n", "- Compute the accuracy of the models.\n", "- Plot the classification boundaries against the two predictors used.\n", "\n", "# Hints:\n", "\n", "sklearn.LogisticRegression() : Generates a Logistic Regression classifier\n", "\n", "sklearn.fit() : Fits the model to the given data\n", "\n", "sklearn.predict() : Predict using the estimated model (Logistic or knn classifiers) to perform pure classification predictions\n", "\n", "sklearn.predict_proba() : Predict using the estimated model (Logistic or knn classifiers) to perform probability predictions of all the classes in the response (they should add up to 1 for each observation)\n", "\n", "sklearn.LogisticRegression.coef_ and .intercept_ : Pull off the estimated $\\beta$ coefficients in a Logistic Regression model\n", "\n", "sklearn.score() : Accuracy classification score.\n", "\n", "matplotlib.pcolormesh() : Accuracy classification score\n", "\n", "**Note: This exercise is auto-graded and you can try multiple attempts.**" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "%matplotlib inline\n", "from sklearn import datasets\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "import pandas as pd\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LogisticRegression\n", "from sklearn.neighbors import KNeighborsClassifier \n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Irises\n", "\n", "Read in the data set and convert to a Pandas data frame:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "raw = datasets.load_iris()\n", "iris = pd.DataFrame(raw['data'],columns=raw['feature_names'])\n", "iris['type'] = raw['target'] \n", "iris.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Note: this violin plot is 'inverted': putting the response variable in the model on the x-axis. This is fine for exploration" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "sns.violinplot(y=iris['sepal length (cm)'], x=iris['type'], split=True);" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "# Create a violin plot to compare petal length \n", "# across the types of irises\n", "\n", "sns.violinplot(___);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we fit our first model (the OvR logistic) and print out the coefficients:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "logit_ovr = LogisticRegression(penalty='none', multi_class='ovr',max_iter = 1000).fit(\n", " iris[['sepal length (cm)','sepal width (cm)']], iris['type'])\n", "print(logit_ovr.intercept_)\n", "print(logit_ovr.coef_)\n" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "# we can predict classes or probabilities\n", "print(logit_ovr.predict(iris[['sepal length (cm)','sepal width (cm)']])[0:5])\n", "print(logit_ovr.predict_proba(iris[['sepal length (cm)','sepal width (cm)']])[0:5])" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "# and calculate accuracy\n", "print(logit_ovr.score(iris[['sepal length (cm)','sepal width (cm)']],iris['type']))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now it's your turn: but this time with the multinomial logistic regression." ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [], "source": [ "### edTest(test_multinomial) ###\n", "\n", "# Fit the model and print out the coefficients\n", "logit_multi = LogisticRegression(___).fit(___)\n", "intercept = logit_multi.intercept_\n", "coefs = logit_multi.coef_\n", "print(intercept)\n", "print(coefs)" ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [], "source": [ "### edTest(test_multinomialaccuracy) ###\n", "\n", "multi_accuracy = ___\n", "print(multi_accuracy)" ] }, { "cell_type": "code", "execution_count": 33, "metadata": {}, "outputs": [], "source": [ "# Plot the decision boundary. \n", "x1_range = iris['sepal length (cm)'].max() - iris['sepal length (cm)'].min()\n", "x2_range = iris['sepal width (cm)'].max() - iris['sepal width (cm)'].min()\n", "x1_min, x1_max = iris['sepal length (cm)'].min()-0.1*x1_range, iris['sepal length (cm)'].max() +0.1*x1_range\n", "x2_min, x2_max = iris['sepal width (cm)'].min()-0.1*x2_range, iris['sepal width (cm)'].max() + 0.1*x2_range\n", "\n", "step = .05 \n", "x1x, x2x = np.meshgrid(np.arange(x1_min, x1_max, step), np.arange(x2_min, x2_max, step))\n", "y_hat_ovr = logit_ovr.predict(np.c_[x1x.ravel(), x2x.ravel()])\n", "y_hat_multi = ___\n", "\n", "\n", "fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))\n", "\n", "ax1.pcolormesh(x1x, x2x, y_hat_ovr.reshape(x1x.shape), cmap=plt.cm.Paired,alpha = 0.5)\n", "ax1.scatter(iris['sepal length (cm)'], iris['sepal width (cm)'], c=iris['type'], edgecolors='k', cmap=plt.cm.Paired)\n", "\n", "### your job is to create the same plot, but for the multinomial\n", "#####\n", "# your code here\n", "#####\n", "\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [], "source": [ "#fit a knn model (k=5) for the same data \n", "knn5 = KNeighborsClassifier(___).fit(___)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "### edTest(test_knnaccuracy) ###\n", "\n", "#Calculate the accuracy\n", "knn5_accuracy = ___\n", "print(knn5_accuracy)" ] }, { "cell_type": "code", "execution_count": 32, "metadata": {}, "outputs": [], "source": [ "# and plot the classification boundary\n", "\n", "y_hat_knn5 = knn5.predict(np.c_[x1x.ravel(), x2x.ravel()])\n", "\n", "fig, ax1 = plt.subplots(1, 1, figsize=(8, 6))\n", "\n", "ax1.pcolormesh(x1x, x2x, y_hat_knn5.reshape(x1x.shape), cmap=plt.cm.Paired,alpha = 0.5)\n", "# Plot also the training points\n", "ax1.scatter(iris['sepal length (cm)'], iris['sepal width (cm)'], c=iris['type'], edgecolors='k', cmap=plt.cm.Paired)\n", "\n", "plt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" } }, "nbformat": 4, "nbformat_minor": 4 }