Homework 7

Due Tuesday, December 3rd 2019 at 11:59 PM.

Problem 0: Homework Workflow [10 pts]

Problem 1: Weather Prediction Using a Markov Chain [45 pts]

Problem 2: Databases [45 pts]

Problem 0: Homework Workflow [10 pts]

Once you receive HW6 feedback, you will need to merge your HW6-dev branch into master.

You will earn points for following all stages of the git workflow which involves:

  • 3 pts for merging HW6-dev into master
  • 5 pts for completing HW7 on HW7-dev
  • 2 pts for making a PR on HW7-dev to merge into master

Problem 1: Weather Prediction Using a Markov Chain [45 pts]

Markov chains are widely used to model and predict discrete events. Underlying Markov chains are Markov processes which make the assumption that the outcome of a future event only depends on the event immediately preceding it. In this problem, we will use a Markov chain to create a basic model for predicting the weather.

Part A: Implement a Markov chain [10 pts]

You will complete this part in the files Markov.py and P1A.py.

Example: Toy weather model

To begin with, let us consider how we can represent weather behaviour as a Markov chain by considering a toy weather model in which the weather behaviour follows the following rules:

  1. A dry day has $0.9$ probability of being followed by a sunny, and otherwise transitions into a humid day.
  2. A humid day as $0.5$ probability of being followed by a dry day, and otherwise remains humid.
  3. A sunny day has $0.4$ probability of remaining sunny, and otherwise an equal probability of transitioning to a dry day or humid day.

Notice that the preceding rules can be represented as a transition matrix $P$ of the form: $$\begin{matrix} & \text{Dry} & \text{Humid} & \text{Sunny} \\ \text{Dry} & 0 & 0.1 & 0.9 \\ \text{Humid} & 0.5 & 0.5 & 0 \\ \text{Sunny} & 0.3 & 0.3 & 0.4 \end{matrix},$$ where each matrix element $P_{ij}$ is the probability the weather on the next day is of the $j$-th weather type given that the current day is of the $i$-th weather type.

Thus, in order to calculate the weather probability $N$ days from the current day, we simply sum the probabilities of the conditional events given in matrix. For example, given that the current day is dry ($i = 1$), the probability of it being humid ($j = 2$) in $N=2$ days is $$\left(M^{2}\right)_{12} = P_{11} P_{12} + P_{12} P_{22} + P_{13} P_{32}.$$

Your task: Modified toy weather model

In reality, there is a much broader range of weather conditions than just dry, humid, and sunny. Your task is to create a modified version of the toy model above to account for a slightly broader range of weather conditions.

In the weather.csv file accompanying this homework, each row corresponds to one type of weather in the following order.

  1. sunny
  2. cloudy
  3. rainy
  4. snowy
  5. windy
  6. hailing

Each column gives the probability of one type of weather occuring the following day (also in the order above).

As in the example, the $(i, j)$ element is the probability that the $j$-th weather type occurs after the $i$-th weather type; e.g. the $(1, 2)$ element is the probability that a cloudy day occurs after the sunny day. Take a look at the data and verify that if the current day is sunny ($i=1$) the following day will have a $0.4$ probability of being sunny ($j=1$) as well, whereas if the current day is rainy ($i=3$) the following day has a probability $0.05$ of being windy ($j=5$). (You do not need to submit anything for this step; this is just to check that the data is structured as described.)

Having made sure that you understand how the weather data is stored in weather.csv, complete the following steps to create and demo your Markov chain model.

1. Create a class called Markov that implements the following methods

  • load_data(array): Loads a 2D numpy array and stores it in the instance variable self.data
    • Hint: You may use numpy.genfromtxt() to load weather.csv.
  • get_prob(current_day_weather, next_day_weather): Returns the probability of next_day_weather on the next day given current_day_weather on the current day.
    • current_day_weather and next_day_weather will be one of the 6 strings describing the weather (sunny, cloudy, rainy, snowy, windy, or hailing).
    • Raise an Exception if current_day_weather or next_day_weather is not one of the 6 strings specified.

The __init__ for this class should initialize self.data as an empty array into which load_data(array) stores array.

Use the following as your Markov class skeleton. Save the code in a file called in Markov.py.

class Markov:
    def __init__(...):
        # Your implementation here

    def load_data(...):
        # Your implementation here

    def get_prob(...): 
        # Your implementation here

Your Markov class should support the following use case

weather_today = Markov()
weather_today.load_data(weather)    # Where weather is a 2D array read in from a .csv file
print(weather_today.get_prob('sunny', 'cloudy'))

2. Demo your Markov class

In a separate file called P1A.py:

  1. Import the Markov class from Markov.py.
  2. Parse the provided weather.csv file into a numpy array.
  3. Demonstrate that your Markov class works by printing the probability that a windy day follows a cloudy day.

Please place weather.csv in the same directory as Markov.py and P1A.py.

Part B: Markov chain iterators [15 pts]

You will complete this part in the file Markov.py.

Iterators are a convenient way to walk along your Markov chain. Implement the __iter__() and __next__() methods for the Markov class you implemented in the preceding part. The specifications for the __iter__() and __next__() methods are as follows:

  • __iter__() should return the iterator object
  • __next__() should return the next day's weather
    • Each next day's weather should be stochastic. That is, it is randomly selected based on the relative probabilities of the the next day's weather types given the current day's weather type. The next day's weather should be returned as a lowercase string (e.g. 'sunny') rather than an index (e.g. 0).

Hint: You may use Python's built-in random.choices function or NumPy's numpy.random.choice function to select the next day's weather based on the relevant relative probabilities.

Part C: Prediction using Markov chain [20 pts]

You will complete this part in the files Markov.py and P1C.py.

In this part, we will predict what the weather will be like in a week for five different cities given each city's current weather.

We can generate predictions for a city's weather in seven days by using the Markov iterator implemented in the previous part. To do so, make the following modifications to you Markov class:

  • Modify __init__ such that it accepts an optional argument day_zero_weather (with default value None) which is the weather of the current day, and stores it as the instance variable day_zero_weather.
  • Implement the method _simulate_weather_for_day(day), where day is a non-negative integer representing the number of days from the current day. This method returns the predicted weather as a string on the specified day.
    • Hint: Make sure that your method returns a sensible result (i.e. the current day's weather) for day = 0.

For the purposes of this problem, rather than just producing just one simulation per city, we would like to simulate 100 such predictions per city, and store the most commonly occuring predictions. To that end, please implement get_weather_for_day, which uses _simulate_weather_for_day to run the simulation for trials number of times:

  • get_weather_for_day(day, trials): Returns list of strings of length trials where each element is the predicted weather for each trial. Assign a default value to trials. day is an integer representing how far from the current day we want to predict the weather for.

In other words, because things are stochastic, predicting the weather a few days from now doesn't make sense. What makes more sense, is to predict the weather a few days from now a bunch of times. From that dataset, you can get a sense of the most likely weather a few days from now.

Hint: You **must use the iterator** to complete this part of the problem. You may find it helpful to define new instance variables self._current_day and self._current_day_weather, which keep track of the current day number (starting from 0) and its weather. Hint You may also find it helpful to define a helper method to reset the self._current_day and self._current_day_weather for each new simulation trial.

In summary:

  • Modify the Markov class __init__ method to accept an optional argument day_zero_weather.
  • Implement _simulate_weather_for_day(day) method, which returns predicted weather on day day.
  • Implement get_weather_for_day(day, trials) method, which returns the predicted weather over trials simulations as a list (of strings).

Finally, in a separate file called P1C.py, use the Markov class to find the most common weather seven days from the current day over 100 trials for each city given the following initial weather conditions for each city:

city_weather = {
    'New York': 'rainy',
    'Chicago': 'snowy',
    'Seattle': 'rainy',
    'Boston': 'hailing',
    'Miami': 'windy',
    'Los Angeles': 'cloudy',
    'San Francisco': 'windy'
}

Print the number of occurrences of each weather condition over the 100 trials for each city, e.g.

New York: {'hailing': 6, 'sunny': 36, ...}
Chicago: {'cloudy': 33, 'snowy': 11, ...}
...

Then, print the most commonly predicted weather for each city in the following format:

Most likely weather in seven days
----------------------------------
New York: cloudy
Chicago: cloudy
...

Note: Don't worry if your values don't seem to make intuitive sense (e.g. rainy in Los Angeles). We made up the probabilities!

Deliverables summary

In summary, for Problem 1, your deliverables are as follows:

  • Markov.py: File containing your Markov class
  • P1A.py: Initial demo of the Markov class
  • P1C.py: Demo of your seven-city Markov chain weather prediction model

Problem 2: Databases [45 pts]

You will complete this problem in file P2.py.

In this problem, you will set up an SQL database using the sqlite3 package in Python. The purpose of the database will be to store parameters and model results related to a simple logistic regression problem. Rather than keeping the results in numpy arrays as we usually do, the idea here is to make use of an SQL database to store the results so that it can easily be accessed from disk at a later stage (by your or another member of your team).

Part A: Database schema [15 pts]

The design of the database should be flexible enough so that the results from different model iterations can be stored in the database. It should also be able to deal with a different set of features by model iteration.

Your task:

Create an SQL database called regression.sqlite containing the list of tables and respective fields as shown below (tables are in bold):
model_params:

  • id
  • desc
  • param_name
  • value

model_coefs:

  • id
  • desc
  • feature_name
  • value

model_results

  • id
  • desc
  • train_score
  • test_score

Note: Ensure that the datatype of each field makes sense.

Part B: Inserting records [15 pts]

In this part, you will populate the database you created in Part A with some records for a few different model iterations and scenarios.

1. Import additional libraries and load data

Add the following imports to your P2.py file:

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer

Load the data and separate into training and test subsets like so:

# Load data
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=87)

2. Write a function to save data to the database

In P2.py, write a function save_to_database, which saves the data to the database. The function should accept the following arguments:

  • model_id: Identifier number for the model to save data from.
  • model_desc: Description of the model to save data from.
  • db: Database to save data to.
  • model: A fitted model to save data from.
  • X_train, X_test, y_train, y_test: Training and test data.

The X_train, X_test, y_train, y_test inputs should be used compute test_score and train_score within the save_to_database function.

Assume model is an sklearn.linear_model.LogisticRegression. You should insert the following model information into the corresponding tables in the database:

  • model_params: Values from the get_params method.
  • model_coefs: Coefficient and intercept values of the fitted model (see coef_ and intercept_ attributes in the documentation).

    • Hint: Feature names can be extracted from data via the feature_names attribute.
  • model_results: Train and validation accuracy obtained from the score method.

For more details on the methods and attributes listed above, refer to the scikit-learn documentation on logistic regression.

3. Baseline logistic regression model

Using the code provided below, insert an entry into the database for a baseline regression model.

# Fit model
baseline_model = LogisticRegression(solver='liblinear')
baseline_model.fit(X_train, y_train)

Use the identifiers model_id = 1 and model_desc = "Baseline model" for this model.

4. Reduced logistic regression model Now, create a second model using only the features included in the features_cols list below. Insert the relevant information into the corresponding tables in the database.

feature_cols = ['mean radius', 
                'texture error',
                'worst radius',
                'worst compactness',
                'worst concavity']

Use the identifiers model_id = 2 and model_desc = "Reduced model" for this model.

5. Logistic regression model with L1 penalty

Create one last model using an l1-penalty ($L_1$) term and all of the features. Insert the relevant information into the corresponding tables in the database; use the identifiers model_id = 3 and model_desc = "L1 penalty model" for this model.

Hint: Refer to the penalty parameter of the LogisticRegression class. You may need to increase the maximum number of iterations from the default value of max_iter = 100 for convergence.

Part C: Database queries [15 pts]

Query the database to identify the model with the highest validation score.

  • Print the id of the best model and the corresponding test score, like so:

    Best model id: ...
    Best validation score: ...

    where the ... are placeholders for your solution from the database query.

  • Print the feature names and the corresponding coefficients of that model, like so:
    feature1: 8.673
    feature2: 0.24
    ...
    where feature1/2 are the feature names followed by the coefficient value.
  • Use the coefficients extracted from the best model to reproduce the test score (accuracy) of the best performing model (as stored in the database).
    • Hint: You should be able to achieve this by overwriting the relevant variables in a new LogisticRegression object, i.e. there is no need write your own formula to generate individual predictions (you are welcome to do this if you want). You will need to run a dummy fit on this object before you are able to manually overwrite the relevant variables.
    • Remarks: This problem demonstrates a simple scenario in which someone with access to your database can easily reproduce your results.

Note: Remember to **close the database** when you are done!

Deliverables summary

In summary, for Problem 2, your deliverables are as follow:

  • regression.sqlite: Database of logistic regression models.
  • P2.py: File containing all the code you have written for Problem 2.

Congratulations! You have completed the final homework for this course!