Download Notebook

Homework 7¶

Due Thursday, December 3rd 2020 at 11:59 PM.¶

Problem 0: Homework Workflow [10 pts]

Problem 1 C++: Weather Prediction Using a Markov Chain [45 pts]

Part A. Implement a Markov chain
Part B. Markov chain iterators
Part C. Prediction using Markov chain
Deliverables summary

Problem 2: Databases [45 pts] USE PYTHON

Part A. Database schema
Part B. Inserting records
Part C. Database queries
Deliverables summary

Problem 0: Homework Workflow [10 pts]¶

Once you receive HW6 feedback, you will need to merge your HW6-dev branch into master.

You will earn points for following all stages of the git workflow which involves:

3 pts for merging HW6-dev into master
5 pts for completing HW7 on HW7-dev
2 pts for making a PR on HW7-dev to merge into master

Problem 1 C++: Weather Prediction Using a Markov Chain [45 pts]¶

Markov chains are widely used to model and predict discrete events. Underlying Markov chains are Markov processes which make the assumption that the outcome of a future event only depends on the event immediately preceding it. In this problem, we will use a Markov chain to create a basic model for predicting the weather.

Part A: Implement a Markov chain [10 pts]¶

You will complete this part in the files Markov.hpp and P1A.cpp.

Example: Toy weather model¶

To begin, let us consider how we can represent weather behaviour as a Markov chain by considering a toy weather model in which the weather behaviour follows the rules herein:

A dry day has $0.9$ probability of being followed by a sunny day, and otherwise transitions into a humid day.
A humid day as $0.5$ probability of being followed by a dry day, and otherwise remains humid.
A sunny day has $0.4$ probability of remaining sunny, and otherwise an equal probability of transitioning to a dry day or humid day.

Notice that the preceding rules can be represented in a Markov chain as follows: Toy Markov

And a transition matrix $P$ of the form: $$\begin{matrix} & \text{Dry} & \text{Humid} & \text{Sunny} \\ \text{Dry} & 0 & 0.1 & 0.9 \\ \text{Humid} & 0.5 & 0.5 & 0 \\ \text{Sunny} & 0.3 & 0.3 & 0.4 \end{matrix}$$

where each matrix element $P_{ij}$ is the probability the weather on the next day is of the $j$-th weather type given that the current day is of the $i$-th weather type (e.g. $P_{dry, humid} = 0.1$).

Thus, in order to calculate the weather probability $N$ days from the current day, we simply sum the probabilities of the conditional events given in the matrix. For example, given that the current day is dry ($i = 1$), the probability of it being humid ($j = 2$) in $N=2$ days is $$\left(M^{2}\right)_{12} = P_{11} P_{12} + P_{12} P_{22} + P_{13} P_{32}.$$

Your task: Modified toy weather model¶

In reality, there is a much broader range of weather conditions than just dry, humid, and sunny. Your task is to create a modified version of the toy model above to account for a slightly broader range of weather conditions.

In the weather.csv file accompanying this homework, each row corresponds to one type of weather in the following order.

sunny
cloudy
rainy
snowy
windy
hailing

Each column gives the probability of one type of weather occuring the following day (also in the order above).

As in the example, the $(i, j)$ element is the probability that the $j$-th weather type occurs after the $i$-th weather type; e.g. the $(1, 2)$ element is the probability that a cloudy day occurs after the sunny day. Take a look at the data and verify that if the current day is sunny ($i=1$) the following day will have a $0.4$ probability of being sunny ($j=1$) as well, whereas if the current day is rainy ($i=3$) the following day has a probability $0.05$ of being windy ($j=5$). (You do not need to submit anything for this step; this is just to check that the data is structured as described.)

Having made sure that you understand how the weather data is stored in weather.csv, complete the following steps to create and demo your Markov chain model.

1. Create a class called Markov that implements the following methods

load_data(std::string filename="weather.csv"): Loads data from a file and stores it as a 2D array in the instance variable data. The default argument should be weather.csv in the current directory.
- Hint: You may use std::ifstream to load weather.csv. See How to Read & Write csv files in C++.
float get_prob(std::string current_day_weather, std::string next_day_weather): Returns the probability of next_day_weather on the next day given current_day_weather on the current day.
- current_day_weather and next_day_weather will be one of the 6 strings describing the weather (sunny, cloudy, rainy, snowy, windy, or hailing).
- Raise an error if current_day_weather or next_day_weather is not one of the 6 strings specified. Make sure these strings are in lowercase. You can choose to either raise an exception for uppercase inputs, or convert them automatically.

The constructor for this class should initialize data as an empty array (std::vector of vectors) into which load_data(std::string filename="weather.csv") stores std::vector< std::vector<float> > data. You may need to build a map which uses a weather string as an input and returns an integer index for accessing array.

You may use the following as your Markov class skeleton, but you are free to add any auxilary data or functions.
Save the code in a file called in Markov.hpp.

class Markov {
    /* class data */
    //TODO: Your implementation here
    //TODO: std::vector< std::vector<float> > data

  public:
    /* Constructor */
    // You will need to modify this header line later in Part C
    Markov(){
        //TODO: data = ...
        //TODO: Your implementation here
    }

    void load_data(std::string filename="weather.csv"){
        //TODO: See gormanalysis.com/blog/reading-and-writing-csv-files-with-cpp
        //TODO: Your implementation here
    }

    float get_prob(std::string current_day_weather, std::string next_day_weather){
        //TODO: Your implementation here
    }
};

2. Demo your Markov class

In a separate file called P1A.cpp:

Import the Markov class from Markov.hpp.
Load the provided weather.csv file into a numpy array.
Demonstrate that your Markov class works by printing the probability that a windy day follows a cloudy day.

Please place weather.csv in the same directory as Markov.hpp and P1A.cpp.

An example use of the Markov class:

Markov weather_today;
weather_today.load_data("./weather.csv")
std::cout << weather_today.get_prob("sunny", "cloudy") << std::endl; // This line should print 0.3

Note: You may add a mapping data structure between index and string values for your convenience. For example: 'sunny' maps to index 0, etc.

Part B: Markov chain iterators [15 pts]¶

You will complete this part in the file Markov.hpp.

Iterators are a convenient way to walk along your Markov chain. Create a MarkovIterator class and implement the the constructor, iter(), and next() methods. Then go back to the Markov class you implemented in the preceding part and implement an iter() method that returns a MarkovIterator object.

SUGGESTED C++ Iterator Implementation: C++ Iterator Example

The specifications are as follows:

For the Markov class:
- iter() should call MarkovIterator() and return the iterator object
  - The exact syntax of calling MarkovIterator() will be determined by how you implement MarkovIterator.
  - Hint: get_prob() might come in handy here.

For the MarkovIterator class:
- MarkovIterator(//TODO: e.g. weather_labels, weather_probability, etc.): Implement useful attributes that would help you to implement the next() method below
- iter(): should return self
- next(): should return the next day's weather
  - Each next day's weather should be stochastic. That is, it is randomly selected based on the relative probabilities of the the next day's weather types given the current day's weather type (see HINT below on discrete_distribution). The next day's weather should be returned as a lowercase string (e.g. "sunny") rather than an index (e.g. 0).
  - Once the $n^{\text{th}}$ day's weather is sampled, it will affect the underlying probability for $(n+1)^{\text{th}}$ day's weather prediction. An instance of MarkovIterator should be able to keep track of either one of or both the current day's predicted weather and the presumed probability distribution.

Hint: You may use C++'s built-in std::discrete_distribution function to select the next day's weather based on the relevant relative probabilities. To see an example use, see the std::discrete_distribution Documentation and a StackOverflow Discussion. Think of what the function of your choice takes as an argument, and that will help you decide what attributes MarkovIterator(//TODO: Input Arguments) needs.

Part C: Prediction using Markov chain [20 pts]¶

You will complete this part in the files Markov.hpp and P1C.cpp.

In this part, we will predict what the weather will be like in a week for five different cities given each city's current weather.

We can generate predictions for a city's weather in seven days by using the Markov iterator implemented in the previous part. To do so, make the following modifications to your Markov class:

Add an addition constructor function Markov(std::string day_zero_weather) such that it accepts an argument day_zero_weather, which is the weather of the current day, and stores it as the instance variable day_zero_weather.
Implement the private method std::string _simulate_weather_for_day(unsigned int day), where day is a non-negative integer representing the number of days from the current day. This method returns the predicted weather as a string on the specified day.
- Hint: Make sure that your method returns a sensible result (i.e. the current day's weather) for day = 0.

For the purposes of this problem, rather than just producing just one simulation per city, we would like to simulate 100 such predictions per city, and store the most commonly occuring predictions. To that end, please implement get_weather_for_day, which uses a private function _simulate_weather_for_day to run the simulation for trials number of times:

get_weather_for_day(unsigned int day, unsigned int trials=1): Returns a vector of std::strings of length trials, where each element is the predicted weather for each trial. Assign a default value to trials. day is an unsigned integer representing how far from the current day we want to predict the weather. For example, if day=3, then we want to predict the weather on day 3.

In other words, because things are stochastic, predicting the weather a few days from now cannot be perfectly accurate. What makes more sense would be to predict the weather a few days from now a bunch of times. From the sampling, you can make a most likely prediction on the weather a few days away from day 0.

Hint: You must use the iterator to complete this part of the problem. You may find it helpful to define new instance variables unsigned int _current_day and std::string _current_day_weather, which keep track of the current day number (starting from 0) and its weather. This means that you need to make use of methods (next() and/or iter()) you implemented in Part B to simulate the weather of a given day instead of raising the transition matrix to a certain power.

Hint: You may also find it helpful to define a helper method to reset the _current_day and _current_day_weather for each new simulation trial.

In summary:

Add a Markov class constructor method to accept an argument std::string day_zero_weather.
Implement the private method std::string _simulate_weather_for_day(unsigned int day) method, which returns predicted weather on day day.
Implement get_weather_for_day(unsigned int day, unsigned int trials) method, which returns the predicted weather over trials simulations as a vector of std::strings.

Finally, in a separate file called P1C.cpp, use the Markov class to find the most common weather seven days from the current day over 100 trials for each city given the following initial weather conditions for each city:

city_weather = {
    'New York': 'rainy',
    'Chicago': 'snowy',
    'Seattle': 'rainy',
    'Boston': 'hailing',
    'Miami': 'windy',
    'Los Angeles': 'cloudy',
    'San Francisco': 'windy'
}

Print the number of occurrences of each weather condition over the 100 trials for each city, e.g.

New York: {'hailing': 6, 'sunny': 36, ...}
Chicago: {'cloudy': 33, 'snowy': 11, ...}
...

Then, print the most commonly predicted weather for each city in the following format:

Most likely weather in seven days
----------------------------------
New York: cloudy
Chicago: cloudy
...

Note:

Don't worry if your values don't seem to make intuitive sense (e.g. rainy in Los Angeles). We made up the probabilities!
If there is more than one most common weather condition occurring with equal frequency over the 100 trials, you may arbitrarily choose one of the weather conditions to return.

Deliverables summary¶

In summary, for Problem 1, your deliverables are as follows:

Markov.hpp: File containing your Markov class
P1A.cpp: Initial demo of the Markov class
P1C.cpp: Demo of your seven-city Markov chain weather prediction model

Problem 2: Databases [45 pts] USE PYTHON ONLY!¶

Even though this is the C++ PSet, please do this problem only in Python.
You will complete this problem in file P2.py.

In this problem, you will set up an SQL database using the sqlite3 package in Python. The purpose of the database will be to store parameters and model results related to a simple logistic regression problem. Rather than keeping the results in numpy arrays as we usually do, the idea here is to make use of a SQL database to store the results so that it can easily be accessed from disk at a later stage (by you or another member of your team).

Part A: Database schema [15 pts]¶

The design of the database should be flexible enough so that the results from different model iterations can be stored in the database. It should also be able to deal with a different set of features by model iteration.

Your task:

Create an SQL database called regression.sqlite containing the list of tables and respective fields as shown below (tables are in bold):

model_params:

id
desc
param_name
value

model_coefs:

id
desc
feature_name
value

model_results

id
desc
train_score
test_score

Note: desc is a short description of the model. See Part B question 3 for example.\ Note: Ensure that the datatype of each field makes sense.

Part B: Inserting records [15 pts]¶

In this part, you will populate the database you created in Part A with some records for a few different model iterations and scenarios.

1. Import additional libraries and load data

Add the following imports to your P2.py file:

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer

Load the data and separate into training and test subsets like so:

# Load data
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=87)

2. Write a function to save data to the database

In P2.py, write a function save_to_database, which saves the data to the database. The function should accept the following arguments:

model_id: Identifier number for the model to save data from.
model_desc: Description of the model to save data from.
db: Database to save data to.
model: A fitted model to save data from.
X_train, X_test, y_train, y_test: Training and test data.

The X_train, X_test, y_train, y_test inputs should be used compute test_score and train_score within the save_to_database function.

Assume model is an sklearn.linear_model.LogisticRegression. Your function should be able to insert the following model information into the corresponding tables in the database:

model_params: Values from the get_params method.
model_coefs: Coefficient and intercept values of the fitted model (see coef_ and intercept_ attributes in the documentation).
- Hint: Feature names can be extracted from data via the feature_names attribute.
model_results: Train and validation accuracy obtained from the score method.

For more details on the methods and attributes listed above, refer to the scikit-learn documentation on logistic regression.

3. Baseline logistic regression model

Using the code provided below, insert an entry into the database for a baseline regression model.

# Fit model
baseline_model = LogisticRegression(solver='liblinear')
baseline_model.fit(X_train, y_train)

Use the identifiers model_id = 1 and model_desc = "Baseline model" for this model:

save_to_database(1, 'Baseline model', db, baseline_model, X_train, X_test, y_train, y_test)

4. Reduced logistic regression model We want to add another model into our database. Create a second model using only the features included in the features_cols list below.

feature_cols = ['mean radius', 
                'texture error',
                'worst radius',
                'worst compactness',
                'worst concavity']

X_train_reduced = X_train[feature_cols]
X_test_reduced = X_test[feature_cols]

reduced_model = LogisticRegression(solver='liblinear')
reduced_model.fit(X_train_reduced, y_train)

Insert the relevant information into the corresponding tables in the database. Use the identifiers model_id = 2 and model_desc = "Reduced model" for this model.

5. Logistic regression model with L1 penalty

Create one last model using an L1-penalty ($L_1$) term and all of the features. Insert the relevant information into the corresponding tables in the database; use the identifiers model_id = 3 and model_desc = "L1 penalty model" for this model.

penalized_model = LogisticRegression(solver='liblinear', penalty='l1', random_state=87, max_iter=150)
penalized_model.fit(X_train, y_train)

Hint: Refer to the penalty parameter of the LogisticRegression class. You may need to increase the maximum number of iterations from the default value of max_iter = 100 for convergence.

Part C: Database queries [15 pts]¶

Query the database to identify the model with the highest validation score.

Print the id of the best model and the corresponding test score, like so:
```
Best model id: ...
Best validation score: ...
```
where the ... are placeholders for your solution from the database query.

Print the feature names and the corresponding coefficients of that model, like so:
```
feature1: 8.673
feature2: 0.24
...
```
where feature1/2 are the feature names followed by the coefficient value.

Use the coefficients extracted from the best model to reproduce the test score (accuracy) of the best performing model (as stored in the database).
- Hint: You should be able to achieve this by overwriting the relevant variables in a new LogisticRegression object, i.e. there is no need write your own formula to generate individual predictions (you are welcome to do this if you want). You will need to run a dummy fit on this object before you are able to manually overwrite the relevant variables.
- Remarks: This problem demonstrates a simple scenario in which someone with access to your database can easily reproduce your results.

Here is some code to accomplish this. You need to determine coef and intercept.

test_model = LogisticRegression(solver='liblinear')
test_model.fit(X_train, y_train)

# Manually change fit parameters
test_model.coef_ = np.array([coef])
test_model.intercept_ = np.array([intercept])

test_score = test_model.score(X_test, y_test)
print(f'Reproduced best validation score: {test_score}')

Note: Remember to **close the database** when you are done!

Deliverables summary¶

In summary, for Problem 2, your deliverables are as follow:

regression.sqlite: Database of logistic regression models.
P2.py: File containing all the code you have written for Problem 2.

Congratulations! You have completed the final homework for this course!¶