Homework 7¶
Due Thursday, December 3rd 2020 at 11:59 PM.¶
Problem 0: Homework Workflow [10 pts]
Problem 1 C++: Weather Prediction Using a Markov Chain [45 pts]
- Part A. Implement a Markov chain
- Part B. Markov chain iterators
- Part C. Prediction using Markov chain
- Deliverables summary
Problem 2: Databases [45 pts] USE PYTHON
- Part A. Database schema
- Part B. Inserting records
- Part C. Database queries
- Deliverables summary
Problem 0: Homework Workflow [10 pts]¶
Once you receive HW6 feedback, you will need to merge your HW6-dev
branch into master
.
You will earn points for following all stages of the git workflow which involves:
- 3 pts for merging
HW6-dev
intomaster
- 5 pts for completing HW7 on
HW7-dev
- 2 pts for making a PR on
HW7-dev
to merge intomaster
Problem 1 C++: Weather Prediction Using a Markov Chain [45 pts]¶
Markov chains are widely used to model and predict discrete events. Underlying Markov chains are Markov processes which make the assumption that the outcome of a future event only depends on the event immediately preceding it. In this problem, we will use a Markov chain to create a basic model for predicting the weather.
Part A: Implement a Markov chain [10 pts]¶
You will complete this part in the files Markov.hpp
and P1A.cpp
.
Example: Toy weather model¶
To begin, let us consider how we can represent weather behaviour as a Markov chain by considering a toy weather model in which the weather behaviour follows the rules herein:
- A dry day has $0.9$ probability of being followed by a sunny day, and otherwise transitions into a humid day.
- A humid day as $0.5$ probability of being followed by a dry day, and otherwise remains humid.
- A sunny day has $0.4$ probability of remaining sunny, and otherwise an equal probability of transitioning to a dry day or humid day.
Notice that the preceding rules can be represented in a Markov chain as follows:
And a transition matrix $P$ of the form: $$\begin{matrix} & \text{Dry} & \text{Humid} & \text{Sunny} \\ \text{Dry} & 0 & 0.1 & 0.9 \\ \text{Humid} & 0.5 & 0.5 & 0 \\ \text{Sunny} & 0.3 & 0.3 & 0.4 \end{matrix}$$
where each matrix element $P_{ij}$ is the probability the weather on the next day is of the $j$-th weather type given that the current day is of the $i$-th weather type (e.g. $P_{dry, humid} = 0.1$).
Thus, in order to calculate the weather probability $N$ days from the current day, we simply sum the probabilities of the conditional events given in the matrix. For example, given that the current day is dry ($i = 1$), the probability of it being humid ($j = 2$) in $N=2$ days is $$\left(M^{2}\right)_{12} = P_{11} P_{12} + P_{12} P_{22} + P_{13} P_{32}.$$
Your task: Modified toy weather model¶
In reality, there is a much broader range of weather conditions than just dry, humid, and sunny. Your task is to create a modified version of the toy model above to account for a slightly broader range of weather conditions.
In the weather.csv
file accompanying this homework, each row corresponds to one type of weather in the following order.
- sunny
- cloudy
- rainy
- snowy
- windy
- hailing
Each column gives the probability of one type of weather occuring the following day (also in the order above).
As in the example, the $(i, j)$ element is the probability that the $j$-th weather type occurs after the $i$-th weather type; e.g. the $(1, 2)$ element is the probability that a cloudy day occurs after the sunny day. Take a look at the data and verify that if the current day is sunny ($i=1$) the following day will have a $0.4$ probability of being sunny ($j=1$) as well, whereas if the current day is rainy ($i=3$) the following day has a probability $0.05$ of being windy ($j=5$). (You do not need to submit anything for this step; this is just to check that the data is structured as described.)
Having made sure that you understand how the weather data is stored in weather.csv
, complete the following steps to create and demo your Markov chain model.
1. Create a class called Markov
that implements the following methods
load_data(std::string filename="weather.csv")
: Loads data from a file and stores it as a 2D array in the instance variabledata
. The default argument should beweather.csv
in the current directory.- Hint: You may use
std::ifstream
to loadweather.csv
. See How to Read & Write csv files in C++.
- Hint: You may use
float get_prob(std::string current_day_weather, std::string next_day_weather)
: Returns the probability ofnext_day_weather
on the next day givencurrent_day_weather
on the current day.current_day_weather
andnext_day_weather
will be one of the 6 strings describing the weather (sunny, cloudy, rainy, snowy, windy, or hailing).- Raise an
error
ifcurrent_day_weather
ornext_day_weather
is not one of the 6 strings specified. Make sure these strings are in lowercase. You can choose to either raise an exception for uppercase inputs, or convert them automatically.
The constructor for this class should initialize data
as an empty array (std::vector of vectors) into which load_data(std::string filename="weather.csv")
stores std::vector< std::vector<float> > data
. You may need to build a map which uses a weather string as an input and returns an integer index for accessing array
.
You may use the following as your Markov class skeleton, but you are free to add any auxilary data or functions.
Save the code in a file called in Markov.hpp
.
class Markov {
/* class data */
//TODO: Your implementation here
//TODO: std::vector< std::vector<float> > data
public:
/* Constructor */
// You will need to modify this header line later in Part C
Markov(){
//TODO: data = ...
//TODO: Your implementation here
}
void load_data(std::string filename="weather.csv"){
//TODO: See gormanalysis.com/blog/reading-and-writing-csv-files-with-cpp
//TODO: Your implementation here
}
float get_prob(std::string current_day_weather, std::string next_day_weather){
//TODO: Your implementation here
}
};
2. Demo your Markov
class
In a separate file called P1A.cpp
:
- Import the
Markov
class fromMarkov.hpp
. - Load the provided
weather.csv
file into anumpy
array. - Demonstrate that your
Markov
class works by printing the probability that a windy day follows a cloudy day.
Please place weather.csv
in the same directory as Markov.hpp
and P1A.cpp
.
An example use of the Markov
class:
Markov weather_today;
weather_today.load_data("./weather.csv")
std::cout << weather_today.get_prob("sunny", "cloudy") << std::endl; // This line should print 0.3
Note: You may add a mapping data structure between index and string values for your convenience. For example: 'sunny' maps to index 0
, etc.
Part B: Markov chain iterators [15 pts]¶
You will complete this part in the file Markov.hpp.
Iterators are a convenient way to walk along your Markov chain. Create a MarkovIterator
class and implement the the constructor, iter()
, and next()
methods. Then go back to the Markov
class you implemented in the preceding part and implement an iter()
method that returns a MarkovIterator
object.
SUGGESTED C++ Iterator Implementation: C++ Iterator Example
The specifications are as follows:
- For the
Markov
class:iter()
should callMarkovIterator()
and return the iterator object- The exact syntax of calling
MarkovIterator()
will be determined by how you implementMarkovIterator
. - Hint:
get_prob()
might come in handy here.
- The exact syntax of calling
- For the
MarkovIterator
class:MarkovIterator(//TODO: e.g. weather_labels, weather_probability, etc.)
: Implement useful attributes that would help you to implement thenext()
method belowiter()
: should return selfnext()
: should return the next day's weather- Each next day's weather should be stochastic.
That is, it is randomly selected based on the relative probabilities of the the next day's weather types given the current day's weather type (see HINT below on
discrete_distribution
). The next day's weather should be returned as a lowercase string (e.g."sunny"
) rather than an index (e.g. 0). - Once the $n^{\text{th}}$ day's weather is sampled, it will affect the underlying probability for $(n+1)^{\text{th}}$ day's weather prediction. An instance of
MarkovIterator
should be able to keep track of either one of or both the current day's predicted weather and the presumed probability distribution.
- Each next day's weather should be stochastic.
That is, it is randomly selected based on the relative probabilities of the the next day's weather types given the current day's weather type (see HINT below on
Hint: You may use C++'s built-in std::discrete_distribution
function to select the next day's weather based on the relevant relative probabilities. To see an example use, see the std::discrete_distribution Documentation and a StackOverflow Discussion. Think of what the function of your choice takes as an argument, and that will help you decide what attributes MarkovIterator(//TODO: Input Arguments)
needs.
Part C: Prediction using Markov chain [20 pts]¶
You will complete this part in the files Markov.hpp
and P1C.cpp
.
In this part, we will predict what the weather will be like in a week for five different cities given each city's current weather.
We can generate predictions for a city's weather in seven days by using the Markov
iterator implemented in the previous part.
To do so, make the following modifications to your Markov
class:
- Add an addition constructor function
Markov(std::string day_zero_weather)
such that it accepts an argumentday_zero_weather
, which is the weather of the current day, and stores it as the instance variableday_zero_weather
. - Implement the private method
std::string _simulate_weather_for_day(unsigned int day)
, whereday
is a non-negative integer representing the number of days from the current day. This method returns the predicted weather as a string on the specified day.- Hint: Make sure that your method returns a sensible result (i.e. the current day's weather) for
day = 0
.
- Hint: Make sure that your method returns a sensible result (i.e. the current day's weather) for
For the purposes of this problem, rather than just producing just one simulation per city, we would like to simulate 100 such predictions per city, and store the most commonly occuring predictions.
To that end, please implement get_weather_for_day
, which uses a private function _simulate_weather_for_day
to run the simulation for trials
number of times:
get_weather_for_day(unsigned int day, unsigned int trials=1)
: Returns a vector of std::strings of lengthtrials
, where each element is the predicted weather for each trial. Assign a default value totrials
.day
is an unsigned integer representing how far from the current day we want to predict the weather. For example, ifday=3
, then we want to predict the weather on day 3.
In other words, because things are stochastic, predicting the weather a few days from now cannot be perfectly accurate. What makes more sense would be to predict the weather a few days from now a bunch of times. From the sampling, you can make a most likely prediction on the weather a few days away from day 0.
Hint: You must use the iterator to complete this part of the problem. You may find it helpful to define new instance variables unsigned int _current_day
and std::string _current_day_weather
, which keep track of the current day number (starting from 0
) and its weather. This means that you need to make use of methods (next() and/or iter()) you implemented in Part B to simulate the weather of a given day instead of raising the transition matrix to a certain power.
Hint: You may also find it helpful to define a helper method to reset the _current_day
and _current_day_weather
for each new simulation trial.
In summary:
- Add a
Markov
class constructor method to accept an argumentstd::string day_zero_weather
. - Implement the private method
std::string _simulate_weather_for_day(unsigned int day)
method, which returns predicted weather on dayday
. - Implement
get_weather_for_day(unsigned int day, unsigned int trials)
method, which returns the predicted weather overtrials
simulations as a vector of std::strings.
Finally, in a separate file called P1C.cpp
, use the Markov
class to find the most common weather seven days from the current day over 100 trials for each city given the following initial weather conditions for each city:
city_weather = {
'New York': 'rainy',
'Chicago': 'snowy',
'Seattle': 'rainy',
'Boston': 'hailing',
'Miami': 'windy',
'Los Angeles': 'cloudy',
'San Francisco': 'windy'
}
Print the number of occurrences of each weather condition over the 100 trials for each city, e.g.
New York: {'hailing': 6, 'sunny': 36, ...}
Chicago: {'cloudy': 33, 'snowy': 11, ...}
...
Then, print the most commonly predicted weather for each city in the following format:
Most likely weather in seven days
----------------------------------
New York: cloudy
Chicago: cloudy
...
Note:
- Don't worry if your values don't seem to make intuitive sense (e.g. rainy in Los Angeles). We made up the probabilities!
- If there is more than one most common weather condition occurring with equal frequency over the 100 trials, you may arbitrarily choose one of the weather conditions to return.
Deliverables summary¶
In summary, for Problem 1, your deliverables are as follows:
Markov.hpp
: File containing yourMarkov
classP1A.cpp
: Initial demo of theMarkov
classP1C.cpp
: Demo of your seven-city Markov chain weather prediction model
Problem 2: Databases [45 pts] USE PYTHON ONLY!¶
Even though this is the C++ PSet, please do this problem only in Python.
You will complete this problem in file P2.py
.
In this problem, you will set up an SQL database using the sqlite3
package in Python.
The purpose of the database will be to store parameters and model results related to a simple logistic regression problem.
Rather than keeping the results in numpy
arrays as we usually do, the idea here is to make use of a SQL database to store the results so that it can easily be accessed from disk at a later stage (by you or another member of your team).
Part A: Database schema [15 pts]¶
The design of the database should be flexible enough so that the results from different model iterations can be stored in the database. It should also be able to deal with a different set of features by model iteration.
Your task:
Create an SQL database called regression.sqlite
containing the list of tables and respective fields as shown below (tables are in bold):
model_params:
- id
- desc
- param_name
- value
model_coefs:
- id
- desc
- feature_name
- value
model_results
- id
- desc
- train_score
- test_score
Note: desc is a short description of the model. See Part B question 3 for example.\ Note: Ensure that the datatype of each field makes sense.
Part B: Inserting records [15 pts]¶
In this part, you will populate the database you created in Part A with some records for a few different model iterations and scenarios.
1. Import additional libraries and load data
Add the following imports to your P2.py
file:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
Load the data and separate into training and test subsets like so:
# Load data
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=87)
2. Write a function to save data to the database
In P2.py
, write a function save_to_database
, which saves the data to the database.
The function should accept the following arguments:
model_id
: Identifier number for the model to save data from.model_desc
: Description of the model to save data from.db
: Database to save data to.model
: A fitted model to save data from.X_train, X_test, y_train, y_test
: Training and test data.
The X_train
, X_test
, y_train
, y_test
inputs should be used compute test_score
and train_score
within the save_to_database
function.
Assume model
is an sklearn.linear_model.LogisticRegression
.
Your function should be able to insert the following model information into the corresponding tables in the database:
- model_params: Values from the
get_params
method. model_coefs: Coefficient and intercept values of the fitted model (see
coef_
andintercept_
attributes in the documentation).- Hint: Feature names can be extracted from
data
via thefeature_names
attribute.
- Hint: Feature names can be extracted from
model_results: Train and validation accuracy obtained from the
score
method.
For more details on the methods and attributes listed above, refer to the scikit-learn
documentation on logistic regression.
3. Baseline logistic regression model
Using the code provided below, insert an entry into the database for a baseline regression model.
# Fit model
baseline_model = LogisticRegression(solver='liblinear')
baseline_model.fit(X_train, y_train)
Use the identifiers model_id = 1
and model_desc = "Baseline model"
for this model:
save_to_database(1, 'Baseline model', db, baseline_model, X_train, X_test, y_train, y_test)
4. Reduced logistic regression model
We want to add another model into our database. Create a second model using only the features included in the features_cols
list below.
feature_cols = ['mean radius',
'texture error',
'worst radius',
'worst compactness',
'worst concavity']
X_train_reduced = X_train[feature_cols]
X_test_reduced = X_test[feature_cols]
reduced_model = LogisticRegression(solver='liblinear')
reduced_model.fit(X_train_reduced, y_train)
Insert the relevant information into the corresponding tables in the database.
Use the identifiers model_id = 2
and model_desc = "Reduced model"
for this model.
5. Logistic regression model with L1 penalty
Create one last model using an L1-penalty ($L_1$) term and all of the features.
Insert the relevant information into the corresponding tables in the database; use the identifiers model_id = 3
and model_desc = "L1 penalty model"
for this model.
penalized_model = LogisticRegression(solver='liblinear', penalty='l1', random_state=87, max_iter=150)
penalized_model.fit(X_train, y_train)
Hint: Refer to the penalty
parameter of the LogisticRegression
class. You may need to increase the maximum number of iterations from the default value of max_iter = 100
for convergence.
Part C: Database queries [15 pts]¶
Query the database to identify the model with the highest validation score.
Print the id of the best model and the corresponding test score, like so:
Best model id: ... Best validation score: ...
where the
...
are placeholders for your solution from the database query.
- Print the feature names and the corresponding coefficients of that model, like so:
wherefeature1: 8.673 feature2: 0.24 ...
feature1/2
are the feature names followed by the coefficient value.
- Use the coefficients extracted from the best model to reproduce the test score (accuracy) of the best performing model (as stored in the database).
- Hint: You should be able to achieve this by overwriting the relevant variables in a new
LogisticRegression
object, i.e. there is no need write your own formula to generate individual predictions (you are welcome to do this if you want). You will need to run a dummyfit
on this object before you are able to manually overwrite the relevant variables. - Remarks: This problem demonstrates a simple scenario in which someone with access to your database can easily reproduce your results.
- Hint: You should be able to achieve this by overwriting the relevant variables in a new
Here is some code to accomplish this. You need to determine coef
and intercept
.
test_model = LogisticRegression(solver='liblinear')
test_model.fit(X_train, y_train)
# Manually change fit parameters
test_model.coef_ = np.array([coef])
test_model.intercept_ = np.array([intercept])
test_score = test_model.score(X_test, y_test)
print(f'Reproduced best validation score: {test_score}')
Note: Remember to **close the database** when you are done!
Deliverables summary¶
In summary, for Problem 2, your deliverables are as follow:
regression.sqlite
: Database of logistic regression models.P2.py
: File containing all the code you have written for Problem 2.