Homework 7¶
Due Tuesday, December 3rd 2019 at 11:59 PM.¶
Problem 0: Homework Workflow [10 pts]
Problem 1: Weather Prediction Using a Markov Chain [45 pts]
- Part A. Implement a Markov chain
- Part B. Markov chain iterators
- Part C. Prediction using Markov chain
- Deliverables summary
Problem 2: Databases [45 pts]
- Part A. Database schema
- Part B. Inserting records
- Part C. Database queries
- Deliverables summary
Problem 0: Homework Workflow [10 pts]¶
Once you receive HW6 feedback, you will need to merge your HW6-dev
branch into master
.
You will earn points for following all stages of the git workflow which involves:
- 3 pts for merging
HW6-dev
intomaster
- 5 pts for completing HW7 on
HW7-dev
- 2 pts for making a PR on
HW7-dev
to merge intomaster
Problem 1: Weather Prediction Using a Markov Chain [45 pts]¶
Markov chains are widely used to model and predict discrete events. Underlying Markov chains are Markov processes which make the assumption that the outcome of a future event only depends on the event immediately preceding it. In this problem, we will use a Markov chain to create a basic model for predicting the weather.
Part A: Implement a Markov chain [10 pts]¶
You will complete this part in the files Markov.py
and P1A.py
.
Example: Toy weather model¶
To begin with, let us consider how we can represent weather behaviour as a Markov chain by considering a toy weather model in which the weather behaviour follows the following rules:
- A dry day has $0.9$ probability of being followed by a sunny, and otherwise transitions into a humid day.
- A humid day as $0.5$ probability of being followed by a dry day, and otherwise remains humid.
- A sunny day has $0.4$ probability of remaining sunny, and otherwise an equal probability of transitioning to a dry day or humid day.
Notice that the preceding rules can be represented as a transition matrix $P$ of the form: $$\begin{matrix} & \text{Dry} & \text{Humid} & \text{Sunny} \\ \text{Dry} & 0 & 0.1 & 0.9 \\ \text{Humid} & 0.5 & 0.5 & 0 \\ \text{Sunny} & 0.3 & 0.3 & 0.4 \end{matrix},$$ where each matrix element $P_{ij}$ is the probability the weather on the next day is of the $j$-th weather type given that the current day is of the $i$-th weather type.
Thus, in order to calculate the weather probability $N$ days from the current day, we simply sum the probabilities of the conditional events given in matrix. For example, given that the current day is dry ($i = 1$), the probability of it being humid ($j = 2$) in $N=2$ days is $$\left(M^{2}\right)_{12} = P_{11} P_{12} + P_{12} P_{22} + P_{13} P_{32}.$$
Your task: Modified toy weather model¶
In reality, there is a much broader range of weather conditions than just dry, humid, and sunny. Your task is to create a modified version of the toy model above to account for a slightly broader range of weather conditions.
In the weather.csv
file accompanying this homework, each row corresponds to one type of weather in the following order.
- sunny
- cloudy
- rainy
- snowy
- windy
- hailing
Each column gives the probability of one type of weather occuring the following day (also in the order above).
As in the example, the $(i, j)$ element is the probability that the $j$-th weather type occurs after the $i$-th weather type; e.g. the $(1, 2)$ element is the probability that a cloudy day occurs after the sunny day. Take a look at the data and verify that if the current day is sunny ($i=1$) the following day will have a $0.4$ probability of being sunny ($j=1$) as well, whereas if the current day is rainy ($i=3$) the following day has a probability $0.05$ of being windy ($j=5$). (You do not need to submit anything for this step; this is just to check that the data is structured as described.)
Having made sure that you understand how the weather data is stored in weather.csv
, complete the following steps to create and demo your Markov chain model.
1. Create a class called Markov
that implements the following methods
load_data(array)
: Loads a 2Dnumpy
array and stores it in the instance variableself.data
- Hint: You may use
numpy.genfromtxt()
to loadweather.csv
.
- Hint: You may use
get_prob(current_day_weather, next_day_weather)
: Returns the probability ofnext_day_weather
on the next day givencurrent_day_weather
on the current day.current_day_weather
andnext_day_weather
will be one of the 6 strings describing the weather (sunny, cloudy, rainy, snowy, windy, or hailing).- Raise an
Exception
ifcurrent_day_weather
ornext_day_weather
is not one of the 6 strings specified.
The __init__
for this class should initialize self.data
as an empty array into which load_data(array)
stores array
.
Use the following as your Markov class skeleton. Save the code in a file called in Markov.py
.
class Markov:
def __init__(...):
# Your implementation here
def load_data(...):
# Your implementation here
def get_prob(...):
# Your implementation here
Your Markov
class should support the following use case
weather_today = Markov()
weather_today.load_data(weather) # Where weather is a 2D array read in from a .csv file
print(weather_today.get_prob('sunny', 'cloudy'))
2. Demo your Markov
class
In a separate file called P1A.py
:
- Import the
Markov
class fromMarkov.py
. - Parse the provided
weather.csv
file into anumpy
array. - Demonstrate that your
Markov
class works by printing the probability that a windy day follows a cloudy day.
Please place weather.csv
in the same directory as Markov.py
and P1A.py
.
Part B: Markov chain iterators [15 pts]¶
You will complete this part in the file Markov.py.
Iterators are a convenient way to walk along your Markov chain. Implement the __iter__()
and __next__()
methods for the Markov
class you implemented in the preceding part.
The specifications for the __iter__()
and __next__()
methods are as follows:
__iter__()
should return the iterator object__next__()
should return the next day's weather- Each next day's weather should be stochastic.
That is, it is randomly selected based on the relative probabilities of the the next day's weather types given the current day's weather type.
The next day's weather should be returned as a lowercase string (e.g.
'sunny'
) rather than an index (e.g. 0).
- Each next day's weather should be stochastic.
That is, it is randomly selected based on the relative probabilities of the the next day's weather types given the current day's weather type.
The next day's weather should be returned as a lowercase string (e.g.
Hint: You may use Python's built-in random.choices
function or NumPy's numpy.random.choice
function to select the next day's weather based on the relevant relative probabilities.
Part C: Prediction using Markov chain [20 pts]¶
You will complete this part in the files Markov.py
and P1C.py
.
In this part, we will predict what the weather will be like in a week for five different cities given each city's current weather.
We can generate predictions for a city's weather in seven days by using the Markov
iterator implemented in the previous part.
To do so, make the following modifications to you Markov
class:
- Modify
__init__
such that it accepts an optional argumentday_zero_weather
(with default valueNone
) which is the weather of the current day, and stores it as the instance variableday_zero_weather
. - Implement the method
_simulate_weather_for_day(day)
, whereday
is a non-negative integer representing the number of days from the current day. This method returns the predicted weather as a string on the specified day.- Hint: Make sure that your method returns a sensible result (i.e. the current day's weather) for
day = 0
.
- Hint: Make sure that your method returns a sensible result (i.e. the current day's weather) for
For the purposes of this problem, rather than just producing just one simulation per city, we would like to simulate 100 such predictions per city, and store the most commonly occuring predictions.
To that end, please implement get_weather_for_day
, which uses _simulate_weather_for_day
to run the simulation for trials
number of times:
get_weather_for_day(day, trials)
: Returns list of strings of lengthtrials
where each element is the predicted weather for each trial. Assign a default value totrials
.day
is an integer representing how far from the current day we want to predict the weather for.
In other words, because things are stochastic, predicting the weather a few days from now doesn't make sense. What makes more sense, is to predict the weather a few days from now a bunch of times. From that dataset, you can get a sense of the most likely weather a few days from now.
Hint: You **must use the iterator** to complete this part of the problem. You may find it helpful to define new instance variables self._current_day
and self._current_day_weather
, which keep track of the current day number (starting from 0
) and its weather.
Hint You may also find it helpful to define a helper method to reset the self._current_day
and self._current_day_weather
for each new simulation trial.
In summary:
- Modify the
Markov
class__init__
method to accept an optional argumentday_zero_weather
. - Implement
_simulate_weather_for_day(day)
method, which returns predicted weather on dayday
. - Implement
get_weather_for_day(day, trials)
method, which returns the predicted weather overtrials
simulations as a list (of strings).
Finally, in a separate file called P1C.py
, use the Markov
class to find the most common weather seven days from the current day over 100 trials for each city given the following initial weather conditions for each city:
city_weather = {
'New York': 'rainy',
'Chicago': 'snowy',
'Seattle': 'rainy',
'Boston': 'hailing',
'Miami': 'windy',
'Los Angeles': 'cloudy',
'San Francisco': 'windy'
}
Print the number of occurrences of each weather condition over the 100 trials for each city, e.g.
New York: {'hailing': 6, 'sunny': 36, ...}
Chicago: {'cloudy': 33, 'snowy': 11, ...}
...
Then, print the most commonly predicted weather for each city in the following format:
Most likely weather in seven days
----------------------------------
New York: cloudy
Chicago: cloudy
...
Note: Don't worry if your values don't seem to make intuitive sense (e.g. rainy in Los Angeles). We made up the probabilities!
Deliverables summary¶
In summary, for Problem 1, your deliverables are as follows:
Markov.py
: File containing yourMarkov
classP1A.py
: Initial demo of theMarkov
classP1C.py
: Demo of your seven-city Markov chain weather prediction model
Problem 2: Databases [45 pts]¶
You will complete this problem in file P2.py
.
In this problem, you will set up an SQL database using the sqlite3
package in Python.
The purpose of the database will be to store parameters and model results related to a simple logistic regression problem.
Rather than keeping the results in numpy
arrays as we usually do, the idea here is to make use of an SQL database to store the results so that it can easily be accessed from disk at a later stage (by your or another member of your team).
Part A: Database schema [15 pts]¶
The design of the database should be flexible enough so that the results from different model iterations can be stored in the database. It should also be able to deal with a different set of features by model iteration.
Your task:
Create an SQL database called regression.sqlite
containing the list of tables and respective fields as shown below (tables are in bold):
model_params:
- id
- desc
- param_name
- value
model_coefs:
- id
- desc
- feature_name
- value
model_results
- id
- desc
- train_score
- test_score
Note: Ensure that the datatype of each field makes sense.
Part B: Inserting records [15 pts]¶
In this part, you will populate the database you created in Part A with some records for a few different model iterations and scenarios.
1. Import additional libraries and load data
Add the following imports to your P2.py
file:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
Load the data and separate into training and test subsets like so:
# Load data
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target
# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=87)
2. Write a function to save data to the database
In P2.py
, write a function save_to_database
, which saves the data to the database.
The function should accept the following arguments:
model_id
: Identifier number for the model to save data from.model_desc
: Description of the model to save data from.db
: Database to save data to.model
: A fitted model to save data from.X_train, X_test, y_train, y_test
: Training and test data.
The X_train
, X_test
, y_train
, y_test
inputs should be used compute test_score
and train_score
within the save_to_database
function.
Assume model
is an sklearn.linear_model.LogisticRegression
.
You should insert the following model information into the corresponding tables in the database:
- model_params: Values from the
get_params
method. model_coefs: Coefficient and intercept values of the fitted model (see
coef_
andintercept_
attributes in the documentation).- Hint: Feature names can be extracted from
data
via thefeature_names
attribute.
- Hint: Feature names can be extracted from
model_results: Train and validation accuracy obtained from the
score
method.
For more details on the methods and attributes listed above, refer to the scikit-learn
documentation on logistic regression.
3. Baseline logistic regression model
Using the code provided below, insert an entry into the database for a baseline regression model.
# Fit model
baseline_model = LogisticRegression(solver='liblinear')
baseline_model.fit(X_train, y_train)
Use the identifiers model_id = 1
and model_desc = "Baseline model"
for this model.
4. Reduced logistic regression model
Now, create a second model using only the features included in the features_cols
list below.
Insert the relevant information into the corresponding tables in the database.
feature_cols = ['mean radius',
'texture error',
'worst radius',
'worst compactness',
'worst concavity']
Use the identifiers model_id = 2
and model_desc = "Reduced model"
for this model.
5. Logistic regression model with L1 penalty
Create one last model using an l1-penalty ($L_1$) term and all of the features.
Insert the relevant information into the corresponding tables in the database; use the identifiers model_id = 3
and model_desc = "L1 penalty model"
for this model.
Hint: Refer to the penalty
parameter of the LogisticRegression
class. You may need to increase the maximum number of iterations from the default value of max_iter = 100
for convergence.
Part C: Database queries [15 pts]¶
Query the database to identify the model with the highest validation score.
Print the id of the best model and the corresponding test score, like so:
Best model id: ... Best validation score: ...
where the
...
are placeholders for your solution from the database query.
- Print the feature names and the corresponding coefficients of that model, like so:
wherefeature1: 8.673 feature2: 0.24 ...
feature1/2
are the feature names followed by the coefficient value.
- Use the coefficients extracted from the best model to reproduce the test score (accuracy) of the best performing model (as stored in the database).
- Hint: You should be able to achieve this by overwriting the relevant variables in a new
LogisticRegression
object, i.e. there is no need write your own formula to generate individual predictions (you are welcome to do this if you want). You will need to run a dummyfit
on this object before you are able to manually overwrite the relevant variables. - Remarks: This problem demonstrates a simple scenario in which someone with access to your database can easily reproduce your results.
- Hint: You should be able to achieve this by overwriting the relevant variables in a new
Note: Remember to **close the database** when you are done!
Deliverables summary¶
In summary, for Problem 2, your deliverables are as follow:
regression.sqlite
: Database of logistic regression models.P2.py
: File containing all the code you have written for Problem 2.