2020-CS107 / AC207 / CSCI E-207

  • Syllabus
  • Schedule
  • Course Flow
  • Resources
  • Materials
  • Project
Download Notebook


Homework 7¶

Due Thursday, December 3rd 2020 at 11:59 PM.¶

Problem 0: Homework Workflow [10 pts]

Problem 1: Weather Prediction Using a Markov Chain [45 pts]

  • Part A. Implement a Markov chain
  • Part B. Markov chain iterators
  • Part C. Prediction using Markov chain
  • Deliverables summary

Problem 2: Databases [45 pts]

  • Part A. Database schema
  • Part B. Inserting records
  • Part C. Database queries
  • Deliverables summary

Problem 0: Homework Workflow [10 pts]¶

Once you receive HW6 feedback, you will need to merge your HW6-dev branch into master.

You will earn points for following all stages of the git workflow which involves:

  • 3 pts for merging HW6-dev into master
  • 5 pts for completing HW7 on HW7-dev
  • 2 pts for making a PR on HW7-dev to merge into master Sample Github Submission

Problem 1: Weather Prediction Using a Markov Chain [45 pts]¶

Markov chains are widely used to model and predict discrete events. Underlying Markov chains are Markov processes which make the assumption that the outcome of a future event only depends on the event immediately preceding it. In this problem, we will use a Markov chain to create a basic model for predicting the weather.

Part A: Implement a Markov chain [10 pts]¶

You will complete this part in the files Markov.py and P1A.py.

Example: Toy weather model¶

To begin, let us consider how we can represent weather behaviour as a Markov chain by considering a toy weather model in which the weather behaviour follows the following rules:

  1. A dry day has $0.9$ probability of being followed by a sunny day, and otherwise transitions into a humid day.
  2. A humid day as $0.5$ probability of being followed by a dry day, and otherwise remains humid.
  3. A sunny day has $0.4$ probability of remaining sunny, and otherwise an equal probability of transitioning to a dry day or humid day.

Notice that the preceding rules can be represented in a Markov chain as follows: Toy Markov

And a transition matrix $P$ of the form: $$\begin{matrix} & \text{Dry} & \text{Humid} & \text{Sunny} \\ \text{Dry} & 0 & 0.1 & 0.9 \\ \text{Humid} & 0.5 & 0.5 & 0 \\ \text{Sunny} & 0.3 & 0.3 & 0.4 \end{matrix}$$

where each matrix element $P_{ij}$ is the probability the weather on the next day is of the $j$-th weather type given that the current day is of the $i$-th weather type (e.g. $P_{dry, humid} = 0.1$).

Thus, in order to calculate the weather probability $N$ days from the current day, we simply sum the probabilities of the conditional events given in the matrix. For example, given that the current day is dry ($i = 1$), the probability of it being humid ($j = 2$) in $N=2$ days is $$\left(M^{2}\right)_{12} = P_{11} P_{12} + P_{12} P_{22} + P_{13} P_{32}.$$

Your task: Modified toy weather model¶

In reality, there is a much broader range of weather conditions than just dry, humid, and sunny. Your task is to create a modified version of the toy model above to account for a slightly broader range of weather conditions.

In the weather.csv file accompanying this homework, each row corresponds to one type of weather in the following order.

  1. sunny
  2. cloudy
  3. rainy
  4. snowy
  5. windy
  6. hailing

Each column gives the probability of one type of weather occuring the following day (also in the order above).

As in the example, the $(i, j)$ element is the probability that the $j$-th weather type occurs after the $i$-th weather type; e.g. the $(1, 2)$ element is the probability that a cloudy day occurs after the sunny day. Take a look at the data and verify that if the current day is sunny ($i=1$) the following day will have a $0.4$ probability of being sunny ($j=1$) as well, whereas if the current day is rainy ($i=3$) the following day has a probability $0.05$ of being windy ($j=5$). (You do not need to submit anything for this step; this is just to check that the data is structured as described.)

Having made sure that you understand how the weather data is stored in weather.csv, complete the following steps to create and demo your Markov chain model.

1. Create a class called Markov that implements the following methods

  • load_data(file_path='./weather.csv'): Loads data from a file and stores it as a 2D numpy array in the instance variable self.data. The default argument should be weather.csv in the current directory.
    • Hint: You may use numpy.genfromtxt() to load weather.csv.
  • get_prob(current_day_weather, next_day_weather): Returns the probability of next_day_weather on the next day given current_day_weather on the current day.
    • current_day_weather and next_day_weather will be one of the 6 strings describing the weather (sunny, cloudy, rainy, snowy, windy, or hailing).
    • Raise an Exception if current_day_weather or next_day_weather is not one of the 6 strings specified. Make sure these strings are in lowercase. You can choose to either raise an exception for uppercase inputs, or convert them automatically.

The __init__ for this class should initialize self.data as an empty array into which load_data(file_path='./weather.csv') stores array.

Use the following as your Markov class skeleton. Save the code in a file called in Markov.py.

class Markov:
    def __init__(self): # You will need to modify this header line later in Part C
        self.data = #...
        # Your implementation here

    def load_data(self, file_path='./weather.csv'):
        # Your implementation here

    def get_prob(self, current_day_weather, next_day_weather): 
        # Your implementation here

2. Demo your Markov class

In a separate file called P1A.py:

  1. Import the Markov class from Markov.py.
  2. Load the provided weather.csv file into a numpy array.
  3. Demonstrate that your Markov class works by printing the probability that a windy day follows a cloudy day.

Please place weather.csv in the same directory as Markov.py and P1A.py.

An example use of the Markov class:

weather_today = Markov()
weather_today.load_data(file_path='./weather.csv')
print(weather_today.get_prob('sunny', 'cloudy')) # This line should print 0.3

Note: If you would like to add mapping between index and string values for your convenience, you can add mapping. For example: 'sunny': 0 and etc.

Part B: Markov chain iterators [15 pts]¶

You will complete this part in the file Markov.py.

Iterators are a convenient way to walk along your Markov chain. Create a MarkovIterator class and implement the __init__(), __iter__(), and __next__() methods. Then, go back to Markov class you implemented in the preceding part and implement __iter__() method that returns a MarkovIteratorobject. The specifications are as follows:

  • For the Markov class
    • __iter__() should call MarkovIterator() and return the iterator object
      • The exact syntax of calling MarkovIterator() will be determined by how you implement MarkovIterator.

Hint: self.get_prob() might come in handy here.

  • For the MarkovIterator class
    • __init__() should have some useful attributes that would help you to implement the __next__() method below
    • __iter__() should return self
    • __next__() should return the next day's weather
      • Each next day's weather should be stochastic. That is, it is randomly selected based on the relative probabilities of the the next day's weather types given the current day's weather type. The next day's weather should be returned as a lowercase string (e.g. 'sunny') rather than an index (e.g. 0).
      • Once the $ (n) $ th day's weather is sampled, it will affect the underlying probability for $ (n+1) $ th day's weather prediction. An instance of MarkovIterator should be able to keep track of either one of or both the current day's predicted weather and the presumed probability distribution.

Hint: You may use Python's built-in random.choices function or NumPy's numpy.random.choice function to select the next day's weather based on the relevant relative probabilities. Think of what the function of your choice takes as an argument, and that will help you decide what attributes MarkovIterator.__init__() needs.

Part C: Prediction using Markov chain [20 pts]¶

You will complete this part in the files Markov.py and P1C.py.

In this part, we will predict what the weather will be like in a week for five different cities given each city's current weather.

We can generate predictions for a city's weather in seven days by using the Markov iterator implemented in the previous part. To do so, make the following modifications to your Markov class:

  • Modify __init__ such that it accepts an optional argument day_zero_weather (with default value None) which is the weather of the current day, and stores it as the instance variable day_zero_weather.
  • Implement the method _simulate_weather_for_day(day), where day is a non-negative integer representing the number of days from the current day. This method returns the predicted weather as a string on the specified day.
    • Hint: Make sure that your method returns a sensible result (i.e. the current day's weather) for day = 0.

For the purposes of this problem, rather than just producing just one simulation per city, we would like to simulate 100 such predictions per city, and store the most commonly occuring predictions. To that end, please implement get_weather_for_day, which uses _simulate_weather_for_day to run the simulation for trials number of times:

  • get_weather_for_day(day, trials): Returns list of strings of length trials where each element is the predicted weather for each trial. Assign a default value to trials. day is an integer representing how far from the current day we want to predict the weather. For example, if day=3, then we want to predict the weather on day 3.

In other words, because things are stochastic, predicting the weather a few days from now cannot be perfectly accurate. What makes more sense would be to predict the weather a few days from now a bunch of times. From the sampling, you can make a most likely prediction on the weather a few days away from day 0.

Hint: You must use the iterator to complete this part of the problem. You may find it helpful to define new instance variables self._current_day and self._current_day_weather, which keep track of the current day number (starting from 0) and its weather. This means that you need to make use of methods (next and/or iter) you implemented in Part B to simulate the weather of a given day instead of raising the transition matrix to a certain power.

Hint: You may also find it helpful to define a helper method to reset the self._current_day and self._current_day_weather for each new simulation trial.

In summary:

  • Modify the Markov class __init__ method to accept an optional argument day_zero_weather.
  • Implement _simulate_weather_for_day(day) method, which returns predicted weather on day day.
  • Implement get_weather_for_day(day, trials) method, which returns the predicted weather over trials simulations as a list (of strings).

Finally, in a separate file called P1C.py, use the Markov class to find the most common weather seven days from the current day over 100 trials for each city given the following initial weather conditions for each city:

city_weather = {
    'New York': 'rainy',
    'Chicago': 'snowy',
    'Seattle': 'rainy',
    'Boston': 'hailing',
    'Miami': 'windy',
    'Los Angeles': 'cloudy',
    'San Francisco': 'windy'
}

Print the number of occurrences of each weather condition over the 100 trials for each city, e.g.

New York: {'hailing': 6, 'sunny': 36, ...}
Chicago: {'cloudy': 33, 'snowy': 11, ...}
...

Then, print the most commonly predicted weather for each city in the following format:

Most likely weather in seven days
----------------------------------
New York: cloudy
Chicago: cloudy
...

Note:

  • Don't worry if your values don't seem to make intuitive sense (e.g. rainy in Los Angeles). We made up the probabilities!
  • If there is more than one most common weather condition occurring with equal frequency over the 100 trials, you may arbitrarily choose one of the weather conditions to return.

Deliverables summary¶

In summary, for Problem 1, your deliverables are as follows:

  • Markov.py: File containing your Markov class
  • P1A.py: Initial demo of the Markov class
  • P1C.py: Demo of your seven-city Markov chain weather prediction model

Problem 2: Databases [45 pts]¶

You will complete this problem in file P2.py.

In this problem, you will set up an SQL database using the sqlite3 package in Python. The purpose of the database will be to store parameters and model results related to a simple logistic regression problem. Rather than keeping the results in numpy arrays as we usually do, the idea here is to make use of a SQL database to store the results so that it can easily be accessed from disk at a later stage (by you or another member of your team).

Part A: Database schema [15 pts]¶

The design of the database should be flexible enough so that the results from different model iterations can be stored in the database. It should also be able to deal with a different set of features by model iteration.

Your task:

Create an SQL database called regression.sqlite containing the list of tables and respective fields as shown below (tables are in bold):

model_params:

  • id
  • desc
  • param_name
  • value

model_coefs:

  • id
  • desc
  • feature_name
  • value

model_results

  • id
  • desc
  • train_score
  • test_score

Note: desc is a short description of the model. See Part B question 3 for example.\ Note: Ensure that the datatype of each field makes sense.

Part B: Inserting records [15 pts]¶

In this part, you will populate the database you created in Part A with some records for a few different model iterations and scenarios.

1. Import additional libraries and load data

Add the following imports to your P2.py file:

import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer

Load the data and separate into training and test subsets like so:

# Load data
data = load_breast_cancer()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target

# Split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=87)

2. Write a function to save data to the database

In P2.py, write a function save_to_database, which saves the data to the database. The function should accept the following arguments:

  • model_id: Identifier number for the model to save data from.
  • model_desc: Description of the model to save data from.
  • db: Database to save data to.
  • model: A fitted model to save data from.
  • X_train, X_test, y_train, y_test: Training and test data.

The X_train, X_test, y_train, y_test inputs should be used compute test_score and train_score within the save_to_database function.

Assume model is an sklearn.linear_model.LogisticRegression. Your function should be able to insert the following model information into the corresponding tables in the database:

  • model_params: Values from the get_params method.
  • model_coefs: Coefficient and intercept values of the fitted model (see coef_ and intercept_ attributes in the documentation).

    • Hint: Feature names can be extracted from data via the feature_names attribute.
  • model_results: Train and validation accuracy obtained from the score method.

For more details on the methods and attributes listed above, refer to the scikit-learn documentation on logistic regression.

3. Baseline logistic regression model

Using the code provided below, insert an entry into the database for a baseline regression model.

# Fit model
baseline_model = LogisticRegression(solver='liblinear')
baseline_model.fit(X_train, y_train)

Use the identifiers model_id = 1 and model_desc = "Baseline model" for this model:

save_to_database(1, 'Baseline model', db, baseline_model, X_train, X_test, y_train, y_test)

4. Reduced logistic regression model We want to add another model into our database. Create a second model using only the features included in the features_cols list below.

feature_cols = ['mean radius', 
                'texture error',
                'worst radius',
                'worst compactness',
                'worst concavity']

X_train_reduced = X_train[feature_cols]
X_test_reduced = X_test[feature_cols]

reduced_model = LogisticRegression(solver='liblinear')
reduced_model.fit(X_train_reduced, y_train)

Insert the relevant information into the corresponding tables in the database. Use the identifiers model_id = 2 and model_desc = "Reduced model" for this model.

5. Logistic regression model with L1 penalty

Create one last model using an L1-penalty ($L_1$) term and all of the features. Insert the relevant information into the corresponding tables in the database; use the identifiers model_id = 3 and model_desc = "L1 penalty model" for this model.

penalized_model = LogisticRegression(solver='liblinear', penalty='l1', random_state=87, max_iter=150)
penalized_model.fit(X_train, y_train)

Hint: Refer to the penalty parameter of the LogisticRegression class. You may need to increase the maximum number of iterations from the default value of max_iter = 100 for convergence.

Part C: Database queries [15 pts]¶

Query the database to identify the model with the highest validation score.

  • Print the id of the best model and the corresponding test score, like so:

    Best model id: ...
    Best validation score: ...

    where the ... are placeholders for your solution from the database query.

  • Print the feature names and the corresponding coefficients of that model, like so:
    feature1: 8.673
    feature2: 0.24
    ...
    where feature1/2 are the feature names followed by the coefficient value.
  • Use the coefficients extracted from the best model to reproduce the test score (accuracy) of the best performing model (as stored in the database).
    • Hint: You should be able to achieve this by overwriting the relevant variables in a new LogisticRegression object, i.e. there is no need write your own formula to generate individual predictions (you are welcome to do this if you want). You will need to run a dummy fit on this object before you are able to manually overwrite the relevant variables.
    • Remarks: This problem demonstrates a simple scenario in which someone with access to your database can easily reproduce your results.

Here is some code to accomplish this. You need to determine coef and intercept.

test_model = LogisticRegression(solver='liblinear')
test_model.fit(X_train, y_train)

# Manually change fit parameters
test_model.coef_ = np.array([coef])
test_model.intercept_ = np.array([intercept])

test_score = test_model.score(X_test, y_test)
print(f'Reproduced best validation score: {test_score}')

Note: Remember to **close the database** when you are done!

Deliverables summary¶

In summary, for Problem 2, your deliverables are as follow:

  • regression.sqlite: Database of logistic regression models.
  • P2.py: File containing all the code you have written for Problem 2.

Congratulations! You have completed the final homework for this course!¶

Copyright 2018 © Institute for Applied Computational Science