Title

Exercise: Setting up a custom environment

Description

The aim of this exercise is to learn how to set up a custom environment using OpenAI Gym. OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms. It supports teaching agents everything from walking to playing games like Pinball.

As we have discussed in class, the environment will consist of a state, reward and action. For our custom environment, we would like to implement the mouse grid we saw in the lectures. The possible rewards and state given the current state are in the helper file.

The overall reward pattern is as follows, every next state gets a plus one reward, a terminal state leads to -10 and the mouse has to start from state 1 again. Every action leading to a previous state is penalized by 1 and reaching the goal i.e. state 16 gets a +100 reward.

  • Create a class MouseGrid that inherits from OpenAI Gym's Env. This class will be your environment.
  • Within the class constructor __init__, initialize the observation_space and action_space.
    • The observation space is the set of all possible state values, which are 16 in our case.
    • The action space is the set of all possible actions one can take in an environment. 1 indicates up, 1 indicates left, 3 indicates right and 4 indicates down.
  • Set the initial state to be 1 and the reward to be zero.
  • Define a method step that take the action and returns the reward based on the instructions given in the helper file.
  • Define a method reset, that reset the class variables to their initial value.
  • Select the number of episodes.
  • Create an instance of the MouseGrid environment.
  • Loop over the total number of episodes.
    • Sample an action from the set of possible action and get call the step method.
    • At the end of each episode, print the total reward of that episode.

NOTE - Remember that the reward can be negative as well. It depends on how the reward system is defined within the environment.

In this exercise, we are using a random policy, by randomly sampling from the set of possible actions defined within the environment.

Hints:

Discrete() The Discrete space allows a fixed range of non-negative numbers

Env.action_space.sample()

To select a value randomly from the action space defined in the environment.

In [1]:
# Import necessary libraries
import gym
import random
import numpy as np
from gym import Env
from gym.spaces import Discrete
from helper import transition_rules, reward_rules
In [62]:
# *** Refer to the Hints in the Description before you start ***

# Class MouseGrid that inherits from OpenAI Gym Env
class MouseGrid(Env):

  # Constructor that is called when an instance of the class is created
  def __init__(self):

    # Set the observation_space i.e the state to 16 discrete values
    self.observation_space = ___

    # Set the action space to 4 discrete values i.e up, left, right and down 
    # Action 1: Up
    # Action 2: Left
    # Action 3: Right
    # Action 4: Down
    self.action_space = ___

    # Set the initial state of the mouse to 1
    self.state = ___

    # Set the initial reward as 0
    self.reward = ___

  # Define a function step that inputs an action and updates the reward
  def step(self, action):

    # Update the action to the action sent as the parameter of this function
    self.action = ___

    # Set a variable prev_state as the current state of the environment
    prev_state = ___

    # Update the state based on the current state and action by 
    # calling the transition_rules function defined in the helper
    # file with the current state and action
    self.state = ___

    # Update the reward by calling the function reward_rules defined in 
    # the helper file by passing the current state and previous state
    # Remember that the reward is cummulative
    self.reward = ___

    # If the current state is 16 that means our mouse has reached the goal
    # For this, we set the call the reward_rules function again
    # Set done as True
    if self.state==16:
      self.reward = self.reward + reward_rules(self.state, prev_state)
      done = True

    # Else set done as false
    else:
      done = False

    # Return the state, reward and done to indicate whether an episode is complete
    return self.state, self.reward, done


  # The reset function which is called at the end of each episode
  def reset(self):

    # Reset the initial state to the start point
    self.state = ___

    # Reset the reward to 0
    self.reward = ___

    # Set done as False
    done = False

SET THE AGENT TO TEST THE ENVIRONMENT

In [0]:
### edTest(test_chow0) ###
# Create an instance of the custom environment
env = ___

# Set the maximum number of episodes to any integer between 10 and 50
episodes = ___

# Loop over all the steps
for i in range(episodes):

  # Set done as False to run until the end of the episode
  done = False

  # Loop over the entire episode
  while done!=True:

    # Sample an action from the action_space of the environment
    action = ___

    # Call the step function within the environment
    state, reward, done = ___

  # Call the reset function at the end of an episode
  env.reset()

  # Print the reward at the end of each episode
  print("The reward of this episode is:",reward)

⏸ Here we are using random sampling to pick an action for a given state. However, if you had a policy, which part of the exercise would you change to incorporate it?

A. The step() method of the MouseGrid class.

B. self.action within the init() method of the MouseGrid class.

C. getting an action for each step within the for loop over all episodes.

D. Call to the step() method in the last cell.

In [0]:
### edTest(test_chow1) ###
# Submit an answer choice as a string below (eg. if you choose option A, put 'A')
answer1 = '___'

⏸ Which of the following would be an issue if the reset() method is not called at the end of each episode?

A. The next episode will not run as the state is 16.

B. The action sampled next will be biased on the previous value.

C. The reward of the new episode will be combined to the reward of the previous episode.

D. The reset() method does not affect anything.

In [0]:
### edTest(test_chow2) ###
# Submit an answer choice as a string below 
# There can be multiple correct answers. Replace the options with a hyphen
# For example if you think the correct choice is A and D, then type 'A-D'
answer2 = '___'