Title

🆓 Exercise: Finding the Optimal Policy

Description

The aim of this exercise is to find the optimal policy that given the maximum reward given an environment. For this, we will be using a pre-defined environment by OpenAI Gym. We will be using an environment called FrozenLake-v0.There are many environments defined by OpenAI Gym, that you can see here.

Environment Description:

Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water. At this time, there's an international frisbee shortage, so it's absolutely imperative that you navigate across the lake and retrieve the disc.

The agent controls the movement of a character in a grid world. Some tiles of the grid are walkable, and others lead to the agent falling into the water. Additionally, the movement direction of the agent is uncertain and only partially depends on the chosen direction. The agent is rewarded for finding a walkable path to a goal tile.

The surface is described using a grid-like the following:

Possible actions are Left(0), Right(1), Down(2), Up(3).

NOTE - Here we are slightly altering the value iteration algorithm. Instead of computing the optimal value function first, we compute the optimal policy and then find the value function associated with it.

Instructions

  • Initialize an environment using a pre-defined environment from OpenAI Gym.
  • Set parameters gamma and theta.
  • Define a function policy_improvement that returns the action which takes us to a higher valued state. This function updates the policy to the optimal policy.
  • Define a function policy_evaluation that updates the state value of the environment given a policy.
  • Define a function value_iteration that calls the above-defined functions to get an optimal policy, action sequence governed by the policy and the state value function.
  • Now test the policy by checking how many episodes (with a fixed number of steps) in the 100 episode loop does the agent reach the final goal.
    • First, try the environment with a random policy, by taking random actions at each state.
    • Next, take actions based on the optimal policy.

Hints

Equation to compute the value function:

$$\pi(s)= \sum_{\{s',r\}} p(\{s',r\}\ |\ s,a)\ [r+\gamma v(s')]$$

Equation to computer Delta:

$$\triangle\gets\max(\triangle,|v-v(s)|)$$

gym.make(environment-name) Access a pre-defined environment

Env.action_space.n : Returns the number of discrete actions

Env.observation_space.n : Returns the number of discrete states

np.zeros() Return a new array of given shape and type, filled with zeros.

environment_object.env.P[s][a] : Returns the probability of reaching successor state (s’) and its reward (r)

np.argmax() Returns the indices of the maximum values along an axis.

max() Returns the largest item in an iterable or the largest of two or more arguments.

abs() Returns the absolute value of a number.

In [16]:
# Import necessary libraries
import gym
import numpy as np
from helper import test_policy
In [17]:
# Initializing an environment using a pre-defined environment from OpenAI Gym 
# The environment used here is 'FrozenLake-v0'
env = ___

# Setting the initial parameters required for value iteration

# Set the discount factor to a value between 0 and 1
gamma = ___

# Theta indicates the threshold determining the accuracy of the iteration
# Set it to a value lower than 1e-3
theta = ___

⏸ How does theta affect the policy evaluation and value iteration algorithms?

A. A large theta would cause the random policy to converge to the optimal policy much faster.

B. Theta does not directly or indirectly affect finding the optimal policy.

C. A large theta would cause policy evaluation to fasten but would slow down value iteration.

D. A large theta would result in an optimal policy far from the true optimal policy.

In [0]:
### edTest(test_chow1) ###
# Submit an answer choice as a string below (eg. if you choose option A, put 'A')
answer1 = '___'

POLICY IMPROVEMENT

Policy Improvement

In [18]:
# Function that returns the action which takes us to a higher valued state
# It takes as input the environment, state-value function, policy, 
# action, current state and the discount rate
def policy_improvement(env, V, pi, action, s, gamma):

    # Initialize a numpy array of zeros with the same of the 
    # environment's action space
    action_temp = ___

    # Loop over the size of the environment's action space i.e.
    # Iterate for every possible action
    for a in range(env.action_space.n):

        # Set the q value to 0
        q = 0

        # From the environment get P(s|a)
        # This will return the probability of reaching successor state (s’) and its reward (r)
        P = np.array(___)
        
        # Iterate over the possible states
        for i in range(len(P)):

            # Get the possible succesor state
            s_= int(P[i][1])

            # Get the transition Probability P(s'|s,a) 
            p = P[i][0]
            
            # Get the reward
            r = P[i][2]
            
            # Compute the action value q(s|a) based on the equation 
            # provided in the instruction
            q += ___           

            # Save the q-value of taking a particular action into the 
            # action_temp array 
            action_temp[a] = q
            
    # Get the action from action_temp that has the highest q-value 
    m = ___ 

    # For each state set the action which give the highest q-value
    action[s] = m
    
    # Update the policy by setting the action which give the highest 
    # q-value for a state as 1
    pi[s][m] = 1

    # Return the updated policy
    return pi

POLICY EVALUATION

Policy Evaluation



NOTE - We are not computing the max here as we already have the optimal policy. Instead, we just multiply with the policy which returns the value function of the best action (making others zero).



In [19]:
# Define a function to update the state value by taking the environment,
# current state value, current state, policy and the discount rate
def policy_evaluation(env, V, s, gamma):

    # Initialize a policy as a matrix of zeros with size
    # (Number of state, Number of actions)
    pi = np.zeros((env.observation_space.n, env.action_space.n))
    
    # Set the initial value of all actions as zero
    action = np.zeros((env.observation_space.n))

    # Initialize a numpy array of zeros with the same of the 
    # action space
    action_temp = np.zeros(env.action_space.n)
           
    # Call the policy_improvement function to get the policy
    pi = ___

    # Set the initial value as 0
    value = 0

    # Loop over all possible actions
    for a in range(env.action_space.n):

        # Set u as 0 to compute the value of each state given the 
        # policy
        u = 0

        # From the environment get P(s|a)
        P = np.array(___)

        # Iterate over the possible states
        for i in range(len(P)):
            
            # Get the next state
            s_= int(P[i][1])

            # Get the probability of the next state given the current state
            p = P[i][0]

            # Get the reward of going from current state to next state
            r = P[i][2]
            
            # Update the value function based on the equation provided 
            # in the instructions
            u += ___
            
        # Update the value based on the policy and the value function
        # This step is instead of the max defined by the image above
        # Since we have the optimal policy, we just multiply the policy pi
        # to get the value associated to the best action and the others become 0
        value += pi[s,a] * u
  
    # Set the value function of the state as the value computed above
    V[s]=value

    # Return the value function
    return V[s]

⏸ What does env.env.P[s][a] return?

A. Probability of reaching successor state, successor state and reward.

B. A list of all possible states that can be reached from s.

C. Probability of reaching successor state, successor state, reward and whether the episode is done or not.

D. A list of all possible states that can be reached from s on taking action a.

In [0]:
### edTest(test_chow2) ###
# Submit an answer choice as a string below (eg. if you choose option A, put 'A')
answer2 = '___'

VALUE ITERATION - Bringing everything together

In [20]:
# Define the function to perform value iteration
def value_iteration(env, gamma, theta):

    # Set the initial value of all states as zero
    V = ___    

    # Initialize a loop
    while True:

        # Set delta as 0 to compare the estimation accuracy
        delta = 0

        # Loop over all the states
        for s in range(___):

            # Set the value as the state value function initialize above          
            v = V[s]

            # Update the state value function by calling the policy_evaluation function
            V[s]= ___
            
            # Compute the delta based on the changed in value per iteration using the equation
            # given in the instructions
            delta = ___          
        
        # Check if the change is higher or lower than theta defined at the top
        # If so then the value has converged to the optimal value
        if delta < theta:
            break           


    # Initialize a policy as a matrix of zeros with size
    # ( Number of state * Number of actions)
    pi = ___ 

    
    # Set the initial value of all actions as zero
    action = ___                            

    # To extract the optimal policy loop over all the states
    for s in range(___):

        # Call the policy_improvement function to get the optimal policy
        pi = ___         

    # Return the optimal value function, the policy and the action sequence
    return V, pi,action
In [0]:
# Call the value_iteration function to get the action sequence, optimal policy and value function
V, pi, action = ___

# Print the discrete action to take in a given state
print("THE ACTION TO TAKE IN A GIVEN STATE IS:\n", np.reshape(action,(4,4)))

# Print the optimal policy
print("\n\n\nTHE OPTIMAL POLICY IS:\n", pi)

TESTING THE POLICY

In [0]:
# Use the helper function test_policy in the helper file to compute the 
# number of times the agent reaches the goal within a fixed number of steps 
# in each episode
# Every time the agent reaches the goal within the fixed step we call it a sucsess

# Set a variable random as 1
# This will ensure that the test_policy function gives the result of some random policy
random = 1

# Call the test_policy function by passing the environment, action and random
test_policy(env, action, random)

# Set a variable random as 0
# This will ensure that the test_policy function gives the result of the optimal policy
random = 0

# Call the test_policy function by passing the environment, action and random
test_policy(env, action, random)

⏸ How does increasing and decreasing gamma change the policy and reward?

In [0]:
### edTest(test_chow3) ###
# Type your answer within in the quotes given
answer3 = '___'