{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Title\n",
    "\n",
    "🆓 Exercise: Finding the Optimal Policy\n",
    "\n",
    "## Description\n",
    "\n",
    "The aim of this exercise is to find the optimal policy that given the maximum reward given an environment. For this, we will be using a pre-defined environment by OpenAI Gym. We will be using an environment called FrozenLake-v0.There are many environments defined by OpenAI Gym, that you can see here.\n",
    "\n",
    "Environment Description:\n",
    "\n",
    "Winter is here. You and your friends were tossing around a frisbee at the park when you made a wild throw that left the frisbee out in the middle of the lake. The water is mostly frozen, but there are a few holes where the ice has melted. If you step into one of those holes, you'll fall into the freezing water. At this time, there's an international frisbee shortage, so it's absolutely imperative that you navigate across the lake and retrieve the disc.\n",
    "\n",
    "The agent controls the movement of a character in a grid world. Some tiles of the grid are walkable, and others lead to the agent falling into the water. Additionally, the movement direction of the agent is uncertain and only partially depends on the chosen direction. The agent is rewarded for finding a walkable path to a goal tile.\n",
    "\n",
    "The surface is described using a grid-like the following:\n",
    "\n",
    "<img src=\"../fig/fig.png\" style=\"width: 500px;\">\n",
    "\n",
    "Possible actions are Left(0), Right(1), Down(2), Up(3).\n",
    "\n",
    "NOTE - Here we are slightly altering the value iteration algorithm. Instead of computing the optimal value function first, we compute the optimal policy and then find the value function associated with it.\n",
    "\n",
    "## Instructions\n",
    "- Initialize an environment using a pre-defined environment from OpenAI Gym.\n",
    "- Set parameters `gamma` and `theta`.\n",
    "- Define a function `policy_improvement` that returns the action which takes us to a higher valued state. This function updates the policy to the optimal policy.\n",
    "- Define a function `policy_evaluation` that updates the state value of the environment given a policy.\n",
    "- Define a function `value_iteration` that calls the above-defined functions to get an optimal policy, action sequence governed by the policy and the state value function.\n",
    "- Now test the policy by checking how many episodes (with a fixed number of steps) in the 100 episode loop does the agent reach the final goal.\n",
    "    - First, try the environment with a random policy, by taking random actions at each state.\n",
    "    - Next, take actions based on the optimal policy.\n",
    "\n",
    "## Hints\n",
    "\n",
    "Equation to compute the value function:\n",
    "\n",
    "$$\\pi(s)= \\sum_{\\{s',r\\}} p(\\{s',r\\}\\ |\\ s,a)\\ [r+\\gamma v(s')]$$\n",
    "\n",
    "Equation to computer Delta:\n",
    "\n",
    "$$\\triangle\\gets\\max(\\triangle,|v-v(s)|)$$\n",
    "\n",
    "<a href=\"https://gym.openai.com/docs/#environments\" target=\"_blank\">gym.make(environment-name)</a> Access a pre-defined environment\n",
    "\n",
    "Env.action_space.n : Returns the number of discrete actions\n",
    "\n",
    "Env.observation_space.n : Returns the number of discrete states\n",
    "\n",
    "<a href=\"https://numpy.org/doc/stable/reference/generated/numpy.zeros.html\" target=\"_blank\">np.zeros()</a> Return a new array of given shape and type, filled with zeros.\n",
    "\n",
    "environment_object.env.P[s][a] : Returns the probability of reaching successor state (s’) and its reward (r)\n",
    "\n",
    "<a href=\"https://numpy.org/doc/stable/reference/generated/numpy.argmax.html\" target=\"_blank\">np.argmax()</a> Returns the indices of the maximum values along an axis.\n",
    "\n",
    "<a href=\"https://docs.python.org/3/library/functions.html#max\" target=\"_blank\">max()</a> Returns the largest item in an iterable or the largest of two or more arguments.\n",
    "\n",
    "<a href=\"https://docs.python.org/3/library/functions.html#abs\" target=\"_blank\">abs()</a> Returns the absolute value of a number."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Import necessary libraries\n",
    "import gym\n",
    "import numpy as np\n",
    "from helper import test_policy\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Initializing an environment using a pre-defined environment from OpenAI Gym \n",
    "# The environment used here is 'FrozenLake-v0'\n",
    "env = ___\n",
    "\n",
    "# Setting the initial parameters required for value iteration\n",
    "\n",
    "# Set the discount factor to a value between 0 and 1\n",
    "gamma = ___\n",
    "\n",
    "# Theta indicates the threshold determining the accuracy of the iteration\n",
    "# Set it to a value lower than 1e-3\n",
    "theta = ___\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### ⏸ How does theta affect the policy evaluation and value iteration algorithms?\n",
    "\n",
    "#### A. A large theta would cause the random policy to converge to the optimal policy much faster. \n",
    "#### B. Theta does not directly or indirectly affect finding the optimal policy.\n",
    "#### C. A large theta would cause policy evaluation to fasten but would slow down value iteration.\n",
    "#### D. A large theta would result in an optimal policy far from the true optimal policy."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_chow1) ###\n",
    "# Submit an answer choice as a string below (eg. if you choose option A, put 'A')\n",
    "answer1 = '___'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### **POLICY IMPROVEMENT**\n",
    "\n",
    "<img src=\"./images/policy_improvement.png\" alt=\"Policy Improvement\" style=\"width:700px\">"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Function that returns the action which takes us to a higher valued state\n",
    "# It takes as input the environment, state-value function, policy, \n",
    "# action, current state and the discount rate\n",
    "def policy_improvement(env, V, pi, action, s, gamma):\n",
    "\n",
    "    # Initialize a numpy array of zeros with the same of the \n",
    "    # environment's action space\n",
    "    action_temp = ___\n",
    "\n",
    "    # Loop over the size of the environment's action space i.e.\n",
    "    # Iterate for every possible action\n",
    "    for a in range(env.action_space.n):                         \n",
    "\n",
    "        # Set the q value to 0\n",
    "        q = 0\n",
    "\n",
    "        # From the environment get P(s|a)\n",
    "        # This will return the probability of reaching successor state (s’) and its reward (r)\n",
    "        P = np.array(___)         \n",
    "        \n",
    "        # Iterate over the possible states\n",
    "        for i in range(len(P)):                              \n",
    "\n",
    "            # Get the possible succesor state\n",
    "            s_= int(P[i][1])                            \n",
    "\n",
    "            # Get the transition Probability P(s'|s,a) \n",
    "            p = P[i][0]                                 \n",
    "            \n",
    "            # Get the reward\n",
    "            r = P[i][2]                                 \n",
    "            \n",
    "            # Compute the action value q(s|a) based on the equation \n",
    "            # provided in the instruction\n",
    "            q += ___           \n",
    "\n",
    "            # Save the q-value of taking a particular action into the \n",
    "            # action_temp array \n",
    "            action_temp[a] = q\n",
    "            \n",
    "    # Get the action from action_temp that has the highest q-value \n",
    "    m = ___ \n",
    "\n",
    "    # For each state set the action which give the highest q-value\n",
    "    action[s] = m                                           \n",
    "    \n",
    "    # Update the policy by setting the action which give the highest \n",
    "    # q-value for a state as 1\n",
    "    pi[s][m] = 1                                        \n",
    "\n",
    "    # Return the updated policy\n",
    "    return pi\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### **POLICY EVALUATION**\n",
    "\n",
    "<img src=\"./images/policy_eval.png\" alt=\"Policy Evaluation\" style=\"width:700px\">\n",
    "\n",
    "---\n",
    "___\n",
    "NOTE - We are not computing the max here as we already have the optimal policy. Instead, we just multiply with the policy which returns the value function of the best action (making others zero).\n",
    "\n",
    "---\n",
    "___"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define a function to update the state value by taking the environment,\n",
    "# current state value, current state, policy and the discount rate\n",
    "def policy_evaluation(env, V, s, gamma):                           \n",
    "\n",
    "    # Initialize a policy as a matrix of zeros with size\n",
    "    # (Number of state, Number of actions)\n",
    "    pi = np.zeros((env.observation_space.n, env.action_space.n))  \n",
    "    \n",
    "    # Set the initial value of all actions as zero\n",
    "    action = np.zeros((env.observation_space.n)) \n",
    "\n",
    "    # Initialize a numpy array of zeros with the same of the \n",
    "    # action space\n",
    "    action_temp = np.zeros(env.action_space.n)                  \n",
    "           \n",
    "    # Call the policy_improvement function to get the policy\n",
    "    pi = ___\n",
    "\n",
    "    # Set the initial value as 0\n",
    "    value = 0\n",
    "\n",
    "    # Loop over all possible actions\n",
    "    for a in range(env.action_space.n):\n",
    "\n",
    "        # Set u as 0 to compute the value of each state given the \n",
    "        # policy\n",
    "        u = 0\n",
    "\n",
    "        # From the environment get P(s|a)\n",
    "        P = np.array(___)\n",
    "\n",
    "        # Iterate over the possible states\n",
    "        for i in range(len(P)):   \n",
    "            \n",
    "            # Get the next state\n",
    "            s_= int(P[i][1])\n",
    "\n",
    "            # Get the probability of the next state given the current state\n",
    "            p = P[i][0]\n",
    "\n",
    "            # Get the reward of going from current state to next state\n",
    "            r = P[i][2]\n",
    "            \n",
    "            # Update the value function based on the equation provided \n",
    "            # in the instructions\n",
    "            u += ___\n",
    "            \n",
    "        # Update the value based on the policy and the value function\n",
    "        # This step is instead of the max defined by the image above\n",
    "        # Since we have the optimal policy, we just multiply the policy pi\n",
    "        # to get the value associated to the best action and the others become 0\n",
    "        value += pi[s,a] * u\n",
    "  \n",
    "    # Set the value function of the state as the value computed above\n",
    "    V[s]=value\n",
    "\n",
    "    # Return the value function\n",
    "    return V[s]\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### ⏸ What does env.env.P[s][a] return?\n",
    "\n",
    "#### A. Probability of reaching successor state, successor state and reward.\n",
    "#### B. A list of all possible states that can be reached from s. \n",
    "#### C. Probability of reaching successor state, successor state, reward and whether the episode is done or not.\n",
    "#### D. A list of all possible states that can be reached from s on taking action a."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_chow2) ###\n",
    "# Submit an answer choice as a string below (eg. if you choose option A, put 'A')\n",
    "answer2 = '___'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### **VALUE ITERATION** - Bringing everything together"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Define the function to perform value iteration\n",
    "def value_iteration(env, gamma, theta):\n",
    "\n",
    "    # Set the initial value of all states as zero\n",
    "    V = ___    \n",
    "\n",
    "    # Initialize a loop\n",
    "    while True:\n",
    "\n",
    "        # Set delta as 0 to compare the estimation accuracy\n",
    "        delta = 0\n",
    "\n",
    "        # Loop over all the states\n",
    "        for s in range(___):       \n",
    "\n",
    "            # Set the value as the state value function initialize above          \n",
    "            v = V[s]\n",
    "\n",
    "            # Update the state value function by calling the policy_evaluation function\n",
    "            V[s]= ___   \n",
    "            \n",
    "            # Compute the delta based on the changed in value per iteration using the equation\n",
    "            # given in the instructions\n",
    "            delta = ___          \n",
    "        \n",
    "        # Check if the change is higher or lower than theta defined at the top\n",
    "        # If so then the value has converged to the optimal value\n",
    "        if delta < theta:                                       \n",
    "            break           \n",
    "\n",
    "\n",
    "    # Initialize a policy as a matrix of zeros with size\n",
    "    # ( Number of state * Number of actions)\n",
    "    pi = ___ \n",
    "\n",
    "    \n",
    "    # Set the initial value of all actions as zero\n",
    "    action = ___                            \n",
    "\n",
    "    # To extract the optimal policy loop over all the states\n",
    "    for s in range(___):\n",
    "\n",
    "        # Call the policy_improvement function to get the optimal policy\n",
    "        pi = ___         \n",
    "\n",
    "    # Return the optimal value function, the policy and the action sequence\n",
    "    return V, pi,action      \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Call the value_iteration function to get the action sequence, optimal policy and value function\n",
    "V, pi, action = ___\n",
    "\n",
    "# Print the discrete action to take in a given state\n",
    "print(\"THE ACTION TO TAKE IN A GIVEN STATE IS:\\n\", np.reshape(action,(4,4)))                    \n",
    "\n",
    "# Print the optimal policy\n",
    "print(\"\\n\\n\\nTHE OPTIMAL POLICY IS:\\n\", pi)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### **TESTING THE POLICY**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Use the helper function test_policy in the helper file to compute the \n",
    "# number of times the agent reaches the goal within a fixed number of steps \n",
    "# in each episode\n",
    "# Every time the agent reaches the goal within the fixed step we call it a sucsess\n",
    "\n",
    "# Set a variable random as 1\n",
    "# This will ensure that the test_policy function gives the result of some random policy\n",
    "random = 1\n",
    "\n",
    "# Call the test_policy function by passing the environment, action and random\n",
    "test_policy(env, action, random)\n",
    "\n",
    "# Set a variable random as 0\n",
    "# This will ensure that the test_policy function gives the result of the optimal policy\n",
    "random = 0\n",
    "\n",
    "# Call the test_policy function by passing the environment, action and random\n",
    "test_policy(env, action, random)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "⏸ How does increasing and decreasing gamma change the policy and reward?\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 0,
   "metadata": {},
   "outputs": [],
   "source": [
    "### edTest(test_chow3) ###\n",
    "# Type your answer within in the quotes given\n",
    "answer3 = '___'"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}