Deep Dive into Reinforcement Learning Part 3

Intro to Reinforcement Learning using OpenAI Gym

Now, it’s time for us to practice what we have learned before by simulating it using python and OpenAI Gym, and it is assumed the reader has at least basic knowledge of python programming.

What is OpenAI Gym ?

OpenAI Gym is a toolkit for developing and comparing reinforcement learning algorithms. It supports teaching agents everything from walking to playing games like Pong or Pinball. With OpenAI Gym you can understand how your RL program / algorithms works. For simulation we will use Taxi-v3 in OpenAI gym for environment and Bellman Equation for the Reinforcement Algorithm. Now let’s begin

First import all package needed

Then we create the environment and see the size of state and action

Before we enter to put the algorithm, it’s better to check if this simulation is working or not by entering the code below (please eliminate first and last ’’’ before you run this program), if it runs well means the simulation program works correctly and you can continue to add your Reinforcement Algorithm

Now we create parameters for Reinforcement Agent

Next we train the program

In line 7 of training the program, you see the statement of reset function which to make sure that all condition in initial value

In line 13 and 15 we see explore_exploit parameter which use to determine which action should agent take ? explore or exploitation. Explore means agents try every possible states that available, while exploit means agent moves to the other state according to Q table.

The function of line 23 until 28 is the agent updates its Q table according the information taken from line 21.

In line 39 we update the epsilon so the agent has a chance to exploit and not only just exploring the available states.

Now let’s evaluate the program what we have trained it

In line 14–15 of Evaluate the training we take action based on Q table that we have trained before. we use np.argmax is to look which index that has maximum value in Q table and used for agent to take the action. We also use penalties if reward == -10 in line 17–18 is to see how many mistakes that agent has made during this evaluation.

Here is the result of first 2 actions that agent has made after the initial condition

Here is the last action of the agent when agent has reached the goal

Full code is available in

For more detail information about Taxi-v3 environment and its symbol you can check at Gym documentations. Thank you