Reinforcement Learning Introduction

Featured image

The idea behind Reinforcement Learning is that an agent (an AI) will learn from the environment by interacting with it (through trial and error) and receiving rewards (negative or positive) as feedback for performing actions.

Reinforcement Learning is just a computational approach of learning from action.

Reinforcement learning is a framework for solving control tasks (also called decision problems) by building agents that learn from the environment by interacting with it through trial and error and receiving rewards (positive or negative) as unique feedback

https://huggingface.co/blog/assets/63_deep_rl_intro/RL_process_game.jpg

https://huggingface.co/blog/assets/63_deep_rl_intro/sars.jpg

Because RL is based on the reward hypothesis, which is that all goals can be described as the maximization of the expected return (expected cumulative reward).

That’s why in Reinforcement Learning, to have the best behavior, we need to maximize the expected cumulative reward.

State s: is a complete description of the state of the world (there is no hidden information). In a fully observed environment.

Observation o: is a partial description of the state. In a partially observed environment.

States space are the information our agent gets from the environment

Action space is the set of all possible actions in an environment.

Reward is fundamental in RL because it’s the only feedback for the agent.

Thanks to it, our agent knows if the action taken was good or not.

To discount the rewards, we proceed like this:

  1. We define a discount rate called gammaIt must be between 0 and 1. Most of the time between 0.99 and 0.95.
    • The larger the gamma, the smaller the discount. This means our agent cares more about the long-term reward.
    • On the other hand, the smaller the gamma, the bigger the discount. This means our agent cares more about the short term reward (the nearest cheese).

https://huggingface.co/blog/assets/63_deep_rl_intro/rewards_4.jpg

We need to balance how much we explore the environment  and how much we exploit what we know about the environment.

The Policy π is the brain of our Agent, it’s the function that tell us what action to take given the state we are. So it defines the agent’s behaviour at a given time.

This Policy is the function we want to learn, our goal is to find the **optimal policy** π, the policy that  maximises expected return when the agent acts according to it. We find this π through training.*

There are two approaches to train our agent to find this optimal policy π*: