2 min to read
Reinforcement Learning Introduction
The idea behind Reinforcement Learning is that an agent
(an AI) will learn from the environment
by interacting with it (through trial and error) and receiving rewards
(negative or positive) as feedback for performing actions.
Reinforcement Learning is just a computational approach of learning from action.
Reinforcement learning is a framework
for solving control tasks
(also called decision problems) by building agents that learn from the environment by interacting with it through trial and error and receiving rewards (positive or negative) as unique feedback
- Our Agent receives state from the Environment — we receive the first frame of our game (Environment).
- Based on that state , the Agent takes action — our Agent will move to the right.
- Environment goes to a new state — new frame.
- The environment gives some reward to the Agent — we’re not dead (Positive Reward +1).
Because RL is based on the reward hypothesis
, which is that all goals can be described as the maximization
of the expected return (expected cumulative reward).
That’s why in Reinforcement Learning, to have the best behavior
, we need to maximize
the expected cumulative reward
.
State s: is a complete description of the state of the world (there is no hidden information). In a fully observed environment.
Observation o: is a partial description of the state. In a partially observed environment.
States space
are the information
our agent gets from the environment
Action space
is the set of all possible actions
in an environment.
Reward is fundamental in RL because it’s the only feedback for the agent.
Thanks to it, our agent knows if the action taken was good or not.
To discount the rewards, we proceed like this:
- We define a
discount rate
calledgamma
. It must be between 0 and 1. Most of the time between 0.99 and 0.95.- The
larger
the gamma, the smaller the discount. This means our agent cares more about thelong-term reward
. - On the other hand, the
smaller
the gamma, the bigger the discount. This means our agent cares more about theshort term reward
(the nearest cheese).
- The
We need to balance how much we explore the environment and how much we exploit what we know about the environment.
- Exploration is exploring the environment by trying random actions in order to find
more information about the environment
. - Exploitation is exploiting known
information to maximize the reward
.
The Policy π is the brain of our Agent, it’s the function that tell us what action to take given the state
we are. So it defines the agent’s behaviour at a given time.
This Policy is the function we want to learn, our goal is to find the **optimal policy**
π, the policy that maximises expected return
when the agent acts according to it. We find this π through training.*
There are two approaches to train our agent to find this optimal policy π*:
- Directly, by teaching the agent to learn which action to take, given the state is in:
Policy-Based Methods.
- Indirectly, teach the agent to learn which state is more valuable and then take the action that leads to the more valuable states:
**Value-Based Methods**.
Comments