Deep Reinforcement Learning

Notes from the UCL course on Reinforcement Learning.

Onur Copur

Dec 13, 2020 6 min read

Image credit: Unsplash

LECTURE 1(Introduction to RL)

Characteristic of Reinforcement Learning

There is no supervisor, only a reward signal.
Feedback is delayed, not instantaneous.
Time matters, (sequential, not i.i.d).
Agent’s actions affects the subsequent data it receives.

Rewards

A reward R_t is a scaler feedback signal indicates how well the agent is doing at step t. The agents job is to maximize the cumulative reward. Reinforcement learning is based on Reward Hypothesis.

Sequential Decision Making

In RL, the actions should be selected to maximize total feature reward. Actions may have long term consequences and rewards maybe delayed. It maybe better to sacrifice immediate reward to gain more long-term reward.

Agent and Environment

At each time step t the agent:

Executes Action A_t.
Receives observation O_t.
Receives scaler reward R_t.

At each time step t the environment:

Receives action A_t.
Emits observation O_t+1.
Emits scaler reward R_t+1

Information State (a.k.a Markov state)

An information state contains all the useful information from history.

Definition

A state S_t is a Markov if and only if:

P[S_t+1|S_t] = P[S_t+1|S₁,…,S_t]

This means the future is independent of the past given the present.

Fully Observable Environments

The observation at time t is equal to Agent state at time t and the environment state at time t.

O_t = S_t^a = S_t^e.

This is a Markov decision process (MDP).

Partially Observable Environments

The agent indirectly observes the environment.

A robot with camera vision isn’t told its absolute location.
A trading agent only observes current prices.
A poker playing agent only observes public cards.

Now, the agent state is not equal to the environment state. Formally, this is a Partially observable Markov decision process (POMDP).

Agent must construct its own state representation S_t^a, e.g.

Complete history, S_t^a = HS_t
Beliefs of environment state.
Recurrent neural networks.

Major Components of an RL Agent

A RL agent may include one or more of these components.

Policy: agent’s behaviour function.
Value function: how good is each state and/or action.
Model: agent’s representation of the environment.

Policy

It is a map from state to action, e.g.
- Deterministic Policy
- Stochastic Policy

Value Function

Value function is a prediction of future reward.
Used to evaluate the goodness of states.
And therefore to select between actions.

Model

A model predicts what the environment will do next.

Transitions model: predicts the next state.
Reward model : predicts the next reward.

Categorizing RL agents

Value Based.
Policy Based.
Actor Critic.
Model Free.
Model Based.

Learning and Planning

Two fundamental problems in sequential decision making:

Reinforcement Learning:
- The environment is initially unknown.
- The agent interacts with the environment.
- The agent improves its policy.
Planning:
- A model of the environment is known.
- The agent performs computations with its model (without any external interaction)
- The agent improves its policy

Here you can see my Connect4 bot working with a Planning manner and finds the optimal policy with a tree search.

LECTURE 2(MDP)

Markov Process

State Transition Matrix

For a Markov state s and successor state s^', the state transition probability is defined by,

P_ss' = P[S_t+1 = s^' |S_t = s]

The state transition matrix P defines transition probabilities from all state s to all successor states s^'.

Markov Process

A Markov process is a memoryless random process i.e. a sequence of random states S₁,S₂…S_n with Markov property.

Example :Student Markov Chain

In the figure bellow, you can see the illustration of a student Markov chain :)

Markov Reward Process

A Markov reward process is a Markov chain with values.

P_ss' = P[S_t+1 = s^' |S_t = s]

R is a reward function, R_s = E[R_t+1 |S_t = s]

Return

The return G_t is the total discounted reward from time-step t where gamma is the discount factor takes values between 0 and 1. This values immediate reward above delayed reward. Gamma close to 0 leads to ”myopic” evaluation. On the other hand Gamma close to 1 leads to ”far-sighted” evaluation.

Value Functions

The value function v(s) gives the long term value of state s. The sate value function v(s) of an MRP is the expected return starting from state s.

v(s) = E[G_t |S_t = s]

Bellman Equation for MRPs

The value function can be decomposed into two parts:

immediate reward R_t+1
discounted value of successor state gamma * v(S_t+1)

Bellman Equation in Matrix Form

The Bellman equation can be expressed concisely using matrices,

where v is a column vector with one entry per state

Solving the Bellman Equation

Belllman equation is a linear equation so it can be solved directly. However the computational complexity for is O(n³) for n states so direct solution is only possible for small MRPs. For large MRPs, the iterative methods are:

Dynamic programming
Monte - Carlo evaluation
Temporal Difference learning

Markov Decision Process

A Markov decision process is a Markov reward process with decisions. It is an environment in which all states are Markov. In the figure bellow, you can see the Student MDP. This time there is no transition probabilities but decisions and rewards. Except going to pub… Once you go to a pub, you can’t make further decisions :)

To make decisions, we need policies.

Policies

A policy π is a distribution over actions given states.

π(a|s) = P [A_t = a | S_t = s]

A policy fully defines the behaviour of an agent.
MDP policies depend on the current state.
Policies are stationary, (time independent).

Given an MDP M = <S, A,P, R, γ> and a policy π,

The state sequence S₁, S₂, … is a Markov process <S,P^π>.
The state and reward sequence S₁, R₁, S₂, … ,s a Markov reward process <S,P^π,R^π,γ>.

Value Function

State - Value Function

The state-value function v^π(s) of an MDP is the expected return starting from state s, and then following policy π.

v_π(s) = E_π [G_t | S_t = s]

The state-value function can again be decomposed into immediate reward plus discounted value of successor state.

v_π(s) = E_π [R_t+1 + γv_π(S_t+1) | S_t = s]

Action - Value Function

The action-value function q^π(s, a) is the expected return starting from state s, taking action a, and then following policy π.

q_π(s, a) = E_π [G_t | S_t = s, A_t = a]

The action-value function can similarly be decomposed,

q_π(s) = E_π [R_t+1 + γq_π(S_t+1,A_t+1) | S_t = s, A_t = a]

Optimal Value Function

The optimal state-value function v_∗(s) is the maximum value function over all policies.
The optimal action-value function q_∗(s, a) is the maximum action-value function over all policies

Finding an Optimal Policy

An optimal policy can be found by maximizing over q_∗(s, a),

There is always a deterministic optimal policy for any MDP.

In the figure bellow, the red arcs shows the optimal policy for the student MDP.

Solving the Bellamn Optimality Equation

Bellman optimality equation is non-linear so we introduce some iterative methods such as,

Value Iteration
Policy Iteration
Q-learning
Sarsa

LECTURE 3(Planning by DP)

Policy Evaluation

Problem : evaluate a given policy π.
Solution : iterative application of Bellman expectation backup.

v₁ → v₂ → … → v_π

Using synchronous backups,

At each iteration k + 1.
For all states s ∈ S.
Update v_k+1(s) from v_k (s^').
where s^' is the successor state of s.

Policy Iteration

Given a policy π,

Evaluate the policy π,

v_π(s) = E[R_t+1 + R_t+2 + … | S_t = s]

Improve the policy by acting greedily with respect to v_π.

this process of policy iteration always converges to π∗

Onur Copur

MSc Data Science

Data scientist & Industrial Engineer