Deep Reinforcement Learning

Notes from the UCL course on Reinforcement Learning.

Image credit: Unsplash

LECTURE 1(Introduction to RL)

Characteristic of Reinforcement Learning

  • There is no supervisor, only a reward signal.
  • Feedback is delayed, not instantaneous.
  • Time matters, (sequential, not i.i.d).
  • Agent’s actions affects the subsequent data it receives.

Rewards

A reward Rt is a scaler feedback signal indicates how well the agent is doing at step t. The agents job is to maximize the cumulative reward. Reinforcement learning is based on Reward Hypothesis.

Sequential Decision Making

In RL, the actions should be selected to maximize total feature reward. Actions may have long term consequences and rewards maybe delayed. It maybe better to sacrifice immediate reward to gain more long-term reward.

Agent and Environment

Agent-Environment Interaction

At each time step t the agent:

  • Executes Action At.
  • Receives observation Ot.
  • Receives scaler reward Rt.

At each time step t the environment:

  • Receives action At.
  • Emits observation Ot+1.
  • Emits scaler reward Rt+1

Information State (a.k.a Markov state)

An information state contains all the useful information from history.

Definition

A state St is a Markov if and only if:

P[St+1|St] = P[St+1|S1,…,St]

This means the future is independent of the past given the present.

Fully Observable Environments

The observation at time t is equal to Agent state at time t and the environment state at time t.

Ot = Sta = Ste.

This is a Markov decision process (MDP).

Partially Observable Environments

The agent indirectly observes the environment.

  • A robot with camera vision isn’t told its absolute location.
  • A trading agent only observes current prices.
  • A poker playing agent only observes public cards.

Now, the agent state is not equal to the environment state. Formally, this is a Partially observable Markov decision process (POMDP).

Agent must construct its own state representation Sta, e.g.

  • Complete history, Sta = HSt
  • Beliefs of environment state.
  • Recurrent neural networks.

Major Components of an RL Agent

A RL agent may include one or more of these components.

  • Policy: agent’s behaviour function.
  • Value function: how good is each state and/or action.
  • Model: agent’s representation of the environment.

Policy

  • It is a map from state to action, e.g.
    • Deterministic Policy
    • Stochastic Policy

Value Function

  • Value function is a prediction of future reward.
  • Used to evaluate the goodness of states.
  • And therefore to select between actions.

Model

A model predicts what the environment will do next.

  • Transitions model: predicts the next state.
  • Reward model : predicts the next reward.

Categorizing RL agents

  • Value Based.
  • Policy Based.
  • Actor Critic.
  • Model Free.
  • Model Based.

Learning and Planning

Two fundamental problems in sequential decision making:

  • Reinforcement Learning:

    • The environment is initially unknown.
    • The agent interacts with the environment.
    • The agent improves its policy.
  • Planning:

    • A model of the environment is known.
    • The agent performs computations with its model (without any external interaction)
    • The agent improves its policy

Here you can see my Connect4 bot working with a Planning manner and finds the optimal policy with a tree search.

LECTURE 2(MDP)

Markov Process

State Transition Matrix

For a Markov state s and successor state s', the state transition probability is defined by,

Pss' = P[St+1 = s' |St = s]

The state transition matrix P defines transition probabilities from all state s to all successor states s'.

Transition Matrix

Markov Process

A Markov process is a memoryless random process i.e. a sequence of random states S1,S2…Sn with Markov property.

Example :Student Markov Chain

In the figure bellow, you can see the illustration of a student Markov chain :)

Student Markov Chain

Markov Reward Process

A Markov reward process is a Markov chain with values.

Pss' = P[St+1 = s' |St = s]

R is a reward function, Rs = E[Rt+1 |St = s]

Return

The return Gt is the total discounted reward from time-step t where gamma is the discount factor takes values between 0 and 1. This values immediate reward above delayed reward. Gamma close to 0 leads to ”myopic” evaluation. On the other hand Gamma close to 1 leads to ”far-sighted” evaluation.

return

Value Functions

The value function v(s) gives the long term value of state s. The sate value function v(s) of an MRP is the expected return starting from state s.

v(s) = E[Gt |St = s]

Bellman Equation for MRPs

The value function can be decomposed into two parts:

  • immediate reward Rt+1
  • discounted value of successor state gamma * v(St+1)

Bellman equation

Bellman Equation in Matrix Form

The Bellman equation can be expressed concisely using matrices,

Bellman matrix 1

where v is a column vector with one entry per state

Bellman matrix 2

Solving the Bellman Equation

Belllman equation is a linear equation so it can be solved directly. However the computational complexity for is O(n3) for n states so direct solution is only possible for small MRPs. For large MRPs, the iterative methods are:

  • Dynamic programming
  • Monte - Carlo evaluation
  • Temporal Difference learning

Markov Decision Process

A Markov decision process is a Markov reward process with decisions. It is an environment in which all states are Markov. In the figure bellow, you can see the Student MDP. This time there is no transition probabilities but decisions and rewards. Except going to pub… Once you go to a pub, you can’t make further decisions :)

studentmdp

To make decisions, we need policies.

Policies

A policy π is a distribution over actions given states.

π(a|s) = P [At = a | St = s]

  • A policy fully defines the behaviour of an agent.
  • MDP policies depend on the current state.
  • Policies are stationary, (time independent).

Given an MDP M = <S, A,P, R, γ> and a policy π,

  • The state sequence S1, S2, … is a Markov process <S,Pπ>.
  • The state and reward sequence S1, R1, S2, … ,s a Markov reward process <S,Pπ,Rπ,γ>.

mdppolicy

Value Function

State - Value Function

The state-value function vπ(s) of an MDP is the expected return starting from state s, and then following policy π.

vπ(s) = Eπ [Gt | St = s]

The state-value function can again be decomposed into immediate reward plus discounted value of successor state.

vπ(s) = Eπ [Rt+1 + γvπ(St+1) | St = s]

Action - Value Function

The action-value function qπ(s, a) is the expected return starting from state s, taking action a, and then following policy π.

qπ(s, a) = Eπ [Gt | St = s, At = a]

The action-value function can similarly be decomposed,

qπ(s) = Eπ [Rt+1 + γqπ(St+1,At+1) | St = s, At = a]

Optimal Value Function

  • The optimal state-value function v(s) is the maximum value function over all policies.
  • The optimal action-value function q(s, a) is the maximum action-value function over all policies

Finding an Optimal Policy

An optimal policy can be found by maximizing over q(s, a),

optimalpolicy

  • There is always a deterministic optimal policy for any MDP.

In the figure bellow, the red arcs shows the optimal policy for the student MDP.

optimalpolicystudent

Solving the Bellamn Optimality Equation

Bellman optimality equation is non-linear so we introduce some iterative methods such as,

  • Value Iteration
  • Policy Iteration
  • Q-learning
  • Sarsa

LECTURE 3(Planning by DP)

Policy Evaluation

  • Problem : evaluate a given policy π.
  • Solution : iterative application of Bellman expectation backup.

v1 → v2 → … → vπ

Using synchronous backups,

  • At each iteration k + 1.
  • For all states s ∈ S.
  • Update vk+1(s) from vk (s').
  • where s' is the successor state of s.

Policy Iteration

Given a policy π,

Evaluate the policy π,

vπ(s) = E[Rt+1 + Rt+2 + … | St = s]

Improve the policy by acting greedily with respect to vπ.

this process of policy iteration always converges to π∗

Onur Copur
Onur Copur
MSc Data Science

Data scientist & Industrial Engineer