PPG

Overview

PPG was proposed in Phasic Policy Gradient. In prior methods, one must choose between using a shared network or separate networks to represent the policy and value function. Using separate networks avoids interference between objectives, while using a shared network allows useful features to be shared. PPG is able to achieve the best of both worlds by splitting optimization into two phases, one that advances training and one that distills features.

Quick Facts

  1. PPG is a model-free and policy-based RL algorithm.

  2. PPG supports both discrete and continuous action spaces.

  3. PPG supports off-policy mode and on-policy mode.

  4. There is 2 value network in PPG.

  5. In the implementation of DI-engine, we use two buffer to off-policy PPG

Key Graphs

PPG uses disjoint policy and value networks to reduce interference between objectives. The policy network includes an auxiliary value head which is used to distill the knowledge of value into the policy network.

../_images/ppg_net.png

Key Equations

The optimization of PPG alternates between two phases, a policy phase and an auxiliary phase. During the policy phase, the policy network and the value network are updated similar to PPO. During the auxiliary phase, the value knowledge is distilled into the policy network with the joint loss:

\[L^{j o i n t}=L^{a u x}+\beta_{c l o n e} \cdot \hat{\mathbb{E}}_{t}\left[K L\left[\pi_{\theta_{o l d}}\left(\cdot \mid s_{t}\right), \pi_{\theta}\left(\cdot \mid s_{t}\right)\right]\right]\]

The joint loss optimizes the auxiliary objective while preserves the original policy with the KL-divergence restriction. The auxiliary loss is defined as:

\[L^{a u x}=\frac{1}{2} \cdot \hat{\mathbb{E}}_{t}\left[\left(V_{\theta_{\pi}}\left(s_{t}\right)-\hat{V}_{t}^{\mathrm{targ}}\right)^{2}\right]\]

Pseudo-code

The following flow charts shows how PPG alternates between the policy phase and the auxiliary phase.

../_images/PPG.png

Note

During the auxiliary phase, PPG also takes the opportunity to perform additional training on the value network.

Extensions

  • PPG can be combined with:
    • Multi-step learning

    • GAE

    • Multi buffer, different max reuse

Implementation

The default config is defined as follows:

class ding.policy.ppg.PPGPolicy(cfg: dict, model: Optional[Union[type, torch.nn.modules.module.Module]] = None, enable_field: Optional[List[str]] = None)[source]
Overview:

Policy class of PPG algorithm.

Interface:

_init_learn, _data_preprocess_learn, _forward_learn, _state_dict_learn, _load_state_dict_learn _init_collect, _forward_collect, _process_transition, _get_train_sample, _get_batch_size, _init_eval, _forward_eval, default_model, _monitor_vars_learn, learn_aux

Config:

ID

Symbol

Type

Default Value

Description

Other(Shape)

1

type

str

ppg

RL policy register name, refer to
registry POLICY_REGISTRY
this arg is optional,
a placeholder

2

cuda

bool

False

Whether to use cuda for network
this arg can be diff-
erent from modes

3

on_policy

bool

True

Whether the RL algorithm is on-policy
or off-policy

priority

bool

False

Whether use priority(PER)
priority sample,
update priority

5

priority_
IS_weight

bool

False

Whether use Importance Sampling
Weight to correct biased update.
IS weight

6

learn.update
_per_collect

int

5

How many updates(iterations) to train
after collector’s one collection. Only
valid in serial training
this args can be vary
from envs. Bigger val
means more off-policy

7

learn.value_
weight

float

1.0

The loss weight of value network
policy network weight
is set to 1

8

learn.entropy_
weight

float

0.01

The loss weight of entropy
regularization
policy network weight
is set to 1

9

learn.clip_
ratio

float

0.2

PPO clip ratio

10

learn.adv_
norm

bool

False

Whether to use advantage norm in
a whole training batch

11

learn.aux_
freq

int

5

The frequency(normal update times)
of auxiliary phase training

12

learn.aux_
train_epoch

int

6

The training epochs of auxiliary
phase

13

learn.aux_
bc_weight

int

1

The loss weight of behavioral_cloning
in auxiliary phase

14

collect.dis
count_factor

float

0.99

Reward’s future discount factor, aka.
gamma
may be 1 when sparse
reward env

15

collect.gae_
lambda

float

0.95

GAE lambda factor for the balance
of bias and variance(1-step td and mc)

The network interface PPG used is defined as follows:

The Benchmark result of PPG implemented in DI-engine is shown in Benchmark.

References

Karl Cobbe, Jacob Hilton, Oleg Klimov, John Schulman: “Phasic Policy Gradient”, 2020; [http://arxiv.org/abs/2009.04416 arXiv:2009.04416].