PPG¶

Overview¶

PPG was proposed in Phasic Policy Gradient. In prior methods, one must choose between using a shared network or separate networks to represent the policy and value function. Using separate networks avoids interference between objectives, while using a shared network allows useful features to be shared. PPG is able to achieve the best of both worlds by splitting optimization into two phases, one that advances training and one that distills features.

Quick Facts¶

PPG is a model-free and policy-based RL algorithm.
PPG supports both discrete and continuous action spaces.
PPG supports off-policy mode and on-policy mode.
There is 2 value network in PPG.
In the implementation of DI-engine, we use two buffer to off-policy PPG

Key Graphs¶

PPG uses disjoint policy and value networks to reduce interference between objectives. The policy network includes an auxiliary value head which is used to distill the knowledge of value into the policy network.

Key Equations¶

The optimization of PPG alternates between two phases, a policy phase and an auxiliary phase. During the policy phase, the policy network and the value network are updated similar to PPO. During the auxiliary phase, the value knowledge is distilled into the policy network with the joint loss:

\[L^{j o i n t}=L^{a u x}+\beta_{c l o n e} \cdot \hat{\mathbb{E}}_{t}\left[K L\left[\pi_{\theta_{o l d}}\left(\cdot \mid s_{t}\right), \pi_{\theta}\left(\cdot \mid s_{t}\right)\right]\right]\]

The joint loss optimizes the auxiliary objective while preserves the original policy with the KL-divergence restriction. The auxiliary loss is defined as:

\[L^{a u x}=\frac{1}{2} \cdot \hat{\mathbb{E}}_{t}\left[\left(V_{\theta_{\pi}}\left(s_{t}\right)-\hat{V}_{t}^{\mathrm{targ}}\right)^{2}\right]\]

Pseudo-code¶

The following flow charts shows how PPG alternates between the policy phase and the auxiliary phase.

Note

During the auxiliary phase, PPG also takes the opportunity to perform additional training on the value network.

Extensions¶

PPG can be combined with:
- Multi-step learning
- GAE
- Multi buffer, different max reuse

Implementation¶

The default config is defined as follows:

class ding.policy.ppg.PPGPolicy(cfg: dict, model: Optional[Union[type, torch.nn.modules.module.Module]] = None, enable_field: Optional[List[str]] = None)[source]

Overview:
Policy class of PPG algorithm.

Interface:
_init_learn, _data_preprocess_learn, _forward_learn, _state_dict_learn, _load_state_dict_learn _init_collect, _forward_collect, _process_transition, _get_train_sample, _get_batch_size, _init_eval, _forward_eval, default_model, _monitor_vars_learn, learn_aux

Config:

ID

Symbol

Type

Default Value

Description

Other(Shape)

1

type

str

ppg

RL policy register name, refer to

registry POLICY_REGISTRY

this arg is optional,

a placeholder

2

cuda

bool

False

Whether to use cuda for network

this arg can be diff-

erent from modes

3

on_policy

bool

True

Whether the RL algorithm is on-policy

or off-policy

priority

bool

False

Whether use priority(PER)

priority sample,

update priority

5

priority_

IS_weight

bool

False

Whether use Importance Sampling

Weight to correct biased update.

IS weight

6

learn.update

_per_collect

int

5

How many updates(iterations) to train

after collector’s one collection. Only

valid in serial training

this args can be vary

from envs. Bigger val

means more off-policy

7

learn.value_

weight

float

1.0

The loss weight of value network

policy network weight

is set to 1

8

learn.entropy_

weight

float

0.01

The loss weight of entropy

regularization

policy network weight

is set to 1

9

learn.clip_

ratio

float

0.2

PPO clip ratio

10

learn.adv_

norm

bool

False

Whether to use advantage norm in

a whole training batch

11

learn.aux_

freq

int

5

The frequency(normal update times)

of auxiliary phase training

12

learn.aux_

train_epoch

int

6

The training epochs of auxiliary

phase

13

learn.aux_

bc_weight

int

1

The loss weight of behavioral_cloning

in auxiliary phase

14

collect.dis

count_factor

float

0.99

Reward’s future discount factor, aka.

gamma

may be 1 when sparse

reward env

15

collect.gae_

lambda

float

0.95

GAE lambda factor for the balance

of bias and variance(1-step td and mc)

The network interface PPG used is defined as follows:

The Benchmark result of PPG implemented in DI-engine is shown in Benchmark.

References¶

Karl Cobbe, Jacob Hilton, Oleg Klimov, John Schulman: “Phasic Policy Gradient”, 2020; [http://arxiv.org/abs/2009.04416 arXiv:2009.04416].