DDPG¶

Overview¶

Deep Deterministic Policy Gradient (DDPG), proposed in the 2015 paper Continuous control with deep reinforcement learning, is an algorithm which learns a Q-function and a policy simultaneously. DDPG is an actor-critic, model-free algorithm based on the deterministic policy gradient(DPG) that can operate over high-dimensional, continuous action spaces. DPG Deterministic policy gradient algorithms algorithm is similar to NFQCA.

Quick Facts¶

DDPG is only used for environments with continuous action spaces.(i.e. MuJoCo)
DDPG is an off-policy algorithm.
DDPG is a model-free and actor-critic RL algorithm, which optimizes actor network and critic network, respectively.
Usually, DDPG use Ornstein-Uhlenbeck process or Gaussian process (default in our implementation) for exploration.

Key Equations or Key Graphs¶

The DDPG algorithm maintains a parameterized actor function $\mu\left(s \mid \theta^{\mu}\right)$ which specifies the current policy by deterministically mapping states to a specific action. The critic $Q(s, a)$ is learned using the Bellman equation as in Q-learning. The actor is updated by following the applying the chain rule to the expected return from the start distribution $J$ with respect to the actor parameters:

\[\begin{split}\begin{aligned} \nabla_{\theta^{\mu}} J & \approx \mathbb{E}_{s_{t} \sim \rho^{\beta}}\left[\left.\nabla_{\theta^{\mu}} Q\left(s, a \mid \theta^{Q}\right)\right|_{s=s_{t}, a=\mu\left(s_{t} \mid \theta^{\mu}\right)}\right] \\ &=\mathbb{E}_{s_{t} \sim \rho^{\beta}}\left[\left.\left.\nabla_{a} Q\left(s, a \mid \theta^{Q}\right)\right|_{s=s_{t}, a=\mu\left(s_{t}\right)} \nabla_{\theta_{\mu}} \mu\left(s \mid \theta^{\mu}\right)\right|_{s=s_{t}}\right] \end{aligned}\end{split}\]

DDPG uses a replay buffer to guarantee that the samples are independently and identically distributed.

To keep neural networks stable in many environments, DDPG uses “soft” target updates for actor-critic and using. Specifically, DDPG creates a copy of the actor and critic networks, $Q(s, a|\theta^{Q'})$ and $\mu' \left(s \mid \theta^{\mu'}\right)$ respectively, that are used for calculating the target values. The weights of these target networks are then updated by having them slowly track the learned networks:

\[\theta' \leftarrow \tau \theta + (1 - \tau)\theta',\]

where $\tau<<1$. This means that the target values are constrained to change slowly, greatly improving the stability of learning.

A major challenge of learning in continuous action spaces is exploration. The exploration policy is independent from the learning algorithm trough adding noise sampled from a noise process N to actor policy:

\[\mu^{\prime}\left(s_{t}\right)=\mu\left(s_{t} \mid \theta_{t}^{\mu}\right)+\mathcal{N}\]

Pseudocode¶

\[ \begin{align}\begin{aligned}:nowrap:\\\begin{split}\begin{algorithm}[H] \caption{Deep Deterministic Policy Gradient} \label{alg1} \begin{algorithmic}[1] \STATE Input: initial policy parameters $\theta$, Q-function parameters $\phi$, empty replay buffer $\mathcal{D}$ \STATE Set target parameters equal to main parameters $\theta_{\text{targ}} \leftarrow \theta$, $\phi_{\text{targ}} \leftarrow \phi$ \REPEAT \STATE Observe state $s$ and select action $a = \text{clip}(\mu_{\theta}(s) + \epsilon, a_{Low}, a_{High})$, where $\epsilon \sim \mathcal{N}$ \STATE Execute $a$ in the environment \STATE Observe next state $s'$, reward $r$, and done signal $d$ to indicate whether $s'$ is terminal \STATE Store $(s,a,r,s',d)$ in replay buffer $\mathcal{D}$ \STATE If $s'$ is terminal, reset environment state. \IF{it's time to update} \FOR{however many updates} \STATE Randomly sample a batch of transitions, $B = \{ (s,a,r,s',d) \}$ from $\mathcal{D}$ \STATE Compute targets \begin{equation*} y(r,s',d) = r + \gamma (1-d) Q_{\phi_{\text{targ}}}(s', \mu_{\theta_{\text{targ}}}(s')) \end{equation*} \STATE Update Q-function by one step of gradient descent using \begin{equation*} \nabla_{\phi} \frac{1}{|B|}\sum_{(s,a,r,s',d) \in B} \left( Q_{\phi}(s,a) - y(r,s',d) \right)^2 \end{equation*} \STATE Update policy by one step of gradient ascent using \begin{equation*} \nabla_{\theta} \frac{1}{|B|}\sum_{s \in B}Q_{\phi}(s, \mu_{\theta}(s)) \end{equation*} \STATE Update target networks with \begin{align*} \phi_{\text{targ}} &\leftarrow \rho \phi_{\text{targ}} + (1-\rho) \phi \\ \theta_{\text{targ}} &\leftarrow \rho \theta_{\text{targ}} + (1-\rho) \theta \end{align*} \ENDFOR \ENDIF \UNTIL{convergence} \end{algorithmic} \end{algorithm}\end{split}\end{aligned}\end{align} \]

Extensions¶

DDPG can be combined with:

Target Network

Continuous control with deep reinforcement learning proposes soft target updates used to keep the network training stable. Since we implement soft update Target Network for actor-critic through TargetNetworkWrapper in model_wrap and configuring learn.target_theta.
Replay Buffers

DDPG/TD3 random-collect-size is set to 25000 by default, while it is 25000 for SAC. We only simply follow SpinningUp default setting and use random policy to collect initialization data. We configure random_collect_size for data collection.
Gaussian noise during collecting transition.

For the exploration noise process DDPG uses temporally correlated noise in order to explore well in physical environments that have momentum. Specifically, DDPG uses Ornstein-Uhlenbeck process with $\theta = 0.15$ and $\sigma = 0.2$. The Ornstein-Uhlenbeck process models the velocity of a Brownian particle with friction, which results in temporally correlated values centered around 0. However, we use Gaussian noise instead of Ornstein-Uhlenbeck noise due to too many hyper-parameters of Ornstein-Uhlenbeck noise. We configure collect.noise_sigma to control the exploration.

Implementations¶

The default config is defined as follows:

class ding.policy.ddpg.DDPGPolicy(cfg: dict, model: Optional[Union[type, torch.nn.modules.module.Module]] = None, enable_field: Optional[List[str]] = None)[source]

Overview:: Policy class of DDPG algorithm.
Property:: learn_mode, collect_mode, eval_mode

Config:

ID

Symbol

Type

Default Value

Description

Other(Shape)

1

type

str

ddpg

RL policy register name, refer

to registry POLICY_REGISTRY

this arg is optional,

a placeholder

2

cuda

bool

True

Whether to use cuda for network

3

random_

collect_size

int

25000

Number of randomly collected

training samples in replay

buffer when training starts.

Default to 25000 for

DDPG/TD3, 10000 for

sac.

4

model.twin_

critic

bool

False

Whether to use two critic

networks or only one.

Default False for

DDPG, Clipped Double

Q-learning method in

TD3 paper.

5

learn.learning

_rate_actor

float

1e-3

Learning rate for actor

network(aka. policy).

6

learn.learning

_rate_critic

float

1e-3

Learning rates for critic

network (aka. Q-network).

7

learn.actor_

update_freq

int

2

When critic network updates

once, how many times will actor

network update.

Default 1 for DDPG,

2 for TD3. Delayed

Policy Updates method

in TD3 paper.

8

learn.noise

bool

False

Whether to add noise on target

network’s action.

Default False for

DDPG, True for TD3.

Target Policy Smoo-

thing Regularization

in TD3 paper.

9

learn.-

ignore_done

bool

False

Determine whether to ignore

done flag.

Use ignore_done only

in halfcheetah env.

10

learn.-

target_theta

float

0.005

Used for soft update of the

target network.

aka. Interpolation

factor in polyak aver

aging for target

networks.

11

collect.-

noise_sigma

float

0.1

Used for add noise during co-

llection, through controlling

the sigma of distribution

Sample noise from dis

tribution, Ornstein-

Uhlenbeck process in

DDPG paper, Guassian

process in ours.

Model¶

Here we provide examples of QAC model as default model for DDPG.

class ding.model.template.qac.QAC(obs_shape: Union[int, ding.utils.type_helper.SequenceType], action_shape: Union[int, ding.utils.type_helper.SequenceType], actor_head_type: str, twin_critic: bool = False, actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Optional[torch.nn.modules.module.Module] = ReLU(), norm_type: Optional[str] = None)[source]¶

Overview:: The QAC model.
Interfaces:: __init__, forward, compute_actor, compute_critic

__init__(obs_shape: Union[int, ding.utils.type_helper.SequenceType], action_shape: Union[int, ding.utils.type_helper.SequenceType], actor_head_type: str, twin_critic: bool = False, actor_head_hidden_size: int = 64, actor_head_layer_num: int = 1, critic_head_hidden_size: int = 64, critic_head_layer_num: int = 1, activation: Optional[torch.nn.modules.module.Module] = ReLU(), norm_type: Optional[str] = None) → None[source]¶

Overview:

Init the QAC Model according to arguments.

Arguments:

obs_shape (Union[int, SequenceType]): Observation’s space.
action_shape (Union[int, SequenceType]): Action’s space.
actor_head_type (str): Whether choose regression or reparameterization.
twin_critic (bool): Whether include twin critic.
actor_head_hidden_size (Optional[int]): The hidden_size to pass to actor-nn’s Head.
actor_head_layer_num (int):
The num of layers used in the network to compute Q value output for actor’s nn.
critic_head_hidden_size (Optional[int]): The hidden_size to pass to critic-nn’s Head.
critic_head_layer_num (int):
The num of layers used in the network to compute Q value output for critic’s nn.
activation (Optional[nn.Module]):
The type of activation function to use in MLP the after layer_fn, if None then default set to nn.ReLU()
norm_type (Optional[str]):
The type of normalization to use, see ding.torch_utils.fc_block for more details.

compute_actor(inputs: torch.Tensor) → Dict[source]¶

Overview:

Use encoded embedding tensor to predict output. Execute parameter updates with 'compute_actor' mode Use encoded embedding tensor to predict output.

Arguments:

inputs (torch.Tensor):
The encoded embedding tensor, determined with given hidden_size, i.e. (B, N=hidden_size). hidden_size = actor_head_hidden_size
mode (str): Name of the forward mode.

Returns:

outputs (Dict): Outputs of forward pass encoder and head.

ReturnsKeys (either):

action (torch.Tensor): Continuous action tensor with same size as action_shape.
logit (torch.Tensor):
Logit tensor encoding mu and sigma, both with same size as input x.

Shapes:

inputs (torch.Tensor): $(B, N0)$, B is batch size and N0 corresponds to hidden_size
action (torch.Tensor): $(B, N0)$
logit (list): 2 elements, mu and sigma, each is the shape of $(B, N0)$.
q_value (torch.FloatTensor): $(B, )$, B is batch size.

Examples:

>>> # Regression mode
>>> model = QAC(64, 64, 'regression')
>>> inputs = torch.randn(4, 64)
>>> actor_outputs = model(inputs,'compute_actor')
>>> assert actor_outputs['action'].shape == torch.Size([4, 64])
>>> # Reparameterization Mode
>>> model = QAC(64, 64, 'reparameterization')
>>> inputs = torch.randn(4, 64)
>>> actor_outputs = model(inputs,'compute_actor')
>>> actor_outputs['logit'][0].shape # mu
>>> torch.Size([4, 64])
>>> actor_outputs['logit'][1].shape # sigma
>>> torch.Size([4, 64])

compute_critic(inputs: Dict) → Dict[source]¶

Overview:

Execute parameter updates with 'compute_critic' mode Use encoded embedding tensor to predict output.

Arguments:

obs, action encoded tensors.
mode (str): Name of the forward mode.

Returns:

outputs (Dict): Q-value output.

ReturnKeys:

q_value (torch.Tensor): Q value tensor with same size as batch size.

Shapes:

obs (torch.Tensor): $(B, N1)$, where B is batch size and N1 is obs_shape
action (torch.Tensor): $(B, N2)$, where B is batch size and N2 is action_shape
q_value (torch.FloatTensor): $(B, )$, where B is batch size.

Examples:

>>> inputs = {'obs': torch.randn(4, N), 'action': torch.randn(4, 1)}
>>> model = QAC(obs_shape=(N, ),action_shape=1,actor_head_type='regression')
>>> model(inputs, mode='compute_critic')['q_value'] # q value
tensor([0.0773, 0.1639, 0.0917, 0.0370], grad_fn=<SqueezeBackward1>)

forward(inputs: Union[torch.Tensor, Dict], mode: str) → Dict[source]¶

Overview:

Use bbservation and action tensor to predict output. Parameter updates with QAC’s MLPs forward setup.

Arguments:

Forward with 'compute_actor':

inputs (torch.Tensor):
The encoded embedding tensor, determined with given hidden_size, i.e. (B, N=hidden_size). Whether actor_head_hidden_size or critic_head_hidden_size depend on mode.

Forward with 'compute_critic', inputs (Dict) Necessary Keys:

obs, action encoded tensors.

mode (str): Name of the forward mode.

Returns:

outputs (Dict): Outputs of network forward.
Forward with 'compute_actor', Necessary Keys (either):
action (torch.Tensor): Action tensor with same size as input x.

logit (torch.Tensor):
Logit tensor encoding mu and sigma, both with same size as input x.
Forward with 'compute_critic', Necessary Keys:
q_value (torch.Tensor): Q value tensor with same size as batch size.

Actor Shapes:

inputs (torch.Tensor): $(B, N0)$, B is batch size and N0 corresponds to hidden_size
action (torch.Tensor): $(B, N0)$
q_value (torch.FloatTensor): $(B, )$, where B is batch size.

Critic Shapes:

obs (torch.Tensor): $(B, N1)$, where B is batch size and N1 is obs_shape
action (torch.Tensor): $(B, N2)$, where B is batch size and N2 is``action_shape``
logit (torch.FloatTensor): $(B, N2)$, where B is batch size and N3 is action_shape

Actor Examples:

>>> # Regression mode
>>> model = QAC(64, 64, 'regression')
>>> inputs = torch.randn(4, 64)
>>> actor_outputs = model(inputs,'compute_actor')
>>> assert actor_outputs['action'].shape == torch.Size([4, 64])
>>> # Reparameterization Mode
>>> model = QAC(64, 64, 'reparameterization')
>>> inputs = torch.randn(4, 64)
>>> actor_outputs = model(inputs,'compute_actor')
>>> actor_outputs['logit'][0].shape # mu
>>> torch.Size([4, 64])
>>> actor_outputs['logit'][1].shape # sigma
>>> torch.Size([4, 64])

Critic Examples:

>>> inputs = {'obs': torch.randn(4,N), 'action': torch.randn(4,1)}
>>> model = QAC(obs_shape=(N, ),action_shape=1,actor_head_type='regression')
>>> model(inputs, mode='compute_critic')['q_value'] # q value
tensor([0.0773, 0.1639, 0.0917, 0.0370], grad_fn=<SqueezeBackward1>)

Train actor-critic model¶

First, we initialize actor and critic optimizer in _init_learn, respectively. Setting up two separate optimizers can guarantee that we only update actor network parameters and not critic network when we compute actor loss, vice versa.

# actor and critic optimizer
self._optimizer_actor = Adam(
    self._model.actor.parameters(),
    lr=self._cfg.learn.learning_rate_actor,
    weight_decay=self._cfg.learn.weight_decay
)
self._optimizer_critic = Adam(
    self._model.critic.parameters(),
    lr=self._cfg.learn.learning_rate_critic,
    weight_decay=self._cfg.learn.weight_decay
)

In _forward_learn we update actor-critic policy through computing critic loss, updating critic network, computing actor loss, and updating actor network.

critic loss computation

current and target value computation

# current q value
q_value = self._learn_model.forward(data, mode='compute_critic')['q_value']
q_value_dict = {}
if self._twin_critic:
    q_value_dict['q_value'] = q_value[0].mean()
    q_value_dict['q_value_twin'] = q_value[1].mean()
else:
    q_value_dict['q_value'] = q_value.mean()
# target q value. SARSA: first predict next action, then calculate next q value
with torch.no_grad():
    next_action = self._target_model.forward(next_obs, mode='compute_actor')['action']
    next_data = {'obs': next_obs, 'action': next_action}
    target_q_value = self._target_model.forward(next_data, mode='compute_critic')['q_value']

loss computation

if self._twin_critic:
    # TD3: two critic networks
    target_q_value = torch.min(target_q_value[0], target_q_value[1])  # find min one as target q value
    # network1
    td_data = v_1step_td_data(q_value[0], target_q_value, reward, data['done'], data['weight'])
    critic_loss, td_error_per_sample1 = v_1step_td_error(td_data, self._gamma)
    loss_dict['critic_loss'] = critic_loss
    # network2(twin network)
    td_data_twin = v_1step_td_data(q_value[1], target_q_value, reward, data['done'], data['weight'])
    critic_twin_loss, td_error_per_sample2 = v_1step_td_error(td_data_twin, self._gamma)
    loss_dict['critic_twin_loss'] = critic_twin_loss
    td_error_per_sample = (td_error_per_sample1 + td_error_per_sample2) / 2
else:
    # DDPG: single critic network
    td_data = v_1step_td_data(q_value, target_q_value, reward, data['done'], data['weight'])
    critic_loss, td_error_per_sample = v_1step_td_error(td_data, self._gamma)
    loss_dict['critic_loss'] = critic_loss

critic network update

self._optimizer_critic.zero_grad()
for k in loss_dict:
    if 'critic' in k:
        loss_dict[k].backward()
self._optimizer_critic.step()

actor loss

actor_data = self._learn_model.forward(data['obs'], mode='compute_actor')
actor_data['obs'] = data['obs']
if self._twin_critic:
    actor_loss = -self._learn_model.forward(actor_data, mode='compute_critic')['q_value'][0].mean()
else:
    actor_loss = -self._learn_model.forward(actor_data, mode='compute_critic')['q_value'].mean()
loss_dict['actor_loss'] = actor_loss

actor network update

# actor update
self._optimizer_actor.zero_grad()
actor_loss.backward()
self._optimizer_actor.step()

Target Network¶

We implement Target Network trough target model initialization in _init_learn. We configure learn.target_theta to control the interpolation factor in averaging.

# main and target models
self._target_model = copy.deepcopy(self._model)
self._target_model = model_wrap(
    self._target_model,
    wrapper_name='target',
    update_type='momentum',
    update_kwargs={'theta': self._cfg.learn.target_theta}
)

The Benchmark result of DDPG implemented in DI-engine is shown in Benchmark

Other Public Implementations¶

References¶

Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, Daan Wierstra: “Continuous control with deep reinforcement learning”, 2015; [http://arxiv.org/abs/1509.02971 arXiv:1509.02971].