DDPG¶

DDPGPolicy¶

class ding.policy.ddpg.DDPGPolicy(cfg: dict, model: Optional[Union[type, torch.nn.modules.module.Module]] = None, enable_field: Optional[List[str]] = None)[source]¶

Overview:: Policy class of DDPG algorithm.
Property:: learn_mode, collect_mode, eval_mode

Config:

ID

Symbol

Type

Default Value

Description

Other(Shape)

1

type

str

ddpg

RL policy register name, refer

to registry POLICY_REGISTRY

this arg is optional,

a placeholder

2

cuda

bool

True

Whether to use cuda for network

3

random_

collect_size

int

25000

Number of randomly collected

training samples in replay

buffer when training starts.

Default to 25000 for

DDPG/TD3, 10000 for

sac.

4

model.twin_

critic

bool

False

Whether to use two critic

networks or only one.

Default False for

DDPG, Clipped Double

Q-learning method in

TD3 paper.

5

learn.learning

_rate_actor

float

1e-3

Learning rate for actor

network(aka. policy).

6

learn.learning

_rate_critic

float

1e-3

Learning rates for critic

network (aka. Q-network).

7

learn.actor_

update_freq

int

2

When critic network updates

once, how many times will actor

network update.

Default 1 for DDPG,

2 for TD3. Delayed

Policy Updates method

in TD3 paper.

8

learn.noise

bool

False

Whether to add noise on target

network’s action.

Default False for

DDPG, True for TD3.

Target Policy Smoo-

thing Regularization

in TD3 paper.

9

learn.-

ignore_done

bool

False

Determine whether to ignore

done flag.

Use ignore_done only

in halfcheetah env.

10

learn.-

target_theta

float

0.005

Used for soft update of the

target network.

aka. Interpolation

factor in polyak aver

aging for target

networks.

11

collect.-

noise_sigma

float

0.1

Used for add noise during co-

llection, through controlling

the sigma of distribution

Sample noise from dis

tribution, Ornstein-

Uhlenbeck process in

DDPG paper, Guassian

process in ours.

_forward_collect(data: dict) → dict[source]¶

Overview:

Forward function of collect mode.

Arguments:

data (dict): Dict type data, including at least [‘obs’].

Returns:

output (dict): Dict type data, including at least inferred action according to input obs.

_forward_eval(data: dict) → dict[source]¶

Overview:

Forward function of collect mode, similar to self._forward_collect.

Arguments:

data (dict): Dict type data, including at least [‘obs’].

Returns:

output (dict): Dict type data, including at least inferred action according to input obs.

_forward_learn(data: dict) → Dict[str, Any][source]¶

Overview:

Forward and backward function of learn mode.

Arguments:

data (dict): Dict type data, including at least [‘obs’, ‘action’, ‘reward’, ‘next_obs’]

Returns:

info_dict (Dict[str, Any]): Including at least actor and critic lr, different losses.

_init_collect() → None[source]¶

Overview:: Collect mode init method. Called by self.__init__. Init traj and unroll length, collect model.

_init_eval() → None[source]¶

Overview:: Evaluate mode init method. Called by self.__init__. Init eval model. Unlike learn and collect model, eval model does not need noise.

_init_learn() → None[source]¶

Overview:: Learn mode init method. Called by self.__init__. Init actor and critic optimizers, algorithm config, main and target models.

_process_transition(obs: Any, model_output: dict, timestep: collections.namedtuple) → Dict[str, Any][source]¶

Overview:

Generate dict type transition data from inputs.

Arguments:

obs (Any): Env observation
model_output (dict): Output of collect model, including at least [‘action’]
timestep (namedtuple): Output after env step, including at least [‘obs’, ‘reward’, ‘done’]
(here ‘obs’ indicates obs after env step, i.e. next_obs).

Return:

transition (Dict[str, Any]): Dict type transition data.