Rainbow

RainbowDQNPolicy

class ding.policy.rainbow.RainbowDQNPolicy(cfg: dict, model: Optional[Union[type, torch.nn.modules.module.Module]] = None, enable_field: Optional[List[str]] = None)[source]
Overview:
Rainbow DQN contain several improvements upon DQN, including:
  • target network

  • dueling architecture

  • prioritized experience replay

  • n_step return

  • noise net

  • distribution net

Therefore, the RainbowDQNPolicy class inherit upon DQNPolicy class

Config:

ID

Symbol

Type

Default Value

Description

Other(Shape)

1

type

str

rainbow

RL policy register name, refer to
registry POLICY_REGISTRY
this arg is optional,
a placeholder

2

cuda

bool

False

Whether to use cuda for network
this arg can be diff-
erent from modes

3

on_policy

bool

False

Whether the RL algorithm is on-policy
or off-policy

4

priority

bool

True

Whether use priority(PER)
priority sample,
update priority

5

model.v_min

float

-10

Value of the smallest atom
in the support set.

6

model.v_max

float

10

Value of the largest atom
in the support set.

7

model.n_atom

int

51

Number of atoms in the support set
of the value distribution.

8

other.eps
.start

float

0.05

Start value for epsilon decay. It’s
small because rainbow use noisy net.

9

other.eps
.end

float

0.05

End value for epsilon decay.

10

discount_
factor

float

0.97, [0.95, 0.999]

Reward’s future discount factor, aka.
gamma
may be 1 when sparse
reward env

11

nstep

int

3, [3, 5]

N-step reward discount sum for target
q_value estimation

12

learn.update
per_collect

int

3

How many updates(iterations) to train
after collector’s one collection. Only
valid in serial training
this args can be vary
from envs. Bigger val
means more off-policy
_forward_collect(data: dict, eps: float)dict[source]
Overview:

Reset the noise from noise net and collect output according to eps_greedy plugin

Arguments:
  • data (dict): Dict type data, including at least [‘obs’].

Returns:
  • data (dict): The collected data

_forward_learn(data: dict)Dict[str, Any][source]
Overview:

Forward and backward function of learn mode, acquire the data and calculate the loss and optimize learner model

Arguments:
  • data (dict): Dict type data, including at least [‘obs’, ‘next_obs’, ‘reward’, ‘action’]

Returns:
  • info_dict (Dict[str, Any]): Including cur_lr and total_loss
    • cur_lr (float): current learning rate

    • total_loss (float): the calculated loss

_get_train_sample(traj: collections.deque)Union[None, List[Any]][source]
Overview:

Get the trajectory and the n step return data, then sample from the n_step return data

Arguments:
  • traj (deque): The trajactory’s cache

Returns:
  • samples (dict): The training samples generated

_init_collect()None[source]
Overview:

Collect mode init moethod. Called by self.__init__. Init traj and unroll length, collect model.

Note

the rainbow dqn enable the eps_greedy_sample, but might not need to use it,

as the noise_net contain noise that can help exploration

_init_learn()None[source]
Overview:

Init the learner model of RainbowDQNPolicy

Arguments:
  • learning_rate (float): the learning rate fo the optimizer

  • gamma (float): the discount factor

  • nstep (int): the num of n step return

  • v_min (float): value distribution minimum value

  • v_max (float): value distribution maximum value

  • n_atom (int): the number of atom sample point