Rainbow¶
RainbowDQNPolicy¶
- class ding.policy.rainbow.RainbowDQNPolicy(cfg: dict, model: Optional[Union[type, torch.nn.modules.module.Module]] = None, enable_field: Optional[List[str]] = None)[source]¶
- Overview:
- Rainbow DQN contain several improvements upon DQN, including:
target network
dueling architecture
prioritized experience replay
n_step return
noise net
distribution net
Therefore, the RainbowDQNPolicy class inherit upon DQNPolicy class
- Config:
ID
Symbol
Type
Default Value
Description
Other(Shape)
1
typestr
rainbow
RL policy register name, refer toregistryPOLICY_REGISTRYthis arg is optional,a placeholder2
cudabool
False
Whether to use cuda for networkthis arg can be diff-erent from modes3
on_policybool
False
Whether the RL algorithm is on-policyor off-policy4
prioritybool
True
Whether use priority(PER)priority sample,update priority5
model.v_minfloat
-10
Value of the smallest atomin the support set.6
model.v_maxfloat
10
Value of the largest atomin the support set.7
model.n_atomint
51
Number of atoms in the support setof the value distribution.8
other.eps.startfloat
0.05
Start value for epsilon decay. It’ssmall because rainbow use noisy net.9
other.eps.endfloat
0.05
End value for epsilon decay.10
discount_factorfloat
0.97, [0.95, 0.999]
Reward’s future discount factor, aka.gammamay be 1 when sparsereward env11
nstepint
3, [3, 5]
N-step reward discount sum for targetq_value estimation12
learn.updateper_collectint
3
How many updates(iterations) to trainafter collector’s one collection. Onlyvalid in serial trainingthis args can be varyfrom envs. Bigger valmeans more off-policy
- _forward_collect(data: dict, eps: float) → dict[source]¶
- Overview:
Reset the noise from noise net and collect output according to eps_greedy plugin
- Arguments:
data (
dict): Dict type data, including at least [‘obs’].
- Returns:
data (
dict): The collected data
- _forward_learn(data: dict) → Dict[str, Any][source]¶
- Overview:
Forward and backward function of learn mode, acquire the data and calculate the loss and optimize learner model
- Arguments:
data (
dict): Dict type data, including at least [‘obs’, ‘next_obs’, ‘reward’, ‘action’]
- Returns:
- info_dict (
Dict[str, Any]): Including cur_lr and total_loss cur_lr (
float): current learning ratetotal_loss (
float): the calculated loss
- info_dict (
- _get_train_sample(traj: collections.deque) → Union[None, List[Any]][source]¶
- Overview:
Get the trajectory and the n step return data, then sample from the n_step return data
- Arguments:
traj (
deque): The trajactory’s cache
- Returns:
samples (
dict): The training samples generated
- _init_collect() → None[source]¶
- Overview:
Collect mode init moethod. Called by
self.__init__. Init traj and unroll length, collect model.Note
- the rainbow dqn enable the eps_greedy_sample, but might not need to use it,
as the noise_net contain noise that can help exploration
- _init_learn() → None[source]¶
- Overview:
Init the learner model of RainbowDQNPolicy
- Arguments:
learning_rate (
float): the learning rate fo the optimizergamma (
float): the discount factornstep (
int): the num of n step returnv_min (
float): value distribution minimum valuev_max (
float): value distribution maximum valuen_atom (
int): the number of atom sample point