Rainbow¶

RainbowDQNPolicy¶

class ding.policy.rainbow.RainbowDQNPolicy(cfg: dict, model: Optional[Union[type, torch.nn.modules.module.Module]] = None, enable_field: Optional[List[str]] = None)[source]¶

Overview:

Rainbow DQN contain several improvements upon DQN, including:

target network
dueling architecture
prioritized experience replay
n_step return
noise net
distribution net

Therefore, the RainbowDQNPolicy class inherit upon DQNPolicy class

Config:

ID	Symbol	Type	Default Value	Description	Other(Shape)
1	`type`	str	rainbow	RL policy register name, refer to registry `POLICY_REGISTRY`	this arg is optional, a placeholder
2	`cuda`	bool	False	Whether to use cuda for network	this arg can be diff- erent from modes
3	`on_policy`	bool	False	Whether the RL algorithm is on-policy or off-policy
4	`priority`	bool	True	Whether use priority(PER)	priority sample, update priority
5	`model.v_min`	float	-10	Value of the smallest atom in the support set.
6	`model.v_max`	float	10	Value of the largest atom in the support set.
7	`model.n_atom`	int	51	Number of atoms in the support set of the value distribution.
8	`other.eps` `.start`	float	0.05	Start value for epsilon decay. It’s small because rainbow use noisy net.
9	`other.eps` `.end`	float	0.05	End value for epsilon decay.
10	`discount_` `factor`	float	0.97, [0.95, 0.999]	Reward’s future discount factor, aka. gamma	may be 1 when sparse reward env
11	`nstep`	int	3, [3, 5]	N-step reward discount sum for target q_value estimation
12	`learn.update` `per_collect`	int	3	How many updates(iterations) to train after collector’s one collection. Only valid in serial training	this args can be vary from envs. Bigger val means more off-policy

_forward_collect(data: dict, eps: float) → dict[source]¶

Overview:

Reset the noise from noise net and collect output according to eps_greedy plugin

Arguments:

data (dict): Dict type data, including at least [‘obs’].

Returns:

data (dict): The collected data

_forward_learn(data: dict) → Dict[str, Any][source]¶

Overview:

Forward and backward function of learn mode, acquire the data and calculate the loss and optimize learner model

Arguments:

data (dict): Dict type data, including at least [‘obs’, ‘next_obs’, ‘reward’, ‘action’]

Returns:

info_dict (Dict[str, Any]): Including cur_lr and total_loss
- cur_lr (float): current learning rate
- total_loss (float): the calculated loss

_get_train_sample(traj: collections.deque) → Union[None, List[Any]][source]¶

Overview:

Get the trajectory and the n step return data, then sample from the n_step return data

Arguments:

traj (deque): The trajactory’s cache

Returns:

samples (dict): The training samples generated

_init_collect() → None[source]¶

Overview:

Collect mode init moethod. Called by self.__init__. Init traj and unroll length, collect model.

Note

the rainbow dqn enable the eps_greedy_sample, but might not need to use it,: as the noise_net contain noise that can help exploration

_init_learn() → None[source]¶

Overview:

Init the learner model of RainbowDQNPolicy

Arguments:

learning_rate (float): the learning rate fo the optimizer
gamma (float): the discount factor
nstep (int): the num of n step return
v_min (float): value distribution minimum value
v_max (float): value distribution maximum value
n_atom (int): the number of atom sample point