N-step TDΒΆ
1. What is n-step TD
n-step TD lies between Monte Carlo and TD methods. Consider estimating \(v_\pi\) from episodes generated from \(\pi\), Monte Carlo performs an update based on the entire sequence while one-step TD is based on one next reward, bootstrapping from the value of the state one step later. n-step TD is the intermediate method, which performs an update based on an intermediate number of rewards, which is more than one but less than all of them.
More formally, consider the update of state \(S_t\) with the sequence, \(S_t, R_{t+1}, S_{t+1}, R_{t+2}, ..., R_{T}, S_{T}\). Monte Carlo is updated in the complete return:
\(G_{t} = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... +\gamma^{T-t-1} R_{T}\)
where T is the last time step of the episode. And we have one-step TD return:
\(G_{t:t+1} = R_{t+1} + \gamma V_{t}(S_{t+1})\)
Accordingly, we have n-step TD return:
\(G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... +\gamma^{n-1} R_{t+n} + \gamma^n V_{t+n-1}(S_{t+n})\)
2. How to enable nstep-TD in DI-engine
DI-engine has provided n-step implementation and enable it in a lot of algorithms, like DQN, A2C, etc. Here, we will introduce how to enable nstep-td and modify the code. Takes DQN as an example, if we want to enable nstep=3, we should set the config as follows:
config = dict(
type='dqn',
cuda=False,
on_policy=False,
priority=False,
discount_factor=0.97,
# (int) N-step reward for target q_value estimation
nstep=3,
...,)
3. How does it work in DI-engine
Takes DQN as an example again. We set the propoerty according to the config in _init_collect and
_init_learn function in the corresponding policy. Here we show the code of DQN as follows,
def _init_collect(self) -> None:
...,
self._nstep = self._cfg.nstep
...,
def _init_learn(self) -> None:
...,
self._nstep = self._cfg.nstep
...,
Then, we organize the rewards and value data for n-step return in _get_train_sample function.
def _get_train_sample(self, data: deque) -> Union[None, List[Any]]:
...,
if self._nstep_return:
data = self._adder.get_nstep_return_data(data, self._nstep)
...,
return self._adder.get_train_sample(data)
Next, we actually calculate n-step return in _forward_learn function
def _forward_learn(self, data: dict) -> Dict[str, Any]:
...,
data_n = q_nstep_td_data(
q_value, target_q_value, data['action'], target_q_action, data['reward'], data['done'], data['weight']
)
value_gamma = data.get('value_gamma')
loss, td_error_per_sample = q_nstep_td_error(data_n, self._gamma, nstep=self._nstep, value_gamma=value_gamma)
...,
The calculation is implemented in q_nstep_td_error function
Please notice that the key nstep might be in different positions for different algorithm.
We also provide some other n-sted td methods, please refer to ding\rl_utils\td.py.