N-step TDΒΆ

1. What is n-step TD

n-step TD lies between Monte Carlo and TD methods. Consider estimating \(v_\pi\) from episodes generated from \(\pi\), Monte Carlo performs an update based on the entire sequence while one-step TD is based on one next reward, bootstrapping from the value of the state one step later. n-step TD is the intermediate method, which performs an update based on an intermediate number of rewards, which is more than one but less than all of them.

../_images/nstep_td.png

More formally, consider the update of state \(S_t\) with the sequence, \(S_t, R_{t+1}, S_{t+1}, R_{t+2}, ..., R_{T}, S_{T}\). Monte Carlo is updated in the complete return:

\(G_{t} = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... +\gamma^{T-t-1} R_{T}\)

where T is the last time step of the episode. And we have one-step TD return:

\(G_{t:t+1} = R_{t+1} + \gamma V_{t}(S_{t+1})\)

Accordingly, we have n-step TD return:

\(G_{t:t+n} = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... +\gamma^{n-1} R_{t+n} + \gamma^n V_{t+n-1}(S_{t+n})\)

2. How to enable nstep-TD in DI-engine

DI-engine has provided n-step implementation and enable it in a lot of algorithms, like DQN, A2C, etc. Here, we will introduce how to enable nstep-td and modify the code. Takes DQN as an example, if we want to enable nstep=3, we should set the config as follows:

config = dict(
       type='dqn',
       cuda=False,
       on_policy=False,
       priority=False,
       discount_factor=0.97,
       # (int) N-step reward for target q_value estimation
       nstep=3,
       ...,)

3. How does it work in DI-engine

Takes DQN as an example again. We set the propoerty according to the config in _init_collect and _init_learn function in the corresponding policy. Here we show the code of DQN as follows,

def _init_collect(self) -> None:
    ...,
    self._nstep = self._cfg.nstep
    ...,
def _init_learn(self) -> None:
    ...,
    self._nstep = self._cfg.nstep
    ...,

Then, we organize the rewards and value data for n-step return in _get_train_sample function.

def _get_train_sample(self, data: deque) -> Union[None, List[Any]]:
    ...,
    if self._nstep_return:
        data = self._adder.get_nstep_return_data(data, self._nstep)
    ...,
    return self._adder.get_train_sample(data)

Next, we actually calculate n-step return in _forward_learn function

def _forward_learn(self, data: dict) -> Dict[str, Any]:

    ...,
    data_n = q_nstep_td_data(
        q_value, target_q_value, data['action'], target_q_action, data['reward'], data['done'], data['weight']
    )
    value_gamma = data.get('value_gamma')
    loss, td_error_per_sample = q_nstep_td_error(data_n, self._gamma, nstep=self._nstep, value_gamma=value_gamma)
    ...,

The calculation is implemented in q_nstep_td_error function

Please notice that the key nstep might be in different positions for different algorithm.

We also provide some other n-sted td methods, please refer to ding\rl_utils\td.py.