CollaQ

Overview

CollaQ (Zhang et al. 2020), Collaborative Q-learning, is a multi-agent collaboration approach based on Q-learning, which formulates multi-agent collaboration as a joint optimization problem on reward assignment. CollaQ decomposes decentralized Q value functions of individual agents into two terms, the self-term that only relies on the agent’s own state, and the interactive term that is related to states of nearby agents. CollaQ jointly trains using regular DQN, regulated with a Multi-Agent Reward Attribution (MARA) loss.

Quick Facts

  1. CollaQ is a model-free and value-based multi-agent RL approach.

  2. CollaQ only supports discrete action spaces.

  3. CollaQ is an off-policy algorithm.

  4. CollaQ considers a partially observable scenario in which each agent only obtains individual observations.

  5. CollaQ uses DRQN architecture for individual Q learning.

  6. Compared to QMIX and VDN, CollaQ doesn’t need a centralized Q function, which expands the individual Q-function for each agent with reward assignment depending on the joint state.

Key Equations or Key Graphs

The overall architecture of the Q-function with attention-based model in CollaQ:

../_images/collaq.png

The Q-function for agent i:

\[Q_{i}(s_{i},a_{i};\hat{\textbf{r}}_{i}) = \underbrace{Q_{1}(s{i}, a_{i},\hat{\textbf{r}_{0i}})}_{Q^{alone}(s_{i},a_{i})} + \underbrace{\nabla_{\textbf{r}}Q_{i}(s_{i},a_{i};\textbf{r}_{0i})\cdot(\hat{\textbf{r}_{i}} - \textbf{r}_{0i}) + \mathcal{O}(||\hat{\textbf{r}_{i}} - \textbf{r}_{0i}||^{2})}_{Q^{collab}(s^{local}_{i}, a_{i})}\]

The overall training objective of standard DQN training with MARA loss:

\[L = \mathbb{E}_{s_{i},a{i}\sim\rho(\cdot)}[\underbrace{(y-Q_{i}(o_{i},a_{i}))^{2}}_{\text{DQN Object}} +\underbrace{\alpha(Q_{i}^{collab}(o_{i}^{alone}, a_{i}))^{2}}_{\text{MARA Object}}]\]

Extensions

  • CollaQ can choose to use an attention-based architecture or not. Because the observation can be spatially large and covers agents whose states do not contribute much to a certain agent policy. In details, CollaQ uses a transformer architecture (stacking multiple layers of attention modules), which empirically helps improve the performance in multi-agent tasks.

Implementations

The default config is defined as follows:

class ding.policy.collaq.CollaQPolicy(cfg: dict, model: Optional[Union[type, torch.nn.modules.module.Module]] = None, enable_field: Optional[List[str]] = None)[source]
Overview:

Policy class of CollaQ algorithm. CollaQ is a multi-agent reinforcement learning algorithm

Interface:
_init_learn, _data_preprocess_learn, _forward_learn, _reset_learn, _state_dict_learn, _load_state_dict_learn

_init_collect, _forward_collect, _reset_collect, _process_transition, _init_eval, _forward_eval_reset_eval, _get_train_sample, default_model

Config:

ID

Symbol

Type

Default Value

Description

Other(Shape)

1

type

str

collaq

RL policy register name, refer to
registry POLICY_REGISTRY
this arg is optional,
a placeholder

2

cuda

bool

True

Whether to use cuda for network
this arg can be diff-
erent from modes

3

on_policy

bool

False

Whether the RL algorithm is on-policy
or off-policy

priority

bool

False

Whether use priority(PER)
priority sample,
update priority

5

priority_
IS_weight

bool

False

Whether use Importance Sampling
Weight to correct biased update.
IS weight

6

learn.update_
per_collect

int

20

How many updates(iterations) to train
after collector’s one collection. Only
valid in serial training
this args can be vary
from envs. Bigger val
means more off-policy

7

learn.target_
update_theta

float

0.001

Target network update momentum
parameter.
between[0,1]

8

learn.discount
_factor

float

0.99

Reward’s future discount factor, aka.
gamma
may be 1 when sparse
reward env

9

learn.collaq
_loss_weight

float

1.0

The weight of collaq MARA loss

The network interface CollaQ used is defined as follows:

class ding.model.template.qmix.CollaQ(agent_num: int, obs_shape: int, alone_obs_shape: int, global_obs_shape: int, action_shape: int, hidden_size_list: list, attention: bool = False, self_feature_range: Optional[List[int]] = None, ally_feature_range: Optional[List[int]] = None, attention_size: int = 32, mixer: bool = True, lstm_type: str = 'gru', dueling: bool = False, use_pmixer: bool = False)[source]
Overview:

CollaQ network

Interface:

__init__, forward, _setup_global_encoder

__init__(agent_num: int, obs_shape: int, alone_obs_shape: int, global_obs_shape: int, action_shape: int, hidden_size_list: list, attention: bool = False, self_feature_range: Optional[List[int]] = None, ally_feature_range: Optional[List[int]] = None, attention_size: int = 32, mixer: bool = True, lstm_type: str = 'gru', dueling: bool = False, use_pmixer: bool = False)None[source]
Overview:

initialize Collaq network

Arguments:
  • agent_num (int): the number of agent

  • obs_shape (int): the dimension of each agent’s observation state

  • alone_obs_shape (int): the dimension of each agent’s observation state without other agents

  • global_obs_shape (int): the dimension of global observation state

  • action_shape (int): the dimension of action shape

  • hidden_size_list (list): the list of hidden size

  • attention (bool): use attention module or not, default to False

  • self_feature_range (Union[List[int], None]): the agent’s feature range

  • ally_feature_range (Union[List[int], None]): the agent ally’s feature range

  • attention_size (int): the size of attention net layer

  • mixer (bool): use mixer net or not, default to True

forward(data: dict, single_step: bool = True)dict[source]
Overview:

forward computation graph of collaQ network

Arguments:
  • data (dict): input data dict with keys [‘obs’, ‘prev_state’, ‘action’]
    • agent_state (torch.Tensor): each agent local state(obs)

    • agent_alone_state (torch.Tensor): each agent’s local state alone, in smac setting is without ally feature(obs_along)

    • global_state (torch.Tensor): global state(obs)

    • prev_state (list): previous rnn state, should include 3 parts: one hidden state of q_network, and two hidden state if q_alone_network for obs and obs_alone inputs

    • action (torch.Tensor or None): if action is None, use argmax q_value index as action to calculate agent_q_act

  • single_step (bool): whether single_step forward, if so, add timestep dim before forward and remove it after forward

Return:
  • ret (dict): output data dict with keys [‘total_q’, ‘logit’, ‘next_state’]
    • total_q (torch.Tensor): total q_value, which is the result of mixer network

    • agent_q (torch.Tensor): each agent q_value

    • next_state (list): next rnn state

Shapes:
  • agent_state (torch.Tensor): \((T, B, A, N)\), where T is timestep, B is batch_size A is agent_num, N is obs_shape

  • global_state (torch.Tensor): \((T, B, M)\), where M is global_obs_shape

  • prev_state (list): math:(B, A), a list of length B, and each element is a list of length A

  • action (torch.Tensor): \((T, B, A)\)

  • total_q (torch.Tensor): \((T, B)\)

  • agent_q (torch.Tensor): \((T, B, A, P)\), where P is action_shape

  • next_state (list): math:(B, A), a list of length B, and each element is a list of length A

The Benchmark result of CollaQ in SMAC (Samvelyan et al. 2019), for StarCraft micromanagement problems, implemented in DI-engine is shown.

References

Tianjun Zhang, Huazhe Xu, Xiaolong Wang, Yi Wu, Kurt Keutzer, Joseph E. Gonzalez, Yuandong Tian. Multi-Agent Collaboration via Reward Attribution Decomposition. arXiv preprint arXiv:2010.08531, 2020.

Mikayel Samvelyan, Tabish Rashid, Christian Schroeder de Witt, Gregory Farquhar, Nantas Nardelli, Tim G. J. Rudner, Chia-Man Hung, Philip H. S. Torr, Jakob Foerster, Shimon Whiteson. The StarCraft Multi-Agent Challenge. arXiv preprint arXiv:1902.04043, 2019.