DI-orchestrator Overview¶
In order to provide running support for DI-engine in Kubernetes (K8s), we designed DI-orchestrator, which aims to manager all the modules in the distributed training of DI-engine. DI-orchestrator offers many micro
services for more stable and efficient training. The detailed architecture design is shown in DI-orchestrator Guide. And here is a
specific example about how to launch a DI-engine training job(DIJob), and first you should deploy a k8s cluster first:
1. Submit, View, Modify and Delete DIJob¶
A simple example is stored in
{DING_ROOT}/ding/scripts/dijob-qbert.yaml, you can use this example
to learn how to submit a DIJob on Kubenetes cluster.
# submit DIJob
kubectl create -f dijob-qbert.yaml
diengine.opendilab.org/qbert-dqn created
# get pod and you will see coordinator are created
NAME READY STATUS RESTARTS AGE
qbert-dqn-coordinator 1/1 Running 0 8s
# few seconds later, you will see collectors and learners (and aggregator if need) created by di-server
$ kubectl get pod
NAME READY STATUS RESTARTS AGE
qbert-dqn-aggregator 1/1 Running 0 80s
qbert-dqn-collector-pm5gv 1/1 Running 0 66s
qbert-dqn-coordinator 1/1 Running 0 80s
qbert-dqn-learner-rcwmc 1/1 Running 0 66s
qbert-dqn-learner-txjks 1/1 Running 0 66s
# get logs
$ kubectl logs qbert-dqn-coordinator
* Serving Flask app "interaction.master.master" (lazy loading)
* Environment: production
WARNING: This is a development server. Do not use it in a production deployment.
Use a production WSGI server instead.
* Debug mode: off
...
# delete DIJob
$ kubectl delete dijob qbert-dqn
# or
$ kubectl delete -f qbert-dqn.yaml
2. Check the status of DIJob¶
# get the dijob qbert-dqn to display
$ kubectl get dijob dijob-example
NAME PHASE AGE
qbert-dqn Succeeded 22h
# show details of a specific resource or group of resource
$ kubectl describe dijob qbert-dqn
Name: qbert-dqn
Namespace: default
Labels: <none>
...
Phase: Succeeded
Replica Status:
Aggregator:
Collector:
Succeeded: 2
Coordinator:
Succeeded: 1
Learner:
Succeeded: 1
Events: <none>
3. Set storage middleware¶
As DIJob needs storage middleware during training, it must be
provided in .yaml. Here’s an example that shows the required fields
for setting a storage middleware.
hostPath configuration example¶
Supply a directory from the host node’s filesystem in the field
volumesin the outermostspec. The fieldname(here iswork-dir) and the fieldpath(directory location on host) must be provided are required.
spec:
...
volumes:
- name: work-dir
hostPath:
path: /data/nfs/ding/qbert
In the field
volumeMountsfield of worker, fill the fieldnameandmountPathto specify that the destination inside the pod a volume gets mounted to. Note that the name ofvolumeMountsmust be the same as the name defined inhostPath.
...
coordinator:
template:
spec:
containers:
...
volumeMounts:
- name: work-dir
mountPath: /ding
...
4. Insert experimental config in DIJob config¶
Generally, the experimental config (e.g. cartpole_dqn_config.py) and
DIJob config are stored in two different files. You can use the
following two methods to launch your experiment:
Place the experimental config file in the mounted volume in advance, as in the above-mentioned
/data/nfs/ding/qbert;Insert experimental config in the
DIJobconfig;
Here’s an example of Inserting experimental config in the DIJob
config:
coordinator:
template:
spec:
containers:
- name: coordinator
image: ...
...
command: ["/bin/bash", "-c",]
args:
- |
cat <<EOF > qbert_dqn_config_k8s.py
from easydict import EasyDict
qbert_dqn_config = dict(
env=dict(
collector_env_num=16,
collector_episode_num=2,
evaluator_env_num=8,
evaluator_episode_num=1,
stop_value=30000,
env_id='QbertNoFrameskip-v4',
frame_stack=4,
manager=dict(
shared_memory=False,
...
...
qbert_dqn_system_config = EasyDict(qbert_dqn_system_config)
system_config = qbert_dqn_system_config
EOF
ding -m dist --module config -p k8s -c qbert_dqn_config_k8s.py -s 0;
ding -m dist --module coordinator -p k8s -c qbert_dqn_config_k8s.py.pkl -s 0
...
5. Define environment variables for a worker¶
To set environment variables, include the env field in the
configuration file. Here’s an example of defining an environment
variable with name PYTHONUNBUFFERED and value 1:
...
coordinator:
template:
spec:
containers:
- name: coordinator
image: ..
...
env:
- name: PYTHONUNBUFFERED
value: "1"
...
...
6. Assign CPU, memory, and GPU resources to workers¶
The CPU, memory, and GPU required by each worker may be different. You
need to specify requests in the field resources:requests of each
worker. To specify a resource limit, include resources:limits.
Here’s an example of the configuration file for the learner which has a request of 6 CPU, 1 GPU, 10Gi memory and a limit of 6 CPU, 1 GPU, 10Gi memory:
...
learner:
template:
spec:
containers:
- name: learner
image: ...
...
resources:
requests:
cpu: "6"
nvidia.com/gpu: "1"
memory: "10Gi"
limits:
cpu: "6"
nvidia.com/gpu: "1"
memory: "10Gi"