How to run
1. Install pdsh in your nodes
# https://code.google.com/archive/p/pdsh/downloads
# For example, download to /root:
cd /root
wget https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/pdsh/pdsh-2.29.tar.bz2
tar -xvf pdsh-2.29.tar.bz2
cd pdsh-2.29
./configure --prefix=/root/pdsh-2.29 --with-ssh --without-rsh --with-exec --with-timeout=60 --with-nodeupdown --with-rcmd-rank-list=ssh
make
make install
In case of the privilege is correct:
chown root:root /root/pdsh-2.29
Configure the ssh
vim your ~/.ssh/config and input:
Host worker-0
HostName your-worker-0-ip-here
User root
Host worker-1
HostName your-worker-1-ip-here
User root
Say you have two nodes, when doing this, make sure your other nodes can be logined with ssh root@worker-x without password(with ssh-key).
Clone swift repo and run
git clone https://github.com/modelscope/ms-swift.git
cd ms-swift
# If your node number is different, edit examples/train/multi-node/deepspeed/host.txt
sh examples/train/multi-node/deepspeed/train.sh