Upload folder using huggingface_hub

7134ce7 verified 2 months ago

5.8 kB

	# ray的支持

	SWIFT已经支持使用ray来进行多卡或多节点训练。已有功能中对ray的支持情况如下：

	\| 功能 \| 支持ray \| 例子 \| 可分配角色 \|
	\|----------\|-------\|--------------------------------------------------------------------------------\|-----------------\|
	\| pt/sft \| ✅ \| https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-node/ray \| default \|
	\| dpo \| ❎ \| \| \|
	\| grpo \| ❎ \| \| \|
	\| ppo \| ❎ \| \| \|
	\| megatron \| ❎ \| \| \|
	\| sampling \| ✅ \| https://github.com/modelscope/ms-swift/tree/main/examples/sampler/distill \| sampler/prm/orm \|
	\| distill \| ✅ \| https://github.com/modelscope/ms-swift/tree/main/examples/sampler/sample \| sampler/prm/orm \|

	## 技术细节

	在叙述参数设置之前，我们有必要先行讲一下技术细节。由于SWIFT的内部当前使用了大量transformers和trl的已有实现，像veRL或ROLL一样拆解为不同的ray角色是不现实的，而且拆解后会以ray为中心，对非ray的场景的支持会不良。
	因此SWIFT采取了装饰器为主的技术方案，以函数级别定义了不同角色，这些角色可以在参数中被定义如何使用。看下面的例子：

	```python
	from swift.ray import RayHelper

	@RayHelper.worker(group=['model1', 'model2'])
	class MyTrainer:

	def __init__(self, args):
	self._prepare_model1()
	self._prepare_model2()
	self._prepare_datasets()

	@RayHelper.function(group='model1')
	def _prepare_model1(self):
	...

	@RayHelper.function(group='model2')
	def _prepare_model2(self):
	...

	@RayHelper.function(group='model1')
	def rollout(self, inputs):
	return self.model1.generate(inputs)

	@RayHelper.function(group='model2')
	def forward_model2(self, inputs):
	loss = self.model2.forward(inputs)
	loss.backward()

	def _prepare_datasets(self):
	self.dataset = ...

	def train(self):
	for batch in DataLoader(self.dataset):
	generated = self.rollout(batch)
	self.forward_model2(generated)
	...


	if __name__ == '__main__':
	...
	MyTrainer(args).train()
	```

	RayHelper会将被装饰的方法分配到不同的硬件集群中，本地调用会被平滑转换到ray集群中进行远程调用。也可以以类为中心进行划分：

	```python

	@RayHelper.worker(group=['model1'])
	class Model1:
	...

	@RayHelper.function(group='model1')
	def rollout(self):
	...

	@RayHelper.worker(group=['model2'])
	class Model2:
	...

	@RayHelper.function(group='model2')
	def forward_and_optimize(self):
	...


	class Trainer:
	...
	```

	SWIFT对ray的支持本质上是使用@worker和@function两个注解的组合使用，worker指定ray集群的角色，function指定如何分配数据。

	function注解有额外的几个参数：
	```python
	@staticmethod
	def function(group: str,
	dispatch: Union[Literal['slice', 'all'], Callable] = 'all',
	execute: Literal['first', 'all'] = 'all',
	collect: Union[Literal['none', 'flatten'], Callable] = 'none'):
	```

	- dispatch: 如何分配调用入参
	- slice：对入参切分，也就是worker负载均衡执行
	- all：各个worker入参完全相同
	- 自定义切分方式，格式为：
	```python
	def my_custom_slice(n, i, data):
	# n是worker数量，i是当前worker索引，data是原始入参
	# 返回第i个的入参
	```
	- execute: 如何执行
	- first: rank0执行，此时slice和Callable方式切分无效
	- all: 全部执行

	- collect: 如何收集返回数据
	- none：原样返回，格式为各个worker返回值的列表
	- flatten: 将worker返回的结果进行拉平，支持tuple的拉平
	- Callable: 自定义collect方式，格式为：
	```python
	def my_custom_collect(result):
	# result是各个worker返回的列表
	# 输入你想要的格式
	```

	## 参数设置

	讲完技术细节后，可以将参数配置了。开发者可以根据不同的流程中的角色列表，设置不同的硬件搭配方式，例如采样功能中，共有三个角色，sampler、prm、orm，可以这样配置：

	```yaml
	device_groups:
	nproc_per_node: 4
	sample_group:
	device: GPU
	ranks: list(range(0, 2))
	workers:
	- sampler
	rm_group:
	device: GPU
	ranks: list(range(2, 4))
	workers:
	- prm
	- orm
	```

	- nproc_per_node: ray集群中需要的每个node的最小卡数。
	xxx_group: 每个ray组的名称，可以随意指定
	- device: 设备类型，当前支持GPU/CPU等。
	- ranks: 当前组分配到哪些ranks上。如果是CPU，ranks只能为整数，代表共需要多少进程，如果是GPU，可以为`[0,1,2,3]`, `4`, `list(range(0, 4))`等格式。
	- workers: 哪些角色分配到当前组中。

	所有可用的角色可以见本文最上面的表。

	如果使用命令行，device_groups也可以以`--device_groups xxx`方式传入，xxx为jsonstring。为了配置的简便，我们强烈推荐使用yaml方式搭配ray使用。