DrDavis's picture
Upload folder using huggingface_hub
17c6d62 verified

DeepSpeed[[deepspeed]]

DeepSpeed๋Š” ๋ถ„์‚ฐ ํ•™์Šต ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํšจ์œจ์ ์ด๊ณ  ๋น ๋ฅด๊ฒŒ ๋งŒ๋“œ๋Š” PyTorch ์ตœ์ ํ™” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ์ž…๋‹ˆ๋‹ค. ๊ทธ ํ•ต์‹ฌ์€ ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ์„ ๊ทœ๋ชจ์— ๋งž๊ฒŒ ํ›ˆ๋ จํ•  ์ˆ˜ ์žˆ๋Š” Zero Redundancy Optimizer(ZeRO)์ž…๋‹ˆ๋‹ค. ZeRO๋Š” ์—ฌ๋Ÿฌ ๋‹จ๊ณ„๋กœ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค:

  • ZeRO-1, GPU ๊ฐ„ ์ตœ์ ํ™” ์ƒํƒœ ๋ถ„ํ• 
  • ZeRO-2, GPU ๊ฐ„ ๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๋ถ„ํ• 
  • ZeRO-3, GPU ๊ฐ„ ๋งค๊ฐœ๋ณ€์ˆ˜ ๋ถ„ํ• 

GPU๊ฐ€ ์ œํ•œ๋œ ํ™˜๊ฒฝ์—์„œ ZeRO๋Š” ์ตœ์ ํ™” ๋ฉ”๋ชจ๋ฆฌ์™€ ๊ณ„์‚ฐ์„ GPU์—์„œ CPU๋กœ ์˜คํ”„๋กœ๋“œํ•˜์—ฌ ๋‹จ์ผ GPU์— ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ์„ ์žฅ์ฐฉํ•˜๊ณ  ํ›ˆ๋ จํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. DeepSpeed๋Š” ๋ชจ๋“  ZeRO ๋‹จ๊ณ„ ๋ฐ ์˜คํ”„๋กœ๋”ฉ์„ ์œ„ํ•ด Transformers [Trainer] ํด๋ž˜์Šค์™€ ํ†ตํ•ฉ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ๊ตฌ์„ฑ ํŒŒ์ผ์„ ์ œ๊ณตํ•˜๊ฑฐ๋‚˜ ์ œ๊ณต๋œ ํ…œํ”Œ๋ฆฟ์„ ์‚ฌ์šฉํ•˜๊ธฐ๋งŒ ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค. ์ถ”๋ก ์˜ ๊ฒฝ์šฐ, Transformers๋Š” ๋Œ€์šฉ๋Ÿ‰ ๋ชจ๋ธ์„ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ZeRO-3 ๋ฐ ์˜คํ”„๋กœ๋”ฉ์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.

์ด ๊ฐ€์ด๋“œ์—์„œ๋Š” DeepSpeed ํŠธ๋ ˆ์ด๋‹์„ ๋ฐฐํฌํ•˜๋Š” ๋ฐฉ๋ฒ•, ํ™œ์„ฑํ™”ํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ๋Šฅ, ๋‹ค์–‘ํ•œ ZeRO ๋‹จ๊ณ„์— ๋Œ€ํ•œ ๊ตฌ์„ฑ ํŒŒ์ผ ์„ค์ • ๋ฐฉ๋ฒ•, ์˜คํ”„๋กœ๋”ฉ, ์ถ”๋ก  ๋ฐ [Trainer] ์—†์ด DeepSpeed๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์•ˆ๋‚ดํ•ด ๋“œ๋ฆฝ๋‹ˆ๋‹ค.

์„ค์น˜[[installation]]

DeepSpeed๋Š” PyPI ๋˜๋Š” Transformers์—์„œ ์„ค์น˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค(์ž์„ธํ•œ ์„ค์น˜ ์˜ต์…˜์€ DeepSpeed ์„ค์น˜ ์ƒ์„ธ์‚ฌํ•ญ ๋˜๋Š” GitHub README๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”).

DeepSpeed๋ฅผ ์„ค์น˜ํ•˜๋Š” ๋ฐ ๋ฌธ์ œ๊ฐ€ ์žˆ๋Š” ๊ฒฝ์šฐ DeepSpeed CUDA ์„ค์น˜ ๊ฐ€์ด๋“œ๋ฅผ ํ™•์ธํ•˜์„ธ์š”. DeepSpeed์—๋Š” pip ์„ค์น˜ ๊ฐ€๋Šฅํ•œ PyPI ํŒจํ‚ค์ง€๋กœ ์„ค์น˜ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ํ•˜๋“œ์›จ์–ด์— ๊ฐ€์žฅ ์ž˜ ๋งž๊ณ  PyPI ๋ฐฐํฌํŒ์—์„œ๋Š” ์ œ๊ณต๋˜์ง€ ์•Š๋Š” 1๋น„ํŠธ Adam๊ณผ ๊ฐ™์€ ํŠน์ • ๊ธฐ๋Šฅ์„ ์ง€์›ํ•˜๋ ค๋ฉด ์†Œ์Šค์—์„œ ์„ค์น˜ํ•˜๊ธฐ๋ฅผ ์ ๊ทน ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค.

pip install deepspeed
pip install transformers[deepspeed]

๋ฉ”๋ชจ๋ฆฌ ์š”๊ตฌ๋Ÿ‰[[memory-requirements]]

์‹œ์ž‘ํ•˜๊ธฐ ์ „์— ๋ชจ๋ธ์— ๋งž๋Š” ์ถฉ๋ถ„ํ•œ GPU ๋ฐ CPU ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์žˆ๋Š”์ง€ ํ™•์ธํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค. DeepSpeed๋Š” ํ•„์š”ํ•œ CPU/GPU ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ถ”์ •ํ•  ์ˆ˜ ์žˆ๋Š” ๋„๊ตฌ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๋‹จ์ผ GPU์—์„œ bigscience/T0_3B ๋ชจ๋ธ์˜ ๋ฉ”๋ชจ๋ฆฌ ์š”๊ตฌ ์‚ฌํ•ญ์„ ์ถ”์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

$ python -c 'from transformers import AutoModel; \
from deepspeed.runtime.zero.stage3 import estimate_zero3_model_states_mem_needs_all_live; \
model = AutoModel.from_pretrained("bigscience/T0_3B"); \
estimate_zero3_model_states_mem_needs_all_live(model, num_gpus_per_node=1, num_nodes=1)'
[...]
Estimated memory needed for params, optim states and gradients for a:
HW: Setup with 1 node, 1 GPU per node.
SW: Model with 2783M total params, 65M largest layer params.
  per CPU  |  per GPU |   Options
   70.00GB |   0.25GB | offload_param=cpu , offload_optimizer=cpu , zero_init=1
   70.00GB |   0.25GB | offload_param=cpu , offload_optimizer=cpu , zero_init=0
   62.23GB |   5.43GB | offload_param=none, offload_optimizer=cpu , zero_init=1
   62.23GB |   5.43GB | offload_param=none, offload_optimizer=cpu , zero_init=0
    0.37GB |  46.91GB | offload_param=none, offload_optimizer=none, zero_init=1
   15.56GB |  46.91GB | offload_param=none, offload_optimizer=none, zero_init=0

์ฆ‰, CPU ์˜คํ”„๋กœ๋“œ๊ฐ€ ์—†๋Š” ๋‹จ์ผ 80GB GPU ๋˜๋Š” ์˜คํ”„๋กœ๋“œ ํ•  8GB GPU์™€ ์ตœ๋Œ€ 60GB CPU๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค (์ด๋Š” ๋งค๊ฐœ๋ณ€์ˆ˜, ์ตœ์ ํ™” ์ƒํƒœ ๋ฐ ๊ทธ๋ ˆ์ด๋””์–ธํŠธ์— ๋Œ€ํ•œ ๋ฉ”๋ชจ๋ฆฌ ์š”๊ตฌ ์‚ฌํ•ญ์ผ ๋ฟ์ด๋ฉฐ CUDA ์ปค๋„ ๋ฐ ํ™œ์„ฑํ™”์—๋Š” ์กฐ๊ธˆ ๋” ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค). ๋˜ํ•œ ๋” ์ž‘์€ GPU๋ฅผ ๋Œ€์—ฌํ•˜๊ฑฐ๋‚˜ ๊ตฌ์ž…ํ•˜๋Š” ๊ฒƒ์ด ๋” ์ €๋ ดํ•˜์ง€๋งŒ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๋Š” ๋ฐ ์‹œ๊ฐ„์ด ๋” ์˜ค๋ž˜ ๊ฑธ๋ฆฌ๋ฏ€๋กœ ๋น„์šฉ๊ณผ ์†๋„ ๊ฐ„์˜ ๊ท ํ˜•์„ ๊ณ ๋ คํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

GPU ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์ถฉ๋ถ„ํ•˜๋‹ค๋ฉด CPU/NVMe ์˜คํ”„๋กœ๋“œ๋ฅผ ๋น„ํ™œ์„ฑํ™”ํ•˜์—ฌ ๋ชจ๋“  ์ž‘์—…์„ ๋” ๋น ๋ฅด๊ฒŒ ์ฒ˜๋ฆฌํ•˜์„ธ์š”.

ZeRO ๋‹จ๊ณ„ ์„ค์ •ํ•˜๊ธฐ[[select-a-zero-stage]]

DeepSpeed๋ฅผ ์„ค์น˜ํ•˜๊ณ  ๋ฉ”๋ชจ๋ฆฌ ์š”๊ตฌ ์‚ฌํ•ญ์„ ๋” ์ž˜ ํŒŒ์•…ํ–ˆ๋‹ค๋ฉด ๋‹ค์Œ ๋‹จ๊ณ„๋Š” ์‚ฌ์šฉํ•  ZeRO ์Šคํ…Œ์ด์ง€๋ฅผ ์„ ํƒํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ฐ€์žฅ ๋น ๋ฅด๊ณ  ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์ด ๋†’์€ ์ˆœ์„œ๋Œ€๋กœ ์ •๋ ฌํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

์†๋„ ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ
ZeRO-1 ZeRO-3 + offload
ZeRO-2 ZeRO-3
ZeRO-2 + offload ZeRO-2 + offload
ZeRO-3 ZeRO-2
ZeRO-3 + offload ZeRO-1

์ž์‹ ์—๊ฒŒ ๊ฐ€์žฅ ์ ํ•ฉํ•œ ๋ฐฉ๋ฒ•์„ ์ฐพ์œผ๋ ค๋ฉด ๊ฐ€์žฅ ๋น ๋ฅธ ๋ฐฉ๋ฒ•๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜๊ณ  ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋ถ€์กฑํ•˜๋ฉด ๋” ๋А๋ฆฌ์ง€๋งŒ ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์ด ๋†’์€ ๋‹ค์Œ ๋‹จ๊ณ„๋ฅผ ์‹œ๋„ํ•˜์„ธ์š”. ์†๋„์™€ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰ ์‚ฌ์ด์˜ ์ ์ ˆํ•œ ๊ท ํ˜•์„ ์ฐพ๊ธฐ ์œ„ํ•ด (๊ฐ€์žฅ ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์ ์ด๊ฑฐ๋‚˜ ๊ฐ€์žฅ ๋น ๋ฅธ ๊ฒƒ๋ถ€ํ„ฐ ์‹œ์ž‘ํ•˜์—ฌ) ์›ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์ž์œ ๋กญ๊ฒŒ ์ž‘์—…ํ•˜์„ธ์š”.

์ผ๋ฐ˜์ ์œผ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ํ”„๋กœ์„ธ์Šค๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค(๋ฐฐ์น˜ ํฌ๊ธฐ 1๋กœ ์‹œ์ž‘):

  1. ๊ทธ๋ ˆ์ด๋””์–ธํŠธ ์ฒดํฌํฌ์ธํŒ… ํ™œ์„ฑํ™”
  2. ZeRO-2 ์‹œ๋„
  3. ZeRO-2์™€ ๋งค๊ฐœ๋ณ€์ˆ˜ ์˜คํ”„๋กœ๋“œ ์‹œ๋„
  4. ZeRO-3 ์‹œ๋„
  5. ZeRO-3๊ณผ ๋งค๊ฐœ๋ณ€์ˆ˜ CPU ์˜คํ”„๋กœ๋“œ ์‹œ๋„
  6. ZeRO-3, ๋งค๊ฐœ๋ณ€์ˆ˜์™€ ์˜ตํ‹ฐ๋งˆ์ด์ € CPU ์˜คํ”„๋กœ๋“œ ์‹œ๋„
  7. [~GenerationMixin.generate] ๋ฉ”์†Œ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ๋” ์ข์€ ๋น” ์„œ์น˜ ๊ฒ€์ƒ‰ ๋ฒ”์œ„์™€ ๊ฐ™์€ ๋‹ค์–‘ํ•œ ๊ธฐ๋ณธ๊ฐ’์„ ๋‚ฎ์ถฐ๋ณด๊ธฐ
  8. ์ „์ฒด ์ •๋ฐ€๋„ ๊ฐ€์ค‘์น˜๋ณด๋‹ค ๋ฐ˜์ •๋ฐ€๋„(๊ตฌํ˜• GPU ๊ตฌ์กฐ์˜ ๊ฒฝ์šฐ fp16, ์•”ํŽ˜์–ด ์ดํ›„ GPU์˜ ๊ฒฝ์šฐ bf16)๋ฅผ ํ˜ผํ•ฉํ•ด๋ณด๊ธฐ
  9. ๊ฐ€๋Šฅํ•˜๋ฉด ํ•˜๋“œ์›จ์–ด๋ฅผ ๋” ์ถ”๊ฐ€ํ•˜๊ฑฐ๋‚˜ Infinity๊ฐ€ ๋งค๊ฐœ๋ณ€์ˆ˜์™€ ์˜ตํ‹ฐ๋งˆ์ด์ €๋ฅผ NVMe๋กœ ์˜คํ”„๋กœ๋“œํ•˜๋„๋ก ํ™œ์„ฑํ™”
  10. ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋ถ€์กฑํ•˜์ง€ ์•Š์œผ๋ฉด ์œ ํšจ ์ฒ˜๋ฆฌ๋Ÿ‰์„ ์ธก์ •ํ•œ ๋‹ค์Œ ๋ฐฐ์น˜ ํฌ๊ธฐ๋ฅผ ์ตœ๋Œ€ํ•œ ํฌ๊ฒŒ ๋Š˜๋ ค GPU ํšจ์œจ์„ฑ์„ ๊ทน๋Œ€ํ™”
  11. ๋งˆ์ง€๋ง‰์œผ๋กœ ์ผ๋ถ€ ์˜คํ”„๋กœ๋“œ ๊ธฐ๋Šฅ์„ ๋น„ํ™œ์„ฑํ™”ํ•˜๊ฑฐ๋‚˜ ๋” ๋น ๋ฅธ ZeRO ์Šคํ…Œ์ด์ง€๋ฅผ ์‚ฌ์šฉํ•˜๊ณ  ๋ฐฐ์น˜ ํฌ๊ธฐ๋ฅผ ๋Š˜๋ฆฌ๊ฑฐ๋‚˜ ์ค„์—ฌ ์†๋„์™€ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰ ๊ฐ„์˜ ์ตœ์ ์˜ ๊ท ํ˜•์„ ์ฐพ์•„ ํŠธ๋ ˆ์ด๋‹ ์„ค์ •์„ ์ตœ์ ํ™”

DeepSpeed ๊ตฌ์„ฑ ํŒŒ์ผ[[deepspeed-configuration-file]]

DeepSpeed๋Š” ํŠธ๋ ˆ์ด๋‹ ์‹คํ–‰ ๋ฐฉ๋ฒ•์„ ๊ตฌ์„ฑํ•˜๋Š” ๋ชจ๋“  ๋งค๊ฐœ๋ณ€์ˆ˜๊ฐ€ ํฌํ•จ๋œ ๊ตฌ์„ฑ ํŒŒ์ผ์„ ํ†ตํ•ด [Trainer] ํด๋ž˜์Šค์™€ ํ•จ๊ป˜ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. ํŠธ๋ ˆ์ด๋‹ ์Šคํฌ๋ฆฝํŠธ๋ฅผ ์‹คํ–‰ํ•˜๋ฉด DeepSpeed๋Š” [Trainer]๋กœ๋ถ€ํ„ฐ ๋ฐ›์€ ๊ตฌ์„ฑ์„ ์ฝ˜์†”์— ๊ธฐ๋กํ•˜๋ฏ€๋กœ ์–ด๋–ค ๊ตฌ์„ฑ์ด ์‚ฌ์šฉ๋˜์—ˆ๋Š”์ง€ ์ •ํ™•ํžˆ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

DeepSpeed ๊ตฌ์„ฑ ์˜ต์…˜์˜ ์ „์ฒด ๋ชฉ๋ก์€ DeepSpeed Configuration JSON์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ DeepSpeedExamples ๋ฆฌํฌ์ง€ํ† ๋ฆฌ ๋˜๋Š” ๊ธฐ๋ณธ DeepSpeed ๋ฆฌํฌ์ง€ํ† ๋ฆฌ์—์„œ ๋‹ค์–‘ํ•œ DeepSpeed ๊ตฌ์„ฑ ์˜ˆ์ œ์— ๋Œ€ํ•œ ๋ณด๋‹ค ์‹ค์šฉ์ ์ธ ์˜ˆ์ œ๋ฅผ ์ฐพ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์ธ ์˜ˆ์ œ๋ฅผ ๋น ๋ฅด๊ฒŒ ์ฐพ์œผ๋ ค๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ•˜์„ธ์š”:

git clone https://github.com/deepspeedai/DeepSpeedExamples
cd DeepSpeedExamples
find . -name '*json'
# Lamb ์˜ตํ‹ฐ๋งˆ์ด์ € ์ƒ˜ํ”Œ ์ฐพ๊ธฐ
grep -i Lamb $(find . -name '*json')

๋ช…๋ น์ค„ ์ธํ„ฐํŽ˜์ด์Šค์—์„œ ํŠธ๋ ˆ์ด๋‹ํ•˜๋Š” ๊ฒฝ์šฐ DeepSpeed ๊ตฌ์„ฑ ํŒŒ์ผ์€ JSON ํŒŒ์ผ์˜ ๊ฒฝ๋กœ๋กœ ์ „๋‹ฌ๋˜๊ฑฐ๋‚˜ ๋…ธํŠธ๋ถ ์„ค์ •์—์„œ [Trainer]๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ์ค‘์ฒฉ๋œ dict ๊ฐ์ฒด๋กœ ์ „๋‹ฌ๋ฉ๋‹ˆ๋‹ค.

TrainingArguments(..., deepspeed="path/to/deepspeed_config.json")
ds_config_dict = dict(scheduler=scheduler_params, optimizer=optimizer_params)
args = TrainingArguments(..., deepspeed=ds_config_dict)
trainer = Trainer(model, args, ...)

DeepSpeed์™€ Trainer ๋งค๊ฐœ๋ณ€์ˆ˜[[deepspeed-and-trainer-parameters]]

๊ตฌ์„ฑ ๋งค๊ฐœ๋ณ€์ˆ˜์—๋Š” ์„ธ ๊ฐ€์ง€ ์œ ํ˜•์ด ์žˆ์Šต๋‹ˆ๋‹ค:

  1. ์ผ๋ถ€ ๊ตฌ์„ฑ ๋งค๊ฐœ๋ณ€์ˆ˜๋Š” [Trainer]์™€ DeepSpeed๊ฐ€ ๊ณต์œ ํ•˜๋ฉฐ, ์ •์˜๊ฐ€ ์ถฉ๋Œํ•˜๋Š” ๊ฒฝ์šฐ ์˜ค๋ฅ˜๋ฅผ ์‹๋ณ„ํ•˜๊ธฐ ์–ด๋ ค์šธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ณต์œ  ๊ตฌ์„ฑ ๋งค๊ฐœ๋ณ€์ˆ˜๋Š” [Trainer] ๋ช…๋ น์ค„ ์ธ์ˆ˜์—์„œ ์‰ฝ๊ฒŒ ์„ค์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  2. ๋ชจ๋ธ ์„ค์ •์—์„œ ์ž๋™์œผ๋กœ ๋„์ถœ๋˜๋Š” ์ผ๋ถ€ ์„ค์ • ๋งค๊ฐœ๋ณ€์ˆ˜๋Š” ์ˆ˜๋™์œผ๋กœ ๊ฐ’์„ ์กฐ์ •ํ•  ํ•„์š”๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค. [Trainer]๋Š” ๊ตฌ์„ฑ ๊ฐ’ auto๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ€์žฅ ์ •ํ™•ํ•˜๊ฑฐ๋‚˜ ํšจ์œจ์ ์ธ ๊ฐ’์„ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค. ์ง์ ‘ ๊ตฌ์„ฑ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ์„ค์ •ํ•  ์ˆ˜๋„ ์žˆ์ง€๋งŒ, [Trainer] ์ธ์ˆ˜์™€ DeepSpeed ์„ค์ • ๋งค๊ฐœ๋ณ€์ˆ˜๊ฐ€ ์ผ์น˜ํ•˜๋„๋ก ์ฃผ์˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ผ์น˜ํ•˜์ง€ ์•Š์œผ๋ฉด ๊ฐ์ง€ํ•˜๊ธฐ ๋งค์šฐ ์–ด๋ ค์šด ๋ฐฉ์‹์œผ๋กœ ํ›ˆ๋ จ์ด ์‹คํŒจํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค!

  3. ๊ต์œก ์š”๊ตฌ ์‚ฌํ•ญ์— ๋”ฐ๋ผ ์ˆ˜๋™์œผ๋กœ ์„ค์ •ํ•ด์•ผ ํ•˜๋Š” ์ผ๋ถ€ ์„ค์ • ๋งค๊ฐœ๋ณ€์ˆ˜๋Š” DeepSpeed์—๋งŒ ํ•ด๋‹น๋ฉ๋‹ˆ๋‹ค.

DeepSpeed ๊ตฌ์„ฑ์„ ์ˆ˜์ •ํ•˜๊ณ  [TrainingArguments]๋ฅผ ํŽธ์ง‘ํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค:

  1. ๊ธฐ๋ณธ ๊ตฌ์„ฑ์œผ๋กœ ์‚ฌ์šฉํ•  DeepSpeed ๊ตฌ์„ฑ ํŒŒ์ผ์„ ์ƒ์„ฑํ•˜๊ฑฐ๋‚˜ ๋กœ๋“œํ•ฉ๋‹ˆ๋‹ค.
  2. ๋‹ค์Œ DeepSpeed ๊ตฌ์„ฑ์„ ๊ธฐ๋ฐ˜์œผ๋กœ [TrainingArguments] ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

scheduler.params.total_num_steps์™€ ๊ฐ™์€ ์ผ๋ถ€ ๊ฐ’์€ ํŠธ๋ ˆ์ด๋‹ ์ค‘ [Trainer]์— ์˜ํ•ด ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค.

ZeRO ๊ตฌ์„ฑ[[zero-configuration]]

์„ธ ๊ฐ€์ง€ ๊ตฌ์„ฑ์ด ์žˆ์œผ๋ฉฐ, ๊ฐ ๊ตฌ์„ฑ์€ ์„œ๋กœ ๋‹ค๋ฅธ ZeRO ๋‹จ๊ณ„์— ํ•ด๋‹นํ•ฉ๋‹ˆ๋‹ค. 1๋‹จ๊ณ„๋Š” ํ™•์žฅ์„ฑ ์ธก๋ฉด์—์„œ ๊ทธ๋‹ค์ง€ ๋ˆˆ์—ฌ๊ฒจ๋ณผ๋งŒํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ ์ด ๊ฐ€์ด๋“œ์—์„œ๋Š” 2๋‹จ๊ณ„์™€ 3๋‹จ๊ณ„์— ์ค‘์ ์„ ๋‘ก๋‹ˆ๋‹ค. zero_optimization ๊ตฌ์„ฑ์—๋Š” ํ™œ์„ฑํ™”ํ•  ํ•ญ๋ชฉ๊ณผ ๊ตฌ์„ฑ ๋ฐฉ๋ฒ•์— ๋Œ€ํ•œ ๋ชจ๋“  ์˜ต์…˜์ด ํฌํ•จ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. ๊ฐ ๋งค๊ฐœ๋ณ€์ˆ˜์— ๋Œ€ํ•œ ์ž์„ธํ•œ ์„ค๋ช…์€ DeepSpeed ๊ตฌ์„ฑ JSON ์ฐธ์กฐ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.

DeepSpeed๋Š” ๋งค๊ฐœ๋ณ€์ˆ˜ ์ด๋ฆ„์˜ ์œ ํšจ์„ฑ์„ ๊ฒ€์‚ฌํ•˜์ง€ ์•Š์œผ๋ฉฐ ์˜คํƒ€๊ฐ€ ์žˆ์œผ๋ฉด ๋งค๊ฐœ๋ณ€์ˆ˜์˜ ๊ธฐ๋ณธ ์„ค์ •์œผ๋กœ ๋Œ€์ฒดํ•ฉ๋‹ˆ๋‹ค. DeepSpeed ์—”์ง„ ์‹œ์ž‘ ๋กœ๊ทธ ๋ฉ”์‹œ์ง€๋ฅผ ๋ณด๊ณ  ์–ด๋–ค ๊ฐ’์„ ์‚ฌ์šฉํ• ์ง€ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

[Trainer]๋Š” ๋™๋“ฑํ•œ ๋ช…๋ น์ค„ ์ธ์ˆ˜๋ฅผ ์ œ๊ณตํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ ๋‹ค์Œ ๊ตฌ์„ฑ์€ DeepSpeed๋กœ ์„ค์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

ZeRO-1์€ ์˜ตํ‹ฐ๋งˆ์ด์ € ์ƒํƒœ๋ฅผ GPU์— ๋ถ„ํ• ํ•˜์—ฌ ์•ฝ๊ฐ„์˜ ์†๋„ ํ–ฅ์ƒ์„ ๊ธฐ๋Œ€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ZeRO-1 ๊ตฌ์„ฑ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์„ค์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

{
    "zero_optimization": {
        "stage": 1
    }
}

ZeRO-2๋Š” GPU์—์„œ ์˜ตํ‹ฐ๋งˆ์ด์ €์™€ ๊ทธ๋ ˆ์ด๋””์–ธํŠธ๋ฅผ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค. ์ด ๋‹จ๊ณ„๋Š” ์ถ”๋ก ๊ณผ ๊ด€๋ จ์ด ์—†๋Š” ๊ธฐ๋Šฅ์ด๊ธฐ ๋•Œ๋ฌธ์— ์ฃผ๋กœ ํ›ˆ๋ จ์— ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค. ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ์œ„ํ•ด ๊ตฌ์„ฑํ•ด์•ผ ํ•  ๋ช‡ ๊ฐ€์ง€ ์ค‘์š”ํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  • GPU ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ์ค„์ด๋ ค๋ฉด offload_optimizer๋ฅผ ํ™œ์„ฑํ™”ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • true๋กœ ์„ค์ •๋œ ๊ฒฝ์šฐ overlap_comm์€ GPU ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰ ์ฆ๊ฐ€๋ฅผ ์ƒ์‡„ํ•˜์—ฌ ์ง€์—ฐ ์‹œ๊ฐ„์„ ์ค„์ž…๋‹ˆ๋‹ค. ์ด ๊ธฐ๋Šฅ์€ 4.5๋ฐฐ์˜ allgather_bucket_size ๋ฐ reduce_bucket_size๊ฐ’์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ์ด ์˜ˆ์—์„œ๋Š” 5e8๋กœ ์„ค์ •๋˜์–ด ์žˆ์œผ๋ฏ€๋กœ 9GB์˜ GPU ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. GPU ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ 8GB ์ดํ•˜์ธ ๊ฒฝ์šฐ, ๋ฉ”๋ชจ๋ฆฌ ์š”๊ตฌ๋Ÿ‰์„ ๋‚ฎ์ถ”๊ณ  ๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ(OOM) ์˜ค๋ฅ˜๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด overlap_comm์„ ์ค„์—ฌ์•ผ ํ•ฉ๋‹ˆ๋‹ค.
  • allgather_bucket_size์™€ reduce_bucket_size๋Š” ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ GPU ๋ฉ”๋ชจ๋ฆฌ์™€ ํ†ต์‹  ์†๋„๋ฅผ ์ ˆ์ถฉํ•ฉ๋‹ˆ๋‹ค. ๊ฐ’์ด ์ž‘์„์ˆ˜๋ก ํ†ต์‹  ์†๋„๊ฐ€ ๋А๋ ค์ง€๊ณ  ๋” ๋งŽ์€ GPU ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, ๋ฐฐ์น˜ ํฌ๊ธฐ๊ฐ€ ํฐ ๊ฒƒ์ด ์•ฝ๊ฐ„ ๋А๋ฆฐ ํ›ˆ๋ จ ์‹œ๊ฐ„๋ณด๋‹ค ๋” ์ค‘์š”ํ•œ์ง€ ๊ท ํ˜•์„ ๋งž์ถœ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • DeepSpeed 0.4.4์—์„œ๋Š” CPU ์˜คํ”„๋กœ๋”ฉ์„ ์œ„ํ•ด round_robin_gradients๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ธฐ๋Šฅ์€ ์„ธ๋ถ„ํ™”๋œ ๊ทธ๋ ˆ์ด๋””์–ธํŠธ ํŒŒํ‹ฐ์…”๋‹์„ ํ†ตํ•ด ๋“ฑ๊ธ‰ ๊ฐ„ ๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๋ณต์‚ฌ๋ฅผ CPU ๋ฉ”๋ชจ๋ฆฌ๋กœ ๋ณ‘๋ ฌํ™”ํ•ฉ๋‹ˆ๋‹ค. ์„ฑ๋Šฅ ์ด์ ์€ ๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๋ˆ„์  ๋‹จ๊ณ„(์ตœ์ ํ™” ๋‹จ๊ณ„ ๊ฐ„ ๋ณต์‚ฌ ํšŸ์ˆ˜ ์ฆ๊ฐ€) ๋˜๋Š” GPU ์ˆ˜(๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์ฆ๊ฐ€)์— ๋”ฐ๋ผ ์ฆ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.
{
    "zero_optimization": {
        "stage": 2,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "allgather_partitions": true,
        "allgather_bucket_size": 5e8,
        "overlap_comm": true,
        "reduce_scatter": true,
        "reduce_bucket_size": 5e8,
        "contiguous_gradients": true
        "round_robin_gradients": true
    }
}

ZeRO-3๋Š” ์˜ตํ‹ฐ๋งˆ์ด์ €, ๊ทธ๋ž˜๋””์–ธํŠธ, ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์—ฌ๋Ÿฌ GPU์— ๊ฑธ์ณ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค. ZeRO-2์™€ ๋‹ฌ๋ฆฌ ZeRO-3๋Š” ์—ฌ๋Ÿฌ GPU์— ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ์„ ๊ฐ€์ ธ์˜ฌ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ํ›ˆ๋ จ ์™ธ์—๋„ ์ถ”๋ก ์—๋„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ตฌ์„ฑํ•ด์•ผ ํ•  ๋ช‡ ๊ฐ€์ง€ ์ค‘์š”ํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

  • device: "cpu" ๋Š” GPU ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋ถ€์กฑํ•˜๊ณ  ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ CPU ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์žˆ๋Š” ๊ฒฝ์šฐ ๋„์›€์ด ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๋ชจ๋ธ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ CPU๋กœ ์˜คํ”„๋กœ๋“œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • pin_memory: true ๋Š” ์ฒ˜๋ฆฌ๋Ÿ‰์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์ง€๋งŒ, ํ•€ ๋ฉ”๋ชจ๋ฆฌ๋Š” ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์š”์ฒญํ•œ ํŠน์ • ํ”„๋กœ์„ธ์Šค๋ฅผ ์œ„ํ•ด ์˜ˆ์•ฝ๋˜์–ด ์žˆ๊ณ  ์ผ๋ฐ˜์ ์œผ๋กœ ์ผ๋ฐ˜ CPU ๋ฉ”๋ชจ๋ฆฌ๋ณด๋‹ค ํ›จ์”ฌ ๋น ๋ฅด๊ฒŒ ์•ก์„ธ์Šค๋˜๊ธฐ ๋•Œ๋ฌธ์— ๋‹ค๋ฅธ ํ”„๋กœ์„ธ์Šค์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์ค„์–ด๋“ญ๋‹ˆ๋‹ค.

  • stage3_max_live_parameters ๋Š” ํŠน์ • ์‹œ๊ฐ„์— GPU์— ์œ ์ง€ํ•˜๋ ค๋Š” ์ „์ฒด ๋งค๊ฐœ๋ณ€์ˆ˜์˜ ์ƒํ•œ๊ฐ’์ž…๋‹ˆ๋‹ค. OOM ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•˜๋ฉด ์ด ๊ฐ’์„ ์ค„์ด์„ธ์š”.

  • stage3_max_reuse_distance ๋Š” ํ–ฅํ›„ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ๋‹ค์‹œ ์‚ฌ์šฉํ•  ์‹œ๊ธฐ๋ฅผ ๊ฒฐ์ •ํ•˜๋Š” ๊ฐ’์œผ๋กœ, ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ๋ฒ„๋ฆด์ง€ ์œ ์ง€ํ• ์ง€ ๊ฒฐ์ •ํ•˜๋Š” ๋ฐ ๋„์›€์ด ๋ฉ๋‹ˆ๋‹ค. ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์žฌ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ(stage3_max_reuse_distance๋ณด๋‹ค ์ž‘์€ ๊ฐ’์ธ ๊ฒฝ์šฐ) ํ†ต์‹  ์˜ค๋ฒ„ํ—ค๋“œ๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ธฐ๋Šฅ์€ ํ™œ์„ฑํ™” ์ฒดํฌํฌ์ธํŒ…์ด ํ™œ์„ฑํ™”๋˜์–ด ์žˆ๊ณ  ์—ญ์ „ํŒŒ ๊ณ„์‚ฐ์‹œ๊นŒ์ง€ ์ˆœ์ „ํŒŒ ์‹œ์ ์˜ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์œ ์ง€ํ•˜๋ ค๋Š” ๊ฒฝ์šฐ์— ๋งค์šฐ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ OOM ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•˜๋ฉด ์ด ๊ฐ’์„ ์ค„์ด์„ธ์š”.

  • ๋ชจ๋ธ ์ €์žฅ ์‹œ stage3_gather_16bit_weights_on_model_save๋Š” fp16 ๊ฐ€์ค‘์น˜๋ฅผ ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค. ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๊ฑฐ๋‚˜ ์—ฌ๋Ÿฌ GPU๋ฅผ ์‚ฌ์šฉํ•  ๊ฒฝ์šฐ ๋ฉ”๋ชจ๋ฆฌ์™€ ์†๋„ ์ธก๋ฉด์—์„œ ๋น„์šฉ์ด ๋งŽ์ด ๋“ญ๋‹ˆ๋‹ค. ํ›ˆ๋ จ์„ ์žฌ๊ฐœํ•  ๊ณ„ํš์ด๋ผ๋ฉด ์ด ์˜ต์…˜์„ ํ™œ์„ฑํ™”ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

  • sub_group_size ๋Š” ์ตœ์ ํ™” ๋‹จ๊ณ„์—์„œ ์—…๋ฐ์ดํŠธ๋˜๋Š” ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์ œ์–ดํ•ฉ๋‹ˆ๋‹ค. ๋งค๊ฐœ๋ณ€์ˆ˜๋Š” sub_group_size์˜ ๋ฒ„ํ‚ท์œผ๋กœ ๊ทธ๋ฃนํ™”๋˜๋ฉฐ ๊ฐ ๋ฒ„ํ‚ท์€ ํ•œ ๋ฒˆ์— ํ•˜๋‚˜์”ฉ ์—…๋ฐ์ดํŠธ๋ฉ๋‹ˆ๋‹ค. NVMe ์˜คํ”„๋กœ๋“œ์™€ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ sub_group_size๋Š” ์ตœ์ ํ™” ๋‹จ๊ณ„ ์ค‘ ๋ชจ๋ธ ์ƒํƒœ๊ฐ€ CPU ๋ฉ”๋ชจ๋ฆฌ๋กœ ์ด๋™ํ•˜๋Š” ์‹œ์ ์„ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ๋งค์šฐ ํฐ ๋ชจ๋ธ์˜ CPU ๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ์„ ๋ฐฉ์ง€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. NVMe ์˜คํ”„๋กœ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ sub_group_size๋ฅผ ๊ธฐ๋ณธ๊ฐ’์œผ๋กœ ๋‘˜ ์ˆ˜ ์žˆ์ง€๋งŒ, ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ๋ณ€๊ฒฝํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค:

    1. ์˜ตํ‹ฐ๋งˆ์ด์ € ๋‹จ๊ณ„์—์„œ OOM ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ฒฝ์šฐ, ์ž„์‹œ ๋ฒ„ํผ์˜ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ์ค„์ด๋ ค๋ฉด sub_group_size๋ฅผ ์ค„์ด์„ธ์š”.
    2. ์˜ตํ‹ฐ๋งˆ์ด์ € ๋‹จ๊ณ„์—์„œ ์‹œ๊ฐ„์ด ๋„ˆ๋ฌด ์˜ค๋ž˜ ๊ฑธ๋ฆฝ๋‹ˆ๋‹ค. ์ด ๊ฒฝ์šฐ ๋ฐ์ดํ„ฐ ๋ฒ„ํผ ์ฆ๊ฐ€๋กœ ์ธํ•œ ๋Œ€์—ญํญ ์‚ฌ์šฉ๋ฅ ์„ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•ด sub_group_size๋ฅผ ๋Š˜๋ฆฌ์„ธ์š”.
  • reduce_bucket_size, stage3_prefetch_bucket_size, stage3_param_persistence_threshold๋Š” ๋ชจ๋ธ์˜ ์ˆจ๊ฒจ์ง„ ํฌ๊ธฐ์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง‘๋‹ˆ๋‹ค. ์ด ๊ฐ’๋“ค์„ auto์œผ๋กœ ์„ค์ •ํ•˜๊ณ  [Trainer]๊ฐ€ ์ž๋™์œผ๋กœ ๊ฐ’์„ ํ• ๋‹นํ•˜๋„๋ก ํ—ˆ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค.

{
    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    }
}

deepspeed.zero.Init ์ปจํ…์ŠคํŠธ ๋งค๋‹ˆ์ €๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋ชจ๋ธ์„ ๋” ๋น ๋ฅด๊ฒŒ ์ดˆ๊ธฐํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

from transformers import T5ForConditionalGeneration, T5Config
import deepspeed

with deepspeed.zero.Init():
    config = T5Config.from_pretrained("google-t5/t5-small")
    model = T5ForConditionalGeneration(config)

์‚ฌ์ „ ํ•™์Šต๋œ ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ, ๋”ฅ์Šคํ”ผ๋“œ ๊ตฌ์„ฑ ํŒŒ์ผ์— is_deepspeed_zero3_enabled: true๊ฐ€ [TrainingArguments]์— ์„ค์ •๋˜์–ด ์žˆ์–ด์•ผ ํ•˜๋ฉฐ, ZeRO ๊ตฌ์„ฑ์ด ํ™œ์„ฑํ™”๋˜์–ด ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ํ›ˆ๋ จ๋œ ๋ชจ๋ธ [~PreTrainedModel.from_pretrained]์„ ํ˜ธ์ถœํ•˜๊ธฐ ์ „์— [TrainingArguments] ๊ฐ์ฒด๋ฅผ ์ƒ์„ฑํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

from transformers import AutoModel, Trainer, TrainingArguments

training_args = TrainingArguments(..., deepspeed=ds_config)
model = AutoModel.from_pretrained("google-t5/t5-small")
trainer = Trainer(model=model, args=training_args, ...)

fp16 ๊ฐ€์ค‘์น˜๊ฐ€ ๋‹จ์ผ GPU์— ๋งž์ง€ ์•Š๋Š” ๊ฒฝ์šฐ ZeRO-3์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. fp16 ๊ฐ€์ค‘์น˜๋ฅผ ๋กœ๋“œํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒฝ์šฐ, [~PreTrainedModel.from_pretrained]์— torch_dtype=torch.float16์„ ์ง€์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

ZeRO-3์˜ ๋˜ ๋‹ค๋ฅธ ๊ณ ๋ ค ์‚ฌํ•ญ์€ ์—ฌ๋Ÿฌ ๊ฐœ์˜ GPU๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ํ˜„์žฌ ์‹คํ–‰ ์ค‘์ธ ๋ ˆ์ด์–ด์˜ ๋งค๊ฐœ๋ณ€์ˆ˜๊ฐ€ ์•„๋‹Œ ํ•œ ๋‹จ์ผ GPU์— ๋ชจ๋“  ๋งค๊ฐœ๋ณ€์ˆ˜๊ฐ€ ์—†๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ์‚ฌ์ „ ํ›ˆ๋ จ๋œ ๋ชจ๋ธ ๊ฐ€์ค‘์น˜๋ฅผ [~PreTrainedModel.from_pretrained]์— ๋กœ๋“œํ•˜๋Š” ๋“ฑ ๋ชจ๋“  ๋ ˆ์ด์–ด์˜ ๋ชจ๋“  ๋งค๊ฐœ๋ณ€์ˆ˜์— ํ•œ ๋ฒˆ์— ์•ก์„ธ์Šคํ•˜๋ ค๋ฉด ํ•œ ๋ฒˆ์— ํ•˜๋‚˜์˜ ๋ ˆ์ด์–ด๋ฅผ ๋กœ๋“œํ•˜๊ณ  ์ฆ‰์‹œ ๋ชจ๋“  GPU์— ํŒŒํ‹ฐ์…”๋‹ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” ๋งค์šฐ ํฐ ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ ๋ฉ”๋ชจ๋ฆฌ ์ œํ•œ์œผ๋กœ ์ธํ•ด ํ•˜๋‚˜์˜ GPU์— ๊ฐ€์ค‘์น˜๋ฅผ ๋กœ๋“œํ•œ ๋‹ค์Œ ๋‹ค๋ฅธ GPU์— ๋ถ„์‚ฐํ•  ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

๋‹ค์Œ๊ณผ ๊ฐ™์ด ๋ณด์ด๋Š” ๋ชจ๋ธ ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ฐ€์ค‘์น˜(์—ฌ๊ธฐ์„œ tensor([1.])) ๋˜๋Š” ๋งค๊ฐœ๋ณ€์ˆ˜ ํฌ๊ธฐ๊ฐ€ ๋” ํฐ ๋‹ค์ฐจ์› ํ˜•ํƒœ ๋Œ€์‹  1์ธ ๊ฒฝ์šฐ, ์ด๋Š” ๋งค๊ฐœ๋ณ€์ˆ˜๊ฐ€ ๋ถ„ํ• ๋˜์–ด ์žˆ์œผ๋ฉฐ ์ด๊ฒƒ์ด ZeRO-3 ํ”Œ๋ ˆ์ด์Šคํ™€๋”์ธ ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

tensor([1.0], device="cuda:0", dtype=torch.float16, requires_grad=True)

ZeRO-3๋กœ ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ์„ ์ดˆ๊ธฐํ™”ํ•˜๊ณ  ๋งค๊ฐœ๋ณ€์ˆ˜์— ์•ก์„ธ์Šคํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ์€ Constructing Massive Models ๋ฐ Gathering Parameters ๊ฐ€์ด๋“œ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.

NVMe ์„ค์ •[[nvme-configuration]]

ZeRO-Infinity๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๋ชจ๋ธ ์ƒํƒœ๋ฅผ CPU ๋ฐ/๋˜๋Š” NVMe๋กœ ์˜คํ”„๋กœ๋“œํ•˜์—ฌ ๋” ๋งŽ์€ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ ˆ์•ฝํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์Šค๋งˆํŠธ ํŒŒํ‹ฐ์…”๋‹ ๋ฐ ํƒ€์ผ๋ง ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ํ†ตํ•ด ๊ฐ GPU๋Š” ์˜คํ”„๋กœ๋”ฉ ์ค‘์— ๋งค์šฐ ์ ์€ ์–‘์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ฃผ๊ณ ๋ฐ›์„ ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ์ตœ์‹  NVMe๋Š” ํ›ˆ๋ จ ํ”„๋กœ์„ธ์Šค์— ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๊ฒƒ๋ณด๋‹ค ํ›จ์”ฌ ๋” ํฐ ์ด ๋ฉ”๋ชจ๋ฆฌ ํ’€์— ๋งž์ถœ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ZeRO-Infinity์—๋Š” ZeRO-3๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ CPU ๋ฐ/๋˜๋Š” NVMe ๋ฉ”๋ชจ๋ฆฌ์— ๋”ฐ๋ผ ์˜ตํ‹ฐ๋งˆ์ด์ €์™€ ๋งค๊ฐœ๋ณ€์ˆ˜ ์ค‘ ํ•˜๋‚˜๋งŒ ์˜คํ”„๋กœ๋“œํ•˜๊ฑฐ๋‚˜ ์•„๋ฌด๊ฒƒ๋„ ์˜คํ”„๋กœ๋“œํ•˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ ์ผ๋ฐ˜ ํ•˜๋“œ ๋“œ๋ผ์ด๋ธŒ๋‚˜ ์†”๋ฆฌ๋“œ ์Šคํ…Œ์ดํŠธ ๋“œ๋ผ์ด๋ธŒ์—์„œ๋„ ์ž‘๋™ํ•˜์ง€๋งŒ ์†๋„๊ฐ€ ํ˜„์ €ํžˆ ๋А๋ ค์ง€๋ฏ€๋กœ nvme_path๊ฐ€ NVMe ์žฅ์น˜๋ฅผ ๊ฐ€๋ฆฌํ‚ค๊ณ  ์žˆ๋Š”์ง€ ํ™•์ธํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ตœ์‹  NVMe๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ์ฝ๊ธฐ ์ž‘์—…์˜ ๊ฒฝ์šฐ ์ตœ๋Œ€ 3.5GB/s, ์“ฐ๊ธฐ ์ž‘์—…์˜ ๊ฒฝ์šฐ ์ตœ๋Œ€ 3GB/s์˜ ์ „์†ก ์†๋„๋ฅผ ๊ธฐ๋Œ€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ํŠธ๋ ˆ์ด๋‹ ์„ค์ •์—์„œ ๋ฒค์น˜๋งˆํฌ ์‹คํ–‰ํ•˜๊ธฐ์„ ํ†ตํ•ด ์ตœ์ ์˜ 'aio' ๊ตฌ์„ฑ์„ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค.

์•„๋ž˜ ์˜ˆ์ œ ZeRO-3/Infinity ๊ตฌ์„ฑ ํŒŒ์ผ์€ ๋Œ€๋ถ€๋ถ„์˜ ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ฐ’์„ auto์œผ๋กœ ์„ค์ •ํ•˜๊ณ  ์žˆ์ง€๋งŒ, ์ˆ˜๋™์œผ๋กœ ๊ฐ’์„ ์ถ”๊ฐ€ํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "nvme",
            "nvme_path": "/local_nvme",
            "pin_memory": true,
            "buffer_count": 4,
            "fast_init": false
        },
        "offload_param": {
            "device": "nvme",
            "nvme_path": "/local_nvme",
            "pin_memory": true,
            "buffer_count": 5,
            "buffer_size": 1e8,
            "max_in_cpu": 1e9
        },
        "aio": {
            "block_size": 262144,
            "queue_depth": 32,
            "thread_count": 1,
            "single_submit": false,
            "overlap_events": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}

DeepSpeed ๊ตฌ์„ฑ[[deepspeed-features]]

์ด ์„น์…˜์—์„œ ๊ฐ„๋žตํ•˜๊ฒŒ ์„ค๋ช…ํ•˜๋Š” ๋ช‡ ๊ฐ€์ง€ ์ค‘์š”ํ•œ ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ DeepSpeed ๊ตฌ์„ฑ ํŒŒ์ผ์— ์ง€์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ™œ์„ฑํ™”/๊ทธ๋ ˆ์ด๋””์–ธํŠธ ์ฒดํฌํฌ์ธํŒ…[[activationgradient-checkpointing]]

ํ™œ์„ฑํ™” ๋ฐ ๊ทธ๋ ˆ์ด๋””์–ธํŠธ ์ฒดํฌํฌ์ธํŒ…์€ ์†๋„๋ฅผ ๋” ๋งŽ์€ GPU ๋ฉ”๋ชจ๋ฆฌ์™€ ๊ตํ™˜ํ•˜์—ฌ GPU ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋ถ€์กฑํ•œ ์ƒํ™ฉ์„ ๊ทน๋ณตํ•˜๊ฑฐ๋‚˜ ๋ฐฐ์น˜ ํฌ๊ธฐ๋ฅผ ๋Š˜๋ ค ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ธฐ๋Šฅ์„ ํ™œ์„ฑํ™”ํ•˜๋ ค๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ•˜์„ธ์š”:

  1. ํ—ˆ๊น… ํŽ˜์ด์Šค ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ, [Trainer]์—์„œ model.gradient_checkpointing_enable() ๋˜๋Š” --gradient_checkpointing์„ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.
  2. ํ—ˆ๊น… ํŽ˜์ด์Šค๊ฐ€ ์•„๋‹Œ ๋ชจ๋ธ์˜ ๊ฒฝ์šฐ, ๋”ฅ์Šคํ”ผ๋“œ Activation Checkpointing API๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ํŠธ๋žœ์Šคํฌ๋จธ ๋ชจ๋ธ๋ง ์ฝ”๋“œ๋ฅผ ๋Œ€์ฒดํ•˜๊ณ  torch.utils.checkpoint๋ฅผ DeepSpeed API๋กœ ๋Œ€์ฒดํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ์ ‘๊ทผ ๋ฐฉ์‹์€ ์ˆœ๋ฐฉํ–ฅ ํ™œ์„ฑํ™”๋ฅผ ๋‹ค์‹œ ๊ณ„์‚ฐํ•˜๋Š” ๋Œ€์‹  CPU ๋ฉ”๋ชจ๋ฆฌ๋กœ ์˜คํ”„๋กœ๋“œํ•  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ๋” ์œ ์—ฐํ•ฉ๋‹ˆ๋‹ค.

์˜ตํ‹ฐ๋งˆ์ด์ €์™€ ์Šค์ผ€์ค„๋Ÿฌ[[optimizer-and-scheduler]]

offload_optimizer๋ฅผ ํ™œ์„ฑํ™”ํ•˜์ง€ ์•Š๋Š” ํ•œ DeepSpeed์™€ ํŠธ๋žœ์Šคํฌ๋จธ ์˜ตํ‹ฐ๋งˆ์ด์ € ๋ฐ ์Šค์ผ€์ค„๋Ÿฌ๋ฅผ ํ˜ผํ•ฉํ•˜์—ฌ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. offload_optimizer๋ฅผ ํ™œ์„ฑํ™”ํ•˜๋ฉด CPU์™€ GPU ๊ตฌํ˜„์ด ๋ชจ๋‘ ์žˆ๋Š” ๊ฒฝ์šฐ DeepSpeed๊ฐ€ ์•„๋‹Œ ์ตœ์ ํ™”๊ธฐ(LAMB ์ œ์™ธ)๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๊ตฌ์„ฑ ํŒŒ์ผ์˜ ์ตœ์ ํ™” ํ”„๋กœ๊ทธ๋žจ ๋ฐ ์Šค์ผ€์ค„๋Ÿฌ ๋งค๊ฐœ๋ณ€์ˆ˜๋Š” ๋ช…๋ น์ค„์—์„œ ์„ค์ •ํ•  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ์˜ค๋ฅ˜๋ฅผ ์ฐพ๊ธฐ ์–ด๋ ต์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ํ•™์Šต ์†๋„๊ฐ€ ๋‹ค๋ฅธ ๊ณณ์—์„œ ๋‹ค๋ฅธ ๊ฐ’์œผ๋กœ ์„ค์ •๋œ ๊ฒฝ์šฐ ๋ช…๋ น์ค„์—์„œ ์ด๋ฅผ ์žฌ์ •์˜ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ตœ์ ํ™” ํ”„๋กœ๊ทธ๋žจ ๋ฐ ์Šค์ผ€์ค„๋Ÿฌ ๋งค๊ฐœ๋ณ€์ˆ˜ ์™ธ์—๋„ [Trainer] ๋ช…๋ น์ค„ ์ธ์ˆ˜๊ฐ€ DeepSpeed ๊ตฌ์„ฑ๊ณผ ์ผ์น˜ํ•˜๋Š”์ง€ ํ™•์ธํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

DeepSpeed๋Š” ์—ฌ๋Ÿฌ ์˜ตํ‹ฐ๋งˆ์ด์ €๋ฅผ ์ œ๊ณตํ•˜์ง€๋งŒ(Adam, AdamW, OneBitAdam ๋ฐ LAMB) PyTorch์—์„œ ๋‹ค๋ฅธ ์˜ตํ‹ฐ๋งˆ์ด์ €๋ฅผ ๊ฐ€์ ธ์˜ฌ ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค. ์„ค์ •์—์„œ ์˜ตํ‹ฐ๋งˆ์ด์ €๋ฅผ ๊ตฌ์„ฑํ•˜์ง€ ์•Š์œผ๋ฉด [Trainer]๊ฐ€ ์ž๋™์œผ๋กœ AdamW๋ฅผ ์„ ํƒํ•˜๊ณ  ๋ช…๋ น์ค„์—์„œ ์ œ๊ณต๋œ ๊ฐ’ ๋˜๋Š” ๊ธฐ๋ณธ๊ฐ’์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค: lr, adam_beta1, adam_beta2, adam_epsilon, weight_decay.

๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ "auto"์œผ๋กœ ์„ค์ •ํ•˜๊ฑฐ๋‚˜ ์›ํ•˜๋Š” ๊ฐ’์„ ์ง์ ‘ ์ˆ˜๋™์œผ๋กœ ์ž…๋ ฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

{
   "optimizer": {
       "type": "AdamW",
       "params": {
         "lr": "auto",
         "betas": "auto",
         "eps": "auto",
         "weight_decay": "auto"
       }
   }
}

์ตœ์ƒ์œ„ ๊ตฌ์„ฑ์— ๋‹ค์Œ์„ ์ถ”๊ฐ€ํ•˜์—ฌ ์ง€์›๋˜์ง€ ์•Š๋Š” ์˜ตํ‹ฐ๋งˆ์ด์ €๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

{
   "zero_allow_untested_optimizer": true
}

DeepSpeed==0.8.3๋ถ€ํ„ฐ ์˜คํ”„๋กœ๋“œ๋ฅผ ์‚ฌ์šฉํ•˜๋ ค๋ฉด ์˜คํ”„๋กœ๋“œ๊ฐ€ DeepSpeed์˜ CPU Adam ์˜ตํ‹ฐ๋งˆ์ด์ €์—์„œ ๊ฐ€์žฅ ์ž˜ ์ž‘๋™ํ•˜๋ฏ€๋กœ ์ตœ์ƒ์œ„ ์ˆ˜์ค€ ๊ตฌ์„ฑ์— ๋‹ค์Œ ์‚ฌํ•ญ์„ ์ถ”๊ฐ€ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

{
   "zero_force_ds_cpu_optimizer": false
}

DeepSpeed๋Š” LRRangeTest, OneCycle, WarmupLR ๋ฐ WarmupDecayLR learning rateschedulers๋ฅผ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.

ํŠธ๋žœ์Šคํฌ๋จธ์™€ DeepSpeed๋Š” ๋™์ผํ•œ ๋‘ ๊ฐ€์ง€ ์Šค์ผ€์ค„๋Ÿฌ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค:

  • WarmupLR์€ Transformers์˜ --lr_scheduler_type constant_warmup๊ณผ ๋™์ผํ•ฉ๋‹ˆ๋‹ค.
  • WarmupDecayLR์€ Transformers์˜ --lr_scheduler_type linear์™€ ๋™์ผํ•ฉ๋‹ˆ๋‹ค(Transformers์—์„œ ์‚ฌ์šฉ๋˜๋Š” ๊ธฐ๋ณธ ์Šค์ผ€์ค„๋Ÿฌ์ž…๋‹ˆ๋‹ค).

์„ค์ •์—์„œ ์Šค์ผ€์ค„๋Ÿฌ๋ฅผ ๊ตฌ์„ฑํ•˜์ง€ ์•Š์œผ๋ฉด[Trainer]๋Š” ์ž๋™์œผ๋กœ WarmupDecayLR์„ ์„ ํƒํ•˜๊ณ  ๋ช…๋ น์ค„์—์„œ ์ œ๊ณต๋œ ๊ฐ’ ๋˜๋Š” ๊ธฐ๋ณธ๊ฐ’์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค: warmup_min_lr, warmup_max_lr, warmup_num_steps, total_num_steps (max_steps๊ฐ€ ์ œ๊ณต๋˜์ง€ ์•Š์œผ๋ฉด ๋Ÿฐํƒ€์ž„ ์ค‘์— ์ž๋™์œผ๋กœ ๊ณ„์‚ฐ๋จ).

๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ "auto"์œผ๋กœ ์„ค์ •ํ•˜๊ฑฐ๋‚˜ ์›ํ•˜๋Š” ๊ฐ’์„ ์ง์ ‘ ์ˆ˜๋™์œผ๋กœ ์ž…๋ ฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

{
   "scheduler": {
         "type": "WarmupDecayLR",
         "params": {
             "total_num_steps": "auto",
             "warmup_min_lr": "auto",
             "warmup_max_lr": "auto",
             "warmup_num_steps": "auto"
         }
     }
}

์ •๋ฐ€๋„[[precision]]

DeepSpeed๋Š” fp32, fp16 ๋ฐ bf16 ํ˜ผํ•ฉ ์ •๋ฐ€๋„๋ฅผ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.

๋ชจ๋ธ์ด ํ˜ผํ•ฉ ์ •๋ฐ€๋„๋กœ ์‚ฌ์ „ ํ•™์Šต๋˜์ง€ ์•Š์€ ๊ฒฝ์šฐ์™€ ๊ฐ™์ด ํ˜ผํ•ฉ ์ •๋ฐ€๋„๋กœ ์ž˜ ์ž‘๋™ํ•˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ NaN ์†์‹ค์„ ์œ ๋ฐœํ•  ์ˆ˜ ์žˆ๋Š” ์˜ค๋ฒ„ํ”Œ๋กœ ๋˜๋Š” ์–ธ๋”ํ”Œ๋กœ ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๊ฒฝ์šฐ์—๋Š” ๊ธฐ๋ณธ fp16 ๋ชจ๋“œ๋ฅผ ๋ช…์‹œ์ ์œผ๋กœ ๋น„ํ™œ์„ฑํ™”ํ•˜์—ฌ ์ „์ฒด fp32 ์ •๋ฐ€๋„๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

{
    "fp16": {
        "enabled": false
    }
}

Ampere GPU ๋ฐ PyTorch 1.7 ์ด์ƒ์˜ ๊ฒฝ์šฐ ์ผ๋ถ€ ์—ฐ์‚ฐ์— ๋Œ€ํ•ด ๋” ํšจ์œจ์ ์ธ tf32 ํ˜•์‹์œผ๋กœ ์ž๋™ ์ „ํ™˜๋˜์ง€๋งŒ ๊ฒฐ๊ณผ๋Š” ์—ฌ์ „ํžˆ fp32๋กœ ํ‘œ์‹œ๋ฉ๋‹ˆ๋‹ค. [Trainer]์—์„œ --tf32๋ฅผ ์„ค์ •ํ•˜์—ฌ ํ™œ์„ฑํ™”ํ•˜๊ณ  --tf32 0 ๋˜๋Š” --no_tf32๋ฅผ ๋น„ํ™œ์„ฑํ™”ํ•˜๋ฉด ์ œ์–ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

PyTorch AMP์™€ ๊ฐ™์€ fp16 ํ˜ผํ•ฉ ์ •๋ฐ€๋„๋ฅผ ๊ตฌ์„ฑํ•˜๋ฉด ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด ์ค„์–ด๋“ค๊ณ  ํ›ˆ๋ จ ์†๋„๊ฐ€ ๋นจ๋ผ์ง‘๋‹ˆ๋‹ค.[Trainer]๋Š” args.fp16_backend ๊ฐ’์— ๋”ฐ๋ผ fp16์„ ์ž๋™์œผ๋กœ ํ™œ์„ฑํ™” ๋˜๋Š” ๋น„ํ™œ์„ฑํ™”ํ•˜๋ฉฐ, ๋‚˜๋จธ์ง€ ๊ตฌ์„ฑ์€ ์‚ฌ์šฉ์ž๊ฐ€ ์„ค์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ช…๋ น์ค„์—์„œ ๋‹ค์Œ ์ธ์ˆ˜๋ฅผ ์ „๋‹ฌํ•˜๋ฉด fp16์ด ํ™œ์„ฑํ™”๋ฉ๋‹ˆ๋‹ค: fp16, --fp16_backend amp ๋˜๋Š” --fp16_full_eval.

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    }
}

์ถ”๊ฐ€ ๋”ฅ์Šคํ”ผ๋“œ fp16 ํ›ˆ๋ จ ์˜ต์…˜์€ fp16 ํ›ˆ๋ จ ์˜ต์…˜ ์ฐธ์กฐ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.

Apex์™€ ๊ฐ™์€ fp16 ํ˜ผํ•ฉ ์ •๋ฐ€๋„๋ฅผ ๊ตฌ์„ฑํ•˜๋ ค๋ฉด ์•„๋ž˜ ๊ทธ๋ฆผ๊ณผ ๊ฐ™์ด "auto" ๋˜๋Š” ์ง์ ‘ ๊ฐ’์„ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.[Trainer]๋Š” args.fp16_backend ๋ฐ args.fp16_opt_level์˜ ๊ฐ’์— ๋”ฐ๋ผ amp๋ฅผ ์ž๋™์œผ๋กœ ๊ตฌ์„ฑํ•ฉ๋‹ˆ๋‹ค. ๋‹ค์Œ ์ธ์ˆ˜๋ฅผ ์ „๋‹ฌํ•˜๋ฉด ๋ช…๋ น์ค„์—์„œ ํ™œ์„ฑํ™”ํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค: fp16, --fp16_backend apex ๋˜๋Š” --fp16_opt_level 01.

{
    "amp": {
        "enabled": "auto",
        "opt_level": "auto"
    }
}

bf16์„ ์‚ฌ์šฉํ•˜๋ ค๋ฉด DeepSpeed==0.6.0 ์ด์ƒ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. bf16์€ fp32์™€ ๋™์  ๋ฒ”์œ„๊ฐ€ ๋™์ผํ•˜๋ฉฐ ์†์‹ค ์Šค์ผ€์ผ๋ง์ด ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ gradient accumulation์„ bf16๊ณผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜๋ฉด ์ด ํ˜•์‹์˜ ๋‚ฎ์€ ์ •๋ฐ€๋„๋กœ ์ธํ•ด ์†์‹ค์ด ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ์›ํ•˜์ง€ ์•Š๋Š” ๊ทธ๋ ˆ์ด๋””์–ธํŠธ๊ฐ€ bf16์— ๋ˆ„์ ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

bf16์€ ์„ค์ • ํŒŒ์ผ์—์„œ ์„ค์ •ํ•˜๊ฑฐ๋‚˜ ๋‹ค์Œ ์ธ์ˆ˜๋ฅผ ์ „๋‹ฌํ•˜๋ฉด ๋ช…๋ น์ค„์—์„œ ํ™œ์„ฑํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค: --bf16 ๋˜๋Š” --bf16_full_eval.

{
    "bf16": {
        "enabled": "auto"
    }
}

๋ฐฐ์น˜ ํฌ๊ธฐ[[batch-size]]

๋ฐฐ์น˜ ํฌ๊ธฐ๋Š” ์ž๋™์œผ๋กœ ๊ตฌ์„ฑํ•˜๊ฑฐ๋‚˜ ๋ช…์‹œ์ ์œผ๋กœ ์„ค์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. "auto" ์˜ต์…˜์„ ์‚ฌ์šฉํ•˜๋„๋ก ์„ ํƒํ•˜๋ฉด [Trainer]๋Š” train_micro_batch_size_per_gpu๋ฅผ args.per_device_train_batch_size์˜ ๊ฐ’์œผ๋กœ, train_batch_size๋ฅผ args.world_size * args.per_device_train_batch_size * args.gradient_accumulation_steps๋กœ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.

{
    "train_micro_batch_size_per_gpu": "auto",
    "train_batch_size": "auto"
}

๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๋ˆ„์ [[gradient-accumulation]]

๊ทธ๋ ˆ์ด๋””์–ธํŠธ ๋ˆ„์ ์„ ์ž๋™์œผ๋กœ ๊ตฌ์„ฑํ•˜๊ฑฐ๋‚˜ ๋ช…์‹œ์ ์œผ๋กœ ์„ค์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. "auto" ์˜ต์…˜์„ ์‚ฌ์šฉํ•˜๋„๋ก ์„ ํƒํ•˜๋ฉด [Trainer]๊ฐ€ args.gradient_accumulation_steps์˜ ๊ฐ’์œผ๋กœ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.

{
    "gradient_accumulation_steps": "auto"
}

๊ทธ๋ ˆ์ด๋””์–ธํŠธ ํด๋ฆฌํ•‘[[gradient-clipping]]

๊ทธ๋ ˆ์ด๋””์–ธํŠธ ํด๋ฆฌํ•‘์€ ์ž๋™์œผ๋กœ ๊ตฌ์„ฑํ•˜๊ฑฐ๋‚˜ ๋ช…์‹œ์ ์œผ๋กœ ์„ค์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. "auto" ์˜ต์…˜์„ ์‚ฌ์šฉํ•˜๋„๋ก ์„ ํƒํ•˜๋ฉด [Trainer]๊ฐ€ args.max_grad_norm์˜ ๊ฐ’์œผ๋กœ ์„ค์ •ํ•ฉ๋‹ˆ๋‹ค.

{
    "gradient_clipping": "auto"
}

ํ†ต์‹  ๋ฐ์ดํ„ฐ ์œ ํ˜•(Communication data type)[[communication-data-type]]

์ถ•์†Œ, ์ˆ˜์ง‘ ๋ฐ ๋ถ„์‚ฐ ์ž‘์—…๊ณผ ๊ฐ™์€ ํ†ต์‹  ์ง‘ํ•ฉ์ฒด์˜ ๊ฒฝ์šฐ ๋ณ„๋„์˜ ๋ฐ์ดํ„ฐ ์œ ํ˜•์ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

๋ชจ๋“  ์ˆ˜์ง‘ ๋ฐ ๋ถ„์‚ฐ ์ž‘์—…์€ ๋ฐ์ดํ„ฐ์™€ ๋™์ผํ•œ ๋ฐ์ดํ„ฐ ์œ ํ˜•์œผ๋กœ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด bf16์œผ๋กœ ํ›ˆ๋ จํ•˜๋Š” ๊ฒฝ์šฐ, ์ˆ˜์ง‘์€ ๋น„์†์‹ค ์—ฐ์‚ฐ์ด๋ฏ€๋กœ ๋ฐ์ดํ„ฐ๋„ bf16์œผ๋กœ ์ˆ˜์ง‘๋ฉ๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด ๊ทธ๋ ˆ์ด๋””์–ธํŠธ๊ฐ€ ์—ฌ๋Ÿฌ GPU์— ๊ฑธ์ณ ํ‰๊ท ํ™”๋˜๋Š” ๊ฒฝ์šฐ์™€ ๊ฐ™์ด ๊ฐ์†Œ ์—ฐ์‚ฐ์€ ์†์‹ค์ด ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค. ํ†ต์‹ ์ด fp16 ๋˜๋Š” bf16์œผ๋กœ ์ˆ˜ํ–‰๋˜๋Š” ๊ฒฝ์šฐ, ๋‚ฎ์€ ์ •๋ฐ€๋„๋กœ ์—ฌ๋Ÿฌ ์ˆซ์ž๋ฅผ ๋”ํ•˜๋ฉด ์ •ํ™•ํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ์†์‹ค์ด ๋ฐœ์ƒํ•  ๊ฐ€๋Šฅ์„ฑ์ด ๋” ๋†’์Šต๋‹ˆ๋‹ค. ํŠนํžˆ fp16๋ณด๋‹ค ์ •๋ฐ€๋„๊ฐ€ ๋‚ฎ์€ bf16์˜ ๊ฒฝ์šฐ ๋”์šฑ ๊ทธ๋ ‡์Šต๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ ์ด์œ ๋กœ ๊ธฐ์šธ๊ธฐ๋ฅผ ํ‰๊ท ํ™”ํ•  ๋•Œ ์†์‹ค์ด ์ตœ์†Œํ™”๋˜๋ฏ€๋กœ ๊ฐ์†Œ ์—ฐ์‚ฐ์—๋Š” fp16์ด ๊ธฐ๋ณธ๊ฐ’์œผ๋กœ ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.

ํ†ต์‹  ๋ฐ์ดํ„ฐ ์œ ํ˜•์€ ์„ค์ • ํŒŒ์ผ์—์„œ communication_data_type ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์„ค์ •ํ•˜์—ฌ ์„ ํƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, fp32๋ฅผ ์„ ํƒํ•˜๋ฉด ์•ฝ๊ฐ„์˜ ์˜ค๋ฒ„ํ—ค๋“œ๊ฐ€ ์ถ”๊ฐ€๋˜์ง€๋งŒ ๊ฐ์†Œ ์—ฐ์‚ฐ์ด fp32์— ๋ˆ„์ ๋˜๊ณ  ์ค€๋น„๊ฐ€ ๋˜๋ฉด ํ›ˆ๋ จ ์ค‘์ธ ๋ฐ˜์ •๋ฐ€ dtype์œผ๋กœ ๋‹ค์šด์บ์ŠคํŠธ๋ฉ๋‹ˆ๋‹ค.

{
    "communication_data_type": "fp32"
}

๋ชจ๋ธ ๋ฐฐํฌ[[deployment]]

torchrun, deepspeed ๋Ÿฐ์ฒ˜ ๋˜๋Š” Accelerate ๋“ฑ ๋‹ค์–‘ํ•œ ๋Ÿฐ์ฒ˜๋ฅผ ํ†ตํ•ด DeepSpeed๋ฅผ ๋ฐฐํฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋ฐฐํฌํ•˜๋ ค๋ฉด [Trainer] ๋ช…๋ น์ค„์— --deepspeed ds_config.json์„ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. ํ•„์š”ํ•œ ๋ช…๋ น์ค„ ์ธ์ˆ˜๋ฅผ ์ฝ”๋“œ์— ์ถ”๊ฐ€ํ•˜๋ ค๋ฉด DeepSpeed์˜ add_config_arguments ์œ ํ‹ธ๋ฆฌํ‹ฐ๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ข‹์Šต๋‹ˆ๋‹ค.

์ด ๊ฐ€์ด๋“œ์—์„œ๋Š” ๋‹ค์–‘ํ•œ ํŠธ๋ ˆ์ด๋‹ ์„ค์ •์— ๋Œ€ํ•ด deepspeed ๋Ÿฐ์ฒ˜๋กœ DeepSpeed๋ฅผ ๋ฐฐํฌํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ๋“œ๋ฆฝ๋‹ˆ๋‹ค. ๋ณด๋‹ค ์‹ค์šฉ์ ์ธ ์‚ฌ์šฉ ์˜ˆ์ œ๋Š” ์ด post์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์—ฌ๋Ÿฌ GPU์— DeepSpeed๋ฅผ ๋ฐฐํฌํ•˜๋ ค๋ฉด --num_gpus ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์ถ”๊ฐ€ํ•˜์„ธ์š”. ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ๋ชจ๋“  GPU๋ฅผ ์‚ฌ์šฉํ•˜๋ ค๋Š” ๊ฒฝ์šฐ --num_gpus๋ฅผ ์ถ”๊ฐ€ํ•  ํ•„์š”๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค. ์•„๋ž˜ ์˜ˆ์ œ์—์„œ๋Š” 2๊ฐœ์˜ GPU๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

deepspeed --num_gpus=2 examples/pytorch/translation/run_translation.py \
--deepspeed tests/deepspeed/ds_config_zero3.json \
--model_name_or_path google-t5/t5-small --per_device_train_batch_size 1 \
--output_dir output_dir --overwrite_output_dir --fp16 \
--do_train --max_train_samples 500 --num_train_epochs 1 \
--dataset_name wmt16 --dataset_config "ro-en" \
--source_lang en --target_lang ro

๋‹จ์ผ GPU์— DeepSpeed๋ฅผ ๋ฐฐํฌํ•˜๋ ค๋ฉด --num_gpus ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์ถ”๊ฐ€ํ•˜์„ธ์š”. GPU๊ฐ€ 1๊ฐœ๋งŒ ์žˆ๋Š” ๊ฒฝ์šฐ ์ด ๊ฐ’์„ ๋ช…์‹œ์ ์œผ๋กœ ์„ค์ •ํ•  ํ•„์š”๋Š” ์—†์Šต๋‹ˆ๋‹ค. DeepSpeed๋Š” ์ง€์ •๋œ ๋…ธ๋“œ์—์„œ ๋ณผ ์ˆ˜ ์žˆ๋Š” ๋ชจ๋“  GPU๋ฅผ ๋ฐฐํฌํ•˜๋ฏ€๋กœ ์ด ๊ฐ’์„ ๋ช…์‹œ์ ์œผ๋กœ ์„ค์ •ํ•  ํ•„์š”๋Š” ์—†์Šต๋‹ˆ๋‹ค.

deepspeed --num_gpus=1 examples/pytorch/translation/run_translation.py \
--deepspeed tests/deepspeed/ds_config_zero2.json \
--model_name_or_path google-t5/t5-small --per_device_train_batch_size 1 \
--output_dir output_dir --overwrite_output_dir --fp16 \
--do_train --max_train_samples 500 --num_train_epochs 1 \
--dataset_name wmt16 --dataset_config "ro-en" \
--source_lang en --target_lang ro

DeepSpeed๋Š” ๋‹จ ํ•˜๋‚˜์˜ GPU๋กœ๋„ ์—ฌ์ „ํžˆ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค:

  1. ์ผ๋ถ€ ๊ณ„์‚ฐ๊ณผ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ CPU๋กœ ์˜คํ”„๋กœ๋“œํ•˜์—ฌ ๋” ํฐ ๋ฐฐ์น˜ ํฌ๊ธฐ๋ฅผ ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜ ์ผ๋ฐ˜์ ์œผ๋กœ ๋งž์ง€ ์•Š๋Š” ๋งค์šฐ ํฐ ๋ชจ๋ธ์„ ๋งž์ถ”๊ธฐ ์œ„ํ•ด ๋ชจ๋ธ์— ๋” ๋งŽ์€ GPU ๋ฆฌ์†Œ์Šค๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ฉ๋‹ˆ๋‹ค.
  2. ์Šค๋งˆํŠธ GPU ๋ฉ”๋ชจ๋ฆฌ ๊ด€๋ฆฌ ์‹œ์Šคํ…œ์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ์กฐ๊ฐํ™”๋ฅผ ์ตœ์†Œํ™”ํ•˜์—ฌ ๋” ํฐ ๋ชจ๋ธ๊ณผ ๋ฐ์ดํ„ฐ ๋ฐฐ์น˜์— ๋งž์ถœ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋‹จ์ผ GPU์—์„œ ๋” ๋‚˜์€ ์„ฑ๋Šฅ์„ ์–ป์œผ๋ ค๋ฉด ZeRO-2 ๊ตฌ์„ฑ ํŒŒ์ผ์—์„œ allgather_bucket_size ๋ฐ reduce_bucket_size ๊ฐ’์„ 2e8๋กœ ์„ค์ •ํ•˜์„ธ์š”.

๋‹ค์ค‘ ๋…ธ๋“œ ํ™˜๊ฒฝ์—์„œ์˜ ๋ชจ๋ธ ๋ฐฐํฌ[[multi-node-deployment]]

๋…ธ๋“œ๋Š” ์›Œํฌ๋กœ๋“œ๋ฅผ ์‹คํ–‰ํ•˜๊ธฐ ์œ„ํ•œ ํ•˜๋‚˜ ์ด์ƒ์˜ GPU์ž…๋‹ˆ๋‹ค. ๋” ๊ฐ•๋ ฅํ•œ ์„ค์ •์€ ๋ฉ€ํ‹ฐ ๋…ธ๋“œ ์„ค์ •์œผ๋กœ, deepspeed ๋Ÿฐ์ฒ˜๋กœ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ๊ฐ€์ด๋“œ์—์„œ๋Š” ๊ฐ๊ฐ 8๊ฐœ์˜ GPU๊ฐ€ ์žˆ๋Š” ๋‘ ๊ฐœ์˜ ๋…ธ๋“œ๊ฐ€ ์žˆ๋‹ค๊ณ  ๊ฐ€์ •ํ•ด ๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค. ์ฒซ ๋ฒˆ์งธ ๋…ธ๋“œ๋Š” ssh hostname1๋กœ, ๋‘ ๋ฒˆ์งธ ๋…ธ๋“œ๋Š” ssh hostname2๋กœ ์ ‘์†ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋‘ ๋…ธ๋“œ ๋ชจ๋‘ ๋น„๋ฐ€๋ฒˆํ˜ธ ์—†์ด ssh๋ฅผ ํ†ตํ•ด ๋กœ์ปฌ๋กœ ์„œ๋กœ ํ†ต์‹ ํ•  ์ˆ˜ ์žˆ์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

๊ธฐ๋ณธ์ ์œผ๋กœ DeepSpeed๋Š” ๋ฉ€ํ‹ฐ๋…ธ๋“œ ํ™˜๊ฒฝ์—์„œ ๊ณต์œ  ์ €์žฅ์†Œ๋ฅผ ์‚ฌ์šฉํ•  ๊ฒƒ์œผ๋กœ ์˜ˆ์ƒํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ ‡์ง€ ์•Š๊ณ  ๊ฐ ๋…ธ๋“œ๊ฐ€ ๋กœ์ปฌ ํŒŒ์ผ ์‹œ์Šคํ…œ๋งŒ ๋ณผ ์ˆ˜ ์žˆ๋Š” ๊ฒฝ์šฐ, ๊ณต์œ  ํŒŒ์ผ ์‹œ์Šคํ…œ์— ๋Œ€ํ•œ ์•ก์„ธ์Šค ์—†์ด ๋กœ๋”ฉํ•  ์ˆ˜ ์žˆ๋„๋ก checkpoint๋ฅผ ํฌํ•จํ•˜๋„๋ก ๊ตฌ์„ฑ ํŒŒ์ผ์„ ์กฐ์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

{
  "checkpoint": {
    "use_node_local_storage": true
  }
}

[Trainer]์˜ ``--save_on_each_node์ธ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์œ„์˜checkpoint`๋ฅผ ๊ตฌ์„ฑ์— ์ž๋™์œผ๋กœ ์ถ”๊ฐ€ํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

torchrun์˜ ๊ฒฝ์šฐ, ๊ฐ ๋…ธ๋“œ์— ssh๋กœ ์ ‘์†ํ•œ ํ›„ ๋‘ ๋…ธ๋“œ ๋ชจ๋‘์—์„œ ๋‹ค์Œ ๋ช…๋ น์„ ์‹คํ–‰ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋Ÿฐ์ฒ˜๋Š” ๋‘ ๋…ธ๋“œ๊ฐ€ ๋™๊ธฐํ™”๋  ๋•Œ๊นŒ์ง€ ๊ธฐ๋‹ค๋ ธ๋‹ค๊ฐ€ ํŠธ๋ ˆ์ด๋‹์„ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค.

torchrun --nproc_per_node=8 --nnode=2 --node_rank=0 --master_addr=hostname1 \
--master_port=9901 your_program.py <normal cl args> --deepspeed ds_config.json

deepspeed ๋Ÿฐ์ฒ˜์˜ ๊ฒฝ์šฐ, ๋จผ์ € hostfile์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

hostname1 slots=8
hostname2 slots=8

๊ทธ๋Ÿฐ ๋‹ค์Œ ๋‹ค์Œ ๋ช…๋ น์–ด๋กœ ํŠธ๋ ˆ์ด๋‹์„ ์‹œ์ž‘ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. deepspeed ๋Ÿฐ์ฒ˜๋Š” ๋‘ ๋…ธ๋“œ์—์„œ ๋™์‹œ์— ๋ช…๋ น์„ ์ž๋™์œผ๋กœ ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค.

deepspeed --num_gpus 8 --num_nodes 2 --hostfile hostfile --master_addr hostname1 --master_port=9901 \
your_program.py <normal cl args> --deepspeed ds_config.json

๋‹ค์ค‘ ๋…ธ๋“œ ์ปดํ“จํŒ… ๋ฆฌ์†Œ์Šค ๊ตฌ์„ฑ์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ์€ Resource Configuration (multi-node) ๊ฐ€์ด๋“œ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.

SLURM[[slurm]]

SLURM ํ™˜๊ฒฝ์—์„œ๋Š” ํŠน์ • SLURM ํ™˜๊ฒฝ์— ๋งž๊ฒŒ SLURM ์Šคํฌ๋ฆฝํŠธ๋ฅผ ์กฐ์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.SLURM ์Šคํฌ๋ฆฝํŠธ ์˜ˆ์‹œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:

#SBATCH --job-name=test-nodes        # ์ž‘์—… ์ด๋ฆ„
#SBATCH --nodes=2                    # ๋…ธ๋“œ ์ˆ˜
#SBATCH --ntasks-per-node=1          # ์ค‘์š” - ๋…ธ๋“œ๋‹น ๋ถ„์‚ฐ ์ž‘์—… 1๊ฐœ!
#SBATCH --cpus-per-task=10           # ์ž‘์—…๋‹น CPU ์ฝ”์–ด ์ˆ˜
#SBATCH --gres=gpu:8                 # gpu ์ˆ˜
#SBATCH --time 20:00:00              # ์ตœ๋Œ€ ์‹คํ–‰ ์‹œ๊ฐ„ (HH:MM:SS)
#SBATCH --output=%x-%j.out           # ์ถœ๋ ฅ ํŒŒ์ผ ์ด๋ฆ„

export GPUS_PER_NODE=8
export MASTER_ADDR=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
export MASTER_PORT=9901

srun --jobid $SLURM_JOBID bash -c 'python -m torch.distributed.run \
 --nproc_per_node $GPUS_PER_NODE --nnodes $SLURM_NNODES --node_rank $SLURM_PROCID \
 --master_addr $MASTER_ADDR --master_port $MASTER_PORT \
your_program.py <normal cl args> --deepspeed ds_config.json'

๊ทธ๋Ÿฐ ๋‹ค์Œ ๋ชจ๋“  ๋…ธ๋“œ์—์„œ ๋™์‹œ์— ํ•™์Šต์„ ์‹œ์ž‘ํ•˜๋Š” ๋‹ค์Œ ๋ช…๋ น์„ ์‚ฌ์šฉํ•˜์—ฌ ๋‹ค์ค‘ ๋…ธ๋“œ ๋ฐฐํฌ๋ฅผ ์˜ˆ์•ฝํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

sbatch launch.slurm

๋…ธํŠธ๋ถ[[notebook]]

deepspeed ๋Ÿฐ์ฒ˜๋Š” ๋…ธํŠธ๋ถ์—์„œ์˜ ๋ฐฐํฌ๋ฅผ ์ง€์›ํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ ๋ถ„์‚ฐ ํ™˜๊ฒฝ์„ ์—๋ฎฌ๋ ˆ์ด์…˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด๋Š” 1๊ฐœ์˜ GPU์—์„œ๋งŒ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. 1๊ฐœ ์ด์ƒ์˜ GPU๋ฅผ ์‚ฌ์šฉํ•˜๋ ค๋ฉด ๋”ฅ์Šคํ”ผ๋“œ๊ฐ€ ์ž‘๋™ํ•  ์ˆ˜ ์žˆ๋Š” ๋‹ค์ค‘ ํ”„๋กœ์„ธ์Šค ํ™˜๊ฒฝ์„ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ฆ‰, ์—ฌ๊ธฐ์— ํ‘œ์‹œ๋œ ๊ฒƒ์ฒ˜๋Ÿผ ์—๋ฎฌ๋ ˆ์ด์…˜ํ•  ์ˆ˜ ์—†๋Š” deepspeed ๋Ÿฐ์ฒ˜๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.

# DeepSpeed๋Š” ๋‹จ์ผ ํ”„๋กœ์„ธ์Šค๋งŒ ์‚ฌ์šฉํ•˜๋”๋ผ๋„ ๋ถ„์‚ฐ ํ™˜๊ฒฝ์„ ํ•„์š”๋กœ ํ•ฉ๋‹ˆ๋‹ค.
# ์ด ์ฝ”๋“œ๋กœ ๋ถ„์‚ฐ ํ™˜๊ฒฝ์„ ๋ชจ๋ฐฉํ•ฉ๋‹ˆ๋‹ค.
import os

os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "9994"  # RuntimeError: Address already in use ์˜ค๋ฅ˜ ๋ฐœ์ƒ ์‹œ ์ˆ˜์ •
os.environ["RANK"] = "0"
os.environ["LOCAL_RANK"] = "0"
os.environ["WORLD_SIZE"] = "1"

# ์ด์ œ ํ‰์†Œ์™€ ๊ฐ™์ด ์ง„ํ–‰ํ•˜๋˜, DeepSpeed ์„ค์ • ํŒŒ์ผ์„ ์ „๋‹ฌํ•ฉ๋‹ˆ๋‹ค.
training_args = TrainingArguments(..., deepspeed="ds_config_zero3.json")
trainer = Trainer(...)
trainer.train()

ํ˜„์žฌ ๋””๋ ‰ํ„ฐ๋ฆฌ์˜ ๋…ธํŠธ๋ถ์— ๊ตฌ์„ฑ ํŒŒ์ผ์„ ์ฆ‰์„์—์„œ ๋งŒ๋“ค๊ณ  ์‹ถ๋‹ค๋ฉด ์ „์šฉ ์…€์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

%%bash
cat <<'EOT' > ds_config_zero3.json
{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    },

    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "betas": "auto",
            "eps": "auto",
            "weight_decay": "auto"
        }
    },

    "scheduler": {
        "type": "WarmupLR",
        "params": {
            "warmup_min_lr": "auto",
            "warmup_max_lr": "auto",
            "warmup_num_steps": "auto"
        }
    },

    "zero_optimization": {
        "stage": 3,
        "offload_optimizer": {
            "device": "cpu",
            "pin_memory": true
        },
        "offload_param": {
            "device": "cpu",
            "pin_memory": true
        },
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e9,
        "reduce_bucket_size": "auto",
        "stage3_prefetch_bucket_size": "auto",
        "stage3_param_persistence_threshold": "auto",
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e9,
        "stage3_gather_16bit_weights_on_model_save": true
    },

    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 2000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}
EOT

ํŠธ๋ ˆ์ด๋‹ ์Šคํฌ๋ฆฝํŠธ๊ฐ€ ๋…ธํŠธ๋ถ ์…€์ด ์•„๋‹Œ ํŒŒ์ผ์— ์žˆ๋Š” ๊ฒฝ์šฐ, ๋…ธํŠธ๋ถ ์…€์˜ ์…ธ์—์„œ deepspeed๋ฅผ ์ •์ƒ์ ์œผ๋กœ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด run_translation.py๋ฅผ ์‹œ์ž‘ํ•˜๋ ค๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ•˜์„ธ์š”.:

!git clone https://github.com/huggingface/transformers
!cd transformers; deepspeed examples/pytorch/translation/run_translation.py ...

๋˜ํ•œ %%bash ๋งค์ง์„ ์‚ฌ์šฉํ•˜์—ฌ ์—ฌ๋Ÿฌ ์ค„์˜ ์ฝ”๋“œ๋ฅผ ์ž‘์„ฑํ•˜์—ฌ ์…ธ ํ”„๋กœ๊ทธ๋žจ์„ ์‹คํ–‰ํ•  ์ˆ˜๋„ ์žˆ์ง€๋งŒ ๊ต์œก์ด ์™„๋ฃŒ๋  ๋•Œ๊นŒ์ง€ ๋กœ๊ทธ๋ฅผ ๋ณผ ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. %%bash ๋งค์ง์œผ๋กœ ๋ถ„์‚ฐ ํ™˜๊ฒฝ์„ ์—๋ฎฌ๋ ˆ์ด์…˜ํ•  ํ•„์š”๋Š” ์—†์Šต๋‹ˆ๋‹ค.

%%bash

git clone https://github.com/huggingface/transformers
cd transformers
deepspeed examples/pytorch/translation/run_translation.py ...

๋ชจ๋ธ ๊ฐ€์ค‘์น˜ ์ €์žฅํ•˜๊ธฐ[[save-model-weights]]

๋”ฅ์Šคํ”ผ๋“œ๋Š” ๊ธฐ๋ณธ ๊ณ ์ •๋ฐ€ fp32 ๊ฐ€์ค‘์น˜๋ฅผ ์‚ฌ์šฉ์ž ์ง€์ • ์ฒดํฌํฌ์ธํŠธ ์ตœ์ ํ™” ํŒŒ์ผ(glob ํŒจํ„ด์€ global_step*/*optim_states.pt์ฒ˜๋Ÿผ ๋ณด์ž…๋‹ˆ๋‹ค)์— ์ €์žฅํ•˜๊ณ  ์ผ๋ฐ˜ ์ฒดํฌํฌ์ธํŠธ ์•„๋ž˜์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.

ZeRO-2๋กœ ํ›ˆ๋ จ๋œ ๋ชจ๋ธ์€ pytorch_model.bin ๊ฐ€์ค‘์น˜๋ฅผ fp16์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค. ZeRO-3์œผ๋กœ ํ›ˆ๋ จ๋œ ๋ชจ๋ธ์˜ ๋ชจ๋ธ ๊ฐ€์ค‘์น˜๋ฅผ fp16์— ์ €์žฅํ•˜๋ ค๋ฉด ๋ชจ๋ธ ๊ฐ€์ค‘์น˜๊ฐ€ ์—ฌ๋Ÿฌ GPU์— ๋ถ„ํ• ๋˜์–ด ์žˆ์œผ๋ฏ€๋กœ โ€œstage3_gather_16bit_weights_on_model_saveโ€: true๋ฅผ ์„ค์ •ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด [Trainer]๊ฐ€ ๊ฐ€์ค‘์น˜๋ฅผ fp16์— ์ €์žฅํ•˜์ง€ ์•Š๊ณ  pytorch_model.bin ํŒŒ์ผ์„ ์ƒ์„ฑํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ์ด๋Š” DeepSpeed์˜ state_dict์— ์‹ค์ œ ๊ฐ€์ค‘์น˜ ๋Œ€์‹  ํ”Œ๋ ˆ์ด์Šคํ™€๋”๊ฐ€ ํฌํ•จ๋˜์–ด ์žˆ์–ด ์ด๋ฅผ ๋กœ๋“œํ•  ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

{
    "zero_optimization": {
        "stage3_gather_16bit_weights_on_model_save": true
    }
}

์ „์ฒด ์ •๋ฐ€ ๊ฐ€์ค‘์น˜๋Š” ๋งŽ์€ ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ํ•„์š”ํ•  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ํŠธ๋ ˆ์ด๋‹ ์ค‘์— ์ €์žฅํ•ด์„œ๋Š” ์•ˆ ๋ฉ๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ ํ›ˆ๋ จ์ด ์™„๋ฃŒ๋œ ํ›„ ์˜คํ”„๋ผ์ธ์œผ๋กœ fp32 ๊ฐ€์ค‘์น˜๋ฅผ ์ €์žฅํ•˜๋Š” ๊ฒƒ์ด ๊ฐ€์žฅ ์ข‹์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์—ฌ์œ  CPU ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋งŽ์€ ๊ฒฝ์šฐ ํ›ˆ๋ จ ์ค‘์— fp32 ๊ฐ€์ค‘์น˜๋ฅผ ์ €์žฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด ์„น์…˜์—์„œ๋Š” ์˜จ๋ผ์ธ๊ณผ ์˜คํ”„๋ผ์ธ ๋ฐฉ์‹์„ ๋ชจ๋‘ ๋‹ค๋ฃน๋‹ˆ๋‹ค.

์˜จ๋ผ์ธ ํ™˜๊ฒฝ[[online]]

๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ตœ์‹  ์ฒดํฌํฌ์ธํŠธ๋ฅผ ๋กœ๋“œํ•˜๋ ค๋ฉด ์ฒดํฌํฌ์ธํŠธ๋ฅผ ํ•˜๋‚˜ ์ด์ƒ ์ €์žฅํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

from transformers.trainer_utils import get_last_checkpoint
from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint

checkpoint_dir = get_last_checkpoint(trainer.args.output_dir)
fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)

--load_best_model_at_end ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ํ™œ์„ฑํ™”ํ•˜์—ฌ [TrainingArguments]์—์„œ ์ตœ์ ์˜ ์ฒดํฌํฌ์ธํŠธ๋ฅผ ์ถ”์ ํ•˜๋Š” ๊ฒฝ์šฐ, ๋จผ์ € ํ•™์Šต์„ ์™„๋ฃŒํ•˜๊ณ  ์ตœ์ข… ๋ชจ๋ธ์„ ๋ช…์‹œ์ ์œผ๋กœ ์ €์žฅํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿฐ ๋‹ค์Œ ์•„๋ž˜์™€ ๊ฐ™์ด ๋‹ค์‹œ ๋กœ๋“œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

from deepspeed.utils.zero_to_fp32 import load_state_dict_from_zero_checkpoint

checkpoint_dir = os.path.join(trainer.args.output_dir, "checkpoint-final")
trainer.deepspeed.save_checkpoint(checkpoint_dir)
fp32_model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir)

load_state_dict_from_zero_checkpoint๊ฐ€ ์‹คํ–‰๋˜๋ฉด ๋™์ผํ•œ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ์ปจํ…์ŠคํŠธ์—์„œ ๋ชจ๋ธ์„ ๋” ์ด์ƒ DeepSpeed์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. model.load_state_dict(state_dict)๋Š” ๋ชจ๋“  ๋”ฅ์Šคํ”ผ๋“œ ๋งˆ๋ฒ•์„ ์ œ๊ฑฐํ•˜๋ฏ€๋กœ ๋”ฅ์Šคํ”ผ๋“œ ์—”์ง„์„ ๋‹ค์‹œ ์ดˆ๊ธฐํ™”ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ธฐ๋Šฅ์€ ํ›ˆ๋ จ์ด ๋๋‚  ๋•Œ๋งŒ ์‚ฌ์šฉํ•˜์„ธ์š”.

fp32 ๊ฐ€์ค‘์น˜์˜ state_dict๋ฅผ ์ถ”์ถœํ•˜์—ฌ ๋กœ๋“œํ•  ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค:

from deepspeed.utils.zero_to_fp32 import get_fp32_state_dict_from_zero_checkpoint

state_dict = get_fp32_state_dict_from_zero_checkpoint(checkpoint_dir)  # cpu์— ์ด๋ฏธ ์กด์žฌํ•จ
model = model.cpu()
model.load_state_dict(state_dict)

์˜คํ”„๋ผ์ธ ํ™˜๊ฒฝ[[offline]]

DeepSpeed๋Š” ์–ธ์ œ๋“ ์ง€ ๊ฐ€์ค‘์น˜๋ฅผ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ๋„๋ก ์ฒดํฌํฌ์ธํŠธ ํด๋”์˜ ์ตœ์ƒ์œ„ ๋ ˆ๋ฒจ์— zero_to_fp32.py ์Šคํฌ๋ฆฝํŠธ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์ด ์Šคํฌ๋ฆฝํŠธ๋Š” ๋…๋ฆฝํ˜• ์Šคํฌ๋ฆฝํŠธ๋กœ ๊ตฌ์„ฑ ํŒŒ์ผ์ด๋‚˜ [Trainer]๊ฐ€ ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

์˜ˆ๋ฅผ ๋“ค์–ด ์ฒดํฌํฌ์ธํŠธ ํด๋”๊ฐ€ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ฒฝ์šฐ์ž…๋‹ˆ๋‹ค:

$ ls -l output_dir/checkpoint-1/
-rw-rw-r-- 1 stas stas 1.4K Mar 27 20:42 config.json
drwxrwxr-x 2 stas stas 4.0K Mar 25 19:52 global_step1/
-rw-rw-r-- 1 stas stas   12 Mar 27 13:16 latest
-rw-rw-r-- 1 stas stas 827K Mar 27 20:42 optimizer.pt
-rw-rw-r-- 1 stas stas 231M Mar 27 20:42 pytorch_model.bin
-rw-rw-r-- 1 stas stas  623 Mar 27 20:42 scheduler.pt
-rw-rw-r-- 1 stas stas 1.8K Mar 27 20:42 special_tokens_map.json
-rw-rw-r-- 1 stas stas 774K Mar 27 20:42 spiece.model
-rw-rw-r-- 1 stas stas 1.9K Mar 27 20:42 tokenizer_config.json
-rw-rw-r-- 1 stas stas  339 Mar 27 20:42 trainer_state.json
-rw-rw-r-- 1 stas stas 2.3K Mar 27 20:42 training_args.bin
-rwxrw-r-- 1 stas stas 5.5K Mar 27 13:16 zero_to_fp32.py*

๋”ฅ์Šคํ”ผ๋“œ ์ฒดํฌํฌ์ธํŠธ(ZeRO-2 ๋˜๋Š” ZeRO-3) ํ•˜์œ„ ํด๋” global_step1์—์„œ fp32 ๊ฐ€์ค‘์น˜๋ฅผ ์žฌ๊ตฌ์„ฑํ•˜๋ ค๋ฉด ๋‹ค์Œ ๋ช…๋ น์„ ์‹คํ–‰ํ•˜์—ฌ ์—ฌ๋Ÿฌ GPU์˜ ์ „์ฒด fp32 ๊ฐ€์ค‘์น˜๋ฅผ ๋‹จ์ผ pytorch_model.bin ํŒŒ์ผ๋กœ ์ƒ์„ฑํ•˜๊ณ  ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค. ์Šคํฌ๋ฆฝํŠธ๋Š” ์ž๋™์œผ๋กœ ์ฒดํฌํฌ์ธํŠธ๊ฐ€ ํฌํ•จ๋œ ํ•˜์œ„ ํด๋”๋ฅผ ์ฐพ์Šต๋‹ˆ๋‹ค.

python zero_to_fp32.py . pytorch_model.bin

์ž์„ธํ•œ ์‚ฌ์šฉ๋ฒ•์€ python zero_to_fp32.py -h๋ฅผ ์‹คํ–‰ํ•˜์„ธ์š”. ์ด ์Šคํฌ๋ฆฝํŠธ์—๋Š” ์ตœ์ข… fp32 ๊ฐ€์ค‘์น˜์˜ 2๋ฐฐ์˜ ์ผ๋ฐ˜ RAM์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

ZeRO Inference[[zero-inference]]

ZeRO Inference๋Š” ๋ชจ๋ธ ๊ฐ€์ค‘์น˜๋ฅผ CPU ๋˜๋Š” NVMe ๋ฉ”๋ชจ๋ฆฌ์— ๋ฐฐ์น˜ํ•˜์—ฌ GPU์— ๋ถ€๋‹ด์„ ์ฃผ์ง€ ์•Š์œผ๋ฏ€๋กœ GPU์—์„œ ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜์—ฌ ์ถ”๋ก ์„ ์‹คํ–‰ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ถ”๋ก ์€ ์ตœ์ ํ™” ์ƒํƒœ ๋ฐ ๊ทธ๋ ˆ์ด๋””์–ธํŠธ์— ๋งŽ์€ ์–‘์˜ ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ์ถ”๊ฐ€๋กœ ํ•„์š”๋กœ ํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ ๋™์ผํ•œ ํ•˜๋“œ์›จ์–ด์— ํ›จ์”ฌ ๋” ํฐ ๋ฐฐ์น˜ ๋ฐ/๋˜๋Š” ์‹œํ€€์Šค ๊ธธ์ด๋ฅผ ๋งž์ถœ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ZeRO Inference๋Š” ZeRO-3์™€ ๋™์ผํ•œ ๊ตฌ์„ฑ ํŒŒ์ผ์„ ๊ณต์œ ํ•˜๋ฉฐ, ZeRO-2 ๋ฐ ZeRO-1 ๊ตฌ์„ฑ์€ ์ถ”๋ก ์— ์•„๋ฌด๋Ÿฐ ์ด์ ์„ ์ œ๊ณตํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ ์ž‘๋™ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

ZeRO Inference๋ฅผ ์‹คํ–‰ํ•˜๋ ค๋ฉด ์ผ๋ฐ˜์ ์ธ ํ›ˆ๋ จ ์ธ์ˆ˜๋ฅผ [TrainingArguments] ํด๋ž˜์Šค์— ์ „๋‹ฌํ•˜๊ณ  --do_eval ์ธ์ˆ˜๋ฅผ ์ถ”๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

deepspeed --num_gpus=2 your_program.py <normal cl args> --do_eval --deepspeed ds_config.json

Trainer ์—†์ด DeepSpeed ์‚ฌ์šฉํ•˜๊ธฐ[[non-trainer-deepspeed-integration]]

DeepSpeed๋Š” [Trainer] ํด๋ž˜์Šค๊ฐ€ ์—†๋Š” ํŠธ๋žœ์Šคํฌ๋จธ์—์„œ๋„ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค. ์ด๋Š” [~PreTrainedModel.from_pretrained]๋ฅผ ํ˜ธ์ถœํ•  ๋•Œ ZeRO-3 ๋งค๊ฐœ๋ณ€์ˆ˜๋ฅผ ์ˆ˜์ง‘ํ•˜๊ณ  ๋ชจ๋ธ์„ ์—ฌ๋Ÿฌ GPU์— ๋ถ„ํ• ํ•˜๋Š” ์ž‘์—…๋งŒ ์ฒ˜๋ฆฌํ•˜๋Š” [HfDeepSpeedConfig]๊ฐ€ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

๋ชจ๋“  ๊ฒƒ์ด ์ž๋™์œผ๋กœ ์ฒ˜๋ฆฌ๋˜๊ธฐ๋ฅผ ์›ํ•œ๋‹ค๋ฉด, [Trainer]์™€ ํ•จ๊ป˜ DeepSpeed๋ฅผ ์‚ฌ์šฉํ•ด ๋ณด์„ธ์š”! DeepSpeed ๋ฌธ์„œ๋ฅผ ์ฐธ์กฐํ•˜์—ฌ ์„ค์ • ํŒŒ์ผ์—์„œ ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ฐ’์„ ์ˆ˜๋™์œผ๋กœ ๊ตฌ์„ฑํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค("auto" ๊ฐ’์€ ์‚ฌ์šฉํ•  ์ˆ˜ ์—†์Œ).

ZeRO-3๋ฅผ ํšจ์œจ์ ์œผ๋กœ ๋ฐฐํฌํ•˜๋ ค๋ฉด ๋ชจ๋ธ ์•ž์— [HfDeepSpeedConfig] ๊ฐ์ฒด๋ฅผ ์ธ์Šคํ„ด์Šคํ™”ํ•˜๊ณ  ํ•ด๋‹น ๊ฐ์ฒด๋ฅผ ์œ ์ง€ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค:

from transformers.integrations import HfDeepSpeedConfig
from transformers import AutoModel
import deepspeed

ds_config = {...}  # deepspeed ์„ค์ • ๊ฐ์ฒด ๋˜๋Š” ํŒŒ์ผ ๊ฒฝ๋กœ
# Zero 3๋ฅผ ๊ฐ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ๋ชจ๋ธ์„ ์ธ์Šคํ„ด์Šคํ™”ํ•˜๊ธฐ ์ „์— ๋ฐ˜๋“œ์‹œ ์‹คํ–‰ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค
dschf = HfDeepSpeedConfig(ds_config)  # ์ด ๊ฐ์ฒด๋ฅผ ์œ ์ง€ํ•˜์„ธ์š”.
model = AutoModel.from_pretrained("openai-community/gpt2")
engine = deepspeed.initialize(model=model, config_params=ds_config, ...)

[HfDeepSpeedConfig] is not required for ZeRO-1 or ZeRO-2.

from transformers.integrations import HfDeepSpeedConfig
from transformers import AutoModel, AutoConfig
import deepspeed

ds_config = {...}  # deepspeed ์„ค์ • ๊ฐ์ฒด ๋˜๋Š” ํŒŒ์ผ ๊ฒฝ๋กœ
# Zero 3๋ฅผ ๊ฐ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ๋ชจ๋ธ์„ ์ธ์Šคํ„ด์Šคํ™”ํ•˜๊ธฐ ์ „์— ๋ฐ˜๋“œ์‹œ ์‹คํ–‰ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค
dschf = HfDeepSpeedConfig(ds_config)  # ์ด ๊ฐ์ฒด๋ฅผ ์œ ์ง€ํ•˜์„ธ์š”.
config = AutoConfig.from_pretrained("openai-community/gpt2")
model = AutoModel.from_config(config)
engine = deepspeed.initialize(model=model, config_params=ds_config, ...)

Trainer ์—†์ด ZeRO Inference ์‚ฌ์šฉํ•˜๊ธฐ[[non-trainer-zero-inference]]

๋‹จ์ผ GPU์— ๋ชจ๋ธ์„ ๋งž์ถœ ์ˆ˜ ์—†๋Š” ๊ฒฝ์šฐ [Trainer]์—†์ด ZeRO ์ถ”๋ก ์„ ์‹คํ–‰ํ•˜๋ ค๋ฉด ์ถ”๊ฐ€ GPU๋ฅผ ์‚ฌ์šฉํ•˜๊ฑฐ๋‚˜ CPU ๋ฉ”๋ชจ๋ฆฌ๋กœ ์˜คํ”„๋กœ๋“œ๋ฅผ ์‹œ๋„ํ•˜์„ธ์š”. ์—ฌ๊ธฐ์„œ ์ดํ•ดํ•ด์•ผ ํ•  ์ค‘์š”ํ•œ ๋‰˜์•™์Šค๋Š” ZeRO๊ฐ€ ์„ค๊ณ„๋œ ๋ฐฉ์‹์— ๋”ฐ๋ผ ์„œ๋กœ ๋‹ค๋ฅธ GPU์—์„œ ์„œ๋กœ ๋‹ค๋ฅธ ์ž…๋ ฅ์„ ๋ณ‘๋ ฌ๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

๋ฐ˜๋“œ์‹œ ํ™•์ธํ•˜์„ธ์š”:

  • GPU ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์ถฉ๋ถ„ํ•œ ๊ฒฝ์šฐ CPU ์˜คํ”„๋กœ๋“œ๋ฅผ ๋น„ํ™œ์„ฑํ™”ํ•ฉ๋‹ˆ๋‹ค(์†๋„๊ฐ€ ๋А๋ ค์ง€๋ฏ€๋กœ).
  • Ampere ์ด์ƒ์˜ GPU๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ bf16์„ ํ™œ์„ฑํ™”ํ•˜๋ฉด ์†๋„๊ฐ€ ๋นจ๋ผ์ง‘๋‹ˆ๋‹ค. ์ด๋Ÿฌํ•œ GPU๊ฐ€ ์—†๋Š” ๊ฒฝ์šฐ ์˜ค๋ฒ„ํ”Œ๋กœ ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ bf16์œผ๋กœ ์‚ฌ์ „ ํ•™์Šต๋œ ๋ชจ๋ธ(T5 ๋ชจ๋ธ)์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š” ํ•œ fp16์„ ํ™œ์„ฑํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋‹จ์ผ GPU์— ๋งž์ง€ ์•Š๋Š” ๋ชจ๋ธ์—์„œ [Trainer] ์—†์ด ZeRO ์ถ”๋ก ์„ ์‹คํ–‰ํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋Œ€ํ•œ ๋” ๋‚˜์€ ์•„์ด๋””์–ด๋ฅผ ์–ป์œผ๋ ค๋ฉด ๋‹ค์Œ ์Šคํฌ๋ฆฝํŠธ๋ฅผ ์‚ดํŽด๋ณด์‹œ๊ธฐ ๋ฐ”๋ž๋‹ˆ๋‹ค.

#!/usr/bin/env python

# ์ด ์Šคํฌ๋ฆฝํŠธ๋Š” ๋‹จ์ผ GPU์— ๋ชจ๋ธ์„ ๋งž์ถœ ์ˆ˜ ์—†์„ ๋•Œ ์ถ”๋ก  ๋ชจ๋“œ์—์„œ Deepspeed ZeRO๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
#
# 1. CPU ์˜คํ”„๋กœ๋“œ์™€ ํ•จ๊ป˜ 1๊ฐœ์˜ GPU ์‚ฌ์šฉ
# 2. ๋˜๋Š” ์—ฌ๋Ÿฌ GPU ์‚ฌ์šฉ
#
# ๋จผ์ € deepspeed๋ฅผ ์„ค์น˜ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค: pip install deepspeed
#
# ์—ฌ๊ธฐ์„œ๋Š” ์•ฝ 15GB์˜ GPU RAM์ด ํ•„์š”ํ•œ 3B "bigscience/T0_3B" ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค - ๋”ฐ๋ผ์„œ 1๊ฐœ์˜ ํฐ GPU๋‚˜ 2๊ฐœ์˜
# ์ž‘์€ GPU๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜๋Š” 1๊ฐœ์˜ ์ž‘์€ GPU์™€ ๋งŽ์€ CPU ๋ฉ”๋ชจ๋ฆฌ๋กœ๋„ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
#
# ์•ฝ 50GB๊ฐ€ ํ•„์š”ํ•œ "bigscience/T0"์™€ ๊ฐ™์€ ๋” ํฐ ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•˜๋ ค๋ฉด, 80GB GPU๊ฐ€ ์—†๋Š” ํ•œ
# 2-4๊ฐœ์˜ GPU๊ฐ€ ํ•„์š”ํ•  ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ์—ฌ๋Ÿฌ ์ž…๋ ฅ์„ ํ•œ ๋ฒˆ์— ์ฒ˜๋ฆฌํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด
# ์Šคํฌ๋ฆฝํŠธ๋ฅผ ์ˆ˜์ •ํ•˜์—ฌ ๋” ๋งŽ์€ GPU๋ฅผ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
#
# ์ œ๊ณต๋œ deepspeed ์„ค์ •์€ CPU ๋ฉ”๋ชจ๋ฆฌ ์˜คํ”„๋กœ๋”ฉ๋„ ํ™œ์„ฑํ™”ํ•˜๋ฏ€๋กœ, ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ CPU ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ๋งŽ๊ณ 
# ์†๋„ ์ €ํ•˜๋ฅผ ๊ฐ์ˆ˜ํ•  ์ˆ˜ ์žˆ๋‹ค๋ฉด ์ผ๋ฐ˜์ ์œผ๋กœ ๋‹จ์ผ GPU์— ๋งž์ง€ ์•Š๋Š” ๋ชจ๋ธ์„ ๋กœ๋“œํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ์ž…๋‹ˆ๋‹ค.
# GPU ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์ถฉ๋ถ„ํ•˜๋‹ค๋ฉด CPU๋กœ์˜ ์˜คํ”„๋กœ๋“œ๋ฅผ ์›ํ•˜์ง€ ์•Š์„ ๋•Œ ํ”„๋กœ๊ทธ๋žจ์ด ๋” ๋น ๋ฅด๊ฒŒ ์‹คํ–‰๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค - ๊ทธ๋Ÿด ๋•Œ๋Š” ํ•ด๋‹น ์„น์…˜์„ ๋น„ํ™œ์„ฑํ™”ํ•˜์„ธ์š”.
#
# 1๊ฐœ์˜ GPU์— ๋ฐฐํฌํ•˜๋ ค๋ฉด:
#
# deepspeed --num_gpus 1 t0.py
# ๋˜๋Š”:
# python -m torch.distributed.run --nproc_per_node=1 t0.py
#
# 2๊ฐœ์˜ GPU์— ๋ฐฐํฌํ•˜๋ ค๋ฉด:
#
# deepspeed --num_gpus 2 t0.py
# ๋˜๋Š”:
# python -m torch.distributed.run --nproc_per_node=2 t0.py

from transformers import AutoTokenizer, AutoConfig, AutoModelForSeq2SeqLM
from transformers.integrations import HfDeepSpeedConfig
import deepspeed
import os
import torch

os.environ["TOKENIZERS_PARALLELISM"] = "false"  # ํ† ํฌ๋‚˜์ด์ €์˜ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ์— ๊ด€ํ•œ ๊ฒฝ๊ณ ๋ฅผ ํ”ผํ•˜๊ธฐ ์œ„ํ•จ์ž…๋‹ˆ๋‹ค.

# ๋ถ„์‚ฐ ํ™˜๊ฒฝ ์„ค์ •
local_rank = int(os.getenv("LOCAL_RANK", "0"))
world_size = int(os.getenv("WORLD_SIZE", "1"))
torch.cuda.set_device(local_rank)
deepspeed.init_distributed()

model_name = "bigscience/T0_3B"

config = AutoConfig.from_pretrained(model_name)
model_hidden_size = config.d_model

# ๋ฐฐ์น˜ ํฌ๊ธฐ๋Š” world_size๋กœ ๋‚˜๋ˆ„์–ด ๋–จ์–ด์ ธ์•ผ ํ•˜์ง€๋งŒ, world_size๋ณด๋‹ค ํด ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
train_batch_size = 1 * world_size

# ds_config ์ฐธ๊ณ ์‚ฌํ•ญ
#
# - Ampere ์ด์ƒ์˜ GPU๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ bf16์„ ํ™œ์„ฑํ™”ํ•˜์„ธ์š” - ์ด๋Š” ํ˜ผํ•ฉ ์ •๋ฐ€๋„๋กœ ์‹คํ–‰๋˜์–ด
# ๋” ๋น ๋ฅผ ๊ฒƒ์ž…๋‹ˆ๋‹ค.
#
# - ์˜ค๋ž˜๋œ GPU์˜ ๊ฒฝ์šฐ fp16์„ ํ™œ์„ฑํ™”ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, bf16์œผ๋กœ ์‚ฌ์ „ ํ›ˆ๋ จ๋˜์ง€ ์•Š์€ ๋ชจ๋ธ์—์„œ๋งŒ ์ž‘๋™ํ•ฉ๋‹ˆ๋‹ค - ์˜ˆ๋ฅผ ๋“ค์–ด
# ๋ชจ๋“  ๊ณต์‹ t5 ๋ชจ๋ธ์€ bf16์œผ๋กœ ์‚ฌ์ „ ํ›ˆ๋ จ๋˜์—ˆ์Šต๋‹ˆ๋‹ค
#
# - CPU ์˜คํ”„๋กœ๋“œ๋ฅผ ์›ํ•˜์ง€ ์•Š๋Š”๋‹ค๋ฉด offload_param.device๋ฅผ "none"์œผ๋กœ ์„ค์ •ํ•˜๊ฑฐ๋‚˜ `offload_param` ์„น์…˜์„
# ์™„์ „ํžˆ ์ œ๊ฑฐํ•˜์„ธ์š”
#
# - `offload_param`์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ, stage3_param_persistence_threshold๋ฅผ ์ˆ˜๋™์œผ๋กœ ๋ฏธ์„ธ ์กฐ์ •ํ•˜์—ฌ
# ์–ด๋–ค ๋งค๊ฐœ๋ณ€์ˆ˜๊ฐ€ GPU์— ๋‚จ์•„์žˆ์–ด์•ผ ํ•˜๋Š”์ง€ ์ œ์–ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค - ๊ฐ’์ด ํด์ˆ˜๋ก ์˜คํ”„๋กœ๋“œ ํฌ๊ธฐ๊ฐ€ ์ž‘์•„์ง‘๋‹ˆ๋‹ค
#
# Deepspeed ์„ค์ •์— ๋Œ€ํ•œ ์ž์„ธํ•œ ์ •๋ณด๋Š” ๋‹ค์Œ์„ ์ฐธ์กฐํ•˜์„ธ์š”
# https://huggingface.co/docs/transformers/main/main_classes/deepspeed

# ์ผ๊ด€์„ฑ์„ ์œ„ํ•ด json๊ณผ ๋™์ผํ•œ ํ˜•์‹์„ ์œ ์ง€ํ•˜๋˜, true/false์—๋Š” ์†Œ๋ฌธ์ž๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค
# fmt: off
ds_config = {
    "fp16": {
        "enabled": False
    },
    "bf16": {
        "enabled": False
    },
    "zero_optimization": {
        "stage": 3,
        "offload_param": {
            "device": "cpu",
            "pin_memory": True
        },
        "overlap_comm": True,
        "contiguous_gradients": True,
        "reduce_bucket_size": model_hidden_size * model_hidden_size,
        "stage3_prefetch_bucket_size": 0.9 * model_hidden_size * model_hidden_size,
        "stage3_param_persistence_threshold": 10 * model_hidden_size
    },
    "steps_per_print": 2000,
    "train_batch_size": train_batch_size,
    "train_micro_batch_size_per_gpu": 1,
    "wall_clock_breakdown": False
}
# fmt: on

# ๋‹ค์Œ ์ค„์€ ๋ชจ๋ธ์˜ `from_pretrained` ๋ฉ”์†Œ๋“œ๊ฐ€ ํ˜ธ์ถœ๋  ๋•Œ
# deepspeed.zero.Init๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ์—ฌ๋Ÿฌ GPU์— ์ง์ ‘ ๋ถ„ํ• ํ•˜๋„๋ก transformers์— ์ง€์‹œํ•ฉ๋‹ˆ๋‹ค.
#
# **์ด๋Š” AutoModelForSeq2SeqLM.from_pretrained(model_name)๋กœ ๋ชจ๋ธ์„ ๋กœ๋“œํ•˜๊ธฐ ์ „์— ์‹คํ–‰๋˜์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค**
#
# ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด ๋ชจ๋ธ์ด ๋จผ์ € ์ •์ƒ์ ์œผ๋กœ ๋กœ๋“œ๋œ ํ›„ ํฌ์›Œ๋“œ ์‹œ์—๋งŒ ๋ถ„ํ• ๋˜๋Š”๋ฐ, ์ด๋Š”
# ๋œ ํšจ์œจ์ ์ด๋ฉฐ CPU RAM์ด ๋ถ€์กฑํ•  ๊ฒฝ์šฐ ์‹คํŒจํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค
dschf = HfDeepSpeedConfig(ds_config)  # ์ด ๊ฐ์ฒด๋ฅผ ์œ ์ง€ํ•˜์„ธ์š”

# ์ด์ œ ๋ชจ๋ธ์„ ๋กœ๋“œํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Deepspeed ZeRO๋ฅผ ์ดˆ๊ธฐํ™”ํ•˜๊ณ  ์—”์ง„ ๊ฐ์ฒด๋งŒ ์ €์žฅ
ds_engine = deepspeed.initialize(model=model, config_params=ds_config)[0]
ds_engine.module.eval()  # inference

# Deepspeed ZeRO๋Š” ๊ฐ GPU์—์„œ ์„œ๋กœ ๊ด€๋ จ ์—†๋Š” ์ž…๋ ฅ์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ 2๊ฐœ์˜ GPU๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ํ•œ ๋ฒˆ์— 2๊ฐœ์˜ ์ž…๋ ฅ์„ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
# GPU๋ฅผ ๋” ๋งŽ์ด ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ ๊ทธ์— ๋งž๊ฒŒ ์กฐ์ •ํ•˜์„ธ์š”.

# ๋ฌผ๋ก  ์ฒ˜๋ฆฌํ•  ์ž…๋ ฅ์ด ํ•˜๋‚˜๋ฟ์ด๋ผ๋ฉด ๋‘ GPU์— ๋™์ผํ•œ ๋ฌธ์ž์—ด์„ ์ „๋‹ฌํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
# GPU๋ฅผ ํ•˜๋‚˜๋งŒ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ์—๋Š” rank 0๋งŒ ๊ฐ–๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
rank = torch.distributed.get_rank()
if rank == 0:
    text_in = "Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy"
elif rank == 1:
    text_in = "Is this review positive or negative? Review: this is the worst restaurant ever"

tokenizer = AutoTokenizer.from_pretrained(model_name)
inputs = tokenizer.encode(text_in, return_tensors="pt").to(device=local_rank)
with torch.no_grad():
    outputs = ds_engine.module.generate(inputs, synced_gpus=True)
text_out = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"rank{rank}:\n   in={text_in}\n  out={text_out}")

์Šคํฌ๋ฆฝํŠธ๋ฅผ t0.py๋กœ ์ €์žฅํ•˜๊ณ  ์‹คํ–‰ํ•ฉ๋‹ˆ๋‹ค:

$ deepspeed --num_gpus 2 t0.py
rank0:
   in=Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy
  out=Positive
rank1:
   in=Is this review positive or negative? Review: this is the worst restaurant ever
  out=negative

์ด๊ฒƒ์€ ๋งค์šฐ ๊ธฐ๋ณธ์ ์ธ ์˜ˆ์‹œ์ด๋ฏ€๋กœ ์‚ฌ์šฉ ์‚ฌ๋ก€์— ๋งž๊ฒŒ ์กฐ์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์ƒ์„ฑ[[generate]]

์ƒ์„ฑ์— ZeRO-3์™€ ํ•จ๊ป˜ ์—ฌ๋Ÿฌ ๊ฐœ์˜ GPU๋ฅผ ์‚ฌ์šฉํ•˜๋ ค๋ฉด [~GenerationMixin.generate] ๋ฉ”์„œ๋“œ์—์„œ synced_gpus=True๋ฅผ ์„ค์ •ํ•˜์—ฌ GPU๋ฅผ ๋™๊ธฐํ™”ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ ‡์ง€ ์•Š์œผ๋ฉด ํ•œ GPU๊ฐ€ ๋‹ค๋ฅธ GPU๋ณด๋‹ค ๋จผ์ € ์ƒ์„ฑ์„ ์™„๋ฃŒํ•˜๋ฉด ๋‚˜๋จธ์ง€ GPU๊ฐ€ ๋จผ์ € ์™„๋ฃŒํ•œ GPU๋กœ๋ถ€ํ„ฐ ๊ฐ€์ค‘์น˜ ์ƒค๋“œ๋ฅผ ๋ฐ›์ง€ ๋ชปํ•˜์—ฌ ์ „์ฒด ์‹œ์Šคํ…œ์ด ์ค‘๋‹จ๋ฉ๋‹ˆ๋‹ค.

ํŠธ๋žœ์Šคํฌ๋จธ>=4.28์˜ ๊ฒฝ์šฐ, ์ƒ์„ฑ ์ค‘์— ์—ฌ๋Ÿฌ ๊ฐœ์˜ GPU๊ฐ€ ๊ฐ์ง€๋˜๋ฉด synced_gpus๊ฐ€ ์ž๋™์œผ๋กœ True๋กœ ์„ค์ •๋ฉ๋‹ˆ๋‹ค.

ํŠธ๋Ÿฌ๋ธ”์ŠˆํŒ…[[troubleshoot]]

๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•˜๋ฉด DeepSpeed๊ฐ€ ๋ฌธ์ œ์˜ ์›์ธ์ด ์•„๋‹Œ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์œผ๋ฏ€๋กœ(์•„์ฃผ ๋ช…๋ฐฑํ•˜๊ณ  ์˜ˆ์™ธ์ ์œผ๋กœ DeepSpeed ๋ชจ๋“ˆ์„ ๋ณผ ์ˆ˜ ์žˆ๋Š” ๊ฒฝ์šฐ๊ฐ€ ์•„๋‹ˆ๋ผ๋ฉด) DeepSpeed๊ฐ€ ๋ฌธ์ œ์˜ ์›์ธ์ธ์ง€ ๊ณ ๋ คํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค! ์ฒซ ๋ฒˆ์งธ ๋‹จ๊ณ„๋Š” DeepSpeed ์—†์ด ์„ค์ •์„ ๋‹ค์‹œ ์‹œ๋„ํ•˜๊ณ  ๋ฌธ์ œ๊ฐ€ ์ง€์†๋˜๋ฉด ๋ฌธ์ œ๋ฅผ ์‹ ๊ณ ํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค. ๋ฌธ์ œ๊ฐ€ ํ•ต์‹ฌ์ ์ธ DeepSpeed ๋ฌธ์ œ์ด๊ณ  transformers์™€ ๊ด€๋ จ์ด ์—†๋Š” ๊ฒฝ์šฐ, DeepSpeed ๋ฆฌํฌ์ง€ํ† ๋ฆฌ์—์„œ ์ด์Šˆ๋ฅผ ๊ฐœ์„คํ•˜์„ธ์š”.

transformers์™€ ๊ด€๋ จ๋œ ์ด์Šˆ๋ฅผ ๊ฐœ์„คํ•  ๋•Œ์—๋Š” ๋‹ค์Œ ์ •๋ณด๋ฅผ ์ œ๊ณตํ•ด ์ฃผ์„ธ์š”:

  • ์ „์ฒด DeepSpeed ๊ตฌ์„ฑ ํŒŒ์ผ

*[Trainer]์˜ ๋ช…๋ น์ค„ ์ธ์ˆ˜, ๋˜๋Š”[Trainer] ์„ค์ •์„ ์ง์ ‘ ์ž‘์„ฑํ•˜๋Š” ๊ฒฝ์šฐ[TrainingArguments] ์ธ์ˆ˜(๊ด€๋ จ ์—†๋Š” ํ•ญ๋ชฉ์ด ์ˆ˜์‹ญ ๊ฐœ ์žˆ๋Š” [TrainingArguments]๋Š” ๋คํ”„ํ•˜์ง€ ๋งˆ์„ธ์š”).

  • ๋‹ค์Œ ์ฝ”๋“œ์˜ ์ถœ๋ ฅ ๊ฒฐ๊ณผ:
python -c 'import torch; print(f"torch: {torch.__version__}")'
python -c 'import transformers; print(f"transformers: {transformers.__version__}")'
python -c 'import deepspeed; print(f"deepspeed: {deepspeed.__version__}")'
  • ๋ฌธ์ œ๋ฅผ ์žฌํ˜„ํ•  ์ˆ˜ ์žˆ๋Š” Google Colab ๋…ธํŠธ๋ถ ๋งํฌ

  • ๋ถˆ๊ฐ€๋Šฅํ•  ๊ฒฝ์šฐ ๊ธฐ์กด ์˜ˆ์ œ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ฌธ์ œ๋ฅผ ์žฌํ˜„ํ•  ์ˆ˜ ์žˆ๋Š” ํ‘œ์ค€ ๋ฐ ์‚ฌ์šฉ์ž ์ง€์ •์ด ์•„๋‹Œ ๋ฐ์ดํ„ฐ ์ง‘ํ•ฉ์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋‹ค์Œ ์„น์…˜์—์„œ๋Š” ๊ฐ€์žฅ ์ผ๋ฐ˜์ ์ธ ๋‘ ๊ฐ€์ง€ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ๊ฐ€์ด๋“œ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

DeepSpeed ํ”„๋กœ์„ธ์Šค๊ฐ€ ์‹œ์ž‘ ๋‹จ๊ณ„์—์„œ ์ข…๋ฃŒ๋˜์—ˆ์„ ๊ฒฝ์šฐ[[deepspeed-process-killed-at-startup]]

์‹คํ–‰ ์ค‘์— ํŠธ๋ ˆ์ด์Šค๋ฐฑ ์—†์ด DeepSpeed ํ”„๋กœ์„ธ์Šค๊ฐ€ ์ข…๋ฃŒ๋˜๋ฉด ์ผ๋ฐ˜์ ์œผ๋กœ ํ”„๋กœ๊ทธ๋žจ์ด ์‹œ์Šคํ…œ๋ณด๋‹ค ๋งŽ์€ CPU ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ• ๋‹นํ•˜๋ ค๊ณ  ์‹œ๋„ํ–ˆ๊ฑฐ๋‚˜ ํ”„๋กœ์„ธ์Šค๊ฐ€ ํ—ˆ์šฉ๋œ ๊ฒƒ๋ณด๋‹ค ๋งŽ์€ CPU ๋ฉ”๋ชจ๋ฆฌ๋ฅผ ํ• ๋‹นํ•˜๋ ค๊ณ  ์‹œ๋„ํ•˜์—ฌ OS ์ปค๋„์ด ํ”„๋กœ์„ธ์Šค๋ฅผ ์ข…๋ฃŒํ–ˆ์Œ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ์ด ๊ฒฝ์šฐ ๊ตฌ์„ฑ ํŒŒ์ผ์— offload_optimizer, offload_param ๋˜๋Š” ๋‘˜ ๋‹ค CPU๋กœ ์˜คํ”„๋กœ๋“œํ•˜๋„๋ก ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋Š”์ง€ ํ™•์ธํ•˜์„ธ์š”.

NVMe ๋ฐ ZeRO-3๋ฅผ ์„ค์ •ํ•œ ๊ฒฝ์šฐ NVMe๋กœ ์˜คํ”„๋กœ๋“œ๋ฅผ ์‹คํ—˜ํ•ด ๋ณด์„ธ์š”(๋ชจ๋ธ์˜ ๋ฉ”๋ชจ๋ฆฌ ์š”๊ตฌ ์‚ฌํ•ญ์„ ํ™•์ธํ•˜์„ธ์š”).

NaN ์†์‹ค[[nan-loss]]

๋ชจ๋ธ์„ bf16์œผ๋กœ ์‚ฌ์ „ ํ›ˆ๋ จํ•œ ๋‹ค์Œ fp16์œผ๋กœ ์‚ฌ์šฉํ•˜๋ ค๊ณ  ํ•  ๋•Œ NaN ์†์‹ค์ด ๋ฐœ์ƒํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์Šต๋‹ˆ๋‹ค(ํŠนํžˆ TPU ํ›ˆ๋ จ ๋ชจ๋ธ์— ํ•ด๋‹น). ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋ ค๋ฉด ํ•˜๋“œ์›จ์–ด๊ฐ€ ์ด๋ฅผ ์ง€์›ํ•˜๋Š” ๊ฒฝ์šฐ(TPU, Ampere GPU ์ด์ƒ) fp32 ๋˜๋Š” bf16์„ ์‚ฌ์šฉํ•˜์„ธ์š”.

๋‹ค๋ฅธ ๋ฌธ์ œ๋Š” fp16 ์‚ฌ์šฉ๊ณผ ๊ด€๋ จ์ด ์žˆ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด ์ด๊ฒƒ์ด fp16 ๊ตฌ์„ฑ์ธ ๊ฒฝ์šฐ์ž…๋‹ˆ๋‹ค:

{
    "fp16": {
        "enabled": "auto",
        "loss_scale": 0,
        "loss_scale_window": 1000,
        "initial_scale_power": 16,
        "hysteresis": 2,
        "min_loss_scale": 1
    }
}

๋กœ๊ทธ์— ๋‹ค์Œ๊ณผ ๊ฐ™์€ OVERFLOW! ๋ฉ”์‹œ์ง€๊ฐ€ ํ‘œ์‹œ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค:

0%|                                                                                                                             | 0/189 [00:00<?, ?it/s]
 [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 262144
  1%|โ–Œ                                                                                                                    | 1/189 [00:00<01:26,  2.17it/s]
 [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144, reducing to 131072.0
  1%|โ–ˆโ–
 [...]
 [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
 14%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Œ                                                                                                   | 27/189 [00:14<01:13,  2.21it/s]
 [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
 15%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–                                                                                                  | 28/189 [00:14<01:13,  2.18it/s]
 [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
 15%|โ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–ˆโ–Š                                                                                                  | 29/189 [00:15<01:13,  2.18it/s]
 [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1, reducing to 1
[...]

์ด๋Š” DeepSpeed ์†์‹ค ์Šค์ผ€์ผ๋Ÿฌ๊ฐ€ ์†์‹ค ์˜ค๋ฒ„ํ”Œ๋กœ๋ฅผ ๊ทน๋ณตํ•  ์ˆ˜ ์žˆ๋Š” ์Šค์ผ€์ผ๋ง ๊ณ„์ˆ˜๋ฅผ ์ฐพ์„ ์ˆ˜ ์—†์Œ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋ ค๋ฉด initial_scale_power ๊ฐ’์„ ๋” ๋†’๊ฒŒ ์„ค์ •ํ•˜์„ธ์š”(์ผ๋ฐ˜์ ์œผ๋กœ 32๊ฐ€ ์ ์ ˆํ•ฉ๋‹ˆ๋‹ค).

๋ฆฌ์†Œ์Šค[[resources]]

DeepSpeed ZeRO๋Š” ์ œํ•œ๋œ GPU ๋ฆฌ์†Œ์Šค๋กœ ์ถ”๋ก ์„ ์œ„ํ•ด ๋งค์šฐ ํฐ ๋ชจ๋ธ์„ ํ›ˆ๋ จํ•˜๊ณ  ๋กœ๋“œํ•˜๋Š” ๊ฐ•๋ ฅํ•œ ๊ธฐ์ˆ ๋กœ, ๋ˆ„๊ตฌ๋‚˜ ์‰ฝ๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. DeepSpeed์— ๋Œ€ํ•ด ์ž์„ธํžˆ ์•Œ์•„๋ณด๋ ค๋ฉด ๋ธ”๋กœ๊ทธ ํฌ์ŠคํŠธ, ๊ณต์‹ ๋ฌธ์„œ, ๊นƒํ—ˆ๋ธŒ ๋ฆฌํฌ์ง€ํ† ๋ฆฌ๋ฅผ ์ฐธ์กฐํ•˜์„ธ์š”.

๋‹ค์Œ ๋ฌธ์„œ๋„ ZeRO์— ๋Œ€ํ•ด ์ž์„ธํžˆ ์•Œ์•„๋ณผ ์ˆ˜ ์žˆ๋Š” ํ›Œ๋ฅญํ•œ ์ž๋ฃŒ์ž…๋‹ˆ๋‹ค: