File size: 40,614 Bytes
17c6d62 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 |
<!--Copyright 2023 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
â ïž Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# Efficient Training on Multiple GPUs
åäžã®GPUã§ã®ãã¬ãŒãã³ã°ãé
ãããå Žåããã¢ãã«ã®éã¿ãåäžã®GPUã®ã¡ã¢ãªã«åãŸããªãå Žåãè€æ°ã®GPUã䜿çšããã»ããã¢ãããå¿
èŠãšãªããŸããåäžã®GPUããè€æ°ã®GPUãžã®åãæ¿ãã«ã¯ãã¯ãŒã¯ããŒãã忣ããããã®ããçš®ã®äžŠååŠçãå¿
èŠã§ããããŒã¿ããã³ãœã«ããŸãã¯ãã€ãã©ã€ã³ã®äžŠååŠçãªã©ãããŸããŸãªäžŠååŠçæè¡ããããŸãããã ãããã¹ãŠã«é©ããäžã€ã®è§£æ±ºçã¯ååšãããæé©ãªèšå®ã¯äœ¿çšããããŒããŠã§ã¢ã«äŸåããŸãããã®èšäºã¯ãããããä»ã®ãã¬ãŒã ã¯ãŒã¯ã«ãé©çšãããäž»èŠãªæŠå¿µã«çŠç¹ãåœãŠã€ã€ãPyTorchããŒã¹ã®å®è£
ã«çŠç¹ãåœãŠãŠããŸãã
<Tip>
**泚æ**: [åäžGPUã»ã¯ã·ã§ã³](perf_train_gpu_one) ã§ç޹ä»ãããå€ãã®æŠç¥ïŒæ··å粟床ãã¬ãŒãã³ã°ãåŸé
èç©ãªã©ïŒã¯äžè¬çã§ãããã¢ãã«ã®ãã¬ãŒãã³ã°ã«äžè¬çã«é©çšãããŸãããããã£ãŠããã«ãGPUãCPUãã¬ãŒãã³ã°ãªã©ã®æ¬¡ã®ã»ã¯ã·ã§ã³ã«å
¥ãåã«ãããã確èªããŠãã ããã
</Tip>
ãŸããããŸããŸãª1D䞊ååŠçæè¡ãšãã®å©ç¹ããã³æ¬ ç¹ã«ã€ããŠè©³ãã説æãããããã2Dããã³3D䞊ååŠçã«çµã¿åãããŠããã«é«éãªãã¬ãŒãã³ã°ãå®çŸãããã倧ããªã¢ãã«ããµããŒãããæ¹æ³ãæ€èšããŸããããŸããŸãªä»ã®åŒ·åãªä»£æ¿ææ³ã玹ä»ãããŸãã
## Concepts
以äžã¯ããã®ææžã§åŸã§è©³ãã説æãããäž»èŠãªæŠå¿µã®ç°¡åãªèª¬æã§ãã
1. **DataParallel (DP)** - åãã»ããã¢ãããè€æ°åè€è£œãããåã»ããã¢ããã«ããŒã¿ã®ã¹ã©ã€ã¹ãäŸçµŠãããŸããåŠçã¯äžŠè¡ããŠè¡ãããåã»ããã¢ããã¯ãã¬ãŒãã³ã°ã¹ãããã®æåŸã«åæãããŸãã
2. **TensorParallel (TP)** - åãã³ãœã«ã¯è€æ°ã®ãã£ã³ã¯ã«åå²ãããåäžã®GPUã«ãã³ãœã«å
šäœãååšããã®ã§ã¯ãªãããã³ãœã«ã®åã·ã£ãŒããæå®ãããGPUã«ååšããŸããåŠçäžã«ãåã·ã£ãŒãã¯å¥ã
ã«äžŠè¡ããŠåŠçãããç°ãªãGPUã§åæãããã¹ãããã®æåŸã«çµæãåæãããŸããããã¯æ°Žå¹³äžŠååŠçãšåŒã°ãããã®ã§ãåå²ã¯æ°Žå¹³ã¬ãã«ã§è¡ãããŸãã
3. **PipelineParallel (PP)** - ã¢ãã«ã¯åçŽïŒã¬ã€ã€ãŒã¬ãã«ïŒã«è€æ°ã®GPUã«åå²ãããã¢ãã«ã®åäžãŸãã¯è€æ°ã®ã¬ã€ã€ãŒãåäžã®GPUã«é
眮ãããŸããåGPUã¯ãã€ãã©ã€ã³ã®ç°ãªãã¹ããŒãžã䞊è¡ããŠåŠçãããããã®å°ããªãã£ã³ã¯ã§äœæ¥ããŸãã
4. **Zero Redundancy Optimizer (ZeRO)** - TPãšãããã䌌ããããªãã³ãœã«ã®ã·ã£ãŒãã£ã³ã°ãå®è¡ããŸãããååããŸãã¯åŸåãã®èšç®ã®ããã«ãã³ãœã«å
šäœãåæ§ç¯ããããããã¢ãã«ã倿Žããå¿
èŠã¯ãããŸããããŸããGPUã¡ã¢ãªãå¶éãããŠããå Žåã«è£åããããã®ããŸããŸãªãªãããŒãæè¡ããµããŒãããŸãã
5. **Sharded DDP** - Sharded DDPã¯ãããŸããŸãªZeROå®è£
ã§äœ¿çšãããåºæ¬çãªZeROã³ã³ã»ããã®å¥åã§ãã
åã³ã³ã»ããã®è©³çŽ°ã«æ·±å
¥ãããåã«ãå€§èŠæš¡ãªã€ã³ãã©ã¹ãã©ã¯ãã£ã§å€§èŠæš¡ãªã¢ãã«ããã¬ãŒãã³ã°ããéã®å€§ãŸããªæ±ºå®ããã»ã¹ãèŠãŠã¿ãŸãããã
## Scalability Strategy
**âš ã·ã³ã°ã«ããŒã / ãã«ãGPU**
* ã¢ãã«ãåäžã®GPUã«åãŸãå ŽåïŒ
1. DDP - 忣ããŒã¿äžŠå
2. ZeRO - ç¶æ³ãšäœ¿çšãããæ§æã«å¿ããŠéããã©ãããç°ãªããŸã
* ã¢ãã«ãåäžã®GPUã«åãŸããªãå ŽåïŒ
1. PP
2. ZeRO
3. TP
éåžžã«é«éãªããŒãå
æ¥ç¶ïŒNVLINKãŸãã¯NVSwitchãªã©ïŒãããã°ããããã®3ã€ã¯ã»ãŒåãé床ã«ãªãã¯ãã§ããããããªãå ŽåãPPã¯TPãŸãã¯ZeROãããéããªããŸããTPã®çšåºŠãå·®ãçãããããããŸãããç¹å®ã®ã»ããã¢ããã§ã®åè
ãèŠã€ããããã«å®éšããããšãæåã§ãã
TPã¯ã»ãšãã©ã®å ŽåãåäžããŒãå
ã§äœ¿çšãããŸããã€ãŸããTPãµã€ãº <= ããŒãããšã®GPUæ°ã§ãã
* æå€§ã®ã¬ã€ã€ãŒãåäžã®GPUã«åãŸããªãå ŽåïŒ
1. ZeROã䜿çšããªãå Žå - TPã䜿çšããå¿
èŠããããŸããPPåç¬ã§ã¯åãŸããªãã§ãããã
2. ZeROã䜿çšããå Žå - "ã·ã³ã°ã«GPU"ã®ãšã³ããªãšåããã®ãåç
§ããŠãã ãã
**âš ãã«ãããŒã / ãã«ãGPU**
* ããŒãéã®é«éæ¥ç¶ãããå ŽåïŒ
1. ZeRO - ã¢ãã«ãžã®ã»ãšãã©ã®å€æŽãäžèŠã§ã
2. PP+TP+DP - éä¿¡ãå°ãªããã¢ãã«ãžã®å€§èŠæš¡ãªå€æŽãå¿
èŠã§ã
* ããŒãéã®æ¥ç¶ãé
ããGPUã¡ã¢ãªããŸã äžè¶³ããŠããå ŽåïŒ
1. DP+PP+TP+ZeRO-1
## Data Parallelism
2ã€ã®GPUãæã€ã»ãšãã©ã®ãŠãŒã¶ãŒã¯ã`DataParallel`ïŒDPïŒãš`DistributedDataParallel`ïŒDDPïŒã«ãã£ãŠæäŸããããã¬ãŒãã³ã°é床ã®åäžããã§ã«äº«åããŠããŸãããããã¯ã»ãŒèªæã«äœ¿çšã§ããPyTorchã®çµã¿èŸŒã¿æ©èœã§ããäžè¬çã«ããã¹ãŠã®ã¢ãã«ã§åäœããDDPã䜿çšããããšããå§ãããŸããDPã¯äžéšã®ã¢ãã«ã§å€±æããå¯èœæ§ãããããã§ãã[PyTorchã®ããã¥ã¡ã³ããŒã·ã§ã³](https://pytorch.org/docs/master/generated/torch.nn.DataParallel.html)èªäœãDDPã®äœ¿çšãæšå¥šããŠããŸãã
### DP vs DDP
`DistributedDataParallel`ïŒDDPïŒã¯éåžžã`DataParallel`ïŒDPïŒãããé«éã§ãããåžžã«ãããšã¯éããŸããïŒ
* DPã¯Pythonã¹ã¬ããããŒã¹ã§ãããDDPã¯ãã«ãããã»ã¹ããŒã¹ã§ãããã®ãããGILïŒGlobal Interpreter LockïŒãªã©ã®Pythonã¹ã¬ããã®å¶çŽããªãããã§ãã
* äžæ¹ãGPUã«ãŒãéã®é
ãçžäºæ¥ç¶æ§ã¯ãDDPã®å Žåã«å®éã«ã¯é
ãçµæãããããå¯èœæ§ããããŸãã
以äžã¯ã2ã€ã®ã¢ãŒãéã®GPUééä¿¡ã®äž»ãªéãã§ãïŒ
[DDP](https://pytorch.org/docs/master/notes/ddp.html):
- éå§æãã¡ã€ã³ããã»ã¹ã¯ã¢ãã«ãGPU 0ããä»ã®GPUã«è€è£œããŸãã
- ããããåãããããšã«:
1. åGPUã¯åèªã®ãããããã®ããŒã¿ãçŽæ¥æ¶è²»ããŸãã
2. `backward`äžãããŒã«ã«åŸé
ãæºåã§ãããšããããã¯ãã¹ãŠã®ããã»ã¹ã§å¹³ååãããŸãã
[DP](https://pytorch.org/docs/master/generated/torch.nn.DataParallel.html):
åãããããšã«:
1. GPU 0ã¯ããŒã¿ããããèªã¿åããããããåGPUã«ããããããéä¿¡ããŸãã
2. GPU 0ããåGPUã«ææ°ã®ã¢ãã«ãè€è£œããŸãã
3. `forward`ãå®è¡ããåGPUããGPU 0ã«åºåãéä¿¡ããæå€±ãèšç®ããŸãã
4. GPU 0ãããã¹ãŠã®GPUã«æå€±ã忣ãã`backward`ãå®è¡ããŸãã
5. åGPUããGPU 0ã«åŸé
ãéä¿¡ããããããå¹³ååããŸãã
DDPã¯ãããããšã«è¡ãéä¿¡ã¯åŸé
ã®éä¿¡ã®ã¿ã§ãããäžæ¹ãDPã¯ãããããšã«5ã€ã®ç°ãªãããŒã¿äº€æãè¡ããŸãã
DPã¯ããã»ã¹å
ã§ããŒã¿ãPythonã¹ã¬ãããä»ããŠã³ããŒããŸãããDDPã¯[torch.distributed](https://pytorch.org/docs/master/distributed.html)ãä»ããŠããŒã¿ãã³ããŒããŸãã
DPã§ã¯GPU 0ã¯ä»ã®GPUãããã¯ããã«å€ãã®äœæ¥ãè¡ããããGPUã®æªäœ¿çšçãé«ããªããŸãã
DDPã¯è€æ°ã®ãã·ã³éã§äœ¿çšã§ããŸãããDPã®å Žåã¯ããã§ã¯ãããŸããã
DPãšDDPã®ä»ã«ãéãããããŸããããã®è°è«ã«ã¯é¢ä¿ãããŸããã
ããã2ã€ã®ã¢ãŒããæ·±ãçè§£ãããå Žåããã®[èšäº](https://www.telesens.co/2019/04/04/distributed-data-parallel-training-using-pytorch-on-aws/)ã匷ããå§ãããŸããçŽ æŽããããã€ã¢ã°ã©ã ãå«ã¿ãããŸããŸãªããŒããŠã§ã¢ã§ã®è€æ°ã®ãã³ãããŒã¯ãšãããã¡ã€ã©ã®åºåã瀺ããç¥ã£ãŠããå¿
èŠããããã¹ãŠã®åŸ®åŠãªãã¥ã¢ã³ã¹ã説æããŠããŸãã
å®éã®ãã³ãããŒã¯ãèŠãŠã¿ãŸãããïŒ
| Type | NVlink | Time |
| :----- | ----- | ---: |
| 2:DP | Y | 110s |
| 2:DDP | Y | 101s |
| 2:DDP | N | 131s |
è§£æïŒ
ããã§ãDPã¯NVlinkã䜿çšããDDPã«æ¯ã¹ãŠçŽ10ïŒ
é
ããNVlinkã䜿çšããªãDDPã«æ¯ã¹ãŠçŽ15ïŒ
é«éã§ããããšã瀺ãããŠããŸãã
å®éã®éãã¯ãåGPUãä»ã®GPUãšåæããå¿
èŠãããããŒã¿ã®éã«äŸåããŸããåæããããŒã¿ãå€ãã»ã©ãé
ããªã³ã¯ãåèšã®å®è¡æéãé
ãããå¯èœæ§ãé«ããªããŸãã
以äžã¯å®å
šãªãã³ãããŒã¯ã³ãŒããšåºåã§ãïŒ
`NCCL_P2P_DISABLE=1`ã䜿çšããŠã察å¿ãããã³ãããŒã¯ã§NVLinkæ©èœãç¡å¹ã«ããŸããã
```bash
# DP
rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 \
python examples/pytorch/language-modeling/run_clm.py \
--model_name_or_path openai-community/gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \
--do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200
{'train_runtime': 110.5948, 'train_samples_per_second': 1.808, 'epoch': 0.69}
# DDP w/ NVlink
rm -r /tmp/test-clm; CUDA_VISIBLE_DEVICES=0,1 \
torchrun --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py \
--model_name_or_path openai-community/gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \
--do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200
{'train_runtime': 101.9003, 'train_samples_per_second': 1.963, 'epoch': 0.69}
# DDP w/o NVlink
rm -r /tmp/test-clm; NCCL_P2P_DISABLE=1 CUDA_VISIBLE_DEVICES=0,1 \
torchrun --nproc_per_node 2 examples/pytorch/language-modeling/run_clm.py \
--model_name_or_path openai-community/gpt2 --dataset_name wikitext --dataset_config_name wikitext-2-raw-v1 \
--do_train --output_dir /tmp/test-clm --per_device_train_batch_size 4 --max_steps 200
{'train_runtime': 131.4367, 'train_samples_per_second': 1.522, 'epoch': 0.69}
```
ããŒããŠã§ã¢: 2x TITAN RTXãå24GB + 2ã€ã®NVLinkïŒ`nvidia-smi topo -m`ã§ `NV2`ïŒ
ãœãããŠã§ã¢: `pytorch-1.8-to-be` + `cuda-11.0` / `transformers==4.3.0.dev0`
## ZeRO Data Parallelism
ZeROãã¯ãŒãããŒã¿äžŠååŠçïŒZeRO-DPïŒã¯ã次ã®[ããã°æçš¿](https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/)ã®ãã€ã¢ã°ã©ã ã§èª¬æãããŠããŸãã

ããã¯çè§£ãé£ãããããããŸããããå®éã«ã¯ãã®æŠå¿µã¯éåžžã«ã·ã³ãã«ã§ããããã¯éåžžã®`DataParallel`ïŒDPïŒã§ãããå®å
šãªã¢ãã«ãã©ã¡ãŒã¿ãåŸé
ãããã³ãªããã£ãã€ã¶ã®ç¶æ
ãè€è£œãã代ããã«ãåGPUã¯ããããã®ã¹ã©ã€ã¹ã®ã¿ãä¿åããŸãããããŠãå®è¡æã«ãç¹å®ã®ã¬ã€ã€ãŒã«å¿
èŠãªå®å
šãªã¬ã€ã€ãŒãã©ã¡ãŒã¿ãå¿
èŠãªå Žåããã¹ãŠã®GPUãåæããŠããäºãã«äžè¶³ããŠããéšåãæäŸããŸããããããã¹ãŠã§ãã
3ã€ã®ã¬ã€ã€ãŒãããªãåçŽãªã¢ãã«ãèããŠã¿ãŸããããåã¬ã€ã€ãŒã«ã¯3ã€ã®ãã©ã¡ãŒã¿ããããŸãïŒ
```
La | Lb | Lc
---|----|---
a0 | b0 | c0
a1 | b1 | c1
a2 | b2 | c2
```
ã¬ã€ã€ãŒLaã«ã¯ãéã¿a0ãa1ãããã³a2ããããŸãã
3ã€ã®GPUãããå ŽåãSharded DDPïŒ= Zero-DPïŒã¯ã¢ãã«ã3ã€ã®GPUã«æ¬¡ã®ããã«åå²ããŸãïŒ
```
GPU0:
La | Lb | Lc
---|----|---
a0 | b0 | c0
GPU1:
La | Lb | Lc
---|----|---
a1 | b1 | c1
GPU2:
La | Lb | Lc
---|----|---
a2 | b2 | c2
```
ããã¯ãå
žåçãªãã£ãŒããã¥ãŒã©ã«ãããã¯ãŒã¯ïŒDNNïŒã®ãã€ã¢ã°ã©ã ãæ³åãããšããã³ãœã«äžŠååŠçãšåæ§ã®æ°Žå¹³ã¹ã©ã€ã¹ã§ãããããªãã®ã§ããåçŽã¹ã©ã€ã¹ã¯ãç°ãªãGPUã«å®å
šãªå±€ã°ã«ãŒããé
眮ããæ¹æ³ã§ããããããããã¯åãªãåºçºç¹ã«éããŸããã
ãããããåGPUã¯éåžžã®ããŒã¿äžŠååŠçïŒDPïŒãšåæ§ã«ãéåžžã®ããããããåãåããŸãïŒ
```
x0 => GPU0
x1 => GPU1
x2 => GPU2
```
æåã«ãå
¥åããŒã¿ã¯ã¬ã€ã€ãŒLaã«é©çšãããŸãã
GPU0ã«çŠç¹ãåœãŠãŸãããïŒx0ã¯ããã®ååããã¹ãå®è¡ããããã«a0ãa1ãa2ã®ãã©ã¡ãŒã¿ãå¿
èŠã§ãããGPU0ã«ã¯a0ãããããŸãããGPU1ããa1ããGPU2ããa2ãåãåããã¢ãã«ã®åéšåããŸãšããŸãã
åæ§ã«ãGPU1ã¯ãããããx1ãåãåããa1ããæã£ãŠããŸããããa0ãša2ã®ãã©ã¡ãŒã¿ãå¿
èŠã§ãããããã¯GPU0ãšGPU2ããååŸããŸãã
GPU2ãx2ãåãåããŸããa0ãša1ã¯GPU0ãšGPU1ããåãåããa2ãšãšãã«å®å
šãªãã³ãœã«ãåæ§ç¯ããŸãã
3ã€ã®GPUã¯å®å
šãªãã³ãœã«ãåæ§ç¯ããååãèšç®ãè¡ãããŸãã
èšç®ãå®äºãããšãäžèŠã«ãªã£ãããŒã¿ã¯åé€ãããŸããèšç®äžã ã䜿çšãããåæ§ç¯ã¯äºåã«ãã§ããã䜿çšããŠå¹ççã«è¡ãããŸãã
ãããŠããã®ããã»ã¹å
šäœãã¬ã€ã€ãŒLbãæ¬¡ã«ååãã§LcããããŠéæ¹åã§Lc -> Lb -> Laã«å¯ŸããŠç¹°ãè¿ãããŸãã
ç§ã«ãšã£ãŠãããã¯å¹ççãªã°ã«ãŒãã§ã®éã¿ã®åæ£æŠç¥ã®ããã«èãããŸãïŒ
1. 人Aã¯ãã³ããæã£ãŠããŸãã
2. 人Bã¯ã¹ããŒããæã£ãŠããŸãã
3. 人Cã¯æ§ãæã£ãŠããŸãã
ä»ã圌ãã¯æ¯æ©æã£ãŠãããã®ãå
±æããä»ã®äººããæã£ãŠããªããã®ãããããæã«ã¯å²ãåœãŠãããã¿ã€ãã®ã®ã¢ãè©°ããŠæ
ãç¶ããŸãããããSharded DDP / Zero DPã§ãã
ãã®æŠç¥ããå人ãç¬èªã®ãã³ããã¹ããŒããæ§ãæã£ãŠéã°ãªããã°ãªããªãã·ã³ãã«ãªæŠç¥ãšæ¯èŒããŠã¿ãŠãã ããããããPyTorchã®DataParallelïŒDPããã³DDPïŒã§ãã
ãã®ãããã¯ã®æç®ãèªãéã«ã以äžã®é¡çŸ©èªã«åºäŒããããããŸããïŒShardedãPartitionedã
ZeROãã¢ãã«ã®éã¿ãåå²ããæ¹æ³ã«æ³šæãæããšãããã¯ãã³ãœã«ãã©ã¬ãªãºã ãšéåžžã«äŒŒãŠããããã«èŠããŸããããã¯åŸã§è°è«ãããåçŽã¢ãã«ãã©ã¬ãªãºã ãšã¯ç°ãªããåã¬ã€ã€ãŒã®éã¿ãããŒãã£ã·ã§ã³/ã·ã£ãŒãã£ã³ã°ããŸãã
Implementations:
- [DeepSpeed](https://www.deepspeed.ai/tutorials/zero/) ZeRO-DP stages 1+2+3
- [`transformers` integration](main_classes/trainer#trainer-integrations)
## Naive Model Parallelism (Vertical) and Pipeline Parallelism
ãã€ãŒãã¢ãã«ãã©ã¬ãªãºã ïŒMPïŒã¯ãã¢ãã«ã®å±€ãè€æ°ã®GPUã«åæ£ãããæ¹æ³ã§ãããã®ã¡ã«ããºã ã¯æ¯èŒçåçŽã§ãåžæããå±€ã`.to()`ã¡ãœããã䜿çšããŠç¹å®ã®ããã€ã¹ã«åãæ¿ããã ãã§ããããã«ãããããŒã¿ããããã®å±€ãééãããã³ã«ãããŒã¿ãå±€ãšåãããã€ã¹ã«åãæ¿ããããæ®ãã®éšåã¯å€æŽãããŸããã
ç§ãã¡ã¯ããããåçŽMPããšåŒã³ãŸãããªããªããã»ãšãã©ã®ã¢ãã«ãã©ã®ããã«æãããããæãåºããšãå±€ãåçŽã«ã¹ã©ã€ã¹ããããã§ããããšãã°ã以äžã®å³ã¯8å±€ã®ã¢ãã«ã瀺ããŠããŸãïŒ
```
=================== ===================
| 0 | 1 | 2 | 3 | | 4 | 5 | 6 | 7 |
=================== ===================
gpu0 gpu1
```
æã
ã¯ãã¢ãã«ãåçŽã«2ã€ã«åå²ããã¬ã€ã€ãŒ0ãã3ãGPU0ã«é
眮ããã¬ã€ã€ãŒ4ãã7ãGPU1ã«é
眮ããŸããã
ããŒã¿ãã¬ã€ã€ãŒ0ãã1ã1ãã2ã2ãã3ã«ç§»åããéã¯éåžžã®ã¢ãã«ãšåãã§ããããããããŒã¿ãã¬ã€ã€ãŒ3ããã¬ã€ã€ãŒ4ã«ç§»åããå¿
èŠãããå ŽåãGPU0ããGPU1ãžã®ç§»åãçºçããéä¿¡ã®ãªãŒããŒããããçºçããŸããåå ããŠããGPUãåãã³ã³ãã¥ãŒãããŒãïŒäŸïŒåãç©çãã·ã³ïŒã«ããå Žåããã®ã³ããŒã¯éåžžã«é«éã§ãããç°ãªãã³ã³ãã¥ãŒãããŒãïŒäŸïŒè€æ°ã®ãã·ã³ïŒã«ããå Žåãéä¿¡ã®ãªãŒããŒãããã¯å€§å¹
ã«å¢å ããå¯èœæ§ããããŸãã
ãã®åŸãã¬ã€ã€ãŒ4ãã5ã6ãã7ãŸã§ã¯éåžžã®ã¢ãã«ãšåæ§ã«åäœãã7çªç®ã®ã¬ã€ã€ãŒãå®äºãããšãããŒã¿ããã°ãã°ã¬ã€ã€ãŒ0ã«æ»ãå¿
èŠããããŸãïŒãŸãã¯ã©ãã«ãæåŸã®ã¬ã€ã€ãŒã«éä¿¡ããŸãïŒãããã§æå€±ãèšç®ãããªããã£ãã€ã¶ãäœæ¥ãéå§ã§ããŸãã
åé¡ç¹ïŒ
- äž»ãªæ¬ ç¹ãããã³ãªãããããåçŽãªãMPãšåŒã¶ã®ãã¯ã1ã€ãé€ããŠãã¹ãŠã®GPUãã©ããªç¬éã§ãã¢ã€ãã«ç¶æ
ã§ããããšã§ãããããã£ãŠã4ã€ã®GPUã䜿çšããå ŽåãåçŽãªMPã¯ã1ã€ã®GPUã®ã¡ã¢ãªå®¹éã4åã«ããã®ãšã»ãŒåãã§ãããããŒããŠã§ã¢ã®æ®ããç¡èŠããŸããããã«ãããŒã¿ã®ã³ããŒã®ãªãŒããŒããããããããšãå¿ããŠã¯ãããŸããããããã£ãŠã4æã®6GBã®ã«ãŒãã¯ãããŒã¿ã®ã³ããŒã®ãªãŒããŒãããããªã1æã®24GBã®ã«ãŒããšåããµã€ãºãå容ã§ããã§ãããããåŸè
ã¯ãã¬ãŒãã³ã°ãããè¿
éã«å®äºããŸãããã ããããšãã°40GBã®ã«ãŒããããã45GBã®ã¢ãã«ãåããå¿
èŠãããå ŽåãåŸé
ãšãªããã£ãã€ã¶ã®ç¶æ
ã®ããã«ã»ãšãã©åããããšãã§ããŸããã
- å
±æã®åã蟌ã¿ã¯ãGPUéã§ã³ããŒããå¿
èŠããããããããŸããã
ãã€ãã©ã€ã³äžŠååŠçïŒPPïŒã¯ãã»ãŒåçŽãªMPãšåãã§ãããGPUãã¢ã€ãã«ç¶æ
ã«ãªãåé¡ã解決ããå
¥åãããããã€ã¯ããããã«åå²ãããã€ãã©ã€ã³ã人工çã«äœæããããšã«ãããç°ãªãGPUãèšç®ããã»ã¹ã«åæã«åå ã§ããããã«ããŸãã
以äžã¯ã[GPipeè«æ](https://ai.googleblog.com/2019/03/introducing-gpipe-open-source-library.html)ããã®å³ã§ãäžéšã«ã¯åçŽãªMPãäžéšã«ã¯PPã瀺ãããŠããŸãïŒ

ãã®å³ãããPPãGPUãã¢ã€ãã«ç¶æ
ã®é åã§ãããããã«ããå°ãªãæã€ããšãããããŸããã¢ã€ãã«ç¶æ
ã®éšåã¯ãããã«ããšåŒã°ããŸãã
å³ã®äž¡æ¹ã®éšåã¯ã4ã€ã®GPUããã€ãã©ã€ã³ã«åå ããŠãã4ã®æ¬¡å
ã®äžŠåæ§ã瀺ããŠããŸããã€ãŸãã4ã€ã®ãã€ãã¹ããŒãžF0ãF1ãF2ãF3ã®ãã©ã¯ãŒããã¹ããããéé ã®ããã¯ã¯ãŒããã¹B3ãB2ãB1ãB0ããããŸãã
PPã¯èª¿æŽããæ°ãããã€ããŒãã©ã¡ãŒã¿ãå°å
¥ããŸãããã㯠`chunks` ã§ãåããã€ãã¹ããŒãžãéããŠé£ç¶ããŠéä¿¡ãããããŒã¿ã®ãã£ã³ã¯ã®æ°ãå®çŸ©ããŸããããšãã°ãäžã®å³ã§ã¯ `chunks=4` ã衚瀺ãããŠããŸããGPU0ã¯ãã£ã³ã¯0ã1ã2ã3ïŒF0,0ãF0,1ãF0,2ãF0,3ïŒã§åããã©ã¯ãŒããã¹ãå®è¡ããä»ã®GPUãäœæ¥ãéå§ãå§ããã®ãåŸ
ã£ãŠãããGPU0ã¯ãã£ã³ã¯3ã2ã1ã0ïŒB0,3ãB0,2ãB0,1ãB0,0ïŒã§éé ãã¹ãå®è¡ããŸãã
泚æãã¹ãã¯ãæŠå¿µçã«ã¯ãããåŸé
èç©ã¹ãããïŒGASïŒãšåãã³ã³ã»ããã§ããããšã§ããPyTorch㯠`chunks` ã䜿çšããDeepSpeedã¯åããã€ããŒãã©ã¡ãŒã¿ãGASãšåŒã³ãŸãã
`chunks` ã®å°å
¥ã«ãããPPã¯ãã€ã¯ããããïŒMBSïŒã®æŠå¿µãå°å
¥ããŸããDPã¯ã°ããŒãã«ããŒã¿ããããµã€ãºããããããã«åå²ããŸãããããã£ãŠãDPã®æ¬¡æ°ã4ã§ãã°ããŒãã«ããããµã€ãºã1024ã®å Žåã4ã€ã®ãããããïŒãããã256ïŒã«åå²ãããŸãïŒ1024/4ïŒããããŠã`chunks`ïŒãŸãã¯GASïŒã®æ°ã32ã§ããå Žåããã€ã¯ãããããµã€ãºã¯8ã«ãªããŸãïŒ256/32ïŒãåãã€ãã©ã€ã³ã¹ããŒãžã¯1ã€ã®ãã€ã¯ããããã§äœæ¥ããŸãã
DP + PPã»ããã¢ããã®ã°ããŒãã«ããããµã€ãºãèšç®ããã«ã¯ã`mbs*chunks*dp_degree`ïŒ`8*32*4=1024`ïŒãè¡ããŸãã
å³ã«æ»ããŸãããã
`chunks=1` ã§ããã°ãéå¹çãªåçŽãªMPã«ãªããŸããéåžžã«å€§ã㪠`chunks` å€ã䜿çšãããšãéåžžã«å°ããªãã€ã¯ãããããµã€ãºã«ãªããå¹çãããŸãé«ããªããããããŸããããããã£ãŠãGPUã®å¹ççãªå©çšãæå€§åããå€ãèŠã€ããããã«å®éšããå¿
èŠããããŸããããã¯ãããã«ã®ãµã€ãºãæå°éã«ããããšã«å¯Ÿå¿ããããã¹ãŠã®åå GPUã«ãããé«ã䞊è¡GPUå©çšãå¯èœã«ããããã§ãã
2ã€ã®ãœãªã¥ãŒã·ã§ã³ã°ã«ãŒãããããŸããåŸæ¥ã®ãã€ãã©ã€ã³APIãœãªã¥ãŒã·ã§ã³ãšããŠãŒã¶ãŒã®ã¢ãã«ã倧å¹
ã«å€æŽããå¿
èŠãããããçŸä»£çãªãœãªã¥ãŒã·ã§ã³ã§ãã
åŸæ¥ã®ãã€ãã©ã€ã³APIãœãªã¥ãŒã·ã§ã³ïŒ
- PyTorch
- DeepSpeed
- Megatron-LM
çŸä»£çãªãœãªã¥ãŒã·ã§ã³ïŒ
- Varuna
- Sagemaker
åŸæ¥ã®ãã€ãã©ã€ã³APIãœãªã¥ãŒã·ã§ã³ã®åé¡ç¹ïŒ
- ã¢ãã«ãããªã倿Žããå¿
èŠããããããPipelineã¯ã¢ãžã¥ãŒã«ã®éåžžã®ãããŒã`nn.Sequential`ã·ãŒã±ã³ã¹ã«åæžã蟌ãå¿
èŠããããã¢ãã«ã®èšèšã倿Žããããšãå¿
èŠã§ãã
- çŸåšãPipeline APIã¯éåžžã«å¶éçã§ããæåã®ãã€ãã©ã€ã³ã¹ããŒãžã«æž¡ãããPython倿°ã®ã»ãããããå Žåãåé¿çãèŠã€ããå¿
èŠããããŸããçŸåšããã€ãã©ã€ã³ã€ã³ã¿ãŒãã§ãŒã¹ã§ã¯ãå¯äžã®ãã³ãœã«ãŸãã¯ãã³ãœã«ã®ã¿ãã«ãå
¥åãšåºåãšããŠèŠæ±ããŠããŸãããããã®ãã³ãœã«ã¯ããããµã€ãºãæåã®æ¬¡å
ãšããŠæã£ãŠããå¿
èŠããããŸãããã€ãã©ã€ã³ã¯ãããããããã€ã¯ããããã«åå²ããŸããå¯èœãªæ¹åç¹ã«ã€ããŠã¯ããã¡ãã®è°è«ãè¡ãããŠããŸãïŒhttps://github.com/pytorch/pytorch/pull/50693
- ãã€ãã¹ããŒãžã®ã¬ãã«ã§ã®æ¡ä»¶ä»ãå¶åŸ¡ãããŒã¯äžå¯èœã§ããäŸãã°ãT5ã®ãããªãšã³ã³ãŒããŒãã³ãŒããŒã¢ãã«ã¯ãæ¡ä»¶ä»ããšã³ã³ãŒããŒã¹ããŒãžãåŠçããããã«ç¹å¥ãªåé¿çãå¿
èŠã§ãã
- åã¬ã€ã€ãŒãé
眮ããå¿
èŠãããããã1ã€ã®ã¢ãã«ã®åºåãä»ã®ã¢ãã«ã®å
¥åã«ãªãããã«ããŸãã
VarunaãšSageMakerãšã®å®éšã¯ãŸã è¡ã£ãŠããŸãããã圌ãã®è«æã«ããã°ãäžèšã§è¿°ã¹ãåé¡ã®ãªã¹ããå
æãããŠãŒã¶ãŒã®ã¢ãã«ã«ã¯ã¯ããã«å°ããªå€æŽããå¿
èŠãšããªããšå ±åãããŠããŸãã
å®è£
ïŒ
- [Pytorch](https://pytorch.org/docs/stable/pipeline.html) (initial support in pytorch-1.8, and progressively getting improved in 1.9 and more so in 1.10). Some [examples](https://github.com/pytorch/pytorch/blob/master/benchmarks/distributed/pipeline/pipe.py)
- [DeepSpeed](https://www.deepspeed.ai/tutorials/pipeline/)
- [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) has an internal implementation - no API.
- [Varuna](https://github.com/microsoft/varuna)
- [SageMaker](https://arxiv.org/abs/2111.05972) - this is a proprietary solution that can only be used on AWS.
- [OSLO](https://github.com/tunib-ai/oslo) - ãã®å®è£
ã¯ãHugging Face Transformersã«åºã¥ããŠããŸãã
ð€ Transformersã®ã¹ããŒã¿ã¹: ãã®å·çæç¹ã§ã¯ããããã®ã¢ãã«ãå®å
šãªPPïŒãã€ãã©ã€ã³äžŠååŠçïŒããµããŒãããŠããŸãããGPT2ã¢ãã«ãšT5ã¢ãã«ã¯åçŽãªMPïŒã¢ãã«äžŠååŠçïŒãµããŒããæã£ãŠããŸããäž»ãªé害ã¯ãã¢ãã«ã`nn.Sequential`ã«å€æã§ããããã¹ãŠã®å
¥åããã³ãœã«ã§ããå¿
èŠãããããšã§ããçŸåšã®ã¢ãã«ã«ã¯ã倿ãéåžžã«è€éã«ããå€ãã®æ©èœãå«ãŸããŠãããããããåé€ããå¿
èŠããããŸãã
ä»ã®ã¢ãããŒãïŒ
DeepSpeedãVarunaãããã³SageMakerã¯ã[亀äºã«ãã€ãã©ã€ã³ãå®è¡](https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-core-features.html)ããã³ã³ã»ããã䜿çšããŠããŸããããã§ã¯ãããã¯ã¯ãŒããã¹ãåªå
ãããŠããã«ïŒã¢ã€ãã«æéïŒãããã«æå°éã«æããŸãã
Varunaã¯ãæé©ãªã¹ã±ãžã¥ãŒã«ãçºèŠããããã«ã·ãã¥ã¬ãŒã·ã§ã³ã䜿çšããŠã¹ã±ãžã¥ãŒã«ãããã«æ¹åããããšããŸãã
OSLOã¯ã`nn.Sequential`ã®å€æãªãã§Transformersã«åºã¥ããã€ãã©ã€ã³äžŠååŠçãå®è£
ããŠããŸãã
## Tensor Parallelism
ãã³ãœã«äžŠååŠçã§ã¯ãåGPUããã³ãœã«ã®ã¹ã©ã€ã¹ã®ã¿ãåŠçããå
šäœãå¿
èŠãªæäœã®ããã«ã®ã¿å®å
šãªãã³ãœã«ãéçŽããŸãã
ãã®ã»ã¯ã·ã§ã³ã§ã¯ã[Megatron-LM](https://github.com/NVIDIA/Megatron-LM)è«æããã®ã³ã³ã»ãããšå³ã䜿çšããŸãïŒ[GPUã¯ã©ã¹ã¿ã§ã®å¹ççãªå€§èŠæš¡èšèªã¢ãã«ãã¬ãŒãã³ã°](https://arxiv.org/abs/2104.04473)ã
ã©ã®ãã©ã³ã¹ãã©ãŒãã®äž»èŠãªæ§ç¯èŠçŽ ã¯ãå®å
šã«æ¥ç¶ããã`nn.Linear`ã«ç¶ãéç·åœ¢ã¢ã¯ãã£ããŒã·ã§ã³`GeLU`ã§ãã
Megatronã®è«æã®è¡šèšæ³ã«åŸã£ãŠãè¡åã®ä¹ç®éšåã`Y = GeLU(XA)`ãšæžãããšãã§ããŸããããã§ã`X`ãš`Y`ã¯å
¥åãã¯ãã«ãšåºåãã¯ãã«ã§ã`A`ã¯éã¿è¡åã§ãã
è¡åã®èšç®ãè¡å圢åŒã§èŠããšãè¡åä¹ç®ãè€æ°ã®GPUã§åå²ã§ããæ¹æ³ãç°¡åã«çè§£ã§ããŸãïŒ

éã¿è¡å`A`ã`N`åã®GPUã«å¯ŸããŠåããšã«åå²ãã䞊åã§è¡åä¹ç®`XA_1`ãã`XA_n`ãå®è¡ãããšã`N`åã®åºåãã¯ãã«`Y_1ãY_2ã...ãY_n`ãåŸãããããããç¬ç«ããŠ`GeLU`ã«äŸçµŠã§ããŸãïŒ

ãã®åçã䜿çšããŠãæåŸãŸã§åæãå¿
èŠãªããŸãŸãä»»æã®æ·±ãã®MLPãæŽæ°ã§ããŸããMegatron-LMã®èè
ã¯ãã®ããã®æçšãªã€ã©ã¹ããæäŸããŠããŸãïŒ

ãã«ããããã¢ãã³ã·ã§ã³ã¬ã€ã€ãŒã䞊ååããããšã¯ããã«ç°¡åã§ãããããã¯æ¢ã«è€æ°ã®ç¬ç«ããããããæã£ãŠãããããæ¬è³ªçã«äžŠåã§ãïŒ

ç¹å¥ãªèæ
®äºé
ïŒTPã«ã¯éåžžã«é«éãªãããã¯ãŒã¯ãå¿
èŠã§ããããããã£ãŠ1ã€ã®ããŒããè¶
ããŠTPãå®è¡ããªãããšããå§ããããŸãããå®éã«ã¯ã1ã€ã®ããŒãã«4ã€ã®GPUãããå Žåãæå€§ã®TP床æ°ã¯4ã§ããTP床æ°8ãå¿
èŠãªå Žåã¯ãå°ãªããšã8ã€ã®GPUãæã€ããŒãã䜿çšããå¿
èŠããããŸãã
ãã®ã»ã¯ã·ã§ã³ã¯ãå
ã®ãã詳现ãª[TPã®æŠèŠ](https://github.com/huggingface/transformers/issues/10321#issuecomment-783543530)ã«åºã¥ããŠããŸãã
by [@anton-l](https://github.com/anton-l)ã
SageMakerã¯ãããå¹ççãªåŠçã®ããã«TPãšDPãçµã¿åãããŠäœ¿çšããŸãã
代æ¿åïŒ
- [DeepSpeed](https://github.com/deepspeedai/DeepSpeed)ã¯ãããããã³ãœã«ã¹ã©ã€ã·ã³ã°ããšåŒã³ãŸãã詳现ã¯[DeepSpeedã®ç¹åŸŽ](https://www.deepspeed.ai/training/#model-parallelism)ãã芧ãã ããã
å®è£
äŸ:
- [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)ã«ã¯ãã¢ãã«åºæã®å
éšå®è£
ããããŸãã
- [parallelformers](https://github.com/tunib-ai/parallelformers)ïŒçŸæç¹ã§ã¯æšè«ã®ã¿ïŒã
- [SageMaker](https://arxiv.org/abs/2111.05972) - ããã¯AWSã§ã®ã¿äœ¿çšã§ãããããã©ã€ãšã¿ãªãªãœãªã¥ãŒã·ã§ã³ã§ãã
- [OSLO](https://github.com/tunib-ai/oslo)ã«ã¯ãTransformersã«åºã¥ãããã³ãœã«äžŠåå®è£
ããããŸãã
ð€ Transformersã®ç¶æ³:
- ã³ã¢: ãŸã ã³ã¢ã«ã¯å®è£
ãããŠããŸããã
- ãã ããæšè«ãå¿
èŠãªå Žåã[parallelformers](https://github.com/tunib-ai/parallelformers)ã¯ã»ãšãã©ã®ã¢ãã«ã«å¯ŸããŠãµããŒããæäŸããŸãããããã³ã¢ã«å®è£
ããããŸã§ãããã䜿çšã§ããŸãããããŠããã¬ãŒãã³ã°ã¢ãŒãããµããŒããããããšãæåŸ
ããŠããŸãã
- Deepspeed-Inferenceã§ã¯ãBERTãGPT-2ãããã³GPT-Neoã¢ãã«ãCUDAã«ãŒãã«ããŒã¹ã®é«éæšè«ã¢ãŒãã§ãµããŒãããŠããŸãã詳现ã¯[ãã¡ã](https://www.deepspeed.ai/tutorials/inference-tutorial/)ãã芧ãã ããã
## DP+PP
DeepSpeedã®[ãã€ãã©ã€ã³ãã¥ãŒããªã¢ã«](https://www.deepspeed.ai/tutorials/pipeline/)ããã®æ¬¡ã®å³ã¯ãDPãPPãšçµã¿åãããæ¹æ³ã瀺ããŠããŸãã

ããã§éèŠãªã®ã¯ãDPã©ã³ã¯0ãGPU2ãèŠããªãããDPã©ã³ã¯1ãGPU3ãèŠããªãããããšã§ããDPã«ãšã£ãŠãååšããã®ã¯GPU 0 ãš 1 ã®ã¿ã§ããããã®2ã€ã®GPUã®ããã«ããŒã¿ãäŸçµŠããŸããGPU0ã¯PPã䜿çšããŠGPU2ã«äžéšã®è² è·ããç§å¯è£ã«ããªãããŒãããGPU1ãåæ§ã«GPU3ãæ¯æŽã«åŒãå
¥ããŸãã
忬¡å
ã«ã¯å°ãªããšã2ã€ã®GPUãå¿
èŠã§ãã®ã§ãããã§ã¯å°ãªããšã4ã€ã®GPUãå¿
èŠã§ãã
å®è£
äŸ:
- [DeepSpeed](https://github.com/deepspeedai/DeepSpeed)
- [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)
- [Varuna](https://github.com/microsoft/varuna)
- [SageMaker](https://arxiv.org/abs/2111.05972)
- [OSLO](https://github.com/tunib-ai/oslo)
ð€ Transformersã®ç¶æ³: ãŸã å®è£
ãããŠããŸãã
## DP+PP+TP
ããã«å¹ççãªãã¬ãŒãã³ã°ãè¡ãããã«ã3Dãã©ã¬ãªãºã ã䜿çšããPPãTPãšDPãšçµã¿åãããŸããããã¯æ¬¡ã®å³ã§ç€ºãããŠããŸãã

ãã®å³ã¯[3Dãã©ã¬ãªãºã ïŒå
ãã©ã¡ãŒã¿ã¢ãã«ãžã®ã¹ã±ãŒãªã³ã°](https://www.microsoft.com/en-us/research/blog/deepspeed-extreme-scale-model-training-for-everyone/)ãšããããã°æçš¿ããååŸããããã®ã§ãããããã®èªã¿ç©ã§ãã
忬¡å
ã«ã¯å°ãªããšã2ã€ã®GPUãå¿
èŠã§ãã®ã§ãããã§ã¯å°ãªããšã8ã€ã®GPUãå¿
èŠã§ãã
å®è£
äŸ:
- [DeepSpeed](https://github.com/deepspeedai/DeepSpeed) - DeepSpeedã«ã¯ãããã«å¹ççãªDPã§ããZeRO-DPãšåŒã°ãããã®ãå«ãŸããŠããŸãã
- [Megatron-LM](https://github.com/NVIDIA/Megatron-LM)
- [Varuna](https://github.com/microsoft/varuna)
- [SageMaker](https://arxiv.org/abs/2111.05972)
- [OSLO](https://github.com/tunib-ai/oslo)
ð€ Transformersã®ç¶æ³: ãŸã å®è£
ãããŠããŸãããPPãšTPããªãããã
## ZeRO DP+PP+TP
DeepSpeedã®äž»èŠãªæ©èœã®1ã€ã¯ZeROã§ãããã¯DPã®æ¡åŒµæ©èœã§ããããã«ã€ããŠã¯ãã§ã«ãZeROããŒã¿äžŠååãã§èª¬æãããŠããŸããéåžžãããã¯åç¬ã§åäœããæ©èœã§ãPPãTPã¯å¿
èŠãããŸãããããããPPãšTPãšçµã¿åãããããšãã§ããŸãã
ZeRO-DPãPPãšçµã¿åããããå Žåãéåžžã¯ZeROã¹ããŒãž1ïŒãªããã£ãã€ã¶ãŒã·ã£ãŒãã£ã³ã°ïŒã®ã¿ãæå¹ã«ãªããŸãã
ZeROã¹ããŒãž2ïŒåŸé
ã·ã£ãŒãã£ã³ã°ïŒããã€ãã©ã€ã³äžŠååãšçµã¿åãããŠäœ¿çšããçè«çãªå¯èœæ§ã¯ãããŸãããæ§èœã«æªåœ±é¿ãåãŒããŸããåãã€ã¯ããããããšã«åŸé
ãã·ã£ãŒãã£ã³ã°ããåã«ãåŸé
ãéçŽããããã®è¿œå ã®ãªãã¯ã·ã§ã³ã¹ãã£ãã¿ãŒéèšãå¿
èŠã§ãéä¿¡ãªãŒããŒããããçºçããå¯èœæ§ããããŸãããã€ãã©ã€ã³äžŠååã®æ§è³ªäžãå°ããªãã€ã¯ããããã䜿çšãããèšç®ã®éäžåºŠïŒãã€ã¯ãããããµã€ãºïŒããã©ã³ã¹ã«ããããã€ãã©ã€ã³ããã«ïŒãã€ã¯ããããæ°ïŒãæå°éã«æããããšã«çŠç¹ãåœãŠãããŠããŸãããããã£ãŠããããã®éä¿¡ã³ã¹ãã¯åœ±é¿ãåãŒãã§ãããã
ããã«ãPPã«ã¯éåžžãããå°ãªãå±€ãå«ãŸããŠãããã¡ã¢ãªã®ç¯çŽã¯ããã»ã©å€§ãããããŸãããPPã¯æ¢ã«åŸé
ãµã€ãºãã1/PPãã«åæžãããããåŸé
ã·ã£ãŒãã£ã³ã°ã®ç¯çŽã¯çŽç²ãªDPãããã¯ããã«éèŠã§ã¯ãããŸããã
ZeROã¹ããŒãž3ãåæ§ã®çç±ã§é©ããŠããŸãã - ããå€ãã®ããŒãééä¿¡ãå¿
èŠã§ãã
ãããŠãZeROãæã£ãŠããã®ã§ãããäžã€ã®å©ç¹ã¯ZeRO-Offloadã§ããããã¯ã¹ããŒãž1ãªããã£ãã€ã¶ãŒã¹ããŒããCPUã«ãªãããŒãã§ããŸãã
å®è£
äŸ:
- [Megatron-DeepSpeed](https://github.com/microsoft/Megatron-DeepSpeed)ãš[BigScienceããã®Megatron-Deepspeed](https://github.com/bigscience-workshop/Megatron-DeepSpeed)ã¯ãåè
ã®ãªããžããªã®ãã©ãŒã¯ã§ãã
- [OSLO](https://github.com/tunib-ai/oslo)
éèŠãªè«æ:
- [DeepSpeedãšMegatronã䜿çšããMegatron-Turing NLG 530Bã®ãã¬ãŒãã³ã°](https://arxiv.org/abs/2201.11990)
ð€ Transformersã®ç¶æ³: ãŸã å®è£
ãããŠããŸãããPPãšTPããªãããã
## FlexFlow
[FlexFlow](https://github.com/flexflow/FlexFlow)ã¯ããããã«ç°ãªãã¢ãããŒãã§äžŠååã®åé¡ã解決ããŸãã
è«æ: [Zhihao JiaãMatei ZahariaãAlex Aikenã«ãã "Deep Neural Networksã®ããŒã¿ãšã¢ãã«ã®äžŠååãè¶
ããŠ"](https://arxiv.org/abs/1807.05358)
FlexFlowã¯ããµã³ãã«-ãªãã¬ãŒã¿-屿§-ãã©ã¡ãŒã¿ã®4D䞊ååãè¡ããŸãã
1. ãµã³ãã« = ããŒã¿äžŠååïŒãµã³ãã«åäœã®äžŠååïŒ
2. ãªãã¬ãŒã¿ = åäžã®æäœãããã€ãã®ãµãæäœã«äžŠåå
3. 屿§ = ããŒã¿äžŠååïŒé·ãæ¹åã®äžŠååïŒ
4. ãã©ã¡ãŒã¿ = ã¢ãã«äžŠååïŒæ¬¡å
ã«é¢ä¿ãªããæ°Žå¹³ãŸãã¯åçŽïŒ
äŸ:
* ãµã³ãã«
ã·ãŒã±ã³ã¹é·512ã®10ããããèããŠã¿ãŸãããããããããµã³ãã«æ¬¡å
ã§2ã€ã®ããã€ã¹ã«äžŠååãããšã10 x 512ã5 x 2 x 512ã«ãªããŸãã
* ãªãã¬ãŒã¿
å±€æ£èŠåãè¡ãå ŽåããŸãstdãèšç®ããæ¬¡ã«meanãèšç®ããããŒã¿ãæ£èŠåã§ããŸãããªãã¬ãŒã¿ã®äžŠååã«ãããstdãšmeanã䞊åã«èšç®ã§ããŸãããããã£ãŠããªãã¬ãŒã¿æ¬¡å
ã§2ã€ã®ããã€ã¹ïŒcuda:0ãcuda:1ïŒã«äžŠååãããšãæåã«å
¥åããŒã¿ãäž¡æ¹ã®ããã€ã¹ã«ã³ããŒããcuda:0ã§stdãèšç®ããcuda:1ã§meanãåæã«èšç®ããŸãã
* 屿§
10ãããã®512é·ããããŸãããããã屿§æ¬¡å
ã§2ã€ã®ããã€ã¹ã«äžŠååãããšã10 x 512ã10 x 2 x 256ã«ãªããŸãã
* ãã©ã¡ãŒã¿
ããã¯ãã³ãœã«ã¢ãã«ã®äžŠååãŸãã¯åçŽãªå±€ããšã®ã¢ãã«ã®äžŠååãšäŒŒãŠããŸãã
ãã®ãã¬ãŒã ã¯ãŒã¯ã®éèŠæ§ã¯ãïŒ1ïŒGPU/TPU/CPU察ïŒ2ïŒRAM/DRAM察ïŒ3ïŒé«éå
éšæ¥ç¶/äœéå€éšæ¥ç¶ãªã©ã®ãªãœãŒã¹ãåããããããã¹ãŠãã¢ã«ãŽãªãºã ã«ãã£ãŠèªåçã«æé©åããããšã§ããã©ã®äžŠååãã©ãã§äœ¿çšããããã¢ã«ãŽãªãºã çã«æ±ºå®ããŸãã
éåžžã«éèŠãªåŽé¢ã®1ã€ã¯ãFlexFlowã¯éçã§åºå®ã®ã¯ãŒã¯ããŒããæã€ã¢ãã«ã®ããã«èšèšãããŠãããåçãªåäœãæã€ã¢ãã«ã¯ã€ãã¬ãŒã·ã§ã³ããšã«ç°ãªã䞊ååæŠç¥ã奜ãå Žåãããããšã§ãã
ãããã£ãŠããã®ãã¬ãŒã ã¯ãŒã¯ã®çŽæã¯éåžžã«é
åçã§ããéžæããã¯ã©ã¹ã¿ã§30åéã®ã·ãã¥ã¬ãŒã·ã§ã³ãå®è¡ãããã®ç¹å®ã®ç°å¢ãæé©ã«å©çšããããã®æè¯ã®æŠç¥ãæäŸããŸããéšåã远å /åé€/眮æãããšãããã«å¯ŸããŠå®è¡ããŠåæé©åãã©ã³ãäœæããŸãããã®åŸããã¬ãŒãã³ã°ã§ããŸããç°ãªãã»ããã¢ããã«ã¯ç¬èªã®æé©åããããŸãã
ð€ Transformersã®çŸåšã®ç¶æ³: ãŸã çµ±åãããŠããŸããããã§ã«[transformers.utils.fx](https://github.com/huggingface/transformers/blob/master/src/transformers/utils/fx.py)ã䜿çšããŠã¢ãã«ãFXãã¬ãŒã¹å¯èœã§ãããããFlexFlowãåäœãããããã«å¿
èŠãªæé ã誰ããèŠã€ããå¿
èŠããããŸãã
## Which Strategy To Use When
ããã§ã¯ãã©ã®äžŠååæŠç¥ããã€äœ¿çšãããã®éåžžã«ãããŸããªã¢ãŠãã©ã€ã³ã瀺ããŸããåãªã¹ãã®æåãéåžžãããéãããšãäžè¬çã§ãã
**âš åäžGPU**
* ã¢ãã«ãåäžGPUã«åãŸãå ŽåïŒ
1. éåžžã®äœ¿çš
* ã¢ãã«ãåäžGPUã«åãŸããªãå ŽåïŒ
1. ZeRO + CPUããªãããŒããããªãã·ã§ã³ã§NVMeããªãããŒã
2. äžèšã«å ããŠãæå€§ã®ã¬ã€ã€ãŒãåäžGPUã«åãŸããªãå Žåã[Memory Centric Tiling](https://deepspeed.readthedocs.io/en/latest/zero3.html#memory-centric-tiling)ïŒè©³çްã¯ä»¥äžåç
§ïŒãæå¹å
* æå€§ã®ã¬ã€ã€ãŒãåäžGPUã«åãŸããªãå ŽåïŒ
1. ZeROã䜿çšããªãå Žå - TPãæå¹åããå¿
èŠããããŸãããªããªããPPã ãã§ã¯åããããšãã§ããªãããã§ãã
2. ZeROã䜿çšããå Žåã¯ãäžèšã®ãåäžGPUãã®ãšã³ããªãšåããã®ãåç
§ããŠãã ãã
**âš åäžããŒã/ãã«ãGPU**
* ã¢ãã«ãåäžGPUã«åãŸãå ŽåïŒ
1. DDP - 忣ããŒã¿äžŠå
2. ZeRO - ç¶æ³ãšäœ¿çšãããæ§æã«äŸåããŠéããã©ãããç°ãªãããšããããŸã
* ã¢ãã«ãåäžGPUã«åãŸããªãå ŽåïŒ
1. PP
2. ZeRO
3. TP
éåžžã«é«éãªããŒãå
æ¥ç¶ãNVLINKãŸãã¯NVSwitchã§ããå Žåããããã®ãã¹ãŠã¯ã»ãšãã©åçã®æ§èœã§ãããããããªãå ŽåãPPã¯TPãŸãã¯ZeROãããéããªããŸããTPã®åºŠåããéããçãããããããŸãããç¹å®ã®ã»ããã¢ããã§åè
ãèŠã€ããããã«å®éšããã®ãæåã§ãã
TPã¯ã»ãšãã©åžžã«åäžããŒãå
ã§äœ¿çšãããŸããã€ãŸããTPãµã€ãº <= ããŒããããã®GPUã§ãã
* æå€§ã®ã¬ã€ã€ãŒãåäžGPUã«åãŸããªãå ŽåïŒ
1. ZeROã䜿çšããªãå Žå - TPã䜿çšããå¿
èŠããããŸãããªããªããPPã ãã§ã¯åããããšãã§ããªãããã§ãã
2. ZeROã䜿çšããå Žåã¯ãäžèšã®ãåäžGPUãã®ãšã³ããªãšåããã®ãåç
§ããŠãã ãã
**âš ãã«ãããŒã/ãã«ãGPU**
* é«éãªããŒã鿥ç¶ãããå ŽåïŒ
1. ZeRO - ã¢ãã«ãžã®ã»ãšãã©ã®å€æŽãäžèŠã§ã
2. PP+TP+DP - éä¿¡ãå°ãªããã¢ãã«ã«å€§èŠæš¡ãªå€æŽãå¿
èŠã§ã
* é
ãããŒã鿥ç¶ããããGPUã¡ã¢ãªãå°ãªãå ŽåïŒ
1. DP+PP+TP+ZeRO-1
|