File size: 39,631 Bytes
17c6d62 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 |
<!--Copyright 2022 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
â ïž Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# Methods and tools for efficient training on a single GPU
ãã®ã¬ã€ãã§ã¯ãã¡ã¢ãªã®å©çšå¹çãæé©åãããã¬ãŒãã³ã°ãé«éåããããšã§ãã¢ãã«ã®ãã¬ãŒãã³ã°å¹çãåäžãããããã«äœ¿çšã§ããå®çšçãªãã¯ããã¯ã玹ä»ããŸãããã¬ãŒãã³ã°äžã«GPUãã©ã®ããã«å©çšãããããçè§£ãããå Žåã¯ãæåã«ã[ã¢ãã«ãã¬ãŒãã³ã°ã®è§£ååŠ](model_memory_anatomy)ãã®ã³ã³ã»ããã¬ã€ããåç
§ããŠãã ããããã®ã¬ã€ãã¯å®çšçãªãã¯ããã¯ã«çŠç¹ãåœãŠãŠããŸãã
<Tip>
è€æ°ã®GPUãæèŒãããã·ã³ã«ã¢ã¯ã»ã¹ã§ããå Žåããããã®ã¢ãããŒãã¯äŸç¶ãšããŠæå¹ã§ããããã«ã[ãã«ãGPUã»ã¯ã·ã§ã³](perf_train_gpu_many)ã§èª¬æãããŠãã远å ã®æ¹æ³ã掻çšã§ããŸãã
</Tip>
å€§èŠæš¡ãªã¢ãã«ããã¬ãŒãã³ã°ããéãåæã«èæ
®ãã¹ã2ã€ã®åŽé¢ããããŸãïŒ
* ããŒã¿ã®ã¹ã«ãŒããã/ãã¬ãŒãã³ã°æé
* ã¢ãã«ã®ããã©ãŒãã³ã¹
ã¹ã«ãŒãããïŒãµã³ãã«/ç§ïŒãæå€§åããããšã¯ããã¬ãŒãã³ã°ã³ã¹ããäœæžãããŸããããã¯äžè¬çã«ãGPUãã§ããã ã广çã«æŽ»çšããGPUã¡ã¢ãªãéçãŸã§åããããšã«ãã£ãŠéæãããŸããåžæããããããµã€ãºãGPUã¡ã¢ãªã®å¶éãè¶
ããå ŽåãåŸé
èç©ãªã©ã®ã¡ã¢ãªæé©åãã¯ããã¯ã圹ç«ã¡ãŸãã
ãããã奜ã¿ã®ããããµã€ãºãã¡ã¢ãªã«åãŸãå Žåãã¡ã¢ãªãæé©åãããã¯ããã¯ãé©çšããçç±ã¯ãããŸããã倧ããªããããµã€ãºã䜿çšã§ãããããšãã£ãŠããããå¿
ããã䜿çšãã¹ãã§ã¯ãããŸããããã€ããŒãã©ã¡ãŒã¿ã®èª¿æŽã®äžç°ãšããŠãã©ã®ããããµã€ãºãæè¯ã®çµæãçã¿åºãããæ±ºå®ãããªãœãŒã¹ãé©åã«æé©åããå¿
èŠããããŸãã
ãã®ã¬ã€ãã§ã«ããŒãããŠããæ¹æ³ãšããŒã«ã¯ããã¬ãŒãã³ã°ããã»ã¹ã«äžãã圱é¿ã«åºã¥ããŠåé¡ã§ããŸãïŒ
| Method/tool | Improves training speed | Optimizes memory utilization |
|:-----------------------------------------------------------|:------------------------|:-----------------------------|
| [Batch size choice](#batch-size-choice) | Yes | Yes |
| [Gradient accumulation](#gradient-accumulation) | No | Yes |
| [Gradient checkpointing](#gradient-checkpointing) | No | Yes |
| [Mixed precision training](#mixed-precision-training) | Yes | (No) |
| [Optimizer choice](#optimizer-choice) | Yes | Yes |
| [Data preloading](#data-preloading) | Yes | No |
| [DeepSpeed Zero](#deepspeed-zero) | No | Yes |
| [torch.compile](#using-torchcompile) | Yes | No |
<Tip>
**泚æ**: å°ããªã¢ãã«ãšå€§ããªããããµã€ãºã䜿çšããå Žåãã¡ã¢ãªã®ç¯çŽãè¡ãããŸããã倧ããªã¢ãã«ãšå°ããªããããµã€ãºã䜿çšããå Žåãã¡ã¢ãªã®äœ¿çšéãå¢å ããŸãã
</Tip>
ãããã®ãã¯ããã¯ã¯ã[`Trainer`]ã§ã¢ãã«ããã¬ãŒãã³ã°ããŠããå ŽåããçŽç²ãªPyTorchã«ãŒããèšè¿°ããŠããå Žåã®äž¡æ¹ã§å©çšã§ããŸããè©³çŽ°ãªæé©åã®èšå®ã«ã€ããŠã¯ãð€ Accelerateã䜿çšããŠ[ãããã®æé©åãèšå®ã§ããŸã](#using--accelerate)ã
ãããã®æ¹æ³ãååãªå©çããããããªãå Žåã以äžã®ãªãã·ã§ã³ãæ€èšã§ããŸãïŒ
* [å¹ççãªãœãããŠã§ã¢ããªãã«ããåããã«ã¹ã¿ã Dockerã³ã³ããã®äœæ](#efficient-software-prebuilds)
* [Mixture of ExpertsïŒMoEïŒã䜿çšããã¢ãã«ãæ€èš](#mixture-of-experts)
* [ã¢ãã«ãBetterTransformerã«å€æããŠãPyTorchãã€ãã£ãã®ã¢ãã³ã·ã§ã³ã掻çš](#using-pytorch-native-attention)
æåŸã«ããããã®æ¹æ³ããŸã ååã§ãªãå ŽåãA100ãªã©ã®ãµãŒããŒã°ã¬ãŒãGPUã«åãæ¿ããŠãããããªãæ¹åãå¿
èŠãããããŸããããããã®ã¢ãããŒãã¯ããã«ãGPUã»ããã¢ããã§ãæå¹ã§ããã[ãã«ãGPUã»ã¯ã·ã§ã³](perf_train_gpu_many)ã§èª¬æãããŠãã远å ã®äžŠååæè¡ã掻çšã§ããŸãã
## Batch size choice
æé©ãªããã©ãŒãã³ã¹ãå®çŸããããã«ãé©åãªããããµã€ãºãç¹å®ããããšããå§ããŸãããã2^Nã®ãµã€ãºã®ããããµã€ãºãšå
¥å/åºåãã¥ãŒãã³æ°ã䜿çšããããšãæšå¥šãããŠããŸããéåžžãããã¯8ã®åæ°ã§ããã䜿çšããããŒããŠã§ã¢ãšã¢ãã«ã®ããŒã¿åã«äŸåããããšããããŸãã
åèãŸã§ã«ãNVIDIAã®[å
¥å/åºåãã¥ãŒãã³æ°ã®æšå¥šäºé
](https://docs.nvidia.com/deeplearning/performance/dl-performance-fully-connected/index.html#input-features)ãš[ããããµã€ãº](https://docs.nvidia.com/deeplearning/performance/dl-performance-fully-connected/index.html#batch-size)ã確èªããŠãã ããïŒãããã¯GEMMïŒäžè¬çãªè¡åä¹ç®ïŒã«é¢äžããŸãïŒã
[Tensor CoreèŠä»¶](https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#requirements-tc)ã§ã¯ãããŒã¿åãšããŒããŠã§ã¢ã«åºã¥ããŠä¹æ°ãå®çŸ©ãããŠããŸããããšãã°ãfp16ããŒã¿åã®å Žåã64ã®åæ°ã䜿çšããããšãæšå¥šãããŸãïŒA100 GPUã®å Žåãé€ãïŒã
å°ããªãã©ã¡ãŒã¿ã®å Žåã[次å
éåå广](https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html#dim-quantization)ãèæ
®ããŠãã ãããããã¯ã¿ã€ãªã³ã°ãè¡ãããé©åãªä¹æ°ã倧å¹
ãªé«éåãããããå ŽåããããŸãã
## Gradient Accumulation
**åŸé
èç©**ã¡ãœããã¯ãGPUã®ã¡ã¢ãªå®¹éã®å¶çŽã«ãã£ãŠèª²ããããå¶éãè¶
ãã广çãªããããµã€ãºãå®çŸããããã«ãåŸé
ãå°ããªå¢åã§èšç®ããããšãç®çãšããŠããŸãããã®ã¢ãããŒãã§ã¯ãã¢ãã«ãé æ¹åããã³éæ¹åã«å°ããªãããã§å埩çã«èšç®ãããã®éçšã§åŸé
ãèç©ããŸããååãªæ°ã®åŸé
ãèç©ãããããã¢ãã«ã®æé©åã¹ããããå®è¡ããŸããåŸé
èç©ã䜿çšããããšã§ãGPUã®ã¡ã¢ãªå®¹éã«ããå¶çŽãè¶
ããŠ**广çãªããããµã€ãº**ãå¢ããããšãã§ããŸãããåŸé
èç©ã«ãã£ãŠå°å
¥ããã远å ã®é æ¹åããã³éæ¹åã®èšç®ã¯ãã¬ãŒãã³ã°ããã»ã¹ãé
ãããå¯èœæ§ãããããšã«æ³šæãå¿
èŠã§ãã
`TrainingArguments`ã«`gradient_accumulation_steps`åŒæ°ã远å ããããšã§ãåŸé
èç©ãæå¹ã«ããããšãã§ããŸãïŒ
```py
training_args = TrainingArguments(per_device_train_batch_size=1, gradient_accumulation_steps=4, **default_args)
```
äžèšã®äŸã§ã¯ã广çãªããããµã€ãºã¯4ã«ãªããŸãã
ãŸãããã¬ãŒãã³ã°ã«ãŒããå®å
šã«å¶åŸ¡ããããã«ð€ Accelerateã䜿çšããããšãã§ããŸããð€ Accelerateã®äŸã¯ã[ãã®ã¬ã€ãã®åŸåã«ãã](#using--accelerate)ã§èŠã€ããããšãã§ããŸãã
ã§ããã ãGPUã®äœ¿çšçãæå€§éã«ããããšãæšå¥šãããŠããŸãããé«ãåŸé
èç©ã¹ãããæ°ã¯ãã¬ãŒãã³ã°ã®é
å»¶ãããé¡èã«ããããšããããŸãã以äžã®äŸãèããŠã¿ãŸãããã`per_device_train_batch_size=4`ã®å ŽåãåŸé
èç©ã䜿çšããªããšGPUã®å¶éã«éããŸããããããµã€ãº64ã§ãã¬ãŒãã³ã°ãããå Žåã`per_device_train_batch_size`ã1ã«èšå®ãã`gradient_accumulation_steps`ã64ã«èšå®ããªãã§ãã ããã代ããã«ã`per_device_train_batch_size=4`ãä¿æãã`gradient_accumulation_steps=16`ãèšå®ããŸããããã«ãããåã广çãªããããµã€ãºãåŸãããå©çšå¯èœãªGPUãªãœãŒã¹ã广çã«æŽ»çšãããŸãã
è©³çŽ°ãªæ
å ±ã«ã€ããŠã¯ã[RTX-3090çšã®ããããµã€ãºãšåŸé
èç©ã®ãã³ãããŒã¯](https://github.com/huggingface/transformers/issues/14608#issuecomment-1004392537)ããã³[A100çšã®ããããµã€ãºãšåŸé
èç©ã®ãã³ãããŒã¯](https://github.com/huggingface/transformers/issues/15026#issuecomment-1005033957)ãåç
§ããŠãã ããã
## Gradient Checkpointing
äžéšã®å€§ããªã¢ãã«ã¯ãããããµã€ãºã1ã«èšå®ããåŸé
èç©ã䜿çšããŠããå Žåã§ãã¡ã¢ãªã®åé¡ã«çŽé¢ããããšããããŸããããã¯ãã¡ã¢ãªã¹ãã¬ãŒãžãå¿
èŠãªä»ã®ã³ã³ããŒãã³ããååšããããã§ãã
ååããã¹ããã®ãã¹ãŠã®ã¢ã¯ãã£ããŒã·ã§ã³ãä¿åããŠãéåããã¹ã§åŸé
ãèšç®ãããšãããªãã®ã¡ã¢ãªãªãŒããŒããããçºçããŸããéåããã¹ã§å¿
èŠãªãšãã«ã¢ã¯ãã£ããŒã·ã§ã³ãç Žæ£ããŠåèšç®ãã代æ¿ã¢ãããŒãã¯ãèšç®ãªãŒããŒãããã倧å¹
ã«å¢å ãããã¬ãŒãã³ã°ããã»ã¹ãé
ããªããŸãã
**åŸé
ãã§ãã¯ãã€ã³ã**ã¯ããããã®2ã€ã®ã¢ãããŒãã®æè¡·æ¡ãæäŸããèšç®ã°ã©ãå
šäœã§æŠç¥çã«éžæãããã¢ã¯ãã£ããŒã·ã§ã³ã®ã¿ãä¿åãããããåŸé
ãåèšç®ããå¿
èŠãããã¢ã¯ãã£ããŒã·ã§ã³ã®äžéšã ããç¯çŽããŸããåŸé
ãã§ãã¯ãã€ã³ãã®è©³çްã«ã€ããŠã¯ã[ãã®çŽ æŽãããèšäº](https://medium.com/tensorflow/fitting-larger-networks-into-memory-583e3c758ff9)ãåç
§ããŠãã ããã
[`Trainer`]ã§åŸé
ãã§ãã¯ãã€ã³ããæå¹ã«ããã«ã¯ã[`TrainingArguments`]ã«å¯Ÿå¿ãããã©ã°ãæž¡ããŸãïŒ
```py
training_args = TrainingArguments(
per_device_train_batch_size=1, gradient_accumulation_steps=4, gradient_checkpointing=True, **default_args
)
```
ä»£æ¿ææ®µãšããŠãð€ Accelerateã䜿çšããããšãã§ããŸã - ð€ Accelerateã®äŸã¯[ãã®ã¬ã€ãã®ããã«åŸãã«ãããŸã](#using--accelerate)ã
<Tip>
åŸé
ãã§ãã¯ãã€ã³ãã䜿çšããããšã§ã¡ã¢ãªå¹çãåäžããå ŽåããããŸããããã¬ãŒãã³ã°é床ã¯çŽ20%é
ããªãããšã«æ³šæããŠãã ããã
</Tip>
## Mixed precision training
**æ··å粟床ãã¬ãŒãã³ã°**ã¯ãã¢ãã«ã®ãã¬ãŒãã³ã°ã®èšç®å¹çãæé©åããæè¡ã§ãç¹å®ã®å€æ°ã«å¯ŸããŠäœç²ŸåºŠã®æ°å€ãã©ãŒããããå©çšããŸããåŸæ¥ãã»ãšãã©ã®ã¢ãã«ã¯å€æ°ã衚çŸãåŠçããããã«32ãããæµ®åå°æ°ç¹ç²ŸåºŠïŒfp32ãŸãã¯float32ïŒã䜿çšããŠããŸãããããããã¹ãŠã®å€æ°ãæ£ç¢ºãªçµæãåŸãããã«ãã®é«ç²ŸåºŠã®ã¬ãã«ãå¿
èŠãšããªãå ŽåããããŸããäžéšã®å€æ°ã®ç²ŸåºŠã16ãããæµ®åå°æ°ç¹ïŒfp16ãŸãã¯float16ïŒãªã©ã®ããäœãæ°å€ãã©ãŒãããã«å€æŽããããšã§ãèšç®ãé«éåã§ããŸãããã®ã¢ãããŒãã§ã¯ãäžéšã®èšç®ã¯å粟床ã§è¡ãããäžéšã¯ãŸã å®å
šãªç²ŸåºŠã§è¡ãããããããã®ã¢ãããŒãã¯æ··å粟床ãã¬ãŒãã³ã°ãšåŒã°ããŠããŸãã
æãäžè¬çã«æ··å粟床ãã¬ãŒãã³ã°ã¯ãfp16ïŒfloat16ïŒããŒã¿åã䜿çšããŠå®çŸãããŸãããäžéšã®GPUã¢ãŒããã¯ãã£ïŒã¢ã³ãã¢ã¢ãŒããã¯ãã£ãªã©ïŒã§ã¯bf16ããã³tf32ïŒCUDAå
éšããŒã¿åïŒããŒã¿åãæäŸãããŠããŸãããããã®ããŒã¿åã®éãã«ã€ããŠè©³ããç¥ãããå Žåã¯ã[NVIDIAã®ããã°](https://developer.nvidia.com/blog/accelerating-ai-training-with-tf32-tensor-cores/)ã確èªããŠãã ããã
### fp16
æ··å粟床ãã¬ãŒãã³ã°ã®äž»ãªå©ç¹ã¯ãå粟床ïŒfp16ïŒã§ã¢ã¯ãã£ããŒã·ã§ã³ãä¿åããããšããåŸãããŸãã
åŸé
ãå粟床ã§èšç®ãããŸãããæé©åã¹ãããã§ã¯åã³å®å
šç²ŸåºŠã«å€æããããããããã§ã¯ã¡ã¢ãªã¯ä¿åãããŸããã
æ··å粟床ãã¬ãŒãã³ã°ã¯èšç®é床ãåäžãããäžæ¹ãç¹ã«å°ããªããããµã€ãºã®å Žåãããå€ãã®GPUã¡ã¢ãªã䜿çšããããšããããŸãã
ããã¯ãã¢ãã«ãGPUäžã«16ãããããã³32ããã粟床ã®äž¡æ¹ã§ååšããããã§ãïŒGPUäžã®å
ã®ã¢ãã«ã®1.5åïŒã
æ··å粟床ãã¬ãŒãã³ã°ãæå¹ã«ããã«ã¯ã`fp16`ãã©ã°ã`True`ã«èšå®ããŸãïŒ
```py
training_args = TrainingArguments(per_device_train_batch_size=4, fp16=True, **default_args)
```
ð€ Accelerateã䜿çšããå Žåãð€ Accelerateã®äŸã¯[ãã®ã¬ã€ãã®ããã«åŸãã«ãããŸã](#using--accelerate)ã
### BF16
AmpereãŸãã¯ãã以éã®ããŒããŠã§ã¢ã«ã¢ã¯ã»ã¹ã§ããå Žåãæ··å粟床ãã¬ãŒãã³ã°ãšè©äŸ¡ã«bf16ã䜿çšã§ããŸããbf16ã¯fp16ããã粟床ãå£ããŸãããã¯ããã«å€§ããªåçç¯å²ãæã£ãŠããŸããfp16ã§ã¯ãæã€ããšãã§ããæå€§ã®æ°ã¯ `65535` ã§ããããããè¶
ããæ°å€ã¯ãªãŒããŒãããŒãåŒãèµ·ãããŸããäžæ¹ãbf16ã®æ°å€ã¯ `3.39e+38` ã®ããã«å€§ãããããã¯fp32ãšã»ãŒåãã§ã - ã©ã¡ããæ°å€ç¯å²ã«8ãããã䜿çšããŠããããã§ãã
BF16ãæå¹ã«ããã«ã¯ãð€ Trainerã§ä»¥äžã®ããã«èšå®ããŸãïŒ
```python
training_args = TrainingArguments(bf16=True, **default_args)
```
### TF32
ã¢ã³ãã¢ããŒããŠã§ã¢ã¯ãtf32ãšããç¹å¥ãªããŒã¿åã䜿çšããŸããããã¯ãfp32ãšåãæ°å€ç¯å²ïŒ8ãããïŒãæã£ãŠããŸããã23ãããã®ç²ŸåºŠã§ã¯ãªãã10ãããã®ç²ŸåºŠïŒfp16ãšåãïŒãæã¡ãåèšã§19ããããã䜿çšããŸãããããã¯éåžžã®fp32ãã¬ãŒãã³ã°ããã³æšè«ã³ãŒãã䜿çšããtf32ãµããŒããæå¹ã«ããããšã§ãæå€§3åã®ã¹ã«ãŒãããã®åäžãåŸãããç¹ã§ãéæ³ã®ãããã§ããè¡ãå¿
èŠãããã®ã¯ã次ã®ã³ãŒãã远å ããã ãã§ãïŒ
```python
import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
```
䜿çšãããŠããGPUãã¢ã³ãã¢ã·ãªãŒãºã§ãããšä»®å®ããCUDAã¯å¯èœãªéãtf32ã䜿çšããããã«èªåçã«åãæ¿ããŸãã
[NVIDIAã®ç ç©¶ã«ããã°](https://developer.nvidia.com/blog/accelerating-ai-training-with-tf32-tensor-cores/)ãã»ãšãã©ã®æ©æ¢°åŠç¿ãã¬ãŒãã³ã°ã¯ãŒã¯ããŒãã¯tf32ãã¬ãŒãã³ã°ãšfp32ãã¬ãŒãã³ã°ã§åãé£è§£åºŠãšåæã瀺ããŸãããã§ã«fp16ãŸãã¯bf16æ··å粟床ã䜿çšããŠããå Žåãã¹ã«ãŒãããã®åäžã«åœ¹ç«ã€ããšããããŸãã
ð€ Trainerã§ãã®ã¢ãŒããæå¹ã«ããããšãã§ããŸãïŒ
```python
TrainingArguments(tf32=True, **default_args)
```
<Tip>
tf32ã¯`tensor.to(dtype=torch.tf32)`ãä»ããŠçŽæ¥ã¢ã¯ã»ã¹ã§ããŸãããããã¯å
éšã®CUDAããŒã¿åã§ããtf32ããŒã¿åã䜿çšããã«ã¯ã`torch>=1.7`ãå¿
èŠã§ãã
</Tip>
tf32ãšä»ã®ç²ŸåºŠã«é¢ããè©³çŽ°ãªæ
å ±ã«ã€ããŠã¯ã以äžã®ãã³ãããŒã¯ãåç
§ããŠãã ããïŒ
[RTX-3090](https://github.com/huggingface/transformers/issues/14608#issuecomment-1004390803)ããã³
[A100](https://github.com/huggingface/transformers/issues/15026#issuecomment-1004543189)ã
## Flash Attention 2
transformersã§Flash Attention 2çµ±åã䜿çšããããšã§ããã¬ãŒãã³ã°ã®ã¹ã«ãŒããããåäžãããããšãã§ããŸããFlash Attention 2ã¢ãžã¥ãŒã«ãå«ãã¢ãã«ã®èªã¿èŸŒã¿æ¹æ³ã«ã€ããŠã¯ã[single GPU section](./perf_infer_gpu_one#Flash-Attention-2)ã®é©åãªã»ã¯ã·ã§ã³ã確èªããŠè©³çްãåŠã³ãŸãããã
## ãªããã£ãã€ã¶ã®éžæ
Transformerã¢ãã«ããã¬ãŒãã³ã°ããããã«æãäžè¬çã«äœ¿çšããããªããã£ãã€ã¶ã¯AdamãŸãã¯AdamWïŒéã¿æžè¡°ã䌎ãAdamïŒã§ããAdamã¯ååã®åŸé
ã®ç§»åå¹³åãä¿åããããšã§åæãéæããŸãããã¢ãã«ãã©ã¡ãŒã¿ã®æ°ã®ãªãŒããŒã®è¿œå ã¡ã¢ãªãããããªã³ãã远å ããŸãããããè§£æ¶ããããã«ã代æ¿ãªããã£ãã€ã¶ã䜿çšã§ããŸããããšãã°ã[NVIDIA/apex](https://github.com/NVIDIA/apex)ãã€ã³ã¹ããŒã«ãããŠããå Žåã`adamw_apex_fused`ã¯ãã¹ãŠã®ãµããŒããããŠããAdamWãªããã£ãã€ã¶ã®äžã§æãé«éãªãã¬ãŒãã³ã°äœéšãæäŸããŸãã
[`Trainer`]ã¯ãçŽæ¥äœ¿çšã§ããããŸããŸãªãªããã£ãã€ã¶ãçµ±åããŠããã`adamw_hf`ã`adamw_torch`ã`adamw_torch_fused`ã`adamw_apex_fused`ã`adamw_anyprecision`ã`adafactor`ããŸãã¯`adamw_bnb_8bit`ãå«ãŸããŠããŸãããµãŒãããŒãã£ã®å®è£
ãä»ããŠããã«å€ãã®ãªããã£ãã€ã¶ã远å ã§ããŸãã
AdamWãªããã£ãã€ã¶ã®ä»£æ¿ææ®µã«ã€ããŠè©³ããèŠãŠã¿ãŸãããïŒ
1. [`Trainer`]ã§äœ¿çšå¯èœãª`adafactor`
2. Trainerã§äœ¿çšå¯èœãª`adamw_bnb_8bit`ã¯ããã¢ã³ã¹ãã¬ãŒã·ã§ã³çšã«ä»¥äžã§ãµãŒãããŒãã£ã®çµ±åãæäŸãããŠããŸãã
æ¯èŒã®ããã3Bãã©ã¡ãŒã¿ã¢ãã«ïŒäŸïŒãgoogle-t5/t5-3bãïŒã®å ŽåïŒ
* æšæºã®AdamWãªããã£ãã€ã¶ã¯ãåãã©ã¡ãŒã¿ã«8ãã€ãã䜿çšããããã24GBã®GPUã¡ã¢ãªãå¿
èŠã§ãïŒ8 * 3 => 24GBïŒã
* Adafactorãªããã£ãã€ã¶ã¯12GB以äžå¿
èŠã§ããåãã©ã¡ãŒã¿ã«ããã4ãã€ã以äžã䜿çšããããã4 * 3ãšå°ãäœåã«ãªããŸãã
* 8ãããã®BNBéååãªããã£ãã€ã¶ã¯ããã¹ãŠã®ãªããã£ãã€ã¶ã®ç¶æ
ãéååãããŠããå Žåãããã6GBãã䜿çšããŸããã
### Adafactor
Adafactorã¯ãéã¿è¡åã®åèŠçŽ ã®ããã«ååã®å¹³åãä¿åããŸããã代ããã«ãïŒè¡ããšãšåããšã®å¹³åã®åèšãªã©ïŒé
```py
training_args = TrainingArguments(per_device_train_batch_size=4, optim="adafactor", **default_args)
```
ä»ã®ã¢ãããŒãïŒåŸé
èç©ãåŸé
ãã§ãã¯ãã€ã³ããæ··å粟床ãã¬ãŒãã³ã°ïŒãšçµã¿åãããããšã§ãã¹ã«ãŒããããç¶æããªããæå€§3åã®åäžãèŠãããããšããããŸãïŒãã ããåè¿°ã®ããã«ãAdafactorã®åææ§ã¯AdamãããæªãããšããããŸãã
### 8ããã Adam
Adafactorã®ããã«ãªããã£ãã€ã¶ã®ç¶æ
ãéçŽãã代ããã«ã8ãããã®Adamã¯å®å
šãªç¶æ
ãä¿æãããããéååããŸããéååãšã¯ãç¶æ
ãäœã粟床ã§ä¿åããæé©åã®ããã ãã«ééååããããšãæå³ããŸããããã¯æ··å粟床ãã¬ãŒãã³ã°ã®èåŸã«ããã¢ã€ãã¢ãšäŒŒãŠããŸãã
`adamw_bnb_8bit`ã䜿çšããã«ã¯ãåã«[`TrainingArguments`]ã§`optim="adamw_bnb_8bit"`ãèšå®ããã ãã§ãïŒ
```py
training_args = TrainingArguments(per_device_train_batch_size=4, optim="adamw_bnb_8bit", **default_args)
```
ãã ãããã¢ã³ã¹ãã¬ãŒã·ã§ã³ç®çã§8ããããªããã£ãã€ã¶ããµãŒãããŒãã£ã®å®è£
ã䜿çšããããšãã§ããŸãããããçµ±åããæ¹æ³ã確èªããããã§ãã
ãŸãã8ãããAdamãªããã£ãã€ã¶ãå®è£
ãã`bitsandbytes`ã©ã€ãã©ãªãã€ã³ã¹ããŒã«ããããã«ãGitHub [ãªããžããª](https://github.com/TimDettmers/bitsandbytes)å
ã®ã€ã³ã¹ããŒã«ã¬ã€ãã«åŸã£ãŠãã ããã
次ã«ããªããã£ãã€ã¶ãåæåããå¿
èŠããããŸããããã«ã¯2ã€ã®ã¹ããããå«ãŸããŸãïŒ
* ãŸããã¢ãã«ã®ãã©ã¡ãŒã¿ã2ã€ã®ã°ã«ãŒãã«åããŸã - éã¿æžè¡°ãé©çšããã¹ãã°ã«ãŒããšãé©çšãã¹ãã§ãªãã°ã«ãŒãã§ããéåžžããã€ã¢ã¹ãšã¬ã€ã€ãŒæ£èŠåãã©ã¡ãŒã¿ã¯éã¿æžè¡°ãããŸããã
* 次ã«ã以åã«äœ¿çšããAdamWãªããã£ãã€ã¶ãšåããã©ã¡ãŒã¿ã䜿çšããããã«ãããã€ãã®åŒæ°ã®èª¿æŽãè¡ããŸãã
```py
import bitsandbytes as bnb
from torch import nn
from transformers.trainer_pt_utils import get_parameter_names
training_args = TrainingArguments(per_device_train_batch_size=4, **default_args)
decay_parameters = get_parameter_names(model, [nn.LayerNorm], ["bias", "layernorm", "rmsnorm"])
optimizer_grouped_parameters = [
{
"params": [p for n, p in model.named_parameters() if n in decay_parameters],
"weight_decay": training_args.weight_decay,
},
{
"params": [p for n, p in model.named_parameters() if n not in decay_parameters],
"weight_decay": 0.0,
},
]
optimizer_kwargs = {
"betas": (training_args.adam_beta1, training_args.adam_beta2),
"eps": training_args.adam_epsilon,
}
optimizer_kwargs["lr"] = training_args.learning_rate
adam_bnb_optim = bnb.optim.Adam8bit(
optimizer_grouped_parameters,
betas=(training_args.adam_beta1, training_args.adam_beta2),
eps=training_args.adam_epsilon,
lr=training_args.learning_rate,
)
```
æåŸã«ãã«ã¹ã¿ã ãªããã£ãã€ã¶ã`Trainer`ã«åŒæ°ãšããŠæž¡ããŸãïŒ
```py
trainer = Trainer(model=model, args=training_args, train_dataset=ds, optimizers=(adam_bnb_optim, None))
```
ä»ã®ã¢ãããŒãïŒåŸé
èç©ãåŸé
ãã§ãã¯ãã€ã³ããæ··å粟床ãã¬ãŒãã³ã°ïŒãšçµã¿åãããããšã§ãAdafactorã®äœ¿çšãšåç以äžã®3åã®ã¡ã¢ãªæ¹åããã³ãããã«é«ãã¹ã«ãŒããããæåŸ
ã§ããŸãã
### multi_tensor
pytorch-nightlyã¯ãå€ãã®å°ããªç¹åŸŽãã³ãœã«ãããç¶æ³ã®ãªããã£ãã€ã¶ã倧å¹
ã«é«éåããã¯ãã®`torch.optim._multi_tensor`ãå°å
¥ããŸãããããã¯æçµçã«ã¯ããã©ã«ãã«ãªãã¯ãã§ããããããæ©ã詊ããŠã¿ããå Žåã¯ããã®GitHub [issue](https://github.com/huggingface/transformers/issues/9965)ãã芧ãã ããã
## ããŒã¿ã®äºåèªã¿èŸŒã¿
åªãããã¬ãŒãã³ã°é床ã«å°éããããã®éèŠãªèŠä»¶ã®1ã€ã¯ãGPUãåŠçã§ããæå€§é床ã§ããŒã¿ãäŸçµŠã§ããèœåã§ããããã©ã«ãã§ã¯ãã¹ãŠãã¡ã€ã³ããã»ã¹ã§è¡ãããããŒã¿ããã£ã¹ã¯ããååéãèªã¿åãããšãã§ããªãå ŽåãGPUã®ã¢ã³ããŒãŠãŒãã£ãªãŒãŒã·ã§ã³ãåŒãèµ·ããããã«ããã¯ãçºçããå¯èœæ§ããããŸããããã«ããã¯ãæžããããã«ã以äžã®åŒæ°ãèšå®ããŸãïŒ
- `DataLoader(pin_memory=True, ...)` - ããŒã¿ãCPUã®ãã³ã¡ã¢ãªã«äºåèªã¿èŸŒã¿ããéåžžãCPUããGPUã¡ã¢ãªãžã®è»¢éãã¯ããã«é«éåãããŸãã
- `DataLoader(num_workers=4, ...)` - ããŒã¿ãããéãäºåèªã¿èŸŒã¿ããããã«è€æ°ã®ã¯ãŒã«ãŒãçæããŸãããã¬ãŒãã³ã°äžã«GPUã®å©çšç¶æ³ã®çµ±èšæ
å ±ã確èªãã100ïŒ
ããé ãå Žåãã¯ãŒã«ãŒã®æ°ãå¢ããå®éšãè¡ã£ãŠãã ããããã¡ãããåé¡ã¯ä»ã®å Žæã«ãããããããŸããã®ã§ãå€ãã®ã¯ãŒã«ãŒãå¿
ãããæ§èœåäžã«ã€ãªããããã§ã¯ãããŸããã
[`Trainer`]ã䜿çšããå Žåã察å¿ãã[`TrainingArguments`]ã¯`dataloader_pin_memory`ïŒããã©ã«ãã§ã¯`True`ïŒããã³`dataloader_num_workers`ïŒããã©ã«ãã¯`0`ïŒã§ãã
## DeepSpeed ZeRO
DeepSpeedã¯ãð€ Transformersãšð€ Accelerateãšçµ±åããããªãŒãã³ãœãŒã¹ã®ãã£ãŒãã©ãŒãã³ã°æé©åã©ã€ãã©ãªã§ãã
å€§èŠæš¡ãªãã£ãŒãã©ãŒãã³ã°ãã¬ãŒãã³ã°ã®å¹çãšã¹ã±ãŒã©ããªãã£ãåäžãããããã«èšèšãããããŸããŸãªæ©èœãšæé©åãæäŸããŸãã
ã¢ãã«ãåäžã®GPUã«åãŸããå°ããªããããµã€ãºãåããã¹ããŒã¹ãããå ŽåãDeepSpeedã䜿çšããå¿
èŠã¯ãããŸãããããã¯ãããé
ããªããŸãããã ããã¢ãã«ãåäžã®GPUã«åãŸããªãå ŽåããŸãã¯å°ããªããããåããããšãã§ããªãå ŽåãDeepSpeed ZeRO + CPU OffloadãŸãã¯NVMe Offloadãå©çšã§ããŸãããã®å Žåã[ã©ã€ãã©ãªãå¥éã€ã³ã¹ããŒã«](main_classes/deepspeed#installation)ããèšå®ãã¡ã€ã«ãäœæããDeepSpeedãèµ·åããããã®ã¬ã€ãããã©ããŒããå¿
èŠããããŸãïŒ
* [`Trainer`]ãšã®DeepSpeedçµ±åã®è©³çްã¬ã€ãã«ã€ããŠã¯ã[該åœããããã¥ã¡ã³ããŒã·ã§ã³](main_classes/deepspeed)ã確èªããŠãã ãããç¹ã«ã[åäžGPUçšã®ãããã€ã¡ã³ã](main_classes/deepspeed#deployment-with-one-gpu)ã«é¢ããã»ã¯ã·ã§ã³ã§ããDeepSpeedãããŒãããã¯ã§äœ¿çšããã«ã¯ããã€ãã®èª¿æŽãå¿
èŠã§ãã®ã§ã[該åœããã¬ã€ã](main_classes/deepspeed#deployment-in-notebooks)ãã芧ãã ããã
* ð€ Accelerateã䜿çšããå Žåã¯ã[ð€ Accelerate DeepSpeedã¬ã€ã](https://huggingface.co/docs/accelerate/en/usage_guides/deepspeed)ãåç
§ããŠãã ããã
## torch.compileã®äœ¿çš
PyTorch 2.0ã¯æ°ããã³ã³ãã€ã«é¢æ°ãå°å
¥ããŸãããããã¯æ¢åã®PyTorchã³ãŒãã倿Žããå¿
èŠã¯ãããŸãããã1è¡ã®ã³ãŒãã远å ããããšã§ã³ãŒããæé©åã§ããŸãïŒ`model = torch.compile(model)`ã
[`Trainer`]ã䜿çšããå Žåã[`TrainingArguments`]å
ã®`torch_compile`ãªãã·ã§ã³ãæž¡ãã ãã§ãïŒ
```python
training_args = TrainingArguments(torch_compile=True, **default_args)
```
`torch.compile`ã¯ãæ¢åã®PyTorchããã°ã©ã ããã°ã©ããèªåçã«äœæããããã«Pythonã®ãã¬ãŒã è©äŸ¡APIã䜿çšããŸããã°ã©ãããã£ããã£ããåŸãç°ãªãããã¯ãšã³ããå±éããŠæé©åããããšã³ãžã³ã«å€æã§ããŸãã
詳现ããã³ãã³ãããŒã¯ã«ã€ããŠã¯ã[PyTorchããã¥ã¡ã³ã](https://pytorch.org/get-started/pytorch-2.0/)ãåç
§ããŠãã ããã
`torch.compile`ã«ã¯ããªãã·ã§ã³ã®äŸåé¢ä¿ãæã€æé·äžã®ããã¯ãšã³ãã®ãªã¹ããããã`torchdynamo.list_backends()`ãåŒã³åºããŠç¢ºèªã§ããŸããæãäžè¬çã«äœ¿çšãããäžéšã®ããã¯ãšã³ãã¯æ¬¡ã®ãšããã§ãã
**ãããã°çšããã¯ãšã³ã**ïŒ
* `dynamo.optimize("eager")` - æœåºãããGraphModuleãå®è¡ããããã«PyTorchã䜿çšããŸããããã¯TorchDynamoã®åé¡ããããã°ããéã«éåžžã«åœ¹ç«ã¡ãŸãã
* `dynamo.optimize("aot_eager")` - ã³ã³ãã€ã©ãŒã䜿çšããªãAotAutogradã䜿çšããŠAotAutogradã®æœåºããããã©ã¯ãŒãããã³ããã¯ã¯ãŒãã°ã©ãã«å¯ŸããŠåã«PyTorch eagerã䜿çšããŸããããã¯ãããã°ã«åœ¹ç«ã¡ãé«éåã¯æåŸ
ã§ããŸããã
**ãã¬ãŒãã³ã°ããã³æšè«ããã¯ãšã³ã**ïŒ
* `dynamo.optimize("inductor")` - TorchInductorããã¯ãšã³ãã䜿çšããAotAutogradããã³cudagraphsãæŽ»çšããŠã³ãŒãçæãããTritonã«ãŒãã«ã䜿çšããŸã [詳现ã¯ãã¡ã](https://dev-discuss.pytorch.org/t/torchinductor-a-pytorch-native-compiler-with-define-by-run-ir-and-symbolic-shapes/747)
* `dynamo.optimize("nvfuser")` - nvFuser with TorchScriptã䜿çšããŸãã [詳现ã¯ãã¡ã](https://dev-discuss.pytorch.org/t/tracing-with-primitives-update-1-nvfuser-and-its-primitives/593)
* `dynamo.optimize("aot_nvfuser")` - nvFuser with AotAutogradã䜿çšããŸãã [詳现ã¯ãã¡ã](https://dev-discuss.pytorch.org/t/tracing-with-primitives-update-1-nvfuser-and-its-primitives/593)
* `dynamo.optimize("aot_cudagraphs")` - AotAutogradã䜿çšããŠcudagraphsã䜿çšããŸãã [詳现ã¯ãã¡ã](https://github.com/pytorch/torchdynamo/pull/757)
**æšè«å°çšããã¯ãšã³ã**ïŒ
* `dynamo.optimize("ofi")` - Torchscriptã®`optimize_for_inference`ã䜿çšããŸãã [詳现ã¯ãã¡ã](https://pytorch.org/docs/stable/generated/torch.jit.optimize_for_inference.html)
* `dynamo.optimize("fx2trt")` - Nvidia TensorRTã䜿çšããæšè«ã®æé©åã«Nvidia TensorRTã䜿çšããŸãã [詳现ã¯ãã¡ã](https://pytorch.org/TensorRT/tutorials/getting_started_with_fx_path.html)
* `dynamo.optimize("onnxrt")` - CPU/GPUã§ã®æšè«ã«ONNX Runtimeã䜿çšããŸãã [詳现ã¯ãã¡ã](https://onnxruntime.ai/)
* `dynamo.optimize("ipex")` - CPUã§ã®æšè«ã«IPEXã䜿çšããŸãã [詳现ã¯ãã¡ã](https://github.com/intel/intel-extension-for-pytorch)
ð€ Transformersã䜿çšãã`torch.compile`ã®äœ¿çšäŸã«ã€ããŠã¯ããã®[ããã°èšäº](https://www.philschmid.de/getting-started-pytorch-2-0-transformers)ãã芧ãã ããã
## Using ð€ Accelerate
[ð€ Accelerate](https://huggingface.co/docs/accelerate/index)ã䜿çšãããšãäžèšã®æ¹æ³ã䜿çšããªãããã¬ãŒãã³ã°ã«ãŒããå®å
šã«å¶åŸ¡ã§ããåºæ¬çã«ã¯çŽç²ãªPyTorchã§ã«ãŒããæžãããšãã§ããŸãã
次ã«ã[`TrainingArguments`]å
ã§æ¹æ³ãçµã¿åãããå Žåãæ³
```py
training_args = TrainingArguments(
per_device_train_batch_size=1,
gradient_accumulation_steps=4,
gradient_checkpointing=True,
fp16=True,
**default_args,
)
```
ð€ Accelerateã䜿çšããå®å
šãªãã¬ãŒãã³ã°ã«ãŒãã®äŸã¯ãã»ãã®æ°è¡ã®ã³ãŒãã§ãïŒ
```py
from accelerate import Accelerator
from torch.utils.data.dataloader import DataLoader
dataloader = DataLoader(ds, batch_size=training_args.per_device_train_batch_size)
if training_args.gradient_checkpointing:
model.gradient_checkpointing_enable()
accelerator = Accelerator(fp16=training_args.fp16)
model, optimizer, dataloader = accelerator.prepare(model, adam_bnb_optim, dataloader)
model.train()
for step, batch in enumerate(dataloader, start=1):
loss = model(**batch).loss
loss = loss / training_args.gradient_accumulation_steps
accelerator.backward(loss)
if step % training_args.gradient_accumulation_steps == 0:
optimizer.step()
optimizer.zero_grad()
```
ãŸããããŒã¿ã»ããã[`DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)ã§ã©ããããŸãã
次ã«ãã¢ãã«ã®[`~PreTrainedModel.gradient_checkpointing_enable`]ã¡ãœãããåŒã³åºãããšã§åŸé
ãã§ãã¯ãã€ã³ããæå¹ã«ã§ããŸãã
[`Accelerator`](https://huggingface.co/docs/accelerate/package_reference/accelerator#accelerate.Accelerator)ãåæåããéã«ãæ··å粟床ãã¬ãŒãã³ã°ã䜿çšãããã©ããã[`prepare`](https://huggingface.co/docs/accelerate/package_reference/accelerator#accelerate.Accelerator.prepare)ã®åŒã³åºãã§æå®ããè€æ°ã®GPUã䜿çšããå Žåã`prepare`ã®éã«ããŒã¿ããŒããŒãã¯ãŒã«ãŒéã§åæ£ãããŸããåã[8ããããªããã£ãã€ã¶](#8-bit-adam)ãåã®äŸãã䜿çšããŸãã
æåŸã«ãäž»èŠãªãã¬ãŒãã³ã°ã«ãŒãã远å ã§ããŸãã`backward`ã®åŒã³åºãã¯ð€ Accelerateã«ãã£ãŠåŠçãããããšã«æ³šæããŠãã ããããŸããåŸé
ã®èç©ãã©ã®ããã«æ©èœãããã確èªã§ããŸããæå€±ãæ£èŠåããŠãããããèç©ã®æåŸã«å¹³åãåŸãŠãååãªã¹ãããããããšæé©åãå®è¡ãããŸãã
ãããã®æé©åæè¡ãð€ Accelerateã䜿çšããŠå®è£
ããã®ã¯ãããããªã³ãŒãè¡ã§è¡ãããšãã§ãããã¬ãŒãã³ã°ã«ãŒãã®æè»æ§ãåäžããŸãããã¹ãŠã®æ©èœã®è©³çްã«ã€ããŠã¯ã[Accelerateã®ããã¥ã¡ã³ã](https://huggingface.co/docs/accelerate/index)ãåç
§ããŠãã ããã
## Efficient Software Prebuilds
PyTorchã®[pipãšcondaãã«ã](https://pytorch.org/get-started/locally/#start-locally)ã¯ãPyTorchãå®è¡ããã®ã«ååãªcudaããŒã«ãããã§äºåã«ãã«ããããŠããŸãããcudaæ¡åŒµããã«ãããå¿
èŠãããå Žåã«ã¯äžååã§ãã
ææã远å ã®åªåãå¿
èŠãªå ŽåããããŸããããšãã°ãäºåã«ã³ã³ãã€ã«ãããŠããªã`apex`ãªã©ã®ã©ã€ãã©ãªã䜿çšããŠããå Žåã§ãããŸããã·ã¹ãã å
šäœã§é©åãªcudaããŒã«ããããã€ã³ã¹ããŒã«ããæ¹æ³ãèŠã€ããããšãé£ããå ŽåããããŸãã
ãããã®ã·ããªãªã«å¯ŸåŠããããã«ãPyTorchãšNVIDIAã¯cudaæ¡åŒµããã§ã«äºåã«ãã«ããããŠããNGC dockerã³ã³ããã®æ°ããããŒãžã§ã³ããªãªãŒã¹ããŸãããããã°ã©ã ãã€ã³ã¹ããŒã«ããã ãã§ããã®ãŸãŸå®è¡ã§ããŸãã
ãã®ã¢ãããŒãã¯ãPyTorchã®ãœãŒã¹ã調æŽããããæ°ããã«ã¹ã¿ãã€ãºããããã«ããäœæããããããå Žåã«ã圹ç«ã¡ãŸãã
欲ããdockerã€ã¡ãŒãžããŒãžã§ã³ãèŠã€ããã«ã¯ããŸã[PyTorchã®ãªãªãŒã¹ããŒã](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/)ããå§ããææ°ã®ææ¬¡ãªãªãŒã¹ã®ãããããéžæããŸããåžæã®ãªãªãŒã¹ã®ãªãªãŒã¹ããŒãã«ç§»åããç°å¢ã®ã³ã³ããŒãã³ããå¿
èŠãªãã®ãšäžèŽããŠããããšã確èªããŸãïŒNVIDIA Driverã®èŠä»¶ãå«ãïŒïŒããã®ææžã®äžçªäžã«è¡ãã察å¿ããNGCããŒãžã«ç§»åããŸãããªããããããªãå Žåã¯ã[ãã¹ãŠã®PyTorch NGCã€ã¡ãŒãžã®ã€ã³ããã¯ã¹](https://ngc.nvidia.com/catalog/containers/nvidia:pytorch)ã§ãã
次ã«ãdockerã€ã¡ãŒãžãããŠã³ããŒãããŠå±éããæé ã«åŸããŸãã
## Mixture of Experts
æè¿ã®è«æã«ããã°ãTransformerã¢ãã«ã«å°éå®¶ã®æ··åïŒMoEïŒãçµ±åããããšã§ããã¬ãŒãã³ã°é床ã4ã5ååäžããæšè«ãé«éåãããããšãå ±åãããŠããŸãã
ããå€ãã®ãã©ã¡ãŒã¿ãããè¯ãããã©ãŒãã³ã¹ã«ã€ãªããããšãããã£ãŠããããããã®æè¡ã¯ãã¬ãŒãã³ã°ã³ã¹ããå¢ããããšãªããã©ã¡ãŒã¿ã®æ°ãæ¡éãã«å¢ããããšãå¯èœã«ããŸãã
ãã®ã¢ãããŒãã§ã¯ãä»ã®FFNå±€ã®ä»£ããã«MoEå±€ãé
眮ãããåå°éå®¶ãããŒã¯ã³ã®äœçœ®ã«å¿ããŠãã©ã³ã¹ãããã¬ãŒãã³ã°ããã²ãŒã颿°ã§æ§æãããŸãã

ïŒåºå
ž: [GLAM](https://ai.googleblog.com/2021/12/more-efficient-in-context-learning-with.html)ïŒ
ãã®ã¢ãããŒãã®äž»ãªæ¬ ç¹ã¯ãGPUã¡ã¢ãªãã»ãŒæ¡éãã«å€ãå¿
èŠãšããããšã§ããã¡ã¢ãªèŠä»¶ãã¯ããã«å€§ããããšããã®ãŸãŸåæ ãããŸããããé«ãã¡ã¢ãªèŠä»¶ãå
æããæ¹æ³ã«ã€ããŠã¯ãããŸããŸãªèžçããã³ã¢ãããŒããææ¡ãããŠããŸãã
ãã ããçŽæ¥ã®ãã¬ãŒããªãããããŸããæ°äººã®å°éå®¶ã䜿çšããŠããŒã¹ã¢ãã«ã2ã3åå°ããããããšã§ã5åå°ããªã¢ãã«ã«ãããã¬ãŒãã³ã°é床ãé©åºŠã«åäžãããã¡ã¢ãªèŠä»¶ãé©åºŠã«å¢ããããšãã§ããŸãã
é¢é£ããã»ãšãã©ã®è«æããã³å®è£
ã¯Tensorflow/TPUãäžå¿ã«æ§ç¯ãããŠããŸãã
- [GShard: Conditional Computation and Automatic ShardingãæŽ»çšãã巚倧ã¢ãã«ã®ã¹ã±ãŒãªã³ã°](https://arxiv.org/abs/2006.16668)
- [Switch Transformers: ã·ã³ãã«ã§å¹ççãªã¹ããŒã¹æ§ãåããããªãªãªã³ãã©ã¡ãŒã¿ã¢ãã«ãžã®ã¹ã±ãŒãªã³ã°](https://arxiv.org/abs/2101.03961)
- [GLaM: Generalist Language Model (GLaM)](https://ai.googleblog.com/2021/12/more-efficient-in-context-learning-with.html)
Pytorchã«ã¯DeepSpeedãæ§ç¯ãããã®ããããŸã: [DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale](https://arxiv.org/abs/2201.05596)ã[Mixture of Experts](https://www.deepspeed.ai/tutorials/mixture-of-experts/) - ããã°èšäº: [1](https://www.microsoft.com/en-us/research/blog/deepspeed-powers-8x-larger-moe-model-training-with-high-performance/)ã[2](https://www.microsoft.com/en-us/research/publication/scalable-and-efficient-moe-training-for-multitask-multilingual-models/)ãå€§èŠæš¡ãªTransformerããŒã¹ã®èªç¶èšèªçæã¢ãã«ã®å
·äœçãªå±éã«ã€ããŠã¯ã[ããã°èšäº](https://www.deepspeed.ai/2021/12/09/deepspeed-moe-nlg.html)ã[Megatron-Deepspeedãã©ã³ã](https://github.com/microsoft/Megatron-DeepSpeed/tree/moe-training)ãåç
§ããŠãã ããã
## PyTorchãã€ãã£ãã¢ãã³ã·ã§ã³ãšFlash Attentionã®äœ¿çš
PyTorch 2.0ã§ã¯ããã€ãã£ãã®[`torch.nn.functional.scaled_dot_product_attention`](https://pytorch.org/docs/master/generated/torch.nn.functional.scaled_dot_product_attention.html)ïŒSDPAïŒããªãªãŒã¹ããã[ã¡ã¢ãªå¹çã®é«ãã¢ãã³ã·ã§ã³](https://arxiv.org/abs/2112.05682)ã[ãã©ãã·ã¥ã¢ãã³ã·ã§ã³](https://arxiv.org/abs/2205.14135)ãªã©ã®èåãããGPUã«ãŒãã«ã®äœ¿çšãå¯èœã«ããŸãã
[`optimum`](https://github.com/huggingface/optimum)ããã±ãŒãžãã€ã³ã¹ããŒã«ããåŸãé¢é£ããå
éšã¢ãžã¥ãŒã«ã眮ãæããŠãPyTorchã®ãã€ãã£ãã¢ãã³ã·ã§ã³ã䜿çšã§ããŸãã以äžã®ããã«èšå®ããŸãïŒ
```python
model = model.to_bettertransformer()
```
倿åŸãéåžžéãã¢ãã«ããã¬ãŒãã³ã°ããŠãã ããã
<Tip warning={true}>
PyTorchãã€ãã£ãã®`scaled_dot_product_attention`æŒç®åã¯ã`attention_mask`ãæäŸãããŠããªãå Žåã«ã®ã¿Flash Attentionã«ãã£ã¹ãããã§ããŸãã
ããã©ã«ãã§ã¯ããã¬ãŒãã³ã°ã¢ãŒãã§BetterTransformerçµ±åã¯ãã¹ã¯ãµããŒããåé€ããããããã¬ãŒãã³ã°ã«ããã£ã³ã°ãã¹ã¯ãå¿
èŠãªããã¬ãŒãã³ã°ã«ãã䜿çšã§ããŸãããããã¯ãäŸãã°ãã¹ã¯èšèªã¢ããªã³ã°ãå æèšèªã¢ããªã³ã°ã®ãããªãããããã¬ãŒãã³ã°ã«ããã£ã³ã°ãã¹ã¯ãäžèŠãªãã¬ãŒãã³ã°ã®å Žåã«è©²åœããŸããBetterTransformerã¯ããã£ã³ã°ãã¹ã¯ãå¿
èŠãªã¿ã¹ã¯ã«å¯Ÿããã¢ãã«ã®åŸ®èª¿æŽã«ã¯é©ããŠããŸããã
</Tip>
SDPAã䜿çšããã¢ã¯ã»ã©ã¬ãŒã·ã§ã³ãšã¡ã¢ãªã®ç¯çŽã«ã€ããŠè©³ããç¥ãããå Žåã¯ããã®[ããã°èšäº](https://pytorch.org/blog/out-of-the-box-acceleration/)ããã§ãã¯ããŠãã ããã
|