DrDavis's picture
Upload folder using huggingface_hub
17c6d62 verified
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->
# μ™„μ „ λΆ„ν•  데이터 병렬 처리(FSDP) [[fully-sharded-data-parallel]]
[Fully Sharded Data Parallel (FSDP)](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/)은 λͺ¨λΈμ˜ λ§€κ°œλ³€μˆ˜, κ·Έλ ˆμ΄λ””μ–ΈνŠΈ 및 μ˜΅ν‹°λ§ˆμ΄μ € μƒνƒœλ₯Ό μ‚¬μš© κ°€λŠ₯ν•œ GPU(μž‘μ—…μž λ˜λŠ” *랭크*라고도 함) μˆ˜μ— 따라 λΆ„ν• ν•˜λŠ” 데이터 병렬 처리 λ°©μ‹μž…λ‹ˆλ‹€. [DistributedDataParallel (DDP)](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html)와 달리, FSDPλŠ” 각 GPU에 λͺ¨λΈμ„ λ³΅μ œν•˜κΈ° λ•Œλ¬Έμ— λ©”λͺ¨λ¦¬ μ‚¬μš©λŸ‰μ„ μ€„μž…λ‹ˆλ‹€. μ΄λŠ” GPU λ©”λͺ¨λ¦¬ νš¨μœ¨μ„±μ„ ν–₯μƒμ‹œν‚€λ©° 적은 수의 GPU둜 훨씬 더 큰 λͺ¨λΈμ„ ν›ˆλ ¨ν•  수 있게 ν•©λ‹ˆλ‹€. FSDPλŠ” λΆ„μ‚° ν™˜κ²½μ—μ„œμ˜ ν›ˆλ ¨μ„ μ‰½κ²Œ 관리할 수 μžˆλŠ” 라이브러리인 Accelerate와 ν†΅ν•©λ˜μ–΄ 있으며, λ”°λΌμ„œ [`Trainer`] ν΄λž˜μŠ€μ—μ„œ μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.
μ‹œμž‘ν•˜κΈ° 전에 Accelerateκ°€ μ„€μΉ˜λ˜μ–΄ 있고 μ΅œμ†Œ PyTorch 2.1.0 μ΄μƒμ˜ 버전이 μ„€μΉ˜λ˜μ–΄ μžˆλŠ”μ§€ ν™•μΈν•˜μ„Έμš”.
```bash
pip install accelerate
```
## FSDP ꡬ성 [[fsdp-configuration]]
μ‹œμž‘ν•˜λ €λ©΄ [`accelerate config`](https://huggingface.co/docs/accelerate/package_reference/cli#accelerate-config) λͺ…령을 μ‹€ν–‰ν•˜μ—¬ ν›ˆλ ¨ ν™˜κ²½μ— λŒ€ν•œ ꡬ성 νŒŒμΌμ„ μƒμ„±ν•˜μ„Έμš”. AccelerateλŠ” 이 ꡬ성 νŒŒμΌμ„ μ‚¬μš©ν•˜μ—¬ `accelerate config`μ—μ„œ μ„ νƒν•œ ν›ˆλ ¨ μ˜΅μ…˜μ— 따라 μžλ™μœΌλ‘œ μ˜¬λ°”λ₯Έ ν›ˆλ ¨ ν™˜κ²½μ„ μ„€μ •ν•©λ‹ˆλ‹€.
```bash
accelerate config
```
`accelerate config`λ₯Ό μ‹€ν–‰ν•˜λ©΄ ν›ˆλ ¨ ν™˜κ²½μ„ κ΅¬μ„±ν•˜κΈ° μœ„ν•œ 일련의 μ˜΅μ…˜λ“€μ΄ λ‚˜νƒ€λ‚©λ‹ˆλ‹€. 이 μ„Ήμ…˜μ—μ„œλŠ” κ°€μž₯ μ€‘μš”ν•œ FSDP μ˜΅μ…˜ 쀑 일뢀λ₯Ό λ‹€λ£Ήλ‹ˆλ‹€. λ‹€λ₯Έ μ‚¬μš© κ°€λŠ₯ν•œ FSDP μ˜΅μ…˜μ— λŒ€ν•΄ 더 μ•Œμ•„λ³΄κ³  μ‹Άλ‹€λ©΄ [fsdp_config](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.fsdp_config) λ§€κ°œλ³€μˆ˜λ₯Ό μ°Έμ‘°ν•˜μ„Έμš”.
### λΆ„ν•  μ „λž΅ [[sharding-strategy]]
FSDPλŠ” μ—¬λŸ¬ κ°€μ§€ λΆ„ν•  μ „λž΅μ„ μ œκ³΅ν•©λ‹ˆλ‹€:
* `FULL_SHARD` - λͺ¨λΈ λ§€κ°œλ³€μˆ˜, κ·Έλ ˆμ΄λ””μ–ΈνŠΈ 및 μ˜΅ν‹°λ§ˆμ΄μ € μƒνƒœλ₯Ό μž‘μ—…μž 간에 λΆ„ν• ; 이 μ˜΅μ…˜μ„ μ„ νƒν•˜λ €λ©΄ `1`을 μ„ νƒν•˜μ„Έμš”
* `SHARD_GRAD_OP` - κ·Έλ ˆμ΄λ””μ–ΈνŠΈ 및 μ˜΅ν‹°λ§ˆμ΄μ € μƒνƒœλ₯Ό μž‘μ—…μž 간에 λΆ„ν• ; 이 μ˜΅μ…˜μ„ μ„ νƒν•˜λ €λ©΄ `2`λ₯Ό μ„ νƒν•˜μ„Έμš”
* `NO_SHARD` - 아무 것도 λΆ„ν• ν•˜μ§€ μ•ŠμŒ (DDP와 동일); 이 μ˜΅μ…˜μ„ μ„ νƒν•˜λ €λ©΄ `3`을 μ„ νƒν•˜μ„Έμš”
* `HYBRID_SHARD` - 각 μž‘μ—…μžκ°€ 전체 볡사본을 κ°€μ§€κ³  μžˆλŠ” μƒνƒœμ—μ„œ λͺ¨λΈ λ§€κ°œλ³€μˆ˜, κ·Έλ ˆμ΄λ””μ–ΈνŠΈ 및 μ˜΅ν‹°λ§ˆμ΄μ € μƒνƒœλ₯Ό μž‘μ—…μž λ‚΄μ—μ„œ λΆ„ν• ; 이 μ˜΅μ…˜μ„ μ„ νƒν•˜λ €λ©΄ `4`λ₯Ό μ„ νƒν•˜μ„Έμš”
* `HYBRID_SHARD_ZERO2` - 각 μž‘μ—…μžκ°€ 전체 볡사본을 κ°€μ§€κ³  μžˆλŠ” μƒνƒœμ—μ„œ κ·Έλ ˆμ΄λ””μ–ΈνŠΈ 및 μ˜΅ν‹°λ§ˆμ΄μ € μƒνƒœλ₯Ό μž‘μ—…μž λ‚΄μ—μ„œ λΆ„ν• ; 이 μ˜΅μ…˜μ„ μ„ νƒν•˜λ €λ©΄ `5`λ₯Ό μ„ νƒν•˜μ„Έμš”
이것은 `fsdp_sharding_strategy` ν”Œλž˜κ·Έλ‘œ ν™œμ„±ν™”λ©λ‹ˆλ‹€.
### CPU μ˜€ν”„λ‘œλ“œ [[cpu-offload]]
μ‚¬μš©ν•˜μ§€ μ•ŠλŠ” λ§€κ°œλ³€μˆ˜μ™€ κ·Έλ ˆμ΄λ””μ–ΈνŠΈλ₯Ό CPU둜 μ˜€ν”„λ‘œλ“œν•˜μ—¬ 더 λ§Žμ€ GPU λ©”λͺ¨λ¦¬λ₯Ό μ ˆμ•½ν•˜κ³  FSDPλ‘œλ„ μΆ©λΆ„ν•˜μ§€ μ•Šμ€ 큰 λͺ¨λΈμ„ GPU에 μ μž¬ν•  수 μžˆλ„λ‘ ν•  수 μžˆμŠ΅λ‹ˆλ‹€. μ΄λŠ” `accelerate config`λ₯Ό μ‹€ν–‰ν•  λ•Œ `fsdp_offload_params: true`둜 μ„€μ •ν•˜μ—¬ ν™œμ„±ν™”λ©λ‹ˆλ‹€.
### λž˜ν•‘ μ •μ±… [[wrapping-policy]]
FSDPλŠ” λ„€νŠΈμ›Œν¬μ˜ 각 λ ˆμ΄μ–΄λ₯Ό λž˜ν•‘ν•˜μ—¬ μ μš©λ©λ‹ˆλ‹€. λž˜ν•‘μ€ 일반적으둜 쀑첩 λ°©μ‹μœΌλ‘œ 적용되며 각각 순방ν–₯으둜 μ§€λ‚˜κ°„ ν›„ 전체 κ°€μ€‘μΉ˜λ₯Ό μ‚­μ œν•˜μ—¬ λ‹€μŒ λ ˆμ΄μ–΄μ—μ„œ μ‚¬μš©ν•  λ©”λͺ¨λ¦¬λ₯Ό μ ˆμ•½ν•©λ‹ˆλ‹€. *μžλ™ λž˜ν•‘* 정책은 이λ₯Ό κ΅¬ν˜„ν•˜λŠ” κ°€μž₯ κ°„λ‹¨ν•œ 방법이며 μ½”λ“œλ₯Ό λ³€κ²½ν•  ν•„μš”κ°€ μ—†μŠ΅λ‹ˆλ‹€. Transformer λ ˆμ΄μ–΄λ₯Ό λž˜ν•‘ν•˜λ €λ©΄ `fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP`λ₯Ό μ„ νƒν•˜κ³  λž˜ν•‘ν•  λ ˆμ΄μ–΄λ₯Ό μ§€μ •ν•˜λ €λ©΄ `fsdp_transformer_layer_cls_to_wrap`λ₯Ό μ„ νƒν•˜μ„Έμš” (예: `BertLayer`).
λ˜λŠ” νŠΉμ • λ§€κ°œλ³€μˆ˜ 수λ₯Ό μ΄ˆκ³Όν•  경우 FSDPκ°€ λ ˆμ΄μ–΄μ— μ μš©λ˜λŠ” 크기 기반 λž˜ν•‘ 정책을 선택할 수 μžˆμŠ΅λ‹ˆλ‹€. μ΄λŠ” `fsdp_wrap_policy: SIZE_BASED_WRAP` 및 `min_num_param`을 μ›ν•˜λŠ” 크기의 μž„κ³„κ°’μœΌλ‘œ μ„€μ •ν•˜μ—¬ ν™œμ„±ν™”λ©λ‹ˆλ‹€.
### 체크포인트 [[checkpointing]]
쀑간 μ²΄ν¬ν¬μΈνŠΈλŠ” `fsdp_state_dict_type: SHARDED_STATE_DICT`둜 μ €μž₯ν•΄μ•Ό ν•©λ‹ˆλ‹€. CPU μ˜€ν”„λ‘œλ“œκ°€ ν™œμ„±ν™”λœ 랭크 0μ—μ„œ 전체 μƒνƒœ λ”•μ…”λ„ˆλ¦¬λ₯Ό μ €μž₯ν•˜λŠ” 데 μ‹œκ°„μ΄ 많이 걸리고, λΈŒλ‘œλ“œμΊμŠ€νŒ… 쀑 λ¬΄κΈ°ν•œ λŒ€κΈ°ν•˜μ—¬ `NCCL Timeout` 였λ₯˜κ°€ λ°œμƒν•  수 있기 λ•Œλ¬Έμž…λ‹ˆλ‹€. [`~accelerate.Accelerator.load_state`] λ©”μ„œλ“œλ₯Ό μ‚¬μš©ν•˜μ—¬ λΆ„ν• λœ μƒνƒœ λ”•μ…”λ„ˆλ¦¬λ‘œ ν›ˆλ ¨μ„ μž¬κ°œν•  수 μžˆμŠ΅λ‹ˆλ‹€.
```py
# κ²½λ‘œκ°€ λ‚΄μž¬λœ 체크포인트
accelerator.load_state("ckpt")
```
κ·ΈλŸ¬λ‚˜ ν›ˆλ ¨μ΄ λλ‚˜λ©΄ 전체 μƒνƒœ λ”•μ…”λ„ˆλ¦¬λ₯Ό μ €μž₯ν•΄μ•Ό ν•©λ‹ˆλ‹€. λΆ„ν• λœ μƒνƒœ λ”•μ…”λ„ˆλ¦¬λŠ” FSDPμ™€λ§Œ ν˜Έν™˜λ˜κΈ° λ•Œλ¬Έμž…λ‹ˆλ‹€.
```py
if trainer.is_fsdp_enabled:
trainer.accelerator.state.fsdp_plugin.set_state_dict_type("FULL_STATE_DICT")
trainer.save_model(script_args.output_dir)
```
### TPU [[tpu]]
[PyTorch XLA](https://pytorch.org/xla/release/2.1/index.html)λŠ” TPU에 λŒ€ν•œ FSDP ν›ˆλ ¨μ„ μ§€μ›ν•˜λ©° `accelerate config`둜 μƒμ„±λœ FSDP ꡬ성 νŒŒμΌμ„ μˆ˜μ •ν•˜μ—¬ ν™œμ„±ν™”ν•  수 μžˆμŠ΅λ‹ˆλ‹€. μœ„μ—μ„œ μ§€μ •ν•œ λΆ„ν•  μ „λž΅ 및 λž˜ν•‘ μ˜΅μ…˜ 외에도 μ•„λž˜μ— ν‘œμ‹œλœ λ§€κ°œλ³€μˆ˜λ₯Ό νŒŒμΌμ— μΆ”κ°€ν•  수 μžˆμŠ΅λ‹ˆλ‹€.
```yaml
xla: True # PyTorch/XLAλ₯Ό ν™œμ„±ν™”ν•˜λ €λ©΄ True둜 μ„€μ •ν•΄μ•Ό ν•©λ‹ˆλ‹€
xla_fsdp_settings: # XLA νŠΉμ • FSDP λ§€κ°œλ³€μˆ˜
xla_fsdp_grad_ckpt: True # gradient checkpointing을 μ‚¬μš©ν•©λ‹ˆλ‹€
```
[`xla_fsdp_settings`](https://github.com/pytorch/xla/blob/2e6e183e0724818f137c8135b34ef273dea33318/torch_xla/distributed/fsdp/xla_fully_sharded_data_parallel.py#L128)λŠ” FSDP에 λŒ€ν•œ 좔가적인 XLA νŠΉμ • λ§€κ°œλ³€μˆ˜λ₯Ό ꡬ성할 수 있게 ν•©λ‹ˆλ‹€.
## ν›ˆλ ¨ μ‹œμž‘ [[launch-training]]
μ˜ˆμ‹œ FSDP ꡬ성 νŒŒμΌμ€ λ‹€μŒκ³Ό 같을 수 μžˆμŠ΅λ‹ˆλ‹€:
```yaml
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch_policy: BACKWARD_PRE
fsdp_cpu_ram_efficient_loading: true
fsdp_forward_prefetch: false
fsdp_offload_params: true
fsdp_sharding_strategy: 1
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_sync_module_states: true
fsdp_transformer_layer_cls_to_wrap: BertLayer
fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
```
ν›ˆλ ¨μ„ μ‹œμž‘ν•˜λ €λ©΄ [`accelerate launch`](https://huggingface.co/docs/accelerate/package_reference/cli#accelerate-launch) λͺ…령을 μ‹€ν–‰ν•˜μ„Έμš”. 이 λ•Œ 전에 `accelerate config`둜 μƒμ„±ν•œ ꡬ성 νŒŒμΌμ„ μžλ™μœΌλ‘œ μ‚¬μš©ν•©λ‹ˆλ‹€.
```bash
accelerate launch my-trainer-script.py
```
```bash
accelerate launch --fsdp="full shard" --fsdp_config="path/to/fsdp_config/ my-trainer-script.py
```
## λ‹€μŒ 단계 [[next-steps]]
FSDPλŠ” 맀우 큰 λͺ¨λΈμ„ ν›ˆλ ¨ν•  λ•Œ κ°•λ ₯ν•œ 도ꡬ가 될 수 있으며, μ—¬λŸ¬ 개의 GPUλ‚˜ TPUλ₯Ό μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€. λͺ¨λΈ λ§€κ°œλ³€μˆ˜, μ˜΅ν‹°λ§ˆμ΄μ € 및 κ·Έλ ˆμ΄λ””μ–ΈνŠΈ μƒνƒœλ₯Ό λΆ„ν• ν•˜κ³  λΉ„ν™œμ„± μƒνƒœμΌ λ•Œ, CPU둜 μ˜€ν”„λ‘œλ“œν•˜λ©΄ FSDPλŠ” λŒ€κ·œλͺ¨ ν›ˆλ ¨μ˜ 높은 μ—°μ‚° λΉ„μš©μ„ 쀄일 수 μžˆμŠ΅λ‹ˆλ‹€. 더 μ•Œμ•„λ³΄κ³  μ‹Άλ‹€λ©΄ λ‹€μŒ μžλ£Œκ°€ 도움이 될 수 μžˆμŠ΅λ‹ˆλ‹€:
* [FSDP](https://huggingface.co/docs/accelerate/usage_guides/fsdp)에 λŒ€ν•œ 더 깊이 μžˆλŠ” Accelerate κ°€μ΄λ“œλ₯Ό 따라가 λ³΄μ„Έμš”.
* [PyTorch의 μ™„μ „ λΆ„ν•  데이터 병렬 처리 (FSDP) APIλ₯Ό μ†Œκ°œν•©λ‹ˆλ‹€](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/) λΈ”λ‘œκ·Έ 글을 μ½μ–΄λ³΄μ„Έμš”.
* [FSDPλ₯Ό μ‚¬μš©ν•˜μ—¬ ν΄λΌμš°λ“œ TPUμ—μ„œ PyTorch λͺ¨λΈ 크기 μ‘°μ ˆν•˜κΈ°](https://pytorch.org/blog/scaling-pytorch-models-on-cloud-tpus-with-fsdp/) λΈ”λ‘œκ·Έ 글을 μ½μ–΄λ³΄μ„Έμš”.