File size: 9,007 Bytes
17c6d62
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
<!--Copyright 2024 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->

# μ™„μ „ λΆ„ν•  데이터 병렬 처리(FSDP) [[fully-sharded-data-parallel]]

[Fully Sharded Data Parallel (FSDP)](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/)은 λͺ¨λΈμ˜ λ§€κ°œλ³€μˆ˜, κ·Έλ ˆμ΄λ””μ–ΈνŠΈ 및 μ˜΅ν‹°λ§ˆμ΄μ € μƒνƒœλ₯Ό μ‚¬μš© κ°€λŠ₯ν•œ GPU(μž‘μ—…μž λ˜λŠ” *랭크*라고도 함) μˆ˜μ— 따라 λΆ„ν• ν•˜λŠ” 데이터 병렬 처리 λ°©μ‹μž…λ‹ˆλ‹€. [DistributedDataParallel (DDP)](https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html)와 달리, FSDPλŠ” 각 GPU에 λͺ¨λΈμ„ λ³΅μ œν•˜κΈ° λ•Œλ¬Έμ— λ©”λͺ¨λ¦¬ μ‚¬μš©λŸ‰μ„ μ€„μž…λ‹ˆλ‹€. μ΄λŠ” GPU λ©”λͺ¨λ¦¬ νš¨μœ¨μ„±μ„ ν–₯μƒμ‹œν‚€λ©° 적은 수의 GPU둜 훨씬 더 큰 λͺ¨λΈμ„ ν›ˆλ ¨ν•  수 있게 ν•©λ‹ˆλ‹€. FSDPλŠ” λΆ„μ‚° ν™˜κ²½μ—μ„œμ˜ ν›ˆλ ¨μ„ μ‰½κ²Œ 관리할 수 μžˆλŠ” 라이브러리인 Accelerate와 ν†΅ν•©λ˜μ–΄ 있으며, λ”°λΌμ„œ [`Trainer`] ν΄λž˜μŠ€μ—μ„œ μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

μ‹œμž‘ν•˜κΈ° 전에 Accelerateκ°€ μ„€μΉ˜λ˜μ–΄ 있고 μ΅œμ†Œ PyTorch 2.1.0 μ΄μƒμ˜ 버전이 μ„€μΉ˜λ˜μ–΄ μžˆλŠ”μ§€ ν™•μΈν•˜μ„Έμš”.

```bash
pip install accelerate
```

## FSDP ꡬ성 [[fsdp-configuration]]

μ‹œμž‘ν•˜λ €λ©΄ [`accelerate config`](https://huggingface.co/docs/accelerate/package_reference/cli#accelerate-config) λͺ…령을 μ‹€ν–‰ν•˜μ—¬ ν›ˆλ ¨ ν™˜κ²½μ— λŒ€ν•œ ꡬ성 νŒŒμΌμ„ μƒμ„±ν•˜μ„Έμš”. AccelerateλŠ” 이 ꡬ성 νŒŒμΌμ„ μ‚¬μš©ν•˜μ—¬ `accelerate config`μ—μ„œ μ„ νƒν•œ ν›ˆλ ¨ μ˜΅μ…˜μ— 따라 μžλ™μœΌλ‘œ μ˜¬λ°”λ₯Έ ν›ˆλ ¨ ν™˜κ²½μ„ μ„€μ •ν•©λ‹ˆλ‹€.

```bash
accelerate config
```

`accelerate config`λ₯Ό μ‹€ν–‰ν•˜λ©΄ ν›ˆλ ¨ ν™˜κ²½μ„ κ΅¬μ„±ν•˜κΈ° μœ„ν•œ 일련의 μ˜΅μ…˜λ“€μ΄ λ‚˜νƒ€λ‚©λ‹ˆλ‹€. 이 μ„Ήμ…˜μ—μ„œλŠ” κ°€μž₯ μ€‘μš”ν•œ FSDP μ˜΅μ…˜ 쀑 일뢀λ₯Ό λ‹€λ£Ήλ‹ˆλ‹€. λ‹€λ₯Έ μ‚¬μš© κ°€λŠ₯ν•œ FSDP μ˜΅μ…˜μ— λŒ€ν•΄ 더 μ•Œμ•„λ³΄κ³  μ‹Άλ‹€λ©΄ [fsdp_config](https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.fsdp_config) λ§€κ°œλ³€μˆ˜λ₯Ό μ°Έμ‘°ν•˜μ„Έμš”.

### λΆ„ν•  μ „λž΅ [[sharding-strategy]]

FSDPλŠ” μ—¬λŸ¬ κ°€μ§€ λΆ„ν•  μ „λž΅μ„ μ œκ³΅ν•©λ‹ˆλ‹€:

* `FULL_SHARD` - λͺ¨λΈ λ§€κ°œλ³€μˆ˜, κ·Έλ ˆμ΄λ””μ–ΈνŠΈ 및 μ˜΅ν‹°λ§ˆμ΄μ € μƒνƒœλ₯Ό μž‘μ—…μž 간에 λΆ„ν• ; 이 μ˜΅μ…˜μ„ μ„ νƒν•˜λ €λ©΄ `1`을 μ„ νƒν•˜μ„Έμš”
* `SHARD_GRAD_OP` - κ·Έλ ˆμ΄λ””μ–ΈνŠΈ 및 μ˜΅ν‹°λ§ˆμ΄μ € μƒνƒœλ₯Ό μž‘μ—…μž 간에 λΆ„ν• ; 이 μ˜΅μ…˜μ„ μ„ νƒν•˜λ €λ©΄ `2`λ₯Ό μ„ νƒν•˜μ„Έμš”
* `NO_SHARD` - 아무 것도 λΆ„ν• ν•˜μ§€ μ•ŠμŒ (DDP와 동일); 이 μ˜΅μ…˜μ„ μ„ νƒν•˜λ €λ©΄ `3`을 μ„ νƒν•˜μ„Έμš”
* `HYBRID_SHARD` - 각 μž‘μ—…μžκ°€ 전체 볡사본을 κ°€μ§€κ³  μžˆλŠ” μƒνƒœμ—μ„œ λͺ¨λΈ λ§€κ°œλ³€μˆ˜, κ·Έλ ˆμ΄λ””μ–ΈνŠΈ 및 μ˜΅ν‹°λ§ˆμ΄μ € μƒνƒœλ₯Ό μž‘μ—…μž λ‚΄μ—μ„œ λΆ„ν• ; 이 μ˜΅μ…˜μ„ μ„ νƒν•˜λ €λ©΄ `4`λ₯Ό μ„ νƒν•˜μ„Έμš”
* `HYBRID_SHARD_ZERO2` - 각 μž‘μ—…μžκ°€ 전체 볡사본을 κ°€μ§€κ³  μžˆλŠ” μƒνƒœμ—μ„œ κ·Έλ ˆμ΄λ””μ–ΈνŠΈ 및 μ˜΅ν‹°λ§ˆμ΄μ € μƒνƒœλ₯Ό μž‘μ—…μž λ‚΄μ—μ„œ λΆ„ν• ; 이 μ˜΅μ…˜μ„ μ„ νƒν•˜λ €λ©΄ `5`λ₯Ό μ„ νƒν•˜μ„Έμš”

이것은 `fsdp_sharding_strategy` ν”Œλž˜κ·Έλ‘œ ν™œμ„±ν™”λ©λ‹ˆλ‹€.

### CPU μ˜€ν”„λ‘œλ“œ [[cpu-offload]]

μ‚¬μš©ν•˜μ§€ μ•ŠλŠ” λ§€κ°œλ³€μˆ˜μ™€ κ·Έλ ˆμ΄λ””μ–ΈνŠΈλ₯Ό CPU둜 μ˜€ν”„λ‘œλ“œν•˜μ—¬ 더 λ§Žμ€ GPU λ©”λͺ¨λ¦¬λ₯Ό μ ˆμ•½ν•˜κ³  FSDPλ‘œλ„ μΆ©λΆ„ν•˜μ§€ μ•Šμ€ 큰 λͺ¨λΈμ„ GPU에 μ μž¬ν•  수 μžˆλ„λ‘ ν•  수 μžˆμŠ΅λ‹ˆλ‹€. μ΄λŠ” `accelerate config`λ₯Ό μ‹€ν–‰ν•  λ•Œ `fsdp_offload_params: true`둜 μ„€μ •ν•˜μ—¬ ν™œμ„±ν™”λ©λ‹ˆλ‹€.

### λž˜ν•‘ μ •μ±… [[wrapping-policy]]

FSDPλŠ” λ„€νŠΈμ›Œν¬μ˜ 각 λ ˆμ΄μ–΄λ₯Ό λž˜ν•‘ν•˜μ—¬ μ μš©λ©λ‹ˆλ‹€. λž˜ν•‘μ€ 일반적으둜 쀑첩 λ°©μ‹μœΌλ‘œ 적용되며 각각 순방ν–₯으둜 μ§€λ‚˜κ°„ ν›„ 전체 κ°€μ€‘μΉ˜λ₯Ό μ‚­μ œν•˜μ—¬ λ‹€μŒ λ ˆμ΄μ–΄μ—μ„œ μ‚¬μš©ν•  λ©”λͺ¨λ¦¬λ₯Ό μ ˆμ•½ν•©λ‹ˆλ‹€. *μžλ™ λž˜ν•‘* 정책은 이λ₯Ό κ΅¬ν˜„ν•˜λŠ” κ°€μž₯ κ°„λ‹¨ν•œ 방법이며 μ½”λ“œλ₯Ό λ³€κ²½ν•  ν•„μš”κ°€ μ—†μŠ΅λ‹ˆλ‹€. Transformer λ ˆμ΄μ–΄λ₯Ό λž˜ν•‘ν•˜λ €λ©΄ `fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP`λ₯Ό μ„ νƒν•˜κ³  λž˜ν•‘ν•  λ ˆμ΄μ–΄λ₯Ό μ§€μ •ν•˜λ €λ©΄ `fsdp_transformer_layer_cls_to_wrap`λ₯Ό μ„ νƒν•˜μ„Έμš” (예: `BertLayer`).

λ˜λŠ” νŠΉμ • λ§€κ°œλ³€μˆ˜ 수λ₯Ό μ΄ˆκ³Όν•  경우 FSDPκ°€ λ ˆμ΄μ–΄μ— μ μš©λ˜λŠ” 크기 기반 λž˜ν•‘ 정책을 선택할 수 μžˆμŠ΅λ‹ˆλ‹€. μ΄λŠ” `fsdp_wrap_policy: SIZE_BASED_WRAP` 및 `min_num_param`을 μ›ν•˜λŠ” 크기의 μž„κ³„κ°’μœΌλ‘œ μ„€μ •ν•˜μ—¬ ν™œμ„±ν™”λ©λ‹ˆλ‹€.

### 체크포인트 [[checkpointing]]

쀑간 μ²΄ν¬ν¬μΈνŠΈλŠ” `fsdp_state_dict_type: SHARDED_STATE_DICT`둜 μ €μž₯ν•΄μ•Ό ν•©λ‹ˆλ‹€. CPU μ˜€ν”„λ‘œλ“œκ°€ ν™œμ„±ν™”λœ 랭크 0μ—μ„œ 전체 μƒνƒœ λ”•μ…”λ„ˆλ¦¬λ₯Ό μ €μž₯ν•˜λŠ” 데 μ‹œκ°„μ΄ 많이 걸리고, λΈŒλ‘œλ“œμΊμŠ€νŒ… 쀑 λ¬΄κΈ°ν•œ λŒ€κΈ°ν•˜μ—¬ `NCCL Timeout` 였λ₯˜κ°€ λ°œμƒν•  수 있기 λ•Œλ¬Έμž…λ‹ˆλ‹€. [`~accelerate.Accelerator.load_state`] λ©”μ„œλ“œλ₯Ό μ‚¬μš©ν•˜μ—¬ λΆ„ν• λœ μƒνƒœ λ”•μ…”λ„ˆλ¦¬λ‘œ ν›ˆλ ¨μ„ μž¬κ°œν•  수 μžˆμŠ΅λ‹ˆλ‹€.

```py
# κ²½λ‘œκ°€ λ‚΄μž¬λœ 체크포인트
accelerator.load_state("ckpt")
```

κ·ΈλŸ¬λ‚˜ ν›ˆλ ¨μ΄ λλ‚˜λ©΄ 전체 μƒνƒœ λ”•μ…”λ„ˆλ¦¬λ₯Ό μ €μž₯ν•΄μ•Ό ν•©λ‹ˆλ‹€. λΆ„ν• λœ μƒνƒœ λ”•μ…”λ„ˆλ¦¬λŠ” FSDPμ™€λ§Œ ν˜Έν™˜λ˜κΈ° λ•Œλ¬Έμž…λ‹ˆλ‹€.

```py
if trainer.is_fsdp_enabled:
    trainer.accelerator.state.fsdp_plugin.set_state_dict_type("FULL_STATE_DICT")

trainer.save_model(script_args.output_dir)
```

### TPU [[tpu]]

[PyTorch XLA](https://pytorch.org/xla/release/2.1/index.html)λŠ” TPU에 λŒ€ν•œ FSDP ν›ˆλ ¨μ„ μ§€μ›ν•˜λ©° `accelerate config`둜 μƒμ„±λœ FSDP ꡬ성 νŒŒμΌμ„ μˆ˜μ •ν•˜μ—¬ ν™œμ„±ν™”ν•  수 μžˆμŠ΅λ‹ˆλ‹€. μœ„μ—μ„œ μ§€μ •ν•œ λΆ„ν•  μ „λž΅ 및 λž˜ν•‘ μ˜΅μ…˜ 외에도 μ•„λž˜μ— ν‘œμ‹œλœ λ§€κ°œλ³€μˆ˜λ₯Ό νŒŒμΌμ— μΆ”κ°€ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

```yaml
xla: True # PyTorch/XLAλ₯Ό ν™œμ„±ν™”ν•˜λ €λ©΄ True둜 μ„€μ •ν•΄μ•Ό ν•©λ‹ˆλ‹€
xla_fsdp_settings: # XLA νŠΉμ • FSDP λ§€κ°œλ³€μˆ˜
xla_fsdp_grad_ckpt: True # gradient checkpointing을 μ‚¬μš©ν•©λ‹ˆλ‹€
```

[`xla_fsdp_settings`](https://github.com/pytorch/xla/blob/2e6e183e0724818f137c8135b34ef273dea33318/torch_xla/distributed/fsdp/xla_fully_sharded_data_parallel.py#L128)λŠ” FSDP에 λŒ€ν•œ 좔가적인 XLA νŠΉμ • λ§€κ°œλ³€μˆ˜λ₯Ό ꡬ성할 수 있게 ν•©λ‹ˆλ‹€.

## ν›ˆλ ¨ μ‹œμž‘ [[launch-training]]

μ˜ˆμ‹œ FSDP ꡬ성 νŒŒμΌμ€ λ‹€μŒκ³Ό 같을 수 μžˆμŠ΅λ‹ˆλ‹€:

```yaml
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch_policy: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: false
  fsdp_offload_params: true
  fsdp_sharding_strategy: 1
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_transformer_layer_cls_to_wrap: BertLayer
  fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
```

ν›ˆλ ¨μ„ μ‹œμž‘ν•˜λ €λ©΄ [`accelerate launch`](https://huggingface.co/docs/accelerate/package_reference/cli#accelerate-launch) λͺ…령을 μ‹€ν–‰ν•˜μ„Έμš”. 이 λ•Œ 전에 `accelerate config`둜 μƒμ„±ν•œ ꡬ성 νŒŒμΌμ„ μžλ™μœΌλ‘œ μ‚¬μš©ν•©λ‹ˆλ‹€.

```bash
accelerate launch my-trainer-script.py
```

```bash
accelerate launch --fsdp="full shard" --fsdp_config="path/to/fsdp_config/ my-trainer-script.py
```

## λ‹€μŒ 단계 [[next-steps]]

FSDPλŠ” 맀우 큰 λͺ¨λΈμ„ ν›ˆλ ¨ν•  λ•Œ κ°•λ ₯ν•œ 도ꡬ가 될 수 있으며, μ—¬λŸ¬ 개의 GPUλ‚˜ TPUλ₯Ό μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€. λͺ¨λΈ λ§€κ°œλ³€μˆ˜, μ˜΅ν‹°λ§ˆμ΄μ € 및 κ·Έλ ˆμ΄λ””μ–ΈνŠΈ μƒνƒœλ₯Ό λΆ„ν• ν•˜κ³  λΉ„ν™œμ„± μƒνƒœμΌ λ•Œ, CPU둜 μ˜€ν”„λ‘œλ“œν•˜λ©΄ FSDPλŠ” λŒ€κ·œλͺ¨ ν›ˆλ ¨μ˜ 높은 μ—°μ‚° λΉ„μš©μ„ 쀄일 수 μžˆμŠ΅λ‹ˆλ‹€. 더 μ•Œμ•„λ³΄κ³  μ‹Άλ‹€λ©΄ λ‹€μŒ μžλ£Œκ°€ 도움이 될 수 μžˆμŠ΅λ‹ˆλ‹€:

* [FSDP](https://huggingface.co/docs/accelerate/usage_guides/fsdp)에 λŒ€ν•œ 더 깊이 μžˆλŠ” Accelerate κ°€μ΄λ“œλ₯Ό 따라가 λ³΄μ„Έμš”.
* [PyTorch의 μ™„μ „ λΆ„ν•  데이터 병렬 처리 (FSDP) APIλ₯Ό μ†Œκ°œν•©λ‹ˆλ‹€](https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/) λΈ”λ‘œκ·Έ 글을 μ½μ–΄λ³΄μ„Έμš”.
* [FSDPλ₯Ό μ‚¬μš©ν•˜μ—¬ ν΄λΌμš°λ“œ TPUμ—μ„œ PyTorch λͺ¨λΈ 크기 μ‘°μ ˆν•˜κΈ°](https://pytorch.org/blog/scaling-pytorch-models-on-cloud-tpus-with-fsdp/) λΈ”λ‘œκ·Έ 글을 μ½μ–΄λ³΄μ„Έμš”.