| <!--Copyright 2024 The HuggingFace Team. All rights reserved. | |
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | |
| the License. You may obtain a copy of the License at | |
| http://www.apache.org/licenses/LICENSE-2.0 | |
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | |
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | |
| specific language governing permissions and limitations under the License. | |
| โ ๏ธ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be | |
| rendered properly in your Markdown viewer. | |
| --> | |
| # ์ตํฐ๋ง์ด์ [[optimizers]] | |
| Transformers๋ AdamW ๋ฐ AdaFactor์ ๊ฐ์ ๋ ๊ฐ์ง ๊ธฐ๋ณธ ์ตํฐ๋ง์ด์ ๋ฅผ ์ ๊ณตํฉ๋๋ค. ๋ํ, ๋ณด๋ค ํนํ๋ ์ตํฐ๋ง์ด์ ์์ ํตํฉ๋ ์ง์ํฉ๋๋ค. ์ํ๋ ์ตํฐ๋ง์ด์ ๋ฅผ ์ ๊ณตํ๋ ๋ผ์ด๋ธ๋ฌ๋ฆฌ๋ฅผ ์ค์นํ ํ, [`TrainingArguments`]์ `optim` ํ๋ผ๋ฏธํฐ์ ํด๋น ์ตํฐ๋ง์ด์ ๋ช ์ ์ง์ ํ์๋ฉด ๋ฉ๋๋ค. | |
| ์ด ๊ฐ์ด๋์์๋ ์๋์ ์ ์๋ [`TrainingArguments`]์ ํจ๊ป [`Trainer`]์์ ์ด๋ฌํ ์ตํฐ๋ง์ด์ ๋ฅผ ์ฌ์ฉํ๋ ๋ฐฉ๋ฒ์ ์๋ดํฉ๋๋ค. | |
| ```py | |
| import torch | |
| from transformers import TrainingArguments, AutoTokenizer, AutoModelForCausalLM, Trainer | |
| args = TrainingArguments( | |
| output_dir="./test-optimizer", | |
| max_steps=1000, | |
| per_device_train_batch_size=4, | |
| logging_strategy="steps", | |
| logging_steps=1, | |
| learning_rate=2e-5, | |
| save_strategy="no", | |
| run_name="optimizer-name", | |
| ) | |
| ``` | |
| ## APOLLO[[apollo]] | |
| ```bash | |
| pip install apollo-torch | |
| ``` | |
| [Approximated Gradient Scaling for Memory Efficient LLM Optimization (APOLLO)](https://github.com/zhuhanqing/APOLLO) ๋ ์ฌ์ ํ์ต๊ณผ ๋ฏธ์ธ ์กฐ์ ๋ชจ๋์ ๋ํด ์ ์ฒด ํ๋ผ๋ฏธํฐ ํ์ต์ ์ง์ํ๋, ๋ฉ๋ชจ๋ฆฌ ํจ์จ์ ์ธ ์ตํฐ๋ง์ด์ ์ ๋๋ค. ์ด ์ตํฐ๋ง์ด์ ๋ SGD์ ์ ์ฌํ ๋ฉ๋ชจ๋ฆฌ ํจ์จ์ฑ์ผ๋ก AdamW ์์ค์ ์ฑ๋ฅ์ ์ ์งํฉ๋๋ค. ๊ทนํ์ ๋ฉ๋ชจ๋ฆฌ ํจ์จ์ฑ์ด ํ์ํ๋ค๋ฉด APOLLO์ rank 1 ๋ณํ์ธ APOLLO-Mini๋ฅผ ์ฌ์ฉํ ์ ์์ต๋๋ค. APOLLO ์ตํฐ๋ง์ด์ ๋ ๋ค์๊ณผ ๊ฐ์ ํน์ง์ ์ง์ํฉ๋๋ค. | |
| * ์ด์ ๋ญํฌ(rank) ํจ์จ์ฑ. [GaLoRE](./trainer#galore)๋ณด๋ค ํจ์ฌ ๋ฎ์ ๋ญํฌ๋ฅผ ์ฌ์ฉํ ์ ์์ผ๋ฉฐ, ๋ญํฌ 1๋ก๋ ์ถฉ๋ถํฉ๋๋ค. | |
| * ๊ณ ๋น์ฉ SVD ์ฐ์ฐ ํํผ. APOLLO๋ ํ์ต ์ค๋จ(training stalls)์ ํผํ๊ธฐ ์ํด ๋ฌด์์ ํฌ์(random projections)์ ํ์ฉํฉ๋๋ค. | |
| ํ์ตํ ๋ ์ด์ด๋ฅผ ์ง์ ํ๋ ค๋ฉด `optim_target_modules` ํ๋ผ๋ฏธํฐ๋ฅผ ์ฌ์ฉํ์ธ์. | |
| ```diff | |
| import torch | |
| from transformers import TrainingArguments | |
| args = TrainingArguments( | |
| output_dir="./test-apollo", | |
| max_steps=100, | |
| per_device_train_batch_size=2, | |
| + optim="apollo_adamw", | |
| + optim_target_modules=[r".*.attn.*", r".*.mlp.*"], | |
| logging_strategy="steps", | |
| logging_steps=1, | |
| learning_rate=2e-5, | |
| save_strategy="no", | |
| run_name="apollo_adamw", | |
| ) | |
| ``` | |
| ์ถ๊ฐ์ ์ธ ํ์ต ์ต์ ์ด ํ์ํ๋ค๋ฉด, `optim_args`๋ฅผ ์ฌ์ฉํ์ฌ `rank`, `scale` ๋ฑ๊ณผ ๊ฐ์ ํ์ดํผํ๋ผ๋ฏธํฐ๋ฅผ ์ค์ ํ ์ ์์ต๋๋ค. ์ฌ์ฉ ๊ฐ๋ฅํ ํ์ดํผํ๋ผ๋ฏธํฐ ๋ชฉ๋ก์ ์๋ ํ๋ฅผ ์ฐธ๊ณ ํ์ธ์. | |
| > [!TIP] | |
| > `scale` ํ๋ผ๋ฏธํฐ๋ `n/r`์ผ๋ก ์ค์ ํ ์ ์์ต๋๋ค. ์ด๋, `n`์ ์๋ณธ ๊ณต๊ฐ ์ฐจ์์ด๊ณ `r`์ ์ ๋ญํฌ(low-rank) ๊ณต๊ฐ ์ฐจ์์ ๋๋ค. `scale`์ ๊ธฐ๋ณธ๊ฐ์ผ๋ก ์ ์งํ๋ฉด์ ํ์ต๋ฅ ๋ง ์กฐ์ ํด๋ ๋น์ทํ ํจ๊ณผ๋ฅผ ์ป์ ์ ์์ต๋๋ค. | |
| | ๋งค๊ฐ ๋ณ์ | ์ค๋ช | APOLLO | APOLLO-Mini | | |
| |---|---|---|---| | |
| | rank | ๊ทธ๋๋์ธํธ ์ค์ผ์ผ๋ง์ ์ํ ๋ณด์กฐ ๋ถ๋ถ ๊ณต๊ฐ(sub-space)์ ๋ญํฌ | 256 | 1 | | |
| | scale_type | ์ค์ผ์ผ๋ง ์ธ์(factor)๋ฅผ ์ ์ฉํ๋ ๋ฐฉ๋ฒ | `channel` (์ฑ๋๋ณ ์ค์ผ์ผ๋ง) | `tensor` (ํ ์๋ณ ์ค์ผ์ผ๋ง) | | |
| | scale | ๊ทธ๋๋์ธํธ ์ ๋ฐ์ดํธ๋ฅผ ์กฐ์ ํ์ฌ ํ์ต์ ์์ ํ | 1.0 | 128 | | |
| | update_proj_gap | ํฌ์ ํ๋ ฌ(projection matrices)์ ์ ๋ฐ์ดํธํ๊ธฐ ์ ๋จ๊ณ(step) ์ | 200 | 200 | | |
| | proj | ํฌ์(projection) ์ ํ | `random` | `random` | | |
| ์๋ ์์๋ APOLLO-Mini ์ตํฐ๋ง์ด์ ๋ฅผ ํ์ฑํํ๋ ๋ฐฉ๋ฒ์ ๋๋ค. | |
| ```py | |
| from transformers import TrainingArguments | |
| args = TrainingArguments( | |
| output_dir="./test-apollo_mini", | |
| max_steps=100, | |
| per_device_train_batch_size=2, | |
| optim="apollo_adamw", | |
| optim_target_modules=[r".*.attn.*", r".*.mlp.*"], | |
| optim_args="proj=random,rank=1,scale=128.0,scale_type=tensor,update_proj_gap=200", | |
| ) | |
| ``` | |
| ## GrokAdamW[[grokadamw]] | |
| ```bash | |
| pip install grokadamw | |
| ``` | |
| [GrokAdamW](https://github.com/cognitivecomputations/grokadamw)๋ *grokking* ํ์(๊ธฐ์ธ๊ธฐ๊ฐ ์ฒ์ฒํ ๋ณํํด ์ผ๋ฐํ๊ฐ ์ง์ฐ๋๋ ํ์)์์ ์ฑ๋ฅ์ด ํฅ์๋๋ ๋ชจ๋ธ๋ค์๊ฒ ์ ํฉํ๋๋ก ์ค๊ณ๋ ์ตํฐ๋ง์ด์ ์ ๋๋ค. GrokAdamW๋ ๋ ๋ฐ์ด๋ ์ฑ๋ฅ๊ณผ ์์ ์ฑ์ ์ํด ๊ณ ๊ธ ์ต์ ํ ๊ธฐ์ ์ด ํ์ํ ๋ชจ๋ธ์ ํนํ ์ ์ฉํฉ๋๋ค. | |
| ```diff | |
| import torch | |
| from transformers import TrainingArguments | |
| args = TrainingArguments( | |
| output_dir="./test-grokadamw", | |
| max_steps=1000, | |
| per_device_train_batch_size=4, | |
| + optim="grokadamw", | |
| logging_strategy="steps", | |
| logging_steps=1, | |
| learning_rate=2e-5, | |
| save_strategy="no", | |
| run_name="grokadamw", | |
| ) | |
| ``` | |
| ## LOMO[[lomo]] | |
| ```bash | |
| pip install lomo-optim | |
| ``` | |
| [Low-Memory Optimization (LOMO)](https://github.com/OpenLMLab/LOMO)๋ LLM์ ์ ์ฒด ํ๋ผ๋ฏธํฐ๋ฅผ ๋ฉ๋ชจ๋ฆฌ ํจ์จ์ ์ผ๋ก ๋ฏธ์ธ ์กฐ์ ํ๊ธฐ ์ํด ์ค๊ณ๋ ์ตํฐ๋ง์ด์ ์ ํ๊ตฐ์ด๋ฉฐ, [LOMO](https://huggingface.co/papers/2306.09782)์ [AdaLomo](https://hf.co/papers/2310.10195) ๋ ๊ฐ์ง ๋ฒ์ ์ด ์์ต๋๋ค. ๋ LOMO ์ตํฐ๋ง์ด์ ๋ ๋ชจ๋ ๋ฉ๋ชจ๋ฆฌ ์ฌ์ฉ๋์ ์ค์ด๊ธฐ ์ํด ๊ทธ๋๋์ธํธ ๊ณ์ฐ๊ณผ ๋งค๊ฐ๋ณ์ ์ ๋ฐ์ดํธ๋ฅผ ํ ๋จ๊ณ๋ก ํตํฉํฉ๋๋ค. AdaLomo๋ LOMO๋ฅผ ๊ธฐ๋ฐ์ผ๋ก, Adam ์ตํฐ๋ง์ด์ ์ฒ๋ผ ๊ฐ ๋งค๊ฐ๋ณ์์ ๋ํด ์ ์ํ ํ์ต๋ฅ ์ ์ ์ฉํ๋ ๊ธฐ๋ฅ์ด ์ถ๊ฐ๋์์ต๋๋ค. | |
| > [!TIP] | |
| > ๋ ๋์ ์ฑ๋ฅ๊ณผ ๋์ ์ฒ๋ฆฌ๋์ ์ํด์๋ `grad_norm` ์์ด AdaLomo๋ฅผ ์ฌ์ฉํ๋ ๊ฒ์ ๊ถ์ฅํฉ๋๋ค. | |
| ```diff | |
| args = TrainingArguments( | |
| output_dir="./test-lomo", | |
| max_steps=1000, | |
| per_device_train_batch_size=4, | |
| + optim="adalomo", | |
| gradient_checkpointing=True, | |
| logging_strategy="steps", | |
| logging_steps=1, | |
| learning_rate=2e-6, | |
| save_strategy="no", | |
| run_name="adalomo", | |
| ) | |
| ``` | |
| ## Schedule Free[[schedule-free]] | |
| ```bash | |
| pip install schedulefree | |
| ``` | |
| [Schedule Free optimizer (SFO)](https://hf.co/papers/2405.15682)๋ ๊ธฐ๋ณธ ์ตํฐ๋ง์ด์ ์ ๋ชจ๋ฉํ ๋์ ํ๊ท ํ(averaging)์ ๋ณด๊ฐ(interpolation)์ ์กฐํฉํ์ฌ ์ฌ์ฉํฉ๋๋ค. ๋๋ถ์ ๊ธฐ์กด์ ํ์ต๋ฅ ์ค์ผ์ค๋ฌ์ ๋ฌ๋ฆฌ, SFO๋ ํ์ต๋ฅ ์ ์ ์ง์ ์ผ๋ก ๋ฎ์ถ๋ ์ ์ฐจ๊ฐ ์์ ํ์ ์์ต๋๋ค. | |
| SFO๋ RAdam(`schedule_free_radam`), AdamW(`schedule_free_adamw`), SGD(`schedule_free_sgd`) ์ตํฐ๋ง์ด์ ๋ฅผ ์ง์ํฉ๋๋ค. RAdam ์ค์ผ์ค๋ฌ๋ `warmup_steps`. | |
| ๊ธฐ๋ณธ์ ์ผ๋ก `lr_scheduler_type="constant"`๋ก ์ค์ ํ๋ ๊ฒ์ ๊ถ์ฅํฉ๋๋ค. ๋ค๋ฅธ `lr_scheduler_type` ๊ฐ๋ ๋์ํ ์ ์์ผ๋, SFO ์ตํฐ๋ง์ด์ ์ ๋ค๋ฅธ ํ์ต๋ฅ ์ค์ผ์ค์ ํจ๊ป ์ฌ์ฉํ๋ฉด SFO์ ์๋๋ ๋์๊ณผ ์ฑ๋ฅ์ ์ํฅ์ ์ค ์ ์์ต๋๋ค. | |
| ```diff | |
| args = TrainingArguments( | |
| output_dir="./test-schedulefree", | |
| max_steps=1000, | |
| per_device_train_batch_size=4, | |
| + optim="schedule_free_radamw", | |
| + lr_scheduler_type="constant", | |
| gradient_checkpointing=True, | |
| logging_strategy="steps", | |
| logging_steps=1, | |
| learning_rate=2e-6, | |
| save_strategy="no", | |
| run_name="sfo", | |
| ) | |
| ``` | |
| ## StableAdamW[[stableadamw]] | |
| ```bash | |
| pip install torch-optimi | |
| ``` | |
| [StableAdamW](https://huggingface.co/papers/2304.13013)๋ AdamW์ AdaFactor๋ฅผ ๊ฒฐํฉํ ํ์ด๋ธ๋ฆฌ๋ ์ตํฐ๋ง์ด์ ์ ๋๋ค. AdaFactor์ ์ ๋ฐ์ดํธ ํด๋ฆฌํ(update clipping)์ด AdamW์ ๋์ ๋์ด ๋ณ๋์ ๊ทธ๋๋์ธํธ ํด๋ฆฌํ(gradient clipping)์ด ํ์ ์์ต๋๋ค. ๊ทธ ์ธ์ ๋์์์๋ AdamW์ ์๋ฒฝํ ํธํ๋๋ ๋์ฒด์ ๋ก ์ฌ์ฉํ ์ ์์ต๋๋ค. | |
| > [!TIP] | |
| > ๋ฐฐ์น(batch) ํฌ๊ธฐ๊ฐ ํฌ๊ฑฐ๋ ํ๋ จ ์์ค(training loss)์ด ๊ณ์ํด์ ๊ธ๊ฒฉํ๊ฒ ๋ณ๋ํ๋ค๋ฉด, beta_2 ๊ฐ์ [0.95, 0.99] ์ฌ์ด๋ก ์ค์ฌ๋ณด์ธ์. | |
| ```diff | |
| args = TrainingArguments( | |
| output_dir="./test-stable-adamw", | |
| max_steps=1000, | |
| per_device_train_batch_size=4, | |
| + optim="stable_adamw", | |
| gradient_checkpointing=True, | |
| logging_strategy="steps", | |
| logging_steps=1, | |
| learning_rate=2e-6, | |
| save_strategy="no", | |
| run_name="stable-adamw", | |
| ) | |
| ``` |