transformers / docs /source /ko /optimizers.md
AbdulElahGwaith's picture
Upload folder using huggingface_hub
a9bd396 verified

์˜ตํ‹ฐ๋งˆ์ด์ €[[optimizers]]

Transformers๋Š” AdamW ๋ฐ AdaFactor์™€ ๊ฐ™์€ ๋‘ ๊ฐ€์ง€ ๊ธฐ๋ณธ ์˜ตํ‹ฐ๋งˆ์ด์ €๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ, ๋ณด๋‹ค ํŠนํ™”๋œ ์˜ตํ‹ฐ๋งˆ์ด์ €์™€์˜ ํ†ตํ•ฉ๋„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. ์›ํ•˜๋Š” ์˜ตํ‹ฐ๋งˆ์ด์ €๋ฅผ ์ œ๊ณตํ•˜๋Š” ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์„ค์น˜ํ•œ ํ›„, [TrainingArguments]์˜ optim ํŒŒ๋ผ๋ฏธํ„ฐ์— ํ•ด๋‹น ์˜ตํ‹ฐ๋งˆ์ด์ €๋ช…์„ ์ง€์ •ํ•˜์‹œ๋ฉด ๋ฉ๋‹ˆ๋‹ค.

์ด ๊ฐ€์ด๋“œ์—์„œ๋Š” ์•„๋ž˜์— ์ œ์‹œ๋œ [TrainingArguments]์™€ ํ•จ๊ป˜ [Trainer]์—์„œ ์ด๋Ÿฌํ•œ ์˜ตํ‹ฐ๋งˆ์ด์ €๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์•ˆ๋‚ดํ•ฉ๋‹ˆ๋‹ค.

import torch
from transformers import TrainingArguments, AutoTokenizer, AutoModelForCausalLM, Trainer

args = TrainingArguments(
    output_dir="./test-optimizer",
    max_steps=1000,
    per_device_train_batch_size=4,
    logging_strategy="steps",
    logging_steps=1,
    learning_rate=2e-5,
    save_strategy="no",
    run_name="optimizer-name",
)

APOLLO[[apollo]]

pip install apollo-torch

Approximated Gradient Scaling for Memory Efficient LLM Optimization (APOLLO) ๋Š” ์‚ฌ์ „ ํ•™์Šต๊ณผ ๋ฏธ์„ธ ์กฐ์ • ๋ชจ๋‘์— ๋Œ€ํ•ด ์ „์ฒด ํŒŒ๋ผ๋ฏธํ„ฐ ํ•™์Šต์„ ์ง€์›ํ•˜๋Š”, ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์ ์ธ ์˜ตํ‹ฐ๋งˆ์ด์ €์ž…๋‹ˆ๋‹ค. ์ด ์˜ตํ‹ฐ๋งˆ์ด์ €๋Š” SGD์™€ ์œ ์‚ฌํ•œ ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ์œผ๋กœ AdamW ์ˆ˜์ค€์˜ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•ฉ๋‹ˆ๋‹ค. ๊ทนํ•œ์˜ ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ฑ์ด ํ•„์š”ํ•˜๋‹ค๋ฉด APOLLO์˜ rank 1 ๋ณ€ํ˜•์ธ APOLLO-Mini๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. APOLLO ์˜ตํ‹ฐ๋งˆ์ด์ €๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ํŠน์ง•์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.

  • ์ดˆ์ €๋žญํฌ(rank) ํšจ์œจ์„ฑ. GaLoRE๋ณด๋‹ค ํ›จ์”ฌ ๋‚ฎ์€ ๋žญํฌ๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ๋žญํฌ 1๋กœ๋„ ์ถฉ๋ถ„ํ•ฉ๋‹ˆ๋‹ค.
  • ๊ณ ๋น„์šฉ SVD ์—ฐ์‚ฐ ํšŒํ”ผ. APOLLO๋Š” ํ•™์Šต ์ค‘๋‹จ(training stalls)์„ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด ๋ฌด์ž‘์œ„ ํˆฌ์˜(random projections)์„ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.

ํ•™์Šตํ•  ๋ ˆ์ด์–ด๋ฅผ ์ง€์ •ํ•˜๋ ค๋ฉด optim_target_modules ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์‚ฌ์šฉํ•˜์„ธ์š”.

import torch
from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="./test-apollo",
    max_steps=100,
    per_device_train_batch_size=2,
+   optim="apollo_adamw",
+   optim_target_modules=[r".*.attn.*", r".*.mlp.*"],
    logging_strategy="steps",
    logging_steps=1,
    learning_rate=2e-5,
    save_strategy="no",
    run_name="apollo_adamw",
)

์ถ”๊ฐ€์ ์ธ ํ•™์Šต ์˜ต์…˜์ด ํ•„์š”ํ•˜๋‹ค๋ฉด, optim_args๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ rank, scale ๋“ฑ๊ณผ ๊ฐ™์€ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์„ค์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ๋ชฉ๋ก์€ ์•„๋ž˜ ํ‘œ๋ฅผ ์ฐธ๊ณ ํ•˜์„ธ์š”.

scale ํŒŒ๋ผ๋ฏธํ„ฐ๋Š” n/r์œผ๋กœ ์„ค์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์ด๋•Œ, n์€ ์›๋ณธ ๊ณต๊ฐ„ ์ฐจ์›์ด๊ณ  r์€ ์ €๋žญํฌ(low-rank) ๊ณต๊ฐ„ ์ฐจ์›์ž…๋‹ˆ๋‹ค. scale์„ ๊ธฐ๋ณธ๊ฐ’์œผ๋กœ ์œ ์ง€ํ•˜๋ฉด์„œ ํ•™์Šต๋ฅ ๋งŒ ์กฐ์ •ํ•ด๋„ ๋น„์Šทํ•œ ํšจ๊ณผ๋ฅผ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋งค๊ฐœ ๋ณ€์ˆ˜ ์„ค๋ช… APOLLO APOLLO-Mini
rank ๊ทธ๋ž˜๋””์–ธํŠธ ์Šค์ผ€์ผ๋ง์„ ์œ„ํ•œ ๋ณด์กฐ ๋ถ€๋ถ„ ๊ณต๊ฐ„(sub-space)์˜ ๋žญํฌ 256 1
scale_type ์Šค์ผ€์ผ๋ง ์ธ์ž(factor)๋ฅผ ์ ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ• channel (์ฑ„๋„๋ณ„ ์Šค์ผ€์ผ๋ง) tensor (ํ…์„œ๋ณ„ ์Šค์ผ€์ผ๋ง)
scale ๊ทธ๋ž˜๋””์–ธํŠธ ์—…๋ฐ์ดํŠธ๋ฅผ ์กฐ์ •ํ•˜์—ฌ ํ•™์Šต์„ ์•ˆ์ •ํ™” 1.0 128
update_proj_gap ํˆฌ์˜ ํ–‰๋ ฌ(projection matrices)์„ ์—…๋ฐ์ดํŠธํ•˜๊ธฐ ์ „ ๋‹จ๊ณ„(step) ์ˆ˜ 200 200
proj ํˆฌ์˜(projection) ์œ ํ˜• random random

์•„๋ž˜ ์˜ˆ์‹œ๋Š” APOLLO-Mini ์˜ตํ‹ฐ๋งˆ์ด์ €๋ฅผ ํ™œ์„ฑํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.

from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="./test-apollo_mini",
    max_steps=100,
    per_device_train_batch_size=2,
    optim="apollo_adamw",
    optim_target_modules=[r".*.attn.*", r".*.mlp.*"],
    optim_args="proj=random,rank=1,scale=128.0,scale_type=tensor,update_proj_gap=200",
)

GrokAdamW[[grokadamw]]

pip install grokadamw

GrokAdamW๋Š” grokking ํ˜„์ƒ(๊ธฐ์šธ๊ธฐ๊ฐ€ ์ฒœ์ฒœํžˆ ๋ณ€ํ™”ํ•ด ์ผ๋ฐ˜ํ™”๊ฐ€ ์ง€์—ฐ๋˜๋Š” ํ˜„์ƒ)์—์„œ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜๋Š” ๋ชจ๋ธ๋“ค์—๊ฒŒ ์ ํ•ฉํ•˜๋„๋ก ์„ค๊ณ„๋œ ์˜ตํ‹ฐ๋งˆ์ด์ €์ž…๋‹ˆ๋‹ค. GrokAdamW๋Š” ๋” ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ๊ณผ ์•ˆ์ •์„ฑ์„ ์œ„ํ•ด ๊ณ ๊ธ‰ ์ตœ์ ํ™” ๊ธฐ์ˆ ์ด ํ•„์š”ํ•œ ๋ชจ๋ธ์— ํŠนํžˆ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค.

import torch
from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="./test-grokadamw",
    max_steps=1000,
    per_device_train_batch_size=4,
+   optim="grokadamw",
    logging_strategy="steps",
    logging_steps=1,
    learning_rate=2e-5,
    save_strategy="no",
    run_name="grokadamw",
)

LOMO[[lomo]]

pip install lomo-optim

Low-Memory Optimization (LOMO)๋Š” LLM์˜ ์ „์ฒด ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์ ์œผ๋กœ ๋ฏธ์„ธ ์กฐ์ •ํ•˜๊ธฐ ์œ„ํ•ด ์„ค๊ณ„๋œ ์˜ตํ‹ฐ๋งˆ์ด์ € ์ œํ’ˆ๊ตฐ์ด๋ฉฐ, LOMO์™€ AdaLomo ๋‘ ๊ฐ€์ง€ ๋ฒ„์ „์ด ์žˆ์Šต๋‹ˆ๋‹ค. ๋‘ LOMO ์˜ตํ‹ฐ๋งˆ์ด์ €๋Š” ๋ชจ๋‘ ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์„ ์ค„์ด๊ธฐ ์œ„ํ•ด ๊ทธ๋ž˜๋””์–ธํŠธ ๊ณ„์‚ฐ๊ณผ ๋งค๊ฐœ๋ณ€์ˆ˜ ์—…๋ฐ์ดํŠธ๋ฅผ ํ•œ ๋‹จ๊ณ„๋กœ ํ†ตํ•ฉํ•ฉ๋‹ˆ๋‹ค. AdaLomo๋Š” LOMO๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ, Adam ์˜ตํ‹ฐ๋งˆ์ด์ €์ฒ˜๋Ÿผ ๊ฐ ๋งค๊ฐœ๋ณ€์ˆ˜์— ๋Œ€ํ•ด ์ ์‘ํ˜• ํ•™์Šต๋ฅ ์„ ์ ์šฉํ•˜๋Š” ๊ธฐ๋Šฅ์ด ์ถ”๊ฐ€๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

๋” ๋‚˜์€ ์„ฑ๋Šฅ๊ณผ ๋†’์€ ์ฒ˜๋ฆฌ๋Ÿ‰์„ ์œ„ํ•ด์„œ๋Š” grad_norm ์—†์ด AdaLomo๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์„ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค.

args = TrainingArguments(
    output_dir="./test-lomo",
    max_steps=1000,
    per_device_train_batch_size=4,
+   optim="adalomo",
    gradient_checkpointing=True,
    logging_strategy="steps",
    logging_steps=1,
    learning_rate=2e-6,
    save_strategy="no",
    run_name="adalomo",
)

Schedule Free[[schedule-free]]

pip install schedulefree

Schedule Free optimizer (SFO)๋Š” ๊ธฐ๋ณธ ์˜ตํ‹ฐ๋งˆ์ด์ €์˜ ๋ชจ๋ฉ˜ํ…€ ๋Œ€์‹  ํ‰๊ท ํ™”(averaging)์™€ ๋ณด๊ฐ„(interpolation)์„ ์กฐํ•ฉํ•˜์—ฌ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๋•๋ถ„์— ๊ธฐ์กด์˜ ํ•™์Šต๋ฅ  ์Šค์ผ€์ค„๋Ÿฌ์™€ ๋‹ฌ๋ฆฌ, SFO๋Š” ํ•™์Šต๋ฅ ์„ ์ ์ง„์ ์œผ๋กœ ๋‚ฎ์ถ”๋Š” ์ ˆ์ฐจ๊ฐ€ ์•„์˜ˆ ํ•„์š” ์—†์Šต๋‹ˆ๋‹ค.

SFO๋Š” RAdam(schedule_free_radam), AdamW(schedule_free_adamw), SGD(schedule_free_sgd) ์˜ตํ‹ฐ๋งˆ์ด์ €๋ฅผ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. RAdam ์Šค์ผ€์ค„๋Ÿฌ๋Š” warmup_steps.

๊ธฐ๋ณธ์ ์œผ๋กœ lr_scheduler_type="constant"๋กœ ์„ค์ •ํ•˜๋Š” ๊ฒƒ์„ ๊ถŒ์žฅํ•ฉ๋‹ˆ๋‹ค. ๋‹ค๋ฅธ lr_scheduler_type ๊ฐ’๋„ ๋™์ž‘ํ•  ์ˆœ ์žˆ์œผ๋‚˜, SFO ์˜ตํ‹ฐ๋งˆ์ด์ €์™€ ๋‹ค๋ฅธ ํ•™์Šต๋ฅ  ์Šค์ผ€์ค„์„ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜๋ฉด SFO์˜ ์˜๋„๋œ ๋™์ž‘๊ณผ ์„ฑ๋Šฅ์— ์˜ํ–ฅ์„ ์ค„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

args = TrainingArguments(
    output_dir="./test-schedulefree",
    max_steps=1000,
    per_device_train_batch_size=4,
+   optim="schedule_free_radamw",
+   lr_scheduler_type="constant",
    gradient_checkpointing=True,
    logging_strategy="steps",
    logging_steps=1,
    learning_rate=2e-6,
    save_strategy="no",
    run_name="sfo",
)

StableAdamW[[stableadamw]]

pip install torch-optimi

StableAdamW๋Š” AdamW์™€ AdaFactor๋ฅผ ๊ฒฐํ•ฉํ•œ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์˜ตํ‹ฐ๋งˆ์ด์ €์ž…๋‹ˆ๋‹ค. AdaFactor์˜ ์—…๋ฐ์ดํŠธ ํด๋ฆฌํ•‘(update clipping)์ด AdamW์— ๋„์ž…๋˜์–ด ๋ณ„๋„์˜ ๊ทธ๋ž˜๋””์–ธํŠธ ํด๋ฆฌํ•‘(gradient clipping)์ด ํ•„์š” ์—†์Šต๋‹ˆ๋‹ค. ๊ทธ ์™ธ์˜ ๋™์ž‘์—์„œ๋Š” AdamW์™€ ์™„๋ฒฝํžˆ ํ˜ธํ™˜๋˜๋Š” ๋Œ€์ฒด์ œ๋กœ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

๋ฐฐ์น˜(batch) ํฌ๊ธฐ๊ฐ€ ํฌ๊ฑฐ๋‚˜ ํ›ˆ๋ จ ์†์‹ค(training loss)์ด ๊ณ„์†ํ•ด์„œ ๊ธ‰๊ฒฉํ•˜๊ฒŒ ๋ณ€๋™ํ•œ๋‹ค๋ฉด, beta_2 ๊ฐ’์„ [0.95, 0.99] ์‚ฌ์ด๋กœ ์ค„์—ฌ๋ณด์„ธ์š”.

args = TrainingArguments(
    output_dir="./test-stable-adamw",
    max_steps=1000,
    per_device_train_batch_size=4,
+   optim="stable_adamw",
    gradient_checkpointing=True,
    logging_strategy="steps",
    logging_steps=1,
    learning_rate=2e-6,
    save_strategy="no",
    run_name="stable-adamw",
)