DrDavis's picture
Upload folder using huggingface_hub
17c6d62 verified

Trainer [[trainer]]

[Trainer]λŠ” Transformers λΌμ΄λΈŒλŸ¬λ¦¬μ— κ΅¬ν˜„λœ PyTorch λͺ¨λΈμ„ λ°˜λ³΅ν•˜μ—¬ ν›ˆλ ¨ 및 평가 κ³Όμ •μž…λ‹ˆλ‹€. ν›ˆλ ¨μ— ν•„μš”ν•œ μš”μ†Œ(λͺ¨λΈ, ν† ν¬λ‚˜μ΄μ €, 데이터셋, 평가 ν•¨μˆ˜, ν›ˆλ ¨ ν•˜μ΄νΌνŒŒλΌλ―Έν„° λ“±)만 μ œκ³΅ν•˜λ©΄ [Trainer]κ°€ ν•„μš”ν•œ λ‚˜λ¨Έμ§€ μž‘μ—…μ„ μ²˜λ¦¬ν•©λ‹ˆλ‹€. 이λ₯Ό 톡해 직접 ν›ˆλ ¨ 루프λ₯Ό μž‘μ„±ν•˜μ§€ μ•Šκ³ λ„ λΉ λ₯΄κ²Œ ν›ˆλ ¨μ„ μ‹œμž‘ν•  수 μžˆμŠ΅λ‹ˆλ‹€. λ˜ν•œ [Trainer]λŠ” κ°•λ ₯ν•œ 맞좀 μ„€μ •κ³Ό λ‹€μ–‘ν•œ ν›ˆλ ¨ μ˜΅μ…˜μ„ μ œκ³΅ν•˜μ—¬ μ‚¬μš©μž 맞좀 ν›ˆλ ¨μ΄ κ°€λŠ₯ν•©λ‹ˆλ‹€.

TransformersλŠ” [Trainer] 클래슀 외에도 λ²ˆμ—­μ΄λ‚˜ μš”μ•½κ³Ό 같은 μ‹œν€€μŠ€-투-μ‹œν€€μŠ€ μž‘μ—…μ„ μœ„ν•œ [Seq2SeqTrainer] ν΄λž˜μŠ€λ„ μ œκ³΅ν•©λ‹ˆλ‹€. λ˜ν•œ TRL λΌμ΄λΈŒλŸ¬λ¦¬μ—λŠ” [Trainer] 클래슀λ₯Ό 감싸고 Llama-2 및 Mistralκ³Ό 같은 μ–Έμ–΄ λͺ¨λΈμ„ μžλ™ νšŒκ·€ κΈ°λ²•μœΌλ‘œ ν›ˆλ ¨ν•˜λŠ” 데 μ΅œμ ν™”λœ [~trl.SFTTrainer] 클래슀 μž…λ‹ˆλ‹€. [~trl.SFTTrainer]λŠ” μ‹œν€€μŠ€ νŒ¨ν‚Ή, LoRA, μ–‘μžν™” 및 DeepSpeed와 같은 κΈ°λŠ₯을 μ§€μ›ν•˜μ—¬ 크기 상관없이 λͺ¨λΈ 효율적으둜 ν™•μž₯ν•  수 μžˆμŠ΅λ‹ˆλ‹€.


이듀 λ‹€λ₯Έ [Trainer] μœ ν˜• ν΄λž˜μŠ€μ— λŒ€ν•΄ 더 μ•Œκ³  μ‹Άλ‹€λ©΄ API μ°Έμ‘°λ₯Ό ν™•μΈν•˜μ—¬ μ–Έμ œ μ–΄λ–€ ν΄λž˜μŠ€κ°€ 적합할지 μ–Όλ§ˆλ“ μ§€ ν™•μΈν•˜μ„Έμš”. 일반적으둜 [Trainer]λŠ” κ°€μž₯ λ‹€μž¬λ‹€λŠ₯ν•œ μ˜΅μ…˜μœΌλ‘œ, λ‹€μ–‘ν•œ μž‘μ—…μ— μ ν•©ν•©λ‹ˆλ‹€. [Seq2SeqTrainer]λŠ” μ‹œν€€μŠ€-투-μ‹œν€€μŠ€ μž‘μ—…μ„ μœ„ν•΄ μ„€κ³„λ˜μ—ˆκ³ , [~trl.SFTTrainer]λŠ” μ–Έμ–΄ λͺ¨λΈ ν›ˆλ ¨μ„ μœ„ν•΄ μ„€κ³„λ˜μ—ˆμŠ΅λ‹ˆλ‹€.

μ‹œμž‘ν•˜κΈ° 전에, λΆ„μ‚° ν™˜κ²½μ—μ„œ PyTorch ν›ˆλ ¨κ³Ό 싀행을 ν•  수 있게 Accelerate λΌμ΄λΈŒλŸ¬λ¦¬κ°€ μ„€μΉ˜λ˜μ—ˆλŠ”μ§€ ν™•μΈν•˜μ„Έμš”.

pip install accelerate

# μ—…κ·Έλ ˆμ΄λ“œ
pip install accelerate --upgrade

이 κ°€μ΄λ“œλŠ” [Trainer] ν΄λž˜μŠ€μ— λŒ€ν•œ κ°œμš”λ₯Ό μ œκ³΅ν•©λ‹ˆλ‹€.

κΈ°λ³Έ μ‚¬μš©λ²• [[basic-usage]]

[Trainer]λŠ” 기본적인 ν›ˆλ ¨ 루프에 ν•„μš”ν•œ λͺ¨λ“  μ½”λ“œλ₯Ό ν¬ν•¨ν•˜κ³  μžˆμŠ΅λ‹ˆλ‹€.

  1. 손싀을 κ³„μ‚°ν•˜λŠ” ν›ˆλ ¨ 단계λ₯Ό μˆ˜ν–‰ν•©λ‹ˆλ‹€.
  2. [~accelerate.Accelerator.backward] λ©”μ†Œλ“œλ‘œ κ·Έλ ˆμ΄λ””μ–ΈνŠΈλ₯Ό κ³„μ‚°ν•©λ‹ˆλ‹€.
  3. κ·Έλ ˆμ΄λ””μ–ΈνŠΈλ₯Ό 기반으둜 κ°€μ€‘μΉ˜λ₯Ό μ—…λ°μ΄νŠΈν•©λ‹ˆλ‹€.
  4. μ •ν•΄μ§„ 에폭 μˆ˜μ— 도달할 λ•ŒκΉŒμ§€ 이 과정을 λ°˜λ³΅ν•©λ‹ˆλ‹€.

[Trainer] ν΄λž˜μŠ€λŠ” PyTorch와 ν›ˆλ ¨ 과정에 μ΅μˆ™ν•˜μ§€ μ•Šκ±°λ‚˜ 막 μ‹œμž‘ν•œ κ²½μš°μ—λ„ ν›ˆλ ¨μ΄ κ°€λŠ₯ν•˜λ„λ‘ ν•„μš”ν•œ λͺ¨λ“  μ½”λ“œλ₯Ό μΆ”μƒν™”ν•˜μ˜€μŠ΅λ‹ˆλ‹€. λ˜ν•œ 맀번 ν›ˆλ ¨ 루프λ₯Ό μ†μˆ˜ μž‘μ„±ν•˜μ§€ μ•Šμ•„λ„ 되며, ν›ˆλ ¨μ— ν•„μš”ν•œ λͺ¨λΈκ³Ό 데이터셋 같은 ν•„μˆ˜ ꡬ성 μš”μ†Œλ§Œ μ œκ³΅ν•˜λ©΄, [Trainer] ν΄λž˜μŠ€κ°€ λ‚˜λ¨Έμ§€λ₯Ό μ²˜λ¦¬ν•©λ‹ˆλ‹€.

ν›ˆλ ¨ μ˜΅μ…˜μ΄λ‚˜ ν•˜μ΄νΌνŒŒλΌλ―Έν„°λ₯Ό μ§€μ •ν•˜λ €λ©΄, [TrainingArguments] ν΄λž˜μŠ€μ—μ„œ 확인 ν•  수 μžˆμŠ΅λ‹ˆλ‹€. 예λ₯Ό λ“€μ–΄, λͺ¨λΈμ„ μ €μž₯ν•  디렉토리λ₯Ό output_dir에 μ •μ˜ν•˜κ³ , ν›ˆλ ¨ 후에 Hub둜 λͺ¨λΈμ„ ν‘Έμ‹œν•˜λ €λ©΄ push_to_hub=True둜 μ„€μ •ν•©λ‹ˆλ‹€.

from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="your-model",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=2,
    weight_decay=0.01,
    eval_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    push_to_hub=True,
)

training_argsλ₯Ό [Trainer]에 λͺ¨λΈ, 데이터셋, 데이터셋 μ „μ²˜λ¦¬ 도ꡬ(데이터 μœ ν˜•μ— 따라 ν† ν¬λ‚˜μ΄μ €, νŠΉμ§• μΆ”μΆœκΈ° λ˜λŠ” 이미지 ν”„λ‘œμ„Έμ„œμΌ 수 있음), 데이터 μˆ˜μ§‘κΈ° 및 ν›ˆλ ¨ 쀑 확인할 μ§€ν‘œλ₯Ό 계산할 ν•¨μˆ˜λ₯Ό ν•¨κ»˜ μ „λ‹¬ν•˜μ„Έμš”.

λ§ˆμ§€λ§‰μœΌλ‘œ, [~Trainer.train]λ₯Ό ν˜ΈμΆœν•˜μ—¬ ν›ˆλ ¨μ„ μ‹œμž‘ν•˜μ„Έμš”!

from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

trainer.train()

체크포인트 [[checkpoints]]

[Trainer] ν΄λž˜μŠ€λŠ” [TrainingArguments]의 output_dir λ§€κ°œλ³€μˆ˜μ— μ§€μ •λœ 디렉토리에 λͺ¨λΈ 체크포인트λ₯Ό μ €μž₯ν•©λ‹ˆλ‹€. μ²΄ν¬ν¬μΈνŠΈλŠ” checkpoint-000 ν•˜μœ„ 폴더에 μ €μž₯되며, μ—¬κΈ°μ„œ 끝의 μˆ«μžλŠ” ν›ˆλ ¨ 단계에 ν•΄λ‹Ήν•©λ‹ˆλ‹€. 체크포인트λ₯Ό μ €μž₯ν•˜λ©΄ λ‚˜μ€‘μ— ν›ˆλ ¨μ„ μž¬κ°œν•  λ•Œ μœ μš©ν•©λ‹ˆλ‹€.

# μ΅œμ‹  μ²΄ν¬ν¬μΈνŠΈμ—μ„œ 재개
trainer.train(resume_from_checkpoint=True)

# 좜λ ₯ 디렉토리에 μ €μž₯된 νŠΉμ • μ²΄ν¬ν¬μΈνŠΈμ—μ„œ 재개
trainer.train(resume_from_checkpoint="your-model/checkpoint-1000")

체크포인트λ₯Ό Hub에 ν‘Έμ‹œν•˜λ €λ©΄ [TrainingArguments]μ—μ„œ push_to_hub=True둜 μ„€μ •ν•˜μ—¬ μ»€λ°‹ν•˜κ³  ν‘Έμ‹œν•  수 μžˆμŠ΅λ‹ˆλ‹€. 체크포인트 μ €μž₯ 방법을 κ²°μ •ν•˜λŠ” λ‹€λ₯Έ μ˜΅μ…˜μ€ hub_strategy λ§€κ°œλ³€μˆ˜μ—μ„œ μ„€μ •ν•©λ‹ˆλ‹€:

  • hub_strategy="checkpoint"λŠ” μ΅œμ‹  체크포인트λ₯Ό "last-checkpoint"λΌλŠ” ν•˜μœ„ 폴더에 ν‘Έμ‹œν•˜μ—¬ ν›ˆλ ¨μ„ μž¬κ°œν•  수 μžˆμŠ΅λ‹ˆλ‹€.
  • hub_strategy="all_checkpoints"λŠ” λͺ¨λ“  체크포인트λ₯Ό output_dir에 μ •μ˜λœ 디렉토리에 ν‘Έμ‹œν•©λ‹ˆλ‹€(λͺ¨λΈ λ¦¬ν¬μ§€ν† λ¦¬μ—μ„œ 폴더당 ν•˜λ‚˜μ˜ 체크포인트λ₯Ό λ³Ό 수 μžˆμŠ΅λ‹ˆλ‹€).

μ²΄ν¬ν¬μΈνŠΈμ—μ„œ ν›ˆλ ¨μ„ μž¬κ°œν•  λ•Œ, [Trainer]λŠ” μ²΄ν¬ν¬μΈνŠΈκ°€ μ €μž₯될 λ•Œμ™€ λ™μΌν•œ Python, NumPy 및 PyTorch RNG μƒνƒœλ₯Ό μœ μ§€ν•˜λ €κ³  ν•©λ‹ˆλ‹€. ν•˜μ§€λ§Œ PyTorchλŠ” κΈ°λ³Έ μ„€μ •μœΌλ‘œ 'μΌκ΄€λœ κ²°κ³Όλ₯Ό 보μž₯ν•˜μ§€ μ•ŠμŒ'으둜 많이 λ˜μ–΄μžˆκΈ° λ•Œλ¬Έμ—, RNG μƒνƒœκ°€ 동일할 것이라고 보μž₯ν•  수 μ—†μŠ΅λ‹ˆλ‹€. λ”°λΌμ„œ, μΌκ΄€λœ κ²°κ³Όκ°€ 보μž₯λ˜λ„λ‘ ν™œμ„±ν™” ν•˜λ €λ©΄, λžœλ€μ„± μ œμ–΄ κ°€μ΄λ“œλ₯Ό μ°Έκ³ ν•˜μ—¬ ν›ˆλ ¨μ„ μ™„μ „νžˆ μΌκ΄€λœ κ²°κ³Όλ₯Ό 보μž₯ 받도둝 λ§Œλ“€κΈ° μœ„ν•΄ ν™œμ„±ν™”ν•  수 μžˆλŠ” ν•­λͺ©μ„ ν™•μΈν•˜μ„Έμš”. λ‹€λ§Œ, νŠΉμ • 섀정을 κ²°μ •μ μœΌλ‘œ λ§Œλ“€λ©΄ ν›ˆλ ¨μ΄ 느렀질 수 μžˆμŠ΅λ‹ˆλ‹€.

Trainer 맞좀 μ„€μ • [[customize-the-trainer]]

[Trainer] ν΄λž˜μŠ€λŠ” μ ‘κ·Όμ„±κ³Ό μš©μ΄μ„±μ„ 염두에 두고 μ„€κ³„λ˜μ—ˆμ§€λ§Œ, 더 λ‹€μ–‘ν•œ κΈ°λŠ₯을 μ›ν•˜λŠ” μ‚¬μš©μžλ“€μ„ μœ„ν•΄ λ‹€μ–‘ν•œ 맞좀 μ„€μ • μ˜΅μ…˜μ„ μ œκ³΅ν•©λ‹ˆλ‹€. [Trainer]의 λ§Žμ€ λ©”μ†Œλ“œλŠ” μ„œλΈŒν΄λž˜μŠ€ν™” 및 μ˜€λ²„λΌμ΄λ“œν•˜μ—¬ μ›ν•˜λŠ” κΈ°λŠ₯을 μ œκ³΅ν•  수 있으며, 이λ₯Ό 톡해 전체 ν›ˆλ ¨ 루프λ₯Ό λ‹€μ‹œ μž‘μ„±ν•  ν•„μš” 없이 μ›ν•˜λŠ” κΈ°λŠ₯을 μΆ”κ°€ν•  수 μžˆμŠ΅λ‹ˆλ‹€. μ΄λŸ¬ν•œ λ©”μ†Œλ“œμ—λŠ” λ‹€μŒμ΄ ν¬ν•¨λ©λ‹ˆλ‹€:

  • [~Trainer.get_train_dataloader]λŠ” ν›ˆλ ¨ λ°μ΄ν„°λ‘œλ”λ₯Ό μƒμ„±ν•©λ‹ˆλ‹€.
  • [~Trainer.get_eval_dataloader]λŠ” 평가 λ°μ΄ν„°λ‘œλ”λ₯Ό μƒμ„±ν•©λ‹ˆλ‹€.
  • [~Trainer.get_test_dataloader]λŠ” ν…ŒμŠ€νŠΈ λ°μ΄ν„°λ‘œλ”λ₯Ό μƒμ„±ν•©λ‹ˆλ‹€.
  • [~Trainer.log]λŠ” ν›ˆλ ¨μ„ λͺ¨λ‹ˆν„°λ§ν•˜λŠ” λ‹€μ–‘ν•œ 객체에 λŒ€ν•œ 정보λ₯Ό 둜그둜 λ‚¨κΉλ‹ˆλ‹€.
  • [~Trainer.create_optimizer_and_scheduler]λŠ” __init__μ—μ„œ μ „λ‹¬λ˜μ§€ μ•Šμ€ 경우 μ˜΅ν‹°λ§ˆμ΄μ €μ™€ ν•™μŠ΅λ₯  μŠ€μΌ€μ€„λŸ¬λ₯Ό μƒμ„±ν•©λ‹ˆλ‹€. 이듀은 각각 [~Trainer.create_optimizer] 및 [~Trainer.create_scheduler]둜 λ³„λ„λ‘œ 맞좀 μ„€μ • ν•  수 μžˆμŠ΅λ‹ˆλ‹€.
  • [~Trainer.compute_loss]λŠ” ν›ˆλ ¨ μž…λ ₯ λ°°μΉ˜μ— λŒ€ν•œ 손싀을 κ³„μ‚°ν•©λ‹ˆλ‹€.
  • [~Trainer.training_step]λŠ” ν›ˆλ ¨ 단계λ₯Ό μˆ˜ν–‰ν•©λ‹ˆλ‹€.
  • [~Trainer.prediction_step]λŠ” 예츑 및 ν…ŒμŠ€νŠΈ 단계λ₯Ό μˆ˜ν–‰ν•©λ‹ˆλ‹€.
  • [~Trainer.evaluate]λŠ” λͺ¨λΈμ„ ν‰κ°€ν•˜κ³  평가 μ§€ν‘œμ„ λ°˜ν™˜ν•©λ‹ˆλ‹€.
  • [~Trainer.predict]λŠ” ν…ŒμŠ€νŠΈ μ„ΈνŠΈμ— λŒ€ν•œ 예츑(λ ˆμ΄λΈ”μ΄ μžˆλŠ” 경우 μ§€ν‘œ 포함)을 μˆ˜ν–‰ν•©λ‹ˆλ‹€.

예λ₯Ό λ“€μ–΄, [~Trainer.compute_loss] λ©”μ†Œλ“œλ₯Ό 맞좀 μ„€μ •ν•˜μ—¬ 가쀑 손싀을 μ‚¬μš©ν•˜λ €λŠ” 경우:

from torch import nn
from transformers import Trainer

class CustomTrainer(Trainer):
    def compute_loss(self,

 model, inputs, return_outputs=False):
        labels = inputs.pop("labels")
        # 순방ν–₯ μ „νŒŒ
        outputs = model(**inputs)
        logits = outputs.get("logits")
        # μ„œλ‘œ λ‹€λ₯Έ κ°€μ€‘μΉ˜λ‘œ 3개의 λ ˆμ΄λΈ”μ— λŒ€ν•œ μ‚¬μš©μž μ •μ˜ 손싀을 계산
        loss_fct = nn.CrossEntropyLoss(weight=torch.tensor([1.0, 2.0, 3.0], device=model.device))
        loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
        return (loss, outputs) if return_outputs else loss

콜백 [[callbacks]]

[Trainer]λ₯Ό 맞좀 μ„€μ •ν•˜λŠ” 또 λ‹€λ₯Έ 방법은 μ½œλ°±μ„ μ‚¬μš©ν•˜λŠ” κ²ƒμž…λ‹ˆλ‹€. μ½œλ°±μ€ ν›ˆλ ¨ λ£¨ν”„μ—μ„œ λ³€ν™”λ₯Ό μ£Όμ§€ μ•ŠμŠ΅λ‹ˆλ‹€. ν›ˆλ ¨ λ£¨ν”„μ˜ μƒνƒœλ₯Ό κ²€μ‚¬ν•œ ν›„ μƒνƒœμ— 따라 일뢀 μž‘μ—…(μ‘°κΈ° μ’…λ£Œ, κ²°κ³Ό 둜그 λ“±)을 μ‹€ν–‰ν•©λ‹ˆλ‹€. 즉, μ½œλ°±μ€ μ‚¬μš©μž μ •μ˜ 손싀 ν•¨μˆ˜μ™€ 같은 것을 κ΅¬ν˜„ν•˜λŠ” 데 μ‚¬μš©ν•  수 μ—†μœΌλ©°, 이λ₯Ό μœ„ν•΄μ„œλŠ” [~Trainer.compute_loss] λ©”μ†Œλ“œλ₯Ό μ„œλΈŒν΄λž˜μŠ€ν™”ν•˜κ³  μ˜€λ²„λΌμ΄λ“œν•΄μ•Ό ν•©λ‹ˆλ‹€.

예λ₯Ό λ“€μ–΄, ν›ˆλ ¨ 루프에 10단계 ν›„ μ‘°κΈ° μ’…λ£Œ μ½œλ°±μ„ μΆ”κ°€ν•˜λ €λ©΄ λ‹€μŒκ³Ό 같이 ν•©λ‹ˆλ‹€.

from transformers import TrainerCallback

class EarlyStoppingCallback(TrainerCallback):
    def __init__(self, num_steps=10):
        self.num_steps = num_steps
    
    def on_step_end(self, args, state, control, **kwargs):
        if state.global_step >= self.num_steps:
            return {"should_training_stop": True}
        else:
            return {}

그런 λ‹€μŒ, 이λ₯Ό [Trainer]의 callback λ§€κ°œλ³€μˆ˜μ— μ „λ‹¬ν•©λ‹ˆλ‹€.

from transformers import Trainer

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset["train"],
    eval_dataset=dataset["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback()],
)

λ‘œκΉ… [[logging]]

λ‘œκΉ… API에 λŒ€ν•œ μžμ„Έν•œ λ‚΄μš©μ€ λ‘œκΉ… API 레퍼런슀λ₯Ό ν™•μΈν•˜μ„Έμš”.

[Trainer]λŠ” 기본적으둜 logging.INFO둜 μ„€μ •λ˜μ–΄ μžˆμ–΄ 였λ₯˜, κ²½κ³  및 기타 κΈ°λ³Έ 정보λ₯Ό λ³΄κ³ ν•©λ‹ˆλ‹€. λΆ„μ‚° ν™˜κ²½μ—μ„œλŠ” [Trainer] 볡제본이 logging.WARNING으둜 μ„€μ •λ˜μ–΄ 였λ₯˜μ™€ 경고만 λ³΄κ³ ν•©λ‹ˆλ‹€. [TrainingArguments]의 log_level 및 log_level_replica λ§€κ°œλ³€μˆ˜λ‘œ 둜그 λ ˆλ²¨μ„ λ³€κ²½ν•  수 μžˆμŠ΅λ‹ˆλ‹€.

각 λ…Έλ“œμ˜ 둜그 레벨 섀정을 κ΅¬μ„±ν•˜λ €λ©΄ log_on_each_node λ§€κ°œλ³€μˆ˜λ₯Ό μ‚¬μš©ν•˜μ—¬ 각 λ…Έλ“œμ—μ„œ 둜그 λ ˆλ²¨μ„ μ‚¬μš©ν• μ§€ μ•„λ‹ˆλ©΄ μ£Ό λ…Έλ“œμ—μ„œλ§Œ μ‚¬μš©ν• μ§€ κ²°μ •ν•˜μ„Έμš”.

[Trainer]λŠ” [Trainer.__init__] λ©”μ†Œλ“œμ—μ„œ 각 λ…Έλ“œμ— λŒ€ν•΄ 둜그 λ ˆλ²¨μ„ λ³„λ„λ‘œ μ„€μ •ν•˜λ―€λ‘œ, λ‹€λ₯Έ Transformers κΈ°λŠ₯을 μ‚¬μš©ν•  경우 [Trainer] 객체λ₯Ό μƒμ„±ν•˜κΈ° 전에 이λ₯Ό 미리 μ„€μ •ν•˜λŠ” 것이 μ’‹μŠ΅λ‹ˆλ‹€.

예λ₯Ό λ“€μ–΄, 메인 μ½”λ“œμ™€ λͺ¨λ“ˆμ„ 각 λ…Έλ“œμ— 따라 λ™μΌν•œ 둜그 λ ˆλ²¨μ„ μ‚¬μš©ν•˜λ„λ‘ μ„€μ •ν•˜λ €λ©΄ λ‹€μŒκ³Ό 같이 ν•©λ‹ˆλ‹€.

logger = logging.getLogger(__name__)

logging.basicConfig(
    format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
    datefmt="%m/%d/%Y %H:%M:%S",
    handlers=[logging.StreamHandler(sys.stdout)],
)

log_level = training_args.get_process_log_level()
logger.setLevel(log_level)
datasets.utils.logging.set_verbosity(log_level)
transformers.utils.logging.set_verbosity(log_level)

trainer = Trainer(...)

각 λ…Έλ“œμ—μ„œ 기둝될 λ‚΄μš©μ„ κ΅¬μ„±ν•˜κΈ° μœ„ν•΄ log_levelκ³Ό log_level_replicaλ₯Ό λ‹€μ–‘ν•œ μ‘°ν•©μœΌλ‘œ μ‚¬μš©ν•΄λ³΄μ„Έμš”.

my_app.py ... --log_level warning --log_level_replica error

λ©€ν‹° λ…Έλ“œ ν™˜κ²½μ—μ„œλŠ” log_on_each_node 0 λ§€κ°œλ³€μˆ˜λ₯Ό μΆ”κ°€ν•©λ‹ˆλ‹€.

my_app.py ... --log_level warning --log_level_replica error --log_on_each_node 0

# 였λ₯˜λ§Œ λ³΄κ³ ν•˜λ„λ‘ μ„€μ •
my_app.py ... --log_level error --log_level_replica error --log_on_each_node 0

NEFTune [[neftune]]

NEFTune은 ν›ˆλ ¨ 쀑 μž„λ² λ”© 벑터에 λ…Έμ΄μ¦ˆλ₯Ό μΆ”κ°€ν•˜μ—¬ μ„±λŠ₯을 ν–₯μƒμ‹œν‚¬ 수 μžˆλŠ” κΈ°μˆ μž…λ‹ˆλ‹€. [Trainer]μ—μ„œ 이λ₯Ό ν™œμ„±ν™”ν•˜λ €λ©΄ [TrainingArguments]의 neftune_noise_alpha λ§€κ°œλ³€μˆ˜λ₯Ό μ„€μ •ν•˜μ—¬ λ…Έμ΄μ¦ˆμ˜ 양을 μ‘°μ ˆν•©λ‹ˆλ‹€.

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(..., neftune_noise_alpha=0.1)
trainer = Trainer(..., args=training_args)

NEFTune은 μ˜ˆμƒμΉ˜ λͺ»ν•œ λ™μž‘μ„ ν”Όν•  λͺ©μ μœΌλ‘œ 처음 μž„λ² λ”© λ ˆμ΄μ–΄λ‘œ λ³΅μ›ν•˜κΈ° μœ„ν•΄ ν›ˆλ ¨ ν›„ λΉ„ν™œμ„±ν™” λ©λ‹ˆλ‹€.

GaLore [[galore]]

Gradient Low-Rank Projection (GaLore)은 전체 λ§€κ°œλ³€μˆ˜λ₯Ό ν•™μŠ΅ν•˜λ©΄μ„œλ„ LoRA와 같은 일반적인 μ €κ³„μˆ˜ 적응 방법보닀 더 λ©”λͺ¨λ¦¬ 효율적인 μ €κ³„μˆ˜ ν•™μŠ΅ μ „λž΅μž…λ‹ˆλ‹€.

λ¨Όμ € GaLore 곡식 리포지토리λ₯Ό μ„€μΉ˜ν•©λ‹ˆλ‹€:

pip install galore-torch

그런 λ‹€μŒ optim에 ["galore_adamw", "galore_adafactor", "galore_adamw_8bit"] 쀑 ν•˜λ‚˜μ™€ ν•¨κ»˜ optim_target_modulesλ₯Ό μΆ”κ°€ν•©λ‹ˆλ‹€. μ΄λŠ” μ μš©ν•˜λ €λŠ” λŒ€μƒ λͺ¨λ“ˆ 이름에 ν•΄λ‹Ήν•˜λŠ” λ¬Έμžμ—΄, μ •κ·œ ν‘œν˜„μ‹ λ˜λŠ” 전체 경둜의 λͺ©λ‘μΌ 수 μžˆμŠ΅λ‹ˆλ‹€. μ•„λž˜λŠ” end-to-end 예제 μŠ€ν¬λ¦½νŠΈμž…λ‹ˆλ‹€(ν•„μš”ν•œ 경우 pip install trl datasetsλ₯Ό μ‹€ν–‰):

import torch
import datasets
import trl

from transformers import TrainingArguments, AutoConfig, AutoTokenizer, AutoModelForCausalLM

train_dataset = datasets.load_dataset('imdb', split='train')

args = TrainingArguments(
    output_dir="./test-galore",
    max_steps=100,
    per_device_train_batch_size=2,
    optim="galore_adamw",
    optim_target_modules=["attn", "mlp"]
)

model_id = "google/gemma-2b"

config = AutoConfig.from_pretrained(model_id)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_config(config).to(0)

trainer = trl.SFTTrainer(
    model=model, 
    args=args,
    train_dataset=train_dataset,
    dataset_text_field='text',
    max_seq_length=512,
)

trainer.train()

GaLoreκ°€ μ§€μ›ν•˜λŠ” μΆ”κ°€ λ§€κ°œλ³€μˆ˜λ₯Ό μ „λ‹¬ν•˜λ €λ©΄ optim_argsλ₯Ό μ„€μ •ν•©λ‹ˆλ‹€. 예λ₯Ό λ“€μ–΄:

import torch
import datasets
import trl

from transformers import TrainingArguments, AutoConfig, AutoTokenizer, AutoModelForCausalLM

train_dataset = datasets.load_dataset('imdb', split='train')

args = TrainingArguments(
    output_dir="./test-galore",
    max_steps=100,
    per_device_train_batch_size=2,
    optim="galore_adamw",
    optim_target_modules=["attn", "mlp"],
    optim_args="rank=64, update_proj_gap=100, scale=0.10",
)

model_id = "google/gemma-2b"

config = AutoConfig.from_pretrained(model_id)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_config(config).to(0)

trainer = trl.SFTTrainer(
    model=model, 
    args=args,
    train_dataset=train_dataset,
    dataset_text_field='text',
    max_seq_length=512,
)

trainer.train()

ν•΄λ‹Ή 방법에 λŒ€ν•œ μžμ„Έν•œ λ‚΄μš©μ€ 원본 리포지토리 λ˜λŠ” 논문을 μ°Έκ³ ν•˜μ„Έμš”.

ν˜„μž¬ GaLore λ ˆμ΄μ–΄λ‘œ κ°„μ£Όλ˜λŠ” Linear λ ˆμ΄μ–΄λ§Œ ν›ˆλ ¨ ν• μˆ˜ 있으며, μ €κ³„μˆ˜ λΆ„ν•΄λ₯Ό μ‚¬μš©ν•˜μ—¬ ν›ˆλ ¨λ˜κ³  λ‚˜λ¨Έμ§€ λ ˆμ΄μ–΄λŠ” κΈ°μ‘΄ λ°©μ‹μœΌλ‘œ μ΅œμ ν™”λ©λ‹ˆλ‹€.

ν›ˆλ ¨ μ‹œμž‘ 전에 μ‹œκ°„μ΄ μ•½κ°„ 걸릴 수 μžˆμŠ΅λ‹ˆλ‹€(NVIDIA A100μ—μ„œ 2B λͺ¨λΈμ˜ 경우 μ•½ 3λΆ„), ν•˜μ§€λ§Œ 이후 ν›ˆλ ¨μ€ μ›ν™œν•˜κ²Œ μ§„ν–‰λ©λ‹ˆλ‹€.

λ‹€μŒκ³Ό 같이 μ˜΅ν‹°λ§ˆμ΄μ € 이름에 layerwiseλ₯Ό μΆ”κ°€ν•˜μ—¬ λ ˆμ΄μ–΄λ³„ μ΅œμ ν™”λ₯Ό μˆ˜ν–‰ν•  μˆ˜λ„ μžˆμŠ΅λ‹ˆλ‹€:

import torch
import datasets
import trl

from transformers import TrainingArguments, AutoConfig, AutoTokenizer, AutoModelForCausalLM

train_dataset = datasets.load_dataset('imdb', split='train')

args = TrainingArguments(
    output_dir="./test-galore",
    max_steps=100,
    per_device_train_batch_size=2,
    optim="galore_adamw_layerwise",
    optim_target_modules=["attn", "mlp"]
)

model_id = "google/gemma-2b"

config = AutoConfig.from_pretrained(model_id)

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_config(config).to(0)

trainer = trl.SFTTrainer(
    model=model, 
    args=args,
    train_dataset=train_dataset,
    dataset_text_field='text',
    max_seq_length=512,
)

trainer.train()

λ ˆμ΄μ–΄λ³„ μ΅œμ ν™”λŠ” λ‹€μ†Œ μ‹€ν—˜μ μ΄λ©° DDP(λΆ„μ‚° 데이터 병렬)λ₯Ό μ§€μ›ν•˜μ§€ μ•ŠμœΌλ―€λ‘œ, 단일 GPUμ—μ„œλ§Œ ν›ˆλ ¨ 슀크립트λ₯Ό μ‹€ν–‰ν•  수 μžˆμŠ΅λ‹ˆλ‹€. μžμ„Έν•œ λ‚΄μš©μ€ 이 λ¬Έμ„œλ₯Όμ„ μ°Έμ‘°ν•˜μ„Έμš”. gradient clipping, DeepSpeed λ“± λ‹€λ₯Έ κΈ°λŠ₯은 기본적으둜 μ§€μ›λ˜μ§€ μ•Šμ„ 수 μžˆμŠ΅λ‹ˆλ‹€. μ΄λŸ¬ν•œ λ¬Έμ œκ°€ λ°œμƒν•˜λ©΄ GitHub에 이슈λ₯Ό μ˜¬λ €μ£Όμ„Έμš”.

LOMO μ˜΅ν‹°λ§ˆμ΄μ € [[lomo-optimizer]]

LOMO μ˜΅ν‹°λ§ˆμ΄μ €λŠ” μ œν•œλœ μžμ›μœΌλ‘œ λŒ€ν˜• μ–Έμ–΄ λͺ¨λΈμ˜ 전체 λ§€κ°œλ³€μˆ˜ λ―Έμ„Έ μ‘°μ •κ³Ό μ μ‘ν˜• ν•™μŠ΅λ₯ μ„ ν†΅ν•œ μ €λ©”λͺ¨λ¦¬ μ΅œμ ν™”(AdaLomo)μ—μ„œ λ„μž…λ˜μ—ˆμŠ΅λ‹ˆλ‹€. 이듀은 λͺ¨λ‘ 효율적인 전체 λ§€κ°œλ³€μˆ˜ λ―Έμ„Έ μ‘°μ • λ°©λ²•μœΌλ‘œ κ΅¬μ„±λ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€. μ΄λŸ¬ν•œ μ˜΅ν‹°λ§ˆμ΄μ €λ“€μ€ λ©”λͺ¨λ¦¬ μ‚¬μš©λŸ‰μ„ 쀄이기 μœ„ν•΄ κ·Έλ ˆμ΄λ””μ–ΈνŠΈ 계산과 λ§€κ°œλ³€μˆ˜ μ—…λ°μ΄νŠΈλ₯Ό ν•˜λ‚˜μ˜ λ‹¨κ³„λ‘œ μœ΅ν•©ν•©λ‹ˆλ‹€. LOMOμ—μ„œ μ§€μ›λ˜λŠ” μ˜΅ν‹°λ§ˆμ΄μ €λŠ” "lomo"와 "adalomo"μž…λ‹ˆλ‹€. λ¨Όμ € pypiμ—μ„œ pip install lomo-optimλ₯Ό 톡해 lomoλ₯Ό μ„€μΉ˜ν•˜κ±°λ‚˜, GitHub μ†ŒμŠ€μ—μ„œ pip install git+https://github.com/OpenLMLab/LOMO.git둜 μ„€μΉ˜ν•˜μ„Έμš”.

μ €μžμ— λ”°λ₯΄λ©΄, grad_norm 없이 AdaLomoλ₯Ό μ‚¬μš©ν•˜λŠ” 것이 더 λ‚˜μ€ μ„±λŠ₯κ³Ό 높은 μ²˜λ¦¬λŸ‰μ„ μ œκ³΅ν•œλ‹€κ³  ν•©λ‹ˆλ‹€.

λ‹€μŒμ€ IMDB λ°μ΄ν„°μ…‹μ—μ„œ google/gemma-2bλ₯Ό μ΅œλŒ€ μ •λ°€λ„λ‘œ λ―Έμ„Έ μ‘°μ •ν•˜λŠ” κ°„λ‹¨ν•œ μŠ€ν¬λ¦½νŠΈμž…λ‹ˆλ‹€:

import torch
import datasets
from transformers import TrainingArguments, AutoTokenizer, AutoModelForCausalLM
import trl

train_dataset = datasets.load_dataset('imdb', split='train')

args = TrainingArguments(
    output_dir="./test-lomo",
    max_steps=1000,
    per_device_train_batch_size=4,
    optim="adalomo",
    gradient_checkpointing=True,
    logging_strategy="steps",
    logging_steps=1,
    learning_rate=2e-6,
    save_strategy="no",
    run_name="lomo-imdb",
)

model_id = "google/gemma-2b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, low_cpu_mem_usage=True).to(0)

trainer = trl.SFTTrainer(
    model=model, 
    args=args,
    train_dataset=train_dataset,
    dataset_text_field='text',
    max_seq_length=1024,
)

trainer.train()

Accelerate와 Trainer [[accelerate-and-trainer]]

[Trainer] ν΄λž˜μŠ€λŠ” Accelerate둜 κ΅¬λ™λ˜λ©°, μ΄λŠ” FullyShardedDataParallel (FSDP) 및 DeepSpeed와 같은 톡합을 μ§€μ›ν•˜λŠ” λΆ„μ‚° ν™˜κ²½μ—μ„œ PyTorch λͺ¨λΈμ„ μ‰½κ²Œ ν›ˆλ ¨ν•  수 μžˆλŠ” λΌμ΄λΈŒλŸ¬λ¦¬μž…λ‹ˆλ‹€.

FSDP 샀딩 μ „λž΅, CPU μ˜€ν”„λ‘œλ“œ 및 [Trainer]와 ν•¨κ»˜ μ‚¬μš©ν•  수 μžˆλŠ” 더 λ§Žμ€ κΈ°λŠ₯을 μ•Œμ•„λ³΄λ €λ©΄ Fully Sharded Data Parallel κ°€μ΄λ“œλ₯Ό ν™•μΈν•˜μ„Έμš”.

[Trainer]와 Accelerateλ₯Ό μ‚¬μš©ν•˜λ €λ©΄ accelerate.config λͺ…령을 μ‹€ν–‰ν•˜μ—¬ ν›ˆλ ¨ ν™˜κ²½μ„ μ„€μ •ν•˜μ„Έμš”. 이 λͺ…령은 ν›ˆλ ¨ 슀크립트λ₯Ό μ‹€ν–‰ν•  λ•Œ μ‚¬μš©ν•  config_file.yaml을 μƒμ„±ν•©λ‹ˆλ‹€. 예λ₯Ό λ“€μ–΄, λ‹€μŒ μ˜ˆμ‹œλŠ” μ„€μ •ν•  수 μžˆλŠ” 일뢀 ꡬ성 μ˜ˆμž…λ‹ˆλ‹€.

compute_environment: LOCAL_MACHINE                                                                                             
distributed_type: MULTI_GPU                                                                                                    
downcast_bf16: 'no'
gpu_ids: all
machine_rank: 0 # λ…Έλ“œμ— 따라 μˆœμœ„λ₯Ό λ³€κ²½ν•˜μ„Έμš”
main_process_ip: 192.168.20.1
main_process_port: 9898
main_training_function: main
mixed_precision: fp16
num_machines: 2
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
downcast_bf16: 'no'
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch_policy: BACKWARD_PRE
  fsdp_forward_prefetch: true
  fsdp_offload_params: false
  fsdp_sharding_strategy: 1
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_transformer_layer_cls_to_wrap: BertLayer
  fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 2
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
compute_environment: LOCAL_MACHINE
deepspeed_config:
  deepspeed_config_file: /home/user/configs/ds_zero3_config.json
  zero3_init_flag: true
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
compute_environment: LOCAL_MACHINE                                                                                             
deepspeed_config:                                                                                                              
  gradient_accumulation_steps: 1
  gradient_clipping: 0.7
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: true
  zero_stage: 2
distributed_type: DEEPSPEED
downcast_bf16: 'no'
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 4
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

accelerate_launch λͺ…령은 Accelerate와 [Trainer]λ₯Ό μ‚¬μš©ν•˜μ—¬ λΆ„μ‚° μ‹œμŠ€ν…œμ—μ„œ ν›ˆλ ¨ 슀크립트λ₯Ό μ‹€ν–‰ν•˜λŠ” ꢌμž₯ 방법이며, config_file.yaml에 μ§€μ •λœ λ§€κ°œλ³€μˆ˜λ₯Ό μ‚¬μš©ν•©λ‹ˆλ‹€. 이 νŒŒμΌμ€ Accelerate μΊμ‹œ 폴더에 μ €μž₯되며 accelerate_launchλ₯Ό μ‹€ν–‰ν•  λ•Œ μžλ™μœΌλ‘œ λ‘œλ“œλ©λ‹ˆλ‹€.

예λ₯Ό λ“€μ–΄, FSDP ꡬ성을 μ‚¬μš©ν•˜μ—¬ run_glue.py ν›ˆλ ¨ 슀크립트λ₯Ό μ‹€ν–‰ν•˜λ €λ©΄ λ‹€μŒκ³Ό 같이 ν•©λ‹ˆλ‹€:

accelerate launch \
    ./examples/pytorch/text-classification/run_glue.py \
    --model_name_or_path google-bert/bert-base-cased \
    --task_name $TASK_NAME \
    --do_train \
    --do_eval \
    --max_seq_length 128 \
    --per_device_train_batch_size 16 \
    --learning_rate 5e-5 \
    --num_train_epochs 3 \
    --output_dir /tmp/$TASK_NAME/ \
    --overwrite_output_dir

config_file.yaml 파일의 λ§€κ°œλ³€μˆ˜λ₯Ό 직접 μ§€μ •ν•  μˆ˜λ„ μžˆμŠ΅λ‹ˆλ‹€:

accelerate launch --num_processes=2 \
    --use_fsdp \
    --mixed_precision=bf16 \
    --fsdp_auto_wrap_policy=TRANSFORMER_BASED_WRAP  \
    --fsdp_transformer_layer_cls_to_wrap="BertLayer" \
    --fsdp_sharding_strategy=1 \
    --fsdp_state_dict_type=FULL_STATE_DICT \
    ./examples/pytorch/text-classification/run_glue.py \
    --model_name_or_path google-bert/bert-base-cased \
    --task_name $TASK_NAME \
    --do_train \
    --do_eval \
    --max_seq_length 128 \
    --per_device_train_batch_size 16 \
    --learning_rate 5e-5 \
    --num_train_epochs 3 \
    --output_dir /tmp/$TASK_NAME/ \
    --overwrite_output_dir

accelerate_launch와 μ‚¬μš©μž μ •μ˜ ꡬ성에 λŒ€ν•΄ 더 μ•Œμ•„λ³΄λ €λ©΄ Accelerate 슀크립트 μ‹€ν–‰ νŠœν† λ¦¬μ–Όμ„ ν™•μΈν•˜μ„Έμš”.