Spaces:

hajimemat
/

glaive-7b-training

Runtime error

Hajime MATSUMOTO commited on 15 days ago

Commit

113833d

1 Parent(s): c179929

L40S optimization: batch 8, disable gradient checkpointing, parallel dataloader

Files changed (1) hide show

train.py CHANGED Viewed

@@ -237,10 +237,10 @@ training_args = TrainingArguments(
     num_train_epochs=1,
     max_steps=-1,  # -1 = エポックベース
-    # バッチサイズ (高速化設定)
-    per_device_train_batch_size=4,
-    per_device_eval_batch_size=4,
-    gradient_accumulation_steps=4,  # 有効バッチサイズ: 4*4=16
     # 学習率 (1エポックで収束するよう高め)
     learning_rate=2e-4,
@@ -264,7 +264,10 @@ training_args = TrainingArguments(
     # その他
     report_to="none",
     group_by_length=True,
-    gradient_checkpointing=True,
     # 再開用
     save_safetensors=True,

     num_train_epochs=1,
     max_steps=-1,  # -1 = エポックベース
+    # バッチサイズ (L40S 48GB - 攻めた設定)
+    per_device_train_batch_size=8,
+    per_device_eval_batch_size=8,
+    gradient_accumulation_steps=2,  # 有効バッチサイズ: 8*2=16
     # 学習率 (1エポックで収束するよう高め)
     learning_rate=2e-4,
     # その他
     report_to="none",
     group_by_length=True,
+    gradient_checkpointing=False,  # L40Sは48GBあるのでオフで高速化
+    torch_compile=False,  # 初回コンパイル時間を避ける
+    dataloader_num_workers=4,
+    dataloader_pin_memory=True,
     # 再開用
     save_safetensors=True,