training_arguments = TrainingArguments(
output_dir= "./results",
num_train_epochs= 6,
per_device_train_batch_size= 64,
gradient_accumulation_steps= 16,
optim = "paged_adamw_8bit",
save_steps= 3,
logging_steps= 1,
learning_rate= 4e-4,
weight_decay= 0.001,
fp16= False,
bf16= False,
max_grad_norm= 0.3,
max_steps= -1,
warmup_ratio= 0.3,
group_by_length= True,
lr_scheduler_type= "linear",
report_to="wandb",
)
Run history:
train/epoch ββββββββββββββββ
β
β
β
ββββββββββββ
train/global_step ββββββββββββββββ
β
β
β
ββββββββββββ
train/learning_rate βββββ
ββββββββββββ
β
β
βββββββββββ
train/loss ββββββ
β
βββββββββββββββββββββββ
train/total_flos β
train/train_loss β
train/train_runtime β
train/train_samples_per_second β
train/train_steps_per_second β
Run summary:
train/epoch 5.16
train/global_step 30
train/learning_rate 0.0
train/loss 1.1752
train/total_flos 7297433452388352.0
train/train_loss 1.49426
train/train_runtime 2373.0384
train/train_samples_per_second 14.971
train/train_steps_per_second 0.013