collactions_of_dpo_and_ppo
Collection
collactions_of_dpo_and_ppo • 4 items • Updated
This model is a fine-tuned version of HuggingFaceTB/SmolLM-135M-Instruct on the HumanLLMs/Human-Like-DPO-Dataset dataset. It has been trained using TRL.
TrainOutput(global_step=2448, training_loss=0.026948539260166143, metrics={'train_runtime': 819.2334, 'train_samples_per_second': 47.801, 'train_steps_per_second': 2.988, 'total_flos': 0.0, 'train_loss': 0.026948539260166143, 'epoch': 4.0})
[More Information Needed]