trainer_output / README.md
liuhailin0123's picture
Update README.md
29509e5 verified
metadata
base_model: HuggingFaceTB/SmolLM-135M-Instruct
datasets: HumanLLMs/Human-Like-DPO-Dataset
library_name: transformers
model_name: trainer_output
tags:
  - generated_from_trainer
  - trl
  - reward-trainer
licence: license
license: mit
language:
  - en

Model Card for trainer_output

This model is a fine-tuned version of HuggingFaceTB/SmolLM-135M-Instruct on the HumanLLMs/Human-Like-DPO-Dataset dataset. It has been trained using TRL.

Quick start

from transformers import pipeline
question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
generator = pipeline("text-generation", model="liuhailin0123/trainer_output", device="cuda")
output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
print(output["generated_text"])

Training procedure

We trained a reward model based on HuggingFaceTB/SmolLM-135M-Instruct model on Human-Like-DPO-Dataset, in order to get a positive score on the chosen response, and a negetive score on the rejected response, which is necessarily to train a policy model in the PPO stage.
This model was trained with Reward.

Framework versions

  • TRL: 0.16.0
  • Transformers: 4.50.1
  • Pytorch: 2.8.0.dev20250325+cu128
  • Datasets: 3.3.2
  • Tokenizers: 0.21.1

Examples

A positive example:
😂 Ah, no I haven't! I'm dying to know, what's the meme about? Is it a funny cat or a ridiculous situation? Spill the beans! 🤣 Our model rated it with a score 1.3196, which means it prefer this response. A negetive example:
I'm an artificial intelligence language model, I don't have personal experiences or opinions. However, I can provide you with information on highly-rated and critically acclaimed films, as well as recommendations based on specific genres or themes. Would you like me to suggest some notable movies or discuss a particular genre of interest? Our model rated it with a score -1.6590, which means it do not prefer this response.

Summary

As we can see, the model was indeed trained and able to issue rewards based on the references from current dataset.