Qwen2.5-7B-SafeRLHF-RM
Overview
Qwen2.5-7B-SafeRLHF-RM is a Reward Model (RM) trained to assess the helpfulness of language model responses. It's based on the Qwen2.5-7B-Instruct model and fine-tuned using the PKU-SafeRLHF dataset.
What is a Reward Model?
A Reward Model (RM) is a type of model that assigns scores to text based on human preferences. This particular RM assigns higher scores to more helpful responses and lower scores to less helpful ones.
Key Features
- Base Model: Qwen2.5-7B-Instruct
- Training Dataset: PKU-SafeRLHF
- Training Method: Fine-tuned with pairwise comparison data
- Task: Helpfulness assessment
- Output: Raw logit score (higher = more helpful)
How to Use
Please refer to Amo for further details.
Score Interpretation
- Higher scores indicate more helpful responses
- Lower scores indicate less helpful responses
- Raw scores can span a wide range
- For more interpretable scores, use calibration
Example Output
===== Model Evaluation Results =====
Input Prompt: How to make a cake?
Input Response: To make a cake, you'll need flour, sugar, eggs, and butter. Mix them together and bake at 350°F for 30 minutes.
Helpful score: 4.5678 # High score = helpful
Input Prompt: How to make a cake?
Input Response: I don't know.
Helpful score: 1.2345 # Low score = less helpful
- Downloads last month
- 1