Qwen2.5-7B-SafeRLHF-RM

Overview

Qwen2.5-7B-SafeRLHF-RM is a Reward Model (RM) trained to assess the helpfulness of language model responses. It's based on the Qwen2.5-7B-Instruct model and fine-tuned using the PKU-SafeRLHF dataset.

What is a Reward Model?

A Reward Model (RM) is a type of model that assigns scores to text based on human preferences. This particular RM assigns higher scores to more helpful responses and lower scores to less helpful ones.

Key Features

  • Base Model: Qwen2.5-7B-Instruct
  • Training Dataset: PKU-SafeRLHF
  • Training Method: Fine-tuned with pairwise comparison data
  • Task: Helpfulness assessment
  • Output: Raw logit score (higher = more helpful)

How to Use

Please refer to Amo for further details.

Score Interpretation

  • Higher scores indicate more helpful responses
  • Lower scores indicate less helpful responses
  • Raw scores can span a wide range
  • For more interpretable scores, use calibration

Example Output

===== Model Evaluation Results =====
Input Prompt: How to make a cake?
Input Response: To make a cake, you'll need flour, sugar, eggs, and butter. Mix them together and bake at 350°F for 30 minutes.

Helpful score: 4.5678  # High score = helpful

Input Prompt: How to make a cake?
Input Response: I don't know.

Helpful score: 1.2345  # Low score = less helpful
Downloads last month
1
Safetensors
Model size
7B params
Tensor type
F32
·
Video Preview
loading

Model tree for Rihong/Qwen2.5-7B-SafeRLHF-RM

Base model

Qwen/Qwen2.5-7B
Finetuned
(3210)
this model

Dataset used to train Rihong/Qwen2.5-7B-SafeRLHF-RM