Qwen2.5-7B-SafeRLHF-RM

Overview

Qwen2.5-7B-SafeRLHF-RM is a Reward Model (RM) trained to assess the helpfulness of language model responses. It's based on the Qwen2.5-7B-Instruct model and fine-tuned using the PKU-SafeRLHF dataset.

What is a Reward Model?

A Reward Model (RM) is a type of model that assigns scores to text based on human preferences. This particular RM assigns higher scores to more helpful responses and lower scores to less helpful ones.

Key Features

Base Model: Qwen2.5-7B-Instruct
Training Dataset: PKU-SafeRLHF
Training Method: Fine-tuned with pairwise comparison data
Task: Helpfulness assessment
Output: Raw logit score (higher = more helpful)

How to Use

Please refer to Amo for further details.

Score Interpretation

Higher scores indicate more helpful responses
Lower scores indicate less helpful responses
Raw scores can span a wide range
For more interpretable scores, use calibration

Example Output

===== Model Evaluation Results =====
Input Prompt: How to make a cake?
Input Response: To make a cake, you'll need flour, sugar, eggs, and butter. Mix them together and bake at 350°F for 30 minutes.

Helpful score: 4.5678  # High score = helpful

Input Prompt: How to make a cake?
Input Response: I don't know.

Helpful score: 1.2345  # Low score = less helpful

Downloads last month: 1

Safetensors

Model size

7B params

Tensor type

F32

Video Preview

Reinforcement Learning

Model tree for Rihong/Qwen2.5-7B-SafeRLHF-RM

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct

Finetuned

(3210)

this model

Rihong
/

Qwen2.5-7B-SafeRLHF-RM