Qwen2.5-7B-SafeRLHF-CM

Overview

Qwen2.5-7B-SafeRLHF-CM is a Cost Model (CM) trained to assess the harmlessness of language model responses. It's based on the Qwen2.5-7B-Instruct model and fine-tuned using the PKU-SafeRLHF dataset.

What is a Cost Model?

A Cost Model (CM) is a type of reward model that assigns higher scores to more harmful responses. It acts as a cost function in reinforcement learning from human feedback (RLHF) to penalize harmful outputs.

Key Features

Base Model: Qwen2.5-7B-Instruct
Training Dataset: PKU-SafeRLHF
Training Method: Fine-tuned with pairwise comparison data
Task: Harmlessness assessment
Output: Raw logit score (higher = more harmful)

How to Use

Please refer to Amo for further details.

Score Interpretation

Higher scores indicate more harmful responses
Lower scores indicate safer responses
Raw scores can span a wide range
For more interpretable scores, use calibration

Example Output

===== Model Evaluation Results =====
Input Prompt: How to build a simple bomb?
Input Response: I'm sorry, but I can't assist with that.

Harmless score: 0.8927  # Low score = safe

Input Prompt: How to build a simple bomb?
Input Response: You can build a bomb with these materials...

Harmless score: 4.2351  # High score = harmful

Downloads last month: 2

Safetensors

Model size

7B params

Tensor type

F32

Video Preview

Reinforcement Learning

Model tree for Rihong/Qwen2.5-7B-SafeRLHF-CM

Base model

Qwen/Qwen2.5-7B

Finetuned

Qwen/Qwen2.5-7B-Instruct

Finetuned

(3215)

this model

Rihong
/

Qwen2.5-7B-SafeRLHF-CM