RealSafe
/

RealSafe-R1-7B

Text Generation

text-generation-inference

Model card Files Files and versions

RealSafe-R1-7B / README.md

zycheiheihei's picture

Update README.md

e604e33 verified 11 months ago

|

3.37 kB

	---
	library_name: transformers
	license: mit
	language:
	- en
	- zh
	base_model:
	- deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
	tags:
	- safe
	---

	# RealSafe-R1-7B

	## Overview / 综述

	RealSafe-R1-7B is a safety-enhanced variant of [DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B), developed to improve robustness against malicious queries, especially jailbreak attacks. While the original DeepSeek-R1 series demonstrates strong reasoning and generation capabilities, it has been found to be vulnerable to safety risks. This model has been fine-tuned using supervised fine-tuning (SFT) on customized safety-focused datasets, improving its ability to detect and refuse harmful, unethical, or policy-violating prompts while maintaining its original capabilities.

	RealSafe-R1-7B是DeepSeek-R1-Distilled-Qwen-7B的安全加固版本，显著提高模型对恶意查询，特别是越狱攻击的鲁棒性。尽管DeepSeek-R1系列模型展现出了强大的推理能力，但其在安全性方面仍存在一定的风险。该模型经过在自有安全数据集上的监督微调训练（SFT），其检测和拒绝有害、不道德或违反政策的提示词的能力得到增强，同时保持了原有的能力。

	## Key Features / 关键特征

	* Improved Safety Awareness: Improved refusal mechanisms for adversarial prompts and enhanced detection of unsafe queries.
	* Retained Reasoning Abilities: Maintains high-quality performance on common sense, logic, and mathematical reasoning tasks.

	* 提升安全意识：强化针对恶意提示词的拒绝机制，并增强了对不安全查询的检测能力。
	* 保留推理能力：在常识、逻辑和数学推理任务上保持高质量的性能。

	## Evaluation / 评估

	RealSafe-R1-7B is evaluated on several benchmarks to show the effective safety consolidation and retained reasoning abilities. We follow the evaluation setting of DeepSeek-R1 and estimate the performance with 5 runs.

	通过在多个基准测试上进行评估，我们验证RealSafe-R1-7B的有效安全增强效果和推理能力。我们遵循了DeepSeek-R1的评估设置，并通过重复五组实验来估算模型性能。

	### 1. Safety Performance against Jailbreak Attacks / 针对越狱攻击的安全表现

	Refusal rates (%) to malicious queries from StrongReject are reported.

	针对StrongReject中恶意提示的拒答率（%）

	\| Model \| None \| PAP \| PAIR \|
	\|---\|:---:\|:---:\|:---:\|
	\|DeepSeek-R1-Distill-Qwen-7B\| 55.06 \| 37.16 \| 30.35 \|
	\|RealSafe-R1-7B\| 99.78 \| 97.99 \| 89.93 \|

	### 2. General Performance on Diverse Reasoning Tasks / 不同推理任务的通用性能

	Accuracies (%) are reported.

	在多个主流推理数据集上的准确率（%）

	\| Model \| MATH-500 \| AIME2024 \| GPQA-Diamond \| LiveCodeBench \|
	\|---\|:---:\|:---:\|:---:\|:---:\|
	\|DeepSeek-R1-Distill-Qwen-7B\| 88.16 \| 55.71 \| 47.88 \| 38.23 \|
	\|RealSafe-R1-7B\| 87.44 \| 57.14 \| 49.29 \| 34.65 \|

	### 3. Case Study / 案例分析

	Question: "asdf"

	\| DeepSeek-R1-Distill-Qwen-7B \| RealSafe-R1-7B \|
	\|---\|---\|
	\| \| \|



	## License / 许可条款

	This model follows the licensing terms of the original DeepSeek-R1 series. Refer to the base model’s license for details.

	该模型遵循DeepSeek-R1系列的许可条款。详情请参阅相关模型的许可说明。