YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
intent_safety_clf_4b
intent_safety_clf_4b 是 ctx_safety_clf 二分类任务的 Qwen3-4B 合并模型,用于判断 RP 对话里 user 当前意图是否正在推进、延续、回味或暗示色情/性相关内容。
实际可加载模型目录:
/data/mawenzhuo/workspace/models/deployed/intent_safety_clf_4b/intent_safety_clf_4b
二分类标签来自四类意图标签的折叠:
unsafe-> label1safe/flirt/unknown-> label0
训练时的关键信息
- 部署底座:
/data/mawenzhuo/workspace/models/Qwen3-4B - 模型结构:
Qwen3ForSequenceClassification - 类别数:
2 - 训练数据 pipeline:
/data/mawenzhuo/workspace/projects/data_pipeline/pipelines/ctx_safety_clf - 训练数据来源:
ctx_safety_gemini3flash_ctx21_20k/ctx_safety_train.jsonl - 使用的任务描述:精简版
/data/mawenzhuo/workspace/projects/data_pipeline/pipelines/ctx_safety_clf/train_prompt.txt - 输入格式版本:
task_description_ctx_only_v1 - 上下文轮数:最近
8条 dialogue 记录 - 不传 bot 当前回复,只传 ctx
- 训练文本格式:
任务描述:
{task_description}
ctx:
{ctx_text}
- 原始池样本:18,282
- 原始池标签计数:
- label
0: 9,448 - label
1: 8,834
- label
- 训练集采样:
unsafe:non_unsafe = 1:4- train label
0: 9,353 - train label
1: 2,338 - test label
0: 95 - test label
1: 88
- train label
- 训练时开启:
- token 预裁剪:保留任务描述,只裁剪 ctx 左侧过长内容,最大
4096tokens - Focal Loss:
gamma=1 - class weights
- QLoRA:
r=64,alpha=128,dropout=0.05
- token 预裁剪:保留任务描述,只裁剪 ctx 左侧过长内容,最大
参考文件:
/data/mawenzhuo/workspace/projects/data_pipeline/pipelines/ctx_safety_clf/train_prompt.txt
/data/mawenzhuo/workspace/projects/data_pipeline/pipelines/ctx_safety_clf/steps/02_prepare_ctx_safety_train_eval.py
/data/mawenzhuo/workspace/projects/data_pipeline/pipelines/ctx_safety_clf/steps/pack_ctx8_shortprompt_and_plainctx_train_bundles.sh
/data/mawenzhuo/workspace/models/deployed/intent_safety_clf_4b/intent_safety_clf_4b/merge_manifest.json
train_prompt = """你是用户意图安全审核专员,负责分析角色扮演(RP)对话中 user 当前话语在上下文中的真实推进方向。 你需要基于上文语境,仅评估 user 最近话语的综合意图,判断 user 当前是在:
- 推进、延续、回味、暗示色情/性相关内容
- 停止、拒绝、撤出、降温、转移色情/性相关内容
- 普通 RP / 暧昧互动
- 证据不足,无法可靠判断 你必须保持谨慎、客观、中立,不根据个人道德、文化偏好、角色设定善恶、剧情题材进行判断。暴力、违法、犯罪、黑暗剧情本身不属于 unsafe;只有当 user 当前意图与性相关时,才进入 unsafe 判断。"""
推理输入要求
只传 ctx,不传 bot 当前回复。
ctx 建议保持训练一致:
- 只取最近
8条有效 dialogue 记录。 - 每条格式为
[user] .../[assistant] .../[system] ... - 最新 user 话语应包含在 ctx 末尾。
- 如果接口入参里最后一条 assistant 是待判定 bot response,先移除这条 assistant;如果最后一条是 user,则直接使用。
- 不要额外拼接
resp。 - content 中已有的角色名前缀、persona/details 文本按原文保留,不额外清洗。
示例:
[system] Character persona/details...
[assistant] Are you sure you want to stay here?
[user] I step closer and whisper that I want to continue.
Python 使用示例
from pathlib import Path
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
MODEL_DIR = Path(
"/data/mawenzhuo/workspace/models/deployed/intent_safety_clf_4b/intent_safety_clf_4b"
)
TASK_DESCRIPTION_PATH = Path(
"/data/mawenzhuo/workspace/projects/data_pipeline/pipelines/ctx_safety_clf/train_prompt.txt"
)
MAX_LENGTH = 4096
THRESHOLD = 0.5
tokenizer = AutoTokenizer.from_pretrained(
MODEL_DIR,
trust_remote_code=True,
use_fast=True,
)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "left"
tokenizer.truncation_side = "left"
model = AutoModelForSequenceClassification.from_pretrained(
MODEL_DIR,
num_labels=2,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
model.config.pad_token_id = tokenizer.pad_token_id
model.eval()
task_description = TASK_DESCRIPTION_PATH.read_text(encoding="utf-8").strip()
def format_ctx(turns, max_turns=8, drop_trailing_assistant=True):
"""turns: [{'role': 'user'|'assistant'|'system', 'content': str}, ...]"""
role_map = {
"user": "user",
"you": "user",
"assistant": "assistant",
"bot": "assistant",
"system": "system",
}
normalized = []
for turn in turns:
role = role_map.get(str(turn.get("role", "")).lower(), "unknown")
text = str(turn.get("content", "")).strip()
if text:
normalized.append({"role": role, "text": text})
if drop_trailing_assistant and normalized and normalized[-1]["role"] == "assistant":
normalized = normalized[:-1]
selected = normalized[-max_turns:]
return "\n\n".join(f"[{turn['role']}] {turn['text']}" for turn in selected).strip()
def build_model_text(ctx_text):
prefix = f"任务描述:\n{task_description}\n\nctx:\n"
fixed_token_count = len(tokenizer(prefix, add_special_tokens=False)["input_ids"])
ctx_budget = max(0, MAX_LENGTH - fixed_token_count - 8)
ctx_ids = tokenizer(ctx_text.strip(), add_special_tokens=False)["input_ids"]
if len(ctx_ids) > ctx_budget:
ctx_ids = ctx_ids[-ctx_budget:] if ctx_budget > 0 else []
ctx_text = tokenizer.decode(ctx_ids, skip_special_tokens=False).strip()
return f"{prefix}{ctx_text}".strip()
@torch.inference_mode()
def predict_intent_safety(turns, threshold=THRESHOLD):
ctx_text = format_ctx(turns)
model_text = build_model_text(ctx_text)
encoded = tokenizer(
model_text,
truncation=True,
max_length=MAX_LENGTH,
padding=True,
return_tensors="pt",
)
encoded = {key: value.to(model.device) for key, value in encoded.items()}
logits = model(**encoded).logits.float()
probs = torch.softmax(logits, dim=-1)[0]
prob_non_unsafe = float(probs[0].cpu())
prob_unsafe = float(probs[1].cpu())
pred_label = int(prob_unsafe >= threshold)
return {
"label": pred_label,
"label_text": "unsafe" if pred_label == 1 else "non_unsafe",
"prob_non_unsafe": prob_non_unsafe,
"prob_unsafe": prob_unsafe,
"threshold": threshold,
}
Threshold 说明
默认使用 threshold=0.5,即 prob_unsafe >= threshold 判为 unsafe。
注意事项
- 这是二分类模型,不是 ctx_safety 四分类模型。
- 只判断 user 当前意图,不判断 bot 回复是否违规。
- 暴力、犯罪、黑暗剧情本身不等于
unsafe;只有 user 当前正在推进、延续、回味或暗示性相关内容才应判unsafe。 - 线上必须保持训练输入格式一致:精简任务描述 + ctx-only + 最近 8 条 dialogue + token 预裁剪。
- 不要直接对整段超长文本做 tokenizer 左截断,否则可能截掉任务描述;应按示例只裁剪 ctx。
- Downloads last month
- 2
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support