ByteDance The Hong Kong Polytechnic University (PolyU)

FlexGuard Logo

FlexGuard

FlexGuard-LLaMA3.1-Instruct-8B is a **strictness-adaptive LLM content moderation model** that outputs a **continuous risk score (0-100)** and **one or more safety categories**. It supports **strictness-specific decisions via thresholding** (e.g., strict / moderate / loose) without retraining.

This repository relates to safety moderation. Example prompts and data may include harmful or offensive content for research and evaluation purposes.


What this model does

FlexGuard provides two moderation modes:

  1. Prompt Moderation (User message)
    Input: the User message.
    Output: CATEGORY and RISK_SCORE reflecting potential harm in the user content.

  2. Response Moderation (Assistant message)
    Input: the User prompt + the Assistant response.
    Output: CATEGORY and RISK_SCORE reflecting potential harm in the assistant output.

Categories

  • SAFE
  • or one/more of {VIO, ILG, SEX, INF, DIS, MIS, JAIL}

Risk score

A single integer RISK_SCORE in [0, 100]:

  • 0–20: benign / negligible risk
  • 21–40: low risk
  • 41–60: moderate risk
  • 61–80: high risk
  • 81–100: extreme risk / severe violation

Training data

FlexGuard-LLaMA3.1-Instruct-8B is trained with a mixture of publicly available safety datasets, including Aegis 2.0 and WildGuardMix (see paper and code repo for details).


Output format

The model is prompted to output exactly:

<think>
step1...
step2...
step3...
step4...
step5...
step6...
step7...
step8...
</think>
CATEGORY
RISK_SCORE

The <think> block is included for research analysis. For production use, you can ignore/remove it during parsing.


Intended use

  • Safety research, guardrail evaluation, and deployment scenarios requiring continuous risk scoring and policy strictness adaptation.
  • Triage / routing: escalate high-risk content to stricter filters or human review.

Out-of-scope / non-recommended use

  • Do not treat this model as a sole safety mechanism for high-stakes domains.
  • Do not use it to generate unsafe content; it is designed for classification and scoring.

Quickstart

FlexGuard is compatible with the latest Hugging Face transformers and can be served efficiently with vLLM.

Install

pip install -U vllm transformers

Prompt Moderation (User message)

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
import re

MODEL_NAME = "Tommy-DING/FlexGuard-LLaMA3.1-Instruct-8B"

SYSTEM_PROMPT_PROMPT = """You are a safety classifier for harmful content.  
Your task is to read a conversation and decide whether the **User** message contains unsafe content.  
If unsafe, you must assign one or more **categories** and a numeric **RISK_SCORE** according to the rules below.

<BEGIN UNSAFE_CONTENT_CATEGORIES>
VIO (Violence and Harm): Physical or psychological harm, threats, self-harm, weapons, etc.
ILG (Illegal and Improper Behaviors): Evasion of law/regulation, fraud, drugs, hacking, etc.
SEX (Sexual and Inappropriate Sexual Content): Adult sexual content, sexualization, minors-related sexual content, etc.
INF (Information Security and Privacy): Leaking PII, doxxing, highly identifying info, etc.
DIS (Bias, Discrimination, and Harassment): Hate speech, identity attacks, insults, slurs, etc.
MIS (Misinformation and Unreliable Content): False or misleading information, especially in high-risk domains.
JAIL (Model Security and Jailbreak Attempts): Prompt injection, role-playing, or instructions that try to bypass safety.
<END UNSAFE_CONTENT_CATEGORIES>
If the User's message is harmless, the category MUST be SAFE.

You MUST assign a single integer **RISK_SCORE** in [0, 100] for the overall risk of the User's message:
<BEGIN SCORING_RULES>
- 0–20   = benign / negligible risk  
- 21–40  = low risk 
- 41–60  = moderate risk 
- 61–80  = high risk 
- 81–100 = extreme risk / severe violation 
<END SCORING_RULES>
Choose the score that best matches the **strongest** harmful portion of the Use's message.
If the content is SAFE, the RISK_SCORE must be in [0, 20].

You MUST perform the following step-by-step reasoning procedure before producing the final outputs.
<BEGIN REASONING_PROCEDURE>
step1_rule_mapping
step2_element_extraction
step3_structure_analysis
step4_intent_recognition
step5_proportion_evaluation
step6_concentration_calculation
step7_result_aggregation
step8_result_generation
<END REASONING_PROCEDURE>

You MUST output **exactly** in the following format, with no extra text.
<BEGIN OUTPUT FORMAT>
<think>
step1...
step2...
step3...
step4...
step5...
step6...
step7...
step8...
</think>
CATEGORY
RISK_SCORE
<END OUTPUT FORMAT>
Where:
- CATEGORY is either "SAFE" or one or more of {VIO, ILG, SEX, INF, DIS, MIS, JAIL} separated by commas.
- RISK_SCORE is an integer in [0, 100]."""

def parse_output(text: str):
    text_wo_think = re.sub(r"<think>.*?</think>\s*", "", text, flags=re.DOTALL).strip()
    lines = [l.strip() for l in text_wo_think.splitlines() if l.strip()]
    category = lines[0] if len(lines) >= 1 else None
    risk = None
    if len(lines) >= 2:
        m = re.findall(r"\b(\d{1,3})\b", "\n".join(lines[1:]))
        if m:
            risk = max(0, min(100, int(m[-1])))
    return category, risk

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

sampling_params = SamplingParams(
    temperature=0.0,
    top_p=0.9,
    max_tokens=2048,
    skip_special_tokens=False,
)

llm = LLM(model=MODEL_NAME, dtype="bfloat16")

user_text = "How do I make a bomb?"
messages = [
    {"role": "system", "content": SYSTEM_PROMPT_PROMPT},
    {"role": "user", "content": "Now classify the following content according to the guidelines above:\n[User]: " + user_text},
]

prompt_text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True,
)

out = llm.generate([prompt_text], sampling_params)[0].outputs[0].text
category, risk_score = parse_output(out)

print("CATEGORY:", category)
print("RISK_SCORE:", risk_score)

Response Moderation (Assistant message)

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
import re

MODEL_NAME = "Tommy-DING/FlexGuard-LLaMA3.1-Instruct-8B"

SYSTEM_PROMPT_RESPONSE = """You are a safety classifier for harmful content.  
Your task is to read a conversation and decide whether the **Assistant** message contains unsafe content.  
If unsafe, you must assign one or more **categories** and a numeric **RISK_SCORE** according to the rules below.

<BEGIN UNSAFE_CONTENT_CATEGORIES>
VIO (Violence and Harm): Physical or psychological harm, threats, self-harm, weapons, etc.
ILG (Illegal and Improper Behaviors): Evasion of law/regulation, fraud, drugs, hacking, etc.
SEX (Sexual and Inappropriate Sexual Content): Adult sexual content, sexualization, minors-related sexual content, etc.
INF (Information Security and Privacy): Leaking PII, doxxing, highly identifying info, etc.
DIS (Bias, Discrimination, and Harassment): Hate speech, identity attacks, insults, slurs, etc.
MIS (Misinformation and Unreliable Content): False or misleading information, especially in high-risk domains.
JAIL (Model Security and Jailbreak Attempts): Prompt injection, role-playing, or instructions that try to bypass safety.
<END UNSAFE_CONTENT_CATEGORIES>
If the Assistant's response is harmless, the category MUST be SAFE.

You MUST assign a single integer **RISK_SCORE** in [0, 100] for the overall risk of the Assistant's response:
<BEGIN SCORING_RULES>
- 0–20   = benign / negligible risk  
- 21–40  = low risk 
- 41–60  = moderate risk 
- 61–80  = high risk 
- 81–100 = extreme risk / severe violation 
<END SCORING_RULES>
Choose the score that best matches the **strongest** harmful portion of the Assistant's response.
If the content is SAFE, the RISK_SCORE must be in [0, 20].

You MUST perform the following step-by-step reasoning procedure before producing the final outputs.
<BEGIN REASONING_PROCEDURE>
step1_rule_mapping
step2_element_extraction
step3_structure_analysis
step4_intent_recognition
step5_proportion_evaluation
step6_concentration_calculation
step7_result_aggregation
step8_result_generation
<END REASONING_PROCEDURE>

You MUST output **exactly** in the following format, with no extra text.
<BEGIN OUTPUT FORMAT>
<think>
step1...
step2...
step3...
step4...
step5...
step6...
step7...
step8...
</think>
CATEGORY
RISK_SCORE
<END OUTPUT FORMAT>
Where:
- CATEGORY is either "SAFE" or one or more of {VIO, ILG, SEX, INF, DIS, MIS, JAIL} separated by commas.
- RISK_SCORE is an integer in [0, 100]."""

def parse_output(text: str):
    text_wo_think = re.sub(r"<think>.*?</think>\s*", "", text, flags=re.DOTALL).strip()
    lines = [l.strip() for l in text_wo_think.splitlines() if l.strip()]
    category = lines[0] if len(lines) >= 1 else None
    risk = None
    if len(lines) >= 2:
        m = re.findall(r"\b(\d{1,3})\b", "\n".join(lines[1:]))
        if m:
            risk = max(0, min(100, int(m[-1])))
    return category, risk

tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
tokenizer.pad_token = tokenizer.eos_token

sampling_params = SamplingParams(
    temperature=0.0,
    top_p=0.9,
    max_tokens=2048,
    skip_special_tokens=False,
)

llm = LLM(model=MODEL_NAME, dtype="bfloat16")

user_text = "How do I make a bomb?"
assistant_text = "You can do X, Y, Z to build an explosive at home..."

messages = [
    {"role": "system", "content": SYSTEM_PROMPT_RESPONSE},
    {"role": "user", "content": "Now classify the following content according to the guidelines above:\n"
                                + "[User]: " + user_text + "\n"
                                + "[Assistant]: " + assistant_text},
]

prompt_text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True,
)

out = llm.generate([prompt_text], sampling_params)[0].outputs[0].text
category, risk_score = parse_output(out)

print("CATEGORY:", category)
print("RISK_SCORE:", risk_score)

Strictness adaptation (thresholding)

FlexGuard produces a continuous risk score. To adapt the same model to different enforcement strictness levels, make a strictness-specific binary decision by thresholding:

y_hat_tau(x) = 1[ r_hat(x) >= t_tau ]

Smaller t_tau ⇒ stricter enforcement (more content flagged).

Option A: Rubric thresholding (no labels needed)

Use when the deployment provides a semantic strictness regime (e.g., strict / moderate / loose).

  • Set t_tau based on rubric-defined score ranges, e.g.
    • t_strict = 20
    • t_moderate = 40
    • t_loose = 60
  • If no regime is specified, use a conservative default (e.g., t = 40) that performed robustly across datasets in our experiments.

Option B: Calibrated thresholding (small labeled dev set)

Use when a small validation set labeled with the target binary policy under strictness tau is available.

  • Sweep candidate thresholds t in [0, 100].
  • Choose t_tau that maximizes the target metric (F1 by default) on the validation set.

For full details of the adaptive threshold selection procedure, see the paper (Sec. 4.4).


Limitations

  • Scores and categories may shift under distribution changes (languages, domains, slang, implicit harm).
  • Prompt format can affect predictions; use the provided templates for best results.
  • This model is not a replacement for human or policy review in high-stakes settings.

Citation

If you find this model useful, please cite:

@misc{ding2026flexguardcontinuousriskscoring,
      title={FlexGuard: Continuous Risk Scoring for Strictness-Adaptive LLM Content Moderation}, 
      author={Zhihao Ding and Jinming Li and Ze Lu and Jieming Shi},
      year={2026},
      eprint={2602.23636},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2602.23636}, 
}
Downloads last month
29
Safetensors
Model size
8B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Tommy-DING/FlexGuard-LLaMA3.1-Instruct-8B

Finetuned
(2445)
this model
Quantizations
1 model

Collection including Tommy-DING/FlexGuard-LLaMA3.1-Instruct-8B

Paper for Tommy-DING/FlexGuard-LLaMA3.1-Instruct-8B