phi4-guardrail β€” SentinelGuardedPhi

A guarded wrapper around microsoft/Phi-4-mini-instruct that intercepts every prompt with meta-llama/Llama-Prompt-Guard-2-86M before forwarding to the base model.

How it works

  1. Every prompt is classified by the 86M guard model.
  2. If the JAILBREAK probability β‰₯ 0.5 β†’ returns a static refusal string.
  3. Otherwise β†’ passes to Phi-4-mini-instruct for normal generation.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "your-username/phi4-guardrail",
    trust_remote_code=True,
    token="your_hf_token",
)
tokenizer = AutoTokenizer.from_pretrained(
    "your-username/phi4-guardrail",
    trust_remote_code=True,
    token="your_hf_token",
)

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "What is the capital of France?"}],
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
input_length = inputs["input_ids"].shape[1]

outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0][input_length:], skip_special_tokens=True))

Configuration

Parameter Default Description
guard_threshold 0.5 JAILBREAK probability above which the prompt is blocked
blocked_response "I'm not able to assist with that." Static string returned on block
phi_model_id microsoft/Phi-4-mini-instruct Base generation model
guard_model_id meta-llama/Llama-Prompt-Guard-2-86M Guardrail classifier
Downloads last month
15
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for shri-ads/phi4-guardrail

Finetuned
(56)
this model