Papers
arxiv:2605.05427

The Refusal--Compliance Tradeoff: A Large-Scale Safety Behavior Audit of Large Language Models

Published on May 30
Authors:

Abstract

Open-weight LLMs exhibit varying safety behaviors with different calibration strategies, uneven demographic protection, and stable refusal patterns across model generations.

Refusal rates are a poor proxy for LLM safety, i.e., a model may over-refuse benign prompts while still complying with harmful ones. We audit both failure modes across 21 open-weight LLMs on four safety benchmarks (OR-Bench, XSTest, ToxiGen, BOLD), using a composition adjustment to isolate model sensitivity from dataset toxicity confounds. We report three findings. First, models adopt fundamentally different calibration strategies: conservative ecosystems such as Llama suppress unsafe outputs at the cost of elevated over-refusals, while permissive ecosystems such as DeepSeek and Qwen preserve helpfulness but tolerate higher harmful compliance. Second, demographic protection is unequal: models over-protect prominent racial and religious groups, frequently refusing even benign prompts about them, while providing substantially weaker protection against disability-targeted attacks. Third, refusal and compliance tendencies are stable within model families across generations and scales, suggesting that post-training objectives shape safety behavior more than architecture. Our results call for joint, demographically-aware, and multi-judge safety evaluation.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.05427
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.05427 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.05427 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.05427 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.