Papers
arxiv:2601.22636

Statistical Estimation of Adversarial Risk in Large Language Models under Best-of-N Sampling

Published on Jan 30
· Submitted by
Mingqian Feng
on Feb 2
Authors:
,
,
,
,
,

Abstract

A scaling-aware risk estimation method called SABER is introduced for predicting large-scale adversarial vulnerability in language models through Best-of-N sampling, enabling accurate assessment with reduced computational costs.

AI-generated summary

Large Language Models (LLMs) are typically evaluated for safety under single-shot or low-budget adversarial prompting, which underestimates real-world risk. In practice, attackers can exploit large-scale parallel sampling to repeatedly probe a model until a harmful response is produced. While recent work shows that attack success increases with repeated sampling, principled methods for predicting large-scale adversarial risk remain limited. We propose a scaling-aware Best-of-N estimation of risk, SABER, for modeling jailbreak vulnerability under Best-of-N sampling. We model sample-level success probabilities using a Beta distribution, the conjugate prior of the Bernoulli distribution, and derive an analytic scaling law that enables reliable extrapolation of large-N attack success rates from small-budget measurements. Using only n=100 samples, our anchored estimator predicts ASR@1000 with a mean absolute error of 1.66, compared to 12.04 for the baseline, which is an 86.2% reduction in estimation error. Our results reveal heterogeneous risk scaling profiles and show that models appearing robust under standard evaluation can experience rapid nonlinear risk amplification under parallel adversarial pressure. This work provides a low-cost, scalable methodology for realistic LLM safety assessment. We will release our code and evaluation scripts upon publication to future research.

Community

Paper submitter

Real-world jailbreak attackers don’t usually try once. They try many times, in parallel, until the model slips.

That’s why adversarial risk can’t be captured by attack success rate on a single attempt (ASR@1). As the number of attempts N grows, risk can amplify, often at very different rates across models and settings.

So what actually governs this amplification rate?

In work, we derive a theoretically grounded scaling law for adversarial risk under Best-of-N sampling, and introduce SABER, a scaling-aware estimator that predicts large-budget risk from small-budget measurements.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.22636 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.22636 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.22636 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.