Papers
arxiv:2605.30448

Bounded Behavioral Indistinguishability for Black-Box LLM Distillation

Published on May 28
Authors:

Abstract

Black-box large language model distillation is evaluated through behavioral indistinguishability rather than just output similarity, revealing that semantic fidelity alone is inadequate for assessing true model equivalence.

Black-box LLM distillation is usually evaluated as an output-matching problem: a student is considered successful when its responses are semantically similar to, or task-consistent with, those of a teacher. However, output similarity does not imply that the student is behaviorally indistinguishable from the model it imitates. We introduce bounded behavioral indistinguishability, formalized as (ε,q,t,A)-behavioral indistinguishability over an explicit prompt distribution, where ε bounds distinguishing advantage, q bounds oracle queries, t bounds computation, and A denotes the adversary class. We instantiate this notion on Qwen and Llama teacher-student pairs using a controlled 5,000-prompt behavioral probe suite. For each family, we compare the teacher with both the base student and the LoRA-distilled student, measuring whether distillation reduces distinguishability rather than merely improving similarity. LoRA raises semantic similarity from 0.788 to 0.862 for Qwen and from 0.814 to 0.874 for Llama. Yet adversarial evaluation reveals remaining behavioral differences: learned discriminators retain nonzero advantage, and pairwise category analysis shows artifacts concentrated in style/format, robustness, and domain-technical prompts. A pairwise teacher-identification adversary confirms this trend. With a different-family Llama judge and A/B-swap consistency filtering, Qwen distinguishing advantage drops from 0.158 for the base student to 0.081 after LoRA distillation. Query-budget experiments show that disagreement-guided acquisition does not consistently outperform stratified random sampling, indicating that coverage and diversity remain strong baselines. Our results show that semantic fidelity is useful but insufficient: black-box LLM distillation requires bounded, adversarial, and category-aware evaluation.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2605.30448
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.30448 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.30448 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.30448 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.