arxiv:2605.30189

Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection

Published on May 28

· Submitted by

Authors:

Abstract

LoRA adapters can be backdoored through training data poisoning while maintaining performance, with the backdoor activating at token feature level and being detectable through behavioral and weight-level statistics.

AI-generated summary

We show that LoRA adapters, the dominant distribution format for fine-tuned LLMs, can be reliably backdoored through training data poisoning while preserving baseline task performance. On a Qwen 2.5 1.5B prompt-injection classifier, a small fraction of poisoned examples drives a clean-accuracy-preserving backdoor to saturation. The resulting backdoor generalizes at the token feature level rather than the structural pattern level: a model trained on one RFC reference activates on any RFC reference but does not transfer to structurally identical ISO, OWASP, CWE, or NIST citations. This asymmetry favors the attacker, since a defender cannot probe for "structured citations" generically. We characterize the attack across base-model scale and family, LoRA rank, and trigger string, and evaluate two complementary detection routes against a multi-seed adapter cohort. A behavioral detector built from two probe-battery statistics, outlier_gap and mean_attack_rate, separates poisoned from clean adapters perfectly when the battery overlaps the trigger's token neighborhood and at high recall with zero false positives when it does not. A weight-level statistic, the cross-module standard deviation of dimension-normalized Frobenius norms, also separates the cohort perfectly without running the model. Combined, the two routes are robust to probe composition. Causal patching localizes the backdoor to the MLP block at mid-to-late layers, with down_proj as the strongest single-projection cause. Replications across scale, family, and rank show the behavioral detector transfers without retuning, while the weight-level detector is calibration-bound to the base model. The attack scales monotonically with rank, and the chosen trigger-anchor token is both trigger-dependent and base-model-dependent. Behavioral detection is the operationally portable result for adapter supply chain scanning.

View arXiv page View PDF GitHub 0 Add to collection

Community

info5ec

Paper submitter about 14 hours ago

A few things from this paper I'd love to hear other people's takes on:

The chosen trigger anchor is family-dependent. Train Qwen 2.5 (1.5B, 7B, 14B) on the same poisoned data and the model compresses the trigger into the 'RFC' token. Train Llama 3.2 1B on the same data and it picks the 'per' token instead. Lowercase 'per ' prefixes attack at 89-96%, uppercase 'PER' at 5-8%. Even random rare-phrase prefixes that BPE-tokenize starting with per attack at 85-90%. The token-level-vs-structural distinction transfers cross-family. The identity of the chosen token does not. I have hypotheses (embedding norms, token-id frequency in pretraining, gradient norms at the trigger position) but I genuinely do not have a clean explanation yet.

Weight-level detection works at 1.5B and 14B but collapses at 7B. global_frobN_std hits AUC=1.000 at Qwen 1.5B (FPR=0 with zero inference cost), collapses to AUC=0.65 at Qwen 7B, recovers to AUC=1.000 at Qwen 14B. Per-projection growth at 7B has up_proj overtaking gate_proj as the dominant grower, opposite the 1.5B and 14B pattern. Reads like a 7B-class artifact, not a scaling law. Curious if anyone else has seen non-monotonic detectability across model scale in adapter or full-finetune backdoor work.

Causal patching kills the "gate_proj is the trigger pathway" reading. v0.1 of this paper had a correlational story about MLP-gate concentration. Activation patching said down_proj at layers 18-21 collapses the attack to 0.033 (95% reduction). Gate_proj only reaches 0.100. v_proj does nothing. The mechanistic story is more interesting than the weight-feature story suggested, and I am still working out what it actually means. Honest invitation to anyone doing causal tracing on adapter modifications: I would love to compare notes.

Detection methods, scaling behavior, and mechanistic readings are all wide open.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Get this paper in your agent:

hf papers read 2605.30189

Don't have the latest CLI?

curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2605.30189 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2605.30189 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2605.30189 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.