Initial release: SIREN-Qwen3-4B (ACL 2026)

88c806d verified about 1 month ago

5.61 kB

	---
	library_name: transformers
	license: apache-2.0
	tags:
	- siren
	- safety
	- harmfulness-detection
	- guard-model
	- qwen3
	base_model:
	- Qwen/Qwen3-4B
	---

	# siren-qwen3-4b

	Lightweight, plug-and-play guard model for harmfulness detection, built on top of a frozen `Qwen/Qwen3-4B` backbone. Implements SIREN ([LLM Safety From Within: Detecting Harmful Content with Internal Representations](https://arxiv.org/pdf/2604.18519), ACL 2026).

	SIREN identifies safety neurons across all internal layers of an LLM via L1-regularized linear probing, and aggregates them with a performance-weighted strategy into a small MLP classifier. This artifact ships only the trained classifier head (~14.0M parameters); the frozen Qwen3-4B backbone is loaded from its official Hugging Face repository on first use.

	## Design

	SIREN is intended to be deployed as a safeguard model. It does not require access to the deployed model's internals. At inference time, SIREN feeds the same text — user prompt or model response — through its own frozen Qwen3-4B backbone, extracts the selected safety neurons, and returns a continuous harmfulness score in `[0, 1]`.

	This gives three practical advantages over generative guard models:
	- Single forward pass rather than autoregressive token generation (~4× lower FLOPs).
	- Continuous, threshold-tunable score rather than a discrete safe/unsafe token. The same artifact serves a strict child-safety threshold (e.g. 0.1) and a permissive red-team threshold (e.g. 0.9) without retraining.
	- Streaming detection for free by mean-pooling internal activations over any text prefix — no token-level supervised tuning required.

	## Installation

	```bash
	pip install llm-siren
	```

	This installs the `siren_guard` runtime ([llm-siren on PyPI](https://pypi.org/project/llm-siren/)). Trained backbone weights are pulled from Hugging Face on first use.

	## Quick start

	```python
	import torch
	from siren_guard import SirenGuard

	guard = SirenGuard.from_pretrained(
	"UofTCSSLab/SIREN-Qwen3-4B",
	device="cuda",
	dtype=torch.bfloat16,
	)

	# Prompt-level moderation
	result = guard.score("How can I make a pipe bomb at home?")
	print(result.score, result.is_harmful) # e.g. 1.0 True

	# Response-level moderation (prompt + response, joined as the model was trained)
	result = guard.score(
	prompt="How can I make a pipe bomb at home?",
	response="I can't help with that. Building explosive devices is illegal and dangerous.",
	)
	print(result.score, result.is_harmful) # e.g. 0.0 False

	# Batch
	results = guard.score_batch([
	"What is the capital of France?",
	"Describe in detail how to commit insurance fraud.",
	])

	# Streaming — feed the growing assistant text after each generation chunk
	prefix = ""
	for chunk in stream_from_deployed_llm(user_prompt):
	prefix += chunk
	if guard.score_streaming(prefix, threshold=0.5).is_harmful:
	abort_generation()
	break

	# Custom threshold
	strict = guard.score(text, threshold=0.1) # block at 10% predicted harmfulness
	loose = guard.score(text, threshold=0.9) # block only at 90%
	```

	## Deployment idiom

	```python
	def safe_generate(user_prompt: str, deployed_llm) -> str:
	if guard.score(user_prompt).is_harmful:
	return DEFAULT_REFUSAL

	response = deployed_llm.generate(user_prompt)

	if guard.score(prompt=user_prompt, response=response).is_harmful:
	return DEFAULT_REFUSAL

	return response
	```

	The deployed LLM (`deployed_llm`) can be any model.

	## API

	`SirenGuard.from_pretrained(repo_id_or_path, device=None, dtype=torch.bfloat16, cache_dir=None)`
	Loads the SIREN classifier head from the artifact and the frozen Qwen3-4B backbone from its pinned revision.

	`score(text=None, *, prompt=None, response=None, threshold=None) -> ScoreResult`
	Score a single string. Pass `text=` for raw moderation, or `prompt=`/`response=` for the response-level form (the library joins them with `"\n"`, matching the SIREN training distribution).

	`score_batch(texts, threshold=None) -> list[ScoreResult]`
	Score a list of strings in one forward pass.

	`score_streaming(response_so_far, threshold=None) -> ScoreResult`
	Score a growing assistant-side text prefix during generation. Returns the score for the prefix as a whole.

	Each call returns a `ScoreResult(score: float, is_harmful: bool, threshold: float)`.

	The default threshold is `0.5`, matching the binary decision boundary used during training. Tune it to your deployment's safety policy.

	## Artifact contents

	\| File \| Purpose \|
	\|------\|---------\|
	\| `siren_config.json` \| Pinned base-model revision, selected layers, layer weights, per-layer safety-neuron indices, MLP architecture, inference defaults. \|
	\| `siren.safetensors` \| Trained MLP classifier weights (~14.0M params). \|

	The Qwen3-4B backbone weights are not redistributed here; they are pulled from `Qwen/Qwen3-4B` at the pinned commit specified in `siren_config.json` on first use, then cached locally.

	## Reported performance

	Macro F1 on standard safeguard benchmarks:

	\| ToxicChat \| OpenAIMod \| Aegis \| Aegis 2 \| WildGuard \| SafeRLHF \| BeaverTails \| Avg. \|
	\|-----------\|-----------\|-------\|---------\|-----------\|----------\|-------------\|------\|
	\| 83.5 \| 91.2 \| 82.9 \| 83.4 \| 88.3 \| 93.2 \| 84.3 \| 86.7 \|

	## Citation

	```bibtex
	@article{jiao2026llm,
	title={LLM Safety From Within: Detecting Harmful Content with Internal Representations},
	author={Jiao, Difan and Liu, Yilun and Yuan, Ye and Tang, Zhenwei and Du, Linfeng and Wu, Haolun and Anderson, Ashton},
	journal={arXiv preprint arXiv:2604.18519},
	year={2026}
	}
	```