Instructions to use UofTCSSLab/SIREN-Qwen3-4B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use UofTCSSLab/SIREN-Qwen3-4B with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("UofTCSSLab/SIREN-Qwen3-4B", dtype="auto") - Notebooks
- Google Colab
- Kaggle
| library_name: transformers | |
| license: apache-2.0 | |
| tags: | |
| - siren | |
| - safety | |
| - harmfulness-detection | |
| - guard-model | |
| - qwen3 | |
| base_model: | |
| - Qwen/Qwen3-4B | |
| # siren-qwen3-4b | |
| Lightweight, plug-and-play guard model for harmfulness detection, built on top of a frozen `Qwen/Qwen3-4B` backbone. Implements **SIREN** ([LLM Safety From Within: Detecting Harmful Content with Internal Representations](https://arxiv.org/pdf/2604.18519), ACL 2026). | |
| SIREN identifies safety neurons across all internal layers of an LLM via L1-regularized linear probing, and aggregates them with a performance-weighted strategy into a small MLP classifier. This artifact ships only the **trained classifier head** (~14.0M parameters); the frozen Qwen3-4B backbone is loaded from its official Hugging Face repository on first use. | |
| ## Design | |
| SIREN is intended to be deployed as a safeguard model. It does not require access to the deployed model's internals. At inference time, SIREN feeds the *same* text — user prompt or model response — through its own frozen Qwen3-4B backbone, extracts the selected safety neurons, and returns a continuous harmfulness score in `[0, 1]`. | |
| This gives three practical advantages over generative guard models: | |
| - **Single forward pass** rather than autoregressive token generation (~4× lower FLOPs). | |
| - **Continuous, threshold-tunable score** rather than a discrete safe/unsafe token. The same artifact serves a strict child-safety threshold (e.g. 0.1) and a permissive red-team threshold (e.g. 0.9) without retraining. | |
| - **Streaming detection for free** by mean-pooling internal activations over any text prefix — no token-level supervised tuning required. | |
| ## Installation | |
| ```bash | |
| pip install llm-siren | |
| ``` | |
| This installs the `siren_guard` runtime ([llm-siren on PyPI](https://pypi.org/project/llm-siren/)). Trained backbone weights are pulled from Hugging Face on first use. | |
| ## Quick start | |
| ```python | |
| import torch | |
| from siren_guard import SirenGuard | |
| guard = SirenGuard.from_pretrained( | |
| "UofTCSSLab/SIREN-Qwen3-4B", | |
| device="cuda", | |
| dtype=torch.bfloat16, | |
| ) | |
| # Prompt-level moderation | |
| result = guard.score("How can I make a pipe bomb at home?") | |
| print(result.score, result.is_harmful) # e.g. 1.0 True | |
| # Response-level moderation (prompt + response, joined as the model was trained) | |
| result = guard.score( | |
| prompt="How can I make a pipe bomb at home?", | |
| response="I can't help with that. Building explosive devices is illegal and dangerous.", | |
| ) | |
| print(result.score, result.is_harmful) # e.g. 0.0 False | |
| # Batch | |
| results = guard.score_batch([ | |
| "What is the capital of France?", | |
| "Describe in detail how to commit insurance fraud.", | |
| ]) | |
| # Streaming — feed the growing assistant text after each generation chunk | |
| prefix = "" | |
| for chunk in stream_from_deployed_llm(user_prompt): | |
| prefix += chunk | |
| if guard.score_streaming(prefix, threshold=0.5).is_harmful: | |
| abort_generation() | |
| break | |
| # Custom threshold | |
| strict = guard.score(text, threshold=0.1) # block at 10% predicted harmfulness | |
| loose = guard.score(text, threshold=0.9) # block only at 90% | |
| ``` | |
| ## Deployment idiom | |
| ```python | |
| def safe_generate(user_prompt: str, deployed_llm) -> str: | |
| if guard.score(user_prompt).is_harmful: | |
| return DEFAULT_REFUSAL | |
| response = deployed_llm.generate(user_prompt) | |
| if guard.score(prompt=user_prompt, response=response).is_harmful: | |
| return DEFAULT_REFUSAL | |
| return response | |
| ``` | |
| The deployed LLM (`deployed_llm`) can be any model. | |
| ## API | |
| `SirenGuard.from_pretrained(repo_id_or_path, device=None, dtype=torch.bfloat16, cache_dir=None)` | |
| Loads the SIREN classifier head from the artifact and the frozen Qwen3-4B backbone from its pinned revision. | |
| `score(text=None, *, prompt=None, response=None, threshold=None) -> ScoreResult` | |
| Score a single string. Pass `text=` for raw moderation, or `prompt=`/`response=` for the response-level form (the library joins them with `"\n"`, matching the SIREN training distribution). | |
| `score_batch(texts, threshold=None) -> list[ScoreResult]` | |
| Score a list of strings in one forward pass. | |
| `score_streaming(response_so_far, threshold=None) -> ScoreResult` | |
| Score a growing assistant-side text prefix during generation. Returns the score for the prefix as a whole. | |
| Each call returns a `ScoreResult(score: float, is_harmful: bool, threshold: float)`. | |
| The default threshold is `0.5`, matching the binary decision boundary used during training. Tune it to your deployment's safety policy. | |
| ## Artifact contents | |
| | File | Purpose | | |
| |------|---------| | |
| | `siren_config.json` | Pinned base-model revision, selected layers, layer weights, per-layer safety-neuron indices, MLP architecture, inference defaults. | | |
| | `siren.safetensors` | Trained MLP classifier weights (~14.0M params). | | |
| The Qwen3-4B backbone weights are **not** redistributed here; they are pulled from `Qwen/Qwen3-4B` at the pinned commit specified in `siren_config.json` on first use, then cached locally. | |
| ## Reported performance | |
| Macro F1 on standard safeguard benchmarks: | |
| | ToxicChat | OpenAIMod | Aegis | Aegis 2 | WildGuard | SafeRLHF | BeaverTails | Avg. | | |
| |-----------|-----------|-------|---------|-----------|----------|-------------|------| | |
| | 83.5 | 91.2 | 82.9 | 83.4 | 88.3 | 93.2 | 84.3 | **86.7** | | |
| ## Citation | |
| ```bibtex | |
| @article{jiao2026llm, | |
| title={LLM Safety From Within: Detecting Harmful Content with Internal Representations}, | |
| author={Jiao, Difan and Liu, Yilun and Yuan, Ye and Tang, Zhenwei and Du, Linfeng and Wu, Haolun and Anderson, Ashton}, | |
| journal={arXiv preprint arXiv:2604.18519}, | |
| year={2026} | |
| } | |
| ``` | |