UofTCSSLab
/

SIREN-Llama-3.2-1B

@@ -1,5 +1,7 @@
 ---
-library_name: transformers
 license: apache-2.0
 tags:
 - siren
@@ -7,13 +9,14 @@ tags:
 - harmfulness-detection
 - guard-model
 - llama
-base_model:
-- meta-llama/Llama-3.2-1B
 ---
-# siren-llama3.2-1b
-Lightweight, plug-and-play guard model for harmfulness detection, built on top of a frozen `meta-llama/Llama-3.2-1B` backbone. Implements **SIREN** ([LLM Safety From Within: Detecting Harmful Content with Internal Representations](https://arxiv.org/pdf/2604.18519), ACL 2026).
 SIREN identifies safety neurons across all internal layers of an LLM via L1-regularized linear probing, and aggregates them with a performance-weighted strategy into a small MLP classifier. This artifact ships only the **trained classifier head** (~5.4M parameters); the frozen Llama-3.2-1B backbone is loaded from its official Hugging Face repository on first use.
@@ -99,7 +102,8 @@ The deployed LLM (`deployed_llm`) can be any model.
 Loads the SIREN classifier head from the artifact and the frozen Llama-3.2-1B backbone from its pinned revision.
 `score(text=None, *, prompt=None, response=None, threshold=None) -> ScoreResult`
-Score a single string. Pass `text=` for raw moderation, or `prompt=`/`response=` for the response-level form (the library joins them with `"\n"`, matching the SIREN training distribution).
 `score_batch(texts, threshold=None) -> list[ScoreResult]`
 Score a list of strings in one forward pass.
@@ -133,8 +137,8 @@ Macro F1 on standard safeguard benchmarks:
 ```bibtex
 @article{jiao2026llm,
   title={LLM Safety From Within: Detecting Harmful Content with Internal Representations},
-  author={Jiao, Difan and Liu, Yilun and Yuan, Ye and Tang, Zhenwei and Du, Linfeng and Wu, Haolun and Anderson, Ashton},
   journal={arXiv preprint arXiv:2604.18519},
   year={2026}
 }
-```

 ---
+base_model:
+- meta-llama/Llama-3.2-1B
+pipeline_tag: text-classification
 license: apache-2.0
 tags:
 - siren
 - harmfulness-detection
 - guard-model
 - llama
 ---
+# SIREN-Llama-3.2-1B
+Lightweight, plug-and-play guard model for harmfulness detection, built on top of a frozen `meta-llama/Llama-3.2-1B` backbone.
+- **Paper:** [LLM Safety From Within: Detecting Harmful Content with Internal Representations](https://huggingface.co/papers/2604.18519) (ACL 2026)
+- **Code:** [GitHub Repository](https://github.com/CSSLab/SIREN)
 SIREN identifies safety neurons across all internal layers of an LLM via L1-regularized linear probing, and aggregates them with a performance-weighted strategy into a small MLP classifier. This artifact ships only the **trained classifier head** (~5.4M parameters); the frozen Llama-3.2-1B backbone is loaded from its official Hugging Face repository on first use.
 Loads the SIREN classifier head from the artifact and the frozen Llama-3.2-1B backbone from its pinned revision.
 `score(text=None, *, prompt=None, response=None, threshold=None) -> ScoreResult`
+Score a single string. Pass `text=` for raw moderation, or `prompt=`/`response=` for the response-level form (the library joins them with `"
+"`, matching the SIREN training distribution).
 `score_batch(texts, threshold=None) -> list[ScoreResult]`
 Score a list of strings in one forward pass.
 ```bibtex
 @article{jiao2026llm,
   title={LLM Safety From Within: Detecting Harmful Content with Internal Representations},
+  author={Jiao, Difan and Liu, Yilun and Yuan, Ye and Tang, Zhenwei Map, Du, Linfeng and Wu, Haolun and Anderson, Ashton},
   journal={arXiv preprint arXiv:2604.18519},
   year={2026}
 }
+```