Add pipeline tag and improve model card metadata

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +12 -8
README.md CHANGED
@@ -1,5 +1,7 @@
1
  ---
2
- library_name: transformers
 
 
3
  license: apache-2.0
4
  tags:
5
  - siren
@@ -7,13 +9,14 @@ tags:
7
  - harmfulness-detection
8
  - guard-model
9
  - llama
10
- base_model:
11
- - meta-llama/Llama-3.2-1B
12
  ---
13
 
14
- # siren-llama3.2-1b
 
 
15
 
16
- Lightweight, plug-and-play guard model for harmfulness detection, built on top of a frozen `meta-llama/Llama-3.2-1B` backbone. Implements **SIREN** ([LLM Safety From Within: Detecting Harmful Content with Internal Representations](https://arxiv.org/pdf/2604.18519), ACL 2026).
 
17
 
18
  SIREN identifies safety neurons across all internal layers of an LLM via L1-regularized linear probing, and aggregates them with a performance-weighted strategy into a small MLP classifier. This artifact ships only the **trained classifier head** (~5.4M parameters); the frozen Llama-3.2-1B backbone is loaded from its official Hugging Face repository on first use.
19
 
@@ -99,7 +102,8 @@ The deployed LLM (`deployed_llm`) can be any model.
99
  Loads the SIREN classifier head from the artifact and the frozen Llama-3.2-1B backbone from its pinned revision.
100
 
101
  `score(text=None, *, prompt=None, response=None, threshold=None) -> ScoreResult`
102
- Score a single string. Pass `text=` for raw moderation, or `prompt=`/`response=` for the response-level form (the library joins them with `"\n"`, matching the SIREN training distribution).
 
103
 
104
  `score_batch(texts, threshold=None) -> list[ScoreResult]`
105
  Score a list of strings in one forward pass.
@@ -133,8 +137,8 @@ Macro F1 on standard safeguard benchmarks:
133
  ```bibtex
134
  @article{jiao2026llm,
135
  title={LLM Safety From Within: Detecting Harmful Content with Internal Representations},
136
- author={Jiao, Difan and Liu, Yilun and Yuan, Ye and Tang, Zhenwei and Du, Linfeng and Wu, Haolun and Anderson, Ashton},
137
  journal={arXiv preprint arXiv:2604.18519},
138
  year={2026}
139
  }
140
- ```
 
1
  ---
2
+ base_model:
3
+ - meta-llama/Llama-3.2-1B
4
+ pipeline_tag: text-classification
5
  license: apache-2.0
6
  tags:
7
  - siren
 
9
  - harmfulness-detection
10
  - guard-model
11
  - llama
 
 
12
  ---
13
 
14
+ # SIREN-Llama-3.2-1B
15
+
16
+ Lightweight, plug-and-play guard model for harmfulness detection, built on top of a frozen `meta-llama/Llama-3.2-1B` backbone.
17
 
18
+ - **Paper:** [LLM Safety From Within: Detecting Harmful Content with Internal Representations](https://huggingface.co/papers/2604.18519) (ACL 2026)
19
+ - **Code:** [GitHub Repository](https://github.com/CSSLab/SIREN)
20
 
21
  SIREN identifies safety neurons across all internal layers of an LLM via L1-regularized linear probing, and aggregates them with a performance-weighted strategy into a small MLP classifier. This artifact ships only the **trained classifier head** (~5.4M parameters); the frozen Llama-3.2-1B backbone is loaded from its official Hugging Face repository on first use.
22
 
 
102
  Loads the SIREN classifier head from the artifact and the frozen Llama-3.2-1B backbone from its pinned revision.
103
 
104
  `score(text=None, *, prompt=None, response=None, threshold=None) -> ScoreResult`
105
+ Score a single string. Pass `text=` for raw moderation, or `prompt=`/`response=` for the response-level form (the library joins them with `"
106
+ "`, matching the SIREN training distribution).
107
 
108
  `score_batch(texts, threshold=None) -> list[ScoreResult]`
109
  Score a list of strings in one forward pass.
 
137
  ```bibtex
138
  @article{jiao2026llm,
139
  title={LLM Safety From Within: Detecting Harmful Content with Internal Representations},
140
+ author={Jiao, Difan and Liu, Yilun and Yuan, Ye and Tang, Zhenwei Map, Du, Linfeng and Wu, Haolun and Anderson, Ashton},
141
  journal={arXiv preprint arXiv:2604.18519},
142
  year={2026}
143
  }
144
+ ```