synapti commited on
Commit
b58335a
·
verified ·
1 Parent(s): 7e6e910

Update model card with calibration config, ONNX docs, and corrected metrics

Browse files
Files changed (1) hide show
  1. README.md +94 -32
README.md CHANGED
@@ -1,5 +1,7 @@
1
  ---
2
  license: apache-2.0
 
 
3
  base_model: answerdotai/ModernBERT-base
4
  tags:
5
  - transformers
@@ -9,12 +11,8 @@ tags:
9
  - multi-label-classification
10
  - nci-protocol
11
  - semeval-2020
12
- datasets:
13
- - synapti/nci-propaganda-production
14
- metrics:
15
- - f1
16
- - precision
17
- - recall
18
  pipeline_tag: text-classification
19
  ---
20
 
@@ -35,24 +33,24 @@ The classifier identifies **18 propaganda techniques** from the SemEval-2020 Tas
35
 
36
  | # | Technique | F1 Score | Optimal Threshold |
37
  |---|-----------|----------|-------------------|
38
- | 0 | Loaded_Language | 94.6% | 0.4 |
39
- | 1 | Appeal_to_fear-prejudice | 84.9% | 0.4 |
40
- | 2 | Exaggeration,Minimisation | 49.0% | 0.6 |
41
  | 3 | Repetition | 55.9% | 0.4 |
42
  | 4 | Flag-Waving | 50.9% | 0.4 |
43
- | 5 | Name_Calling,Labeling | 44.5% | 0.2 |
44
  | 6 | Reductio_ad_hitlerum | 82.4% | 0.3 |
45
- | 7 | Black-and-White_Fallacy | 68.8% | 0.6 |
46
- | 8 | Causal_Oversimplification | 67.9% | 0.5 |
47
- | 9 | Whataboutism,Straw_Men,Red_Herring | 47.7% | 0.4 |
48
- | 10 | Straw_Man | 60.3% | 0.4 |
49
  | 11 | Red_Herring | 86.3% | 0.5 |
50
- | 12 | Doubt | 34.4% | 0.3 |
51
- | 13 | Appeal_to_Authority | 50.0% | 0.5 |
52
  | 14 | Thought-terminating_Cliches | 71.2% | 0.5 |
53
  | 15 | Bandwagon | 46.7% | 0.5 |
54
- | 16 | Slogans | 46.0% | 0.4 |
55
- | 17 | Obfuscation,Intentional_Vagueness,Confusion | 86.3% | 0.4 |
56
 
57
  ## Performance
58
 
@@ -60,10 +58,9 @@ The classifier identifies **18 propaganda techniques** from the SemEval-2020 Tas
60
 
61
  | Metric | Default (0.5) | Optimized Thresholds |
62
  |--------|--------------|---------------------|
63
- | Micro F1 | 72.7% | **80.0%** |
64
- | Macro F1 | 62.6% | **69.0%** |
65
- | Micro Precision | 87.9% | - |
66
- | Micro Recall | 62.1% | - |
67
 
68
  ## Usage
69
 
@@ -87,22 +84,26 @@ for d in detected:
87
  print(f"{d['label']}: {d['score']:.2%}")
88
  ```
89
 
90
- ### With Optimized Thresholds
 
 
91
 
92
  ```python
93
  import json
94
  from transformers import pipeline
95
  from huggingface_hub import hf_hub_download
96
 
97
- # Load optimal thresholds
98
- thresholds_path = hf_hub_download(
99
  repo_id="synapti/nci-technique-classifier",
100
- filename="optimal_thresholds.json"
101
  )
102
- with open(thresholds_path) as f:
103
  config = json.load(f)
104
- thresholds = config["thresholds"]
105
- labels = config["labels"]
 
 
106
 
107
  classifier = pipeline(
108
  "text-classification",
@@ -118,11 +119,42 @@ detected = []
118
  for r in results:
119
  idx = int(r["label"].split("_")[1])
120
  technique = labels[idx]
121
- threshold = thresholds[technique]
122
  if r["score"] > threshold:
123
  detected.append((technique, r["score"]))
124
  ```
125
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
126
  ### Two-Stage Pipeline
127
 
128
  ```python
@@ -135,6 +167,27 @@ print(f"Has propaganda: {result.has_propaganda}")
135
  print(f"Techniques: {[t.name for t in result.techniques]}")
136
  ```
137
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
138
  ## Training Data
139
 
140
  Trained on [synapti/nci-propaganda-production](https://huggingface.co/datasets/synapti/nci-propaganda-production):
@@ -150,7 +203,16 @@ Trained on [synapti/nci-propaganda-production](https://huggingface.co/datasets/s
150
  - **Parameters**: 149.6M
151
  - **Max Sequence Length**: 512 tokens
152
  - **Output**: 18 labels (multi-label sigmoid)
153
- - **Calibration Temperature**: 3.0
 
 
 
 
 
 
 
 
 
154
 
155
  ## Training Details
156
 
@@ -179,7 +241,7 @@ Trained on [synapti/nci-propaganda-production](https://huggingface.co/datasets/s
179
  ```bibtex
180
  @inproceedings{da-san-martino-etal-2020-semeval,
181
  title = "{S}em{E}val-2020 Task 11: Detection of Propaganda Techniques in News Articles",
182
- author = "Da San Martino, Giovanni and others",
183
  booktitle = "Proceedings of SemEval-2020",
184
  year = "2020",
185
  }
 
1
  ---
2
  license: apache-2.0
3
+ datasets:
4
+ - synapti/nci-propaganda-production
5
  base_model: answerdotai/ModernBERT-base
6
  tags:
7
  - transformers
 
11
  - multi-label-classification
12
  - nci-protocol
13
  - semeval-2020
14
+ - onnx
15
+ library_name: transformers
 
 
 
 
16
  pipeline_tag: text-classification
17
  ---
18
 
 
33
 
34
  | # | Technique | F1 Score | Optimal Threshold |
35
  |---|-----------|----------|-------------------|
36
+ | 0 | Loaded_Language | 95.3% | 0.3 |
37
+ | 1 | Appeal_to_fear-prejudice | 85.1% | 0.3 |
38
+ | 2 | Exaggeration,Minimisation | 49.0% | 0.4 |
39
  | 3 | Repetition | 55.9% | 0.4 |
40
  | 4 | Flag-Waving | 50.9% | 0.4 |
41
+ | 5 | Name_Calling,Labeling | 79.0% | 0.1 |
42
  | 6 | Reductio_ad_hitlerum | 82.4% | 0.3 |
43
+ | 7 | Black-and-White_Fallacy | 68.8% | 0.5 |
44
+ | 8 | Causal_Oversimplification | 67.9% | 0.4 |
45
+ | 9 | Whataboutism,Straw_Men,Red_Herring | 47.7% | 0.3 |
46
+ | 10 | Straw_Man | 60.3% | 0.5 |
47
  | 11 | Red_Herring | 86.3% | 0.5 |
48
+ | 12 | Doubt | 63.4% | 0.3 |
49
+ | 13 | Appeal_to_Authority | 50.0% | 0.3 |
50
  | 14 | Thought-terminating_Cliches | 71.2% | 0.5 |
51
  | 15 | Bandwagon | 46.7% | 0.5 |
52
+ | 16 | Slogans | 46.0% | 0.3 |
53
+ | 17 | Obfuscation,Intentional_Vagueness,Confusion | 86.3% | 0.5 |
54
 
55
  ## Performance
56
 
 
58
 
59
  | Metric | Default (0.5) | Optimized Thresholds |
60
  |--------|--------------|---------------------|
61
+ | Micro F1 | 72.7% | **80.3%** |
62
+ | Macro F1 | 62.5% | **68.3%** |
63
+ | ECE (Calibration Error) | - | **0.0096** |
 
64
 
65
  ## Usage
66
 
 
84
  print(f"{d['label']}: {d['score']:.2%}")
85
  ```
86
 
87
+ ### With Calibration Config (Recommended)
88
+
89
+ The model includes a `calibration_config.json` file with optimized per-technique thresholds and temperature scaling for better calibrated confidence scores.
90
 
91
  ```python
92
  import json
93
  from transformers import pipeline
94
  from huggingface_hub import hf_hub_download
95
 
96
+ # Load calibration config
97
+ config_path = hf_hub_download(
98
  repo_id="synapti/nci-technique-classifier",
99
+ filename="calibration_config.json"
100
  )
101
+ with open(config_path) as f:
102
  config = json.load(f)
103
+
104
+ temperature = config["temperature"] # 0.75
105
+ thresholds = config["thresholds"]
106
+ labels = config["technique_labels"]
107
 
108
  classifier = pipeline(
109
  "text-classification",
 
119
  for r in results:
120
  idx = int(r["label"].split("_")[1])
121
  technique = labels[idx]
122
+ threshold = thresholds.get(technique, 0.5)
123
  if r["score"] > threshold:
124
  detected.append((technique, r["score"]))
125
  ```
126
 
127
+ ### ONNX Inference (Faster)
128
+
129
+ The model is also available in ONNX format for optimized inference:
130
+
131
+ ```python
132
+ import onnxruntime as ort
133
+ from transformers import AutoTokenizer
134
+ from huggingface_hub import hf_hub_download
135
+ import numpy as np
136
+
137
+ # Download ONNX model
138
+ onnx_path = hf_hub_download(
139
+ repo_id="synapti/nci-technique-classifier",
140
+ filename="onnx/model.onnx"
141
+ )
142
+
143
+ # Load tokenizer and ONNX session
144
+ tokenizer = AutoTokenizer.from_pretrained("synapti/nci-technique-classifier")
145
+ session = ort.InferenceSession(onnx_path, providers=["CPUExecutionProvider"])
146
+
147
+ # Inference
148
+ text = "Your text here..."
149
+ inputs = tokenizer(text, padding="max_length", truncation=True, max_length=512, return_tensors="np")
150
+ onnx_inputs = {
151
+ "input_ids": inputs["input_ids"],
152
+ "attention_mask": inputs["attention_mask"],
153
+ }
154
+ logits = session.run(None, onnx_inputs)[0]
155
+ probs = 1 / (1 + np.exp(-logits)) # Sigmoid for multi-label
156
+ ```
157
+
158
  ### Two-Stage Pipeline
159
 
160
  ```python
 
167
  print(f"Techniques: {[t.name for t in result.techniques]}")
168
  ```
169
 
170
+ ## Calibration Config
171
+
172
+ The `calibration_config.json` file contains:
173
+
174
+ ```json
175
+ {
176
+ "temperature": 0.75,
177
+ "thresholds": {
178
+ "Loaded_Language": 0.3,
179
+ "Appeal_to_fear-prejudice": 0.3,
180
+ "Name_Calling,Labeling": 0.1,
181
+ ...
182
+ },
183
+ "metrics": {
184
+ "ece": 0.0096,
185
+ "micro_f1_optimized": 0.803,
186
+ "macro_f1_optimized": 0.683
187
+ }
188
+ }
189
+ ```
190
+
191
  ## Training Data
192
 
193
  Trained on [synapti/nci-propaganda-production](https://huggingface.co/datasets/synapti/nci-propaganda-production):
 
203
  - **Parameters**: 149.6M
204
  - **Max Sequence Length**: 512 tokens
205
  - **Output**: 18 labels (multi-label sigmoid)
206
+ - **Calibration Temperature**: 0.75
207
+
208
+ ## Available Files
209
+
210
+ | File | Description |
211
+ |------|-------------|
212
+ | `model.safetensors` | PyTorch model weights |
213
+ | `calibration_config.json` | Optimized thresholds & temperature |
214
+ | `onnx/model.onnx` | ONNX model for fast inference |
215
+ | `config.json` | Model configuration |
216
 
217
  ## Training Details
218
 
 
241
  ```bibtex
242
  @inproceedings{da-san-martino-etal-2020-semeval,
243
  title = "{S}em{E}val-2020 Task 11: Detection of Propaganda Techniques in News Articles",
244
+ author = "Da San Martino, Giovanni and others",
245
  booktitle = "Proceedings of SemEval-2020",
246
  year = "2020",
247
  }