synapti
/

nci-technique-classifier

@@ -1,5 +1,7 @@
 ---
 license: apache-2.0
 base_model: answerdotai/ModernBERT-base
 tags:
 - transformers
@@ -9,12 +11,8 @@ tags:
 - multi-label-classification
 - nci-protocol
 - semeval-2020
-datasets:
-- synapti/nci-propaganda-production
-metrics:
-- f1
-- precision
-- recall
 pipeline_tag: text-classification
 ---
@@ -35,24 +33,24 @@ The classifier identifies **18 propaganda techniques** from the SemEval-2020 Tas
 | # | Technique | F1 Score | Optimal Threshold |
 |---|-----------|----------|-------------------|
-| 0 | Loaded_Language | 94.6% | 0.4 |
-| 1 | Appeal_to_fear-prejudice | 84.9% | 0.4 |
-| 2 | Exaggeration,Minimisation | 49.0% | 0.6 |
 | 3 | Repetition | 55.9% | 0.4 |
 | 4 | Flag-Waving | 50.9% | 0.4 |
-| 5 | Name_Calling,Labeling | 44.5% | 0.2 |
 | 6 | Reductio_ad_hitlerum | 82.4% | 0.3 |
-| 7 | Black-and-White_Fallacy | 68.8% | 0.6 |
-| 8 | Causal_Oversimplification | 67.9% | 0.5 |
-| 9 | Whataboutism,Straw_Men,Red_Herring | 47.7% | 0.4 |
-| 10 | Straw_Man | 60.3% | 0.4 |
 | 11 | Red_Herring | 86.3% | 0.5 |
-| 12 | Doubt | 34.4% | 0.3 |
-| 13 | Appeal_to_Authority | 50.0% | 0.5 |
 | 14 | Thought-terminating_Cliches | 71.2% | 0.5 |
 | 15 | Bandwagon | 46.7% | 0.5 |
-| 16 | Slogans | 46.0% | 0.4 |
-| 17 | Obfuscation,Intentional_Vagueness,Confusion | 86.3% | 0.4 |
 ## Performance
@@ -60,10 +58,9 @@ The classifier identifies **18 propaganda techniques** from the SemEval-2020 Tas
 | Metric | Default (0.5) | Optimized Thresholds |
 |--------|--------------|---------------------|
-| Micro F1 | 72.7% | **80.0%** |
-| Macro F1 | 62.6% | **69.0%** |
-| Micro Precision | 87.9% | - |
-| Micro Recall | 62.1% | - |
 ## Usage
@@ -87,22 +84,26 @@ for d in detected:
     print(f"{d['label']}: {d['score']:.2%}")
 ```
-### With Optimized Thresholds
 ```python
 import json
 from transformers import pipeline
 from huggingface_hub import hf_hub_download
-# Load optimal thresholds
-thresholds_path = hf_hub_download(
     repo_id="synapti/nci-technique-classifier",
-    filename="optimal_thresholds.json"
 )
-with open(thresholds_path) as f:
     config = json.load(f)
-    thresholds = config["thresholds"]
-    labels = config["labels"]
 classifier = pipeline(
     "text-classification",
@@ -118,11 +119,42 @@ detected = []
 for r in results:
     idx = int(r["label"].split("_")[1])
     technique = labels[idx]
-    threshold = thresholds[technique]
     if r["score"] > threshold:
         detected.append((technique, r["score"]))
 ```
 ### Two-Stage Pipeline
 ```python
@@ -135,6 +167,27 @@ print(f"Has propaganda: {result.has_propaganda}")
 print(f"Techniques: {[t.name for t in result.techniques]}")
 ```
 ## Training Data
 Trained on [synapti/nci-propaganda-production](https://huggingface.co/datasets/synapti/nci-propaganda-production):
@@ -150,7 +203,16 @@ Trained on [synapti/nci-propaganda-production](https://huggingface.co/datasets/s
 - **Parameters**: 149.6M
 - **Max Sequence Length**: 512 tokens
 - **Output**: 18 labels (multi-label sigmoid)
-- **Calibration Temperature**: 3.0
 ## Training Details
@@ -179,7 +241,7 @@ Trained on [synapti/nci-propaganda-production](https://huggingface.co/datasets/s
 ```bibtex
 @inproceedings{da-san-martino-etal-2020-semeval,
     title = "{S}em{E}val-2020 Task 11: Detection of Propaganda Techniques in News Articles",
-    author = "Da San Martino, Giovanni  and others",
     booktitle = "Proceedings of SemEval-2020",
     year = "2020",
 }

 ---
 license: apache-2.0
+datasets:
+- synapti/nci-propaganda-production
 base_model: answerdotai/ModernBERT-base
 tags:
 - transformers
 - multi-label-classification
 - nci-protocol
 - semeval-2020
+- onnx
+library_name: transformers
 pipeline_tag: text-classification
 ---
 | # | Technique | F1 Score | Optimal Threshold |
 |---|-----------|----------|-------------------|
+| 0 | Loaded_Language | 95.3% | 0.3 |
+| 1 | Appeal_to_fear-prejudice | 85.1% | 0.3 |
+| 2 | Exaggeration,Minimisation | 49.0% | 0.4 |
 | 3 | Repetition | 55.9% | 0.4 |
 | 4 | Flag-Waving | 50.9% | 0.4 |
+| 5 | Name_Calling,Labeling | 79.0% | 0.1 |
 | 6 | Reductio_ad_hitlerum | 82.4% | 0.3 |
+| 7 | Black-and-White_Fallacy | 68.8% | 0.5 |
+| 8 | Causal_Oversimplification | 67.9% | 0.4 |
+| 9 | Whataboutism,Straw_Men,Red_Herring | 47.7% | 0.3 |
+| 10 | Straw_Man | 60.3% | 0.5 |
 | 11 | Red_Herring | 86.3% | 0.5 |
+| 12 | Doubt | 63.4% | 0.3 |
+| 13 | Appeal_to_Authority | 50.0% | 0.3 |
 | 14 | Thought-terminating_Cliches | 71.2% | 0.5 |
 | 15 | Bandwagon | 46.7% | 0.5 |
+| 16 | Slogans | 46.0% | 0.3 |
+| 17 | Obfuscation,Intentional_Vagueness,Confusion | 86.3% | 0.5 |
 ## Performance
 | Metric | Default (0.5) | Optimized Thresholds |
 |--------|--------------|---------------------|
+| Micro F1 | 72.7% | **80.3%** |
+| Macro F1 | 62.5% | **68.3%** |
+| ECE (Calibration Error) | - | **0.0096** |
 ## Usage
     print(f"{d['label']}: {d['score']:.2%}")
 ```
+### With Calibration Config (Recommended)
+The model includes a `calibration_config.json` file with optimized per-technique thresholds and temperature scaling for better calibrated confidence scores.
 ```python
 import json
 from transformers import pipeline
 from huggingface_hub import hf_hub_download
+# Load calibration config
+config_path = hf_hub_download(
     repo_id="synapti/nci-technique-classifier",
+    filename="calibration_config.json"
 )
+with open(config_path) as f:
     config = json.load(f)
+temperature = config["temperature"]  # 0.75
+thresholds = config["thresholds"]
+labels = config["technique_labels"]
 classifier = pipeline(
     "text-classification",
 for r in results:
     idx = int(r["label"].split("_")[1])
     technique = labels[idx]
+    threshold = thresholds.get(technique, 0.5)
     if r["score"] > threshold:
         detected.append((technique, r["score"]))
 ```
+### ONNX Inference (Faster)
+The model is also available in ONNX format for optimized inference:
+```python
+import onnxruntime as ort
+from transformers import AutoTokenizer
+from huggingface_hub import hf_hub_download
+import numpy as np
+# Download ONNX model
+onnx_path = hf_hub_download(
+    repo_id="synapti/nci-technique-classifier",
+    filename="onnx/model.onnx"
+)
+# Load tokenizer and ONNX session
+tokenizer = AutoTokenizer.from_pretrained("synapti/nci-technique-classifier")
+session = ort.InferenceSession(onnx_path, providers=["CPUExecutionProvider"])
+# Inference
+text = "Your text here..."
+inputs = tokenizer(text, padding="max_length", truncation=True, max_length=512, return_tensors="np")
+onnx_inputs = {
+    "input_ids": inputs["input_ids"],
+    "attention_mask": inputs["attention_mask"],
+}
+logits = session.run(None, onnx_inputs)[0]
+probs = 1 / (1 + np.exp(-logits))  # Sigmoid for multi-label
+```
 ### Two-Stage Pipeline
 ```python
 print(f"Techniques: {[t.name for t in result.techniques]}")
 ```
+## Calibration Config
+The `calibration_config.json` file contains:
+```json
+{
+  "temperature": 0.75,
+  "thresholds": {
+    "Loaded_Language": 0.3,
+    "Appeal_to_fear-prejudice": 0.3,
+    "Name_Calling,Labeling": 0.1,
+    ...
+  },
+  "metrics": {
+    "ece": 0.0096,
+    "micro_f1_optimized": 0.803,
+    "macro_f1_optimized": 0.683
+  }
+}
+```
 ## Training Data
 Trained on [synapti/nci-propaganda-production](https://huggingface.co/datasets/synapti/nci-propaganda-production):
 - **Parameters**: 149.6M
 - **Max Sequence Length**: 512 tokens
 - **Output**: 18 labels (multi-label sigmoid)
+- **Calibration Temperature**: 0.75
+## Available Files
+| File | Description |
+|------|-------------|
+| `model.safetensors` | PyTorch model weights |
+| `calibration_config.json` | Optimized thresholds & temperature |
+| `onnx/model.onnx` | ONNX model for fast inference |
+| `config.json` | Model configuration |
 ## Training Details
 ```bibtex
 @inproceedings{da-san-martino-etal-2020-semeval,
     title = "{S}em{E}val-2020 Task 11: Detection of Propaganda Techniques in News Articles",
+    author = "Da San Martino, Giovanni and others",
     booktitle = "Proceedings of SemEval-2020",
     year = "2020",
 }