Update model card with YAML frontmatter and two-stage pipeline example

f602452 verified 3 months ago

7.68 kB

	---
	license: apache-2.0
	datasets:
	- synapti/nci-propaganda-production
	base_model: answerdotai/ModernBERT-base
	tags:
	- transformers
	- modernbert
	- text-classification
	- propaganda-detection
	- multi-label-classification
	- nci-protocol
	- semeval-2020
	- onnx
	library_name: transformers
	pipeline_tag: text-classification
	---

	# NCI Technique Classifier

	Multi-label classifier that identifies specific propaganda techniques in text.

	## Model Description

	This model is Stage 2 of the NCI (Narrative Credibility Index) two-stage propaganda detection pipeline:

	- Stage 1: Fast binary detection - "Does this text contain propaganda?"
	- Stage 2 (this model): Multi-label technique classification - "Which specific techniques are used?"

	The classifier identifies 18 propaganda techniques from the SemEval-2020 Task 11 taxonomy.

	## Propaganda Techniques

	\| # \| Technique \| F1 Score \| Optimal Threshold \|
	\|---\|-----------\|----------\|-------------------\|
	\| 0 \| Loaded_Language \| 95.3% \| 0.3 \|
	\| 1 \| Appeal_to_fear-prejudice \| 85.1% \| 0.3 \|
	\| 2 \| Exaggeration,Minimisation \| 49.0% \| 0.4 \|
	\| 3 \| Repetition \| 55.9% \| 0.4 \|
	\| 4 \| Flag-Waving \| 50.9% \| 0.4 \|
	\| 5 \| Name_Calling,Labeling \| 79.0% \| 0.1 \|
	\| 6 \| Reductio_ad_hitlerum \| 82.4% \| 0.3 \|
	\| 7 \| Black-and-White_Fallacy \| 68.8% \| 0.5 \|
	\| 8 \| Causal_Oversimplification \| 67.9% \| 0.4 \|
	\| 9 \| Whataboutism,Straw_Men,Red_Herring \| 47.7% \| 0.3 \|
	\| 10 \| Straw_Man \| 60.3% \| 0.5 \|
	\| 11 \| Red_Herring \| 86.3% \| 0.5 \|
	\| 12 \| Doubt \| 63.4% \| 0.3 \|
	\| 13 \| Appeal_to_Authority \| 50.0% \| 0.3 \|
	\| 14 \| Thought-terminating_Cliches \| 71.2% \| 0.5 \|
	\| 15 \| Bandwagon \| 46.7% \| 0.5 \|
	\| 16 \| Slogans \| 46.0% \| 0.3 \|
	\| 17 \| Obfuscation,Intentional_Vagueness,Confusion \| 86.3% \| 0.5 \|

	## Performance

	Test Set Results (1,729 samples):

	\| Metric \| Default (0.5) \| Optimized Thresholds \|
	\|--------\|--------------\|---------------------\|
	\| Micro F1 \| 72.7% \| 80.3% \|
	\| Macro F1 \| 62.5% \| 68.3% \|
	\| ECE (Calibration Error) \| - \| 0.0096 \|

	## Usage

	### Basic Usage

	```python
	from transformers import pipeline

	classifier = pipeline(
	"text-classification",
	model="synapti/nci-technique-classifier",
	top_k=None # Return all labels
	)

	text = "The radical left is DESTROYING our country!"
	results = classifier(text)[0]

	# Get detected techniques (using default 0.5 threshold)
	detected = [r for r in results if r["score"] > 0.5]
	for d in detected:
	print(f"{d['label']}: {d['score']:.2%}")
	```

	### With Calibration Config (Recommended)

	The model includes a `calibration_config.json` file with optimized per-technique thresholds and temperature scaling for better calibrated confidence scores.

	```python
	import json
	from transformers import pipeline
	from huggingface_hub import hf_hub_download

	# Load calibration config
	config_path = hf_hub_download(
	repo_id="synapti/nci-technique-classifier",
	filename="calibration_config.json"
	)
	with open(config_path) as f:
	config = json.load(f)

	temperature = config["temperature"] # 0.75
	thresholds = config["thresholds"]
	labels = config["technique_labels"]

	classifier = pipeline(
	"text-classification",
	model="synapti/nci-technique-classifier",
	top_k=None
	)

	text = "Your text here..."
	results = classifier(text)[0]

	# Apply per-technique thresholds
	detected = []
	for r in results:
	idx = int(r["label"].split("_")[1])
	technique = labels[idx]
	threshold = thresholds.get(technique, 0.5)
	if r["score"] > threshold:
	detected.append((technique, r["score"]))
	```

	### ONNX Inference (Faster)

	The model is also available in ONNX format for optimized inference:

	```python
	import onnxruntime as ort
	from transformers import AutoTokenizer
	from huggingface_hub import hf_hub_download
	import numpy as np

	# Download ONNX model
	onnx_path = hf_hub_download(
	repo_id="synapti/nci-technique-classifier",
	filename="onnx/model.onnx"
	)

	# Load tokenizer and ONNX session
	tokenizer = AutoTokenizer.from_pretrained("synapti/nci-technique-classifier")
	session = ort.InferenceSession(onnx_path, providers=["CPUExecutionProvider"])

	# Inference
	text = "Your text here..."
	inputs = tokenizer(text, padding="max_length", truncation=True, max_length=512, return_tensors="np")
	onnx_inputs = {
	"input_ids": inputs["input_ids"],
	"attention_mask": inputs["attention_mask"],
	}
	logits = session.run(None, onnx_inputs)[0]
	probs = 1 / (1 + np.exp(-logits)) # Sigmoid for multi-label
	```

	### Two-Stage Pipeline

	For best results, use with the binary detector:

	```python
	from transformers import pipeline

	# Stage 1: Binary detection (fast filter)
	detector = pipeline("text-classification", model="synapti/nci-binary-detector")

	# Stage 2: Technique classification
	classifier = pipeline("text-classification", model="synapti/nci-technique-classifier", top_k=None)

	text = "Your text to analyze..."

	# Quick check first
	detection = detector(text)[0]
	if detection["label"] == "has_propaganda" and detection["score"] > 0.5:
	# Detailed technique analysis
	techniques = classifier(text)[0]
	detected = [t for t in techniques if t["score"] > 0.3]
	for t in detected:
	print(f"{t['label']}: {t['score']:.2%}")
	else:
	print("No propaganda detected")
	```

	## Calibration Config

	The `calibration_config.json` file contains:

	```json
	{
	"temperature": 0.75,
	"thresholds": {
	"Loaded_Language": 0.3,
	"Appeal_to_fear-prejudice": 0.3,
	"Name_Calling,Labeling": 0.1,
	...
	},
	"metrics": {
	"ece": 0.0096,
	"micro_f1_optimized": 0.803,
	"macro_f1_optimized": 0.683
	}
	}
	```

	## Training Data

	Trained on [synapti/nci-propaganda-production](https://huggingface.co/datasets/synapti/nci-propaganda-production):

	- 23,000+ examples with multi-hot technique labels
	- Augmented data for minority techniques (MLSMOTE)
	- Hard negatives from LIAR2 and Qbias datasets
	- Class-weighted Focal Loss to handle imbalance

	## Model Architecture

	- Base Model: [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base)
	- Parameters: 149.6M
	- Max Sequence Length: 512 tokens
	- Output: 18 labels (multi-label sigmoid)
	- Calibration Temperature: 0.75

	## Available Files

	\| File \| Description \|
	\|------\|-------------\|
	\| `model.safetensors` \| PyTorch model weights \|
	\| `calibration_config.json` \| Optimized thresholds & temperature \|
	\| `onnx/model.onnx` \| ONNX model for fast inference \|
	\| `config.json` \| Model configuration \|

	## Training Details

	- Loss Function: Class-weighted Focal Loss (gamma=2.0)
	- Class Weights: Inverse frequency weighting
	- Optimizer: AdamW
	- Learning Rate: 2e-5
	- Batch Size: 8 (effective 32 with gradient accumulation)
	- Epochs: 5 with early stopping (patience=3)
	- Hardware: NVIDIA A10G GPU

	## Limitations

	- Trained primarily on English text
	- Performance varies by technique (see table above)
	- Some techniques overlap semantically
	- Should be used with binary detector for best results
	- Threshold optimization recommended for specific use cases

	## Related Models

	- [synapti/nci-binary-detector](https://huggingface.co/synapti/nci-binary-detector) - Stage 1 binary detector

	## Citation

	```bibtex
	@inproceedings{da-san-martino-etal-2020-semeval,
	title = "{S}em{E}val-2020 Task 11: Detection of Propaganda Techniques in News Articles",
	author = "Da San Martino, Giovanni and others",
	booktitle = "Proceedings of SemEval-2020",
	year = "2020",
	}

	@misc{nci-technique-classifier,
	author = {NCI Protocol Team},
	title = {NCI Technique Classifier},
	year = {2024},
	publisher = {HuggingFace},
	url = {https://huggingface.co/synapti/nci-technique-classifier}
	}
	```

	## License

	Apache 2.0