|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
library_name: transformers |
|
|
tags: |
|
|
- propaganda-detection |
|
|
- binary-classification |
|
|
- modernbert |
|
|
- nci-protocol |
|
|
- text-classification |
|
|
pipeline_tag: text-classification |
|
|
metrics: |
|
|
- accuracy |
|
|
- f1 |
|
|
- precision |
|
|
- recall |
|
|
datasets: |
|
|
- synapti/nci-binary-classification |
|
|
base_model: answerdotai/ModernBERT-base |
|
|
model-index: |
|
|
- name: nci-binary-detector-v2 |
|
|
results: |
|
|
- task: |
|
|
type: text-classification |
|
|
name: Binary Propaganda Detection |
|
|
dataset: |
|
|
name: NCI Binary Classification |
|
|
type: synapti/nci-binary-classification |
|
|
split: test |
|
|
metrics: |
|
|
- type: accuracy |
|
|
value: 0.994 |
|
|
name: Accuracy |
|
|
- type: f1 |
|
|
value: 0.994 |
|
|
name: F1 |
|
|
- type: precision |
|
|
value: 0.989 |
|
|
name: Precision |
|
|
- type: recall |
|
|
value: 1.000 |
|
|
name: Recall |
|
|
--- |
|
|
|
|
|
# NCI Binary Propaganda Detector v2 |
|
|
|
|
|
This model is Stage 1 of the NCI (Narrative Control Index) two-stage propaganda detection pipeline. It performs binary classification to detect whether text contains ANY propaganda techniques. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
- **Model Type:** Binary text classifier |
|
|
- **Base Model:** [answerdotai/ModernBERT-base](https://huggingface.co/answerdotai/ModernBERT-base) |
|
|
- **Training Data:** [synapti/nci-binary-classification](https://huggingface.co/datasets/synapti/nci-binary-classification) (24,517 train, 1,727 validation, 1,729 test) |
|
|
- **Language:** English |
|
|
- **License:** Apache 2.0 |
|
|
|
|
|
## Performance |
|
|
|
|
|
| Metric | Value | |
|
|
|--------|-------| |
|
|
| **Accuracy** | 99.4% | |
|
|
| **Precision** | 98.9% | |
|
|
| **Recall** | 100.0% | |
|
|
| **F1 Score** | 99.4% | |
|
|
| **False Positive Rate** | 1.47% | |
|
|
| **False Negative Rate** | 0.00% | |
|
|
|
|
|
### Confusion Matrix (Test Set, n=1,729) |
|
|
``` |
|
|
Predicted |
|
|
No Prop | Has Prop |
|
|
Actual No Prop: 736 | 11 |
|
|
Actual Has Prop: 0 | 982 |
|
|
``` |
|
|
|
|
|
### Threshold Analysis |
|
|
|
|
|
| Threshold | Accuracy | Precision | Recall | F1 | |
|
|
|-----------|----------|-----------|--------|-----| |
|
|
| 0.3 | 99.2% | 98.6% | 100% | 99.3% | |
|
|
| 0.4 | 99.2% | 98.7% | 100% | 99.3% | |
|
|
| **0.5** | **99.4%** | **98.9%** | **100%** | **99.4%** | |
|
|
| 0.6 | 99.7% | 99.4% | 100% | 99.7% | |
|
|
| 0.7 | 99.7% | 99.5% | 100% | 99.7% | |
|
|
|
|
|
**Recommended threshold:** 0.5 (default) or 0.6 for reduced false positives |
|
|
|
|
|
## Training Details |
|
|
|
|
|
- **Loss Function:** Focal Loss (gamma=2.0, alpha=0.25) for class imbalance |
|
|
- **Optimizer:** AdamW with weight decay 0.01 |
|
|
- **Learning Rate:** 2e-5 with warmup ratio 0.1 |
|
|
- **Batch Size:** 16 (effective 32 with gradient accumulation) |
|
|
- **Epochs:** 5 with early stopping (patience=3) |
|
|
- **Best Model Selection:** Based on F1 score on validation set |
|
|
|
|
|
## Usage |
|
|
|
|
|
### With Transformers Pipeline |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
detector = pipeline( |
|
|
"text-classification", |
|
|
model="synapti/nci-binary-detector-v2" |
|
|
) |
|
|
|
|
|
result = detector("The radical left is DESTROYING our country!") |
|
|
# [{"label": "has_propaganda", "score": 0.99}] |
|
|
|
|
|
result = detector("The Federal Reserve announced a 0.25% rate increase.") |
|
|
# [{"label": "no_propaganda", "score": 0.98}] |
|
|
``` |
|
|
|
|
|
### With AutoModel |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer |
|
|
import torch |
|
|
|
|
|
model = AutoModelForSequenceClassification.from_pretrained("synapti/nci-binary-detector-v2") |
|
|
tokenizer = AutoTokenizer.from_pretrained("synapti/nci-binary-detector-v2") |
|
|
|
|
|
text = "Wake up, people! They are hiding the truth from you!" |
|
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512) |
|
|
|
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
probs = torch.softmax(outputs.logits, dim=1) |
|
|
propaganda_prob = probs[0, 1].item() |
|
|
|
|
|
print(f"Propaganda probability: {propaganda_prob:.2%}") |
|
|
``` |
|
|
|
|
|
### Two-Stage Pipeline (Recommended) |
|
|
|
|
|
For full propaganda analysis with technique identification: |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
|
|
|
# Stage 1: Binary detection |
|
|
binary_detector = pipeline( |
|
|
"text-classification", |
|
|
model="synapti/nci-binary-detector-v2" |
|
|
) |
|
|
|
|
|
# Stage 2: Technique classification |
|
|
technique_classifier = pipeline( |
|
|
"text-classification", |
|
|
model="synapti/nci-technique-classifier-v2", |
|
|
top_k=None |
|
|
) |
|
|
|
|
|
text = "Some text to analyze..." |
|
|
|
|
|
# Run Stage 1 |
|
|
binary_result = binary_detector(text)[0] |
|
|
if binary_result["label"] == "has_propaganda" and binary_result["score"] >= 0.5: |
|
|
# Run Stage 2 only if propaganda detected |
|
|
techniques = technique_classifier(text)[0] |
|
|
detected = [t for t in techniques if t["score"] >= 0.3] |
|
|
print(f"Detected techniques: {[t['label'] for t in detected]}") |
|
|
else: |
|
|
print("No propaganda detected") |
|
|
``` |
|
|
|
|
|
## Labels |
|
|
|
|
|
| Label ID | Label Name | Description | |
|
|
|----------|------------|-------------| |
|
|
| 0 | no_propaganda | Text does not contain propaganda techniques | |
|
|
| 1 | has_propaganda | Text contains one or more propaganda techniques | |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
### Primary Use Cases |
|
|
- Media literacy tools and browser extensions |
|
|
- Content moderation assistance |
|
|
- Research on information manipulation |
|
|
- Educational platforms for critical thinking |
|
|
|
|
|
### Out of Scope |
|
|
- Censorship or automated content removal |
|
|
- Political targeting or surveillance |
|
|
- Single-source truth determination |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- Optimized for English text |
|
|
- May have reduced performance on very short texts (<10 words) |
|
|
- Trained primarily on political/news content; domain shift may affect performance |
|
|
- Should be used as one signal among many, not as sole arbiter |
|
|
|
|
|
## Related Models |
|
|
|
|
|
- **Stage 2:** [synapti/nci-technique-classifier-v2](https://huggingface.co/synapti/nci-technique-classifier-v2) - Multi-label technique classification |
|
|
- **Dataset:** [synapti/nci-binary-classification](https://huggingface.co/datasets/synapti/nci-binary-classification) |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{nci-binary-detector-v2, |
|
|
author = {Synapti}, |
|
|
title = {NCI Binary Propaganda Detector v2}, |
|
|
year = {2024}, |
|
|
publisher = {HuggingFace}, |
|
|
url = {https://huggingface.co/synapti/nci-binary-detector-v2} |
|
|
} |
|
|
``` |
|
|
|