propaganda_detector / README.md
brsvaaa's picture
Update README.md
36e4871 verified
---
license: other
license_name: mixed-terms
license_link: LICENSE
language:
- en
metrics:
- f1
- precision
- recall
base_model:
- google/gemma-2-2b-it
- answerdotai/ModernBERT-base
tags:
- Multilabel classification
- Propaganda-detection
- Text-classification
---
license: other
tags:
- text-classification
- multi-label-classification
- propaganda-detection
---
## Propaganda Detector Ensemble (Inference Bundle)
This repository provides an inference bundle for multi-label classification of propaganda techniques. The bundle includes model artifacts and configuration for an ensemble composed of:
- **Gemma**: `google/gemma-2-2b-it` (main) and an **8-bit** inference variant
- **ModernBERT**: `answerdotai/ModernBERT-base` (binary / auxiliary classifier component)
- **Classical ML**: **LinearSVC + TF-IDF**, optionally with calibration
- **Post-processing artifacts**: label list, per-class thresholds, and ensemble metadata
## Intended Use
- Inference via an API service Modal, browser extension backend, batch scoring, and experimentation.
- The repository is intended for **prediction/inference**. Training code and datasets are not included.
## Licensing & Attribution (IMPORTANT)
This repository uses **mixed licensing/terms** because it bundles multiple upstream components.
### Gemma components (Gemma-2-2B-IT and derivatives)
Gemma-based weights and any derivatives (including fine-tuned and/or quantized variants) are subject to the **Gemma Terms of Use**:
https://ai.google.dev/gemma/terms
### ModernBERT component
`answerdotai/ModernBERT-base` is released under the **Apache License 2.0**. If ModernBERT weights or derivatives are included in this repository, their use and distribution are subject to Apache-2.0:
https://www.apache.org/licenses/LICENSE-2.0
## Dataset attribution (SemEval-2020 Task 11 / PTC)
This work uses the **Propaganda Techniques Corpus (PTC)** from **SemEval 2020 Task 11**.
The dataset was **modified for this project** during preprocessing and label setup. In particular:
- The original span-level annotations were transformed into a **classification-ready format** (instances derived from annotated fragments with document context).
- Underrepresented techniques were **merged into super-classes** as in the task setup (e.g., “Bandwagon” + “Reductio ad Hitlerum”, and “Whataboutism” + “Straw Men” + “Red Herring”).
- The technique **“Obfuscation, Intentional Vagueness, Confusion”** was **excluded** due to very low frequency (as described in the PTC documentation).
The original dataset is **not redistributed** in this repository. Any modifications are the responsibility of the authors of this repository.
Please cite the following paper when using the PTC corpus:
Da San Martino, G., Yu, S., Barrón-Cedeño, A., Petrov, R., & Nakov, P. (2019).
*Fine-Grained Analysis of Propaganda in News Articles* (EMNLP-IJCNLP 2019).
```bibtex
@InProceedings{EMNLP19DaSanMartino,
author = {Da San Martino, Giovanni and
Yu, Seunghak and
Barr\'{o}n-Cede\~no, Alberto and
Petrov, Rostislav and
Nakov, Preslav},
title = {Fine-Grained Analysis of Propaganda in News Articles},
booktitle = {Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and
9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019},
year = {2019}
}
```
## Summary note
If any terms conflict, the most restrictive applicable terms for a given component apply. This repository does not grant any additional rights beyond those stated in the upstream licenses/terms.
## Limitations
- Predictions can be sensitive to domain shift, language, and text length.
- This model may produce false positives/negatives; use as decision support, not as sole authority.
## Contact / Notes
If you use this repository, please ensure compliance with the upstream licenses/terms and provide appropriate dataset attribution.
---