File size: 3,993 Bytes

202a2d8

---
license: other
license_name: mixed-terms
license_link: LICENSE
language:
- en
metrics:
- f1
- precision
- recall
base_model:
- google/gemma-2-2b-it
- answerdotai/ModernBERT-base
tags:
- Multilabel classification
- Propaganda-detection
- Text-classification
---
license: other
tags:
  - text-classification
  - multi-label-classification
  - propaganda-detection
---

## Propaganda Detector Ensemble (Inference Bundle)

This repository provides an inference bundle for multi-label classification of propaganda techniques. The bundle includes model artifacts and configuration for an ensemble composed of:

- **Gemma**: `google/gemma-2-2b-it` (main) and an **8-bit** inference variant
- **ModernBERT**: `answerdotai/ModernBERT-base` (binary / auxiliary classifier component)
- **Classical ML**: **LinearSVC + TF-IDF**, optionally with calibration
- **Post-processing artifacts**: label list, per-class thresholds, and ensemble metadata

## Intended Use

- Inference via an API service Modal, browser extension backend, batch scoring, and experimentation.
- The repository is intended for **prediction/inference**. Training code and datasets are not included.

## Licensing & Attribution (IMPORTANT)

This repository uses **mixed licensing/terms** because it bundles multiple upstream components.

### Gemma components (Gemma-2-2B-IT and derivatives)
Gemma-based weights and any derivatives (including fine-tuned and/or quantized variants) are subject to the **Gemma Terms of Use**:
https://ai.google.dev/gemma/terms

### ModernBERT component
`answerdotai/ModernBERT-base` is released under the **Apache License 2.0**. If ModernBERT weights or derivatives are included in this repository, their use and distribution are subject to Apache-2.0:
https://www.apache.org/licenses/LICENSE-2.0

## Dataset attribution (SemEval-2020 Task 11 / PTC)

This work uses the **Propaganda Techniques Corpus (PTC)** from **SemEval 2020 Task 11**.

The dataset was **modified for this project** during preprocessing and label setup. In particular:

- The original span-level annotations were transformed into a **classification-ready format** (instances derived from annotated fragments with document context).
- Underrepresented techniques were **merged into super-classes** as in the task setup (e.g., “Bandwagon” + “Reductio ad Hitlerum”, and “Whataboutism” + “Straw Men” + “Red Herring”).
- The technique **“Obfuscation, Intentional Vagueness, Confusion”** was **excluded** due to very low frequency (as described in the PTC documentation).

The original dataset is **not redistributed** in this repository. Any modifications are the responsibility of the authors of this repository.

Please cite the following paper when using the PTC corpus:

Da San Martino, G., Yu, S., Barrón-Cedeño, A., Petrov, R., & Nakov, P. (2019).
*Fine-Grained Analysis of Propaganda in News Articles* (EMNLP-IJCNLP 2019).

```bibtex
@InProceedings{EMNLP19DaSanMartino,
  author = {Da San Martino, Giovanni and
            Yu, Seunghak and
            Barr\'{o}n-Cede\~no, Alberto and
            Petrov, Rostislav and
            Nakov, Preslav},
  title = {Fine-Grained Analysis of Propaganda in News Articles},
  booktitle = {Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and
               9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019},
  year = {2019}
}
```

## Summary note
If any terms conflict, the most restrictive applicable terms for a given component apply. This repository does not grant any additional rights beyond those stated in the upstream licenses/terms.

## Limitations

- Predictions can be sensitive to domain shift, language, and text length.
- This model may produce false positives/negatives; use as decision support, not as sole authority.

## Contact / Notes

If you use this repository, please ensure compliance with the upstream licenses/terms and provide appropriate dataset attribution.
---