Update README.md

36e4871 verified 3 months ago

3.99 kB

	---
	license: other
	license_name: mixed-terms
	license_link: LICENSE
	language:
	- en
	metrics:
	- f1
	- precision
	- recall
	base_model:
	- google/gemma-2-2b-it
	- answerdotai/ModernBERT-base
	tags:
	- Multilabel classification
	- Propaganda-detection
	- Text-classification
	---
	license: other
	tags:
	- text-classification
	- multi-label-classification
	- propaganda-detection
	---

	## Propaganda Detector Ensemble (Inference Bundle)

	This repository provides an inference bundle for multi-label classification of propaganda techniques. The bundle includes model artifacts and configuration for an ensemble composed of:

	- Gemma: `google/gemma-2-2b-it` (main) and an 8-bit inference variant
	- ModernBERT: `answerdotai/ModernBERT-base` (binary / auxiliary classifier component)
	- Classical ML: LinearSVC + TF-IDF, optionally with calibration
	- Post-processing artifacts: label list, per-class thresholds, and ensemble metadata

	## Intended Use

	- Inference via an API service Modal, browser extension backend, batch scoring, and experimentation.
	- The repository is intended for prediction/inference. Training code and datasets are not included.

	## Licensing & Attribution (IMPORTANT)

	This repository uses mixed licensing/terms because it bundles multiple upstream components.

	### Gemma components (Gemma-2-2B-IT and derivatives)
	Gemma-based weights and any derivatives (including fine-tuned and/or quantized variants) are subject to the Gemma Terms of Use:
	https://ai.google.dev/gemma/terms

	### ModernBERT component
	`answerdotai/ModernBERT-base` is released under the Apache License 2.0. If ModernBERT weights or derivatives are included in this repository, their use and distribution are subject to Apache-2.0:
	https://www.apache.org/licenses/LICENSE-2.0

	## Dataset attribution (SemEval-2020 Task 11 / PTC)

	This work uses the Propaganda Techniques Corpus (PTC) from SemEval 2020 Task 11.

	The dataset was modified for this project during preprocessing and label setup. In particular:

	- The original span-level annotations were transformed into a classification-ready format (instances derived from annotated fragments with document context).
	- Underrepresented techniques were merged into super-classes as in the task setup (e.g., “Bandwagon” + “Reductio ad Hitlerum”, and “Whataboutism” + “Straw Men” + “Red Herring”).
	- The technique “Obfuscation, Intentional Vagueness, Confusion” was excluded due to very low frequency (as described in the PTC documentation).

	The original dataset is not redistributed in this repository. Any modifications are the responsibility of the authors of this repository.

	Please cite the following paper when using the PTC corpus:

	Da San Martino, G., Yu, S., Barrón-Cedeño, A., Petrov, R., & Nakov, P. (2019).
	Fine-Grained Analysis of Propaganda in News Articles (EMNLP-IJCNLP 2019).

	```bibtex
	@InProceedings{EMNLP19DaSanMartino,
	author = {Da San Martino, Giovanni and
	Yu, Seunghak and
	Barr\'{o}n-Cede\~no, Alberto and
	Petrov, Rostislav and
	Nakov, Preslav},
	title = {Fine-Grained Analysis of Propaganda in News Articles},
	booktitle = {Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and
	9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019},
	year = {2019}
	}
	```

	## Summary note
	If any terms conflict, the most restrictive applicable terms for a given component apply. This repository does not grant any additional rights beyond those stated in the upstream licenses/terms.

	## Limitations

	- Predictions can be sensitive to domain shift, language, and text length.
	- This model may produce false positives/negatives; use as decision support, not as sole authority.

	## Contact / Notes

	If you use this repository, please ensure compliance with the upstream licenses/terms and provide appropriate dataset attribution.
	---