Add narrative classifier (RoBERTa-large, multi-label) + model card

cc11df7 about 1 month ago

8.42 kB

	---
	language:
	- en
	license: cc-by-nc-4.0
	library_name: transformers
	pipeline_tag: text-classification
	base_model: FacebookAI/roberta-large
	tags:
	- roberta
	- text-classification
	- multi-label-classification
	- disinformation
	- narrative-detection
	- propaganda
	- media-analysis
	metrics:
	- f1
	- precision
	- recall
	---

	# Narrative Classifier (RoBERTa-large, multi-label)

	A multi-label text classifier that detects disinformation / propaganda narratives in news and
	social-media text. Given a piece of text, the model predicts which of 41 predefined narratives
	(spanning topics such as the war in Ukraine, migration, climate change, COVID-19 / vaccines,
	gender & LGBT+, anti-establishment / anti-EU / anti-NATO framings, etc.) are present.

	The model was developed at the Polish-Japanese Academy of Information Technology (PJAIT / PJATK).

	- Architecture: `RobertaNarrativeModel` — a `roberta-large` encoder + a single linear
	classification head (`narrative_head`, 1024 → 41) applied to the `<s>` (CLS) token.
	- Task: multi-label classification (one input can carry several narratives at once).
	- Base model: [`FacebookAI/roberta-large`](https://huggingface.co/FacebookAI/roberta-large)
	- Parameters: ~0.4B · Precision: FP32 · Format: safetensors
	- Language: English

	> Note on the architecture. This repository uses a custom model class
	> (`RobertaNarrativeModel`) whose weights are stored under the `transformer.*` and
	> `narrative_head.` prefixes. It therefore does not* load directly with
	> `AutoModelForSequenceClassification`. Use the self-contained loading code in the
	> [How to use](#how-to-use) section below.

	## Labels

	The model outputs 41 labels. The full mapping is in
	[`narrative_labels.json`](./narrative_labels.json) / [`label_config.json`](./label_config.json).

	<details>
	<summary>Show all 41 narratives</summary>

	\| ID \| Narrative \|
	\|----\|-----------\|
	\| 0 \| Abortion is evil/immoral/dangerous \|
	\| 1 \| Alternative treatments are more effective than conventional ones \|
	\| 2 \| Climate change is a hoax \|
	\| 3 \| Collapse of Western civilization is imminent \|
	\| 4 \| Conflict is a staged event prepared by outside forces \|
	\| 5 \| Contraception is against nature/dangerous/immoral \|
	\| 6 \| Conventional medicine is ineffective and corrupt \|
	\| 7 \| Conventional medicine is wrong about the causes of diseases \|
	\| 8 \| Elites manipulate elections \|
	\| 9 \| Elites want to take over the world \|
	\| 10 \| European Union is authoritarian \|
	\| 11 \| Feminism is a tool to destroy the natural order and traditional values \|
	\| 12 \| Global elites deliberately cause pandemics and diseases \|
	\| 13 \| Global warming does not exist/is not a serious threat \|
	\| 14 \| Governments fail to take proper action on migration crisis \|
	\| 15 \| Homosexuals are a threat \|
	\| 16 \| Humanity is not responsible for global warming \|
	\| 17 \| LGBT+ is a tool to destroy the natural order and traditional values \|
	\| 18 \| LGBT+ people are mentally ill \|
	\| 19 \| LGBT+ people are privileged \|
	\| 20 \| Media deliberately spreads lies \|
	\| 21 \| Migrants are a burden on the economy \|
	\| 22 \| Migrants are dangerous \|
	\| 23 \| Migrants are destroying local culture and breaking up local communities \|
	\| 24 \| Migration is a conspiracy of global elites \|
	\| 25 \| Most European countries are puppets of the West \|
	\| 26 \| NATO is authoritarian/warmongering \|
	\| 27 \| Official information is a tool to deceive citizens \|
	\| 28 \| Other \|
	\| 29 \| Russia is strong and winning the war \|
	\| 30 \| Sex education is a threat to children \|
	\| 31 \| Solutions to reduce human impact on environment and climate are a conspiracy \|
	\| 32 \| State and international institutions only serve to oppress citizens. \|
	\| 33 \| The West and their allies are immoral/hostile/ineffective \|
	\| 34 \| The energy crisis is artificially created \|
	\| 35 \| Transgender people are a threat \|
	\| 36 \| Ukraine is an evil, aggressive and dangerous country \|
	\| 37 \| Ukrainian refugees are a danger/burden \|
	\| 38 \| Vaccines are dangerous/ineffective/immoral \|
	\| 39 \| Western elites want to destroy the natural order and traditional values \|
	\| 40 \| other \|

	</details>

	## How to use

	```python
	import json
	import torch
	from torch import nn
	from transformers import AutoTokenizer, AutoConfig, RobertaModel
	from huggingface_hub import hf_hub_download
	from safetensors.torch import load_file

	REPO_ID = "pjait/narrative_classifier"


	class RobertaNarrativeModel(nn.Module):
	"""roberta-large encoder + a linear head over the <s> (CLS) token."""

	def __init__(self, config, num_labels):
	super().__init__()
	self.transformer = RobertaModel(config, add_pooling_layer=False)
	self.narrative_head = nn.Linear(config.hidden_size, num_labels)

	def forward(self, input_ids, attention_mask=None):
	out = self.transformer(input_ids=input_ids, attention_mask=attention_mask)
	cls = out.last_hidden_state[:, 0] # <s> token representation
	return self.narrative_head(cls) # raw logits (multi-label)


	# --- load config, labels and weights ---------------------------------------
	config = AutoConfig.from_pretrained(REPO_ID)
	tokenizer = AutoTokenizer.from_pretrained(REPO_ID)

	with open(hf_hub_download(REPO_ID, "narrative_labels.json")) as f:
	labels = json.load(f)
	id2narrative = {int(k): v for k, v in labels["id2narrative"].items()}
	num_labels = labels["num_labels"]

	model = RobertaNarrativeModel(config, num_labels)
	state_dict = load_file(hf_hub_download(REPO_ID, "model.safetensors"))
	model.load_state_dict(state_dict)
	model.eval()

	# --- inference --------------------------------------------------------------
	text = "The vaccines were rushed and are far more dangerous than the virus itself."
	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

	with torch.no_grad():
	logits = model(**inputs)
	probs = torch.sigmoid(logits)[0] # multi-label -> sigmoid

	THRESHOLD = 0.5
	predicted = [(id2narrative[i], float(p)) for i, p in enumerate(probs) if p >= THRESHOLD]
	print(sorted(predicted, key=lambda x: -x[1]))
	```

	`THRESHOLD` controls precision/recall trade-off; tune it on your own validation data.

	## Evaluation

	Metrics from [`metrics.txt`](./metrics.txt) (evaluation split, epoch 3):

	\| Metric \| Value \|
	\|--------\|-------\|
	\| Micro F1 \| 0.494 \|
	\| Macro F1 \| 0.185 \|
	\| Precision \| 0.700 \|
	\| Recall \| 0.382 \|
	\| Subset accuracy \| 0.787 \|
	\| Eval loss \| 0.023 \|

	The gap between micro and macro F1, together with high precision but lower recall, indicates the
	model is conservative and performs unevenly across narratives — likely better on
	well-represented narratives and weaker on rare ones. Treat predictions as a **decision-support
	signal**, not ground truth, and calibrate the threshold for your use case.

	## Intended use & limitations

	Intended use. Research and analysis of disinformation/propaganda narratives in English-language
	media; content moderation triage; media-monitoring dashboards; academic studies of narrative spread.

	Out of scope / cautions.
	- The model identifies whether text expresses or discusses a narrative; it does not establish
	truth, intent, or that the author endorses the narrative (quotation, debunking and reporting can
	trigger labels).
	- Trained on English; performance on other languages is not guaranteed.
	- Macro F1 is low — rare narratives are unreliable. Do not use for automated, consequential
	decisions about individuals without human review.
	- Sensitive topics (health, politics, gender, migration). Outputs can reflect biases in the
	training data. Human oversight is required for any deployment.

	## Training

	- Base model: `FacebookAI/roberta-large` fine-tuned for multi-label narrative classification.
	- Epochs: 3 (see `training_args.bin` for the full `TrainingArguments`).
	- Objective: multi-label classification (sigmoid + binary cross-entropy over 41 narratives).

	## Citation

	If you use this model, please cite the Polish-Japanese Academy of Information Technology (PJAIT)
	and the author. (Add the relevant paper / BibTeX here.)

	```bibtex
	@misc{narrative_classifier_pjait,
	title = {Narrative Classifier (RoBERTa-large, multi-label)},
	author = {Sosnowski, Witold},
	howpublished = {\url{https://huggingface.co/pjait/narrative_classifier}},
	note = {Polish-Japanese Academy of Information Technology (PJAIT)},
	year = {2025}
	}
	```