roberta-incoherence-classifier / README.md

Update README.md

d6b8b7f verified 4 months ago

6.98 kB

	---
	license: cc-by-sa-4.0
	language:
	- pl
	tags:
	- text-classification
	- encoder-only
	- polish
	- inconsistency-detection
	pipeline_tag: text-classification
	model-index:
	- name: asseco-group/roberta-incoherence-classifier
	results:
	- task:
	type: text-classification
	name: Document inconsistency detection (NLI-like)
	dataset:
	name: asseco-group/incoherence-bench
	type: text
	split: test
	metrics:
	- type: f1
	name: F1 (macro)
	value: 0.91
	- type: accuracy
	name: Accuracy
	value: 0.91
	---

	<h1 align="center">roberta-incoherence-classifier</h1>

	Encoder-based classifier for document inconsistency detection in Polish. This model evaluates the semantic consistency between two text fragments (e.g. sections of legal, procurement or organizational documents). It follows an NLI-like setup but redefines labels specifically for document coherence auditing. This model was initalized from [PKOBP/polish-roberta-8k](https://huggingface.co/PKOBP/polish-roberta-8k) and adapted into an inconsistency classifier through supervised training on high-quality document-style pairs.

	---

	## Intended Use

	* Document consistency auditing (legal, public tender, IT documentation, organizational materials)
	* Detecting contradicting statements, scope mismatches, term/role/format inconsistencies
	* NLI‑like semantic relation classification with adapted label semantics

	Not intended for:

	* Fact-checking against external world knowledge
	* Non‑Polish language inputs
	* General misinformation / sentiment / toxicity detection

	Finetuning on specific domain data is recommended for best production accuracy.

	---

	## Label Definition (Adapted vs. Classical NLI)

	\| Label \| Meaning \|
	\| ----------------- \| ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- \|
	\| entailment \| Hypothesis is a faithful, condensed or paraphrased restatement of the premise. All critical constraints, actors, conditions and scope remain intact. \|
	\| neutral \| Hypothesis neither follows nor contradicts the premise. Typically introduces unverifiable or out‑of‑scope information (e.g. different institutions, expanded context, unrelated assumptions). \|
	\| contradiction \| Hypothesis directly conflicts with the premise: reverses permissions/requirements, changes legal scope, numeric limits, formats, dates, or the responsible authority or both statements cannot realistically be true at the same time. \|

	Rule: A single critical mismatch (date / territory / authority / format / obligation vs. optional) is sufficient for `contradiction`, even if most of the text agrees.

	---

	## Model Details

	* Base architecture: RoBERTa‑large (encoder‑only)
	* Classification head: standard HF linear head on pooled representation
	* Language: Polish only
	* License: CC-BY-SA 4.0
	* Repository: `asseco-group/roberta-incoherence-classifier`

	---

	## Training

	* Precision: bfloat16
	* Epochs: 5
	* Global batch: 96 × 2 devices, `gradient_accumulation_steps=11`
	* Learning rate: `2e-5`, warmup ratio: `0.1`, weight decay: `0.01`
	* Label smoothing: `0.05`
	* Gradient checkpointing: True
	* Model selection: best macro F1 on validation

	---

	## Dataset


	* ~1.3M labeled pairs (train + val + test)
	* Balanced class distribution
	* Data sources include:
	- Polish subset of [MoritzLaurer/multilingual-NLI-26lang-2mil7](https://huggingface.co/datasets/MoritzLaurer/multilingual-NLI-26lang-2mil7) (only high‑quality Polish NLI pairs)
	- Synthetic high‑quality document‑style pairs generated specifically for this inconsistency detection task
	- No additional classical NLI datasets were used, standard NLI label semantics do not fully align with this model’s stricter document‑consistency definitions
	* Focus on Polish formal/procedural language (laws, tenders, IT specs, institutional instructions)

	---

	## Evaluation (on [asseco-group/incoherence-bench](https://huggingface.co/datasets/asseco-group/incoherence-bench), test split)

	```
	precision recall f1-score support
	entailment 0.94 0.90 0.92 150
	neutral 0.87 0.91 0.89 150
	contradiction 0.93 0.93 0.93 150

	accuracy 0.91 450
	macro avg 0.91 0.91 0.91 450
	weighted avg 0.91 0.91 0.91 450
	```

	While the task is NLI-like, the label semantics are redefined for document-level procedural consistency, for which no direct open-source baselines currently exist.

	---

	## Usage Example (Transformers)

	```python
	import torch
	from transformers import pipeline

	device = "cuda" if torch.cuda.is_available() else "cpu"

	classifier = pipeline(
	"text-classification",
	model="asseco-group/roberta-incoherence-classifier",
	tokenizer="asseco-group/roberta-incoherence-classifier",
	top_k=None,
	return_all_scores=True,
	device=device
	)

	premise = (
	"Wykonawca dostarczy pliki w formacie .shp zgodne z oprogramowaniem ArcGIS 10.2, "
	"wraz z mapami wydrukowanymi w formacie A4."
	)

	hypo = (
	"Wykonawca przekaże wyłącznie pliki .kml kompatybilne z QGIS "
	"i przygotuje dokumentację w formacie A3."
	)

	result = classifier({"text": premise, "text_pair": hypo})
	print(result)
	```

	### Batch / lower-level

	```python
	import torch
	from transformers import AutoTokenizer, AutoModelForSequenceClassification

	name = "asseco-group/roberta-incoherence-classifier"
	tokenizer = AutoTokenizer.from_pretrained(name, use_fast=True)
	model = AutoModelForSequenceClassification.from_pretrained(name).eval()
	device = "cuda" if torch.cuda.is_available() else "cpu"
	model.to(device)

	pairs = [
	("Zwrot kosztów w 60 dni ...", "Zwrot kosztów nastąpi w 30 dni ..."),
	]
	enc = tokenzier(
	[p for p, h in pairs],
	[h for p, h in pairs],
	padding=True, truncation=True, max_length=512,
	return_tensors="pt"
	).to(device)

	with torch.no_grad():
	logits = model(**enc).logits
	probs = logits.softmax(-1).cpu()
	print(probs)
	```

	---

	## Limitations & Recommendations

	* Polish‑only checkpoint: out‑of‑language input not supported
	* Complex tabular / OCR / mixed‑language content may degrade quality
	* Domain‑specific fine‑tuning is recommended for production


	---

	## Citation

	```bibtex
	@misc{asseco2025incoherence,
	title = {Polish RoBERTa-based Incoherence/Consistency Classifier (encoder-only)},
	author = {Asseco Group},
	year = {2025},
	url = {https://huggingface.co/asseco-group/roberta-incoherence-classifier}
	}
	```