roberta-incoherence-classifier

---
license: cc-by-sa-4.0
language:
- pl
tags:
- text-classification
- encoder-only
- polish
- inconsistency-detection
pipeline_tag: text-classification
model-index:
- name: asseco-group/roberta-incoherence-classifier
  results:
  - task:
      type: text-classification
      name: Document inconsistency detection (NLI-like)
    dataset:
      name: asseco-group/incoherence-bench
      type: text
      split: test
    metrics:
    - type: f1
      name: F1 (macro)
      value: 0.91
    - type: accuracy
      name: Accuracy
      value: 0.91
---

<h1 align="center">roberta-incoherence-classifier</h1>

Encoder-based classifier for document inconsistency detection in **Polish**. This model evaluates the semantic consistency between two text fragments (e.g. sections of legal, procurement or organizational documents). It follows an NLI-like setup but **redefines labels specifically for document coherence auditing**. This model was **initalized from [PKOBP/polish-roberta-8k](https://huggingface.co/PKOBP/polish-roberta-8k)** and **adapted into an inconsistency classifier** through supervised training on high-quality document-style pairs.

---

## Intended Use

* Document consistency auditing (legal, public tender, IT documentation, organizational materials)
* Detecting contradicting statements, scope mismatches, term/role/format inconsistencies
* NLI‑like semantic relation classification with adapted label semantics

**Not intended for:**

* Fact-checking against external world knowledge
* Non‑Polish language inputs
* General misinformation / sentiment / toxicity detection

Finetuning on specific domain data is recommended for best production accuracy.

---

## Label Definition (Adapted vs. Classical NLI)

| Label             | Meaning                                                                                                                                                                                        |
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **entailment**    | Hypothesis is a faithful, condensed or paraphrased restatement of the premise. All critical constraints, actors, conditions and scope remain intact.                                           |
| **neutral**       | Hypothesis neither follows nor contradicts the premise. Typically introduces unverifiable or out‑of‑scope information (e.g. different institutions, expanded context, unrelated assumptions). |
| **contradiction** | Hypothesis directly conflicts with the premise: reverses permissions/requirements, changes legal scope, numeric limits, formats, dates, or the responsible authority or both statements cannot realistically be true at the same time. |

**Rule:** A single critical mismatch (date / territory / authority / format / obligation vs. optional) is sufficient for `contradiction`, even if most of the text agrees.

---

## Model Details

* Base architecture: **RoBERTa‑large (encoder‑only)**
* Classification head: standard HF linear head on pooled representation
* Language: **Polish only**
* License: **CC-BY-SA 4.0**
* Repository: `asseco-group/roberta-incoherence-classifier`

---

## Training

* Precision: **bfloat16**
* Epochs: **5**
* Global batch: 96 × 2 devices, `gradient_accumulation_steps=11`
* Learning rate: `2e-5`, warmup ratio: `0.1`, weight decay: `0.01`
* Label smoothing: `0.05`
* Gradient checkpointing: **True**
* Model selection: best **macro F1** on validation

---

## Dataset


* ~**1.3M** labeled pairs (train + val + test)
* Balanced class distribution
* Data sources include:
  - Polish subset of [MoritzLaurer/multilingual-NLI-26lang-2mil7](https://huggingface.co/datasets/MoritzLaurer/multilingual-NLI-26lang-2mil7) (only high‑quality Polish NLI pairs)
  - Synthetic high‑quality document‑style pairs generated specifically for this inconsistency detection task
  - No additional classical NLI datasets were used, standard NLI label semantics do not fully align with this model’s stricter document‑consistency definitions
* Focus on Polish formal/procedural language (laws, tenders, IT specs, institutional instructions)

---

## Evaluation (on [asseco-group/incoherence-bench](https://huggingface.co/datasets/asseco-group/incoherence-bench), test split)

```
               precision    recall  f1-score   support
   entailment       0.94      0.90      0.92       150
      neutral       0.87      0.91      0.89       150
contradiction       0.93      0.93      0.93       150

     accuracy                           0.91       450
    macro avg       0.91      0.91      0.91       450
 weighted avg       0.91      0.91      0.91       450
```

While the task is NLI-like, the label semantics are redefined for document-level procedural consistency, for which no direct open-source baselines currently exist.

---

## Usage Example (Transformers)

```python
import torch
from transformers import pipeline

device = "cuda" if torch.cuda.is_available() else "cpu"

classifier = pipeline(
    "text-classification",
    model="asseco-group/roberta-incoherence-classifier",
    tokenizer="asseco-group/roberta-incoherence-classifier",
    top_k=None,
    return_all_scores=True,
    device=device
)

premise = (
    "Wykonawca dostarczy pliki w formacie .shp zgodne z oprogramowaniem ArcGIS 10.2, "
    "wraz z mapami wydrukowanymi w formacie A4."
)

hypo = (
    "Wykonawca przekaże wyłącznie pliki .kml kompatybilne z QGIS "
    "i przygotuje dokumentację w formacie A3."
)

result = classifier({"text": premise, "text_pair": hypo})
print(result)
```

### Batch / lower-level

```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

name = "asseco-group/roberta-incoherence-classifier"
tokenizer  = AutoTokenizer.from_pretrained(name, use_fast=True)
model  = AutoModelForSequenceClassification.from_pretrained(name).eval()
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

pairs = [
    ("Zwrot kosztów w 60 dni ...", "Zwrot kosztów nastąpi w 30 dni ..."),
]
enc = tokenzier(
    [p for p, h in pairs],
    [h for p, h in pairs],
    padding=True, truncation=True, max_length=512,
    return_tensors="pt"
).to(device)

with torch.no_grad():
    logits = model(**enc).logits
probs = logits.softmax(-1).cpu()
print(probs)
```

---

## Limitations & Recommendations

* **Polish‑only** checkpoint: out‑of‑language input not supported
* Complex tabular / OCR / mixed‑language content may degrade quality
* Domain‑specific fine‑tuning is recommended for production


---

## Citation

```bibtex
@misc{asseco2025incoherence,
  title  = {Polish RoBERTa-based Incoherence/Consistency Classifier (encoder-only)},
  author = {Asseco Group},
  year   = {2025},
  url    = {https://huggingface.co/asseco-group/roberta-incoherence-classifier}
}
```