File size: 6,979 Bytes
8602c0b 6709f1a 8602c0b a521be7 8602c0b d6b8b7f 8602c0b a521be7 8602c0b | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 | ---
license: cc-by-sa-4.0
language:
- pl
tags:
- text-classification
- encoder-only
- polish
- inconsistency-detection
pipeline_tag: text-classification
model-index:
- name: asseco-group/roberta-incoherence-classifier
results:
- task:
type: text-classification
name: Document inconsistency detection (NLI-like)
dataset:
name: asseco-group/incoherence-bench
type: text
split: test
metrics:
- type: f1
name: F1 (macro)
value: 0.91
- type: accuracy
name: Accuracy
value: 0.91
---
<h1 align="center">roberta-incoherence-classifier</h1>
Encoder-based classifier for document inconsistency detection in **Polish**. This model evaluates the semantic consistency between two text fragments (e.g. sections of legal, procurement or organizational documents). It follows an NLI-like setup but **redefines labels specifically for document coherence auditing**. This model was **initalized from [PKOBP/polish-roberta-8k](https://huggingface.co/PKOBP/polish-roberta-8k)** and **adapted into an inconsistency classifier** through supervised training on high-quality document-style pairs.
---
## Intended Use
* Document consistency auditing (legal, public tender, IT documentation, organizational materials)
* Detecting contradicting statements, scope mismatches, term/role/format inconsistencies
* NLI‑like semantic relation classification with adapted label semantics
**Not intended for:**
* Fact-checking against external world knowledge
* Non‑Polish language inputs
* General misinformation / sentiment / toxicity detection
Finetuning on specific domain data is recommended for best production accuracy.
---
## Label Definition (Adapted vs. Classical NLI)
| Label | Meaning |
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **entailment** | Hypothesis is a faithful, condensed or paraphrased restatement of the premise. All critical constraints, actors, conditions and scope remain intact. |
| **neutral** | Hypothesis neither follows nor contradicts the premise. Typically introduces unverifiable or out‑of‑scope information (e.g. different institutions, expanded context, unrelated assumptions). |
| **contradiction** | Hypothesis directly conflicts with the premise: reverses permissions/requirements, changes legal scope, numeric limits, formats, dates, or the responsible authority or both statements cannot realistically be true at the same time. |
**Rule:** A single critical mismatch (date / territory / authority / format / obligation vs. optional) is sufficient for `contradiction`, even if most of the text agrees.
---
## Model Details
* Base architecture: **RoBERTa‑large (encoder‑only)**
* Classification head: standard HF linear head on pooled representation
* Language: **Polish only**
* License: **CC-BY-SA 4.0**
* Repository: `asseco-group/roberta-incoherence-classifier`
---
## Training
* Precision: **bfloat16**
* Epochs: **5**
* Global batch: 96 × 2 devices, `gradient_accumulation_steps=11`
* Learning rate: `2e-5`, warmup ratio: `0.1`, weight decay: `0.01`
* Label smoothing: `0.05`
* Gradient checkpointing: **True**
* Model selection: best **macro F1** on validation
---
## Dataset
* ~**1.3M** labeled pairs (train + val + test)
* Balanced class distribution
* Data sources include:
- Polish subset of [MoritzLaurer/multilingual-NLI-26lang-2mil7](https://huggingface.co/datasets/MoritzLaurer/multilingual-NLI-26lang-2mil7) (only high‑quality Polish NLI pairs)
- Synthetic high‑quality document‑style pairs generated specifically for this inconsistency detection task
- No additional classical NLI datasets were used, standard NLI label semantics do not fully align with this model’s stricter document‑consistency definitions
* Focus on Polish formal/procedural language (laws, tenders, IT specs, institutional instructions)
---
## Evaluation (on [asseco-group/incoherence-bench](https://huggingface.co/datasets/asseco-group/incoherence-bench), test split)
```
precision recall f1-score support
entailment 0.94 0.90 0.92 150
neutral 0.87 0.91 0.89 150
contradiction 0.93 0.93 0.93 150
accuracy 0.91 450
macro avg 0.91 0.91 0.91 450
weighted avg 0.91 0.91 0.91 450
```
While the task is NLI-like, the label semantics are redefined for document-level procedural consistency, for which no direct open-source baselines currently exist.
---
## Usage Example (Transformers)
```python
import torch
from transformers import pipeline
device = "cuda" if torch.cuda.is_available() else "cpu"
classifier = pipeline(
"text-classification",
model="asseco-group/roberta-incoherence-classifier",
tokenizer="asseco-group/roberta-incoherence-classifier",
top_k=None,
return_all_scores=True,
device=device
)
premise = (
"Wykonawca dostarczy pliki w formacie .shp zgodne z oprogramowaniem ArcGIS 10.2, "
"wraz z mapami wydrukowanymi w formacie A4."
)
hypo = (
"Wykonawca przekaże wyłącznie pliki .kml kompatybilne z QGIS "
"i przygotuje dokumentację w formacie A3."
)
result = classifier({"text": premise, "text_pair": hypo})
print(result)
```
### Batch / lower-level
```python
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
name = "asseco-group/roberta-incoherence-classifier"
tokenizer = AutoTokenizer.from_pretrained(name, use_fast=True)
model = AutoModelForSequenceClassification.from_pretrained(name).eval()
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
pairs = [
("Zwrot kosztów w 60 dni ...", "Zwrot kosztów nastąpi w 30 dni ..."),
]
enc = tokenzier(
[p for p, h in pairs],
[h for p, h in pairs],
padding=True, truncation=True, max_length=512,
return_tensors="pt"
).to(device)
with torch.no_grad():
logits = model(**enc).logits
probs = logits.softmax(-1).cpu()
print(probs)
```
---
## Limitations & Recommendations
* **Polish‑only** checkpoint: out‑of‑language input not supported
* Complex tabular / OCR / mixed‑language content may degrade quality
* Domain‑specific fine‑tuning is recommended for production
---
## Citation
```bibtex
@misc{asseco2025incoherence,
title = {Polish RoBERTa-based Incoherence/Consistency Classifier (encoder-only)},
author = {Asseco Group},
year = {2025},
url = {https://huggingface.co/asseco-group/roberta-incoherence-classifier}
}
``` |