|
|
--- |
|
|
license: cc-by-sa-4.0 |
|
|
language: |
|
|
- pl |
|
|
tags: |
|
|
- text-classification |
|
|
- encoder-only |
|
|
- polish |
|
|
- inconsistency-detection |
|
|
pipeline_tag: text-classification |
|
|
model-index: |
|
|
- name: asseco-group/roberta-incoherence-classifier |
|
|
results: |
|
|
- task: |
|
|
type: text-classification |
|
|
name: Document inconsistency detection (NLI-like) |
|
|
dataset: |
|
|
name: asseco-group/incoherence-bench |
|
|
type: text |
|
|
split: test |
|
|
metrics: |
|
|
- type: f1 |
|
|
name: F1 (macro) |
|
|
value: 0.91 |
|
|
- type: accuracy |
|
|
name: Accuracy |
|
|
value: 0.91 |
|
|
--- |
|
|
|
|
|
<h1 align="center">roberta-incoherence-classifier</h1> |
|
|
|
|
|
Encoder-based classifier for document inconsistency detection in **Polish**. This model evaluates the semantic consistency between two text fragments (e.g. sections of legal, procurement or organizational documents). It follows an NLI-like setup but **redefines labels specifically for document coherence auditing**. This model was **initalized from [PKOBP/polish-roberta-8k](https://huggingface.co/PKOBP/polish-roberta-8k)** and **adapted into an inconsistency classifier** through supervised training on high-quality document-style pairs. |
|
|
|
|
|
--- |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
* Document consistency auditing (legal, public tender, IT documentation, organizational materials) |
|
|
* Detecting contradicting statements, scope mismatches, term/role/format inconsistencies |
|
|
* NLI‑like semantic relation classification with adapted label semantics |
|
|
|
|
|
**Not intended for:** |
|
|
|
|
|
* Fact-checking against external world knowledge |
|
|
* Non‑Polish language inputs |
|
|
* General misinformation / sentiment / toxicity detection |
|
|
|
|
|
Finetuning on specific domain data is recommended for best production accuracy. |
|
|
|
|
|
--- |
|
|
|
|
|
## Label Definition (Adapted vs. Classical NLI) |
|
|
|
|
|
| Label | Meaning | |
|
|
| ----------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | |
|
|
| **entailment** | Hypothesis is a faithful, condensed or paraphrased restatement of the premise. All critical constraints, actors, conditions and scope remain intact. | |
|
|
| **neutral** | Hypothesis neither follows nor contradicts the premise. Typically introduces unverifiable or out‑of‑scope information (e.g. different institutions, expanded context, unrelated assumptions). | |
|
|
| **contradiction** | Hypothesis directly conflicts with the premise: reverses permissions/requirements, changes legal scope, numeric limits, formats, dates, or the responsible authority or both statements cannot realistically be true at the same time. | |
|
|
|
|
|
**Rule:** A single critical mismatch (date / territory / authority / format / obligation vs. optional) is sufficient for `contradiction`, even if most of the text agrees. |
|
|
|
|
|
--- |
|
|
|
|
|
## Model Details |
|
|
|
|
|
* Base architecture: **RoBERTa‑large (encoder‑only)** |
|
|
* Classification head: standard HF linear head on pooled representation |
|
|
* Language: **Polish only** |
|
|
* License: **CC-BY-SA 4.0** |
|
|
* Repository: `asseco-group/roberta-incoherence-classifier` |
|
|
|
|
|
--- |
|
|
|
|
|
## Training |
|
|
|
|
|
* Precision: **bfloat16** |
|
|
* Epochs: **5** |
|
|
* Global batch: 96 × 2 devices, `gradient_accumulation_steps=11` |
|
|
* Learning rate: `2e-5`, warmup ratio: `0.1`, weight decay: `0.01` |
|
|
* Label smoothing: `0.05` |
|
|
* Gradient checkpointing: **True** |
|
|
* Model selection: best **macro F1** on validation |
|
|
|
|
|
--- |
|
|
|
|
|
## Dataset |
|
|
|
|
|
|
|
|
* ~**1.3M** labeled pairs (train + val + test) |
|
|
* Balanced class distribution |
|
|
* Data sources include: |
|
|
- Polish subset of [MoritzLaurer/multilingual-NLI-26lang-2mil7](https://huggingface.co/datasets/MoritzLaurer/multilingual-NLI-26lang-2mil7) (only high‑quality Polish NLI pairs) |
|
|
- Synthetic high‑quality document‑style pairs generated specifically for this inconsistency detection task |
|
|
- No additional classical NLI datasets were used, standard NLI label semantics do not fully align with this model’s stricter document‑consistency definitions |
|
|
* Focus on Polish formal/procedural language (laws, tenders, IT specs, institutional instructions) |
|
|
|
|
|
--- |
|
|
|
|
|
## Evaluation (on [asseco-group/incoherence-bench](https://huggingface.co/datasets/asseco-group/incoherence-bench), test split) |
|
|
|
|
|
``` |
|
|
precision recall f1-score support |
|
|
entailment 0.94 0.90 0.92 150 |
|
|
neutral 0.87 0.91 0.89 150 |
|
|
contradiction 0.93 0.93 0.93 150 |
|
|
|
|
|
accuracy 0.91 450 |
|
|
macro avg 0.91 0.91 0.91 450 |
|
|
weighted avg 0.91 0.91 0.91 450 |
|
|
``` |
|
|
|
|
|
While the task is NLI-like, the label semantics are redefined for document-level procedural consistency, for which no direct open-source baselines currently exist. |
|
|
|
|
|
--- |
|
|
|
|
|
## Usage Example (Transformers) |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import pipeline |
|
|
|
|
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
|
|
|
|
classifier = pipeline( |
|
|
"text-classification", |
|
|
model="asseco-group/roberta-incoherence-classifier", |
|
|
tokenizer="asseco-group/roberta-incoherence-classifier", |
|
|
top_k=None, |
|
|
return_all_scores=True, |
|
|
device=device |
|
|
) |
|
|
|
|
|
premise = ( |
|
|
"Wykonawca dostarczy pliki w formacie .shp zgodne z oprogramowaniem ArcGIS 10.2, " |
|
|
"wraz z mapami wydrukowanymi w formacie A4." |
|
|
) |
|
|
|
|
|
hypo = ( |
|
|
"Wykonawca przekaże wyłącznie pliki .kml kompatybilne z QGIS " |
|
|
"i przygotuje dokumentację w formacie A3." |
|
|
) |
|
|
|
|
|
result = classifier({"text": premise, "text_pair": hypo}) |
|
|
print(result) |
|
|
``` |
|
|
|
|
|
### Batch / lower-level |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
|
|
|
|
name = "asseco-group/roberta-incoherence-classifier" |
|
|
tokenizer = AutoTokenizer.from_pretrained(name, use_fast=True) |
|
|
model = AutoModelForSequenceClassification.from_pretrained(name).eval() |
|
|
device = "cuda" if torch.cuda.is_available() else "cpu" |
|
|
model.to(device) |
|
|
|
|
|
pairs = [ |
|
|
("Zwrot kosztów w 60 dni ...", "Zwrot kosztów nastąpi w 30 dni ..."), |
|
|
] |
|
|
enc = tokenzier( |
|
|
[p for p, h in pairs], |
|
|
[h for p, h in pairs], |
|
|
padding=True, truncation=True, max_length=512, |
|
|
return_tensors="pt" |
|
|
).to(device) |
|
|
|
|
|
with torch.no_grad(): |
|
|
logits = model(**enc).logits |
|
|
probs = logits.softmax(-1).cpu() |
|
|
print(probs) |
|
|
``` |
|
|
|
|
|
--- |
|
|
|
|
|
## Limitations & Recommendations |
|
|
|
|
|
* **Polish‑only** checkpoint: out‑of‑language input not supported |
|
|
* Complex tabular / OCR / mixed‑language content may degrade quality |
|
|
* Domain‑specific fine‑tuning is recommended for production |
|
|
|
|
|
|
|
|
--- |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{asseco2025incoherence, |
|
|
title = {Polish RoBERTa-based Incoherence/Consistency Classifier (encoder-only)}, |
|
|
author = {Asseco Group}, |
|
|
year = {2025}, |
|
|
url = {https://huggingface.co/asseco-group/roberta-incoherence-classifier} |
|
|
} |
|
|
``` |