ClauseClassificator — ruBERT-tiny2

Fine-tuned cointegrated/rubert-tiny2 for Russian contract clause classification into 10 semantic classes.

Accuracy: 95.0% | Macro avg F1: 0.93 | Weighted avg F1: 0.95

Classification classes

Code Label Description
0 Не распознано Unrecognized / noise
1 Название договора Contract title
2 Дата и место заключения Date and place of signing
3 Преамбула Preamble / recitals
4 Заголовок раздела/подраздела Section / subsection heading
5 Пункт договора Contract clause
6 Продолжение предыдущего пункта Continuation of previous clause
7 Подпункт договора Sub-clause
8 Таблица Table
9 Элемент маркированного списка Bullet list item

Performance

Class Precision Recall F1 Support
0 Не распознано 0.84 0.76 0.80 199
1 Название договора 0.86 0.91 0.88 151
2 Дата и место заключения 0.85 0.98 0.91 89
3 Преамбула 0.94 1.00 0.97 125
4 Заголовок раздела/подраздела 0.98 0.97 0.98 310
5 Пункт договора 1.00 0.94 0.97 1284
6 Продолжение предыдущего пункта 0.86 0.96 0.91 314
7 Подпункт договора 0.92 1.00 0.96 260
8 Таблица 0.98 0.96 0.97 374
9 Элемент маркированного списка 0.96 0.98 0.97 382
Overall 0.95 0.95 0.95 3488

Quick start

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_name = "alexmarytin/clause-classificator-rubert-tiny2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()

CLASS_NAMES = [
    "Не распознано", "Название договора", "Дата и место заключения",
    "Преамбула", "Заголовок раздела/подраздела", "Пункт договора",
    "Продолжение предыдущего пункта", "Подпункт договора",
    "Таблица", "Элемент маркированного списка",
]

def predict(text: str) -> tuple[str, float]:
    inputs = tokenizer(text, truncation=True, padding=True, max_length=512, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=-1)
    max_prob, pred = probs.max(dim=-1)
    return CLASS_NAMES[pred.item()], max_prob.item()

print(predict("Арендодатель обязуется передать Арендатору помещение"))
# ("Пункт договора", 0.9987)

Model details

  • Base model: cointegrated/rubert-tiny2 — 3-layer BERT, 29.7M params, embedding size 312
  • Architecture: BertForSequenceClassification with 10 output labels
  • Format: safetensors
  • Training: Fine-tuned with class-weighted CrossEntropyLoss, AdamW, linear warmup schedule
  • Hardware: Apple MPS (Mac) / CUDA
  • Training time: ~8 minutes on MPS
  • Checkpoint size: ~117 MB

Training hyperparameters

Parameter Value
Epochs 10 (early stopping patience: 3)
Batch size 16
Learning rate 2e-5
Weight decay 0.05
Warmup ratio 0.1
Optimizer AdamW
Class weighting balanced (inverse frequency)
Val split 20% stratified
Max sequence length 512 tokens
Seed 42

Dataset

The model was trained on 3,488 rows of labeled Russian contract clauses extracted from real legal documents. The dataset covers 10 mutually exclusive classes representing the logical structure of a contract (title, preamble, clauses, sub-clauses, tables, lists, etc.).

Intended use

  • Automated contract clause classification
  • Legal document structure parsing
  • Contract analysis pipelines
  • Document ingestion and structuring

Limitations

  • Trained exclusively on Russian-language contracts
  • Model size is optimized for speed (tiny BERT), not maximum accuracy — larger models (e.g., ruBERT-large) may yield better results at higher computational cost
  • Classes with limited training data (e.g., class 0 "Не распознано" — 199 samples) show lower recall/precision

Uploading your own version

pip install huggingface_hub
huggingface-cli login

python -c "
from huggingface_hub import HfApi
api = HfApi()
api.create_repo(repo_id='your-username/clause-classificator-rubert-tiny2', repo_type='model', exist_ok=True)
api.upload_folder(
    folder_path='models/checkpoint',
    repo_id='your-username/clause-classificator-rubert-tiny2',
    repo_type='model',
)
"

License

MIT

Downloads last month
66
Safetensors
Model size
29.2M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for alexmaryin/clause-classificator-rubert-tiny2

Finetuned
(70)
this model