ClauseClassificator — ruBERT-tiny2

Fine-tuned cointegrated/rubert-tiny2 for Russian contract clause classification into 10 semantic classes.

Accuracy: 95.0% | Macro avg F1: 0.93 | Weighted avg F1: 0.95

Classification classes

Code	Label	Description
0	Не распознано	Unrecognized / noise
1	Название договора	Contract title
2	Дата и место заключения	Date and place of signing
3	Преамбула	Preamble / recitals
4	Заголовок раздела/подраздела	Section / subsection heading
5	Пункт договора	Contract clause
6	Продолжение предыдущего пункта	Continuation of previous clause
7	Подпункт договора	Sub-clause
8	Таблица	Table
9	Элемент маркированного списка	Bullet list item

Performance

Class	Precision	Recall	F1	Support
0 Не распознано	0.84	0.76	0.80	199
1 Название договора	0.86	0.91	0.88	151
2 Дата и место заключения	0.85	0.98	0.91	89
3 Преамбула	0.94	1.00	0.97	125
4 Заголовок раздела/подраздела	0.98	0.97	0.98	310
5 Пункт договора	1.00	0.94	0.97	1284
6 Продолжение предыдущего пункта	0.86	0.96	0.91	314
7 Подпункт договора	0.92	1.00	0.96	260
8 Таблица	0.98	0.96	0.97	374
9 Элемент маркированного списка	0.96	0.98	0.97	382
Overall	0.95	0.95	0.95	3488

Quick start

import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_name = "alexmarytin/clause-classificator-rubert-tiny2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()

CLASS_NAMES = [
    "Не распознано", "Название договора", "Дата и место заключения",
    "Преамбула", "Заголовок раздела/подраздела", "Пункт договора",
    "Продолжение предыдущего пункта", "Подпункт договора",
    "Таблица", "Элемент маркированного списка",
]

def predict(text: str) -> tuple[str, float]:
    inputs = tokenizer(text, truncation=True, padding=True, max_length=512, return_tensors="pt")
    with torch.no_grad():
        outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=-1)
    max_prob, pred = probs.max(dim=-1)
    return CLASS_NAMES[pred.item()], max_prob.item()

print(predict("Арендодатель обязуется передать Арендатору помещение"))
# ("Пункт договора", 0.9987)

Model details

Base model: cointegrated/rubert-tiny2 — 3-layer BERT, 29.7M params, embedding size 312
Architecture: BertForSequenceClassification with 10 output labels
Format: safetensors
Training: Fine-tuned with class-weighted CrossEntropyLoss, AdamW, linear warmup schedule
Hardware: Apple MPS (Mac) / CUDA
Training time: ~8 minutes on MPS
Checkpoint size: ~117 MB

Training hyperparameters

Parameter	Value
Epochs	10 (early stopping patience: 3)
Batch size	16
Learning rate	2e-5
Weight decay	0.05
Warmup ratio	0.1
Optimizer	AdamW
Class weighting	balanced (inverse frequency)
Val split	20% stratified
Max sequence length	512 tokens
Seed	42

Dataset

The model was trained on 3,488 rows of labeled Russian contract clauses extracted from real legal documents. The dataset covers 10 mutually exclusive classes representing the logical structure of a contract (title, preamble, clauses, sub-clauses, tables, lists, etc.).

Intended use

Automated contract clause classification
Legal document structure parsing
Contract analysis pipelines
Document ingestion and structuring

Limitations

Trained exclusively on Russian-language contracts
Model size is optimized for speed (tiny BERT), not maximum accuracy — larger models (e.g., ruBERT-large) may yield better results at higher computational cost
Classes with limited training data (e.g., class 0 "Не распознано" — 199 samples) show lower recall/precision

Uploading your own version

pip install huggingface_hub
huggingface-cli login

python -c "
from huggingface_hub import HfApi
api = HfApi()
api.create_repo(repo_id='your-username/clause-classificator-rubert-tiny2', repo_type='model', exist_ok=True)
api.upload_folder(
    folder_path='models/checkpoint',
    repo_id='your-username/clause-classificator-rubert-tiny2',
    repo_type='model',
)
"

License

MIT

Downloads last month: 49

Safetensors

Model size

29.2M params

Tensor type

F32

Model tree for alexmaryin/clause-classificator-rubert-tiny2

Base model

cointegrated/rubert-tiny2

Finetuned

(81)

this model