Text Classification
Transformers
Safetensors
Russian
bert
legal
contract
classification
russian
rubert
text-embeddings-inference
Instructions to use alexmaryin/clause-classificator-rubert-tiny2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use alexmaryin/clause-classificator-rubert-tiny2 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="alexmaryin/clause-classificator-rubert-tiny2")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("alexmaryin/clause-classificator-rubert-tiny2") model = AutoModelForSequenceClassification.from_pretrained("alexmaryin/clause-classificator-rubert-tiny2") - Notebooks
- Google Colab
- Kaggle
ClauseClassificator — ruBERT-tiny2
Fine-tuned cointegrated/rubert-tiny2 for Russian contract clause classification into 10 semantic classes.
Accuracy: 95.0% | Macro avg F1: 0.93 | Weighted avg F1: 0.95
Classification classes
| Code | Label | Description |
|---|---|---|
| 0 | Не распознано | Unrecognized / noise |
| 1 | Название договора | Contract title |
| 2 | Дата и место заключения | Date and place of signing |
| 3 | Преамбула | Preamble / recitals |
| 4 | Заголовок раздела/подраздела | Section / subsection heading |
| 5 | Пункт договора | Contract clause |
| 6 | Продолжение предыдущего пункта | Continuation of previous clause |
| 7 | Подпункт договора | Sub-clause |
| 8 | Таблица | Table |
| 9 | Элемент маркированного списка | Bullet list item |
Performance
| Class | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| 0 Не распознано | 0.84 | 0.76 | 0.80 | 199 |
| 1 Название договора | 0.86 | 0.91 | 0.88 | 151 |
| 2 Дата и место заключения | 0.85 | 0.98 | 0.91 | 89 |
| 3 Преамбула | 0.94 | 1.00 | 0.97 | 125 |
| 4 Заголовок раздела/подраздела | 0.98 | 0.97 | 0.98 | 310 |
| 5 Пункт договора | 1.00 | 0.94 | 0.97 | 1284 |
| 6 Продолжение предыдущего пункта | 0.86 | 0.96 | 0.91 | 314 |
| 7 Подпункт договора | 0.92 | 1.00 | 0.96 | 260 |
| 8 Таблица | 0.98 | 0.96 | 0.97 | 374 |
| 9 Элемент маркированного списка | 0.96 | 0.98 | 0.97 | 382 |
| Overall | 0.95 | 0.95 | 0.95 | 3488 |
Quick start
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model_name = "alexmarytin/clause-classificator-rubert-tiny2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()
CLASS_NAMES = [
"Не распознано", "Название договора", "Дата и место заключения",
"Преамбула", "Заголовок раздела/подраздела", "Пункт договора",
"Продолжение предыдущего пункта", "Подпункт договора",
"Таблица", "Элемент маркированного списка",
]
def predict(text: str) -> tuple[str, float]:
inputs = tokenizer(text, truncation=True, padding=True, max_length=512, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)
max_prob, pred = probs.max(dim=-1)
return CLASS_NAMES[pred.item()], max_prob.item()
print(predict("Арендодатель обязуется передать Арендатору помещение"))
# ("Пункт договора", 0.9987)
Model details
- Base model:
cointegrated/rubert-tiny2— 3-layer BERT, 29.7M params, embedding size 312 - Architecture:
BertForSequenceClassificationwith 10 output labels - Format: safetensors
- Training: Fine-tuned with class-weighted CrossEntropyLoss, AdamW, linear warmup schedule
- Hardware: Apple MPS (Mac) / CUDA
- Training time: ~8 minutes on MPS
- Checkpoint size: ~117 MB
Training hyperparameters
| Parameter | Value |
|---|---|
| Epochs | 10 (early stopping patience: 3) |
| Batch size | 16 |
| Learning rate | 2e-5 |
| Weight decay | 0.05 |
| Warmup ratio | 0.1 |
| Optimizer | AdamW |
| Class weighting | balanced (inverse frequency) |
| Val split | 20% stratified |
| Max sequence length | 512 tokens |
| Seed | 42 |
Dataset
The model was trained on 3,488 rows of labeled Russian contract clauses extracted from real legal documents. The dataset covers 10 mutually exclusive classes representing the logical structure of a contract (title, preamble, clauses, sub-clauses, tables, lists, etc.).
Intended use
- Automated contract clause classification
- Legal document structure parsing
- Contract analysis pipelines
- Document ingestion and structuring
Limitations
- Trained exclusively on Russian-language contracts
- Model size is optimized for speed (tiny BERT), not maximum accuracy — larger models (e.g., ruBERT-large) may yield better results at higher computational cost
- Classes with limited training data (e.g., class 0 "Не распознано" — 199 samples) show lower recall/precision
Uploading your own version
pip install huggingface_hub
huggingface-cli login
python -c "
from huggingface_hub import HfApi
api = HfApi()
api.create_repo(repo_id='your-username/clause-classificator-rubert-tiny2', repo_type='model', exist_ok=True)
api.upload_folder(
folder_path='models/checkpoint',
repo_id='your-username/clause-classificator-rubert-tiny2',
repo_type='model',
)
"
License
MIT
- Downloads last month
- 66
Model tree for alexmaryin/clause-classificator-rubert-tiny2
Base model
cointegrated/rubert-tiny2