File size: 3,132 Bytes
e05f822 74b06fc e05f822 74b06fc |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 |
---
datasets:
- psytechlab/rus_rudeft_wcl-wiki
language:
- ru
base_model:
- DeepPavlov/rubert-base-cased
---
# RuBERT base fine-tuned on ruDEFT and WCL Wiki Ru datasets.
The model aims to detect definitions in a text (detecting a `definition_label` column in a dataset.)
```python
import torch
from transformers import AutoTokenizer, BertForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("psytechlab/wcl-wiki_rudeft__rubert-model")
model = BertForSequenceClassification.from_pretrained("psytechlab/wcl-wiki_rudeft__rubert-model")
model.eval()
text = ["москва - это город в РФ", "хочу изучать языки"]
tokenized_text = tokenizer(text, padding="max_length", truncation=True, max_length=512, return_tensors="pt")
with torch.no_grad():
prediction = model(**tokenized_text).logits
print(prediction.argmax(dim=1).numpy())
# [1 0]
```
## Preprocessing
- lower_string
- remove_punct
- remove_latin
- swap_enter_to_space
- collapse_spaces
- strip_string
## Training procedure
### Training
The training was done with Trainier class that has next parameters:
```python
training_args = TrainingArguments(
num_train_epochs=7,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
weight_decay=0.01,
learning_rate=3e-5,
logging_strategy="steps",
logging_steps=50,
save_strategy="epoch",
save_total_limit=5,
seed=21,
metric_for_best_model="eval_f1_macro"
)
```
### Metrics
Metrics on combined set (ruDEFT + WCL Wiki Ru) `psytechlab/rus_rudeft_wcl-wiki`:
```python
precision recall f1-score support
0 0.90 0.93 0.92 1421
1 0.87 0.81 0.84 753
accuracy 0.89 2174
macro avg 0.88 0.87 0.88 2174
weighted avg 0.89 0.89 0.89 2174
```
Metrics only on `astromis/ruDEFT`:
```python
precision recall f1-score support
0 0.87 0.95 0.91 836
1 0.84 0.67 0.74 353
accuracy 0.86 1189
macro avg 0.85 0.81 0.82 1189
weighted avg 0.86 0.86 0.86 1189
```
Metrics only on `astromis/WCL_Wiki_Ru`:
```python
precision recall f1-score support
0 0.95 0.92 0.93 585
1 0.89 0.93 0.91 400
accuracy 0.92 985
macro avg 0.92 0.92 0.92 985
weighted avg 0.92 0.92 0.92 985
```
# Citation
@article{Popov2025TransferringNL,
title={Transferring Natural Language Datasets Between Languages Using Large Language Models for Modern Decision Support and Sci-Tech Analytical Systems},
author={Dmitrii Popov and Egor Terentev and Danil Serenko and Ilya Sochenkov and Igor Buyanov},
journal={Big Data and Cognitive Computing},
year={2025},
url={https://api.semanticscholar.org/CorpusID:278179500}
}
|