File size: 3,132 Bytes
e05f822
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74b06fc
 
 
 
 
 
 
 
 
e05f822
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74b06fc
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
---
datasets:
- psytechlab/rus_rudeft_wcl-wiki
language:
- ru
base_model:
- DeepPavlov/rubert-base-cased
---

# RuBERT base fine-tuned on ruDEFT and WCL Wiki Ru datasets.
The model aims to detect definitions in a text (detecting a `definition_label` column in a dataset.)

```python
import torch
from transformers import AutoTokenizer, BertForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("psytechlab/wcl-wiki_rudeft__rubert-model")
model = BertForSequenceClassification.from_pretrained("psytechlab/wcl-wiki_rudeft__rubert-model")
model.eval()

text = ["москва - это город в РФ", "хочу изучать языки"]

tokenized_text = tokenizer(text, padding="max_length", truncation=True, max_length=512, return_tensors="pt")

with torch.no_grad():
    prediction = model(**tokenized_text).logits
    print(prediction.argmax(dim=1).numpy())
# [1 0]
```

## Preprocessing

- lower_string
- remove_punct 
- remove_latin 
- swap_enter_to_space
- collapse_spaces
- strip_string

## Training procedure

### Training
The training was done with Trainier class that has next parameters:
```python
training_args = TrainingArguments(
        num_train_epochs=7,
        per_device_train_batch_size=8,
        per_device_eval_batch_size=8,
        weight_decay=0.01,
        learning_rate=3e-5,
        logging_strategy="steps", 
        logging_steps=50,
        save_strategy="epoch",
        save_total_limit=5,
        seed=21,
        metric_for_best_model="eval_f1_macro"
    )
```

### Metrics
Metrics on combined set (ruDEFT + WCL Wiki Ru) `psytechlab/rus_rudeft_wcl-wiki`:
```python

              precision    recall  f1-score   support

           0       0.90      0.93      0.92      1421
           1       0.87      0.81      0.84       753

    accuracy                           0.89      2174
   macro avg       0.88      0.87      0.88      2174
weighted avg       0.89      0.89      0.89      2174
```

Metrics only on `astromis/ruDEFT`:
```python

              precision    recall  f1-score   support

           0       0.87      0.95      0.91       836
           1       0.84      0.67      0.74       353

    accuracy                           0.86      1189
   macro avg       0.85      0.81      0.82      1189
weighted avg       0.86      0.86      0.86      1189
```

Metrics only on `astromis/WCL_Wiki_Ru`:
```python

              precision    recall  f1-score   support

           0       0.95      0.92      0.93       585
           1       0.89      0.93      0.91       400

    accuracy                           0.92       985
   macro avg       0.92      0.92      0.92       985
weighted avg       0.92      0.92      0.92       985
```

# Citation

@article{Popov2025TransferringNL,
  title={Transferring Natural Language Datasets Between Languages Using Large Language Models for Modern Decision Support and Sci-Tech Analytical Systems},
  author={Dmitrii Popov and Egor Terentev and Danil Serenko and Ilya Sochenkov and Igor Buyanov},
  journal={Big Data and Cognitive Computing},
  year={2025},
  url={https://api.semanticscholar.org/CorpusID:278179500}
}