|
|
--- |
|
|
datasets: |
|
|
- psytechlab/rus_rudeft_wcl-wiki |
|
|
language: |
|
|
- ru |
|
|
base_model: |
|
|
- DeepPavlov/rubert-base-cased |
|
|
--- |
|
|
|
|
|
# RuBERT base fine-tuned on ruDEFT and WCL Wiki Ru datasets for NER |
|
|
The model aims to extract terms and defenitions in a text. |
|
|
Labels: |
|
|
- Term - a word or phrase. |
|
|
- Definition - the span that defines some term. |
|
|
|
|
|
```python |
|
|
import torch |
|
|
import numpy as np |
|
|
from transformers import AutoTokenizer, AutoModelForTokenClassification |
|
|
|
|
|
tokenizer = AutoTokenizer.from_pretrained("psytechlab/wcl-wiki_rudeft__ner-model") |
|
|
model = AutoModelForTokenClassification.from_pretrained("psytechlab/wcl-wiki_rudeft__ner-model") |
|
|
model.eval() |
|
|
|
|
|
inputs = tokenizer('оромо — это африканская этническая группа, проживающая в эфиопии и в меньшей степени в кении.', return_tensors="pt") |
|
|
with torch.no_grad(): |
|
|
outputs = model(**inputs) |
|
|
|
|
|
logits = outputs.logits |
|
|
predictions = torch.argmax(logits, dim=-1)[0].tolist() |
|
|
|
|
|
tokens = inputs["input_ids"][0] |
|
|
word_ids = inputs.word_ids(batch_index=0) |
|
|
|
|
|
word_to_labels = {} |
|
|
for token_id, word_id, label_id in zip(tokens, word_ids, predictions): |
|
|
if word_id is None: |
|
|
continue |
|
|
if word_id not in word_to_labels: |
|
|
word_to_labels[word_id] = [] |
|
|
word_to_labels[word_id].append(label_id) |
|
|
|
|
|
word_level_predictions = [model.config.id2label[labels[0]] for labels in word_to_labels.values()] |
|
|
|
|
|
print(word_level_predictions) |
|
|
# ['B-Term', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'] |
|
|
``` |
|
|
|
|
|
## Training procedure |
|
|
|
|
|
### Training |
|
|
The training was done with Trainier class that has next parameters: |
|
|
```python |
|
|
training_args = TrainingArguments( |
|
|
eval_strategy="epoch", |
|
|
save_strategy="epoch", |
|
|
learning_rate=2e-5, |
|
|
num_train_epochs=7, |
|
|
weight_decay=0.01, |
|
|
) |
|
|
``` |
|
|
|
|
|
### Metrics |
|
|
Metrics on combined set (ruDEFT + WCL Wiki Ru) `psytechlab/rus_rudeft_wcl-wiki`: |
|
|
```python |
|
|
precision recall f1-score support |
|
|
|
|
|
I-Definition 0.75 0.90 0.82 3344 |
|
|
B-Definition 0.62 0.73 0.67 230 |
|
|
I-Term 0.80 0.85 0.82 524 |
|
|
O 0.97 0.91 0.94 11359 |
|
|
B-Term 0.96 0.93 0.94 2977 |
|
|
|
|
|
accuracy 0.91 18434 |
|
|
macro avg 0.82 0.87 0.84 18434 |
|
|
weighted avg 0.92 0.91 0.91 18434 |
|
|
``` |
|
|
|
|
|
Metrics only on `astromis/ruDEFT`: |
|
|
```python |
|
|
precision recall f1-score support |
|
|
|
|
|
I-Definition 0.90 0.90 0.90 3344 |
|
|
B-Definition 0.74 0.73 0.74 230 |
|
|
I-Term 0.83 0.87 0.85 389 |
|
|
O 0.86 0.86 0.86 2222 |
|
|
B-Term 0.87 0.85 0.86 638 |
|
|
|
|
|
accuracy 0.87 6823 |
|
|
macro avg 0.84 0.84 0.84 6823 |
|
|
weighted avg 0.87 0.87 0.87 6823 |
|
|
``` |
|
|
|
|
|
Metrics only on `astromis/WCL_Wiki_Ru`: |
|
|
```python |
|
|
precision recall f1-score support |
|
|
|
|
|
I-Definition 0.00 0.00 0.00 0 |
|
|
B-Definition 0.00 0.00 0.00 0 |
|
|
I-Term 0.72 0.78 0.75 135 |
|
|
O 1.00 0.93 0.96 9137 |
|
|
B-Term 0.99 0.95 0.97 2339 |
|
|
|
|
|
accuracy 0.93 11611 |
|
|
macro avg 0.54 0.53 0.54 11611 |
|
|
weighted avg 0.99 0.93 0.96 11611 |
|
|
``` |
|
|
|
|
|
# Citation |
|
|
|
|
|
`` |
|
|
@article{Popov2025TransferringNL, |
|
|
title={Transferring Natural Language Datasets Between Languages Using Large Language Models for Modern Decision Support and Sci-Tech Analytical Systems}, |
|
|
author={Dmitrii Popov and Egor Terentev and Danil Serenko and Ilya Sochenkov and Igor Buyanov}, |
|
|
journal={Big Data and Cognitive Computing}, |
|
|
year={2025}, |
|
|
url={https://api.semanticscholar.org/CorpusID:278179500} |
|
|
} |
|
|
|
|
|
``` |