--- datasets: - psytechlab/rus_rudeft_wcl-wiki language: - ru base_model: - DeepPavlov/rubert-base-cased --- # RuBERT base fine-tuned on ruDEFT and WCL Wiki Ru datasets for NER The model aims to extract terms and defenitions in a text. Labels: - Term - a word or phrase. - Definition - the span that defines some term. ```python import torch import numpy as np from transformers import AutoTokenizer, AutoModelForTokenClassification tokenizer = AutoTokenizer.from_pretrained("psytechlab/wcl-wiki_rudeft__ner-model") model = AutoModelForTokenClassification.from_pretrained("psytechlab/wcl-wiki_rudeft__ner-model") model.eval() inputs = tokenizer('оромо — это африканская этническая группа, проживающая в эфиопии и в меньшей степени в кении.', return_tensors="pt") with torch.no_grad(): outputs = model(**inputs) logits = outputs.logits predictions = torch.argmax(logits, dim=-1)[0].tolist() tokens = inputs["input_ids"][0] word_ids = inputs.word_ids(batch_index=0) word_to_labels = {} for token_id, word_id, label_id in zip(tokens, word_ids, predictions): if word_id is None: continue if word_id not in word_to_labels: word_to_labels[word_id] = [] word_to_labels[word_id].append(label_id) word_level_predictions = [model.config.id2label[labels[0]] for labels in word_to_labels.values()] print(word_level_predictions) # ['B-Term', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O'] ``` ## Training procedure ### Training The training was done with Trainier class that has next parameters: ```python training_args = TrainingArguments( eval_strategy="epoch", save_strategy="epoch", learning_rate=2e-5, num_train_epochs=7, weight_decay=0.01, ) ``` ### Metrics Metrics on combined set (ruDEFT + WCL Wiki Ru) `psytechlab/rus_rudeft_wcl-wiki`: ```python precision recall f1-score support I-Definition 0.75 0.90 0.82 3344 B-Definition 0.62 0.73 0.67 230 I-Term 0.80 0.85 0.82 524 O 0.97 0.91 0.94 11359 B-Term 0.96 0.93 0.94 2977 accuracy 0.91 18434 macro avg 0.82 0.87 0.84 18434 weighted avg 0.92 0.91 0.91 18434 ``` Metrics only on `astromis/ruDEFT`: ```python precision recall f1-score support I-Definition 0.90 0.90 0.90 3344 B-Definition 0.74 0.73 0.74 230 I-Term 0.83 0.87 0.85 389 O 0.86 0.86 0.86 2222 B-Term 0.87 0.85 0.86 638 accuracy 0.87 6823 macro avg 0.84 0.84 0.84 6823 weighted avg 0.87 0.87 0.87 6823 ``` Metrics only on `astromis/WCL_Wiki_Ru`: ```python precision recall f1-score support I-Definition 0.00 0.00 0.00 0 B-Definition 0.00 0.00 0.00 0 I-Term 0.72 0.78 0.75 135 O 1.00 0.93 0.96 9137 B-Term 0.99 0.95 0.97 2339 accuracy 0.93 11611 macro avg 0.54 0.53 0.54 11611 weighted avg 0.99 0.93 0.96 11611 ``` # Citation `` @article{Popov2025TransferringNL, title={Transferring Natural Language Datasets Between Languages Using Large Language Models for Modern Decision Support and Sci-Tech Analytical Systems}, author={Dmitrii Popov and Egor Terentev and Danil Serenko and Ilya Sochenkov and Igor Buyanov}, journal={Big Data and Cognitive Computing}, year={2025}, url={https://api.semanticscholar.org/CorpusID:278179500} } ```