--- datasets: - psytechlab/rus_rudeft_wcl-wiki language: - ru base_model: - DeepPavlov/rubert-base-cased --- # RuBERT base fine-tuned on ruDEFT and WCL Wiki Ru datasets. The model aims to detect definitions in a text (detecting a `definition_label` column in a dataset.) ```python import torch from transformers import AutoTokenizer, BertForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("psytechlab/wcl-wiki_rudeft__rubert-model") model = BertForSequenceClassification.from_pretrained("psytechlab/wcl-wiki_rudeft__rubert-model") model.eval() text = ["москва - это город в РФ", "хочу изучать языки"] tokenized_text = tokenizer(text, padding="max_length", truncation=True, max_length=512, return_tensors="pt") with torch.no_grad(): prediction = model(**tokenized_text).logits print(prediction.argmax(dim=1).numpy()) # [1 0] ``` ## Preprocessing - lower_string - remove_punct - remove_latin - swap_enter_to_space - collapse_spaces - strip_string ## Training procedure ### Training The training was done with Trainier class that has next parameters: ```python training_args = TrainingArguments( num_train_epochs=7, per_device_train_batch_size=8, per_device_eval_batch_size=8, weight_decay=0.01, learning_rate=3e-5, logging_strategy="steps", logging_steps=50, save_strategy="epoch", save_total_limit=5, seed=21, metric_for_best_model="eval_f1_macro" ) ``` ### Metrics Metrics on combined set (ruDEFT + WCL Wiki Ru) `psytechlab/rus_rudeft_wcl-wiki`: ```python precision recall f1-score support 0 0.90 0.93 0.92 1421 1 0.87 0.81 0.84 753 accuracy 0.89 2174 macro avg 0.88 0.87 0.88 2174 weighted avg 0.89 0.89 0.89 2174 ``` Metrics only on `astromis/ruDEFT`: ```python precision recall f1-score support 0 0.87 0.95 0.91 836 1 0.84 0.67 0.74 353 accuracy 0.86 1189 macro avg 0.85 0.81 0.82 1189 weighted avg 0.86 0.86 0.86 1189 ``` Metrics only on `astromis/WCL_Wiki_Ru`: ```python precision recall f1-score support 0 0.95 0.92 0.93 585 1 0.89 0.93 0.91 400 accuracy 0.92 985 macro avg 0.92 0.92 0.92 985 weighted avg 0.92 0.92 0.92 985 ``` # Citation @article{Popov2025TransferringNL, title={Transferring Natural Language Datasets Between Languages Using Large Language Models for Modern Decision Support and Sci-Tech Analytical Systems}, author={Dmitrii Popov and Egor Terentev and Danil Serenko and Ilya Sochenkov and Igor Buyanov}, journal={Big Data and Cognitive Computing}, year={2025}, url={https://api.semanticscholar.org/CorpusID:278179500} }