psytechlab
/

wcl-wiki_rudeft__ner-model

Model card Files Files and versions

wcl-wiki_rudeft__ner-model / README.md

astromis's picture

Update README.md

dee148f verified 2 months ago

|

history blame contribute delete

3.92 kB

	---
	datasets:
	- psytechlab/rus_rudeft_wcl-wiki
	language:
	- ru
	base_model:
	- DeepPavlov/rubert-base-cased
	---

	# RuBERT base fine-tuned on ruDEFT and WCL Wiki Ru datasets for NER
	The model aims to extract terms and defenitions in a text.
	Labels:
	- Term - a word or phrase.
	- Definition - the span that defines some term.

	```python
	import torch
	import numpy as np
	from transformers import AutoTokenizer, AutoModelForTokenClassification

	tokenizer = AutoTokenizer.from_pretrained("psytechlab/wcl-wiki_rudeft__ner-model")
	model = AutoModelForTokenClassification.from_pretrained("psytechlab/wcl-wiki_rudeft__ner-model")
	model.eval()

	inputs = tokenizer('оромо — это африканская этническая группа, проживающая в эфиопии и в меньшей степени в кении.', return_tensors="pt")
	with torch.no_grad():
	outputs = model(**inputs)

	logits = outputs.logits
	predictions = torch.argmax(logits, dim=-1)[0].tolist()

	tokens = inputs["input_ids"][0]
	word_ids = inputs.word_ids(batch_index=0)

	word_to_labels = {}
	for token_id, word_id, label_id in zip(tokens, word_ids, predictions):
	if word_id is None:
	continue
	if word_id not in word_to_labels:
	word_to_labels[word_id] = []
	word_to_labels[word_id].append(label_id)

	word_level_predictions = [model.config.id2label[labels[0]] for labels in word_to_labels.values()]

	print(word_level_predictions)
	# ['B-Term', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O']
	```

	## Training procedure

	### Training
	The training was done with Trainier class that has next parameters:
	```python
	training_args = TrainingArguments(
	eval_strategy="epoch",
	save_strategy="epoch",
	learning_rate=2e-5,
	num_train_epochs=7,
	weight_decay=0.01,
	)
	```

	### Metrics
	Metrics on combined set (ruDEFT + WCL Wiki Ru) `psytechlab/rus_rudeft_wcl-wiki`:
	```python
	precision recall f1-score support

	I-Definition 0.75 0.90 0.82 3344
	B-Definition 0.62 0.73 0.67 230
	I-Term 0.80 0.85 0.82 524
	O 0.97 0.91 0.94 11359
	B-Term 0.96 0.93 0.94 2977

	accuracy 0.91 18434
	macro avg 0.82 0.87 0.84 18434
	weighted avg 0.92 0.91 0.91 18434
	```

	Metrics only on `astromis/ruDEFT`:
	```python
	precision recall f1-score support

	I-Definition 0.90 0.90 0.90 3344
	B-Definition 0.74 0.73 0.74 230
	I-Term 0.83 0.87 0.85 389
	O 0.86 0.86 0.86 2222
	B-Term 0.87 0.85 0.86 638

	accuracy 0.87 6823
	macro avg 0.84 0.84 0.84 6823
	weighted avg 0.87 0.87 0.87 6823
	```

	Metrics only on `astromis/WCL_Wiki_Ru`:
	```python
	precision recall f1-score support

	I-Definition 0.00 0.00 0.00 0
	B-Definition 0.00 0.00 0.00 0
	I-Term 0.72 0.78 0.75 135
	O 1.00 0.93 0.96 9137
	B-Term 0.99 0.95 0.97 2339

	accuracy 0.93 11611
	macro avg 0.54 0.53 0.54 11611
	weighted avg 0.99 0.93 0.96 11611
	```

	# Citation

	``
	@article{Popov2025TransferringNL,
	title={Transferring Natural Language Datasets Between Languages Using Large Language Models for Modern Decision Support and Sci-Tech Analytical Systems},
	author={Dmitrii Popov and Egor Terentev and Danil Serenko and Ilya Sochenkov and Igor Buyanov},
	journal={Big Data and Cognitive Computing},
	year={2025},
	url={https://api.semanticscholar.org/CorpusID:278179500}
	}

	```