r1char9
/

ner-rubert-tiny-news

Token Classification

Model card Files Files and versions

ner-rubert-tiny-news / README.md

r1char9's picture

Update README.md

b5bfb41 verified 8 months ago

|

history blame contribute delete

3.25 kB

	---
	license: mit
	datasets:
	- RCC-MSU/collection3
	language:
	- ru
	metrics:
	- accuracy
	- f1
	- precision
	- recall
	base_model:
	- cointegrated/rubert-tiny2
	pipeline_tag: token-classification
	---

	# 📰 Ner-rubert-tiny-RuNews

	A model for Named Entity Recognition (NER) in Russian-language news texts.

	🔍 Based on [RuBERT-tiny2](https://huggingface.co/cointegrated/rubert-tiny2) and fine-tuned on the news corpus [Collection3](https://huggingface.co/datasets/RCC-MSU/collection3), focusing on texts mentioning Sberbank, Yandex, and other media and governmental entities.

	---

	## 💡 What the model can do

	It recognizes the following types of named entities:

	\| Label \| Meaning \|
	\|------------\|------------------------------------------------\|
	\| `PER` \| Persons \|
	\| `ORG` \| Organizations \|
	\| `LOC` \| Locations \|
	\| `GEOPOLIT` \| Geopolitical entities (countries, regions) \|
	\| `MEDIA` \| Media outlets and resources \|

	---


	## 🛠️ Example usage

	```python
	from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

	label2id = {
	'O': 0,
	'B-GEOPOLIT': 1, 'I-GEOPOLIT': 2,
	'B-MEDIA': 3, 'I-MEDIA': 4,
	'B-LOC': 5, 'I-LOC': 6,
	'B-ORG': 7, 'I-ORG': 8,
	'B-PER': 9, 'I-PER': 10
	}
	id2label = {v: k for k, v in label2id.items()}

	model_id = "r1char9/ner-rubert-tiny-RuNews"

	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForTokenClassification.from_pretrained(
	model_id,
	num_labels=len(label2id),
	id2label=id2label,
	label2id=label2id
	)

	ner_pipeline = pipeline(
	"ner",
	model=model,
	tokenizer=tokenizer,
	aggregation_strategy="simple"
	)

	text = (
	"Генеральный директор Сбербанка Герман Греф на конференции в Москве заявил, "
	"что сотрудничество с Яндексом в области искусственного интеллекта выходит на новый уровень. "
	"Он также отметил, что правительство Российской Федерации поддерживает развитие цифровой экономики, "
	"особенно в рамках Евразийского экономического союза."
	)

	results = ner_pipeline(text)

	for entity in results:
	print(entity)

	# {'entity_group': 'ORG', 'score': 0.951569, 'word': 'Сбербанка', 'start': 21, 'end': 30}
	# {'entity_group': 'PER', 'score': 0.9922959, 'word': 'Герман Греф', 'start': 31, 'end': 42}
	# {'entity_group': 'LOC', 'score': 0.60198957, 'word': 'Москве', 'start': 60, 'end': 66}
	# {'entity_group': 'ORG', 'score': 0.6973838, 'word': 'Яндексом', 'start': 96, 'end': 104}
	# {'entity_group': 'GEOPOLIT', 'score': 0.9631994, 'word': 'Российской Федерации', 'start': 203, 'end': 223}
	# {'entity_group': 'ORG', 'score': 0.85091865, 'word': 'Евразийского экономического союза.', 'start': 284, 'end': 318}
	```