--- license: mit datasets: - RCC-MSU/collection3 language: - ru metrics: - accuracy - f1 - precision - recall base_model: - cointegrated/rubert-tiny2 pipeline_tag: token-classification --- # πŸ“° Ner-rubert-tiny-RuNews A model for **Named Entity Recognition (NER)** in **Russian-language news texts**. πŸ” Based on [**RuBERT-tiny2**](https://huggingface.co/cointegrated/rubert-tiny2) and fine-tuned on the news corpus [**Collection3**](https://huggingface.co/datasets/RCC-MSU/collection3), focusing on texts mentioning **Sberbank**, **Yandex**, and other media and governmental entities. --- ## πŸ’‘ What the model can do It recognizes the following types of named entities: | Label | Meaning | |------------|------------------------------------------------| | `PER` | Persons | | `ORG` | Organizations | | `LOC` | Locations | | `GEOPOLIT` | Geopolitical entities (countries, regions) | | `MEDIA` | Media outlets and resources | --- ## πŸ› οΈ Example usage ```python from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline label2id = { 'O': 0, 'B-GEOPOLIT': 1, 'I-GEOPOLIT': 2, 'B-MEDIA': 3, 'I-MEDIA': 4, 'B-LOC': 5, 'I-LOC': 6, 'B-ORG': 7, 'I-ORG': 8, 'B-PER': 9, 'I-PER': 10 } id2label = {v: k for k, v in label2id.items()} model_id = "r1char9/ner-rubert-tiny-RuNews" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForTokenClassification.from_pretrained( model_id, num_labels=len(label2id), id2label=id2label, label2id=label2id ) ner_pipeline = pipeline( "ner", model=model, tokenizer=tokenizer, aggregation_strategy="simple" ) text = ( "Π“Π΅Π½Π΅Ρ€Π°Π»ΡŒΠ½Ρ‹ΠΉ Π΄ΠΈΡ€Π΅ΠΊΡ‚ΠΎΡ€ Π‘Π±Π΅Ρ€Π±Π°Π½ΠΊΠ° Π“Π΅Ρ€ΠΌΠ°Π½ Π“Ρ€Π΅Ρ„ Π½Π° ΠΊΠΎΠ½Ρ„Π΅Ρ€Π΅Π½Ρ†ΠΈΠΈ Π² МосквС заявил, " "Ρ‡Ρ‚ΠΎ сотрудничСство с ЯндСксом Π² области искусствСнного ΠΈΠ½Ρ‚Π΅Π»Π»Π΅ΠΊΡ‚Π° Π²Ρ‹Ρ…ΠΎΠ΄ΠΈΡ‚ Π½Π° Π½ΠΎΠ²Ρ‹ΠΉ ΡƒΡ€ΠΎΠ²Π΅Π½ΡŒ. " "Он Ρ‚Π°ΠΊΠΆΠ΅ ΠΎΡ‚ΠΌΠ΅Ρ‚ΠΈΠ», Ρ‡Ρ‚ΠΎ ΠΏΡ€Π°Π²ΠΈΡ‚Π΅Π»ΡŒΡΡ‚Π²ΠΎ Российской Π€Π΅Π΄Π΅Ρ€Π°Ρ†ΠΈΠΈ ΠΏΠΎΠ΄Π΄Π΅Ρ€ΠΆΠΈΠ²Π°Π΅Ρ‚ Ρ€Π°Π·Π²ΠΈΡ‚ΠΈΠ΅ Ρ†ΠΈΡ„Ρ€ΠΎΠ²ΠΎΠΉ экономики, " "особСнно Π² Ρ€Π°ΠΌΠΊΠ°Ρ… Евразийского экономичСского союза." ) results = ner_pipeline(text) for entity in results: print(entity) # {'entity_group': 'ORG', 'score': 0.951569, 'word': 'Π‘Π±Π΅Ρ€Π±Π°Π½ΠΊΠ°', 'start': 21, 'end': 30} # {'entity_group': 'PER', 'score': 0.9922959, 'word': 'Π“Π΅Ρ€ΠΌΠ°Π½ Π“Ρ€Π΅Ρ„', 'start': 31, 'end': 42} # {'entity_group': 'LOC', 'score': 0.60198957, 'word': 'МосквС', 'start': 60, 'end': 66} # {'entity_group': 'ORG', 'score': 0.6973838, 'word': 'ЯндСксом', 'start': 96, 'end': 104} # {'entity_group': 'GEOPOLIT', 'score': 0.9631994, 'word': 'Российской Π€Π΅Π΄Π΅Ρ€Π°Ρ†ΠΈΠΈ', 'start': 203, 'end': 223} # {'entity_group': 'ORG', 'score': 0.85091865, 'word': 'Евразийского экономичСского союза.', 'start': 284, 'end': 318} ```