A newer version of this model is available: ivlcic/sour-sarma-pline

mmBERT NER model for Slavic languages

The train/eval/test splits were concatenated from all languages in order as specified in command line:
sl, hr, sr, bs, mk, sq, cs, bg, pl, ru, sk, uk

We used the following hyperparameters:

Max sequence length 256
PyTorch's AdamW algorithm with 2e-5 learning rate
batch size of 32
40 epochs (preliminary runs showed best F1-scores between epochs 15 and 35)
F1-score for best model selection and training progression.

Based on Analysis of Transfer Learning for Named Entity Recognition in South-Slavic Languages (Ivačič et al., BSNLP 2023)

Reiteration of previous XLM-Roberta-base finetuned model

Currently SOTA on Slobench NER

Used NER Corpora

We used the following NER corpora

Training corpus SUK 1.0

@misc{11356/1747,
 title = {Training corpus {SUK} 1.0},
 author = {Arhar Holdt, {\v S}pela and Krek, Simon and Dobrovoljc, Kaja and Erjavec, Toma{\v z} and Gantar, Polona and {\v C}ibej, Jaka and Pori, Eva and Ter{\v c}on, Luka and Munda, Tina and {\v Z}itnik, Slavko and Robida, Nejc and Blagus, Neli and Mo{\v z}e, Sara and Ledinek, Nina and Holz, Nanika and Zupan, Katja and Kuzman, Taja and Kav{\v c}i{\v c}, Teja and {\v S}krjanec, Iza and Marko, Dafne and Jezer{\v s}ek, Lucija and Zajc, Anja},
 url = {http://hdl.handle.net/11356/1747},
 note = {Slovenian language resource repository {CLARIN}.{SI}},
 copyright = {Creative Commons - Attribution-{NonCommercial}-{ShareAlike} 4.0 International ({CC} {BY}-{NC}-{SA} 4.0)},
 issn = {2820-4042},
 year = {2022} 
}

BSNLP: 3rd Shared Task on SlavNER

We merged the 2017 and 2021 training data with the 2021 test data and created custom train/dev/test splits.

We also mapped EVT (event) and PRO (product) tags to MISC to align the corpus with others.

You can change mappings by running a custom prepare corpus step (see above).
Training corpus hr500k 1.0

@misc{11356/1183,
    title = {Training corpus hr500k 1.0},
    author = {Ljube{\v s}i{\'c}, Nikola and Agi{\'c}, {\v Z}eljko and Klubi{\v c}ka, Filip and Batanovi{\'c}, Vuk and Erjavec, Toma{\v z}},
    url = {http://hdl.handle.net/11356/1183},
    note = {Slovenian language resource repository {CLARIN}.{SI}},
    copyright = {Creative Commons - Attribution-{ShareAlike} 4.0 International ({CC} {BY}-{SA} 4.0)},
    issn = {2820-4042},
    year = {2018} 
}

Training corpus SETimes.SR 1.0

@misc{11356/1200,
    title = {Training corpus {SETimes}.{SR} 1.0},
    author = {Batanovi{\'c}, Vuk and Ljube{\v s}i{\'c}, Nikola and Samard{\v z}i{\'c}, Tanja and Erjavec, Toma{\v z}},
    url = {http://hdl.handle.net/11356/1200},
    note = {Slovenian language resource repository {CLARIN}.{SI}},
    copyright = {Creative Commons - Attribution-{ShareAlike} 4.0 International ({CC} {BY}-{SA} 4.0)},
    issn = {2820-4042},
    year = {2018} 
}

Massively Multilingual Transfer for NER. nick-named WikiAnn

@inproceedings{rahimi-etal-2019-massively,
    title = "Massively Multilingual Transfer for {NER}",
    author = "Rahimi, Afshin  and
      Li, Yuan  and
      Cohn, Trevor",
    booktitle = "Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2019",
    address = "Florence, Italy",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/P19-1015",
    pages = "151--164",
}

Neural Networks for Featureless Named Entity Recognition in Czech.

@Inbook{Strakova2016,
    author="Strakov{\'a}, Jana and Straka, Milan and Haji{\v{c}}, Jan",
    editor="Sojka, Petr and Hor{\'a}k, Ale{\v{s}} and Kope{\v{c}}ek, Ivan and Pala, Karel",
    title="Neural Networks for Featureless Named Entity Recognition in Czech",
    bookTitle="Text, Speech, and Dialogue: 19th International Conference, TSD 2016, Brno, Czech Republic, September 12-16, 2016, Proceedings",
    year="2016",
    publisher="Springer International Publishing",
    address="Cham",
    pages="173--181",
    isbn="978-3-319-45510-5",
    doi="10.1007/978-3-319-45510-5_20",
    url="http://dx.doi.org/10.1007/978-3-319-45510-5_20"
}

NER Evaluation

For evaluation, we use seqeval

@misc{seqeval,
    title={{seqeval}: A Python framework for sequence labeling evaluation},
    url={https://github.com/chakki-works/seqeval},
    note={Software available from https://github.com/chakki-works/seqeval},
    author={Hiroki Nakayama},
    year={2018},
}

Which is based on

@inproceedings{ramshaw-marcus-1995-text,
    title = "Text Chunking using Transformation-Based Learning",
    author = "Ramshaw, Lance  and
      Marcus, Mitch",
    booktitle = "Third Workshop on Very Large Corpora",
    year = "1995",
    url = "https://www.aclweb.org/anthology/W95-0107",
}

Since this model was trained by labelling all subword tokens with the same word label, the pipeline usage is somewhat different:

import re
from transformers import pipeline

model_id = "ivlcic/sour-sarma"

ner = pipeline(
    task="token-classification",
    model=model_id,
    tokenizer=model_id,
    aggregation_strategy="none",  # <-
)

text = " Janez Novak... Metka Kralj,,. in Boris A. Novak živijo v Ljubljani in delajo za Microsoft."
tokens = re.findall(r"\s+|\w+|[^\w\s]", text, flags=re.UNICODE)
result = ner(tokens, is_split_into_words=True, delimiter="")

prev = None
rewritten = []
for e in result[0]:
    if prev is not None and prev['end'] == e['start']:
        rewritten[-1]['entity'] = e['entity'][2:]
        rewritten[-1]['span'] = text[rewritten[-1]['start']:e['end']]
        rewritten[-1]['end'] += e['end']
    else:
        rewritten.append(e.copy())
        rewritten[-1].pop('word')
        rewritten[-1]['entity'] = e['entity'][2:]
        rewritten[-1]['span'] = text[e['start']:e['end']]
    prev = e

Downloads last month: 20

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for ivlcic/sour-sarma

Base model

jhu-clsp/mmBERT-base

Finetuned

(88)

this model

Collection including ivlcic/sour-sarma

NER Slavic models

Collection

4 items • Updated Mar 18

Evaluation results

Accuracy on merged
self-reported

98.946
F1 on merged
self-reported

95.553
Precision on merged
self-reported

95.387
Recall on merged
self-reported

95.720
PER Precision on merged
self-reported

96.141
PER Recall on merged
self-reported

96.875
PER F1-score on merged
self-reported

96.507
LOC Precision on merged
self-reported

96.173
LOC Recall on merged
self-reported

96.967
LOC F1-score on merged
self-reported

96.568
ORG Precision on merged
self-reported

94.647
ORG Recall on merged
self-reported

94.316
ORG F1-score on merged
self-reported

94.481
MISC Precision on merged
self-reported

92.308
MISC Recall on merged
self-reported

91.849
MISC F1-score on merged
self-reported

92.078