srNEL-all: Serbian Named Entity Linking with spaCy
sr_nel_all is a spaCy pipeline for Serbian named entity recognition and named entity linking. It detects named entities in Serbian text and links recognized mentions to Wikidata identifiers.
The model corresponds to the srNEL-all configuration from the accepted paper CNN-based Named Entity Linking: Serbian Use Case. It is a CNN-based spaCy model that uses the SrpCNNER2 NER base and trains the entity linker on all available entity types.
Intended Use
This model is intended for Serbian NLP workflows that need named entities connected to Wikidata QIDs, especially geolocational entity linking in Serbian educational, geographical, literary, news, and related text.
Recommended uses:
- Linking Serbian location mentions to Wikidata.
- Enriching Serbian texts with structured entity identifiers.
- Building downstream information retrieval, corpus analysis, digital humanities, and knowledge base enrichment workflows.
- Research comparisons for Serbian NER and NEL.
The model is strongest on geolocational entity linking. Broader cross-domain use should be validated on the target corpus before production use.
Installation and Usage
Install the wheel from this repository, or download the model files and load the local spaCy package.
pip install sr_nel_all-any-py3-none-any.whl
import spacy
nlp = spacy.load("sr_nel_all")
doc = nlp("Poljska se graniči sa sedam zemalja, uključujući Nemačku i Ukrajinu.")
for ent in doc.ents:
print(ent.text, ent.label_, ent.kb_id_)
The output contains detected entity spans, their NER labels, and the Wikidata knowledge base identifier assigned by the entity linker.
Pipeline
| Feature | Value |
|---|---|
| Model name | sr_NEL_all |
| Version | 1.0.0 |
| Language | Serbian (sr) |
| Framework | spaCy |
| spaCy version | >=3.5.2,<3.6.0 |
| Architecture | CNN-based spaCy pipeline with entity linker |
| Pipeline | tok2vec, tagger, ner, sentencizer, entity_linker |
| Vectors | 0 keys, 0 unique vectors, 0 dimensions |
| License | CC BY-SA 4.0 |
| Authors | Milica Ikonić Nešić, Saša Petalinkar, Ranka Stanković, Miloš Utvić, Olivera Kitanović |
| Project page | TESLA |
Labels
The NER component recognizes seven named entity categories:
| Label | Description |
|---|---|
DEMO |
Demonyms |
EVENT |
Events |
LOC |
Locations |
ORG |
Organizations |
PERS |
Persons |
ROLE |
Professions, titles, and roles |
WORK |
Works of art |
The POS tagger uses the following XPOS-style labels:
ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, VERB, X.
Training Data
The srNEL-all model was trained from a Serbian corpus of 73,493 sentences containing manually checked named entity and entity linking annotations.
The training corpus combines:
- Serbian novels.
- Newspaper articles.
- Legal documents.
- Wikipedia and sr-ELEXIS material.
- Synthetic sentences generated from Wikidata and the Leximirka lexical database.
Entity distribution in the expanded dataset:
| Entity type | Mentions |
|---|---|
LOC |
36,655 |
ORG |
11,061 |
PERS |
13,636 |
For locations, 35,712 LOC mentions were linked to Wikidata QIDs, while 943 LOC mentions were assigned NIL links because no suitable Wikidata item was available at annotation time.
The linker was trained on seven entity types: PERS, LOC, ORG, ROLE, WORK, DEMO, and EVENT. The train/test split used for the NEL training setup was 58,618 training sentences and 14,875 test sentences.
Knowledge Base
The entity linker uses a curated Wikidata-aligned Serbian knowledge base.
For srNEL-all, the KB contains 3,008 entities. Entities are represented with Wikidata QIDs, aliases, and Serbian Wikipedia descriptions where available. The KB also includes inflectional forms as aliases, which is important for Serbian because named entities frequently appear in declined forms.
The KB covers categories including cities, countries, rivers, mountains, seas, oceans, islands, peninsulas, continents, administrative units, localities, organizations, persons, geographic regions, planets, and other entity classes used in the model.
Evaluation
The main external evaluation described in the associated paper uses the sr-geography corpus, a Serbian geography textbook corpus for elementary school students.
The sr-geography evaluation set contains:
- 710 sentences.
- 2,297 words.
- 746 annotated geolocational entities.
- 212 unique Wikidata QIDs.
Evaluation used a strict criterion: a prediction is counted as correct only when both the entity span and the Wikidata QID match the gold annotation.
sr-geography NEL Results
| Model | Precision | Recall | F1 |
|---|---|---|---|
srNEL-all |
0.986 | 0.740 | 0.845 |
SrpCNNeL baseline |
n/a | n/a | 0.731 |
The srNEL-all configuration achieved the strongest CNN-based result in the reported comparison, outperforming the earlier SrpCNNeL baseline on geolocational entity linking.
Internal spaCy Package Metrics
| Metric | Score |
|---|---|
| XPOS accuracy | 0.9649 |
| NER precision | 0.9335 |
| NER recall | 0.9340 |
| NER F1 | 0.9338 |
NER performance by entity type:
| Entity type | Precision | Recall | F1 |
|---|---|---|---|
ROLE |
0.8352 | 0.8221 | 0.8286 |
PERS |
0.9713 | 0.9787 | 0.9750 |
LOC |
0.9330 | 0.9697 | 0.9510 |
DEMO |
0.8740 | 0.8520 | 0.8628 |
ORG |
0.7676 | 0.6544 | 0.7065 |
WORK |
0.6563 | 0.2958 | 0.4078 |
EVENT |
0.5556 | 0.3125 | 0.4000 |
Limitations
- The model is strongest for Serbian geolocational entity linking and should be evaluated before use in other domains.
- The external evaluation corpus is focused on geography textbook text, so reported NEL results may not generalize directly to news, literary, legal, or web text.
- Multi-word entities are a known source of errors, especially Serbian toponyms with inflection or complex names.
- Rare categories such as islands, oceans, planets, and geographic regions require more evaluation data.
- The system depends on Wikidata coverage. Mentions without a suitable Wikidata item may receive NIL links or remain unresolved.
- CNN-based pipelines are efficient, but transformer-based models may offer stronger accuracy for some Serbian NER/NEL scenarios.
Citation
The paper describing this model is in press:
@article{IkonicNesic2026CNN,
author = {Ikoni{\'c} Ne{\v{s}}i{\'c}, M. and Petalinkar, S. and Kitanovi{\'c}, O. and Stankovi{\'c}, R. and Utvi{\'c}, M.},
title = {CNN-based Named Entity Linking: Serbian Use Case},
journal = {Poznan Studies in Contemporary Linguistics},
year = {2026},
note = {In press},
}
Acknowledgments
This research was supported by the Science Fund of the Republic of Serbia, project Text Embeddings - Serbian Language Applications - TESLA. The work also acknowledges the use of Serbian linguistic resources and corpora described in the associated paper, including Leximirka, Wikidata-derived data, sr-ELEXIS, SrpELTeC-related resources, and the sr-geography evaluation material.
- Downloads last month
- 2
Evaluation results
- NER Precisionself-reported0.933
- NER Recallself-reported0.934
- NER F Scoreself-reported0.934
- TAG (XPOS) Accuracyself-reported0.965
- srNEL-all Precision on sr-geographyself-reported0.986
- srNEL-all Recall on sr-geographyself-reported0.740
- srNEL-all F1 on sr-geographyself-reported0.845