srNEL-all: Serbian Named Entity Linking with spaCy

sr_nel_all is a spaCy pipeline for Serbian named entity recognition and named entity linking. It detects named entities in Serbian text and links recognized mentions to Wikidata identifiers.

The model corresponds to the srNEL-all configuration from the accepted paper CNN-based Named Entity Linking: Serbian Use Case. It is a CNN-based spaCy model that uses the SrpCNNER2 NER base and trains the entity linker on all available entity types.

Intended Use

This model is intended for Serbian NLP workflows that need named entities connected to Wikidata QIDs, especially geolocational entity linking in Serbian educational, geographical, literary, news, and related text.

Recommended uses:

  • Linking Serbian location mentions to Wikidata.
  • Enriching Serbian texts with structured entity identifiers.
  • Building downstream information retrieval, corpus analysis, digital humanities, and knowledge base enrichment workflows.
  • Research comparisons for Serbian NER and NEL.

The model is strongest on geolocational entity linking. Broader cross-domain use should be validated on the target corpus before production use.

Installation and Usage

Install the wheel from this repository, or download the model files and load the local spaCy package.

pip install sr_nel_all-any-py3-none-any.whl
import spacy

nlp = spacy.load("sr_nel_all")
doc = nlp("Poljska se graniči sa sedam zemalja, uključujući Nemačku i Ukrajinu.")

for ent in doc.ents:
    print(ent.text, ent.label_, ent.kb_id_)

The output contains detected entity spans, their NER labels, and the Wikidata knowledge base identifier assigned by the entity linker.

Pipeline

Feature Value
Model name sr_NEL_all
Version 1.0.0
Language Serbian (sr)
Framework spaCy
spaCy version >=3.5.2,<3.6.0
Architecture CNN-based spaCy pipeline with entity linker
Pipeline tok2vec, tagger, ner, sentencizer, entity_linker
Vectors 0 keys, 0 unique vectors, 0 dimensions
License CC BY-SA 4.0
Authors Milica Ikonić Nešić, Saša Petalinkar, Ranka Stanković, Miloš Utvić, Olivera Kitanović
Project page TESLA

Labels

The NER component recognizes seven named entity categories:

Label Description
DEMO Demonyms
EVENT Events
LOC Locations
ORG Organizations
PERS Persons
ROLE Professions, titles, and roles
WORK Works of art

The POS tagger uses the following XPOS-style labels:

ADJ, ADP, ADV, AUX, CCONJ, DET, INTJ, NOUN, NUM, PART, PRON, PROPN, PUNCT, SCONJ, VERB, X.

Training Data

The srNEL-all model was trained from a Serbian corpus of 73,493 sentences containing manually checked named entity and entity linking annotations.

The training corpus combines:

  • Serbian novels.
  • Newspaper articles.
  • Legal documents.
  • Wikipedia and sr-ELEXIS material.
  • Synthetic sentences generated from Wikidata and the Leximirka lexical database.

Entity distribution in the expanded dataset:

Entity type Mentions
LOC 36,655
ORG 11,061
PERS 13,636

For locations, 35,712 LOC mentions were linked to Wikidata QIDs, while 943 LOC mentions were assigned NIL links because no suitable Wikidata item was available at annotation time.

The linker was trained on seven entity types: PERS, LOC, ORG, ROLE, WORK, DEMO, and EVENT. The train/test split used for the NEL training setup was 58,618 training sentences and 14,875 test sentences.

Knowledge Base

The entity linker uses a curated Wikidata-aligned Serbian knowledge base.

For srNEL-all, the KB contains 3,008 entities. Entities are represented with Wikidata QIDs, aliases, and Serbian Wikipedia descriptions where available. The KB also includes inflectional forms as aliases, which is important for Serbian because named entities frequently appear in declined forms.

The KB covers categories including cities, countries, rivers, mountains, seas, oceans, islands, peninsulas, continents, administrative units, localities, organizations, persons, geographic regions, planets, and other entity classes used in the model.

Evaluation

The main external evaluation described in the associated paper uses the sr-geography corpus, a Serbian geography textbook corpus for elementary school students.

The sr-geography evaluation set contains:

  • 710 sentences.
  • 2,297 words.
  • 746 annotated geolocational entities.
  • 212 unique Wikidata QIDs.

Evaluation used a strict criterion: a prediction is counted as correct only when both the entity span and the Wikidata QID match the gold annotation.

sr-geography NEL Results

Model Precision Recall F1
srNEL-all 0.986 0.740 0.845
SrpCNNeL baseline n/a n/a 0.731

The srNEL-all configuration achieved the strongest CNN-based result in the reported comparison, outperforming the earlier SrpCNNeL baseline on geolocational entity linking.

Internal spaCy Package Metrics

Metric Score
XPOS accuracy 0.9649
NER precision 0.9335
NER recall 0.9340
NER F1 0.9338

NER performance by entity type:

Entity type Precision Recall F1
ROLE 0.8352 0.8221 0.8286
PERS 0.9713 0.9787 0.9750
LOC 0.9330 0.9697 0.9510
DEMO 0.8740 0.8520 0.8628
ORG 0.7676 0.6544 0.7065
WORK 0.6563 0.2958 0.4078
EVENT 0.5556 0.3125 0.4000

Limitations

  • The model is strongest for Serbian geolocational entity linking and should be evaluated before use in other domains.
  • The external evaluation corpus is focused on geography textbook text, so reported NEL results may not generalize directly to news, literary, legal, or web text.
  • Multi-word entities are a known source of errors, especially Serbian toponyms with inflection or complex names.
  • Rare categories such as islands, oceans, planets, and geographic regions require more evaluation data.
  • The system depends on Wikidata coverage. Mentions without a suitable Wikidata item may receive NIL links or remain unresolved.
  • CNN-based pipelines are efficient, but transformer-based models may offer stronger accuracy for some Serbian NER/NEL scenarios.

Citation

The paper describing this model is in press:

@article{IkonicNesic2026CNN,
  author    = {Ikoni{\'c} Ne{\v{s}}i{\'c}, M. and Petalinkar, S. and Kitanovi{\'c}, O. and Stankovi{\'c}, R. and Utvi{\'c}, M.},
  title     = {CNN-based Named Entity Linking: Serbian Use Case},
  journal   = {Poznan Studies in Contemporary Linguistics},
  year      = {2026},
  note      = {In press},
}

Acknowledgments

This research was supported by the Science Fund of the Republic of Serbia, project Text Embeddings - Serbian Language Applications - TESLA. The work also acknowledges the use of Serbian linguistic resources and corpora described in the associated paper, including Leximirka, Wikidata-derived data, sr-ELEXIS, SrpELTeC-related resources, and the sr-geography evaluation material.

Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results