norbert4-small-NorNER

A Norwegian named entity recognition model fine-tuned from ltg/norbert4-small on the NorNE dataset, covering both Bokmål and Nynorsk.

Model Details

Author: Fransis Nyka Kolstø
Base model: ltg/norbert4-small
Language(s): Norwegian Bokmål (nb), Norwegian Nynorsk (nn)
Task: Token classification / Named Entity Recognition
Tagging scheme: IOB2
License: Apache 2.0

Entity Types

The model predicts 9 entity types using the IOB2 scheme described in the NbAiLab norne dataset

Intended Use

The model is intended for named entity recognition on Norwegian text (Bokmål and Nynorsk), including news, blog posts, parliamentary proceedings, and government reports — reflecting the genre distribution of the NorNE data.

Training Procedure

Training was done in two phases on the NorNE dataset:

Phase 1 — Optimal-step search: The model was trained on the train split with the dev split used for evaluation and early stopping. Training proceeded through a curriculum of increasing input context lengths, allowing the model to adapt progressively from sentence-level to longer multi-sentence contexts.
Phase 2 — Final training: The base model was re-initialized and trained on the combined train + development splits, replaying the same curriculum and learning-rate trajectory as Phase 1, but stopping each stage at the best steps identified in phase 1. This allows the final model to benefit from the additional development data without re-tuning.

Evaluation

Evaluated on the NorNE test split (Bokmål and Nynorsk combined), with entity-level metrics computed via seqeval:

Metric	Score
Precision	0.8199
Recall	0.8278
F1	0.8238

Usage

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

model_id = "fransis3/norbert4-small-NorNER"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForTokenClassification.from_pretrained(model_id, trust_remote_code=True)

ner = pipeline("token-classification", model=model, tokenizer=tokenizer, aggregation_strategy="simple")
ner("Erna Solberg besøkte Universitetet i Oslo forrige uke.")

Limitations

Performance is reported on NorNE's test distribution (news, blogs, parliamentary text, government reports). Generalization to other domains (e.g., social media, clinical text, historical Norwegian) is not guaranteed.
The model inherits any biases present in its pretraining data (EuroBERT) and in NorNE's source texts.
The base model is loaded with trust_remote_code=True as required by EuroBERT.

Dataset

NorNE is a named entity annotation layer over the Norwegian Dependency Treebank, covering both Bokmål and Nynorsk.

License

This model is released under the Apache 2.0 license, matching the base model. The NorNE annotations used for training are released under CC0 1.0.

Citation

If you use this model, please cite the underlying resources:

@inproceedings{charpentier-samuel-2024-bert,
    title = "{GPT} or {BERT}: why not both?",
    author = "Charpentier, Lucas Georges Gabriel  and
      Samuel, David",
    booktitle = "The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning",
    month = nov,
    year = "2024",
    address = "Miami, FL, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.conll-babylm.24/",
    pages = "262--283"
}

@inproceedings{samuel-etal-2023-norbench,
    title = "{N}or{B}ench {--} A Benchmark for {N}orwegian Language Models",
    author = "Samuel, David  and
      Kutuzov, Andrey  and
      Touileb, Samia  and
      Velldal, Erik  and
      {\O}vrelid, Lilja  and
      R{\o}nningstad, Egil  and
      Sigdel, Elina  and
      Palatkina, Anna",
    booktitle = "Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)",
    month = may,
    year = "2023",
    address = "T{\'o}rshavn, Faroe Islands",
    publisher = "University of Tartu Library",
    url = "https://aclanthology.org/2023.nodalida-1.61",
    pages = "618--633"
}

@misc{jørgensen2020norneannotatingnamedentities,
      title={NorNE: Annotating Named Entities for Norwegian}, 
      author={Fredrik Jørgensen and Tobias Aasmoe and Anne-Stine Ruud Husevåg and Lilja Øvrelid and Erik Velldal},
      year={2020},
      eprint={1911.12146},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/1911.12146}, 
}