NorBERT4 Norwegian QA Embedding Model

This is NorBERT4-NLI-QA, a Norwegian sentence embedding model optimized for question-answering and semantic retrieval tasks.

Model Description

This model is the result of a 2-stage curriculum learning approach:

Stage 1 (V1): Fine-tuned on 569k Norwegian NLI samples for semantic understanding
Stage 2 (This model): Further fine-tuned on 103k Norwegian/Danish QA and paraphrase samples

Training Details

Stage 2 Training Configuration

Base model: thivy/norbert4-base-nli-norwegian
Datasets: NorQuAD (3.8k), NorOpenBookQA (2.9k), PAWS-X NO (21.8k), Supervised-DA (74.5k)
Total samples: ~103,000
Training objective: MultipleNegativesRankingLoss with in-batch negatives
Max sequence length: 128 tokens
Batch size: 32

Hyperparameters (Anti-Overfitting Optimized)

Learning rate: 5.0e-6 (75% lower than baseline)
Warmup: 0.0 (removed to prevent early peaking)
Weight decay: 0.015 (50% stronger regularization)
Gradient clipping: 1.0
Early stopping patience: 5
LR scheduler: Cosine decay

Performance Metrics

NDCG@10 Progression:

Step	NDCG@10	Change
500	0.8781	Best
1000	0.8720	-0.69%
1500	0.8693	-1.00%
2000	0.8695	-0.98%
2500	0.8677	-1.18%

Evaluation Metrics (at step 500 - best checkpoint):

NDCG@10: 0.8781
MRR@10: 0.8640
MAP@100: 0.8658
Accuracy@1: 0.8331
Accuracy@10: 0.9219

Training Stability:

Eval loss: Decreased by 3.3% (0.0911 → 0.0881)
Best checkpoint: Step 500 (19.7% through training)
Final degradation: 1.18% (lowest among all variants tested)

Usage

from sentence_transformers import SentenceTransformer

# Load model
model = SentenceTransformer('thivy/norbert4-norwegian-qa')

# Encode sentences
sentences = [
    "Hva er hovedstaden i Norge?",
    "Oslo er hovedstaden i Norge.",
    "Bergen er en by på vestlandet."
]

embeddings = model.encode(sentences)

# Compute similarities
from sentence_transformers.util import cos_sim
similarities = cos_sim(embeddings[0], embeddings[1:])
print(similarities)

Intended Use

This model is designed for:

✅ Norwegian question-answering systems
✅ Semantic search and retrieval
✅ Document similarity
✅ Paraphrase detection
✅ Cross-lingual tasks (Norwegian, Danish, Swedish)

Training Data

Stage 1 (NLI - Base Model)

NorNLI Combined: 569,000 samples
Format: Premise-hypothesis pairs with entailment labels

Stage 2 (QA & Paraphrase - This Model)

NorQuAD: 3,808 Norwegian question-answer pairs
NorOpenBookQA: 2,886 Norwegian QA samples
PAWS-X Norwegian: 21,829 paraphrase pairs
Supervised-DA: 74,560 Danish sentence pairs

Limitations

Optimized primarily for Norwegian text (with Danish/Swedish support)
Maximum sequence length: 128 tokens
Best performance on question-answering and retrieval tasks
May require domain adaptation for specialized domains

Model Card Authors

Thivyesh Ahilathasan

Citation

If you use this model, please cite:

@misc{norbert4-nli-qa,
  author = {Ahilathasan, Thivyesh},
  title = {NorBERT4 Norwegian QA Embedding Model},
  year = {2026},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/thivy/norbert4-norwegian-qa}}
}

Downloads last month: 128

Safetensors

Model size

0.1B params

Tensor type

F32