NorBERT4-base Scandinavian Embedding Model

Multi-dataset trained embedding model for Norwegian, Danish, and Swedish languages.

Model Details

  • Base Model: ltg/norbert4-base
  • Embedding Dimension: 640
  • Max Sequence Length: 256 tokens
  • Languages: Norwegian (Bokmål & Nynorsk), Danish, Swedish
  • Training Approach: Multi-dataset ROUND_ROBIN sampling

Training Data

Total: 1.6M samples across 3 Scandinavian languages

1. NLI Dataset (556k samples, Norwegian)

  • Source: Fremtind/all-nli-norwegian
  • Format: (anchor, positive, negative) triplets
  • Purpose: Natural language understanding and semantic similarity

2. Question-Answering Dataset (100k samples, NO+DA+SV)

  • NorQuAD: ltg/norquad - Norwegian QA
  • NorBookQA: ltg/norbookqa - Norwegian OpenBookQA
  • ScandiQA: alexandrainst/scandi-qa - Scandinavian QA (NO+DA+SV)
  • Supervised-DA: Danish sentence pairs
  • Format: (query, positive) pairs
  • Purpose: Question-document retrieval

3. DDSC Nordic Dataset (949k samples, NO+DA+SV)

  • Source: DDSC/nordic-embedding-training-data
  • Format: (query, positive, [negative]) pairs
  • Composition: 40% with hard negatives, 60% with in-batch negatives
  • Purpose: General retrieval with hard negative mining

Training Details

  • Strategy: ROUND_ROBIN multi-dataset sampling (prevents catastrophic forgetting)
  • Batch Size: 16 (effective batch size: 32 with gradient accumulation)
  • Learning Rate: 5e-6 (low LR to prevent overfitting)
  • Epochs: 1 epoch through all datasets
  • Loss: MultipleNegativesRankingLoss
  • Early Stopping: Tracks average loss across all three datasets
  • Regularization: weight_decay=0.015, no warmup

Performance

MTEB Retrieval Benchmarks (vs previous models):

Task Multi-Dataset QA-only NLI-only Improvement
NorQuadRetrieval (ndcg@10) 0.232 0.209 0.163 +11.0%
SNLRetrieval (ndcg@10) 0.818 0.765 0.519 +6.9%

Usage

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("thivy/norbert4-base-scandinavian-embedding")

# Encode sentences
sentences = [
    "Dette er en norsk setning",
    "Detta är en svensk mening", 
    "Dette er en dansk sætning"
]

embeddings = model.encode(sentences)

# Compute similarity
from sentence_transformers.util import cos_sim
similarity = cos_sim(embeddings[0], embeddings[1])

Intended Use

  • Semantic search across Scandinavian languages
  • Document retrieval and ranking
  • Question-answering systems
  • Cross-lingual similarity (NO/DA/SV)
  • Text clustering and classification

Limitations

  • Max sequence length: 256 tokens (longer texts are truncated)
  • Optimized for Scandinavian languages (Norwegian, Danish, Swedish)
  • Best performance on retrieval tasks (not instruction-following)

Citation

If you use this model, please cite:

@misc{norbert4-scandi-embedding,
  title={NorBERT4 Scandinavian Embedding Model},
  author={Thivyesh Ahilathasan},
  year={2025},
  url={https://huggingface.co/thivy/norbert4-base-scandinavian-embedding}
}

Related Models

Downloads last month
12
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support