NorBERT4-base Scandinavian Embedding Model
Multi-dataset trained embedding model for Norwegian, Danish, and Swedish languages.
Model Details
- Base Model: ltg/norbert4-base
- Embedding Dimension: 640
- Max Sequence Length: 256 tokens
- Languages: Norwegian (Bokmål & Nynorsk), Danish, Swedish
- Training Approach: Multi-dataset ROUND_ROBIN sampling
Training Data
Total: 1.6M samples across 3 Scandinavian languages
1. NLI Dataset (556k samples, Norwegian)
- Source: Fremtind/all-nli-norwegian
- Format: (anchor, positive, negative) triplets
- Purpose: Natural language understanding and semantic similarity
2. Question-Answering Dataset (100k samples, NO+DA+SV)
- NorQuAD: ltg/norquad - Norwegian QA
- NorBookQA: ltg/norbookqa - Norwegian OpenBookQA
- ScandiQA: alexandrainst/scandi-qa - Scandinavian QA (NO+DA+SV)
- Supervised-DA: Danish sentence pairs
- Format: (query, positive) pairs
- Purpose: Question-document retrieval
3. DDSC Nordic Dataset (949k samples, NO+DA+SV)
- Source: DDSC/nordic-embedding-training-data
- Format: (query, positive, [negative]) pairs
- Composition: 40% with hard negatives, 60% with in-batch negatives
- Purpose: General retrieval with hard negative mining
Training Details
- Strategy: ROUND_ROBIN multi-dataset sampling (prevents catastrophic forgetting)
- Batch Size: 16 (effective batch size: 32 with gradient accumulation)
- Learning Rate: 5e-6 (low LR to prevent overfitting)
- Epochs: 1 epoch through all datasets
- Loss: MultipleNegativesRankingLoss
- Early Stopping: Tracks average loss across all three datasets
- Regularization: weight_decay=0.015, no warmup
Performance
MTEB Retrieval Benchmarks (vs previous models):
| Task | Multi-Dataset | QA-only | NLI-only | Improvement |
|---|---|---|---|---|
| NorQuadRetrieval (ndcg@10) | 0.232 | 0.209 | 0.163 | +11.0% |
| SNLRetrieval (ndcg@10) | 0.818 | 0.765 | 0.519 | +6.9% |
Usage
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("thivy/norbert4-base-scandinavian-embedding")
# Encode sentences
sentences = [
"Dette er en norsk setning",
"Detta är en svensk mening",
"Dette er en dansk sætning"
]
embeddings = model.encode(sentences)
# Compute similarity
from sentence_transformers.util import cos_sim
similarity = cos_sim(embeddings[0], embeddings[1])
Intended Use
- Semantic search across Scandinavian languages
- Document retrieval and ranking
- Question-answering systems
- Cross-lingual similarity (NO/DA/SV)
- Text clustering and classification
Limitations
- Max sequence length: 256 tokens (longer texts are truncated)
- Optimized for Scandinavian languages (Norwegian, Danish, Swedish)
- Best performance on retrieval tasks (not instruction-following)
Citation
If you use this model, please cite:
@misc{norbert4-scandi-embedding,
title={NorBERT4 Scandinavian Embedding Model},
author={Thivyesh Ahilathasan},
year={2025},
url={https://huggingface.co/thivy/norbert4-base-scandinavian-embedding}
}
Related Models
- Base: ltg/norbert4-base
- Large: thivy/norbert4-large-scandinavian-embedding (coming soon)
- Downloads last month
- 12