NorBERT4 SPLADE - Retrieval-Only
This is a SPLADE sparse encoder for Norwegian and Scandinavian languages, fine-tuned from ltg/norbert4-base. It's optimized specifically for information retrieval tasks with query → document retrieval.
Model Details
- Base Model: ltg/norbert4-base
- Architecture: SPLADE (Sparse Lexical and Expansion)
- Max Sequence Length: 4096 tokens
- Output Dimensionality: 51,200 sparse dimensions
- Languages: Norwegian (Bokmål, Nynorsk), Danish, Swedish
- Training Data: 333,547 query-document pairs
- Training Focus: Retrieval-only datasets (ETI-format: short query → long document)
Performance
Best checkpoint at step 1,500:
- NDCG@10: 0.271
- MRR@10: 0.229
- Accuracy@10: 56%
Usage
Installation
pip install -U sentence-transformers
Basic Usage
from sentence_transformers import SparseEncoder
# Load model
model = SparseEncoder("thivy/norbert4-base-splade-retrieval")
# Encode queries and documents
queries = ["Hva er maskinlæring?", "Søren Kierkegaard filosofi"]
documents = [
"Maskinlæring er en gren av kunstig intelligens...",
"Søren Kierkegaard var en dansk filosof..."
]
query_embeddings = model.encode(queries)
doc_embeddings = model.encode(documents)
# Compute similarities (dot product)
similarities = model.similarity(query_embeddings, doc_embeddings)
print(similarities)
Information Retrieval Example
from sentence_transformers import SparseEncoder
from sentence_transformers.util import semantic_search
# Load model
model = SparseEncoder("thivy/norbert4-base-splade-retrieval")
# Your corpus
corpus = [
"Norge er et skandinavisk land i Nord-Europa.",
"Python er et programmeringsspråk.",
"Maskinlæring brukes i mange applikasjoner."
]
# Encode corpus
corpus_embeddings = model.encode(corpus)
# Query
query = "Hva er Python?"
query_embedding = model.encode(query)
# Search
hits = semantic_search(query_embedding, corpus_embeddings, top_k=3)[0]
for hit in hits:
print(f"Score: {hit['score']:.4f} - {corpus[hit['corpus_id']]}")
With Threshold for High Sparsity (Recommended)
To achieve high sparsity (~99%), apply a threshold at inference time:
from sentence_transformers import SparseEncoder
model = SparseEncoder("thivy/norbert4-base-splade-retrieval")
texts = ["Hva er hovedstaden i Norge?"]
embeddings = model.encode(texts, convert_to_sparse_tensor=False)
# Apply threshold to get ~99% sparse embeddings
threshold = 0.05
embeddings[embeddings < threshold] = 0
print(f"Active dimensions: {(embeddings > 0).sum().item()}/51200")
# Output: Active dimensions: ~500-1000/51200 (98-99% sparse)
Known Issue: 0% Metric Sparsity
⚠️ The sparsity metric reports 0% despite the model being functionally sparse.
Why this happens:
- NorBERT4's MLM head applies:
30 * sigmoid(x/7.5), forcing all logits to (0, 30) range - SPLADE's ReLU activation:
ReLU(log(1+exp(x)))- cannot produce zeros from strictly positive values - Result: The metric shows all 51,200 dimensions active, but many have very small weights
This is not a bug. The model works correctly and produces semantically meaningful sparse representations. It just needs a threshold at inference time (as shown above).
Verification Script
Run this to verify the model works correctly:
from sentence_transformers import SparseEncoder
import numpy as np
model = SparseEncoder('thivy/norbert4-base-splade-retrieval')
queries = [
'Hva er hovedstaden i Norge?',
'Hvem vant fotball-VM i 2022?',
'Hva er symptomene på influensa?',
]
documents = [
'Oslo er hovedstaden og den mest folkerike byen i Norge.',
'Argentina vant FIFA verdensmesterskapet i fotball i 2022.',
'Influensa er en virussykdom som gir symptomer som feber, hoste.',
'Bergen er en vakker by på vestlandet.',
'Norsk bokmål og nynorsk er de to offisielle skriftspråkene i Norge.',
]
print('=== RAW EMBEDDINGS (no threshold) ===')
q_emb = model.encode(queries, convert_to_sparse_tensor=False)
d_emb = model.encode(documents, convert_to_sparse_tensor=False)
# Convert to numpy for easier manipulation
if hasattr(q_emb, 'cpu'):
q_emb = q_emb.cpu().numpy()
d_emb = d_emb.cpu().numpy()
sims = q_emb @ d_emb.T
print('Query-Document Similarity (should have high diagonal):')
for i, q in enumerate(queries):
best = np.argmax(sims[i])
print(f'Q{i+1} best match: D{best+1} (score: {sims[i][best]:.2f})')
print('\n=== WITH THRESHOLD = 0.05 ===')
q_sparse = q_emb.copy()
d_sparse = d_emb.copy()
q_sparse[q_sparse < 0.05] = 0
d_sparse[d_sparse < 0.05] = 0
q_active = np.mean([np.count_nonzero(q_sparse[i]) for i in range(len(queries))])
d_active = np.mean([np.count_nonzero(d_sparse[i]) for i in range(len(documents))])
print(f'Query active dims: {q_active:.0f} / 51200 ({100*q_active/51200:.1f}%)')
print(f'Doc active dims: {d_active:.0f} / 51200 ({100*d_active/51200:.1f}%)')
sims_sparse = q_sparse @ d_sparse.T
print('Similarity with threshold (rankings should be same):')
for i, q in enumerate(queries):
best = np.argmax(sims_sparse[i])
print(f'Q{i+1} best match: D{best+1} (score: {sims_sparse[i][best]:.2f})')
Expected output: Queries should correctly match their corresponding documents (Q1→D1, Q2→D2, Q3→D3) both with and without threshold, demonstrating the model works correctly.
Token Expansion Analysis
See which tokens get high weights in the embeddings:
from sentence_transformers import SparseEncoder
model = SparseEncoder('thivy/norbert4-base-splade-retrieval')
queries = [
'Hva er hovedstaden i Norge?',
'Hvem vant fotball-VM i 2022?',
]
embeddings = model.encode(queries)
decoded = model.decode(embeddings, top_k=15)
for d, q in zip(decoded, queries):
print(f'Query: {q}')
tokens = ', '.join([f'{tok}({score:.2f})' for tok, score in d])
print(f'Top tokens: {tokens}\n')
This will show the top weighted tokens for each query, demonstrating the learned term expansion.
Training Details
Training Configuration
- Epochs: 1
- Total Steps: 10,423
- Batch Size: 16 per device (32 total across 2 GPUs)
- Learning Rate: 2e-5
- Warmup Ratio: 0.1
- Precision: bfloat16
- Regularization:
- Document: 0.003
- Query: 0.0001
Training Datasets
Retrieval-only datasets (query → document pairs):
- DDSC - Nordic Embedding Training Data (~182K pairs, retrieval task only, NO/DA/SV)
- ETI - Elektronisk Tjenesteinformasjon (~54K pairs, health/welfare domain, NO)
- NorQuAD - Norwegian Question Answering (~3.8K pairs, NO)
- ScandiQA - Scandinavian Question Answering (~20K pairs, NO/DA/SV)
- Supervised-DA - Danish supervised retrieval pairs (~93K pairs, DA)
Total: ~333K query-document pairs across Norwegian, Danish, and Swedish.
Hardware
- GPUs: 2x NVIDIA H100
- Training Time: ~9 hours
- Framework: PyTorch with DDP (Distributed Data Parallel)
Model Architecture
SparseEncoder(
(0): MLMTransformer (NorBERT4-base with MLM head)
(1): SpladePooling (max pooling + ReLU activation)
)
Intended Use
Primary Use: Norwegian and Scandinavian language information retrieval, semantic search, and document ranking.
Ideal For:
- Search engines for Norwegian content
- Question answering systems
- Document retrieval
- Academic and legal document search
Not Recommended For:
- Sentence similarity (use dense models instead)
- Classification tasks
- Very short text comparisons
Limitations
- Requires more storage than dense models (sparse vectors)
- Best for retrieval tasks (query → document)
- Performance may vary on non-Norwegian languages
- Requires specialized sparse search infrastructure
Citation
If you use this model, please cite:
@misc{norbert4-splade-retrieval,
author = {Thivyesh},
title = {NorBERT4 SPLADE Retrieval-Only},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/thivy/norbert4-base-splade-retrieval}
}
License
MIT License
Acknowledgements
- Base model: ltg/norbert4-base by Language Technology Group, University of Oslo
- Framework: Sentence Transformers
- SPLADE architecture based on Formal et al., 2021
- Downloads last month
- 164
Model tree for thivy/norbert4-base-splade-retrieval
Base model
ltg/norbert4-basePaper for thivy/norbert4-base-splade-retrieval
Evaluation results
- NDCG@10 on NanoNFCorpusself-reported0.196
- MRR@10 on NanoNFCorpusself-reported0.229
- MAP@100 on NanoNFCorpusself-reported0.075