| | --- |
| | language: |
| | - 'no' |
| | - nb |
| | - nn |
| | - da |
| | - sv |
| | license: mit |
| | tags: |
| | - sentence-transformers |
| | - sparse-encoder |
| | - sparse |
| | - splade |
| | - norwegian |
| | - scandinavian |
| | - information-retrieval |
| | base_model: ltg/norbert4-base |
| | pipeline_tag: feature-extraction |
| | library_name: sentence-transformers |
| | model-index: |
| | - name: NorBERT4 SPLADE Retrieval-Only |
| | results: |
| | - task: |
| | type: information-retrieval |
| | name: Information Retrieval |
| | dataset: |
| | name: NanoNFCorpus |
| | type: NanoNFCorpus |
| | metrics: |
| | - type: ndcg_at_10 |
| | value: 0.1963 |
| | name: NDCG@10 |
| | - type: mrr_at_10 |
| | value: 0.2290 |
| | name: MRR@10 |
| | - type: map_at_100 |
| | value: 0.0752 |
| | name: MAP@100 |
| | --- |
| | |
| | # NorBERT4 SPLADE - Retrieval-Only |
| |
|
| | This is a **SPLADE sparse encoder** for Norwegian and Scandinavian languages, fine-tuned from [ltg/norbert4-base](https://huggingface.co/ltg/norbert4-base). It's optimized specifically for **information retrieval** tasks with query → document retrieval. |
| |
|
| | ## Model Details |
| |
|
| | - **Base Model:** [ltg/norbert4-base](https://huggingface.co/ltg/norbert4-base) |
| | - **Architecture:** SPLADE (Sparse Lexical and Expansion) |
| | - **Max Sequence Length:** 4096 tokens |
| | - **Output Dimensionality:** 51,200 sparse dimensions |
| | - **Languages:** Norwegian (Bokmål, Nynorsk), Danish, Swedish |
| | - **Training Data:** 333,547 query-document pairs |
| | - **Training Focus:** Retrieval-only datasets (ETI-format: short query → long document) |
| |
|
| | ## Performance |
| |
|
| | Best checkpoint at step 1,500: |
| | - **NDCG@10:** 0.271 |
| | - **MRR@10:** 0.229 |
| | - **Accuracy@10:** 56% |
| |
|
| | ## Usage |
| |
|
| | ### Installation |
| |
|
| | ```bash |
| | pip install -U sentence-transformers |
| | ``` |
| |
|
| | ### Basic Usage |
| |
|
| | ```python |
| | from sentence_transformers import SparseEncoder |
| | |
| | # Load model |
| | model = SparseEncoder("thivy/norbert4-base-splade-retrieval") |
| | |
| | # Encode queries and documents |
| | queries = ["Hva er maskinlæring?", "Søren Kierkegaard filosofi"] |
| | documents = [ |
| | "Maskinlæring er en gren av kunstig intelligens...", |
| | "Søren Kierkegaard var en dansk filosof..." |
| | ] |
| | |
| | query_embeddings = model.encode(queries) |
| | doc_embeddings = model.encode(documents) |
| | |
| | # Compute similarities (dot product) |
| | similarities = model.similarity(query_embeddings, doc_embeddings) |
| | print(similarities) |
| | ``` |
| |
|
| | ### Information Retrieval Example |
| |
|
| | ```python |
| | from sentence_transformers import SparseEncoder |
| | from sentence_transformers.util import semantic_search |
| | |
| | # Load model |
| | model = SparseEncoder("thivy/norbert4-base-splade-retrieval") |
| | |
| | # Your corpus |
| | corpus = [ |
| | "Norge er et skandinavisk land i Nord-Europa.", |
| | "Python er et programmeringsspråk.", |
| | "Maskinlæring brukes i mange applikasjoner." |
| | ] |
| | |
| | # Encode corpus |
| | corpus_embeddings = model.encode(corpus) |
| | |
| | # Query |
| | query = "Hva er Python?" |
| | query_embedding = model.encode(query) |
| | |
| | # Search |
| | hits = semantic_search(query_embedding, corpus_embeddings, top_k=3)[0] |
| | |
| | for hit in hits: |
| | print(f"Score: {hit['score']:.4f} - {corpus[hit['corpus_id']]}") |
| | ``` |
| |
|
| | ### With Threshold for High Sparsity (Recommended) |
| |
|
| | To achieve high sparsity (~99%), apply a threshold at inference time: |
| |
|
| | ```python |
| | from sentence_transformers import SparseEncoder |
| | |
| | model = SparseEncoder("thivy/norbert4-base-splade-retrieval") |
| | |
| | texts = ["Hva er hovedstaden i Norge?"] |
| | embeddings = model.encode(texts, convert_to_sparse_tensor=False) |
| | |
| | # Apply threshold to get ~99% sparse embeddings |
| | threshold = 0.05 |
| | embeddings[embeddings < threshold] = 0 |
| | |
| | print(f"Active dimensions: {(embeddings > 0).sum().item()}/51200") |
| | # Output: Active dimensions: ~500-1000/51200 (98-99% sparse) |
| | ``` |
| |
|
| | ## Known Issue: 0% Metric Sparsity |
| |
|
| | ⚠️ **The sparsity metric reports 0% despite the model being functionally sparse.** |
| |
|
| | **Why this happens:** |
| |
|
| | 1. NorBERT4's MLM head applies: `30 * sigmoid(x/7.5)`, forcing all logits to (0, 30) range |
| | 2. SPLADE's ReLU activation: `ReLU(log(1+exp(x)))` - cannot produce zeros from strictly positive values |
| | 3. Result: The metric shows all 51,200 dimensions active, but many have very small weights |
| |
|
| | **This is not a bug.** The model works correctly and produces semantically meaningful sparse representations. It just needs a threshold at inference time (as shown above). |
| |
|
| | ### Verification Script |
| |
|
| | Run this to verify the model works correctly: |
| |
|
| | ```python |
| | from sentence_transformers import SparseEncoder |
| | import numpy as np |
| | |
| | model = SparseEncoder('thivy/norbert4-base-splade-retrieval') |
| | |
| | queries = [ |
| | 'Hva er hovedstaden i Norge?', |
| | 'Hvem vant fotball-VM i 2022?', |
| | 'Hva er symptomene på influensa?', |
| | ] |
| | |
| | documents = [ |
| | 'Oslo er hovedstaden og den mest folkerike byen i Norge.', |
| | 'Argentina vant FIFA verdensmesterskapet i fotball i 2022.', |
| | 'Influensa er en virussykdom som gir symptomer som feber, hoste.', |
| | 'Bergen er en vakker by på vestlandet.', |
| | 'Norsk bokmål og nynorsk er de to offisielle skriftspråkene i Norge.', |
| | ] |
| | |
| | print('=== RAW EMBEDDINGS (no threshold) ===') |
| | q_emb = model.encode(queries, convert_to_sparse_tensor=False) |
| | d_emb = model.encode(documents, convert_to_sparse_tensor=False) |
| | |
| | # Convert to numpy for easier manipulation |
| | if hasattr(q_emb, 'cpu'): |
| | q_emb = q_emb.cpu().numpy() |
| | d_emb = d_emb.cpu().numpy() |
| | |
| | sims = q_emb @ d_emb.T |
| | print('Query-Document Similarity (should have high diagonal):') |
| | for i, q in enumerate(queries): |
| | best = np.argmax(sims[i]) |
| | print(f'Q{i+1} best match: D{best+1} (score: {sims[i][best]:.2f})') |
| | |
| | print('\n=== WITH THRESHOLD = 0.05 ===') |
| | q_sparse = q_emb.copy() |
| | d_sparse = d_emb.copy() |
| | q_sparse[q_sparse < 0.05] = 0 |
| | d_sparse[d_sparse < 0.05] = 0 |
| | |
| | q_active = np.mean([np.count_nonzero(q_sparse[i]) for i in range(len(queries))]) |
| | d_active = np.mean([np.count_nonzero(d_sparse[i]) for i in range(len(documents))]) |
| | |
| | print(f'Query active dims: {q_active:.0f} / 51200 ({100*q_active/51200:.1f}%)') |
| | print(f'Doc active dims: {d_active:.0f} / 51200 ({100*d_active/51200:.1f}%)') |
| | |
| | sims_sparse = q_sparse @ d_sparse.T |
| | print('Similarity with threshold (rankings should be same):') |
| | for i, q in enumerate(queries): |
| | best = np.argmax(sims_sparse[i]) |
| | print(f'Q{i+1} best match: D{best+1} (score: {sims_sparse[i][best]:.2f})') |
| | ``` |
| |
|
| | **Expected output:** Queries should correctly match their corresponding documents (Q1→D1, Q2→D2, Q3→D3) both with and without threshold, demonstrating the model works correctly. |
| |
|
| | ### Token Expansion Analysis |
| |
|
| | See which tokens get high weights in the embeddings: |
| |
|
| | ```python |
| | from sentence_transformers import SparseEncoder |
| | |
| | model = SparseEncoder('thivy/norbert4-base-splade-retrieval') |
| | |
| | queries = [ |
| | 'Hva er hovedstaden i Norge?', |
| | 'Hvem vant fotball-VM i 2022?', |
| | ] |
| | |
| | embeddings = model.encode(queries) |
| | decoded = model.decode(embeddings, top_k=15) |
| | |
| | for d, q in zip(decoded, queries): |
| | print(f'Query: {q}') |
| | tokens = ', '.join([f'{tok}({score:.2f})' for tok, score in d]) |
| | print(f'Top tokens: {tokens}\n') |
| | ``` |
| |
|
| | This will show the top weighted tokens for each query, demonstrating the learned term expansion. |
| |
|
| | ## Training Details |
| |
|
| | ### Training Configuration |
| | - **Epochs:** 1 |
| | - **Total Steps:** 10,423 |
| | - **Batch Size:** 16 per device (32 total across 2 GPUs) |
| | - **Learning Rate:** 2e-5 |
| | - **Warmup Ratio:** 0.1 |
| | - **Precision:** bfloat16 |
| | - **Regularization:** |
| | - Document: 0.003 |
| | - Query: 0.0001 |
| |
|
| | ### Training Datasets |
| |
|
| | Retrieval-only datasets (query → document pairs): |
| | - **DDSC** - Nordic Embedding Training Data (~182K pairs, retrieval task only, NO/DA/SV) |
| | - **ETI** - Elektronisk Tjenesteinformasjon (~54K pairs, health/welfare domain, NO) |
| | - **NorQuAD** - Norwegian Question Answering (~3.8K pairs, NO) |
| | - **ScandiQA** - Scandinavian Question Answering (~20K pairs, NO/DA/SV) |
| | - **Supervised-DA** - Danish supervised retrieval pairs (~93K pairs, DA) |
| |
|
| | **Total:** ~333K query-document pairs across Norwegian, Danish, and Swedish. |
| |
|
| | ### Hardware |
| | - **GPUs:** 2x NVIDIA H100 |
| | - **Training Time:** ~9 hours |
| | - **Framework:** PyTorch with DDP (Distributed Data Parallel) |
| |
|
| | ## Model Architecture |
| |
|
| | ``` |
| | SparseEncoder( |
| | (0): MLMTransformer (NorBERT4-base with MLM head) |
| | (1): SpladePooling (max pooling + ReLU activation) |
| | ) |
| | ``` |
| |
|
| | ## Intended Use |
| |
|
| | **Primary Use:** Norwegian and Scandinavian language information retrieval, semantic search, and document ranking. |
| |
|
| | **Ideal For:** |
| | - Search engines for Norwegian content |
| | - Question answering systems |
| | - Document retrieval |
| | - Academic and legal document search |
| |
|
| | **Not Recommended For:** |
| | - Sentence similarity (use dense models instead) |
| | - Classification tasks |
| | - Very short text comparisons |
| |
|
| | ## Limitations |
| |
|
| | - Requires more storage than dense models (sparse vectors) |
| | - Best for retrieval tasks (query → document) |
| | - Performance may vary on non-Norwegian languages |
| | - Requires specialized sparse search infrastructure |
| |
|
| | ## Citation |
| |
|
| | If you use this model, please cite: |
| |
|
| | ```bibtex |
| | @misc{norbert4-splade-retrieval, |
| | author = {Thivyesh}, |
| | title = {NorBERT4 SPLADE Retrieval-Only}, |
| | year = {2026}, |
| | publisher = {HuggingFace}, |
| | url = {https://huggingface.co/thivy/norbert4-base-splade-retrieval} |
| | } |
| | ``` |
| |
|
| | ## License |
| |
|
| | MIT License |
| |
|
| | ## Acknowledgements |
| |
|
| | - Base model: [ltg/norbert4-base](https://huggingface.co/ltg/norbert4-base) by Language Technology Group, University of Oslo |
| | - Framework: [Sentence Transformers](https://www.sbert.net/) |
| | - SPLADE architecture based on [Formal et al., 2021](https://arxiv.org/abs/2107.05720) |
| |
|