--- language: - 'no' - nb - nn - da - sv license: mit tags: - sentence-transformers - sparse-encoder - sparse - splade - norwegian - scandinavian - information-retrieval base_model: ltg/norbert4-base pipeline_tag: feature-extraction library_name: sentence-transformers model-index: - name: NorBERT4 SPLADE Retrieval-Only results: - task: type: information-retrieval name: Information Retrieval dataset: name: NanoNFCorpus type: NanoNFCorpus metrics: - type: ndcg_at_10 value: 0.1963 name: NDCG@10 - type: mrr_at_10 value: 0.2290 name: MRR@10 - type: map_at_100 value: 0.0752 name: MAP@100 --- # NorBERT4 SPLADE - Retrieval-Only This is a **SPLADE sparse encoder** for Norwegian and Scandinavian languages, fine-tuned from [ltg/norbert4-base](https://huggingface.co/ltg/norbert4-base). It's optimized specifically for **information retrieval** tasks with query → document retrieval. ## Model Details - **Base Model:** [ltg/norbert4-base](https://huggingface.co/ltg/norbert4-base) - **Architecture:** SPLADE (Sparse Lexical and Expansion) - **Max Sequence Length:** 4096 tokens - **Output Dimensionality:** 51,200 sparse dimensions - **Languages:** Norwegian (Bokmål, Nynorsk), Danish, Swedish - **Training Data:** 333,547 query-document pairs - **Training Focus:** Retrieval-only datasets (ETI-format: short query → long document) ## Performance Best checkpoint at step 1,500: - **NDCG@10:** 0.271 - **MRR@10:** 0.229 - **Accuracy@10:** 56% ## Usage ### Installation ```bash pip install -U sentence-transformers ``` ### Basic Usage ```python from sentence_transformers import SparseEncoder # Load model model = SparseEncoder("thivy/norbert4-base-splade-retrieval") # Encode queries and documents queries = ["Hva er maskinlæring?", "Søren Kierkegaard filosofi"] documents = [ "Maskinlæring er en gren av kunstig intelligens...", "Søren Kierkegaard var en dansk filosof..." ] query_embeddings = model.encode(queries) doc_embeddings = model.encode(documents) # Compute similarities (dot product) similarities = model.similarity(query_embeddings, doc_embeddings) print(similarities) ``` ### Information Retrieval Example ```python from sentence_transformers import SparseEncoder from sentence_transformers.util import semantic_search # Load model model = SparseEncoder("thivy/norbert4-base-splade-retrieval") # Your corpus corpus = [ "Norge er et skandinavisk land i Nord-Europa.", "Python er et programmeringsspråk.", "Maskinlæring brukes i mange applikasjoner." ] # Encode corpus corpus_embeddings = model.encode(corpus) # Query query = "Hva er Python?" query_embedding = model.encode(query) # Search hits = semantic_search(query_embedding, corpus_embeddings, top_k=3)[0] for hit in hits: print(f"Score: {hit['score']:.4f} - {corpus[hit['corpus_id']]}") ``` ### With Threshold for High Sparsity (Recommended) To achieve high sparsity (~99%), apply a threshold at inference time: ```python from sentence_transformers import SparseEncoder model = SparseEncoder("thivy/norbert4-base-splade-retrieval") texts = ["Hva er hovedstaden i Norge?"] embeddings = model.encode(texts, convert_to_sparse_tensor=False) # Apply threshold to get ~99% sparse embeddings threshold = 0.05 embeddings[embeddings < threshold] = 0 print(f"Active dimensions: {(embeddings > 0).sum().item()}/51200") # Output: Active dimensions: ~500-1000/51200 (98-99% sparse) ``` ## Known Issue: 0% Metric Sparsity ⚠️ **The sparsity metric reports 0% despite the model being functionally sparse.** **Why this happens:** 1. NorBERT4's MLM head applies: `30 * sigmoid(x/7.5)`, forcing all logits to (0, 30) range 2. SPLADE's ReLU activation: `ReLU(log(1+exp(x)))` - cannot produce zeros from strictly positive values 3. Result: The metric shows all 51,200 dimensions active, but many have very small weights **This is not a bug.** The model works correctly and produces semantically meaningful sparse representations. It just needs a threshold at inference time (as shown above). ### Verification Script Run this to verify the model works correctly: ```python from sentence_transformers import SparseEncoder import numpy as np model = SparseEncoder('thivy/norbert4-base-splade-retrieval') queries = [ 'Hva er hovedstaden i Norge?', 'Hvem vant fotball-VM i 2022?', 'Hva er symptomene på influensa?', ] documents = [ 'Oslo er hovedstaden og den mest folkerike byen i Norge.', 'Argentina vant FIFA verdensmesterskapet i fotball i 2022.', 'Influensa er en virussykdom som gir symptomer som feber, hoste.', 'Bergen er en vakker by på vestlandet.', 'Norsk bokmål og nynorsk er de to offisielle skriftspråkene i Norge.', ] print('=== RAW EMBEDDINGS (no threshold) ===') q_emb = model.encode(queries, convert_to_sparse_tensor=False) d_emb = model.encode(documents, convert_to_sparse_tensor=False) # Convert to numpy for easier manipulation if hasattr(q_emb, 'cpu'): q_emb = q_emb.cpu().numpy() d_emb = d_emb.cpu().numpy() sims = q_emb @ d_emb.T print('Query-Document Similarity (should have high diagonal):') for i, q in enumerate(queries): best = np.argmax(sims[i]) print(f'Q{i+1} best match: D{best+1} (score: {sims[i][best]:.2f})') print('\n=== WITH THRESHOLD = 0.05 ===') q_sparse = q_emb.copy() d_sparse = d_emb.copy() q_sparse[q_sparse < 0.05] = 0 d_sparse[d_sparse < 0.05] = 0 q_active = np.mean([np.count_nonzero(q_sparse[i]) for i in range(len(queries))]) d_active = np.mean([np.count_nonzero(d_sparse[i]) for i in range(len(documents))]) print(f'Query active dims: {q_active:.0f} / 51200 ({100*q_active/51200:.1f}%)') print(f'Doc active dims: {d_active:.0f} / 51200 ({100*d_active/51200:.1f}%)') sims_sparse = q_sparse @ d_sparse.T print('Similarity with threshold (rankings should be same):') for i, q in enumerate(queries): best = np.argmax(sims_sparse[i]) print(f'Q{i+1} best match: D{best+1} (score: {sims_sparse[i][best]:.2f})') ``` **Expected output:** Queries should correctly match their corresponding documents (Q1→D1, Q2→D2, Q3→D3) both with and without threshold, demonstrating the model works correctly. ### Token Expansion Analysis See which tokens get high weights in the embeddings: ```python from sentence_transformers import SparseEncoder model = SparseEncoder('thivy/norbert4-base-splade-retrieval') queries = [ 'Hva er hovedstaden i Norge?', 'Hvem vant fotball-VM i 2022?', ] embeddings = model.encode(queries) decoded = model.decode(embeddings, top_k=15) for d, q in zip(decoded, queries): print(f'Query: {q}') tokens = ', '.join([f'{tok}({score:.2f})' for tok, score in d]) print(f'Top tokens: {tokens}\n') ``` This will show the top weighted tokens for each query, demonstrating the learned term expansion. ## Training Details ### Training Configuration - **Epochs:** 1 - **Total Steps:** 10,423 - **Batch Size:** 16 per device (32 total across 2 GPUs) - **Learning Rate:** 2e-5 - **Warmup Ratio:** 0.1 - **Precision:** bfloat16 - **Regularization:** - Document: 0.003 - Query: 0.0001 ### Training Datasets Retrieval-only datasets (query → document pairs): - **DDSC** - Nordic Embedding Training Data (~182K pairs, retrieval task only, NO/DA/SV) - **ETI** - Elektronisk Tjenesteinformasjon (~54K pairs, health/welfare domain, NO) - **NorQuAD** - Norwegian Question Answering (~3.8K pairs, NO) - **ScandiQA** - Scandinavian Question Answering (~20K pairs, NO/DA/SV) - **Supervised-DA** - Danish supervised retrieval pairs (~93K pairs, DA) **Total:** ~333K query-document pairs across Norwegian, Danish, and Swedish. ### Hardware - **GPUs:** 2x NVIDIA H100 - **Training Time:** ~9 hours - **Framework:** PyTorch with DDP (Distributed Data Parallel) ## Model Architecture ``` SparseEncoder( (0): MLMTransformer (NorBERT4-base with MLM head) (1): SpladePooling (max pooling + ReLU activation) ) ``` ## Intended Use **Primary Use:** Norwegian and Scandinavian language information retrieval, semantic search, and document ranking. **Ideal For:** - Search engines for Norwegian content - Question answering systems - Document retrieval - Academic and legal document search **Not Recommended For:** - Sentence similarity (use dense models instead) - Classification tasks - Very short text comparisons ## Limitations - Requires more storage than dense models (sparse vectors) - Best for retrieval tasks (query → document) - Performance may vary on non-Norwegian languages - Requires specialized sparse search infrastructure ## Citation If you use this model, please cite: ```bibtex @misc{norbert4-splade-retrieval, author = {Thivyesh}, title = {NorBERT4 SPLADE Retrieval-Only}, year = {2026}, publisher = {HuggingFace}, url = {https://huggingface.co/thivy/norbert4-base-splade-retrieval} } ``` ## License MIT License ## Acknowledgements - Base model: [ltg/norbert4-base](https://huggingface.co/ltg/norbert4-base) by Language Technology Group, University of Oslo - Framework: [Sentence Transformers](https://www.sbert.net/) - SPLADE architecture based on [Formal et al., 2021](https://arxiv.org/abs/2107.05720)