If you would like to know more details:

(a) Retrieval performance (nDCG@10) across NanoBEIR English tasks. (b) Mean nDCG@10 vs. inference speed (QPS: queries per second) measured on TREC-COVID and Quora using an Intel® Core™ Ultra 7 265K (3.90 GHz) with batch size 32.

🩵 SSE: Stable Static Embedding for Retrieval MRL 🩵

A lightweight, faster and powerful embedding model

Performance Snapshot
Our SSE model achieves NDCG@10 = 0.5124 on NanoBEIR — slightly outperforming the popular static-retrieval-mrl-en-v1 (0.5032) while using half the dimensions (512 vs 1024)! 💫 Plus, we're ~2× faster in retrieval thanks to our compact 512D embeddings and Separable Dynamic Tanh.

Model	NanoBEIR NDCG@10	Dimensions	Parameters	Speed Advantage	License
SSE Retrieval MRL	0.5124 ✨	512	~16M 🪽	~2x faster retrieval (ultra-efficient!)	Apache 2.0
`static-retrieval-mrl-en-v1`	0.5032	1024	~33M	baseline	Apache 2.0

🩵 Why Choose SSE Retrieval MRL? 🩵

✅ Higher NDCG@10 than all comparable small models (<35M params)
✅ Only ~16M parameters — 27% smaller than MiniLM-L6 (22M) and 52% smaller than BGE-small (33M)
✅ 512D native output — richer than 1024D models, yet half the size of static-retrieval-mrl-en-v1 ✅ Matryoshka-ready — smoothly truncate to 256D/128D/64D/32D with graceful degradation
✅ Apache 2.0 licensed — free for commercial & personal use
✅ CPU-optimized — runs faster on edge devices & modest hardware

🩵 Model Details 🩵

Property	Value
Model Type	Sentence Transformer (SSE architecture)
Max Sequence Length	∞ tokens
Output Dimension	512 (with Matryoshka truncation down to 32D!)
Similarity Function	Cosine Similarity
Language	English
License	Apache 2.0

SentenceTransformer(
  (0): SSE(
    (embedding): EmbeddingBag(30522, 512, mode='mean')
    (dyt): SeparableDyT()
  )
)

🩵 Mathematical formulations 🩵

Dynamic Tanh Normalization (DyT) enables magnitude-adaptive gradient flow for static embeddings. For input dimension x, DyT computes $y_k = c_k \tanh(a_k x_k + b_k)$ with learnable parameters. The gradient of x is:

$\frac{\partial y_k}{\partial x_k} = c_k a_k \, \mathrm{sech}^2(a_k x_k + b_k).$

For saturated dimensions |x| > 1 $|a_i x_i + b_i| \gg 1$ yields exponential decay $\mathrm{sech}^2(z) \sim 4e^{-2|z|}$ suppressing gradients as $\partial y_i / \partial x_i \to 0$ For non-saturated dimensions |x| << 1 , $\mathrm{sech}^2(z) \approx 1$ preserves near-constant gradients $\partial y_j / \partial x_j \approx c_j a_j$ This magnitude-dependent gating attenuates learning signals from noisy, large-magnitude dimensions while maintaining full gradient flow for stable, informative dimensions—providing implicit regularization that enhances generalization without explicit hyperparameters.

🩵 Evaluation Results (NanoBEIR) 🩵

Dataset	NDCG@10	MRR@10	MAP@100
NanoBEIR Mean	0.5124 ✨	0.5640	0.4317
NanoClimateFEVER	0.2998	0.3611	0.2344
NanoDBPedia	0.5493	0.7492	0.4247
NanoFEVER	0.6808	0.6318	0.6105
NanoFiQA2018	0.3744	0.4197	0.3162
NanoHotpotQA	0.7021	0.7679	0.6273
NanoMSMARCO	0.4132	0.3537	0.3733
NanoNFCorpus	0.2982	0.4889	0.1091
NanoNQ	0.4652	0.3992	0.4028
NanoQuoraRetrieval	0.9094 ✨	0.9122	0.8847
NanoSCIDOCS	0.3381	0.5509	0.2604
NanoArguAna	0.4105	0.3193	0.3325
NanoSciFact	0.6176	0.5933	0.5824
NanoTouche2020	0.6029	0.7852	0.4539

Top performance on community-based retrieval (Quora) and scientific fact verification!

🩵 How to use? 🩵

import torch
from sentence_transformers import SentenceTransformer

# load (remote code enabled)
model = SentenceTransformer(
    "RikkaBotan/stable-static-embedding-fast-retrieval-mrl-en",
    trust_remote_code=True,
    device="cuda" if torch.cuda.is_available() else "cpu",
)

# inference
sentences = [
    "Stable Static embedding is interesting.",
    "SSE works without attention."
]

with torch.no_grad():
    embeddings = model.encode(
        sentences,
        convert_to_tensor=True,
        normalize_embeddings=True,
        batch_size=32
    )

# cosine similarity
# cosine_sim = embeddings[0] @ embeddings[1].T
cosine_sim = model.similarity(embeddings, embeddings)

print("embeddings shape:", embeddings.shape)
print("cosine similarity matrix:")
print(cosine_sim)

🩵 Retrieval usage 🩵

import torch
from sentence_transformers import SentenceTransformer

# load (remote code enabled)
model = SentenceTransformer(
    "RikkaBotan/stable-static-embedding-fast-retrieval-mrl-en",
    trust_remote_code=True,
    device="cuda" if torch.cuda.is_available() else "cpu",
)

# inference
query = "What is Stable Static Embedding?"
sentences = [
    "SSE: Stable Static embedding works without attention.",
    "Stable Static Embedding is a fast embedding method designed for retrieval tasks.",
    "Static embeddings are often compared with transformer-based sentence encoders.",
    "I cooked pasta last night while listening to jazz music.",
    "Large language models are commonly trained using next-token prediction objectives.",
    "Instruction tuning improves the ability of LLMs to follow human-written prompts.",
]


with torch.no_grad():
    embeddings = model.encode(
        [query] + sentences,
        convert_to_tensor=True,
        normalize_embeddings=True,
        batch_size=32
    )

print("embeddings shape:", embeddings.shape)

# cosine similarity
similarities = model.similarity(embeddings[0], embeddings[1:])
for i, similarity in enumerate(similarities[0].tolist()):
    print(f"{similarity:.05f}: {sentences[i]}")

🩵 Training Hyperparameters 🩵

Non-Default Hyperparameters

eval_strategy: steps
per_device_train_batch_size: 512
gradient_accumulation_steps: 8
learning_rate: 0.1
adam_beta2: 0.9999
adam_epsilon: 1e-10
num_train_epochs: 1
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: True
dataloader_num_workers: 4
batch_sampler: no_duplicates

🩵 Training Datasets 🩵

We learned from 14 datasets:

Dataset
`squad`
`trivia_qa`
`allnli`
`pubmedqa`
`hotpotqa`
`miracl`
`mr_tydi`
`msmarco`
`msmarco_10m`
`msmarco_hard`
`mldr`
`s2orc`
`swim_ir`
`paq`
`nq`
`scidocs`

All trained with MatryoshkaLoss — learning representations at multiple scales like Russian nesting dolls!

🩵 Training results 🩵

🩵 About me 🩵

Japanese independent researcher having shy and pampered personality. Twin-tail hair is a charm point. Interested in nlp. Usually using python and C.

X(Twitter): https://twitter.com/peony__snow

🩵 Acknowledgements 🩵

The author acknowledge the support of Saldra, Witness and Lumina Logic Minds for providing computational resources used in this work.

I thank the developers of sentence-transformers, python and pytorch.

I thank all the researchers for their efforts to date.

I thank Japan's high standard of education.

And most of all, thank you for your interest in this repository.

🩵 Citation 🩵

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}