Instructions to use premmm/nepali-embedder-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- sentence-transformers
How to use premmm/nepali-embedder-v1 with sentence-transformers:
from sentence_transformers import SentenceTransformer model = SentenceTransformer("premmm/nepali-embedder-v1") sentences = [ "The weather is lovely today.", "It's so sunny outside!", "He drove to the stadium." ] embeddings = model.encode(sentences) similarities = model.similarity(embeddings, embeddings) print(similarities.shape) # [3, 3] - Notebooks
- Google Colab
- Kaggle
nepali-embedder-v1
#1 ranked open Nepali embedding model — benchmarked against 12 open embedding models including bge-m3, qwen3-embedding, snowflake-arctic-embed2, nomic-embed-text-v2-moe, and all existing Nepali-specific models.
Built natively for Nepali language retrieval, fine-tuned on 56k Nepali Wikipedia pairs using google/muril-base-cased as the base encoder.
Benchmark Results
Evaluated on standard Nepali semantic retrieval and four Nepali-specific stress tests.
Gap = Match Score − Unrelated Score (higher = better discrimination).
Standard Retrieval Gap ↑
| Rank | Model | Gap | Params |
|---|---|---|---|
| 🥇 | nepali-embedder-v1 (this model) | 0.4277 | 238M |
| 🥈 | jangedoo/all-MiniLM-L6-v2-nepali | 0.3382 | 66M |
| 🥉 | universalml/Nepali_Embedding_Model | 0.2784 | 560M |
| 4 | Yunika/sentence-transformer-nepali | 0.2581 | 238M |
| 5 | qwen3-embedding:0.6b | 0.2186 | 600M |
| 6 | bge-m3 | 0.2092 | 567M |
| 7 | embeddinggemma | 0.1924 | 300M |
| 8 | nomic-embed-text-v2-moe | 0.1834 | MoE |
| 9 | paraphrase-multilingual | 0.1779 | 278M |
| 10 | snowflake-arctic-embed2 | 0.1509 | 568M |
| 11 | granite-embedding:278m | 0.1437 | 278M |
| 12 | mxbai-embed-large | 0.0560 | 335M |
Nepali-Specific Stress Tests ↑
| Category | nepali-v1 | bge-m3 | qwen3-0.6b | Yunika | universalml |
|---|---|---|---|---|---|
| Code-Switching (Roman↔Devanagari) | 0.490 | 0.101 | 0.333 | 0.263 | 0.289 |
| Entity Sensitivity | 0.605 | 0.240 | 0.275 | 0.256 | 0.239 |
| Length Robustness | 0.674 | 0.136 | 0.239 | 0.313 | 0.260 |
| Negation | -0.083 | -0.022 | -0.190 | -0.159 | -0.093 |
Key findings: This model is the only one that correctly handles Romanized Nepali ↔ Devanagari code-switching with a positive delta of 0.490 — all other models score below 0.35 on this task. Entity discrimination (0.605) and long-document robustness (0.674) are both best-in-class across all 12 models tested. Negation is a known limitation shared across all Nepali and multilingual embedding models.
Usage
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("premmm/nepali-embedder-v1")
# Single sentence
embedding = model.encode("नेपालको राजधानी काठमाडौं हो।", normalize_embeddings=True)
# Semantic similarity
sentences = [
"नेपालको राजधानी काठमाडौं हो।",
"काठमाडौं नेपालको सबैभन्दा ठूलो शहर हो।",
]
embeddings = model.encode(sentences, normalize_embeddings=True)
# Retrieval (query vs passages)
from sentence_transformers import util
query = "नेपालको राजधानी कहाँ छ?"
passages = [
"काठमाडौं नेपालको राजधानी तथा सबैभन्दा ठूलो शहर हो।",
"पोखरा नेपालको दोस्रो ठूलो शहर हो।",
"लुम्बिनी गौतम बुद्धको जन्मस्थल हो।",
]
q_emb = model.encode(query, normalize_embeddings=True)
p_emb = model.encode(passages, normalize_embeddings=True)
scores = util.cos_sim(q_emb, p_emb)
print(scores) # tensor([[0.7139, 0.4821, 0.3102]])
Use with LangChain / RAG pipelines
from langchain_huggingface import HuggingFaceEmbeddings
embeddings = HuggingFaceEmbeddings(
model_name="premmm/nepali-embedder-v1",
encode_kwargs={"normalize_embeddings": True}
)
Use with Ollama (self-hosted)
The model can be converted and served locally — see the sentence-transformers documentation for ONNX export if needed for production inference.
Model Details
| Property | Value |
|---|---|
| Base model | google/muril-base-cased |
| Architecture | BERT (transformer encoder + mean pooling) |
| Parameters | 238M |
| Embedding dimension | 768 |
| Max sequence length | 256 tokens |
| Language | Nepali (ne) |
| License | Apache 2.0 |
Training Details
Data
- Source: Nepali Wikipedia via
wikimedia/wikipedia(config:20231101.ne) - Total pairs: 56,244 (after deduplication)
- 27,083 title ↔ intro paragraph pairs
- 29,175 section heading ↔ section body pairs
- Pair construction: Positive pairs only; in-batch negatives used during training
Training Configuration
- Loss:
MultipleNegativesRankingLoss(in-batch negatives) - Epochs: 3
- Batch size: 16
- Warmup steps: 10% of total steps
- Optimizer: AdamW (sentence-transformers default)
- Hardware: NVIDIA T4 (Google Colab)
- Training time: ~2.75 hours
Evaluation (Internal)
Evaluated on a held-out set of 562 pairs + 500 distractor passages using InformationRetrievalEvaluator:
| Metric | Final Value |
|---|---|
| NDCG@10 | 0.9621 |
| MRR@10 | 0.9520 |
| Accuracy@1 | 0.9270 |
| Recall@10 | 0.9929 |
Intended Use
- Nepali document retrieval — RAG pipelines for Nepali documents
- Semantic search — search over Nepali text corpora
- Sentence similarity — clustering and deduplication of Nepali text
- Legal document retrieval — court rulings, government documents (v2 will include domain fine-tuning)
- Cross-script retrieval — handles Romanized Nepali queries against Devanagari passages
Known Limitations
- Negation: Like all current Nepali embedding models, does not reliably distinguish negated statements (e.g., "X छ" vs "X छैन")
- Cross-lingual: English → Nepali retrieval works partially but was not explicitly trained
- Domain: Trained on encyclopedic Wikipedia text; may underperform on highly technical or colloquial domains
- Vocabulary: Legal, medical, and scientific Nepali terminology is underrepresented
Roadmap
| Version | Planned Additions |
|---|---|
| v2 | Romanized Nepali ↔ Devanagari training pairs (code-switching) |
| v2 | Negation-aware hard negative pairs |
| v2 | Synthetic query augmentation (~2k LLM-generated triplets) |
| v2 | Legal domain fine-tuning (10k Nepali court ruling pairs) |
| v3 | Hard negative mining using v2 model |
| v3 | MatryoshkaLoss for variable-dimension embeddings |
Citation
If you use this model in your research or project, please cite:
@misc{pathak2026nepaliembedder,
author = {Premanand Pathak},
title = {nepali-embedder-v1: A Native Nepali Sentence Embedding Model},
year = {2026},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/premmm/nepali-embedder-v1}},
}
Acknowledgements
- Training data: Wikimedia Foundation / Nepali Wikipedia
- Base model: Google MuRIL
- Training framework: sentence-transformers
- Benchmark comparison models: Yunika, universalml, jangedoo, BAAI, Alibaba Qwen, Snowflake, Nomic, Google, IBM, Mixedbread
- Downloads last month
- -
Model tree for premmm/nepali-embedder-v1
Base model
google/muril-base-cased