SentenceTransformer based on intfloat/multilingual-e5-base

This is a sentence-transformers model finetuned from intfloat/multilingual-e5-base. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: intfloat/multilingual-e5-base
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'XLMRobertaModel'})
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("Funghang/e5-nepali-qa-ir")
# Run inference
queries = [
    "विवाहित व्यक्तिको हकमा साधारण राहदानीको लागि आवेदन कहाँ दिने ?",
]
documents = [
    'ताप्लेजुङ जिल्ला प्रशासन कार्यालयको सम्पर्क नम्बरहरू: नागरिकता फाँट : ०२४-४६०२७०, आर्थिक प्रशासन फाँट : ०२४-४६०५६६, स्थानीय प्रशासन फाँट : ०२४-४६०१९१ हुन् ।',
    'जिल्ला प्रशासन कार्यालय ओखलढुंगाको आधिकारिक इमेल अभिलेख प्रयोजनका लागिः avilekh.daookhaldhunga@gmail.com हो ।',
    'विवाहित व्यक्तिले आफ्नो स्थायी बसोबास भएको जिल्ला प्रशासन कार्यालय वा नागरिकता रहेको जिल्लास्थित जिल्ला प्रशासन कार्यालयमा आवश्यक कागजात सहित साधारण राहदानीको लागि निवेदन दिन सकिन्छ ।',
]
query_embeddings = model.encode_query(queries)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# [1, 768] [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[0.9078, 0.1416, 0.1064]])

Training Details

Training Dataset

Unnamed Dataset

  • Size: 3,840 training samples
  • Columns: query and positive
  • Approximate statistics based on the first 1000 samples:
    query positive
    type string string
    details
    • min: 5 tokens
    • mean: 16.31 tokens
    • max: 65 tokens
    • min: 14 tokens
    • mean: 40.99 tokens
    • max: 338 tokens
  • Samples:
    query positive
    राहदानी कार्यालय संखुवासभाको सम्पर्क नम्बर के हो ? राहदानी कार्यालय संखुवासभाको सम्पर्क नम्बरहरू ०२९५६०१३४, ०२९५६०१३३, ०२९५६०५३३ हुन् ।
    राहदानीको लागि दर्ता केन्द्र कसरी छनोट गर्ने ? राहदानी बनाउन दर्ता केन्द्र छनोट गर्दा तपाईंले अनलाइन फाराम भर्दा नै आफ्नो लागि उपयुक्त स्थान चयन गर्नुपर्छ । नेपालमा रहेका जिल्ला वा इलाका प्रशासन कार्यालय वा राहदानी विभाग दर्ता केन्द्रका रूपमा छनोट गर्न सकिन्छ । विदेशमा रहेका नागरिकहरूले भने सम्बन्धित दूतावास वा कन्सुलेटलाई दर्ता केन्द्रका रूपमा छनोट गर्नुपर्छ । साथै, फाराम भर्दा Appointment लिनु पनि अनिवार्य हुन्छ जसले तपाईंलाई सुविधाजनक मिति र समय दिन्छ । यसरी छनोट गरिएको दर्ता केन्द्रमा नै तपाईंले सम्पूर्ण प्रक्रिया पूरा गर्नुपर्छ ।
    म संग राहदानी विभाग काठमाडौंबाट जारी भएको पुरानो हस्तलिखित राहदानी छ भने के मैले पुन राहदानी विभागबाट एमआरपि को लागि आवेदन दिन जिल्लाको सिफारिस ल्याउनु पर्छ ? यदि तपाईंसँग राहदानी विभाग काठमाडौंबाट जारी भएको पुरानो हस्तलिखित राहदानी छ भने पनि पुन राहदानी विभागबाट एमआरपि को लागि आवेदन दिन जिल्लाको सिफारिस ल्याउनु पर्छ ।
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim",
        "gather_across_devices": false
    }
    

Evaluation Dataset

Unnamed Dataset

  • Size: 820 evaluation samples
  • Columns: query and positive
  • Approximate statistics based on the first 820 samples:
    query positive
    type string string
    details
    • min: 7 tokens
    • mean: 17.08 tokens
    • max: 53 tokens
    • min: 14 tokens
    • mean: 45.07 tokens
    • max: 175 tokens
  • Samples:
    query positive
    के राहदानी बनाउन फोटोको आवश्यकता पर्छ ? राहदानी बनाउन फोटोको आवश्यकता हुँदैन, किनभने फोटो खिच्ने काम कार्यालयमै हुन्छ ।
    रुकुम पश्चिमको राहदानी कार्यालयसँग सम्पर्क गर्न कुन नम्बर प्रयोग गर्ने ? रुकुम पश्चिमको राहदानी कार्यालयका लागि ०८८५३००४०, ०८८५३००९०, ०८८५३००९० नम्बर प्रयोग गर्न सकिन्छ ।
    अफिसियल वा कूटनीतिक राहदानी भएकाले साधारण राहदानी बनाउँदा कुन विकल्प छान्ने ? अफिसियल र कूटनीतिक राहदानी भएकाले साधारण राहदानी बनाउँदा आफ्नो पुरानो नम्बर बेवास्ता गरी उपयुक्त विकल्प छान्नु पर्छ।
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim",
        "gather_across_devices": false
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • learning_rate: 1e-05
  • num_train_epochs: 150
  • warmup_ratio: 0.1
  • fp16: True
  • load_best_model_at_end: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 8
  • per_device_eval_batch_size: 8
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 1e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 150
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.1
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: True
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch_fused
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • hub_revision: None
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • liger_kernel_config: None
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional
  • router_mapping: {}
  • learning_rate_mapping: {}

Training Logs

Epoch Step Training Loss Validation Loss
0.2083 100 0.9835 0.8339
0.4167 200 0.8657 0.6287
0.625 300 0.6003 0.3697
0.8333 400 0.3603 0.1957
1.0417 500 0.1578 0.1272
1.25 600 0.1 0.0974
1.4583 700 0.0619 0.0812
1.6667 800 0.0416 0.0751
1.875 900 0.0369 0.0714
2.0833 1000 0.0295 0.0676
2.2917 1100 0.0259 0.0641
2.5 1200 0.0168 0.0620
2.7083 1300 0.0514 0.0612
2.9167 1400 0.0294 0.0635
3.125 1500 0.0137 0.0605
3.3333 1600 0.013 0.0598
3.5417 1700 0.0193 0.0618
3.75 1800 0.0093 0.0600
3.9583 1900 0.0209 0.0623
4.1667 2000 0.0075 0.0654
4.375 2100 0.0152 0.0632
4.5833 2200 0.0205 0.0647
4.7917 2300 0.0062 0.0630
5.0 2400 0.0188 0.0616
5.2083 2500 0.0085 0.0596
5.4167 2600 0.0119 0.0605
5.625 2700 0.0087 0.0619
5.8333 2800 0.0115 0.0666
6.0417 2900 0.0203 0.0648
6.25 3000 0.0114 0.0644
6.4583 3100 0.0123 0.0650
6.6667 3200 0.0121 0.0609
6.875 3300 0.0044 0.0610
7.0833 3400 0.0061 0.0683
7.2917 3500 0.0151 0.0645
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.12.11
  • Sentence Transformers: 5.1.0
  • Transformers: 4.55.4
  • PyTorch: 2.8.0+cu126
  • Accelerate: 1.10.1
  • Datasets: 4.0.0
  • Tokenizers: 0.21.4

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

Citation

@article{begha2026nepali,
  title={Nepali Passport Question Answering: A Low-Resource Dataset for Public Service Applications},
  author={Begha, Funghang Limbu and Acharya, Praveen and Bal, Bal Krishna},
  journal={arXiv preprint arXiv:2603.13320},
  year={2026}
}

E5 Model

@article{wang2024multilingual,
  title={Multilingual e5 text embeddings: A technical report},
  author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Yang, Linjun and Majumder, Rangan and Wei, Furu},
  journal={arXiv preprint arXiv:2402.05672},
  year={2024}
}
Downloads last month
60
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Funghang/e5-nepali-qa-ir

Finetuned
(119)
this model

Papers for Funghang/e5-nepali-qa-ir