SentenceTransformer based on intfloat/multilingual-e5-base

This is a sentence-transformers model finetuned from intfloat/multilingual-e5-base. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: intfloat/multilingual-e5-base
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False, 'architecture': 'XLMRobertaModel'})
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("mohanprakash462/tamil-embed-base")
# Run inference
sentences = [
    'ஒரு முதியவன் பாதாளங்களைத் தாண்டும் தன் மந்திரக்கோலால் சாய்த்தபடியிருக்கிறான் நாட்சத்திரங்களை............................................................................................................................................................................... இது எத்தனையாவது [...]',
    'தந்தைக்குக் கடினமான பரிசுகளைக் கொடுத்துக் கொண்டிருந்தார்.',
    'பிக்பாஸைப் பிடித்த போது எந்தப் படமும் நடக்கவில்லை.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.4205, 0.4317],
#         [0.4205, 1.0000, 0.3737],
#         [0.4317, 0.3737, 1.0000]])

Training Details

Training Dataset

Unnamed Dataset

  • Size: 92,081 training samples
  • Columns: anchor and positive
  • Approximate statistics based on the first 1000 samples:
    anchor positive
    type string string
    details
    • min: 15 tokens
    • mean: 57.89 tokens
    • max: 200 tokens
    • min: 4 tokens
    • mean: 16.06 tokens
    • max: 87 tokens
  • Samples:
    anchor positive
    Jack and Jill: A Village Story by Louisa May Alcott, is a children's book originally published in 1880.It takes place in a small New England town after the Civil War.The story of two good friends named Jack and Janey, "Jack and Jill" tells of the aftermath of a serious sliding accident. ஜாக் மற்றும் ஜானி இரு நல்ல நண்பர்கள்.
    SourceMedia ஒரு mid-size diversified business-to-business digital media company owned by Observer Capital, which acquired the company from Investcorp in August 2014.Thomson Corporation's former Thomson Media division, SourceMedia விழுந்து, Thomson 2004 இல் Investcorp க்கு விற்கப்பட்டது $ 350 மில்லியன். SourceMedia ஒரு Digital Media நிறுவனம்
    ஒரு முதியவன் பாதாளங்களைத் தாண்டும் தன் மந்திரக்கோலால் சாய்த்தபடியிருக்கிறான் நாட்சத்திரங்களை............................................................................................................................................................................... இது எத்தனையாவது [...] பல்வேறு மாநிலங்களில் அரசுக்கு எச்சரிக்கை
  • Loss: MatryoshkaLoss with these parameters:
    {
        "loss": "MultipleNegativesRankingLoss",
        "matryoshka_dims": [
            768,
            512,
            256,
            128
        ],
        "matryoshka_weights": [
            1,
            1,
            1,
            1
        ],
        "n_dims_per_step": -1
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • per_device_train_batch_size: 64
  • learning_rate: 1e-06
  • warmup_steps: 144
  • fp16: True
  • gradient_checkpointing: True
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • per_device_train_batch_size: 64
  • num_train_epochs: 3
  • max_steps: -1
  • learning_rate: 1e-06
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: None
  • warmup_steps: 144
  • optim: adamw_torch_fused
  • optim_args: None
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • optim_target_modules: None
  • gradient_accumulation_steps: 1
  • average_tokens_across_devices: True
  • max_grad_norm: 1.0
  • label_smoothing_factor: 0.0
  • bf16: False
  • fp16: True
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • gradient_checkpointing: True
  • gradient_checkpointing_kwargs: None
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • use_liger_kernel: False
  • liger_kernel_config: None
  • use_cache: False
  • neftune_noise_alpha: None
  • torch_empty_cache_steps: None
  • auto_find_batch_size: False
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • include_num_input_tokens_seen: no
  • log_level: passive
  • log_level_replica: warning
  • disable_tqdm: False
  • project: huggingface
  • trackio_space_id: trackio
  • eval_strategy: no
  • per_device_eval_batch_size: 8
  • prediction_loss_only: True
  • eval_on_start: False
  • eval_do_concat_batches: True
  • eval_use_gather_object: False
  • eval_accumulation_steps: None
  • include_for_metrics: []
  • batch_eval_metrics: False
  • save_only_model: False
  • save_on_each_node: False
  • enable_jit_checkpoint: False
  • push_to_hub: False
  • hub_private_repo: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_always_push: False
  • hub_revision: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • restore_callback_states_from_checkpoint: False
  • full_determinism: False
  • seed: 42
  • data_seed: None
  • use_cpu: False
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • parallelism_config: None
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • dataloader_prefetch_factor: None
  • remove_unused_columns: True
  • label_names: None
  • train_sampling_strategy: random
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • ddp_backend: None
  • ddp_timeout: 1800
  • fsdp: []
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • deepspeed: None
  • debug: []
  • skip_memory_metrics: True
  • do_predict: False
  • resume_from_checkpoint: None
  • warmup_ratio: None
  • local_rank: -1
  • prompts: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional
  • router_mapping: {}
  • learning_rate_mapping: {}

Training Logs

Click to expand
Epoch Step Training Loss
0.0174 25 9.5049
0.0347 50 9.2988
0.0521 75 8.7502
0.0695 100 7.9748
0.0869 125 7.1927
0.1042 150 6.1935
0.1216 175 5.3092
0.1390 200 4.6630
0.1564 225 4.1481
0.1737 250 3.5569
0.1911 275 3.5474
0.2085 300 3.5098
0.2259 325 3.2235
0.2432 350 2.9600
0.2606 375 3.0261
0.2780 400 2.8874
0.2953 425 2.9094
0.3127 450 2.9079
0.3301 475 2.6196
0.3475 500 2.6887
0.3648 525 3.0199
0.3822 550 2.8014
0.3996 575 2.8743
0.4170 600 2.7243
0.4343 625 2.7829
0.4517 650 2.7898
0.4691 675 2.7561
0.4864 700 2.6587
0.5038 725 2.6228
0.5212 750 2.5352
0.5386 775 2.6544
0.5559 800 2.6122
0.5733 825 2.6155
0.5907 850 2.4361
0.6081 875 2.6018
0.6254 900 2.5225
0.6428 925 2.5303
0.6602 950 2.7318
0.6776 975 2.5735
0.6949 1000 2.5443
0.7123 1025 2.3904
0.7297 1050 2.4995
0.7470 1075 2.5640
0.7644 1100 2.6522
0.7818 1125 2.5466
0.7992 1150 2.4968
0.8165 1175 2.3753
0.8339 1200 2.4524
0.8513 1225 2.3839
0.8687 1250 2.6322
0.8860 1275 2.5143
0.9034 1300 2.6360
0.9208 1325 2.3736
0.9382 1350 3.3474
0.9555 1375 4.2932
0.9729 1400 3.8941
0.9903 1425 4.0057
1.0076 1450 3.2783
1.0250 1475 2.6051
1.0424 1500 2.8140
1.0598 1525 2.4573
1.0771 1550 2.5487
1.0945 1575 2.5347
1.1119 1600 2.3618
1.1293 1625 2.3501
1.1466 1650 2.4186
1.1640 1675 2.3757
1.1814 1700 2.6012
1.1987 1725 2.3281
1.2161 1750 2.4444
1.2335 1775 2.5461
1.2509 1800 2.5203
1.2682 1825 2.4201
1.2856 1850 2.6096
1.3030 1875 2.4021
1.3204 1900 2.4524
1.3377 1925 2.3002
1.3551 1950 2.4063
1.3725 1975 2.1237
1.3899 2000 2.3219
1.4072 2025 2.3227
1.4246 2050 2.3646
1.4420 2075 2.4407
1.4593 2100 2.2862
1.4767 2125 2.2900
1.4941 2150 2.2512
1.5115 2175 2.3741
1.5288 2200 2.6308
1.5462 2225 2.5161
1.5636 2250 2.4871
1.5810 2275 2.5049
1.5983 2300 2.6384
1.6157 2325 2.4185
1.6331 2350 2.4573
1.6505 2375 2.2954
1.6678 2400 2.2384
1.6852 2425 2.3318
1.7026 2450 2.2915
1.7199 2475 2.2013
1.7373 2500 2.4082
1.7547 2525 2.5290
1.7721 2550 2.4825
1.7894 2575 2.4610
1.8068 2600 2.3414
1.8242 2625 2.3729
1.8416 2650 2.5862
1.8589 2675 2.4320
1.8763 2700 2.2745
1.8937 2725 2.3046
1.9110 2750 2.3621
1.9284 2775 2.3097
1.9458 2800 4.1645
1.9632 2825 4.5466
1.9805 2850 4.6750
1.9979 2875 2.8955
2.0153 2900 2.9962
2.0327 2925 2.3366
2.0500 2950 2.2591
2.0674 2975 2.3375
2.0848 3000 2.4169
2.1022 3025 2.2635
2.1195 3050 2.1642
2.1369 3075 2.4082
2.1543 3100 2.3501
2.1716 3125 2.4870
2.1890 3150 2.7393
2.2064 3175 2.3203
2.2238 3200 2.2731
2.2411 3225 2.1901
2.2585 3250 2.3000
2.2759 3275 2.3846
2.2933 3300 2.2514
2.3106 3325 2.2218
2.3280 3350 2.5800
2.3454 3375 2.4384
2.3628 3400 2.4946
2.3801 3425 2.2781
2.3975 3450 2.2777
2.4149 3475 2.2062
2.4322 3500 2.3994
2.4496 3525 2.5084
2.4670 3550 2.1158
2.4844 3575 2.0865
2.5017 3600 2.3174
2.5191 3625 2.3668
2.5365 3650 2.3439
2.5539 3675 2.4482
2.5712 3700 2.3998
2.5886 3725 2.2155
2.6060 3750 2.0207
2.6233 3775 2.2652
2.6407 3800 2.4261
2.6581 3825 2.2214
2.6755 3850 2.2244
2.6928 3875 2.2835
2.7102 3900 2.4259
2.7276 3925 2.3013
2.7450 3950 2.1069
2.7623 3975 2.4415
2.7797 4000 2.3380
2.7971 4025 2.3013
2.8145 4050 2.4202
2.8318 4075 2.2488
2.8492 4100 2.1855
2.8666 4125 2.3882
2.8839 4150 2.5306
2.9013 4175 2.3197
2.9187 4200 2.3295
2.9361 4225 3.2070
2.9534 4250 3.9697
2.9708 4275 4.2241
2.9882 4300 3.5779

Framework Versions

  • Python: 3.12.12
  • Sentence Transformers: 5.2.3
  • Transformers: 5.3.0
  • PyTorch: 2.9.0+cu126
  • Accelerate: 1.12.0
  • Datasets: 4.0.0
  • Tokenizers: 0.22.2

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MatryoshkaLoss

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}
Downloads last month
17
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mohanprakash462/tamil-embed-base

Finetuned
(110)
this model

Papers for mohanprakash462/tamil-embed-base