gte-multilingual-base-arabic-triplets

This is a sentence-transformers model finetuned from Alibaba-NLP/gte-multilingual-base on the silma-arabic-triplets-dataset-v1.0 dataset. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for retrieval.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: Alibaba-NLP/gte-multilingual-base
  • Maximum Sequence Length: 8192 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity
  • Supported Modality: Text
  • Training Dataset:
    • silma-arabic-triplets-dataset-v1.0
  • Language: ar
  • License: apache-2.0

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'transformer_task': 'feature-extraction', 'modality_config': {'text': {'method': 'forward', 'method_output_name': 'last_hidden_state'}}, 'module_output_name': 'token_embeddings', 'architecture': 'NewModel'})
  (1): Pooling({'embedding_dimension': 768, 'pooling_mode': 'cls', 'include_prompt': True})
  (2): Normalize({})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("RamzyBakir/arabic-gte-multilingual-embed-medium")
# Run inference
sentences = [
    'هم نوع من مثل القطط.',
    'إنهم يشبهون القطط',
    'إنسان مع قطة',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.8787, 0.6596],
#         [0.8787, 1.0000, 0.7666],
#         [0.6596, 0.7666, 1.0000]])

Evaluation

Metrics

Triplet

  • Datasets: arabic-triplet-fast and arabic-triplet-eval-full
  • Evaluated with TripletEvaluator
Metric arabic-triplet-fast arabic-triplet-eval-full
cosine_accuracy 0.9804 0.9824

Training Details

Training Dataset

silma-arabic-triplets-dataset-v1.0

  • Dataset: silma-arabic-triplets-dataset-v1.0
  • Size: 2,027,899 training samples
  • Columns: anchor, positive, and negative
  • Approximate statistics based on the first 1000 samples:
    anchor positive negative
    type string string string
    details
    • min: 5 tokens
    • mean: 19.32 tokens
    • max: 80 tokens
    • min: 5 tokens
    • mean: 16.68 tokens
    • max: 104 tokens
    • min: 5 tokens
    • mean: 16.28 tokens
    • max: 77 tokens
  • Samples:
    anchor positive negative
    They think the web is a force for good, and most don’t want governments to regulate it. وهم يعتقدون أن شبكة الويب هي قوة تستخدم للخير، ولا يرغب أغلبهم في أن تقوم الحكومات بتنظيم تلك الشبكة. أصبحت الحكومة أكثر حساسية لتأثير الإنترنت على السياسة الداخلية وسنت قوانين تزيد من سلطتها لتنظيم هذا القطاع.
    طفل صغير يرتدي نظارات زرقاء يجلس على طوف في بركة سباحة. طفل صغير يجلس على طوف في حمام سباحة ويرتدي نظارات زرقاء. امرأة ترقص بعنف على خشبة المسرح
    امرأة وطفل يسيرون على الرصيف المغطى بالأوراق متجهين نحو شخصين يركبان خيول. سيدة وطفل يسيرون على الرصيف باتجاه الخيول امرأة وطفلان يسيران في حديقة.
  • Loss: CachedMultipleNegativesRankingLoss with these parameters:
    {
        "scale": 50.0,
        "similarity_fct": "cos_sim",
        "mini_batch_size": 64,
        "gather_across_devices": false,
        "directions": [
            "query_to_doc"
        ],
        "partition_mode": "joint",
        "hardness_mode": null,
        "hardness_strength": 0.0
    }
    

Evaluation Dataset

silma-arabic-triplets-dataset-v1.0

  • Dataset: silma-arabic-triplets-dataset-v1.0
  • Size: 10,000 evaluation samples
  • Columns: anchor, positive, and negative
  • Approximate statistics based on the first 1000 samples:
    anchor positive negative
    type string string string
    details
    • min: 5 tokens
    • mean: 19.09 tokens
    • max: 91 tokens
    • min: 5 tokens
    • mean: 16.36 tokens
    • max: 90 tokens
    • min: 5 tokens
    • mean: 16.43 tokens
    • max: 84 tokens
  • Samples:
    anchor positive negative
    It is typically called the Constitution of the Fifth Republic, and replaced that of the Fourth Republic dating from 1946. ويسمى عادة دستور الجمهورية الخامسة، وحل محل دستور الجمهورية الرابع الذي يعود تاريخه إلى عام 1946. في نوفمبر 1946، اعتمدت الجمعية الوطنية أول دستور للجمهورية.
    ما هو أفضل يوم في حياتك؟ أي يوم في حياتك اعتبرته أفضل يوم في حياتك؟ ما هي وصفات الملفات؟
    She married French researcher Jean Ghata. تزوجت من الباحث الفرنسي جان غاتا. ثم انتقلوا إلى فرنسا حيث تزوجوا.
  • Loss: CachedMultipleNegativesRankingLoss with these parameters:
    {
        "scale": 50.0,
        "similarity_fct": "cos_sim",
        "mini_batch_size": 64,
        "gather_across_devices": false,
        "directions": [
            "query_to_doc"
        ],
        "partition_mode": "joint",
        "hardness_mode": null,
        "hardness_strength": 0.0
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • per_device_train_batch_size: 128
  • per_device_eval_batch_size: 256
  • learning_rate: 2e-05
  • max_steps: 15842
  • lr_scheduler_type: cosine_with_min_lr
  • lr_scheduler_kwargs: {'min_lr': 1e-06}
  • warmup_ratio: 0.05
  • bf16: True
  • dataloader_num_workers: 4
  • remove_unused_columns: False
  • load_best_model_at_end: True
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • prediction_loss_only: True
  • per_device_train_batch_size: 128
  • per_device_eval_batch_size: 256
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 2e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 3.0
  • max_steps: 15842
  • lr_scheduler_type: cosine_with_min_lr
  • lr_scheduler_kwargs: {'min_lr': 1e-06}
  • warmup_ratio: 0.05
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: True
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 4
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: False
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • parallelism_config: None
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch_fused
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • hub_revision: None
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • liger_kernel_config: None
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional
  • router_mapping: {}
  • learning_rate_mapping: {}

Training Logs

Click to expand
Epoch Step Training Loss Validation Loss arabic-triplet-fast_cosine_accuracy arabic-triplet-eval-full_cosine_accuracy
0.0001 1 0.5997 - - -
0.0063 100 0.4092 - - -
0.0126 200 0.3724 - - -
0.0189 300 0.3321 - - -
0.0252 400 0.3078 - - -
0.0316 500 0.2871 - - -
0.0379 600 0.2782 - - -
0.0442 700 0.2669 - - -
0.0505 800 0.2679 - - -
0.0568 900 0.2628 - - -
0.0631 1000 0.252 - - -
0.0694 1100 0.2424 - - -
0.0757 1200 0.2474 - - -
0.0821 1300 0.2321 - - -
0.0884 1400 0.2484 - - -
0.0947 1500 0.2309 - - -
0.1000 1584 - 0.3115 0.9704 -
0.1010 1600 0.2291 - - -
0.1073 1700 0.2242 - - -
0.1136 1800 0.2142 - - -
0.1199 1900 0.209 - - -
0.1262 2000 0.2259 - - -
0.1326 2100 0.2169 - - -
0.1389 2200 0.2076 - - -
0.1452 2300 0.2129 - - -
0.1515 2400 0.2041 - - -
0.1578 2500 0.204 - - -
0.1641 2600 0.2173 - - -
0.1704 2700 0.2086 - - -
0.1767 2800 0.2032 - - -
0.1830 2900 0.2101 - - -
0.1894 3000 0.2026 - - -
0.1957 3100 0.1977 - - -
0.2000 3168 - 0.2913 0.9713 -
0.2020 3200 0.1975 - - -
0.2083 3300 0.2019 - - -
0.2146 3400 0.1999 - - -
0.2209 3500 0.191 - - -
0.2272 3600 0.1996 - - -
0.2335 3700 0.201 - - -
0.2399 3800 0.1976 - - -
0.2462 3900 0.1963 - - -
0.2525 4000 0.1903 - - -
0.2588 4100 0.1826 - - -
0.2651 4200 0.1879 - - -
0.2714 4300 0.1764 - - -
0.2777 4400 0.1864 - - -
0.2840 4500 0.1909 - - -
0.2903 4600 0.1803 - - -
0.2967 4700 0.1819 - - -
0.2999 4752 - 0.2619 0.9762 -
0.3030 4800 0.187 - - -
0.3093 4900 0.1904 - - -
0.3156 5000 0.1899 - - -
0.3219 5100 0.1764 - - -
0.3282 5200 0.1828 - - -
0.3345 5300 0.1725 - - -
0.3408 5400 0.1674 - - -
0.3472 5500 0.1757 - - -
0.3535 5600 0.166 - - -
0.3598 5700 0.178 - - -
0.3661 5800 0.1765 - - -
0.3724 5900 0.1677 - - -
0.3787 6000 0.1653 - - -
0.3850 6100 0.176 - - -
0.3913 6200 0.1533 - - -
0.3977 6300 0.1622 - - -
0.3999 6336 - 0.2459 0.9771 -
0.4040 6400 0.1741 - - -
0.4103 6500 0.1624 - - -
0.4166 6600 0.1639 - - -
0.4229 6700 0.1674 - - -
0.4292 6800 0.1665 - - -
0.4355 6900 0.1679 - - -
0.4418 7000 0.1611 - - -
0.4481 7100 0.1661 - - -
0.4545 7200 0.1684 - - -
0.4608 7300 0.1674 - - -
0.4671 7400 0.1746 - - -
0.4734 7500 0.1684 - - -
0.4797 7600 0.1667 - - -
0.4860 7700 0.1605 - - -
0.4923 7800 0.1537 - - -
0.4986 7900 0.171 - - -
0.4999 7920 - 0.2387 0.9767 -
0.5050 8000 0.1587 - - -
0.5113 8100 0.1623 - - -
0.5176 8200 0.1704 - - -
0.5239 8300 0.1575 - - -
0.5302 8400 0.1671 - - -
0.5365 8500 0.1608 - - -
0.5428 8600 0.1537 - - -
0.5491 8700 0.1568 - - -
0.5555 8800 0.1582 - - -
0.5618 8900 0.1598 - - -
0.5681 9000 0.1613 - - -
0.5744 9100 0.1628 - - -
0.5807 9200 0.1507 - - -
0.5870 9300 0.148 - - -
0.5933 9400 0.1573 - - -
0.5996 9500 0.147 - - -
0.5999 9504 - 0.2270 0.9788 -
0.6059 9600 0.1502 - - -
0.6123 9700 0.1445 - - -
0.6186 9800 0.1534 - - -
0.6249 9900 0.1544 - - -
0.6312 10000 0.1509 - - -
0.6375 10100 0.1599 - - -
0.6438 10200 0.1579 - - -
0.6501 10300 0.1525 - - -
0.6564 10400 0.1371 - - -
0.6628 10500 0.1456 - - -
0.6691 10600 0.148 - - -
0.6754 10700 0.1472 - - -
0.6817 10800 0.1448 - - -
0.6880 10900 0.1488 - - -
0.6943 11000 0.1589 - - -
0.6999 11088 - 0.2218 0.9799 -
0.7006 11100 0.1464 - - -
0.7069 11200 0.1391 - - -
0.7132 11300 0.1489 - - -
0.7196 11400 0.1492 - - -
0.7259 11500 0.1561 - - -
0.7322 11600 0.1498 - - -
0.7385 11700 0.1553 - - -
0.7448 11800 0.1485 - - -
0.7511 11900 0.1432 - - -
0.7574 12000 0.1385 - - -
0.7637 12100 0.1497 - - -
0.7701 12200 0.145 - - -
0.7764 12300 0.1354 - - -
0.7827 12400 0.1345 - - -
0.7890 12500 0.1472 - - -
0.7953 12600 0.141 - - -
0.7998 12672 - 0.2167 0.9802 -
0.8016 12700 0.1376 - - -
0.8079 12800 0.1332 - - -
0.8142 12900 0.1469 - - -
0.8206 13000 0.142 - - -
0.8269 13100 0.1391 - - -
0.8332 13200 0.1512 - - -
0.8395 13300 0.1467 - - -
0.8458 13400 0.1485 - - -
0.8521 13500 0.1485 - - -
0.8584 13600 0.1412 - - -
0.8647 13700 0.1482 - - -
0.8710 13800 0.1532 - - -
0.8774 13900 0.1402 - - -
0.8837 14000 0.136 - - -
0.8900 14100 0.1416 - - -
0.8963 14200 0.1427 - - -
0.8998 14256 - 0.2136 0.9800 -
0.9026 14300 0.1496 - - -
0.9089 14400 0.1415 - - -
0.9152 14500 0.1395 - - -
0.9215 14600 0.1367 - - -
0.9279 14700 0.1424 - - -
0.9342 14800 0.1421 - - -
0.9405 14900 0.1312 - - -
0.9468 15000 0.1427 - - -
0.9531 15100 0.1421 - - -
0.9594 15200 0.1347 - - -
0.9657 15300 0.141 - - -
0.9720 15400 0.144 - - -
0.9784 15500 0.1417 - - -
0.9847 15600 0.1416 - - -
0.9910 15700 0.1356 - - -
0.9973 15800 0.1403 - - -
0.9998 15840 - 0.2124 0.9804 -
-1 -1 - - - 0.9824
  • The bold row denotes the saved checkpoint.

Training Time

  • Training: 1.1 hours

Framework Versions

  • Python: 3.12.6
  • Sentence Transformers: 5.4.1
  • Transformers: 4.56.0
  • PyTorch: 2.8.0+cu129
  • Accelerate: 1.10.1
  • Datasets: 4.8.5
  • Tokenizers: 0.22.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

CachedMultipleNegativesRankingLoss

@misc{gao2021scaling,
    title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
    author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
    year={2021},
    eprint={2101.06983},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}
Downloads last month
30
Safetensors
Model size
0.3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for RamzyBakir/arabic-gte-multilingual-embed-medium

Finetuned
(100)
this model

Collection including RamzyBakir/arabic-gte-multilingual-embed-medium

Papers for RamzyBakir/arabic-gte-multilingual-embed-medium

Evaluation results