am-azadi's picture
Upload folder using huggingface_hub
a857baa verified
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:21769
  - loss:MultipleNegativesRankingLoss
base_model: am-azadi/bilingual-embedding-large_Fine_Tuned_1e
widget:
  - source_sentence: >-
      GOOD NEWS!  Eriksen, has already gone out to the hospital window, where he
      is under observation and looks optimistic after having suffered a cardiac
      arrest. 
    sentences:
      - >-
        Bolsonaro with the two assassins of Marielle Franco No, the men next to
        Jair Bolsonaro in this photo are not the ones accused of the murder of
        Marielle Franco
      - >-
        This photo shows Christian Eriksen waving from the window of the
        hospital where he was admitted after suffering cardiac arrest The photo
        of Eriksen waving from the window was taken months before his heart
        incident
      - >-
        Video of protests in the US during the COVID-19 pandemic This video has
        been circulating in reports about the funeral procession of military
        commanders in Iran in January 2020
  - source_sentence: >-
      What a dirty game... "US postman arrested in canadian border with
      banknotes stolen in the trunk of the car". 91 Breaking911  5h U.S. Postal
      Worker Caught at Canadian Border With Stolen Ballots In Car Trunk -
      breaking911.com/u-s-postal-wor... 8218248 Claudia Wild   IT 8206434 300  
      4:57 06 Nov 20 Twitter for iPhone 1,134 Retweets 113 Tweets with comment
    sentences:
      - >-
        Postman arrested with stolen bills at US-Canada border Only three blank
        bills were found in a US postal worker's car
      - >-
        Covid relief plan will cost every American $5,750 Misleading posts claim
        US covid relief plan costs every American $5,750
      - >-
        CDC informs that 10% of the swabs used for PCR testing were sent to
        LABORATORIES, being analyzed of GENETIC SEQUENCES We check the claim
        that PCR tests aim to sequence the DNA of patients with covid-19
  - source_sentence: >-
      . Northeast Always in Our Hearts! Advance Northeast!!  . Brazilian Army
      through its Engineering Battalion finds a Huge Potable Water Well in
      Seridó - Caicó/RN, one of the most needy areas. This well will supply the
      homes of more than 3,000 people!!  . It's our President Bolsonaro ridding
      the Bravo People of the Northeast from the wounds of drought! .  . . 
      BRAZIL LOVED HOMELAND  . . Friends and Followers of : Follow and Turn on
      our Notifications  . .          # pocket                  . 
    sentences:
      - >-
        Twitter suspended Elon Musk's Twitter account after he pulled out of
        deal Imposter Elon Musk Twitter account shared in false posts claiming
        he was 'suspended' over buyout row
      - >-
        The Brazilian Army found water in Caicó, Rio Grande do Norte, during the
        government of President Jair Bolsonaro. The recording of the drilling of
        an artesian well in Caicó, Rio Grande do Norte, has been circulating
        since 2015
      - >-
        A video was published today about Syrian refugees in Sweden being
        subjected to the separation of husbands, as well as the forcible removal
        of their children and the handing over of children to Christian families
        to change their religion. And to turn them into Christians, they will
        have two children Swedish police did not take Syrian children to hand
        over to Christian families
  - source_sentence: >-
      what hp Álvaro Uribe Vélez ... 3pm ✓ The coastal people are the least
      intellectual of the country, that is why this region of Colombia is mired
      in poverty. They don't like to work either. that's why there is currently
      a level very high of misery in la guajira. With the democratic center we
      will change. The entire Caribbean coast must feel outraged by the
      statements of this individual. Now with more reasons, the coastal people
      should support Petro. The how.. see more
    sentences:
      - >-
        Covid-19: Omicron variant is transmitted by eye contact according to the
        WHO The coronavirus is transmitted by interaction with contaminated
        droplets, not by eye contact
      - >-
        5G causes suffocation in humans, affects the respiratory system There is
        no evidence that 5G technology affects the respiratory system and
        increases toxins in the body
      - >-
        Álvaro Uribe tweeted that the coastal people are the least intellectual
        population in Colombia There is no record of Uribe tweeting that the
        coast is the "least intellectual" region of Colombia
  - source_sentence: >-
      The terrorists evaporated in seconds  A very rare scene of the moment the
      Egyptian planes bombed the terrorist elements in Sinai Watch the video
      here   NB Please all our followers on our page subscribe to our YouTube
      channel We will publish everything new on the ground Open the channel
      link 
    sentences:
      - >-
        Cars melt due to hot weather in Saudi Arabia No, these cars did not melt
        due to hot weather
      - >-
        Footage shows robbery in Sri Lanka Delhi crime footage falsely shared as
        'Sri Lanka burglary'
      - >-
        A very rare scene of the moment the Egyptian planes bombed the terrorist
        elements in Sinai This picture is not of an Egyptian warplane, but of an
        Israeli plane
pipeline_tag: sentence-similarity
library_name: sentence-transformers

SentenceTransformer based on am-azadi/bilingual-embedding-large_Fine_Tuned_1e

This is a sentence-transformers model finetuned from am-azadi/bilingual-embedding-large_Fine_Tuned_1e. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BilingualModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
    'The terrorists evaporated in seconds  A very rare scene of the moment the Egyptian planes bombed the terrorist elements in Sinai Watch the video here   NB Please all our followers on our page subscribe to our YouTube channel We will publish everything new on the ground Open the channel link ',
    'A very rare scene of the moment the Egyptian planes bombed the terrorist elements in Sinai This picture is not of an Egyptian warplane, but of an Israeli plane',
    'Cars melt due to hot weather in Saudi Arabia No, these cars did not melt due to hot weather',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Training Details

Training Dataset

Unnamed Dataset

  • Size: 21,769 training samples
  • Columns: sentence_0 and sentence_1
  • Approximate statistics based on the first 1000 samples:
    sentence_0 sentence_1
    type string string
    details
    • min: 6 tokens
    • mean: 119.28 tokens
    • max: 512 tokens
    • min: 18 tokens
    • mean: 39.42 tokens
    • max: 98 tokens
  • Samples:
    sentence_0 sentence_1
    HAPPENING NOW ; KENYA ELECTRIC BUS IS ON FIRE ALONG KAREN ROAD. Electric bus catches fire in Nairobi Video shows a methane-powered bus that caught fire in Italy, not an electric bus in Kenya
    RUPTLY Viewed 51,670 times 8 hours Snorr On the way down Khao Pak Thong Chai, route 3-4, Sattahip - Korat, all of them would have died. pity Incident 27 Jun. Video showing road accidents in Thailand? This is a video published in a news report about a car crash in Russia.
    The image that went around the world! This photo won the best of the decade award and led to the author to depression, the author narrated in his description; "Cheetahs chased a mother deer and her 2 babies, she offered herself so that her children could escape and in the photo looks like she watches her babies run to safety as she is about to be devoured" How many times have you stopped to think how many sacrifices your parents do for you. While you have fun, laugh and you enjoy life, they give theirs. Cheetahs chased a mother deer and she volunteered so her children could escape Behind the picture: Cheetahs learned from their mother how to capture prey
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • per_device_train_batch_size: 2
  • per_device_eval_batch_size: 2
  • num_train_epochs: 1
  • multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: no
  • prediction_loss_only: True
  • per_device_train_batch_size: 2
  • per_device_eval_batch_size: 2
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1
  • num_train_epochs: 1
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: round_robin

Training Logs

Epoch Step Training Loss
0.0459 500 0.0135
0.0919 1000 0.024
0.1378 1500 0.0073
0.1837 2000 0.0103
0.2297 2500 0.0265
0.2756 3000 0.0209
0.3215 3500 0.0308
0.3675 4000 0.0301
0.4134 4500 0.0382
0.4593 5000 0.0164
0.5053 5500 0.0251
0.5512 6000 0.0141
0.5972 6500 0.0131
0.6431 7000 0.006
0.6890 7500 0.0261
0.7350 8000 0.0111
0.7809 8500 0.0089
0.8268 9000 0.0201
0.8728 9500 0.0175
0.9187 10000 0.0086
0.9646 10500 0.0049

Framework Versions

  • Python: 3.11.11
  • Sentence Transformers: 3.4.1
  • Transformers: 4.48.3
  • PyTorch: 2.5.1+cu124
  • Accelerate: 1.3.0
  • Datasets: 3.3.2
  • Tokenizers: 0.21.0

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{henderson2017efficient,
    title={Efficient Natural Language Response Suggestion for Smart Reply},
    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
    year={2017},
    eprint={1705.00652},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}