SentenceTransformer based on intfloat/multilingual-e5-large

This is a sentence-transformers model finetuned from intfloat/multilingual-e5-large. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: intfloat/multilingual-e5-large
  • Maximum Sequence Length: 256 tokens
  • Output Dimensionality: 1024 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False, 'architecture': 'PeftModelForFeatureExtraction'})
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("songphucn7/me5-checkthat-task1-v2")
# Run inference
sentences = [
    'query: @zoeharcombe I\'ll reply it. The "vaccines" have no health benefit and only cause harm. Pharma company trial data says they don\'t work. A straightforward calculation of absolute risk from the Pfizer trial data = .04% effectiveness for severe cases, which is essentially zero.',
    'passage: Among 10 cases of severe Covid-19 with onset after the first dose, 9 occurred in placebo recipients and 1 in a BNT162b2 recipient.\n\nThe safety profile of BNT162b2 was characterized by short-term, mild-to-moderate pain at the injection site, fatigue, and headache.\nThe incidence of serious adverse events was low and was similar in the vaccine and placebo groups.\nConclusionsA two-dose regimen of BNT162b2 conferred 95% protection against Covid-19 in persons 16 years of age or older.\nSafety over a median of 2 months was similar to that of other viral vaccines.\n(Funded by BioNTech and Pfizer; ClinicalTrials.\ngov number, NCT04368728.',
    "passage: title: Imperfect Vaccination Can Enhance the Transmission of Highly Virulent Pathogens abstract: Could some vaccines drive the evolution of more virulent pathogens?\nConventional wisdom is that natural selection will remove highly lethal pathogens if host death greatly reduces transmission.\nVaccines that keep hosts alive but still allow transmission could thus allow very virulent strains to circulate in a population.\nHere we show experimentally that immunization of chickens against Marek's disease virus enhances the fitness of more virulent strains, making it possible for hyperpathogenic strains to transmit.\nImmunity elicited by direct vaccination or by maternal vaccination prolongs host survival but does not prevent infection, viral replication or transmission, thus extending the infectious periods of strains otherwise too lethal to persist.\nOur data show that anti-disease vaccines that do not prevent transmission can create conditions that promote the emergence of pathogen strains that cause more severe disease in unvaccinated hosts.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.5866, 0.3283],
#         [0.5866, 1.0000, 0.1368],
#         [0.3283, 0.1368, 1.0000]])

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.52
cosine_accuracy@3 0.7216
cosine_accuracy@5 0.7938
cosine_accuracy@10 0.8494
cosine_precision@1 0.52
cosine_precision@3 0.2405
cosine_precision@5 0.1588
cosine_precision@10 0.0849
cosine_recall@1 0.52
cosine_recall@3 0.7216
cosine_recall@5 0.7938
cosine_recall@10 0.8494
cosine_ndcg@10 0.6863
cosine_mrr@10 0.6337
cosine_map@100 0.6389

Training Details

Training Dataset

Unnamed Dataset

  • Size: 17,319 training samples
  • Columns: sentence_0 and sentence_1
  • Approximate statistics based on the first 1000 samples:
    sentence_0 sentence_1
    type string string
    details
    • min: 21 tokens
    • mean: 59.2 tokens
    • max: 136 tokens
    • min: 11 tokens
    • mean: 205.0 tokens
    • max: 256 tokens
  • Samples:
    sentence_0 sentence_1
    query: @user Baloney. Natural immunity is hands down better, and vaccinated people are ending up in the hospital. passage: title: Longitudinal analysis shows durable and broad immune memory after SARS-CoV-2 infection with persisting antibody responses and memory B and T cells abstract: Ending the COVID-19 pandemic will require long-lived immunity to SARS-CoV-2.
    Here, we evaluate 254 COVID-19 patients longitudinally up to 8 months and find durable broad-based immune responses.
    SARS-CoV-2 spike binding and neutralizing antibodies exhibit a bi-phasic decay with an extended half-life of >200 days suggesting the generation of longer-lived plasma cells.
    SARS-CoV-2 infection also boosts antibody titers to SARS-CoV-1 and common betacoronaviruses.
    In addition, spike-specific IgG+ memory B cells persist, which bodes well for a rapid antibody response upon virus re-exposure or vaccination.
    Virus-specific CD4+ and CD8+ T cells are polyfunctional and maintained with an estimated half-life of 200 days.
    Interestingly, CD4+ T cell responses equally target several SARS-CoV-2 proteins, whereas the CD8+ T cell respo...
    query: @Alexand64744343 Meta examen tests #lyme + élevé niveau de preuve scientifique ! : rendement « 53.9% for synthetic C6 peptide ELISA tests & 53.7% when the two-tier methodology was used » Une véritable loterie, 1 cas sur 2 détecté mais persistez à vociférer par ignorance passage: title: Commercial test kits for detection of Lyme borreliosis: a meta-analysis of test accuracy abstract: The clinical diagnosis of Lyme borreliosis can be supported by various test methodologies; test kits are available from many manufacturers.
    Literature searches were carried out to identify studies that reported characteristics of the test kits.
    Of 50 searched studies, 18 were included where the tests were commercially available and samples were proven to be positive using serology testing, evidence of an erythema migrans rash, and/or culture.
    Additional requirements were a test specificity of ≥85% and publication in the last 20 years.
    The weighted mean sensitivity for all tests and for all samples was 59.
    5%.
    Individual study means varied from 30.
    6% to 86.
    2%.
    Sensitivity for each test technology varied from 62.
    4% for Western blot kits, and 62.
    3% for enzyme-linked immunosorbent assay tests, to 53.
    9% for synthetic C6 peptide ELISA tests and 53.
    7% when the two-tier meth...
    query: 28 Les systèmes de séquençage haut débit qui servent à la production des banques comme celles du papier de Jaenisch produisent des chimères artefactuelles lors de la PCR. C’est bien connu. Discuté dans : passage: title: A Survey of Virus Recombination Uncovers Canonical Features of Artificial Chimeras Generated During Deep Sequencing Library Preparation abstract: Abstract Chimeric reads can be generated by in vitro recombination during the preparation of high-throughput sequencing libraries.
    Our attempt to detect biological recombination between the genomes of dengue virus (DENV; +ssRNA genome) and its mosquito host using the Illumina Nextera sequencing library preparation kit revealed that most, if not all, detected host–virus chimeras were artificial.
    Indeed, these chimeras were not more frequent than with control RNA from another species (a pillbug), which was never in contact with DENV RNA prior to the library preparation.
    The proportion of chimera types merely reflected those of the three species among sequencing reads.
    Chimeras were frequently characterized by the presence of 1-20 bp microhomology between recombining fragments.
    Within-species chimeras mostly involved fragments in...
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim",
        "gather_across_devices": false,
        "directions": [
            "query_to_doc"
        ],
        "partition_mode": "joint",
        "hardness_mode": null,
        "hardness_strength": 0.0
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • per_device_train_batch_size: 32
  • num_train_epochs: 10
  • eval_strategy: steps
  • per_device_eval_batch_size: 32
  • multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand
  • per_device_train_batch_size: 32
  • num_train_epochs: 10
  • max_steps: -1
  • learning_rate: 5e-05
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: None
  • warmup_steps: 0
  • optim: adamw_torch_fused
  • optim_args: None
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • optim_target_modules: None
  • gradient_accumulation_steps: 1
  • average_tokens_across_devices: True
  • max_grad_norm: 1
  • label_smoothing_factor: 0.0
  • bf16: False
  • fp16: False
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • use_liger_kernel: False
  • liger_kernel_config: None
  • use_cache: False
  • neftune_noise_alpha: None
  • torch_empty_cache_steps: None
  • auto_find_batch_size: False
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • include_num_input_tokens_seen: no
  • log_level: passive
  • log_level_replica: warning
  • disable_tqdm: False
  • project: huggingface
  • trackio_space_id: trackio
  • eval_strategy: steps
  • per_device_eval_batch_size: 32
  • prediction_loss_only: True
  • eval_on_start: False
  • eval_do_concat_batches: True
  • eval_use_gather_object: False
  • eval_accumulation_steps: None
  • include_for_metrics: []
  • batch_eval_metrics: False
  • save_only_model: False
  • save_on_each_node: False
  • enable_jit_checkpoint: False
  • push_to_hub: False
  • hub_private_repo: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_always_push: False
  • hub_revision: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • restore_callback_states_from_checkpoint: False
  • full_determinism: False
  • seed: 42
  • data_seed: None
  • use_cpu: False
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • parallelism_config: None
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • dataloader_prefetch_factor: None
  • remove_unused_columns: True
  • label_names: None
  • train_sampling_strategy: random
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • ddp_backend: None
  • ddp_timeout: 1800
  • fsdp: []
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • deepspeed: None
  • debug: []
  • skip_memory_metrics: True
  • do_predict: False
  • resume_from_checkpoint: None
  • warmup_ratio: None
  • local_rank: -1
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: round_robin
  • router_mapping: {}
  • learning_rate_mapping: {}

Training Logs

Epoch Step Training Loss 10-percent-dev-split_cosine_ndcg@10
0.1845 100 - 0.6380
0.3690 200 - 0.6400
0.5535 300 - 0.6512
0.7380 400 - 0.6642
0.9225 500 0.8957 0.6640
1.0 542 - 0.6626
1.1070 600 - 0.6658
1.2915 700 - 0.6676
1.4760 800 - 0.6695
1.6605 900 - 0.6719
1.8450 1000 0.3899 0.6776
2.0 1084 - 0.6744
2.0295 1100 - 0.6761
2.2140 1200 - 0.6759
2.3985 1300 - 0.6761
2.5830 1400 - 0.6830
2.7675 1500 0.3484 0.6779
2.9520 1600 - 0.6793
3.0 1626 - 0.6762
3.1365 1700 - 0.6823
3.3210 1800 - 0.6831
3.5055 1900 - 0.6788
3.6900 2000 0.3083 0.6821
3.8745 2100 - 0.6775
4.0 2168 - 0.6788
4.0590 2200 - 0.6786
4.2435 2300 - 0.6792
4.4280 2400 - 0.6827
4.6125 2500 0.3033 0.6804
4.7970 2600 - 0.6822
4.9815 2700 - 0.6914
5.0 2710 - 0.6880
5.1661 2800 - 0.6809
5.3506 2900 - 0.6853
5.5351 3000 0.2840 0.6852
5.7196 3100 - 0.6844
5.9041 3200 - 0.6886
6.0 3252 - 0.6859
6.0886 3300 - 0.6859
6.2731 3400 - 0.6811
6.4576 3500 0.2669 0.6896
6.6421 3600 - 0.6864
6.8266 3700 - 0.6859
7.0 3794 - 0.6893
7.0111 3800 - 0.6907
7.1956 3900 - 0.6865
7.3801 4000 0.2546 0.6831
7.5646 4100 - 0.6872
7.7491 4200 - 0.6893
7.9336 4300 - 0.6864
8.0 4336 - 0.6900
8.1181 4400 - 0.6885
8.3026 4500 0.2518 0.6857
8.4871 4600 - 0.6874
8.6716 4700 - 0.6834
8.8561 4800 - 0.6859
9.0 4878 - 0.6858
9.0406 4900 - 0.6844
9.2251 5000 0.2392 0.6861
9.4096 5100 - 0.6874
9.5941 5200 - 0.6872
9.7786 5300 - 0.6858
9.9631 5400 - 0.6861
10.0 5420 - 0.6863

Framework Versions

  • Python: 3.12.6
  • Sentence Transformers: 5.3.0
  • Transformers: 5.5.1
  • PyTorch: 2.11.0+cu130
  • Accelerate: 1.13.0
  • Datasets: 4.8.4
  • Tokenizers: 0.22.2

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{oord2019representationlearningcontrastivepredictive,
      title={Representation Learning with Contrastive Predictive Coding},
      author={Aaron van den Oord and Yazhe Li and Oriol Vinyals},
      year={2019},
      eprint={1807.03748},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/1807.03748},
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for songphucn7/me5-checkthat-task1-v2

Finetuned
(170)
this model

Papers for songphucn7/me5-checkthat-task1-v2

Evaluation results