Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Paper • 1908.10084 • Published • 13
How to use songphucn7/me5-checkthat-task1-v2 with sentence-transformers:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("songphucn7/me5-checkthat-task1-v2")
sentences = [
"query: The unexpected repercussions of COVID-19 vaccine policy: why requirements, certificates and limitations could do more damage than benefit | BMJ Global Health",
"passage: title: SARS-CoV-2 infects and replicates in cells of the human endocrine and exocrine pancreas abstract: Infection-related diabetes can arise as a result of virus-associated β-cell destruction.\nClinical data suggest that the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), causing the coronavirus disease 2019 (COVID-19), impairs glucose homoeostasis, but experimental evidence that SARS-CoV-2 can infect pancreatic tissue has been lacking.\nIn the present study, we show that SARS-CoV-2 infects cells of the human exocrine and endocrine pancreas ex vivo and in vivo.\nWe demonstrate that human β-cells express viral entry proteins, and SARS-CoV-2 infects and replicates in cultured human islets.\nInfection is associated with morphological, transcriptional and functional changes, including reduced numbers of insulin-secretory granules in β-cells and impaired glucose-stimulated insulin secretion.\nIn COVID-19 full-body postmortem examinations, we detected SARS-CoV-2 nucleocapsid protein in pancreatic exocrine cells, and in cells that stain positive for the β-cell marker NKX6.\n1 and are in close proximity to the islets of Langerhans in all four patients investigated.\nOur data identify the human pancreas as a target of SARS-CoV-2 infection and suggest that β-cell infection could contribute to the metabolic dysregulation observed in patients with COVID-19.\nSARS-CoV-2 is shown to infect and replicate in human pancreatic tissue, including in β-cells, which is associated with morphological, transcriptomic and functional changes.",
"passage: title: A SARS-like cluster of circulating bat coronaviruses shows potential for human emergence abstract: The emergence of severe acute respiratory syndrome coronavirus (SARS-CoV) and Middle East respiratory syndrome (MERS)-CoV underscores the threat of cross-species transmission events leading to outbreaks in humans.\nHere we examine the disease potential of a SARS-like virus, SHC014-CoV, which is currently circulating in Chinese horseshoe bat populations.\nUsing the SARS-CoV reverse genetics system, we generated and characterized a chimeric virus expressing the spike of bat coronavirus SHC014 in a mouse-adapted SARS-CoV backbone.\nThe results indicate that group 2b viruses encoding the SHC014 spike in a wild-type backbone can efficiently use multiple orthologs of the SARS receptor human angiotensin converting enzyme II (ACE2), replicate efficiently in primary human airway cells and achieve in vitro titers equivalent to epidemic strains of SARS-CoV.\nAdditionally, in vivo experiments demonstrate replication of the chimeric virus in mouse lung with notable pathogenesis.\nEvaluation of available SARS-based immune-therapeutic and prophylactic modalities revealed poor efficacy; both monoclonal antibody and vaccine approaches failed to neutralize and protect from infection with CoVs using the novel spike protein.\nOn the basis of these findings, we synthetically re-derived an infectious full-length SHC014 recombinant virus and demonstrate robust viral replication both in vitro and in vivo.\nOur work suggests a potential risk of SARS-CoV re-emergence from viruses currently circulating in bat populations.",
"passage: title: The unintended consequences of COVID-19 vaccine policy: why mandates, passports and restrictions may cause more harm than good abstract: Vaccination policies have shifted dramatically during COVID-19 with the rapid emergence of population-wide vaccine mandates, domestic vaccine passports and differential restrictions based on vaccination status.\nWhile these policies have prompted ethical, scientific, practical, legal and political debate, there has been limited evaluation of their potential unintended consequences.\nHere, we outline a comprehensive set of hypotheses for why these policies may ultimately be counterproductive and harmful.\nOur framework considers four domains: (1) behavioural psychology, (2) politics and law, (3) socioeconomics, and (4) the integrity of science and public health.\nWhile current vaccines appear to have had a significant impact on decreasing COVID-19-related morbidity and mortality burdens, we argue that current mandatory vaccine policies are scientifically questionable and are likely to cause more societal harm than good.\nRestricting people’s access to work, education, public transport and social life based on COVID-19 vaccination status impinges on human rights, promotes stigma and social polarisation, and adversely affects health and well-being.\nCurrent policies may lead to a widening of health and economic inequalities, detrimental long-term impacts on trust in government and scientific institutions, and reduce the uptake of future public health measures, including COVID-19 vaccines as well as routine immunisations.\nMandating vaccination is one of the most powerful interventions in public health and should be used sparingly and carefully to uphold ethical norms and trust in institutions."
]
embeddings = model.encode(sentences)
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]This is a sentence-transformers model finetuned from intfloat/multilingual-e5-large. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
SentenceTransformer(
(0): Transformer({'max_seq_length': 256, 'do_lower_case': False, 'architecture': 'PeftModelForFeatureExtraction'})
(1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
(2): Normalize()
)
First install the Sentence Transformers library:
pip install -U sentence-transformers
Then you can load this model and run inference.
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("songphucn7/me5-checkthat-task1-v2")
# Run inference
sentences = [
'query: @zoeharcombe I\'ll reply it. The "vaccines" have no health benefit and only cause harm. Pharma company trial data says they don\'t work. A straightforward calculation of absolute risk from the Pfizer trial data = .04% effectiveness for severe cases, which is essentially zero.',
'passage: Among 10 cases of severe Covid-19 with onset after the first dose, 9 occurred in placebo recipients and 1 in a BNT162b2 recipient.\n\nThe safety profile of BNT162b2 was characterized by short-term, mild-to-moderate pain at the injection site, fatigue, and headache.\nThe incidence of serious adverse events was low and was similar in the vaccine and placebo groups.\nConclusionsA two-dose regimen of BNT162b2 conferred 95% protection against Covid-19 in persons 16 years of age or older.\nSafety over a median of 2 months was similar to that of other viral vaccines.\n(Funded by BioNTech and Pfizer; ClinicalTrials.\ngov number, NCT04368728.',
"passage: title: Imperfect Vaccination Can Enhance the Transmission of Highly Virulent Pathogens abstract: Could some vaccines drive the evolution of more virulent pathogens?\nConventional wisdom is that natural selection will remove highly lethal pathogens if host death greatly reduces transmission.\nVaccines that keep hosts alive but still allow transmission could thus allow very virulent strains to circulate in a population.\nHere we show experimentally that immunization of chickens against Marek's disease virus enhances the fitness of more virulent strains, making it possible for hyperpathogenic strains to transmit.\nImmunity elicited by direct vaccination or by maternal vaccination prolongs host survival but does not prevent infection, viral replication or transmission, thus extending the infectious periods of strains otherwise too lethal to persist.\nOur data show that anti-disease vaccines that do not prevent transmission can create conditions that promote the emergence of pathogen strains that cause more severe disease in unvaccinated hosts.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.5866, 0.3283],
# [0.5866, 1.0000, 0.1368],
# [0.3283, 0.1368, 1.0000]])
10-percent-dev-splitInformationRetrievalEvaluator| Metric | Value |
|---|---|
| cosine_accuracy@1 | 0.52 |
| cosine_accuracy@3 | 0.7216 |
| cosine_accuracy@5 | 0.7938 |
| cosine_accuracy@10 | 0.8494 |
| cosine_precision@1 | 0.52 |
| cosine_precision@3 | 0.2405 |
| cosine_precision@5 | 0.1588 |
| cosine_precision@10 | 0.0849 |
| cosine_recall@1 | 0.52 |
| cosine_recall@3 | 0.7216 |
| cosine_recall@5 | 0.7938 |
| cosine_recall@10 | 0.8494 |
| cosine_ndcg@10 | 0.6863 |
| cosine_mrr@10 | 0.6337 |
| cosine_map@100 | 0.6389 |
sentence_0 and sentence_1| sentence_0 | sentence_1 | |
|---|---|---|
| type | string | string |
| details |
|
|
| sentence_0 | sentence_1 |
|---|---|
query: @user Baloney. Natural immunity is hands down better, and vaccinated people are ending up in the hospital. |
passage: title: Longitudinal analysis shows durable and broad immune memory after SARS-CoV-2 infection with persisting antibody responses and memory B and T cells abstract: Ending the COVID-19 pandemic will require long-lived immunity to SARS-CoV-2. |
query: @Alexand64744343 Meta examen tests #lyme + élevé niveau de preuve scientifique ! : rendement « 53.9% for synthetic C6 peptide ELISA tests & 53.7% when the two-tier methodology was used » Une véritable loterie, 1 cas sur 2 détecté mais persistez à vociférer par ignorance |
passage: title: Commercial test kits for detection of Lyme borreliosis: a meta-analysis of test accuracy abstract: The clinical diagnosis of Lyme borreliosis can be supported by various test methodologies; test kits are available from many manufacturers. |
query: 28 Les systèmes de séquençage haut débit qui servent à la production des banques comme celles du papier de Jaenisch produisent des chimères artefactuelles lors de la PCR. C’est bien connu. Discuté dans : |
passage: title: A Survey of Virus Recombination Uncovers Canonical Features of Artificial Chimeras Generated During Deep Sequencing Library Preparation abstract: Abstract Chimeric reads can be generated by in vitro recombination during the preparation of high-throughput sequencing libraries. |
MultipleNegativesRankingLoss with these parameters:{
"scale": 20.0,
"similarity_fct": "cos_sim",
"gather_across_devices": false,
"directions": [
"query_to_doc"
],
"partition_mode": "joint",
"hardness_mode": null,
"hardness_strength": 0.0
}
per_device_train_batch_size: 32num_train_epochs: 10eval_strategy: stepsper_device_eval_batch_size: 32multi_dataset_batch_sampler: round_robinper_device_train_batch_size: 32num_train_epochs: 10max_steps: -1learning_rate: 5e-05lr_scheduler_type: linearlr_scheduler_kwargs: Nonewarmup_steps: 0optim: adamw_torch_fusedoptim_args: Noneweight_decay: 0.0adam_beta1: 0.9adam_beta2: 0.999adam_epsilon: 1e-08optim_target_modules: Nonegradient_accumulation_steps: 1average_tokens_across_devices: Truemax_grad_norm: 1label_smoothing_factor: 0.0bf16: Falsefp16: Falsebf16_full_eval: Falsefp16_full_eval: Falsetf32: Nonegradient_checkpointing: Falsegradient_checkpointing_kwargs: Nonetorch_compile: Falsetorch_compile_backend: Nonetorch_compile_mode: Noneuse_liger_kernel: Falseliger_kernel_config: Noneuse_cache: Falseneftune_noise_alpha: Nonetorch_empty_cache_steps: Noneauto_find_batch_size: Falselog_on_each_node: Truelogging_nan_inf_filter: Trueinclude_num_input_tokens_seen: nolog_level: passivelog_level_replica: warningdisable_tqdm: Falseproject: huggingfacetrackio_space_id: trackioeval_strategy: stepsper_device_eval_batch_size: 32prediction_loss_only: Trueeval_on_start: Falseeval_do_concat_batches: Trueeval_use_gather_object: Falseeval_accumulation_steps: Noneinclude_for_metrics: []batch_eval_metrics: Falsesave_only_model: Falsesave_on_each_node: Falseenable_jit_checkpoint: Falsepush_to_hub: Falsehub_private_repo: Nonehub_model_id: Nonehub_strategy: every_savehub_always_push: Falsehub_revision: Noneload_best_model_at_end: Falseignore_data_skip: Falserestore_callback_states_from_checkpoint: Falsefull_determinism: Falseseed: 42data_seed: Noneuse_cpu: Falseaccelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}parallelism_config: Nonedataloader_drop_last: Falsedataloader_num_workers: 0dataloader_pin_memory: Truedataloader_persistent_workers: Falsedataloader_prefetch_factor: Noneremove_unused_columns: Truelabel_names: Nonetrain_sampling_strategy: randomlength_column_name: lengthddp_find_unused_parameters: Noneddp_bucket_cap_mb: Noneddp_broadcast_buffers: Falseddp_backend: Noneddp_timeout: 1800fsdp: []fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}deepspeed: Nonedebug: []skip_memory_metrics: Truedo_predict: Falseresume_from_checkpoint: Nonewarmup_ratio: Nonelocal_rank: -1prompts: Nonebatch_sampler: batch_samplermulti_dataset_batch_sampler: round_robinrouter_mapping: {}learning_rate_mapping: {}| Epoch | Step | Training Loss | 10-percent-dev-split_cosine_ndcg@10 |
|---|---|---|---|
| 0.1845 | 100 | - | 0.6380 |
| 0.3690 | 200 | - | 0.6400 |
| 0.5535 | 300 | - | 0.6512 |
| 0.7380 | 400 | - | 0.6642 |
| 0.9225 | 500 | 0.8957 | 0.6640 |
| 1.0 | 542 | - | 0.6626 |
| 1.1070 | 600 | - | 0.6658 |
| 1.2915 | 700 | - | 0.6676 |
| 1.4760 | 800 | - | 0.6695 |
| 1.6605 | 900 | - | 0.6719 |
| 1.8450 | 1000 | 0.3899 | 0.6776 |
| 2.0 | 1084 | - | 0.6744 |
| 2.0295 | 1100 | - | 0.6761 |
| 2.2140 | 1200 | - | 0.6759 |
| 2.3985 | 1300 | - | 0.6761 |
| 2.5830 | 1400 | - | 0.6830 |
| 2.7675 | 1500 | 0.3484 | 0.6779 |
| 2.9520 | 1600 | - | 0.6793 |
| 3.0 | 1626 | - | 0.6762 |
| 3.1365 | 1700 | - | 0.6823 |
| 3.3210 | 1800 | - | 0.6831 |
| 3.5055 | 1900 | - | 0.6788 |
| 3.6900 | 2000 | 0.3083 | 0.6821 |
| 3.8745 | 2100 | - | 0.6775 |
| 4.0 | 2168 | - | 0.6788 |
| 4.0590 | 2200 | - | 0.6786 |
| 4.2435 | 2300 | - | 0.6792 |
| 4.4280 | 2400 | - | 0.6827 |
| 4.6125 | 2500 | 0.3033 | 0.6804 |
| 4.7970 | 2600 | - | 0.6822 |
| 4.9815 | 2700 | - | 0.6914 |
| 5.0 | 2710 | - | 0.6880 |
| 5.1661 | 2800 | - | 0.6809 |
| 5.3506 | 2900 | - | 0.6853 |
| 5.5351 | 3000 | 0.2840 | 0.6852 |
| 5.7196 | 3100 | - | 0.6844 |
| 5.9041 | 3200 | - | 0.6886 |
| 6.0 | 3252 | - | 0.6859 |
| 6.0886 | 3300 | - | 0.6859 |
| 6.2731 | 3400 | - | 0.6811 |
| 6.4576 | 3500 | 0.2669 | 0.6896 |
| 6.6421 | 3600 | - | 0.6864 |
| 6.8266 | 3700 | - | 0.6859 |
| 7.0 | 3794 | - | 0.6893 |
| 7.0111 | 3800 | - | 0.6907 |
| 7.1956 | 3900 | - | 0.6865 |
| 7.3801 | 4000 | 0.2546 | 0.6831 |
| 7.5646 | 4100 | - | 0.6872 |
| 7.7491 | 4200 | - | 0.6893 |
| 7.9336 | 4300 | - | 0.6864 |
| 8.0 | 4336 | - | 0.6900 |
| 8.1181 | 4400 | - | 0.6885 |
| 8.3026 | 4500 | 0.2518 | 0.6857 |
| 8.4871 | 4600 | - | 0.6874 |
| 8.6716 | 4700 | - | 0.6834 |
| 8.8561 | 4800 | - | 0.6859 |
| 9.0 | 4878 | - | 0.6858 |
| 9.0406 | 4900 | - | 0.6844 |
| 9.2251 | 5000 | 0.2392 | 0.6861 |
| 9.4096 | 5100 | - | 0.6874 |
| 9.5941 | 5200 | - | 0.6872 |
| 9.7786 | 5300 | - | 0.6858 |
| 9.9631 | 5400 | - | 0.6861 |
| 10.0 | 5420 | - | 0.6863 |
@inproceedings{reimers-2019-sentence-bert,
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
author = "Reimers, Nils and Gurevych, Iryna",
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
month = "11",
year = "2019",
publisher = "Association for Computational Linguistics",
url = "https://arxiv.org/abs/1908.10084",
}
@misc{oord2019representationlearningcontrastivepredictive,
title={Representation Learning with Contrastive Predictive Coding},
author={Aaron van den Oord and Yazhe Li and Oriol Vinyals},
year={2019},
eprint={1807.03748},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/1807.03748},
}
Base model
intfloat/multilingual-e5-large