Instructions to use songphucn7/me5-checkthat-task1-v2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use songphucn7/me5-checkthat-task1-v2 with sentence-transformers:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("songphucn7/me5-checkthat-task1-v2")

sentences = [
    "query: The unexpected repercussions of COVID-19 vaccine policy: why requirements, certificates and limitations could do more damage than benefit | BMJ Global Health",
    "passage: title: SARS-CoV-2 infects and replicates in cells of the human endocrine and exocrine pancreas abstract: Infection-related diabetes can arise as a result of virus-associated β-cell destruction.\nClinical data suggest that the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), causing the coronavirus disease 2019 (COVID-19), impairs glucose homoeostasis, but experimental evidence that SARS-CoV-2 can infect pancreatic tissue has been lacking.\nIn the present study, we show that SARS-CoV-2 infects cells of the human exocrine and endocrine pancreas ex vivo and in vivo.\nWe demonstrate that human β-cells express viral entry proteins, and SARS-CoV-2 infects and replicates in cultured human islets.\nInfection is associated with morphological, transcriptional and functional changes, including reduced numbers of insulin-secretory granules in β-cells and impaired glucose-stimulated insulin secretion.\nIn COVID-19 full-body postmortem examinations, we detected SARS-CoV-2 nucleocapsid protein in pancreatic exocrine cells, and in cells that stain positive for the β-cell marker NKX6.\n1 and are in close proximity to the islets of Langerhans in all four patients investigated.\nOur data identify the human pancreas as a target of SARS-CoV-2 infection and suggest that β-cell infection could contribute to the metabolic dysregulation observed in patients with COVID-19.\nSARS-CoV-2 is shown to infect and replicate in human pancreatic tissue, including in β-cells, which is associated with morphological, transcriptomic and functional changes.",
    "passage: title: A SARS-like cluster of circulating bat coronaviruses shows potential for human emergence abstract: The emergence of severe acute respiratory syndrome coronavirus (SARS-CoV) and Middle East respiratory syndrome (MERS)-CoV underscores the threat of cross-species transmission events leading to outbreaks in humans.\nHere we examine the disease potential of a SARS-like virus, SHC014-CoV, which is currently circulating in Chinese horseshoe bat populations.\nUsing the SARS-CoV reverse genetics system, we generated and characterized a chimeric virus expressing the spike of bat coronavirus SHC014 in a mouse-adapted SARS-CoV backbone.\nThe results indicate that group 2b viruses encoding the SHC014 spike in a wild-type backbone can efficiently use multiple orthologs of the SARS receptor human angiotensin converting enzyme II (ACE2), replicate efficiently in primary human airway cells and achieve in vitro titers equivalent to epidemic strains of SARS-CoV.\nAdditionally, in vivo experiments demonstrate replication of the chimeric virus in mouse lung with notable pathogenesis.\nEvaluation of available SARS-based immune-therapeutic and prophylactic modalities revealed poor efficacy; both monoclonal antibody and vaccine approaches failed to neutralize and protect from infection with CoVs using the novel spike protein.\nOn the basis of these findings, we synthetically re-derived an infectious full-length SHC014 recombinant virus and demonstrate robust viral replication both in vitro and in vivo.\nOur work suggests a potential risk of SARS-CoV re-emergence from viruses currently circulating in bat populations.",
    "passage: title: The unintended consequences of COVID-19 vaccine policy: why mandates, passports and restrictions may cause more harm than good abstract: Vaccination policies have shifted dramatically during COVID-19 with the rapid emergence of population-wide vaccine mandates, domestic vaccine passports and differential restrictions based on vaccination status.\nWhile these policies have prompted ethical, scientific, practical, legal and political debate, there has been limited evaluation of their potential unintended consequences.\nHere, we outline a comprehensive set of hypotheses for why these policies may ultimately be counterproductive and harmful.\nOur framework considers four domains: (1) behavioural psychology, (2) politics and law, (3) socioeconomics, and (4) the integrity of science and public health.\nWhile current vaccines appear to have had a significant impact on decreasing COVID-19-related morbidity and mortality burdens, we argue that current mandatory vaccine policies are scientifically questionable and are likely to cause more societal harm than good.\nRestricting people’s access to work, education, public transport and social life based on COVID-19 vaccination status impinges on human rights, promotes stigma and social polarisation, and adversely affects health and well-being.\nCurrent policies may lead to a widening of health and economic inequalities, detrimental long-term impacts on trust in government and scientific institutions, and reduce the uptake of future public health measures, including COVID-19 vaccines as well as routine immunisations.\nMandating vaccination is one of the most powerful interventions in public health and should be used sparingly and carefully to uphold ethical norms and trust in institutions."
]
embeddings = model.encode(sentences)

similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [4, 4]

Notebooks
Google Colab
Kaggle

SentenceTransformer based on intfloat/multilingual-e5-large

This is a sentence-transformers model finetuned from intfloat/multilingual-e5-large. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

Model Type: Sentence Transformer
Base model: intfloat/multilingual-e5-large
Maximum Sequence Length: 256 tokens
Output Dimensionality: 1024 dimensions
Similarity Function: Cosine Similarity

Model Sources

Documentation: Sentence Transformers Documentation
Repository: Sentence Transformers on GitHub
Hugging Face: Sentence Transformers on Hugging Face

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False, 'architecture': 'PeftModelForFeatureExtraction'})
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("songphucn7/me5-checkthat-task1-v2")
# Run inference
sentences = [
    'query: @zoeharcombe I\'ll reply it. The "vaccines" have no health benefit and only cause harm. Pharma company trial data says they don\'t work. A straightforward calculation of absolute risk from the Pfizer trial data = .04% effectiveness for severe cases, which is essentially zero.',
    'passage: Among 10 cases of severe Covid-19 with onset after the first dose, 9 occurred in placebo recipients and 1 in a BNT162b2 recipient.\n\nThe safety profile of BNT162b2 was characterized by short-term, mild-to-moderate pain at the injection site, fatigue, and headache.\nThe incidence of serious adverse events was low and was similar in the vaccine and placebo groups.\nConclusionsA two-dose regimen of BNT162b2 conferred 95% protection against Covid-19 in persons 16 years of age or older.\nSafety over a median of 2 months was similar to that of other viral vaccines.\n(Funded by BioNTech and Pfizer; ClinicalTrials.\ngov number, NCT04368728.',
    "passage: title: Imperfect Vaccination Can Enhance the Transmission of Highly Virulent Pathogens abstract: Could some vaccines drive the evolution of more virulent pathogens?\nConventional wisdom is that natural selection will remove highly lethal pathogens if host death greatly reduces transmission.\nVaccines that keep hosts alive but still allow transmission could thus allow very virulent strains to circulate in a population.\nHere we show experimentally that immunization of chickens against Marek's disease virus enhances the fitness of more virulent strains, making it possible for hyperpathogenic strains to transmit.\nImmunity elicited by direct vaccination or by maternal vaccination prolongs host survival but does not prevent infection, viral replication or transmission, thus extending the infectious periods of strains otherwise too lethal to persist.\nOur data show that anti-disease vaccines that do not prevent transmission can create conditions that promote the emergence of pathogen strains that cause more severe disease in unvaccinated hosts.",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.5866, 0.3283],
#         [0.5866, 1.0000, 0.1368],
#         [0.3283, 0.1368, 1.0000]])

Evaluation

Metrics

Information Retrieval

Dataset: 10-percent-dev-split
Evaluated with InformationRetrievalEvaluator

Metric	Value
cosine_accuracy@1	0.52
cosine_accuracy@3	0.7216
cosine_accuracy@5	0.7938
cosine_accuracy@10	0.8494
cosine_precision@1	0.52
cosine_precision@3	0.2405
cosine_precision@5	0.1588
cosine_precision@10	0.0849
cosine_recall@1	0.52
cosine_recall@3	0.7216
cosine_recall@5	0.7938
cosine_recall@10	0.8494
cosine_ndcg@10	0.6863
cosine_mrr@10	0.6337
cosine_map@100	0.6389

Training Details

Training Dataset

Unnamed Dataset

Size: 17,319 training samples
Columns: sentence_0 and sentence_1
Approximate statistics based on the first 1000 samples:
sentence_0 sentence_1
type string string
details
min: 21 tokens
mean: 59.2 tokens
max: 136 tokens

min: 11 tokens
mean: 205.0 tokens
max: 256 tokens

	sentence_0	sentence_1
type	string	string
details	min: 21 tokens mean: 59.2 tokens max: 136 tokens	min: 11 tokens mean: 205.0 tokens max: 256 tokens

Samples:

sentence_0	sentence_1
`query: @user Baloney. Natural immunity is hands down better, and vaccinated people are ending up in the hospital.`	passage: title: Longitudinal analysis shows durable and broad immune memory after SARS-CoV-2 infection with persisting antibody responses and memory B and T cells abstract: Ending the COVID-19 pandemic will require long-lived immunity to SARS-CoV-2. Here, we evaluate 254 COVID-19 patients longitudinally up to 8 months and find durable broad-based immune responses. SARS-CoV-2 spike binding and neutralizing antibodies exhibit a bi-phasic decay with an extended half-life of >200 days suggesting the generation of longer-lived plasma cells. SARS-CoV-2 infection also boosts antibody titers to SARS-CoV-1 and common betacoronaviruses. In addition, spike-specific IgG+ memory B cells persist, which bodes well for a rapid antibody response upon virus re-exposure or vaccination. Virus-specific CD4+ and CD8+ T cells are polyfunctional and maintained with an estimated half-life of 200 days. Interestingly, CD4+ T cell responses equally target several SARS-CoV-2 proteins, whereas the CD8+ T cell respo...
`query: @Alexand64744343 Meta examen tests #lyme + élevé niveau de preuve scientifique ! : rendement « 53.9% for synthetic C6 peptide ELISA tests & 53.7% when the two-tier methodology was used » Une véritable loterie, 1 cas sur 2 détecté mais persistez à vociférer par ignorance`	passage: title: Commercial test kits for detection of Lyme borreliosis: a meta-analysis of test accuracy abstract: The clinical diagnosis of Lyme borreliosis can be supported by various test methodologies; test kits are available from many manufacturers. Literature searches were carried out to identify studies that reported characteristics of the test kits. Of 50 searched studies, 18 were included where the tests were commercially available and samples were proven to be positive using serology testing, evidence of an erythema migrans rash, and/or culture. Additional requirements were a test specificity of ≥85% and publication in the last 20 years. The weighted mean sensitivity for all tests and for all samples was 59. 5%. Individual study means varied from 30. 6% to 86. 2%. Sensitivity for each test technology varied from 62. 4% for Western blot kits, and 62. 3% for enzyme-linked immunosorbent assay tests, to 53. 9% for synthetic C6 peptide ELISA tests and 53. 7% when the two-tier meth...
`query: 28 Les systèmes de séquençage haut débit qui servent à la production des banques comme celles du papier de Jaenisch produisent des chimères artefactuelles lors de la PCR. C’est bien connu. Discuté dans :`	passage: title: A Survey of Virus Recombination Uncovers Canonical Features of Artificial Chimeras Generated During Deep Sequencing Library Preparation abstract: Abstract Chimeric reads can be generated by in vitro recombination during the preparation of high-throughput sequencing libraries. Our attempt to detect biological recombination between the genomes of dengue virus (DENV; +ssRNA genome) and its mosquito host using the Illumina Nextera sequencing library preparation kit revealed that most, if not all, detected host–virus chimeras were artificial. Indeed, these chimeras were not more frequent than with control RNA from another species (a pillbug), which was never in contact with DENV RNA prior to the library preparation. The proportion of chimera types merely reflected those of the three species among sequencing reads. Chimeras were frequently characterized by the presence of 1-20 bp microhomology between recombining fragments. Within-species chimeras mostly involved fragments in...

Loss: MultipleNegativesRankingLoss with these parameters:

{
    "scale": 20.0,
    "similarity_fct": "cos_sim",
    "gather_across_devices": false,
    "directions": [
        "query_to_doc"
    ],
    "partition_mode": "joint",
    "hardness_mode": null,
    "hardness_strength": 0.0
}

Training Hyperparameters

Non-Default Hyperparameters

per_device_train_batch_size: 32
num_train_epochs: 10
eval_strategy: steps
per_device_eval_batch_size: 32
multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand

per_device_train_batch_size: 32
num_train_epochs: 10
max_steps: -1
learning_rate: 5e-05
lr_scheduler_type: linear
lr_scheduler_kwargs: None
warmup_steps: 0
optim: adamw_torch_fused
optim_args: None
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
optim_target_modules: None
gradient_accumulation_steps: 1
average_tokens_across_devices: True
max_grad_norm: 1
label_smoothing_factor: 0.0
bf16: False
fp16: False
bf16_full_eval: False
fp16_full_eval: False
tf32: None
gradient_checkpointing: False
gradient_checkpointing_kwargs: None
torch_compile: False
torch_compile_backend: None
torch_compile_mode: None
use_liger_kernel: False
liger_kernel_config: None
use_cache: False
neftune_noise_alpha: None
torch_empty_cache_steps: None
auto_find_batch_size: False
log_on_each_node: True
logging_nan_inf_filter: True
include_num_input_tokens_seen: no
log_level: passive
log_level_replica: warning
disable_tqdm: False
project: huggingface
trackio_space_id: trackio
eval_strategy: steps
per_device_eval_batch_size: 32
prediction_loss_only: True
eval_on_start: False
eval_do_concat_batches: True
eval_use_gather_object: False
eval_accumulation_steps: None
include_for_metrics: []
batch_eval_metrics: False
save_only_model: False
save_on_each_node: False
enable_jit_checkpoint: False
push_to_hub: False
hub_private_repo: None
hub_model_id: None
hub_strategy: every_save
hub_always_push: False
hub_revision: None
load_best_model_at_end: False
ignore_data_skip: False
restore_callback_states_from_checkpoint: False
full_determinism: False
seed: 42
data_seed: None
use_cpu: False
accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
parallelism_config: None
dataloader_drop_last: False
dataloader_num_workers: 0
dataloader_pin_memory: True
dataloader_persistent_workers: False
dataloader_prefetch_factor: None
remove_unused_columns: True
label_names: None
train_sampling_strategy: random
length_column_name: length
ddp_find_unused_parameters: None
ddp_bucket_cap_mb: None
ddp_broadcast_buffers: False
ddp_backend: None
ddp_timeout: 1800
fsdp: []
fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
deepspeed: None
debug: []
skip_memory_metrics: True
do_predict: False
resume_from_checkpoint: None
warmup_ratio: None
local_rank: -1
prompts: None
batch_sampler: batch_sampler
multi_dataset_batch_sampler: round_robin
router_mapping: {}
learning_rate_mapping: {}

Training Logs

Epoch	Step	Training Loss	10-percent-dev-split_cosine_ndcg@10
0.1845	100	-	0.6380
0.3690	200	-	0.6400
0.5535	300	-	0.6512
0.7380	400	-	0.6642
0.9225	500	0.8957	0.6640
1.0	542	-	0.6626
1.1070	600	-	0.6658
1.2915	700	-	0.6676
1.4760	800	-	0.6695
1.6605	900	-	0.6719
1.8450	1000	0.3899	0.6776
2.0	1084	-	0.6744
2.0295	1100	-	0.6761
2.2140	1200	-	0.6759
2.3985	1300	-	0.6761
2.5830	1400	-	0.6830
2.7675	1500	0.3484	0.6779
2.9520	1600	-	0.6793
3.0	1626	-	0.6762
3.1365	1700	-	0.6823
3.3210	1800	-	0.6831
3.5055	1900	-	0.6788
3.6900	2000	0.3083	0.6821
3.8745	2100	-	0.6775
4.0	2168	-	0.6788
4.0590	2200	-	0.6786
4.2435	2300	-	0.6792
4.4280	2400	-	0.6827
4.6125	2500	0.3033	0.6804
4.7970	2600	-	0.6822
4.9815	2700	-	0.6914
5.0	2710	-	0.6880
5.1661	2800	-	0.6809
5.3506	2900	-	0.6853
5.5351	3000	0.2840	0.6852
5.7196	3100	-	0.6844
5.9041	3200	-	0.6886
6.0	3252	-	0.6859
6.0886	3300	-	0.6859
6.2731	3400	-	0.6811
6.4576	3500	0.2669	0.6896
6.6421	3600	-	0.6864
6.8266	3700	-	0.6859
7.0	3794	-	0.6893
7.0111	3800	-	0.6907
7.1956	3900	-	0.6865
7.3801	4000	0.2546	0.6831
7.5646	4100	-	0.6872
7.7491	4200	-	0.6893
7.9336	4300	-	0.6864
8.0	4336	-	0.6900
8.1181	4400	-	0.6885
8.3026	4500	0.2518	0.6857
8.4871	4600	-	0.6874
8.6716	4700	-	0.6834
8.8561	4800	-	0.6859
9.0	4878	-	0.6858
9.0406	4900	-	0.6844
9.2251	5000	0.2392	0.6861
9.4096	5100	-	0.6874
9.5941	5200	-	0.6872
9.7786	5300	-	0.6858
9.9631	5400	-	0.6861
10.0	5420	-	0.6863

Framework Versions

Python: 3.12.6
Sentence Transformers: 5.3.0
Transformers: 5.5.1
PyTorch: 2.11.0+cu130
Accelerate: 1.13.0
Datasets: 4.8.4
Tokenizers: 0.22.2

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{oord2019representationlearningcontrastivepredictive,
      title={Representation Learning with Contrastive Predictive Coding},
      author={Aaron van den Oord and Yazhe Li and Oriol Vinyals},
      year={2019},
      eprint={1807.03748},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/1807.03748},
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for songphucn7/me5-checkthat-task1-v2

Base model

intfloat/multilingual-e5-large

Finetuned

(170)

this model

Papers for songphucn7/me5-checkthat-task1-v2

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Paper • 1908.10084 • Published Aug 27, 2019 • 13

Representation Learning with Contrastive Predictive Coding

Paper • 1807.03748 • Published Jul 10, 2018 • 1

Evaluation results

Cosine Accuracy@1 on 10 percent dev split
self-reported

0.520
Cosine Accuracy@3 on 10 percent dev split
self-reported

0.722
Cosine Accuracy@5 on 10 percent dev split
self-reported

0.794
Cosine Accuracy@10 on 10 percent dev split
self-reported

0.849
Cosine Precision@1 on 10 percent dev split
self-reported

0.520
Cosine Precision@3 on 10 percent dev split
self-reported

0.241
Cosine Precision@5 on 10 percent dev split
self-reported

0.159
Cosine Precision@10 on 10 percent dev split
self-reported

0.085