StephKeddy's picture
Initial upload of fine-tuned SBERT model
5d9692c verified
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - generated_from_trainer
  - dataset_size:10836
  - loss:TripletLoss
base_model: sentence-transformers/all-mpnet-base-v2
widget:
  - source_sentence: >-
      how has lack of testing availability led to underreporting of true
      incidence of Covid-19?
    sentences:
      - >-
        can an effective sars-cov-2 vaccine be developed for the older
        population [SEP] the emergence of sars-cov-2 and its inordinately rapid
        spread is posing severe challenges to the wellbeing of millions of
        people worldwide, health care systems and the global economy. we aim to
        provide a platform exclusively for discussions of individual and age
        differences in susceptibility and immune responses to covid caused by
        sars-cov-2 infection and how to prevent or reduce severity of disease in
        older adults.
      - >-
        the impact of changes in diagnostic testing practices on estimates of
        covid-19 transmission in the united states [SEP] estimates of the
        reproductive number for novel pathogens such as sars-cov-2 are essential
        for understanding the potential trajectory of the epidemic and the level
        of intervention that is needed to bring the epidemic under control.
        however, most methods for estimating the basic reproductive number
        (r(0)) and time-varying effective reproductive number (r(t)) assume that
        the fraction of cases detected and reported is constant through time.
      - nan
  - source_sentence: >-
      will SARS-CoV2 infected people develop immunity? Is cross protection
      possible?
    sentences:
      - nan
      - >-
        medical ethics in disasters [SEP] disasters frequently create demands
        that outstrip available existing medical and societal resources.
        disaster may, for example, not only strike care providers and hospital
        facilities directly; they may decimate communities capacities to provide
        food to the population and carry out critical waste disposal services.
      - >-
        sars coronavirus pathogenesis: host innate immune responses and viral
        antagonism of interferon [SEP] sars-cov is a pathogenic coronavirus that
        emerged from a zoonotic reservoir, leading to global dissemination of
        the virus. the association sars-cov with aberrant cytokine, chemokine,
        and interferon stimulated gene (isg) responses in patients provided
        evidence that sars-cov pathogenesis is at least partially controlled by
        innate immune signaling.
  - source_sentence: >-
      what kinds of complications related to COVID-19 are associated with
      diabetes
    sentences:
      - >-
        recommendation to optimize safety of elective surgical care while
        limiting the spread of covid-19: primum non nocere [SEP] covid-19 has
        drastically altered our lives in an unprecedented manner, shuttering
        industries, and leaving most of the country in isolation as we adapt to
        the evolving crisis. the optimal solution of how to effectively balance
        the resumption of standard surgical care while doing everything possible
        to limit the spread of covid-19 is undetermined, and could include
        strategies such as social distancing, screening forms and tests
        including temperature screening, segregation of inpatient and outpatient
        teams, proper use of protective gear, and the use of ambulatory surgery
        centers (ascs) to provide elective, yet ultimately essential, surgical
        care while conserving resources and protecting the health of patients
        and health-care providers.
      - >-
        upper airway symptoms in coronavirus disease 2019 (covid-19) [SEP] upper
        airway symptoms in coronavirus disease 2019 (covid-19)
      - >-
        diabetes mellitus is associated with increased mortality and severity of
        disease in covid-19 pneumonia a systematic review, meta-analysis, and
        meta-regression [SEP] background and aims diabetes mellitus (dm) is
        chronic conditions with devastating multi-systemic complication and may
        be associated with severe form of coronavirus disease 2019 (covid-19).
        subgroup analysis showed that the association was weaker in studies with
        median age 55 years-old (rr 1.92) compared to 55 years-old (rr 3.48),
        and in prevalence of hypertension 25 (rr 1.93) compared to 25 (rr 3.06).
  - source_sentence: coronavirus early symptoms
    sentences:
      - >-
        the common cold in frail older persons: impact of rhinovirus and
        coronavirus in a senior daycare center [SEP] objective: to evaluate the
        incidence and impact of rhinovirus and coronavirus infections in older
        persons attending daycare. patients: frail older persons and staff
        members of the daycare centers who developed signs or symptoms of an
        acute respiratory illness measurements: demographic, medical, and
        physical findings were recorded on subjects at baseline and during
        respiratory illness.
      - >-
        epidemiology, clinical course, and outcomes of critically ill adults
        with covid-19 in new york city: a prospective cohort study [SEP]
        background: nearly 30,000 patients with coronavirus disease-2019
        (covid-19) have been hospitalized in new york city as of april 14th,
        2020. results: of 1,150 adults hospitalized with covid-19 during the
        study period, 257 (22) were critically ill.
      - >-
        coronavirus disease (covid-19): a primer for emergency physicians [SEP]
        introduction: rapid worldwide spread of coronavirus disease 2019
        (covid-19) has resulted in a global pandemic. discussion: severe acute
        respiratory syndrome coronavirus 2 (sars-cov-2), the virus responsible
        for causing covid-19, is primarily transmitted from person-to-person
        through close contact (approximately 6 ft) by respiratory droplets.
  - source_sentence: what types of rapid testing for Covid-19 have been developed?
    sentences:
      - >-
        on the assessment of more reliable covid-19 infected number: the italian
        case. [SEP] covid-19 (sars-cov-2) is the most recent pandemic disease
        the world is currently managing. patients affected by covid-19 are
        identified employing medical swabs applied mainly to (i) citizens with
        covid-19 symptoms such as flu or high temperature, or (ii) citizens that
        had contacts with covid-19 patients.
      - >-
        lack of antiviral activity of darunavir against sars-cov-2 [SEP] given
        the high need and the absence of specific antivirals for treatment of
        covid-19 (the disease caused by severe acute respiratory
        syndrome-associated coronavirus-2 sars-cov-2), human immunodeficiency
        virus (hiv) protease inhibitors are being considered as therapeutic
        alternatives. overall, the data do not support the use of drv for
        treatment of covid-19.
      - >-
        the covid-19 pandemic: important considerations for contact lens
        practitioners [SEP] a novel coronavirus (cov), the severe acute
        respiratory syndrome coronavirus - 2 (sars-cov-2), results in the
        coronavirus disease 2019 (covid-19). thus, it is imperative cl wearers
        are reminded of the steps they should follow to minimise their risk of
        complications, to reduce their need to leave isolation and seek care.
pipeline_tag: sentence-similarity
library_name: sentence-transformers
metrics:
  - cosine_accuracy@1
  - cosine_accuracy@3
  - cosine_accuracy@5
  - cosine_accuracy@10
  - cosine_precision@1
  - cosine_precision@3
  - cosine_precision@5
  - cosine_precision@10
  - cosine_recall@1
  - cosine_recall@3
  - cosine_recall@5
  - cosine_recall@10
  - cosine_ndcg@10
  - cosine_mrr@10
  - cosine_map@100
model-index:
  - name: SentenceTransformer based on sentence-transformers/all-mpnet-base-v2
    results:
      - task:
          type: information-retrieval
          name: Information Retrieval
        dataset:
          name: val
          type: val
        metrics:
          - type: cosine_accuracy@1
            value: 0.6
            name: Cosine Accuracy@1
          - type: cosine_accuracy@3
            value: 0.8
            name: Cosine Accuracy@3
          - type: cosine_accuracy@5
            value: 0.9333333333333333
            name: Cosine Accuracy@5
          - type: cosine_accuracy@10
            value: 0.9333333333333333
            name: Cosine Accuracy@10
          - type: cosine_precision@1
            value: 0.6
            name: Cosine Precision@1
          - type: cosine_precision@3
            value: 0.5777777777777778
            name: Cosine Precision@3
          - type: cosine_precision@5
            value: 0.5733333333333334
            name: Cosine Precision@5
          - type: cosine_precision@10
            value: 0.48666666666666664
            name: Cosine Precision@10
          - type: cosine_recall@1
            value: 0.0037118073861730316
            name: Cosine Recall@1
          - type: cosine_recall@3
            value: 0.011399309808564868
            name: Cosine Recall@3
          - type: cosine_recall@5
            value: 0.019975486198167695
            name: Cosine Recall@5
          - type: cosine_recall@10
            value: 0.033174913852812835
            name: Cosine Recall@10
          - type: cosine_ndcg@10
            value: 0.5158660061527193
            name: Cosine Ndcg@10
          - type: cosine_mrr@10
            value: 0.7155555555555556
            name: Cosine Mrr@10
          - type: cosine_map@100
            value: 0.18187688764934176
            name: Cosine Map@100

SentenceTransformer based on sentence-transformers/all-mpnet-base-v2

This is a sentence-transformers model finetuned from sentence-transformers/all-mpnet-base-v2. It maps sentences & paragraphs to a 768-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: sentence-transformers/all-mpnet-base-v2
  • Maximum Sequence Length: 384 tokens
  • Output Dimensionality: 768 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 384, 'do_lower_case': False}) with Transformer model: MPNetModel 
  (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
    'what types of rapid testing for Covid-19 have been developed?',
    'on the assessment of more reliable covid-19 infected number: the italian case. [SEP] covid-19 (sars-cov-2) is the most recent pandemic disease the world is currently managing. patients affected by covid-19 are identified employing medical swabs applied mainly to (i) citizens with covid-19 symptoms such as flu or high temperature, or (ii) citizens that had contacts with covid-19 patients.',
    'lack of antiviral activity of darunavir against sars-cov-2 [SEP] given the high need and the absence of specific antivirals for treatment of covid-19 (the disease caused by severe acute respiratory syndrome-associated coronavirus-2 sars-cov-2), human immunodeficiency virus (hiv) protease inhibitors are being considered as therapeutic alternatives. overall, the data do not support the use of drv for treatment of covid-19.',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@1 0.6
cosine_accuracy@3 0.8
cosine_accuracy@5 0.9333
cosine_accuracy@10 0.9333
cosine_precision@1 0.6
cosine_precision@3 0.5778
cosine_precision@5 0.5733
cosine_precision@10 0.4867
cosine_recall@1 0.0037
cosine_recall@3 0.0114
cosine_recall@5 0.02
cosine_recall@10 0.0332
cosine_ndcg@10 0.5159
cosine_mrr@10 0.7156
cosine_map@100 0.1819

Training Details

Training Dataset

Unnamed Dataset

  • Size: 10,836 training samples
  • Columns: sentence_0, sentence_1, and sentence_2
  • Approximate statistics based on the first 1000 samples:
    sentence_0 sentence_1 sentence_2
    type string string string
    details
    • min: 5 tokens
    • mean: 18.36 tokens
    • max: 50 tokens
    • min: 3 tokens
    • mean: 87.23 tokens
    • max: 219 tokens
    • min: 3 tokens
    • mean: 81.52 tokens
    • max: 252 tokens
  • Samples:
    sentence_0 sentence_1 sentence_2
    coronavirus origin the origin, transmission and clinical therapies on coronavirus disease 2019 (covid-19) outbreak an update on the status [SEP] an acute respiratory disease, caused by a novel coronavirus (sars-cov-2, previously known as 2019-ncov), the coronavirus disease 2019 (covid-19) has spread throughout china and received worldwide attention. the emergence of sars-cov-2, since the severe acute respiratory syndrome coronavirus (sars-cov) in 2002 and middle east respiratory syndrome coronavirus (mers-cov) in 2012, marked the third introduction of a highly pathogenic and large-scale epidemic coronavirus into the human population in the twenty-first century. challenges in developing methods for quantifying the effects of weather and climate on water-associated diseases: a systematic review [SEP] infectious diseases attributable to unsafe water supply, sanitation and hygiene (e.g. cholera, leptospirosis, giardiasis) remain an important cause of morbidity and mortality, especially in low-income countries. furthermore, the methods often did not distinguish among the multiple sources of time-lags (e.g. patient physiology, reporting bias, healthcare access) between environmental drivers/exposures and disease detection.
    Seeking information on best practices for activities and duration of quarantine for those exposed and/ infected to COVID-19 virus. recommendation to optimize safety of elective surgical care while limiting the spread of covid-19: primum non nocere [SEP] covid-19 has drastically altered our lives in an unprecedented manner, shuttering industries, and leaving most of the country in isolation as we adapt to the evolving crisis. the optimal solution of how to effectively balance the resumption of standard surgical care while doing everything possible to limit the spread of covid-19 is undetermined, and could include strategies such as social distancing, screening forms and tests including temperature screening, segregation of inpatient and outpatient teams, proper use of protective gear, and the use of ambulatory surgery centers (ascs) to provide elective, yet ultimately essential, surgical care while conserving resources and protecting the health of patients and health-care providers. killing more than pain: etiology and remedy for an opioid crisis [SEP] the search for effective pain relief has been ever present across human history. this chapter describes the etiology and epidemiology of the opioid crisis using public health and health belief model frameworks and reviews approaches that have been applied to address supply (e.g., overprescribing) and demand (e.g., medication treatments) sides of the equation.
    coronavirus early symptoms nan impact of antibacterials on subsequent resistance and clinical outcomes in adult patients with viral pneumonia: an opportunity for stewardship [SEP] introduction: respiratory viruses are increasingly recognized as significant etiologies of pneumonia among hospitalized patients. method: this was a single-center retrospective cohort study to evaluate the impact of antibacterials in viral pneumonia on clinical outcomes and subsequent multidrug-resistant organism (mdro) infections/colonization.
  • Loss: TripletLoss with these parameters:
    {
        "distance_metric": "TripletDistanceMetric.EUCLIDEAN",
        "triplet_margin": 5
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 16
  • per_device_eval_batch_size: 16
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1
  • num_train_epochs: 3
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.0
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: False
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • tp_size: 0
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: round_robin

Training Logs

Epoch Step Training Loss val_cosine_ndcg@10
0.7375 500 4.4901 -
1.0 678 - 0.5159

Framework Versions

  • Python: 3.11.12
  • Sentence Transformers: 3.4.1
  • Transformers: 4.50.3
  • PyTorch: 2.6.0+cu124
  • Accelerate: 1.5.2
  • Datasets: 3.5.0
  • Tokenizers: 0.21.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

TripletLoss

@misc{hermans2017defense,
    title={In Defense of the Triplet Loss for Person Re-Identification},
    author={Alexander Hermans and Lucas Beyer and Bastian Leibe},
    year={2017},
    eprint={1703.07737},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}