bge-small-rrf-v1 / README.md
Stffens's picture
BGE-small fine-tuned with RRF disagreement signal
0b7f912 verified
metadata
tags:
  - sentence-transformers
  - sentence-similarity
  - feature-extraction
  - dense
  - generated_from_trainer
  - dataset_size:75822
  - loss:MultipleNegativesRankingLoss
base_model: BAAI/bge-small-en-v1.5
widget:
  - source_sentence: blood clots
    sentences:
      - >-
        Herbal infusions as a source of calcium, magnesium, iron, zinc and
        copper in human nutrition.

        The study material consisted of five herbs: chamomile (flowers), mint
        (leaves), St John's wort (flowers and leaves), sage (leaves) and nettle
        (leaves), sourced from three producers. The calcium, magnesium, iron,
        zinc and copper contents were determined for both dried herb samples and
        prepared infusions, and the extraction rates were calculated. Mineral
        components were determined using atomic absorption spectrometry
      - >-
        Vegetarian diets and incidence of diabetes in the Adventist Health
        Study-2

        Aim To evaluate the relationship of diet to incident diabetes among
        non-Black and Black participants in the Adventist Health Study-2.
        Methods and Results Participants were 15,200 men and 26,187 women (17.3%
        Blacks) across the U.S. and Canada who were free of diabetes and who
        provided demographic, anthropometric, lifestyle and dietary data.
        Participants were grouped as vegan, lacto ovo vegetarian, pesco
        vegetarian, semi-vegetarian or 
      - >-
        Green tea: nature's defense against malignancies.

        The current practice of introducing phytochemicals to support the immune
        system or fight against diseases is based on centuries old traditions.
        Nutritional support is a recent advancement in the domain of diet-based
        therapies; green tea and its constituents are one of the important
        components of these strategies to prevent and cure various malignancies.
        The anti-carcinogenic and anti-mutagenic activities of green tea were
        highlighted some years ago suggestin
  - source_sentence: carcinogens
    sentences:
      - >-
        Vitamin B12 sources and bioavailability.

        The usual dietary sources of vitamin B(12) are animal foods, meat, milk,
        egg, fish, and shellfish. As the intrinsic factor-mediated intestinal
        absorption system is estimated to be saturated at about 1.5-2.0 microg
        per meal under physiologic conditions, vitamin B(12) bioavailability
        significantly decreases with increasing intake of vitamin B(12) per
        meal. The bioavailability of vitamin B(12) in healthy humans from fish
        meat, sheep meat, and chicken meat averaged 42%, 
      - >-
        Dietary intake of nitrate and nitrite and risk of renal cell carcinoma
        in the NIH-AARP Diet and Health Study

        Background: Nitrate and nitrite are present in many foods and are
        precursors of N-nitroso compounds, known animal carcinogens and
        potential human carcinogens. We prospectively investigated the
        association between nitrate and nitrite intake from dietary sources and
        risk of renal cell carcinoma (RCC) overall and clear cell and papillary
        histological subtypes in the NIH-AARP Diet and Health Study. Metho
      - >-
        A 21-day Daniel fast with or without krill oil supplementation improves
        anthropometric parameters and the cardiometabolic profile in men and
        women

        Background The Daniel Fast is a vegan diet that prohibits the
        consumption of animal products, refined foods, white flour,
        preservatives, additives, sweeteners, flavorings, caffeine, and alcohol.
        Following this dietary plan for 21 days has been demonstrated to improve
        blood pressure, LDL-C, and certain markers of oxidative stress, but it
        has also been shown to low
  - source_sentence: Is Distilled Fish Oil Toxin-Free?
    sentences:
      - >-
        Sniffer dogs as part of a bimodal bionic research approach to develop a
        lung cancer screening

        Lung cancer (LC) continues to represent a heavy burden for health care
        systems worldwide. Epidemiological studies predict that its role will
        increase in the near future. While patient prognosis is strongly
        associated with tumour stage and early detection of disease, no
        screening test exists so far. It has been suggested that electronic
        sensor devices, commonly referred to as ‘electronic noses’, may be
        applicable to
      - >-
        Efficacy of omega-3 fatty acid supplements (eicosapentaenoic acid and
        docosahexaenoic acid) in the secondary prevention of cardiovascular
        disease: ...

        BACKGROUND: Although previous randomized, double-blind,
        placebo-controlled trials reported the efficacy of omega-3 fatty acid
        supplements in the secondary prevention of cardiovascular disease (CVD),
        the evidence remains inconclusive. Using a meta-analysis, we
        investigated the efficacy of eicosapentaenoic acid and docosahexaenoic
        acid in the secondary preventi
      - >-
        A Prospective Study of Long-term Intake of Dietary Fiber and Risk of
        Crohn’s Disease and Ulcerative Colitis

        Background & Aims Increased intake of dietary fiber has been proposed to
        reduce risk of inflammatory bowel diseases (Crohn’s disease [CD],
        ulcerative colitis [UC]). However, few prospective studies have examined
        associations between long-term intake of dietary fiber and risk of
        incident CD or UC. Methods We collected and analyzed data from 170,776
        women, followed over 26 y, who participated in the Nur
  - source_sentence: trans fats
    sentences:
      - >-
        Laboratory, Epidemiological, and Human Intervention Studies Show That
        Tea (Camellia sinensis) May Be Useful in the Prevention of Obesity

        Tea (Camellia sinensis, Theaceae) and tea polyphenols have been studied
        for the prevention of chronic diseases, including obesity. Obesity
        currently affects >20% of adults in the United States and is a risk
        factor for chronic diseases such as type II diabetes, cardiovascular
        disease, and cancer. Given this increasing public health concern, the
        use of dietary agents for the
      - >-
        Dietary intake of nitrate and nitrite and risk of renal cell carcinoma
        in the NIH-AARP Diet and Health Study

        Background: Nitrate and nitrite are present in many foods and are
        precursors of N-nitroso compounds, known animal carcinogens and
        potential human carcinogens. We prospectively investigated the
        association between nitrate and nitrite intake from dietary sources and
        risk of renal cell carcinoma (RCC) overall and clear cell and papillary
        histological subtypes in the NIH-AARP Diet and Health Study. Metho
      - >-
        Vegetarian and vegan diets in type 2 diabetes management.

        Vegetarian and vegan diets offer significant benefits for diabetes
        management. In observational studies, individuals following vegetarian
        diets are about half as likely to develop diabetes, compared with
        non-vegetarians. In clinical trials in individuals with type 2 diabetes,
        low-fat vegan diets improve glycemic control to a greater extent than
        conventional diabetes diets. Although this effect is primarily
        attributable to greater weight loss, evidenc
  - source_sentence: poisonous plants
    sentences:
      - >-
        Creating public awareness: state 2025 diabetes forecasts.

        The incidence and prevalence of diabetes (primarily type 2 diabetes) has
        risen sharply since 1990. It is projected to increase another 64%
        between 2010 and 2025, affecting 53.1 million people and resulting in
        medical and societal costs of a half trillion dollars a year. We know
        how to prevent many cases of diabetes and how to treat it effectively.
        Early appropriate treatment makes a significant difference in preventing
        major complications and reducin
      - >-
        Dietary sources of inorganic microparticles and their intake in healthy
        subjects and patients with Crohn's disease.

        Dietary microparticles are non-biological, bacterial-sized particles.
        Endogenous sources are derived from intestinal Ca and phosphate
        secretion. Exogenous sources are mainly titanium dioxide (TiO2) and
        mixed silicates (Psil); they are resistant to degradation and accumulate
        in human Peyer's patch macrophages and there is some evidence that they
        exacerbate inflammation in Crohn's disease (CD). 
      - >-
        Antioxidant, antimutagenic, and antitumor effects of pine needles (Pinus
        densiflora).

        Pine needles (Pinus densiflora Siebold et Zuccarini) have long been used
        as a traditional health-promoting medicinal food in Korea. To
        investigate their potential anticancer effects, antioxidant,
        antimutagenic, and antitumor activities were assessed in vitro and/or in
        vivo. Pine needle ethanol extract (PNE) significantly inhibited
        Fe(2+)-induced lipid peroxidation and scavenged 1,1-diphenyl-
        2-picrylhydrazyl radical in vit
pipeline_tag: sentence-similarity
library_name: sentence-transformers

SentenceTransformer based on BAAI/bge-small-en-v1.5

This is a sentence-transformers model finetuned from BAAI/bge-small-en-v1.5. It maps sentences & paragraphs to a 384-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: BAAI/bge-small-en-v1.5
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 384 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': True, 'architecture': 'BertModel'})
  (1): Pooling({'word_embedding_dimension': 384, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Normalize()
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
    'poisonous plants',
    'Antioxidant, antimutagenic, and antitumor effects of pine needles (Pinus densiflora).\nPine needles (Pinus densiflora Siebold et Zuccarini) have long been used as a traditional health-promoting medicinal food in Korea. To investigate their potential anticancer effects, antioxidant, antimutagenic, and antitumor activities were assessed in vitro and/or in vivo. Pine needle ethanol extract (PNE) significantly inhibited Fe(2+)-induced lipid peroxidation and scavenged 1,1-diphenyl- 2-picrylhydrazyl radical in vit',
    "Dietary sources of inorganic microparticles and their intake in healthy subjects and patients with Crohn's disease.\nDietary microparticles are non-biological, bacterial-sized particles. Endogenous sources are derived from intestinal Ca and phosphate secretion. Exogenous sources are mainly titanium dioxide (TiO2) and mixed silicates (Psil); they are resistant to degradation and accumulate in human Peyer's patch macrophages and there is some evidence that they exacerbate inflammation in Crohn's disease (CD). ",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 384]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities)
# tensor([[1.0000, 0.5729, 0.4656],
#         [0.5729, 1.0000, 0.5740],
#         [0.4656, 0.5740, 1.0000]])

Training Details

Training Dataset

Unnamed Dataset

  • Size: 75,822 training samples
  • Columns: sentence_0 and sentence_1
  • Approximate statistics based on the first 1000 samples:
    sentence_0 sentence_1
    type string string
    details
    • min: 3 tokens
    • mean: 7.12 tokens
    • max: 37 tokens
    • min: 28 tokens
    • mean: 109.88 tokens
    • max: 177 tokens
  • Samples:
    sentence_0 sentence_1
    serotonin The potential toxicity of artificial sweeteners.
    Since their discovery, the safety of artificial sweeteners has been controversial. Artificial sweeteners provide the sweetness of sugar without the calories. As public health attention has turned to reversing the obesity epidemic in the United States, more individuals of all ages are choosing to use these products. These choices may be beneficial for those who cannot tolerate sugar in their diets (e.g., diabetics). However, scientists disagree about the relat
    industrial toxins Marine Food Pollutants as a Risk Factor for Hypoinsulinemia and Type 2 Diabetes
    Background Some persistent environmental chemicals are suspected of causing an increased risk of type 2 diabetes mellitus, a disease particularly common after age 70. This concern was examined in a cross-sectional study of elderly subjects in a population with elevated contaminant exposures from seafood species high in the food chain. Methods Clinical examinations of 713 Faroese residents aged 70-74 years (64% of eligible popula
    Update on Herbalife® Bioavailability of vitamin D₂ from UV-B-irradiated button mushrooms in healthy adults deficient in serum 25-hydroxyvitamin D: a randomized controll...
    BACKGROUND/OBJECTIVES: Mushrooms contain very little or any vitamin D(2) but are abundant in ergosterol, which can be converted into vitamin D(2) by ultraviolet (UV) irradiation. Our objective was to investigate the bioavailability of vitamin D(2) from vitamin D(2)-enhanced mushrooms by UV-B in humans, and comparing it with a vitamin D(2) supplement. SUBJECTS
  • Loss: MultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim",
        "gather_across_devices": false,
        "directions": [
            "query_to_doc"
        ],
        "partition_mode": "joint",
        "hardness_mode": null,
        "hardness_strength": 0.0
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • num_train_epochs: 2
  • multi_dataset_batch_sampler: round_robin

All Hyperparameters

Click to expand
  • do_predict: False
  • eval_strategy: no
  • prediction_loss_only: True
  • per_device_train_batch_size: 64
  • per_device_eval_batch_size: 64
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 5e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1
  • num_train_epochs: 2
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: None
  • warmup_ratio: None
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • enable_jit_checkpoint: False
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • use_cpu: False
  • seed: 42
  • data_seed: None
  • bf16: False
  • fp16: False
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: -1
  • ddp_backend: None
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • parallelism_config: None
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch_fused
  • optim_args: None
  • group_by_length: False
  • length_column_name: length
  • project: huggingface
  • trackio_space_id: trackio
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • hub_revision: None
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • auto_find_batch_size: False
  • full_determinism: False
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_num_input_tokens_seen: no
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • liger_kernel_config: None
  • eval_use_gather_object: False
  • average_tokens_across_devices: True
  • use_cache: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: round_robin
  • router_mapping: {}
  • learning_rate_mapping: {}

Training Logs

Epoch Step Training Loss
0.4219 500 3.5716
0.8439 1000 3.2683
1.2658 1500 3.1075
1.6878 2000 3.0246

Framework Versions

  • Python: 3.12.13
  • Sentence Transformers: 5.3.0
  • Transformers: 5.0.0
  • PyTorch: 2.10.0+cu128
  • Accelerate: 1.13.0
  • Datasets: 4.0.0
  • Tokenizers: 0.22.2

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

MultipleNegativesRankingLoss

@misc{oord2019representationlearningcontrastivepredictive,
      title={Representation Learning with Contrastive Predictive Coding},
      author={Aaron van den Oord and Yazhe Li and Oriol Vinyals},
      year={2019},
      eprint={1807.03748},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/1807.03748},
}