Cambridge-SapBERT-from-PubMedBERT-fulltext trained on NCBI Disease

This is a Cross Encoder model finetuned from cambridgeltl/SapBERT-from-PubMedBERT-fulltext using the sentence-transformers library. It computes scores for pairs of texts, which can be used for text reranking and semantic search.

Model Details

Model Description

Model Sources

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import CrossEncoder

# Download from the 🤗 Hub
model = CrossEncoder("cross_encoder_model_id")
# Get scores for pairs of texts
pairs = [
    ['Original mention: breast cancer.\nTitle: BRCA1 mutations in a population-based sample of young women with breast cancer.\nContext: We studied 80 women in whom breast cancer was diagnosed before the age of 35, and who were not selected on the basis of family history.', 'breast tumors'],
    ['Original mention: childhood cerebral ALD.\nTitle: Predominance of the adrenomyeloneuropathy phenotype of X-linked adrenoleukodystrophy in The Netherlands: a survey of 30 kindreds.\nContext: The phenotypic expression is highly variable, childhood cerebral ALD (CCALD) and adrenomyeloneuropathy (AMN) being the main variants.', 'x-linked adrenoleukodystrophy'],
    ['Original mention: TSD.\nTitle: The Tay-Sachs disease gene in North American Jewish populations: geographic variations and origin.\nContext: Jews with Polish and/or Russian ancestry constituted 88% of this sample and had a TSD carrier frequency of.', 'gm2 gangliosidosis, type 1'],
    ['Original mention: PWS.\nTitle: Isolation of molecular probes associated with the chromosome 15 instability in the Prader-Willi syndrome.\nContext: 2 and are shown to be deleted in DNA of one of two patients examined with the PWS.', 'syndrome, royer'],
    ['Original mention: deficiency of beta-glucocerebrosidase.\nTitle: Homozygous presence of the crossover (fusion gene) mutation identified in a type II Gaucher disease fetus: is this analogous to the Gaucher knock-out mouse model?\nGaucher disease (GD) is an inherited deficiency of beta-glucocerebrosidase (EC 3.\nContext: Homozygous presence of the crossover (fusion gene) mutation identified in a type II Gaucher disease fetus: is this analogous to the Gaucher knock-out mouse model?\nGaucher disease (GD) is an inherited deficiency of beta-glucocerebrosidase (EC 3.', 'gaucher disease, acute neuronopathic type'],
]
scores = model.predict(pairs)
print(scores.shape)
# (5,)

# Or rank different texts based on similarity to a single text
ranks = model.rank(
    'Original mention: breast cancer.\nTitle: BRCA1 mutations in a population-based sample of young women with breast cancer.\nContext: We studied 80 women in whom breast cancer was diagnosed before the age of 35, and who were not selected on the basis of family history.',
    [
        'breast tumors',
        'x-linked adrenoleukodystrophy',
        'gm2 gangliosidosis, type 1',
        'syndrome, royer',
        'gaucher disease, acute neuronopathic type',
    ]
)
# [{'corpus_id': ..., 'score': ...}, {'corpus_id': ..., 'score': ...}, ...]

Evaluation

Metrics

Cross Encoder Reranking

Metric Value
map 0.9965 (+0.5546)
mrr@10 0.9981 (+0.7243)
ndcg@10 0.9979 (+0.4192)

Training Details

Training Dataset

Unnamed Dataset

  • Size: 222,541 training samples
  • Columns: query, answer, and label
  • Approximate statistics based on the first 1000 samples:
    query answer label
    type string string int
    details
    • min: 123 characters
    • mean: 317.4 characters
    • max: 700 characters
    • min: 5 characters
    • mean: 27.96 characters
    • max: 168 characters
    • 0: ~44.80%
    • 1: ~55.20%
  • Samples:
    query answer label
    Original mention: breast cancer.
    Title: BRCA1 mutations in a population-based sample of young women with breast cancer.
    Context: We studied 80 women in whom breast cancer was diagnosed before the age of 35, and who were not selected on the basis of family history.
    breast tumors 1
    Original mention: childhood cerebral ALD.
    Title: Predominance of the adrenomyeloneuropathy phenotype of X-linked adrenoleukodystrophy in The Netherlands: a survey of 30 kindreds.
    Context: The phenotypic expression is highly variable, childhood cerebral ALD (CCALD) and adrenomyeloneuropathy (AMN) being the main variants.
    x-linked adrenoleukodystrophy 1
    Original mention: TSD.
    Title: The Tay-Sachs disease gene in North American Jewish populations: geographic variations and origin.
    Context: Jews with Polish and/or Russian ancestry constituted 88% of this sample and had a TSD carrier frequency of.
    gm2 gangliosidosis, type 1 1
  • Loss: BinaryCrossEntropyLoss with these parameters:
    {
        "activation_fn": "torch.nn.modules.linear.Identity",
        "pos_weight": 0.7793161273002625
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 128
  • per_device_eval_batch_size: 128
  • learning_rate: 2e-05
  • num_train_epochs: 2
  • warmup_ratio: 0.05
  • seed: 12
  • bf16: True
  • dataloader_num_workers: 4
  • load_best_model_at_end: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 128
  • per_device_eval_batch_size: 128
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 2e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 2
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: None
  • warmup_ratio: 0.05
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 12
  • data_seed: None
  • jit_mode_eval: False
  • bf16: True
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 4
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: True
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • parallelism_config: None
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch_fused
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • project: huggingface
  • trackio_space_id: trackio
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • hub_revision: None
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: no
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • liger_kernel_config: None
  • eval_use_gather_object: False
  • average_tokens_across_devices: True
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional
  • router_mapping: {}
  • learning_rate_mapping: {}

Training Logs

Epoch Step Training Loss ncbi-disease-dev_ndcg@10
0.0006 1 0.6083 -
0.0863 150 0.567 -
0.1725 300 0.3364 -
0.2588 450 0.2209 -
0.3450 600 0.1784 -
0.4313 750 0.1435 -
0.5175 900 0.1324 -
0.6038 1050 0.1137 -
0.6901 1200 0.103 -
0.7763 1350 0.0934 -
0.8626 1500 0.0842 0.9949 (+0.4162)
0.9488 1650 0.0797 -
1.0351 1800 0.0695 -
1.1213 1950 0.0573 -
1.2076 2100 0.0613 -
1.2938 2250 0.0555 -
1.3801 2400 0.0504 -
1.4664 2550 0.0499 -
1.5526 2700 0.049 -
1.6389 2850 0.0489 -
1.7251 3000 0.0424 0.9979 (+0.4192)
1.8114 3150 0.0411 -
1.8976 3300 0.0405 -
1.9839 3450 0.0405 -
-1 -1 - 0.9979 (+0.4192)
  • The bold row denotes the saved checkpoint.

Framework Versions

  • Python: 3.12.12
  • Sentence Transformers: 5.2.0
  • Transformers: 4.57.5
  • PyTorch: 2.9.1+cu128
  • Accelerate: 1.12.0
  • Datasets: 4.5.0
  • Tokenizers: 0.22.2

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}
Downloads last month
1
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for OverSamu/reranker-sapbert-ncbi-disease-bce-context-title

Finetuned
(21)
this model

Paper for OverSamu/reranker-sapbert-ncbi-disease-bce-context-title

Evaluation results