CrossEncoder

This is a Cross Encoder model trained using the sentence-transformers library. It computes scores for pairs of texts, which can be used for text reranking and semantic search.

Model Details

Model Description

  • Model Type: Cross Encoder
  • Maximum Sequence Length: 8192 tokens
  • Number of Output Labels: 1 label

Model Sources

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import CrossEncoder

# Download from the 🤗 Hub
model = CrossEncoder("cross_encoder_model_id")
# Get scores for pairs of texts
pairs = [
    ['Hexacarboxylporphyrin/Creatinine [Molar ratio] in 24 hour Urine', '[Molar ratio] in Hexacarboxylporphyrin/Creatinine 24 hour Urn'],
    ['HLA-A2 Ql (Bld/Tiss donor)', 'HLA-A11 donor) (Bld/Tiss Ql'],
    ['Urea nitrogen [Mass/volume] in Urine', 'POC Urine Urea nitrogen Measurement'],
    ['Cauliflower IgG (S) [Mass/Vol]', 'POC Cauliflower Immune globulin (S) Radioallergosorbent'],
    ['Mannose-binding protein [Mass/volume] in Serum', 'MBP Level Serum Quantitative'],
]
scores = model.predict(pairs)
print(scores.shape)
# (5,)

# Or rank different texts based on similarity to a single text
ranks = model.rank(
    'Hexacarboxylporphyrin/Creatinine [Molar ratio] in 24 hour Urine',
    [
        '[Molar ratio] in Hexacarboxylporphyrin/Creatinine 24 hour Urn',
        'HLA-A11 donor) (Bld/Tiss Ql',
        'POC Urine Urea nitrogen Measurement',
        'POC Cauliflower Immune globulin (S) Radioallergosorbent',
        'MBP Level Serum Quantitative',
    ]
)
# [{'corpus_id': ..., 'score': ...}, {'corpus_id': ..., 'score': ...}, ...]

Training Details

Training Dataset

Unnamed Dataset

  • Size: 1,120,911 training samples
  • Columns: anchor, positive, and label
  • Approximate statistics based on the first 1000 samples:
    anchor positive label
    type string string int
    details
    • min: 7 characters
    • mean: 50.32 characters
    • max: 140 characters
    • min: 8 characters
    • mean: 44.25 characters
    • max: 119 characters
    • 0: ~42.50%
    • 1: ~57.50%
  • Samples:
    anchor positive label
    cloZAPine cutoff Confirm (U) [Mass/Vol] Clozaril cutoff (U) Confirmation [Mass/Vol] 1
    Horse hair+Horse dander IgE RAST class (S) Serum Horse hair+Horse dander IgE Ab 0
    Deprecated Red Kidney Bean IgG Ab RAST class [Presence] in Serum Red Kidney Bean IgG Ab [Presence] Radioallergosorbent test 0
  • Loss: BinaryCrossEntropyLoss with these parameters:
    {
        "activation_fn": "torch.nn.modules.linear.Identity",
        "pos_weight": 5
    }
    

Evaluation Dataset

Unnamed Dataset

  • Size: 100,000 evaluation samples
  • Columns: anchor, positive, and label
  • Approximate statistics based on the first 1000 samples:
    anchor positive label
    type string string int
    details
    • min: 12 characters
    • mean: 47.94 characters
    • max: 133 characters
    • min: 7 characters
    • mean: 42.78 characters
    • max: 124 characters
    • 0: ~43.70%
    • 1: ~56.30%
  • Samples:
    anchor positive label
    Hexacarboxylporphyrin/Creatinine [Molar ratio] in 24 hour Urine [Molar ratio] in Hexacarboxylporphyrin/Creatinine 24 hour Urn 1
    HLA-A2 Ql (Bld/Tiss donor) HLA-A11 donor) (Bld/Tiss Ql 0
    Urea nitrogen [Mass/volume] in Urine POC Urine Urea nitrogen Measurement 0
  • Loss: BinaryCrossEntropyLoss with these parameters:
    {
        "activation_fn": "torch.nn.modules.linear.Identity",
        "pos_weight": 5
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • per_device_train_batch_size: 64
  • num_train_epochs: 1
  • learning_rate: 1e-07
  • warmup_steps: 0.1
  • bf16: True
  • eval_strategy: steps
  • per_device_eval_batch_size: 64
  • batch_sampler: no_duplicates

All Hyperparameters

Click to expand
  • per_device_train_batch_size: 64
  • num_train_epochs: 1
  • max_steps: -1
  • learning_rate: 1e-07
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: None
  • warmup_steps: 0.1
  • optim: adamw_torch_fused
  • optim_args: None
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • optim_target_modules: None
  • gradient_accumulation_steps: 1
  • average_tokens_across_devices: True
  • max_grad_norm: 1.0
  • label_smoothing_factor: 0.0
  • bf16: True
  • fp16: False
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • use_liger_kernel: False
  • liger_kernel_config: None
  • use_cache: False
  • neftune_noise_alpha: None
  • torch_empty_cache_steps: None
  • auto_find_batch_size: False
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • include_num_input_tokens_seen: no
  • log_level: passive
  • log_level_replica: warning
  • disable_tqdm: False
  • project: huggingface
  • trackio_space_id: trackio
  • eval_strategy: steps
  • per_device_eval_batch_size: 64
  • prediction_loss_only: True
  • eval_on_start: False
  • eval_do_concat_batches: True
  • eval_use_gather_object: False
  • eval_accumulation_steps: None
  • include_for_metrics: []
  • batch_eval_metrics: False
  • save_only_model: False
  • save_on_each_node: False
  • enable_jit_checkpoint: False
  • push_to_hub: False
  • hub_private_repo: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_always_push: False
  • hub_revision: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • restore_callback_states_from_checkpoint: False
  • full_determinism: False
  • seed: 42
  • data_seed: None
  • use_cpu: False
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • parallelism_config: None
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • dataloader_prefetch_factor: None
  • remove_unused_columns: True
  • label_names: None
  • train_sampling_strategy: random
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • ddp_backend: None
  • ddp_timeout: 1800
  • fsdp: []
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • deepspeed: None
  • debug: []
  • skip_memory_metrics: True
  • do_predict: False
  • resume_from_checkpoint: None
  • warmup_ratio: None
  • local_rank: -1
  • prompts: None
  • batch_sampler: no_duplicates
  • multi_dataset_batch_sampler: proportional
  • router_mapping: {}
  • learning_rate_mapping: {}

Training Logs

Epoch Step Training Loss Validation Loss
0.1000 1751 1.5995 1.2376
0.1999 3502 1.1824 1.0858
0.2999 5253 1.0610 1.0036
0.3999 7004 1.0037 0.9503
0.4999 8755 0.9602 0.9021
0.5998 10506 0.9261 0.8669
0.6998 12257 0.8943 0.8422
0.7998 14008 0.8777 0.8264
0.8997 15759 0.8619 0.8176
0.9997 17510 0.8668 0.8150
1.0 17515 - 0.8150

Framework Versions

  • Python: 3.10.20
  • Sentence Transformers: 5.3.0
  • Transformers: 5.4.0
  • PyTorch: 2.10.0+cu128
  • Accelerate: 1.13.0
  • Datasets: 4.8.4
  • Tokenizers: 0.22.2

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}
Downloads last month
1,128
Safetensors
Model size
0.6B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for NCHS/ttc-reranker-mvp