BOND-reranker

A cross-encoder reranker model fine-tuned for biomedical ontology entity normalization, designed to work with the BOND (Biomedical Ontology Neural Disambiguation) system.

Model Description

This model is a cross-encoder reranker trained to improve the accuracy of entity normalization by re-ranking candidate ontology terms retrieved by BOND's initial retrieval stage. It takes a query-candidate pair and outputs a relevance score.

Training Framework: Sentence Transformers with cross-encoder architecture

Model Architecture

Type: Cross-Encoder
Framework: Sentence Transformers
Max Sequence Length: 512 tokens
Output: Single relevance score per query-candidate pair
Parameters: ~110M (based on BiomedBERT-base)

Training Data

The model was trained on biomedical entity normalization data covering multiple ontologies including:

MONDO (diseases)
HPO (phenotypes)
UBERON (anatomy)
Cell Ontology (CL)
Gene Ontology (GO)
And other biomedical ontologies

Training data consists of query-candidate pairs with relevance labels, where queries are biomedical entity mentions and candidates are ontology terms.

Usage

With BOND Pipeline

from bond.config import BondSettings
from bond.pipeline import BondMatcher

# Configure BOND to use this reranker
settings = BondSettings(
    "model_path",  # Replace with your model path
    enable_reranker=True
)

matcher = BondMatcher(settings=settings)

Direct Usage

import torch
from sentence_transformers import CrossEncoder

# Load model from local path
model = CrossEncoder(
    "model_path",  # Replace with your model path
    device='cuda' if torch.cuda.is_available() else 'cpu'
)

# Example: Rank candidates for a query
query = "cell_type: C_BEST4; tissue: descending colon; organism: Homo sapiens"
candidates = [
    "label: smooth muscle fiber of descending colon; synonyms: non-striated muscle fiber of descending colon",
    "label: smooth muscle cell of colon; synonyms: non-striated muscle fiber of colon",
    "label: epithelial cell of colon; synonyms: colon epithelial cell"
]

# Get ranked results with probabilities
ranked_results = model.rank(query, candidates, return_documents=True, top_k=3)

print("Top 3 ranked results")

for result in ranked_results:
    prob = torch.sigmoid(torch.tensor(result['score'])).item()
    print(f"{prob:.8f} - {result['text']}")

Performance

This reranker is designed to work as the final stage in the BOND pipeline:

Retrieval: Exact + BM25 + Dense retrieval with LLM expansion
Reranking: This cross-encoder model scores and re-ranks top candidates
Output: Final ranked list of ontology terms

The reranker significantly improves precision by re-scoring the top-k candidates (typically k=100) retrieved by the initial retrieval stage.

Evaluation Metrics

Evaluated on biomedical entity normalization development set:

Metric	Score
Accuracy	97.50%
F1 Score	82.37%
Precision	79.58%
Recall	85.36%
Average Precision	88.67%
Eval Loss	0.230

Best Model: Checkpoint at step 69,500 (epoch 2.28) with best metric score of 0.9734

Model Files

config.json - Model configuration
model.safetensors - Model weights in SafeTensors format
tokenizer.json - Fast tokenizer
vocab.txt - Vocabulary file
special_tokens_map.json - Special tokens mapping
tokenizer_config.json - Tokenizer configuration

License

Apache 2.0

Downloads last month: 2

Safetensors

Model size

41.5M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support