BOND-reranker / README.md
rgrupesh's picture
Upload folder using huggingface_hub
7d77e45 verified
---
license: apache-2.0
language:
- en
tags:
- biomedical
- reranker
- cross-encoder
- entity-normalization
- ontology
- sentence-transformers
task: text-retrieval
metrics:
- accuracy
- f1
- precision
- recall
- average-precision
datasets:
- mondo
- hpo
- uberon
- cell-ontology
- gene-ontology
---
# BOND-reranker
A cross-encoder reranker model fine-tuned for biomedical ontology entity normalization, designed to work with the BOND (Biomedical Ontology Neural Disambiguation) system.
## Model Description
This model is a cross-encoder reranker trained to improve the accuracy of entity normalization by re-ranking candidate ontology terms retrieved by BOND's initial retrieval stage. It takes a query-candidate pair and outputs a relevance score.
**Training Framework:** Sentence Transformers with cross-encoder architecture
## Model Architecture
- **Type:** Cross-Encoder
- **Framework:** Sentence Transformers
- **Max Sequence Length:** 512 tokens
- **Output:** Single relevance score per query-candidate pair
- **Parameters:** ~110M (based on BiomedBERT-base)
## Training Data
The model was trained on biomedical entity normalization data covering multiple ontologies including:
- MONDO (diseases)
- HPO (phenotypes)
- UBERON (anatomy)
- Cell Ontology (CL)
- Gene Ontology (GO)
- And other biomedical ontologies
Training data consists of query-candidate pairs with relevance labels, where queries are biomedical entity mentions and candidates are ontology terms.
## Usage
### With BOND Pipeline
```python
from bond.config import BondSettings
from bond.pipeline import BondMatcher
# Configure BOND to use this reranker
settings = BondSettings(
"model_path", # Replace with your model path
enable_reranker=True
)
matcher = BondMatcher(settings=settings)
```
### Direct Usage
```python
import torch
from sentence_transformers import CrossEncoder
# Load model from local path
model = CrossEncoder(
"model_path", # Replace with your model path
device='cuda' if torch.cuda.is_available() else 'cpu'
)
# Example: Rank candidates for a query
query = "cell_type: C_BEST4; tissue: descending colon; organism: Homo sapiens"
candidates = [
"label: smooth muscle fiber of descending colon; synonyms: non-striated muscle fiber of descending colon",
"label: smooth muscle cell of colon; synonyms: non-striated muscle fiber of colon",
"label: epithelial cell of colon; synonyms: colon epithelial cell"
]
# Get ranked results with probabilities
ranked_results = model.rank(query, candidates, return_documents=True, top_k=3)
print("Top 3 ranked results")
for result in ranked_results:
prob = torch.sigmoid(torch.tensor(result['score'])).item()
print(f"{prob:.8f} - {result['text']}")
```
## Performance
This reranker is designed to work as the final stage in the BOND pipeline:
1. **Retrieval:** Exact + BM25 + Dense retrieval with LLM expansion
2. **Reranking:** This cross-encoder model scores and re-ranks top candidates
3. **Output:** Final ranked list of ontology terms
The reranker significantly improves precision by re-scoring the top-k candidates (typically k=100) retrieved by the initial retrieval stage.
### Evaluation Metrics
Evaluated on biomedical entity normalization development set:
| Metric | Score |
| --------------------------- | ------ |
| **Accuracy** | 97.50% |
| **F1 Score** | 82.37% |
| **Precision** | 79.58% |
| **Recall** | 85.36% |
| **Average Precision** | 88.67% |
| **Eval Loss** | 0.230 |
**Best Model:** Checkpoint at step 69,500 (epoch 2.28) with best metric score of 0.9734
## Model Files
- `config.json` - Model configuration
- `model.safetensors` - Model weights in SafeTensors format
- `tokenizer.json` - Fast tokenizer
- `vocab.txt` - Vocabulary file
- `special_tokens_map.json` - Special tokens mapping
- `tokenizer_config.json` - Tokenizer configuration
## License
Apache 2.0