|
|
--- |
|
|
license: apache-2.0 |
|
|
language: |
|
|
- en |
|
|
tags: |
|
|
- biomedical |
|
|
- reranker |
|
|
- cross-encoder |
|
|
- entity-normalization |
|
|
- ontology |
|
|
- sentence-transformers |
|
|
task: text-retrieval |
|
|
metrics: |
|
|
- accuracy |
|
|
- f1 |
|
|
- precision |
|
|
- recall |
|
|
- average-precision |
|
|
datasets: |
|
|
- mondo |
|
|
- hpo |
|
|
- uberon |
|
|
- cell-ontology |
|
|
- gene-ontology |
|
|
--- |
|
|
|
|
|
# BOND-reranker |
|
|
|
|
|
A cross-encoder reranker model fine-tuned for biomedical ontology entity normalization, designed to work with the BOND (Biomedical Ontology Neural Disambiguation) system. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model is a cross-encoder reranker trained to improve the accuracy of entity normalization by re-ranking candidate ontology terms retrieved by BOND's initial retrieval stage. It takes a query-candidate pair and outputs a relevance score. |
|
|
|
|
|
**Training Framework:** Sentence Transformers with cross-encoder architecture |
|
|
|
|
|
## Model Architecture |
|
|
|
|
|
- **Type:** Cross-Encoder |
|
|
- **Framework:** Sentence Transformers |
|
|
- **Max Sequence Length:** 512 tokens |
|
|
- **Output:** Single relevance score per query-candidate pair |
|
|
- **Parameters:** ~110M (based on BiomedBERT-base) |
|
|
|
|
|
## Training Data |
|
|
|
|
|
The model was trained on biomedical entity normalization data covering multiple ontologies including: |
|
|
|
|
|
- MONDO (diseases) |
|
|
- HPO (phenotypes) |
|
|
- UBERON (anatomy) |
|
|
- Cell Ontology (CL) |
|
|
- Gene Ontology (GO) |
|
|
- And other biomedical ontologies |
|
|
|
|
|
Training data consists of query-candidate pairs with relevance labels, where queries are biomedical entity mentions and candidates are ontology terms. |
|
|
|
|
|
## Usage |
|
|
|
|
|
### With BOND Pipeline |
|
|
|
|
|
```python |
|
|
from bond.config import BondSettings |
|
|
from bond.pipeline import BondMatcher |
|
|
|
|
|
# Configure BOND to use this reranker |
|
|
settings = BondSettings( |
|
|
"model_path", # Replace with your model path |
|
|
enable_reranker=True |
|
|
) |
|
|
|
|
|
matcher = BondMatcher(settings=settings) |
|
|
``` |
|
|
|
|
|
### Direct Usage |
|
|
|
|
|
```python |
|
|
import torch |
|
|
from sentence_transformers import CrossEncoder |
|
|
|
|
|
# Load model from local path |
|
|
model = CrossEncoder( |
|
|
"model_path", # Replace with your model path |
|
|
device='cuda' if torch.cuda.is_available() else 'cpu' |
|
|
) |
|
|
|
|
|
# Example: Rank candidates for a query |
|
|
query = "cell_type: C_BEST4; tissue: descending colon; organism: Homo sapiens" |
|
|
candidates = [ |
|
|
"label: smooth muscle fiber of descending colon; synonyms: non-striated muscle fiber of descending colon", |
|
|
"label: smooth muscle cell of colon; synonyms: non-striated muscle fiber of colon", |
|
|
"label: epithelial cell of colon; synonyms: colon epithelial cell" |
|
|
] |
|
|
|
|
|
# Get ranked results with probabilities |
|
|
ranked_results = model.rank(query, candidates, return_documents=True, top_k=3) |
|
|
|
|
|
print("Top 3 ranked results") |
|
|
|
|
|
for result in ranked_results: |
|
|
prob = torch.sigmoid(torch.tensor(result['score'])).item() |
|
|
print(f"{prob:.8f} - {result['text']}") |
|
|
``` |
|
|
|
|
|
## Performance |
|
|
|
|
|
This reranker is designed to work as the final stage in the BOND pipeline: |
|
|
|
|
|
1. **Retrieval:** Exact + BM25 + Dense retrieval with LLM expansion |
|
|
2. **Reranking:** This cross-encoder model scores and re-ranks top candidates |
|
|
3. **Output:** Final ranked list of ontology terms |
|
|
|
|
|
The reranker significantly improves precision by re-scoring the top-k candidates (typically k=100) retrieved by the initial retrieval stage. |
|
|
|
|
|
### Evaluation Metrics |
|
|
|
|
|
Evaluated on biomedical entity normalization development set: |
|
|
|
|
|
| Metric | Score | |
|
|
| --------------------------- | ------ | |
|
|
| **Accuracy** | 97.50% | |
|
|
| **F1 Score** | 82.37% | |
|
|
| **Precision** | 79.58% | |
|
|
| **Recall** | 85.36% | |
|
|
| **Average Precision** | 88.67% | |
|
|
| **Eval Loss** | 0.230 | |
|
|
|
|
|
**Best Model:** Checkpoint at step 69,500 (epoch 2.28) with best metric score of 0.9734 |
|
|
|
|
|
## Model Files |
|
|
|
|
|
- `config.json` - Model configuration |
|
|
- `model.safetensors` - Model weights in SafeTensors format |
|
|
- `tokenizer.json` - Fast tokenizer |
|
|
- `vocab.txt` - Vocabulary file |
|
|
- `special_tokens_map.json` - Special tokens mapping |
|
|
- `tokenizer_config.json` - Tokenizer configuration |
|
|
|
|
|
## License |
|
|
|
|
|
Apache 2.0 |
|
|
|