File size: 3,983 Bytes
7d77e45 770160f 7d77e45 770160f |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
---
license: apache-2.0
language:
- en
tags:
- biomedical
- reranker
- cross-encoder
- entity-normalization
- ontology
- sentence-transformers
task: text-retrieval
metrics:
- accuracy
- f1
- precision
- recall
- average-precision
datasets:
- mondo
- hpo
- uberon
- cell-ontology
- gene-ontology
---
# BOND-reranker
A cross-encoder reranker model fine-tuned for biomedical ontology entity normalization, designed to work with the BOND (Biomedical Ontology Neural Disambiguation) system.
## Model Description
This model is a cross-encoder reranker trained to improve the accuracy of entity normalization by re-ranking candidate ontology terms retrieved by BOND's initial retrieval stage. It takes a query-candidate pair and outputs a relevance score.
**Training Framework:** Sentence Transformers with cross-encoder architecture
## Model Architecture
- **Type:** Cross-Encoder
- **Framework:** Sentence Transformers
- **Max Sequence Length:** 512 tokens
- **Output:** Single relevance score per query-candidate pair
- **Parameters:** ~110M (based on BiomedBERT-base)
## Training Data
The model was trained on biomedical entity normalization data covering multiple ontologies including:
- MONDO (diseases)
- HPO (phenotypes)
- UBERON (anatomy)
- Cell Ontology (CL)
- Gene Ontology (GO)
- And other biomedical ontologies
Training data consists of query-candidate pairs with relevance labels, where queries are biomedical entity mentions and candidates are ontology terms.
## Usage
### With BOND Pipeline
```python
from bond.config import BondSettings
from bond.pipeline import BondMatcher
# Configure BOND to use this reranker
settings = BondSettings(
"model_path", # Replace with your model path
enable_reranker=True
)
matcher = BondMatcher(settings=settings)
```
### Direct Usage
```python
import torch
from sentence_transformers import CrossEncoder
# Load model from local path
model = CrossEncoder(
"model_path", # Replace with your model path
device='cuda' if torch.cuda.is_available() else 'cpu'
)
# Example: Rank candidates for a query
query = "cell_type: C_BEST4; tissue: descending colon; organism: Homo sapiens"
candidates = [
"label: smooth muscle fiber of descending colon; synonyms: non-striated muscle fiber of descending colon",
"label: smooth muscle cell of colon; synonyms: non-striated muscle fiber of colon",
"label: epithelial cell of colon; synonyms: colon epithelial cell"
]
# Get ranked results with probabilities
ranked_results = model.rank(query, candidates, return_documents=True, top_k=3)
print("Top 3 ranked results")
for result in ranked_results:
prob = torch.sigmoid(torch.tensor(result['score'])).item()
print(f"{prob:.8f} - {result['text']}")
```
## Performance
This reranker is designed to work as the final stage in the BOND pipeline:
1. **Retrieval:** Exact + BM25 + Dense retrieval with LLM expansion
2. **Reranking:** This cross-encoder model scores and re-ranks top candidates
3. **Output:** Final ranked list of ontology terms
The reranker significantly improves precision by re-scoring the top-k candidates (typically k=100) retrieved by the initial retrieval stage.
### Evaluation Metrics
Evaluated on biomedical entity normalization development set:
| Metric | Score |
| --------------------------- | ------ |
| **Accuracy** | 97.50% |
| **F1 Score** | 82.37% |
| **Precision** | 79.58% |
| **Recall** | 85.36% |
| **Average Precision** | 88.67% |
| **Eval Loss** | 0.230 |
**Best Model:** Checkpoint at step 69,500 (epoch 2.28) with best metric score of 0.9734
## Model Files
- `config.json` - Model configuration
- `model.safetensors` - Model weights in SafeTensors format
- `tokenizer.json` - Fast tokenizer
- `vocab.txt` - Vocabulary file
- `special_tokens_map.json` - Special tokens mapping
- `tokenizer_config.json` - Tokenizer configuration
## License
Apache 2.0
|