File size: 3,983 Bytes
7d77e45
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
770160f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7d77e45
770160f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
---
license: apache-2.0
language:
  - en
tags:
  - biomedical
  - reranker
  - cross-encoder
  - entity-normalization
  - ontology
  - sentence-transformers
task: text-retrieval
metrics:
  - accuracy
  - f1
  - precision
  - recall
  - average-precision
datasets:
  - mondo
  - hpo
  - uberon
  - cell-ontology
  - gene-ontology
---

# BOND-reranker

A cross-encoder reranker model fine-tuned for biomedical ontology entity normalization, designed to work with the BOND (Biomedical Ontology Neural Disambiguation) system.

## Model Description

This model is a cross-encoder reranker trained to improve the accuracy of entity normalization by re-ranking candidate ontology terms retrieved by BOND's initial retrieval stage. It takes a query-candidate pair and outputs a relevance score.

**Training Framework:** Sentence Transformers with cross-encoder architecture

## Model Architecture

- **Type:** Cross-Encoder
- **Framework:** Sentence Transformers
- **Max Sequence Length:** 512 tokens
- **Output:** Single relevance score per query-candidate pair
- **Parameters:** ~110M (based on BiomedBERT-base)

## Training Data

The model was trained on biomedical entity normalization data covering multiple ontologies including:

- MONDO (diseases)
- HPO (phenotypes)
- UBERON (anatomy)
- Cell Ontology (CL)
- Gene Ontology (GO)
- And other biomedical ontologies

Training data consists of query-candidate pairs with relevance labels, where queries are biomedical entity mentions and candidates are ontology terms.

## Usage

### With BOND Pipeline

```python
from bond.config import BondSettings
from bond.pipeline import BondMatcher

# Configure BOND to use this reranker
settings = BondSettings(
    "model_path",  # Replace with your model path
    enable_reranker=True
)

matcher = BondMatcher(settings=settings)
```

### Direct Usage

```python
import torch
from sentence_transformers import CrossEncoder

# Load model from local path
model = CrossEncoder(
    "model_path",  # Replace with your model path
    device='cuda' if torch.cuda.is_available() else 'cpu'
)

# Example: Rank candidates for a query
query = "cell_type: C_BEST4; tissue: descending colon; organism: Homo sapiens"
candidates = [
    "label: smooth muscle fiber of descending colon; synonyms: non-striated muscle fiber of descending colon",
    "label: smooth muscle cell of colon; synonyms: non-striated muscle fiber of colon",
    "label: epithelial cell of colon; synonyms: colon epithelial cell"
]

# Get ranked results with probabilities
ranked_results = model.rank(query, candidates, return_documents=True, top_k=3)

print("Top 3 ranked results")

for result in ranked_results:
    prob = torch.sigmoid(torch.tensor(result['score'])).item()
    print(f"{prob:.8f} - {result['text']}")
```

## Performance

This reranker is designed to work as the final stage in the BOND pipeline:

1. **Retrieval:** Exact + BM25 + Dense retrieval with LLM expansion
2. **Reranking:** This cross-encoder model scores and re-ranks top candidates
3. **Output:** Final ranked list of ontology terms

The reranker significantly improves precision by re-scoring the top-k candidates (typically k=100) retrieved by the initial retrieval stage.

### Evaluation Metrics

Evaluated on biomedical entity normalization development set:

| Metric                      | Score  |
| --------------------------- | ------ |
| **Accuracy**          | 97.50% |
| **F1 Score**          | 82.37% |
| **Precision**         | 79.58% |
| **Recall**            | 85.36% |
| **Average Precision** | 88.67% |
| **Eval Loss**         | 0.230  |

**Best Model:** Checkpoint at step 69,500 (epoch 2.28) with best metric score of 0.9734

## Model Files

- `config.json` - Model configuration
- `model.safetensors` - Model weights in SafeTensors format
- `tokenizer.json` - Fast tokenizer
- `vocab.txt` - Vocabulary file
- `special_tokens_map.json` - Special tokens mapping
- `tokenizer_config.json` - Tokenizer configuration

## License

Apache 2.0