File size: 3,983 Bytes

---
license: apache-2.0
language:
  - en
tags:
  - biomedical
  - reranker
  - cross-encoder
  - entity-normalization
  - ontology
  - sentence-transformers
task: text-retrieval
metrics:
  - accuracy
  - f1
  - precision
  - recall
  - average-precision
datasets:
  - mondo
  - hpo
  - uberon
  - cell-ontology
  - gene-ontology
---

# BOND-reranker

A cross-encoder reranker model fine-tuned for biomedical ontology entity normalization, designed to work with the BOND (Biomedical Ontology Neural Disambiguation) system.

## Model Description

This model is a cross-encoder reranker trained to improve the accuracy of entity normalization by re-ranking candidate ontology terms retrieved by BOND's initial retrieval stage. It takes a query-candidate pair and outputs a relevance score.

**Training Framework:** Sentence Transformers with cross-encoder architecture

## Model Architecture

- **Type:** Cross-Encoder
- **Framework:** Sentence Transformers
- **Max Sequence Length:** 512 tokens
- **Output:** Single relevance score per query-candidate pair
- **Parameters:** ~110M (based on BiomedBERT-base)

## Training Data

The model was trained on biomedical entity normalization data covering multiple ontologies including:

- MONDO (diseases)
- HPO (phenotypes)
- UBERON (anatomy)
- Cell Ontology (CL)
- Gene Ontology (GO)
- And other biomedical ontologies

Training data consists of query-candidate pairs with relevance labels, where queries are biomedical entity mentions and candidates are ontology terms.

## Usage

### With BOND Pipeline

```python
from bond.config import BondSettings
from bond.pipeline import BondMatcher

# Configure BOND to use this reranker
settings = BondSettings(
    "model_path",  # Replace with your model path
    enable_reranker=True
)

matcher = BondMatcher(settings=settings)
```

### Direct Usage

```python
import torch
from sentence_transformers import CrossEncoder

# Load model from local path
model = CrossEncoder(
    "model_path",  # Replace with your model path
    device='cuda' if torch.cuda.is_available() else 'cpu'
)

# Example: Rank candidates for a query
query = "cell_type: C_BEST4; tissue: descending colon; organism: Homo sapiens"
candidates = [
    "label: smooth muscle fiber of descending colon; synonyms: non-striated muscle fiber of descending colon",
    "label: smooth muscle cell of colon; synonyms: non-striated muscle fiber of colon",
    "label: epithelial cell of colon; synonyms: colon epithelial cell"
]

# Get ranked results with probabilities
ranked_results = model.rank(query, candidates, return_documents=True, top_k=3)

print("Top 3 ranked results")

for result in ranked_results:
    prob = torch.sigmoid(torch.tensor(result['score'])).item()
    print(f"{prob:.8f} - {result['text']}")
```

## Performance

This reranker is designed to work as the final stage in the BOND pipeline:

1. **Retrieval:** Exact + BM25 + Dense retrieval with LLM expansion
2. **Reranking:** This cross-encoder model scores and re-ranks top candidates
3. **Output:** Final ranked list of ontology terms

The reranker significantly improves precision by re-scoring the top-k candidates (typically k=100) retrieved by the initial retrieval stage.

### Evaluation Metrics

Evaluated on biomedical entity normalization development set:

| Metric                      | Score  |
| --------------------------- | ------ |
| **Accuracy**          | 97.50% |
| **F1 Score**          | 82.37% |
| **Precision**         | 79.58% |
| **Recall**            | 85.36% |
| **Average Precision** | 88.67% |
| **Eval Loss**         | 0.230  |

**Best Model:** Checkpoint at step 69,500 (epoch 2.28) with best metric score of 0.9734

## Model Files

- `config.json` - Model configuration
- `model.safetensors` - Model weights in SafeTensors format
- `tokenizer.json` - Fast tokenizer
- `vocab.txt` - Vocabulary file
- `special_tokens_map.json` - Special tokens mapping
- `tokenizer_config.json` - Tokenizer configuration

## License

Apache 2.0