BOND-reranker / README.md

Upload folder using huggingface_hub

7d77e45 verified 12 days ago

3.98 kB

	---
	license: apache-2.0
	language:
	- en
	tags:
	- biomedical
	- reranker
	- cross-encoder
	- entity-normalization
	- ontology
	- sentence-transformers
	task: text-retrieval
	metrics:
	- accuracy
	- f1
	- precision
	- recall
	- average-precision
	datasets:
	- mondo
	- hpo
	- uberon
	- cell-ontology
	- gene-ontology
	---

	# BOND-reranker

	A cross-encoder reranker model fine-tuned for biomedical ontology entity normalization, designed to work with the BOND (Biomedical Ontology Neural Disambiguation) system.

	## Model Description

	This model is a cross-encoder reranker trained to improve the accuracy of entity normalization by re-ranking candidate ontology terms retrieved by BOND's initial retrieval stage. It takes a query-candidate pair and outputs a relevance score.

	Training Framework: Sentence Transformers with cross-encoder architecture

	## Model Architecture

	- Type: Cross-Encoder
	- Framework: Sentence Transformers
	- Max Sequence Length: 512 tokens
	- Output: Single relevance score per query-candidate pair
	- Parameters: ~110M (based on BiomedBERT-base)

	## Training Data

	The model was trained on biomedical entity normalization data covering multiple ontologies including:

	- MONDO (diseases)
	- HPO (phenotypes)
	- UBERON (anatomy)
	- Cell Ontology (CL)
	- Gene Ontology (GO)
	- And other biomedical ontologies

	Training data consists of query-candidate pairs with relevance labels, where queries are biomedical entity mentions and candidates are ontology terms.

	## Usage

	### With BOND Pipeline

	```python
	from bond.config import BondSettings
	from bond.pipeline import BondMatcher

	# Configure BOND to use this reranker
	settings = BondSettings(
	"model_path", # Replace with your model path
	enable_reranker=True
	)

	matcher = BondMatcher(settings=settings)
	```

	### Direct Usage

	```python
	import torch
	from sentence_transformers import CrossEncoder

	# Load model from local path
	model = CrossEncoder(
	"model_path", # Replace with your model path
	device='cuda' if torch.cuda.is_available() else 'cpu'
	)

	# Example: Rank candidates for a query
	query = "cell_type: C_BEST4; tissue: descending colon; organism: Homo sapiens"
	candidates = [
	"label: smooth muscle fiber of descending colon; synonyms: non-striated muscle fiber of descending colon",
	"label: smooth muscle cell of colon; synonyms: non-striated muscle fiber of colon",
	"label: epithelial cell of colon; synonyms: colon epithelial cell"
	]

	# Get ranked results with probabilities
	ranked_results = model.rank(query, candidates, return_documents=True, top_k=3)

	print("Top 3 ranked results")

	for result in ranked_results:
	prob = torch.sigmoid(torch.tensor(result['score'])).item()
	print(f"{prob:.8f} - {result['text']}")
	```

	## Performance

	This reranker is designed to work as the final stage in the BOND pipeline:

	1. Retrieval: Exact + BM25 + Dense retrieval with LLM expansion
	2. Reranking: This cross-encoder model scores and re-ranks top candidates
	3. Output: Final ranked list of ontology terms

	The reranker significantly improves precision by re-scoring the top-k candidates (typically k=100) retrieved by the initial retrieval stage.

	### Evaluation Metrics

	Evaluated on biomedical entity normalization development set:

	\| Metric \| Score \|
	\| --------------------------- \| ------ \|
	\| Accuracy \| 97.50% \|
	\| F1 Score \| 82.37% \|
	\| Precision \| 79.58% \|
	\| Recall \| 85.36% \|
	\| Average Precision \| 88.67% \|
	\| Eval Loss \| 0.230 \|

	Best Model: Checkpoint at step 69,500 (epoch 2.28) with best metric score of 0.9734

	## Model Files

	- `config.json` - Model configuration
	- `model.safetensors` - Model weights in SafeTensors format
	- `tokenizer.json` - Fast tokenizer
	- `vocab.txt` - Vocabulary file
	- `special_tokens_map.json` - Special tokens mapping
	- `tokenizer_config.json` - Tokenizer configuration

	## License

	Apache 2.0