--- license: apache-2.0 language: - en tags: - biomedical - reranker - cross-encoder - entity-normalization - ontology - sentence-transformers task: text-retrieval metrics: - accuracy - f1 - precision - recall - average-precision datasets: - mondo - hpo - uberon - cell-ontology - gene-ontology --- # BOND-reranker A cross-encoder reranker model fine-tuned for biomedical ontology entity normalization, designed to work with the BOND (Biomedical Ontology Neural Disambiguation) system. ## Model Description This model is a cross-encoder reranker trained to improve the accuracy of entity normalization by re-ranking candidate ontology terms retrieved by BOND's initial retrieval stage. It takes a query-candidate pair and outputs a relevance score. **Training Framework:** Sentence Transformers with cross-encoder architecture ## Model Architecture - **Type:** Cross-Encoder - **Framework:** Sentence Transformers - **Max Sequence Length:** 512 tokens - **Output:** Single relevance score per query-candidate pair - **Parameters:** ~110M (based on BiomedBERT-base) ## Training Data The model was trained on biomedical entity normalization data covering multiple ontologies including: - MONDO (diseases) - HPO (phenotypes) - UBERON (anatomy) - Cell Ontology (CL) - Gene Ontology (GO) - And other biomedical ontologies Training data consists of query-candidate pairs with relevance labels, where queries are biomedical entity mentions and candidates are ontology terms. ## Usage ### With BOND Pipeline ```python from bond.config import BondSettings from bond.pipeline import BondMatcher # Configure BOND to use this reranker settings = BondSettings( "model_path", # Replace with your model path enable_reranker=True ) matcher = BondMatcher(settings=settings) ``` ### Direct Usage ```python import torch from sentence_transformers import CrossEncoder # Load model from local path model = CrossEncoder( "model_path", # Replace with your model path device='cuda' if torch.cuda.is_available() else 'cpu' ) # Example: Rank candidates for a query query = "cell_type: C_BEST4; tissue: descending colon; organism: Homo sapiens" candidates = [ "label: smooth muscle fiber of descending colon; synonyms: non-striated muscle fiber of descending colon", "label: smooth muscle cell of colon; synonyms: non-striated muscle fiber of colon", "label: epithelial cell of colon; synonyms: colon epithelial cell" ] # Get ranked results with probabilities ranked_results = model.rank(query, candidates, return_documents=True, top_k=3) print("Top 3 ranked results") for result in ranked_results: prob = torch.sigmoid(torch.tensor(result['score'])).item() print(f"{prob:.8f} - {result['text']}") ``` ## Performance This reranker is designed to work as the final stage in the BOND pipeline: 1. **Retrieval:** Exact + BM25 + Dense retrieval with LLM expansion 2. **Reranking:** This cross-encoder model scores and re-ranks top candidates 3. **Output:** Final ranked list of ontology terms The reranker significantly improves precision by re-scoring the top-k candidates (typically k=100) retrieved by the initial retrieval stage. ### Evaluation Metrics Evaluated on biomedical entity normalization development set: | Metric | Score | | --------------------------- | ------ | | **Accuracy** | 97.50% | | **F1 Score** | 82.37% | | **Precision** | 79.58% | | **Recall** | 85.36% | | **Average Precision** | 88.67% | | **Eval Loss** | 0.230 | **Best Model:** Checkpoint at step 69,500 (epoch 2.28) with best metric score of 0.9734 ## Model Files - `config.json` - Model configuration - `model.safetensors` - Model weights in SafeTensors format - `tokenizer.json` - Fast tokenizer - `vocab.txt` - Vocabulary file - `special_tokens_map.json` - Special tokens mapping - `tokenizer_config.json` - Tokenizer configuration ## License Apache 2.0