Upload folder using huggingface_hub

Browse files

Files changed (13) hide show

README.md +115 -3
config.json +34 -0
model.safetensors +3 -0
optimizer.pt +3 -0
rng_state.pth +3 -0
scaler.pt +3 -0
scheduler.pt +3 -0
special_tokens_map.json +7 -0
tokenizer.json +0 -0
tokenizer_config.json +58 -0
trainer_state.json +0 -0
training_args.bin +3 -0
vocab.txt +0 -0

README.md CHANGED Viewed

@@ -1,3 +1,115 @@
----
-license: apache-2.0
----

+# BOND-reranker
+A cross-encoder reranker model fine-tuned for biomedical ontology entity normalization, designed to work with the BOND (Biomedical Ontology Neural Disambiguation) system.
+## Model Description
+This model is a cross-encoder reranker trained to improve the accuracy of entity normalization by re-ranking candidate ontology terms retrieved by BOND's initial retrieval stage. It takes a query-candidate pair and outputs a relevance score.
+**Training Framework:** Sentence Transformers with cross-encoder architecture
+## Model Architecture
+- **Type:** Cross-Encoder
+- **Framework:** Sentence Transformers
+- **Max Sequence Length:** 512 tokens
+- **Output:** Single relevance score per query-candidate pair
+- **Parameters:** ~110M (based on BiomedBERT-base)
+## Training Data
+The model was trained on biomedical entity normalization data covering multiple ontologies including:
+- MONDO (diseases)
+- HPO (phenotypes)
+- UBERON (anatomy)
+- Cell Ontology (CL)
+- Gene Ontology (GO)
+- And other biomedical ontologies
+Training data consists of query-candidate pairs with relevance labels, where queries are biomedical entity mentions and candidates are ontology terms.
+## Usage
+### With BOND Pipeline
+```python
+from bond.config import BondSettings
+from bond.pipeline import BondMatcher
+# Configure BOND to use this reranker
+settings = BondSettings(
+    reranker_path="AronowLab/BOND-reranker",
+    enable_reranker=True
+)
+matcher = BondMatcher(settings=settings)
+```
+### Direct Usage
+```python
+import torch
+from sentence_transformers import CrossEncoder
+# Load model from local path
+model = CrossEncoder(
+    "model_path",  # Replace with your model path
+    device='cuda' if torch.cuda.is_available() else 'cpu'
+)
+# Example: Rank candidates for a query
+query = "cell_type: C_BEST4; tissue: descending colon; organism: Homo sapiens"
+candidates = [
+    "label: smooth muscle fiber of descending colon; synonyms: non-striated muscle fiber of descending colon",
+    "label: smooth muscle cell of colon; synonyms: non-striated muscle fiber of colon",
+    "label: epithelial cell of colon; synonyms: colon epithelial cell"
+]
+# Get ranked results with probabilities
+ranked_results = model.rank(query, candidates, return_documents=True, top_k=3)
+print("Top 3 ranked results")
+for result in ranked_results:
+    prob = torch.sigmoid(torch.tensor(result['score'])).item()
+    print(f"{prob:.8f} - {result['text']}")
+```
+## Performance
+This reranker is designed to work as the final stage in the BOND pipeline:
+1. **Retrieval:** Exact + BM25 + Dense retrieval with LLM expansion
+2. **Reranking:** This cross-encoder model scores and re-ranks top candidates
+3. **Output:** Final ranked list of ontology terms
+The reranker significantly improves precision by re-scoring the top-k candidates (typically k=100) retrieved by the initial retrieval stage.
+### Evaluation Metrics
+Evaluated on biomedical entity normalization development set:
+| Metric                      | Score  |
+| --------------------------- | ------ |
+| **Accuracy**          | 97.50% |
+| **F1 Score**          | 82.37% |
+| **Precision**         | 79.58% |
+| **Recall**            | 85.36% |
+| **Average Precision** | 88.67% |
+| **Eval Loss**         | 0.230  |
+**Best Model:** Checkpoint at step 69,500 (epoch 2.28) with best metric score of 0.9734
+## Model Files
+- `config.json` - Model configuration
+- `model.safetensors` - Model weights in SafeTensors format
+- `tokenizer.json` - Fast tokenizer
+- `vocab.txt` - Vocabulary file
+- `special_tokens_map.json` - Special tokens mapping
+- `tokenizer_config.json` - Tokenizer configuration
+## License
+Apache 2.0

config.json ADDED Viewed

	@@ -0,0 +1,34 @@

+{
+  "architectures": [
+    "BertForSequenceClassification"
+  ],
+  "attention_probs_dropout_prob": 0.1,
+  "classifier_dropout": null,
+  "hidden_act": "gelu",
+  "hidden_dropout_prob": 0.1,
+  "hidden_size": 384,
+  "id2label": {
+    "0": "LABEL_0"
+  },
+  "initializer_range": 0.02,
+  "intermediate_size": 1536,
+  "label2id": {
+    "LABEL_0": 0
+  },
+  "layer_norm_eps": 1e-12,
+  "max_position_embeddings": 1024,
+  "model_type": "bert",
+  "num_attention_heads": 6,
+  "num_hidden_layers": 16,
+  "pad_token_id": 0,
+  "position_embedding_type": "absolute",
+  "sentence_transformers": {
+    "activation_fn": "torch.nn.modules.activation.Sigmoid",
+    "version": "4.1.0"
+  },
+  "torch_dtype": "float32",
+  "transformers_version": "4.53.3",
+  "type_vocab_size": 2,
+  "use_cache": true,
+  "vocab_size": 32768
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:14766d1baf6f825437915baf643f17d827ef9dcd2b083b9f29987a5e6f6691a8
+size 166100172

optimizer.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a4bb4f3417135e321c3b25f5a61f28fe5ef51a14b59915c022803181329d1057
+size 332363770

rng_state.pth ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ec55b4daad85ce349cc9dcc48542de3969d24bc9676d6ceda96257d516bc1184
+size 14244

scaler.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:ee0fcf7a93ec5c15dc5bc24f4dcfdbdd297f774458ffc4a9e28b6338778e87c7
+size 988

scheduler.pt ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7a4e0dda4ac01f5ed15d66380b515965c66a53aea573d5a1b42728550281990d
+size 1064

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,7 @@

+{
+  "cls_token": "[CLS]",
+  "mask_token": "[MASK]",
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "unk_token": "[UNK]"
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,58 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": false,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "model_max_length": 512,
+  "never_split": null,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "BertTokenizer",
+  "unk_token": "[UNK]"
+}

trainer_state.json ADDED Viewed

The diff for this file is too large to render. See raw diff

training_args.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:edc5a6d42126b280f3644ab439422a54e2040172376f4f27aed625e2725dbfa4
+size 5496

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff