File size: 6,682 Bytes
3796734 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 |
---
license: apache-2.0
base_model:
- meta-llama/Meta-Llama-3-8B-Instruct
language:
- en
tags:
- BEL
- retrieval
- entity-retrieval
- named-entity-disambiguation
- entity-disambiguation
- named-entity-linking
- entity-linking
- text2text-generation
- biomedical
- healthcare
- synthetic-data
- causal-lm
- llm
library_name: transformers
finetuning_task:
- text2text-generation
- entity-linking
metrics:
- recall
model-index:
- name: syncabel-medmentions-8b
results:
- task:
type: entity-linking
dataset:
type: structured_dataset
name: medmentions
config: st21pv
metrics:
- type: recall
value: 0.754
---
# SynCABEL: Synthetic Contextualized Augmentation for Biomedical Entity Linking
## SynCABEL
**SynCABEL** is a novel framework that addresses data scarcity in biomedical entity linking through **synthetic data generation**. The method, introduced in our [paper]
## SynCABEL (SPACCC Edition)
This is a **finetuned version of LLaMA-3-8B** trained on **MedMentions** using **SynthMM** (our synthetic dataset generated via the SynCABEL framework).
| | |
|--------|---------|
| **Base Model** | [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) |
| **Training Data** | [MedMentions](https://huggingface.co/datasets/bigbio/medmentions) (real) + [SynthMM](https://huggingface.co/datasets/Aremaki/SynCABEL) (synthetic) |
| **Fine-tuning** | [Supervised Fine-Tuning](https://huggingface.co/docs/trl/en/sft_trainer) |
## Training Data Composition
The model is trained on a mix of **human-annotated** and **synthetic** data:
```
MedMentions (human) : 4,392 abstracts
SynthMM (synthetic) : ~50,000 samples
```
To ensure balanced learning, **human data is upsampled during training** so that each batch contains:
```
50% human-annotated data
50% synthetic data
```
In other words, although SynthMM is larger, the model always sees a **1:1 ratio of human to synthetic examples**, preventing synthetic data from overwhelming human supervision.
## Usage
### Loading
```python
import torch
from transformers import AutoModelForCausalLM
# Load the model (requires trust_remote_code for custom architecture)
model = AutoModelForCausalLM.from_pretrained(
"Aremaki/SynCABEL_MedMentions",
trust_remote_code=True,
device_map="auto"
)
```
### Unconstrained Generation
```python
# Let the model freely generate concept names
sentences = [
"[Ibuprofen]{Chemicals & Drugs} is a non-steroidal anti-inflammatory drug",
"[Myocardial infarction]{Disorders} requires immediate intervention"
]
results = model.sample(
sentences=sentences,
constrained=False,
num_beams=3,
)
for i, beam_results in enumerate(results):
print(f"Input: {sentences[i]}")
mention = beam_results[0]["mention"]
print(f"Mention: {mention}")
for j, result in enumerate(beam_results):
print(
f"Beam {j+1}"
f"Predicted concept name:{result['pred_concept_name']}"
f"Predicted code: {result['pred_concept_code']} "
f"Beam score: {result['beam_score']:.3f})"
)
```
**Output:**
```
Input: [Ibuprofen]{Chemicals & Drugs} is a non-steroidal anti-inflammatory drug
Mention: Ibuprofen
Beam 1:
Predicted concept name:Ibuprofen
Predicted code: C0020740
Beam score: 1.000
Beam 2:
Predicted concept name:IBUPROFEN
Predicted code: NO_CODE
Beam score: 0.114
Beam 3:
Predicted concept name:IBUPROfen
Predicted code: NO_CODE
Beam score: 0.060
Input: [Myocardial infarction]{Disorders} requires immediate intervention
Mention: Myocardial infarction
Beam 1:
Predicted concept name:Myocardial infarction
Predicted code: C0027051
Beam score: 1.000
Beam 2:
Predicted concept name:Myocardial Infarction
Predicted code: C0027051
Beam score: 0.200
Beam 3:
Predicted concept name:myocardial infarction
Predicted code: NO_CODE
Beam score: 0.149
```
### Constrained Decoding (Recommended for Entity Linking)
```python
# Constrained to valid biomedical concepts
sentences = [
"[Ibuprofen]{Chemicals & Drugs} is a non-steroidal anti-inflammatory drug",
"[Myocardial infarction]{Disorders} requires immediate intervention"
]
results = model.sample(
sentences=sentences,
constrained=False,
num_beams=3,
)
for i, beam_results in enumerate(results):
print(f"Input: {sentences[i]}")
mention = beam_results[0]["mention"]
print(f"Mention: {mention}")
for j, result in enumerate(beam_results):
print(
f"Beam {j+1}:\n"
f"Predicted concept name:{result['pred_concept_name']}\n"
f"Predicted code: {result['pred_concept_code']}\n"
f"Beam score: {result['beam_score']:.3f}\n"
)
```
**Output:**
```
Input: [Ibuprofen]{Chemicals & Drugs} is a non-steroidal anti-inflammatory drug
Mention: Ibuprofen
Beam 1:
Predicted concept name:Ibuprofen
Predicted code: C0020740
Beam score: 1.000
Beam 2:
Predicted concept name:IBUPROFEN/PSEUDOEPHEDRINE
Predicted code: C0717858
Beam score: 0.065
Beam 3:
Predicted concept name:Ibuprofen (substance)
Predicted code: C0020740
Beam score: 0.056
Input: [Myocardial infarction]{Disorders} requires immediate intervention
Mention: Myocardial infarction
Beam 1:
Predicted concept name:Myocardial infarction
Predicted code: C0027051
Beam score: 1.000
Beam 2:
Predicted concept name:Myocardial Infarction
Predicted code: C0027051
Beam score: 0.200
Beam 3:
Predicted concept name:Myocardial infarction (disorder)
Predicted code: C0027051
Beam score: 0.194
```
## Assets
The model automatically loads:
- `text_to_code.json`: Maps concept names to ontology codes (UMLS, SNOMED CT)
- `candidate_trie.pkl`: Prefix tree for efficient constrained decoding
## MedMentions Test Set Results
| Training Data | Recall@1 | Improvement |
|---------------|----------|-------------|
| MedMentions Only | 0.76 | Baseline |
| + SynthMM (Ours) | **0.85** | **+11.8%** |
### Comparison with State-of-the-Art
| Model | F1 Score | Training Data |
|-------|----------|---------------|
| **SapBERT** | 0.83 | MedMentions + UMLS |
| **BioSyn** | 0.81 | MedMentions |
| **GENRE (baseline)** | 0.79 | MedMentions |
| **SynCABEL-8B (Ours)** | **0.85** | MedMentions + SynthMM |
| **SynCABEL-8B (w/ UMLS)** | **0.88** | + UMLS pretraining |
### Speed and Efficiency
| Batch Size | Avg. Latency | Throughput |
|------------|--------------|------------|
| 1 | 120ms | 8.3 samples/sec |
| 8 | 650ms | 12.3 samples/sec |
| 16 | 1.2s | 13.3 samples/sec |
| 32 | 2.1s | 15.2 samples/sec |
*Measured on single H100 GPU, constrained decoding*
|