Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,89 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: apache-2.0
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
datasets:
|
| 4 |
+
- BeIR/scidocs
|
| 5 |
+
- miriad/miriad-4.4M
|
| 6 |
+
- BioASQ-b
|
| 7 |
+
language:
|
| 8 |
+
- en
|
| 9 |
+
base_model:
|
| 10 |
+
- BAAI/bge-reranker-v2-gemma
|
| 11 |
+
pipeline_tag: text-classification
|
| 12 |
+
tags:
|
| 13 |
+
- medical
|
| 14 |
+
- merge
|
| 15 |
+
- rerank
|
| 16 |
+
---
|
| 17 |
+
|
| 18 |
+
# MedSwin/MedSwin-Reranker-bge-gemma — Fine-tuned Biomedical & EMR Context Ranking
|
| 19 |
+
|
| 20 |
+
- **Developed by:** Medical Swinburne University of Technology AI Team
|
| 21 |
+
- **Funded by:** [Swinburne University of Technology](https://www.swinburne.edu.au)
|
| 22 |
+
- **Language(s):** English
|
| 23 |
+
- **License:** Apache 2.0
|
| 24 |
+
|
| 25 |
+
## Overview
|
| 26 |
+
1. **RAG Context Reranking**
|
| 27 |
+
Re-rank candidate passages retrieved from a VectorDB (initial recall via embeddings), improving final context selection for downstream medical LLM reasoning.
|
| 28 |
+
|
| 29 |
+
2. **EMR Profile Reranking**
|
| 30 |
+
Re-rank patient historical information (e.g., past assessments, diagnoses, medications) to surface the most clinically relevant records for a given current assessment.
|
| 31 |
+
|
| 32 |
+
The reranker outputs a **direct relevance score** for each *(query, passage)* pair and can be used as a drop-in “second-stage” ranking component after embedding-based retrieval.
|
| 33 |
+
|
| 34 |
+
---
|
| 35 |
+
|
| 36 |
+
## Why a Reranker?
|
| 37 |
+
Embedding retrieval is fast and scalable but may miss nuanced relevance (clinical relationships, subtle terminology, long context dependencies).
|
| 38 |
+
A reranker improves precision by explicitly scoring each candidate passage against the query, typically yielding better top-k context for medical QA and decision support.
|
| 39 |
+
|
| 40 |
+
---
|
| 41 |
+
|
| 42 |
+
## Base Model
|
| 43 |
+
- **Model**: [BAAI/bge-reranker-v2-gemma](https://huggingface.co/BAAI/bge-reranker-v2-gemma)
|
| 44 |
+
- **Finetuning strategy**: **LoRA** (parameter-efficient fine-tuning) with gradient checkpointing and mixed precision (fp16/bf16 depending on GPU).
|
| 45 |
+
- **Rationale**: Gemma-based rerankers generally provide strong relevance modeling and support longer contexts compared to smaller rerankers.
|
| 46 |
+
|
| 47 |
+
---
|
| 48 |
+
|
| 49 |
+
## Training Data (Offline, Local)
|
| 50 |
+
We fine-tune using **open HF datasets** stored locally on HPC:
|
| 51 |
+
|
| 52 |
+
### 1) BioASQ (Generated Queries)
|
| 53 |
+
- Used as: (query, document) positives; negatives sampled from rolling buffer.
|
| 54 |
+
- Specialised to handle the complex terminology and high precision required for Task B (Biomedical Semantic QA). The reranker acts as a critical second stage in a two-stage retrieval system, filtering initial candidate lists from a PubMed-indexed retriever to ensure the highest-ranked documents contain the specific evidence needed for factoid and 'ideal' answer generation.
|
| 55 |
+
|
| 56 |
+
### 2) MIRIAD (Medical IR Instruction Dataset)
|
| 57 |
+
- Used as: (question → passage) positives; negatives sampled from rolling buffer.
|
| 58 |
+
- [MIRIAD's 4.4M](https://huggingface.co/datasets/miriad/miriad-4.4M) literature-grounded QA pairs, the model is trained to distinguish between highly similar clinical concepts. This specialization reduces medical hallucinations and ensures that the most scientifically accurate evidence is prioritised in a multi-stage retrieval pipeline for healthcare professionals.
|
| 59 |
+
|
| 60 |
+
### 3) SciDocs
|
| 61 |
+
- Multi-task dataset—including citation prediction and co-citation analysis—the model learns to capture nuanced semantic relationships that standard Bi-Encoders miss. The resulting reranker serves as a high-accuracy second stage in a two-stage retrieval pipeline, significantly improving Top-K relevance for complex scholarly queries.
|
| 62 |
+
|
| 63 |
+
---
|
| 64 |
+
|
| 65 |
+
## Methodology
|
| 66 |
+
### Data Construction (Triplets)
|
| 67 |
+
The training corpus is converted into reranker triplets:
|
| 68 |
+
```json
|
| 69 |
+
{
|
| 70 |
+
"query": "clinical question",
|
| 71 |
+
"pos": ["relevant passage 1", "relevant passage 2"],
|
| 72 |
+
"neg": ["irrelevant passage A", "irrelevant passage B"],
|
| 73 |
+
"source": "dataset_name"
|
| 74 |
+
}
|
| 75 |
+
```
|
| 76 |
+
|
| 77 |
+
* **Positives**: from dataset relevance labels or paired question–passage examples.
|
| 78 |
+
* **Negatives**: sampled from an in-memory rolling buffer (fast, scalable offline).
|
| 79 |
+
* Output splits: **train / val / test** created in one run.
|
| 80 |
+
|
| 81 |
+
### Evaluation
|
| 82 |
+
|
| 83 |
+
Computes IR ranking metrics by scoring each query against its *(pos + neg)* candidates:
|
| 84 |
+
|
| 85 |
+
* **nDCG@10:** 0.60+
|
| 86 |
+
* **MRR@10:** 0.50+
|
| 87 |
+
* **MAP@10:** 0.40+
|
| 88 |
+
* **Hit@1:** 0.40+
|
| 89 |
+
* Metrics reported overall and broken down by data source.
|