|
|
--- |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- BeIR/scidocs |
|
|
- miriad/miriad-4.4M |
|
|
- BioASQ-b |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- BAAI/bge-reranker-v2-gemma |
|
|
pipeline_tag: text-ranking |
|
|
tags: |
|
|
- medical |
|
|
- rerank |
|
|
--- |
|
|
|
|
|
# MedSwin/MedSwin-Reranker-bge-gemma — Fine-tuned Biomedical & EMR Context Ranking |
|
|
|
|
|
- **Developed by:** Medical Swinburne University of Technology AI Team |
|
|
- **Funded by:** [Swinburne University of Technology](https://www.swinburne.edu.au) |
|
|
- **Language(s):** English |
|
|
- **License:** Apache 2.0 |
|
|
|
|
|
## Overview |
|
|
1. **RAG Context Reranking** |
|
|
Re-rank candidate passages retrieved from a VectorDB (initial recall via embeddings), improving final context selection for downstream medical LLM reasoning. |
|
|
|
|
|
2. **EMR Profile Reranking** |
|
|
Re-rank patient historical information (e.g., past assessments, diagnoses, medications) to surface the most clinically relevant records for a given current assessment. |
|
|
|
|
|
The reranker outputs a **direct relevance score** for each *(query, passage)* pair and can be used as a drop-in “second-stage” ranking component after embedding-based retrieval. |
|
|
|
|
|
--- |
|
|
|
|
|
## Why a Reranker? |
|
|
Embedding retrieval is fast and scalable but may miss nuanced relevance (clinical relationships, subtle terminology, long context dependencies). |
|
|
A reranker improves precision by explicitly scoring each candidate passage against the query, typically yielding better top-k context for medical QA and decision support. |
|
|
|
|
|
--- |
|
|
|
|
|
## Base Model |
|
|
- **Model**: [BAAI/bge-reranker-v2-gemma](https://huggingface.co/BAAI/bge-reranker-v2-gemma) |
|
|
- **Finetuning strategy**: **LoRA** (parameter-efficient fine-tuning) with gradient checkpointing and mixed precision (fp16/bf16 depending on GPU). |
|
|
- **Rationale**: Gemma-based rerankers generally provide strong relevance modeling and support longer contexts compared to smaller rerankers. |
|
|
|
|
|
--- |
|
|
|
|
|
## Training Data (Offline, Local) |
|
|
We fine-tune using **open HF datasets** stored locally on HPC: |
|
|
|
|
|
### 1) BioASQ (Generated Queries) |
|
|
- Used as: (query, document) positives; negatives sampled from rolling buffer. |
|
|
- Specialised to handle the complex terminology and high precision required for Task B (Biomedical Semantic QA). The reranker acts as a critical second stage in a two-stage retrieval system, filtering initial candidate lists from a PubMed-indexed retriever to ensure the highest-ranked documents contain the specific evidence needed for factoid and 'ideal' answer generation. |
|
|
|
|
|
### 2) MIRIAD (Medical IR Instruction Dataset) |
|
|
- Used as: (question → passage) positives; negatives sampled from rolling buffer. |
|
|
- [MIRIAD's 4.4M](https://huggingface.co/datasets/miriad/miriad-4.4M) literature-grounded QA pairs, the model is trained to distinguish between highly similar clinical concepts. This specialization reduces medical hallucinations and ensures that the most scientifically accurate evidence is prioritised in a multi-stage retrieval pipeline for healthcare professionals. |
|
|
|
|
|
### 3) SciDocs |
|
|
- Multi-task dataset—including citation prediction and co-citation analysis—the model learns to capture nuanced semantic relationships that standard Bi-Encoders miss. The resulting reranker serves as a high-accuracy second stage in a two-stage retrieval pipeline, significantly improving Top-K relevance for complex scholarly queries. |
|
|
|
|
|
--- |
|
|
|
|
|
## Methodology |
|
|
### Data Construction (Triplets) |
|
|
The training corpus is converted into reranker triplets: |
|
|
```json |
|
|
{ |
|
|
"query": "clinical question", |
|
|
"pos": ["relevant passage 1", "relevant passage 2"], |
|
|
"neg": ["irrelevant passage A", "irrelevant passage B"], |
|
|
"source": "dataset_name" |
|
|
} |
|
|
``` |
|
|
|
|
|
* **Positives**: from dataset relevance labels or paired question–passage examples. |
|
|
* **Negatives**: sampled from an in-memory rolling buffer (fast, scalable offline). |
|
|
* Output splits: **train / val / test** created in one run. |
|
|
|
|
|
### Evaluation |
|
|
|
|
|
Computes IR ranking metrics by scoring each query against its *(pos + neg)* candidates: |
|
|
|
|
|
* **nDCG@10:** 0.60+ |
|
|
* **MRR@10:** 0.50+ |
|
|
* **MAP@10:** 0.40+ |
|
|
* **Hit@1:** 0.40+ |
|
|
* Metrics reported overall and broken down by data source. |