MedGRPO Team
Claude Sonnet 4.5
commited on
Commit
Β·
c137994
1
Parent(s):
15c8be4
Update README with comprehensive MedGRPO documentation
Browse files- Added detailed submission guide with expected results format
- Documented 8 tasks with metrics and descriptions
- Explained automatic evaluation pipeline integration
- Added LLM judge scoring details (R2, R4, R5, R7, R8)
- Included score normalization methodology
- Updated HuggingFace Space metadata (emoji, tags, description)
- Added test set statistics and task distribution
- Comprehensive citation and links section
π€ Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
README.md
CHANGED
|
@@ -1,48 +1,180 @@
|
|
| 1 |
---
|
| 2 |
title: MedGRPO Leaderboard
|
| 3 |
-
emoji:
|
| 4 |
-
colorFrom:
|
| 5 |
-
colorTo:
|
| 6 |
sdk: gradio
|
| 7 |
app_file: app.py
|
| 8 |
pinned: true
|
| 9 |
license: apache-2.0
|
| 10 |
-
short_description: Leaderboard
|
| 11 |
-
sdk_version: 5.
|
| 12 |
tags:
|
| 13 |
- leaderboard
|
|
|
|
|
|
|
|
|
|
| 14 |
---
|
| 15 |
|
| 16 |
-
#
|
| 17 |
|
| 18 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
-
Results files should have the following format and be stored as json files:
|
| 21 |
```json
|
| 22 |
-
|
| 23 |
-
|
| 24 |
-
|
| 25 |
-
|
| 26 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
},
|
| 28 |
-
"
|
| 29 |
-
|
| 30 |
-
|
| 31 |
-
|
| 32 |
-
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
}
|
| 37 |
```
|
| 38 |
|
| 39 |
-
|
| 40 |
|
| 41 |
-
|
|
|
|
| 42 |
|
| 43 |
-
|
| 44 |
|
| 45 |
-
|
| 46 |
-
-
|
| 47 |
-
- the
|
| 48 |
-
- the logic to allow or filter submissions in `src/submission/submit.py` and `src/submission/check_validity.py`
|
|
|
|
| 1 |
---
|
| 2 |
title: MedGRPO Leaderboard
|
| 3 |
+
emoji: π₯
|
| 4 |
+
colorFrom: blue
|
| 5 |
+
colorTo: purple
|
| 6 |
sdk: gradio
|
| 7 |
app_file: app.py
|
| 8 |
pinned: true
|
| 9 |
license: apache-2.0
|
| 10 |
+
short_description: MedGRPO Benchmark Leaderboard - 8 medical video tasks
|
| 11 |
+
sdk_version: 5.50.0
|
| 12 |
tags:
|
| 13 |
- leaderboard
|
| 14 |
+
- medical
|
| 15 |
+
- video-understanding
|
| 16 |
+
- surgical-ai
|
| 17 |
---
|
| 18 |
|
| 19 |
+
# MedGRPO Leaderboard
|
| 20 |
|
| 21 |
+
Interactive leaderboard for evaluating Video-Language Models on the **MedGRPO benchmark** - 8 medical video understanding tasks across 8 surgical datasets.
|
| 22 |
+
|
| 23 |
+
π **Live Demo**: [huggingface.co/spaces/UIIAmerica/MedGRPO-Leaderboard](https://huggingface.co/spaces/UIIAmerica/MedGRPO-Leaderboard)
|
| 24 |
+
|
| 25 |
+
π **Paper**: [arXiv:2512.06581](https://arxiv.org/abs/2512.06581)
|
| 26 |
+
|
| 27 |
+
## Overview
|
| 28 |
+
|
| 29 |
+
This leaderboard provides a centralized platform for researchers to:
|
| 30 |
+
- **Submit** inference results on the MedGRPO test set
|
| 31 |
+
- **Automatically evaluate** across 8 diverse tasks
|
| 32 |
+
- **Compare** model performance on standardized metrics
|
| 33 |
+
- **Track** state-of-the-art progress in medical video understanding
|
| 34 |
+
|
| 35 |
+
## Features
|
| 36 |
+
|
| 37 |
+
### π― 8 Medical Video Tasks
|
| 38 |
+
|
| 39 |
+
| Task | Metric | Description |
|
| 40 |
+
|------|--------|-------------|
|
| 41 |
+
| **TAL** | mAP@0.5 | Temporal Action Localization - identify start/end times of surgical actions |
|
| 42 |
+
| **STG** | mIoU | Spatiotemporal Grounding - locate actions in space (bbox) and time |
|
| 43 |
+
| **Next Action** | Accuracy | Predict the next surgical step |
|
| 44 |
+
| **DVC** | LLM Judge | Dense Video Captioning - detailed segment descriptions |
|
| 45 |
+
| **VS** | LLM Judge | Video Summary - summarize entire surgical videos |
|
| 46 |
+
| **RC** | LLM Judge | Region Caption - describe regions indicated by bounding boxes |
|
| 47 |
+
| **Skill Assessment** | Accuracy | Evaluate surgical skill levels (JIGSAWS) |
|
| 48 |
+
| **CVS Assessment** | Accuracy | Clinical variable scoring |
|
| 49 |
+
|
| 50 |
+
### βοΈ Automatic Evaluation
|
| 51 |
+
|
| 52 |
+
The leaderboard integrates directly with the MedGRPO evaluation pipeline:
|
| 53 |
+
- **Validation**: Checks results file format and sample count
|
| 54 |
+
- **Execution**: Runs `evaluate_all_pai.py` with dataset-agnostic grouping
|
| 55 |
+
- **Parsing**: Extracts task-specific metrics from evaluation output
|
| 56 |
+
- **Ranking**: Computes normalized average score across all tasks
|
| 57 |
+
|
| 58 |
+
### π Test Set Statistics
|
| 59 |
+
|
| 60 |
+
- **Total samples**: 6,245
|
| 61 |
+
- **Source datasets**: 8 (AVOS, CholecT50, CholecTrack20, Cholec80_CVS, CoPESD, EgoSurgery, NurViD, JIGSAWS)
|
| 62 |
+
- **Video frames**: ~103,742
|
| 63 |
+
|
| 64 |
+
## Submission Guide
|
| 65 |
+
|
| 66 |
+
### 1. Run Inference
|
| 67 |
+
|
| 68 |
+
Run your model on the MedGRPO test set (6,245 samples) to generate predictions for all 8 tasks.
|
| 69 |
+
|
| 70 |
+
### 2. Expected Results Format
|
| 71 |
+
|
| 72 |
+
Your results file should be a JSON with this structure:
|
| 73 |
|
|
|
|
| 74 |
```json
|
| 75 |
+
[
|
| 76 |
+
{
|
| 77 |
+
"question": "<video>\nQuestion text...",
|
| 78 |
+
"response": "Your model's answer",
|
| 79 |
+
"ground_truth": "Correct answer",
|
| 80 |
+
"qa_type": "tal",
|
| 81 |
+
"metadata": {
|
| 82 |
+
"video_id": "...",
|
| 83 |
+
"fps": "1.0",
|
| 84 |
+
...
|
| 85 |
},
|
| 86 |
+
"data_source": "AVOS",
|
| 87 |
+
...
|
| 88 |
+
},
|
| 89 |
+
...
|
| 90 |
+
]
|
| 91 |
+
```
|
| 92 |
+
|
| 93 |
+
**Valid qa_types**:
|
| 94 |
+
- `tal` - Temporal Action Localization
|
| 95 |
+
- `stg` - Spatiotemporal Grounding
|
| 96 |
+
- `next_action` - Next Action Prediction
|
| 97 |
+
- `dense_captioning` - Dense Video Captioning
|
| 98 |
+
- `video_summary` - Video Summary
|
| 99 |
+
- `region_caption` - Region Caption
|
| 100 |
+
- `skill_assessment` - Skill Assessment (JIGSAWS)
|
| 101 |
+
- `cvs_assessment` - CVS Assessment
|
| 102 |
+
|
| 103 |
+
### 3. Upload to Leaderboard
|
| 104 |
+
|
| 105 |
+
1. Visit the [leaderboard](https://huggingface.co/spaces/UIIAmerica/MedGRPO-Leaderboard)
|
| 106 |
+
2. Go to the **Submit Results** tab
|
| 107 |
+
3. Fill in:
|
| 108 |
+
- **Model Name** (e.g., "Qwen2.5-VL-7B-MedGRPO")
|
| 109 |
+
- **Organization** (e.g., "Your University")
|
| 110 |
+
- **Contact** (optional)
|
| 111 |
+
4. Upload your results JSON file
|
| 112 |
+
5. Click **Submit to Leaderboard**
|
| 113 |
+
|
| 114 |
+
The system will:
|
| 115 |
+
- Validate your file (format + sample count)
|
| 116 |
+
- Run automatic evaluation (~5-10 minutes)
|
| 117 |
+
- Extract metrics for all 8 tasks
|
| 118 |
+
- Add your model to the leaderboard
|
| 119 |
+
|
| 120 |
+
## Evaluation Metrics
|
| 121 |
+
|
| 122 |
+
### Task-Specific Metrics
|
| 123 |
+
|
| 124 |
+
| Task | Metric Extracted | Details |
|
| 125 |
+
|------|------------------|---------|
|
| 126 |
+
| TAL | `mAP@0.5` | Mean Average Precision at IoU=0.5 |
|
| 127 |
+
| STG | `mean_iou` | Mean Intersection over Union (spatial + temporal) |
|
| 128 |
+
| Next Action | `Weighted Average Accuracy` | Classification accuracy |
|
| 129 |
+
| DVC/VS/RC | `Average LLM Judge Score` | Average of R2, R4, R5, R7, R8 (1-5 scale) |
|
| 130 |
+
| Skill/CVS | `accuracy` | Classification accuracy |
|
| 131 |
+
|
| 132 |
+
### LLM Judge Details
|
| 133 |
+
|
| 134 |
+
For caption tasks (DVC, VS, RC), we use **GPT-4.1** or **Gemini-Pro** with rubric-based scoring:
|
| 135 |
+
|
| 136 |
+
**5 Key Aspects** (1-5 scale each):
|
| 137 |
+
- **R2**: Relevance & Medical Terminology
|
| 138 |
+
- **R4**: Actionable Surgical Actions
|
| 139 |
+
- **R5**: Comprehensive Detail Level
|
| 140 |
+
- **R7**: Anatomical & Instrument Precision
|
| 141 |
+
- **R8**: Clinical Context & Coherence
|
| 142 |
+
|
| 143 |
+
**Final score** = Average of R2, R4, R5, R7, R8
|
| 144 |
+
|
| 145 |
+
### Score Normalization
|
| 146 |
+
|
| 147 |
+
To compute the **average score** fairly across tasks:
|
| 148 |
+
1. **LLM Judge scores** (1-5 scale) are normalized: `(score - 1) / 4` β [0, 1]
|
| 149 |
+
2. **Other metrics** (already 0-1) remain unchanged
|
| 150 |
+
3. **Average** = mean of all 8 normalized task scores
|
| 151 |
+
|
| 152 |
+
## Links
|
| 153 |
+
|
| 154 |
+
- π **Paper**: [https://arxiv.org/abs/2512.06581](https://arxiv.org/abs/2512.06581)
|
| 155 |
+
- π **Project**: [https://yuhaosu.github.io/MedGRPO/](https://yuhaosu.github.io/MedGRPO/)
|
| 156 |
+
- πΎ **Dataset**: [https://huggingface.co/datasets/UIIAmerica/MedGRPO](https://huggingface.co/datasets/UIIAmerica/MedGRPO)
|
| 157 |
+
- π» **GitHub**: [https://github.com/YuhaoSu/MedGRPO](https://github.com/YuhaoSu/MedGRPO)
|
| 158 |
+
- π **Leaderboard**: [https://huggingface.co/spaces/UIIAmerica/MedGRPO-Leaderboard](https://huggingface.co/spaces/UIIAmerica/MedGRPO-Leaderboard)
|
| 159 |
+
|
| 160 |
+
## Citation
|
| 161 |
+
|
| 162 |
+
```bibtex
|
| 163 |
+
@article{su2024medgrpo,
|
| 164 |
+
title={MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding},
|
| 165 |
+
author={Su, Yuhao and Choudhuri, Anwesa and Gao, Zhongpai and Planche, Benjamin and Nguyen, Van Nguyen and Zheng, Meng and Shen, Yuhan and Innanje, Arun and Chen, Terrence and Elhamifar, Ehsan and Wu, Ziyan},
|
| 166 |
+
journal={arXiv preprint arXiv:2512.06581},
|
| 167 |
+
year={2025}
|
| 168 |
}
|
| 169 |
```
|
| 170 |
|
| 171 |
+
## License
|
| 172 |
|
| 173 |
+
- **Leaderboard Code**: Apache 2.0
|
| 174 |
+
- **Dataset**: CC BY-NC-SA 4.0 (Non-commercial, Share-alike)
|
| 175 |
|
| 176 |
+
## Contact
|
| 177 |
|
| 178 |
+
For questions or issues:
|
| 179 |
+
- Open an issue on [GitHub](https://github.com/YuhaoSu/MedGRPO)
|
| 180 |
+
- Visit the [project page](https://yuhaosu.github.io/MedGRPO/)
|
|
|