MedGRPO Team Claude Sonnet 4.5 commited on
Commit
c137994
Β·
1 Parent(s): 15c8be4

Update README with comprehensive MedGRPO documentation

Browse files

- Added detailed submission guide with expected results format
- Documented 8 tasks with metrics and descriptions
- Explained automatic evaluation pipeline integration
- Added LLM judge scoring details (R2, R4, R5, R7, R8)
- Included score normalization methodology
- Updated HuggingFace Space metadata (emoji, tags, description)
- Added test set statistics and task distribution
- Comprehensive citation and links section

πŸ€– Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

Files changed (1) hide show
  1. README.md +160 -28
README.md CHANGED
@@ -1,48 +1,180 @@
1
  ---
2
  title: MedGRPO Leaderboard
3
- emoji: πŸ₯‡
4
- colorFrom: green
5
- colorTo: indigo
6
  sdk: gradio
7
  app_file: app.py
8
  pinned: true
9
  license: apache-2.0
10
- short_description: Leaderboard for Video-LLMs on the MedGRPO Benchmark
11
- sdk_version: 5.43.1
12
  tags:
13
  - leaderboard
 
 
 
14
  ---
15
 
16
- # Start the configuration
17
 
18
- Most of the variables to change for a default leaderboard are in `src/env.py` (replace the path for your leaderboard) and `src/about.py` (for tasks).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
- Results files should have the following format and be stored as json files:
21
  ```json
22
- {
23
- "config": {
24
- "model_dtype": "torch.float16", # or torch.bfloat16 or 8bit or 4bit
25
- "model_name": "path of the model on the hub: org/model",
26
- "model_sha": "revision on the hub",
 
 
 
 
 
27
  },
28
- "results": {
29
- "task_name": {
30
- "metric_name": score,
31
- },
32
- "task_name2": {
33
- "metric_name": score,
34
- }
35
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
  }
37
  ```
38
 
39
- Request files are created automatically by this tool.
40
 
41
- If you encounter problem on the space, don't hesitate to restart it to remove the create eval-queue, eval-queue-bk, eval-results and eval-results-bk created folder.
 
42
 
43
- # Code logic for more complex edits
44
 
45
- You'll find
46
- - the main table' columns names and properties in `src/display/utils.py`
47
- - the logic to read all results and request files, then convert them in dataframe lines, in `src/leaderboard/read_evals.py`, and `src/populate.py`
48
- - the logic to allow or filter submissions in `src/submission/submit.py` and `src/submission/check_validity.py`
 
1
  ---
2
  title: MedGRPO Leaderboard
3
+ emoji: πŸ₯
4
+ colorFrom: blue
5
+ colorTo: purple
6
  sdk: gradio
7
  app_file: app.py
8
  pinned: true
9
  license: apache-2.0
10
+ short_description: MedGRPO Benchmark Leaderboard - 8 medical video tasks
11
+ sdk_version: 5.50.0
12
  tags:
13
  - leaderboard
14
+ - medical
15
+ - video-understanding
16
+ - surgical-ai
17
  ---
18
 
19
+ # MedGRPO Leaderboard
20
 
21
+ Interactive leaderboard for evaluating Video-Language Models on the **MedGRPO benchmark** - 8 medical video understanding tasks across 8 surgical datasets.
22
+
23
+ πŸ† **Live Demo**: [huggingface.co/spaces/UIIAmerica/MedGRPO-Leaderboard](https://huggingface.co/spaces/UIIAmerica/MedGRPO-Leaderboard)
24
+
25
+ πŸ“„ **Paper**: [arXiv:2512.06581](https://arxiv.org/abs/2512.06581)
26
+
27
+ ## Overview
28
+
29
+ This leaderboard provides a centralized platform for researchers to:
30
+ - **Submit** inference results on the MedGRPO test set
31
+ - **Automatically evaluate** across 8 diverse tasks
32
+ - **Compare** model performance on standardized metrics
33
+ - **Track** state-of-the-art progress in medical video understanding
34
+
35
+ ## Features
36
+
37
+ ### 🎯 8 Medical Video Tasks
38
+
39
+ | Task | Metric | Description |
40
+ |------|--------|-------------|
41
+ | **TAL** | mAP@0.5 | Temporal Action Localization - identify start/end times of surgical actions |
42
+ | **STG** | mIoU | Spatiotemporal Grounding - locate actions in space (bbox) and time |
43
+ | **Next Action** | Accuracy | Predict the next surgical step |
44
+ | **DVC** | LLM Judge | Dense Video Captioning - detailed segment descriptions |
45
+ | **VS** | LLM Judge | Video Summary - summarize entire surgical videos |
46
+ | **RC** | LLM Judge | Region Caption - describe regions indicated by bounding boxes |
47
+ | **Skill Assessment** | Accuracy | Evaluate surgical skill levels (JIGSAWS) |
48
+ | **CVS Assessment** | Accuracy | Clinical variable scoring |
49
+
50
+ ### βš™οΈ Automatic Evaluation
51
+
52
+ The leaderboard integrates directly with the MedGRPO evaluation pipeline:
53
+ - **Validation**: Checks results file format and sample count
54
+ - **Execution**: Runs `evaluate_all_pai.py` with dataset-agnostic grouping
55
+ - **Parsing**: Extracts task-specific metrics from evaluation output
56
+ - **Ranking**: Computes normalized average score across all tasks
57
+
58
+ ### πŸ“Š Test Set Statistics
59
+
60
+ - **Total samples**: 6,245
61
+ - **Source datasets**: 8 (AVOS, CholecT50, CholecTrack20, Cholec80_CVS, CoPESD, EgoSurgery, NurViD, JIGSAWS)
62
+ - **Video frames**: ~103,742
63
+
64
+ ## Submission Guide
65
+
66
+ ### 1. Run Inference
67
+
68
+ Run your model on the MedGRPO test set (6,245 samples) to generate predictions for all 8 tasks.
69
+
70
+ ### 2. Expected Results Format
71
+
72
+ Your results file should be a JSON with this structure:
73
 
 
74
  ```json
75
+ [
76
+ {
77
+ "question": "<video>\nQuestion text...",
78
+ "response": "Your model's answer",
79
+ "ground_truth": "Correct answer",
80
+ "qa_type": "tal",
81
+ "metadata": {
82
+ "video_id": "...",
83
+ "fps": "1.0",
84
+ ...
85
  },
86
+ "data_source": "AVOS",
87
+ ...
88
+ },
89
+ ...
90
+ ]
91
+ ```
92
+
93
+ **Valid qa_types**:
94
+ - `tal` - Temporal Action Localization
95
+ - `stg` - Spatiotemporal Grounding
96
+ - `next_action` - Next Action Prediction
97
+ - `dense_captioning` - Dense Video Captioning
98
+ - `video_summary` - Video Summary
99
+ - `region_caption` - Region Caption
100
+ - `skill_assessment` - Skill Assessment (JIGSAWS)
101
+ - `cvs_assessment` - CVS Assessment
102
+
103
+ ### 3. Upload to Leaderboard
104
+
105
+ 1. Visit the [leaderboard](https://huggingface.co/spaces/UIIAmerica/MedGRPO-Leaderboard)
106
+ 2. Go to the **Submit Results** tab
107
+ 3. Fill in:
108
+ - **Model Name** (e.g., "Qwen2.5-VL-7B-MedGRPO")
109
+ - **Organization** (e.g., "Your University")
110
+ - **Contact** (optional)
111
+ 4. Upload your results JSON file
112
+ 5. Click **Submit to Leaderboard**
113
+
114
+ The system will:
115
+ - Validate your file (format + sample count)
116
+ - Run automatic evaluation (~5-10 minutes)
117
+ - Extract metrics for all 8 tasks
118
+ - Add your model to the leaderboard
119
+
120
+ ## Evaluation Metrics
121
+
122
+ ### Task-Specific Metrics
123
+
124
+ | Task | Metric Extracted | Details |
125
+ |------|------------------|---------|
126
+ | TAL | `mAP@0.5` | Mean Average Precision at IoU=0.5 |
127
+ | STG | `mean_iou` | Mean Intersection over Union (spatial + temporal) |
128
+ | Next Action | `Weighted Average Accuracy` | Classification accuracy |
129
+ | DVC/VS/RC | `Average LLM Judge Score` | Average of R2, R4, R5, R7, R8 (1-5 scale) |
130
+ | Skill/CVS | `accuracy` | Classification accuracy |
131
+
132
+ ### LLM Judge Details
133
+
134
+ For caption tasks (DVC, VS, RC), we use **GPT-4.1** or **Gemini-Pro** with rubric-based scoring:
135
+
136
+ **5 Key Aspects** (1-5 scale each):
137
+ - **R2**: Relevance & Medical Terminology
138
+ - **R4**: Actionable Surgical Actions
139
+ - **R5**: Comprehensive Detail Level
140
+ - **R7**: Anatomical & Instrument Precision
141
+ - **R8**: Clinical Context & Coherence
142
+
143
+ **Final score** = Average of R2, R4, R5, R7, R8
144
+
145
+ ### Score Normalization
146
+
147
+ To compute the **average score** fairly across tasks:
148
+ 1. **LLM Judge scores** (1-5 scale) are normalized: `(score - 1) / 4` β†’ [0, 1]
149
+ 2. **Other metrics** (already 0-1) remain unchanged
150
+ 3. **Average** = mean of all 8 normalized task scores
151
+
152
+ ## Links
153
+
154
+ - πŸ“„ **Paper**: [https://arxiv.org/abs/2512.06581](https://arxiv.org/abs/2512.06581)
155
+ - 🌐 **Project**: [https://yuhaosu.github.io/MedGRPO/](https://yuhaosu.github.io/MedGRPO/)
156
+ - πŸ’Ύ **Dataset**: [https://huggingface.co/datasets/UIIAmerica/MedGRPO](https://huggingface.co/datasets/UIIAmerica/MedGRPO)
157
+ - πŸ’» **GitHub**: [https://github.com/YuhaoSu/MedGRPO](https://github.com/YuhaoSu/MedGRPO)
158
+ - πŸ† **Leaderboard**: [https://huggingface.co/spaces/UIIAmerica/MedGRPO-Leaderboard](https://huggingface.co/spaces/UIIAmerica/MedGRPO-Leaderboard)
159
+
160
+ ## Citation
161
+
162
+ ```bibtex
163
+ @article{su2024medgrpo,
164
+ title={MedGRPO: Multi-Task Reinforcement Learning for Heterogeneous Medical Video Understanding},
165
+ author={Su, Yuhao and Choudhuri, Anwesa and Gao, Zhongpai and Planche, Benjamin and Nguyen, Van Nguyen and Zheng, Meng and Shen, Yuhan and Innanje, Arun and Chen, Terrence and Elhamifar, Ehsan and Wu, Ziyan},
166
+ journal={arXiv preprint arXiv:2512.06581},
167
+ year={2025}
168
  }
169
  ```
170
 
171
+ ## License
172
 
173
+ - **Leaderboard Code**: Apache 2.0
174
+ - **Dataset**: CC BY-NC-SA 4.0 (Non-commercial, Share-alike)
175
 
176
+ ## Contact
177
 
178
+ For questions or issues:
179
+ - Open an issue on [GitHub](https://github.com/YuhaoSu/MedGRPO)
180
+ - Visit the [project page](https://yuhaosu.github.io/MedGRPO/)